So, then the question becomes how do we pick, how we calculate, how do we learn the values of thetas. We could try it by trial on error. Punch in numbers and see what theta values work could do it that way. There's at least three algorithmic way that we're going to take a look at here. So, the starting point we make our hypothesis function, we say it's approximately equal to our y output value or target value. Because that's what we want to predict, we want to predict the price of the house. So, we define a cost function and the cost function there's many ways that we can calculate a cost function, but this one is maybe familiar to you and you've probably come across it before. So, we define a cost function and we have when the sky's the limit for picking a cost function. This particular example was chosen because it's been around for a long time, it's been well understood. So, cost function is J of theta is equal to one-half, and the one-half is in there for a handy reason when we do take, we're going to take a derivative here in a minute, and we sum up the differences in these superscripts. Get ahead one here. The superscript i's refer to the training examples that's not raised, it's not x to the 1x squared, x cubed. These are indices into the training set of data. So, it's one-half the summation of the hypothesis function we input one of the features minus its value and then we square it and we sum all those values up and that's the cost function or error function, how big our error is. What we want to do choose the thetas to minimize J of theta and to do so, we use a search algorithm that starts with some initial guess for theta and then repeatedly changes the theta values to make, J of theta gets smaller, until we reach the minimum value of J of theta. So, make that error term get as small as possible. One way to do that is an algorithm called gradient descent and so with gradient descent it starts with some initial guess of theta and then repeatedly performs this update. It takes the current theta and it subtracts from it this value alpha and this is the partial derivative of J of theta with respect to theta for all J 0 to n. This update rule is called the least mean square rule, and alpha is the learning rate and it's also called a hyper-parameter. If you study machine-learning algorithms, you'll see there are many algorithms have a number of hyper-parameters and these are what we as human beings choose to help our algorithm achieve the desired results that we're after. Great care needs to be taken I discovered in choosing alpha. I set it to one, thinking that was harmless, that was a poor choice. I'll explain why in a couple minutes here. So, we got to calculate this derivative and when you do that for all n, and I'm not going to it here, I'm just going to give you the answer, you get this. Repeat until convergence, l'll talk about what that means. New theta_j is equal to the previous theta_j plus quantity alpha times the summation of the target value minus the hypothesis times target value for all j 0 to n, that's what you get. For each theta_j cycle, we cruise through. So, every theta that we want to update, we cruise through all m training examples. Every iteration. This method is called batch gradient descent. This particular cost function is a convex quadratic function and it only has a single global minimum low animation, I'll show you coming up here. If you pick some other function, it may not have a single global minimum and you need to manually check for this. This one here has as a single one. As Andrew did it. He said, imagine you're up on a side of a hill, a side of a mountain and you started up here with your initial guess and you want to as a algorithm iterates, you work your way down and then hopefully, you reach this final lowest value for your cost function, which is the global minima. But if you pick other functions it may have a little nonlinearity in there and if you start out here, you can get stuck in there and you get out of this so the algorithm will converge here instead of converging here. I had to write my own convergence code. Andrew didn't in his notes didn't tell us how to do that. So, I had to figure that out on my own and I'll share with you here. So, here's our cost function and here we're up on the side of the hill and as this algorithm iterates. We start up here and because it's this convex shape, it has just a single global minimum and no other local minimums. So, as the algorithm runs, you see that your error function eventually it's down to this very small amount and then you stop, and you know you're done. So, as I said the test for convergence is a user-defined function and the downside to batch gradient descent is that, every theta update needs to run through all m training samples. So, if your dataset is enormous, it's computationally expensive because every pass through, you might have 10 million training amples. So, you got to cruise through 10 million training samples update every single theta. So, it's computationally expensive. Is there a better way to do It? Yes. Second method. It's called stochastic gradient descent. For each training example, so we just go through the train data one time and we update our theta values as we go. It's faster for large training sets but potentially less accurate and we'll see that that is in fact the case. So, we go through all the math that I'm not going to do here for you, we end up with this. For i equals 1 to m, the new theta's equal to the current theta times again this learning rate times the target value minus the hypothesis, times x for j 0 to n. Don't iterate through all, it could be i 0 to n minus 1 or i 1 to m, either way. Point is we're sampling each example from our training data once and making a calculation computing our theta values. The third method, if you would've told me this was possible to do this, I would have said no way. Shows you what I know, so I'm not a mathematician, I'm a chip designer. Third method is pretty stunning. Is it possible to directly calculate the theta_j's? The answer is, it turns out. Yes. The closed form equation with no iterations at all and in this method, we minimize J of theta by explicitly taking the derivatives of all of the individual thetas and setting them to zero. Then Andrew did this wild crazy linear algebra went on on his chalkboard at Stanford for couple of white boards full of these manipulations and I was thinking these aren't just arrays of numbers, how can you take the derivative of a derivative of a constant as zero? How can this possibly work? Sure enough it does. So, I'm glad they're smarter people in the world than I am. So, you can directly calculate it using this equation. You can calculate all the thetas by turning your feature vectors into a matrix, take the transpose of it, multiply it by the feature vectors, take the result of that take the inverse of it multiply it by x transpose again and then multiply it by the column vector y of the target values and you get two values for theta. That's pretty cool. When the other ones are computationally expensive compared to this, allows pre-select.