So, here's the results of batch gradient descent. The blue dots are the training data, the line is just, again, that line that I just drew, I use this point and that point, and just plotted a straight line, and these little red triangles are the price predictions, it looks like these x values are pretty close, but you can see this value here was something that the algorithm had never seen before, and this one that's x value had never seen before. That one was close, but these new, new, new, new, it had never seen any of the values in here, but it's really close to the tracking line. That one, and again up, and up, and up, so I did really well, even though it's computationally expensive to calculate the thetas. The next one is the stochastic gradient descent. Here, you can see it has some issues. I always meant to go back, you keep hearing me say that, time is precious, and study this in a little more detail. I think the issue is that I simply didn't have enough data. When we get into the predictive analytics part, many of the authors and people that are doing this big data, science big data and analytics, talk about the importance of having large sets of data, and I think I just didn't have enough data here. So, in this case, even though stochastic gradient descent is computationally faster, it did not produce very accurate results. One of the things we do, and in Machine Learning, is we stand back, and we calculate an error, and look at it mathematically, and/or graph it, and see that one didn't do so well, so we want to use that one, okay. I'll check this out. This is the closed form on and did amazingly well also values that it didn't see. I thought that was so cool that you could just take a whole bunch of numbers, and do take two derivatives of the thetas, and do all this linear algebra, and just calculated directly. It's pretty cool. So, take a look at the thetas here. This was the batch gradient descent thetas here, so theta X sub one came out to just about one. Theta X sub one came on a 45.5. Theta two came out just over one. In the stochastic gradient descent, theta's zero came out right around one again. Theta sub one came out to 35.5. Theta X sub two came out just over one. The close form was very different. It came out where it's intercept term is exactly equal to zero or very close. I don't know if there were some bits out here to the right that just weren't displayed, but it's pretty close to one. This X sub one was four and this is 4.6. It's interesting how this one and this one produce very similar results. So, it's not necessarily one right answer for the weights for those theta values. Choosing alpha, we doing here. Choosing alpha was really tricky. I saw some crazy things happening. Like I said, I initially set my theta to one, how harmful could that be, and it turned out it was a huge deal. So, if you choose your alpha too big, it causes the iteration's when you're calculating that error value to get really big, and the algorithm looks at the error, and goes "I got to go a negative.", and it goes way negative, and then it sees this way negative, and I going to go away positive again. So, it goes with more positive. Then, next iteration, it goes even more negative, and it ran away, and eventually, Python stores real numbers lighting as double precision floating point, and overflow double-precision floating-point, scratching my head, what's going on here. I have a graph of that coming up here, so we won't dwell on this, but I decode this myself. So, I picked some values for theta outer range, and I just learn this through trial and error. I've decided that five times 10 to the 20th and my theta's where that big. I was headed in the wrong direction, and I would stop the simulation from running, and I would stop the program. I also wanted to make sure that as we were iterating, as we were doing our gradient descent and coming down, you're looking for how much of the thetas change from one iteration to another. So, as long as the threshold was over the value of two, I kept going. When the change from one theta update to the next day to update became less than two, then I said, "Okay, I converged." Okay, and you can turn these numbers. Now, look at what I had a set of alpha two in order to keep it from running away. It was very, very small number. So, you can look at that code and that's what it does. So, this is what I saw happening. So, the theta is set to zero, then it went negative, then it went positive, and then in the next iteration, they went more negative, and more positive, and more negative, and more positive, and then more negative, and more positive, and then in that case, it kept going back and forth, and eventually overflow double-precision floating-point, I got an error message from Python. What's going on here? So, it took me a little while to figure that out and was playing with these hyperparameters in order to get it to work. We wanted to look like is this, so theta started out here, set it equal to zero, and zero iteration's go, and see the theta values come up, and start to asymptotically approach the value, and then this is where that threshold value of two that I use. So, as long as from one update to the next, it was changing more than two. I kept going and once the delta became less than two, then I stopped it, and I said that was enough iterations, and I'm going to use those theta values, and then give it data, then give the algorithm a data it's never seen before, and saw the results of the parts. Problems that can occur, you may hear these terms of underfitting and overfitting. If you have a bunch of data points here, and you stand back, and you look at it, and you do the finger in the air thing, and go, it looks like it's a straight line, but it might not be. So, in your initial assumption, you assume it's a linear relationship. But the actual pattern of the data if you have more data and this is why volume of data is important, you may discover that there's actually an exponential shape to it and so it might be a constant wanting some exponential function K times one e minus e to the minus square footage or something like that. So, this is an example of underfitting. We chose linear, but it was actually an exponential or some other function, but this is exponential in this example. Overfitting is a problem where the algorithm just memorizes the datapoints. Though it learns a data really well, and you give it data it's never seen before, and you might get a value up here, and down there, and over here, and an off, some might be on, some might be off, but to produce very poor results. If you see that happening in your Machine Learning algorithm, you have a overfitting problem, and you can dig into that, and see why that's happening.