So, those were the supervised learning example. I'm going to take a look at another learning algorithm called Unsupervised Learning. In this process, similar to the Supervised Learning, you load the data set, you fit data to the model, you visualize the model, in this case, clusters. We're going to look at tune parameters. Always as all those, that's like human intervention, and it's intimately part of machine learning. It's tuning the parameters. You repeat these steps until you hopefully get a solution to the questions that you're answering or you're seeking predictions to, and that includes evaluating the model, calculating its error, how well is it doing. So, there's Unsupervised Learning method called K-Means, which is looking to extract structure from data. Applied to problems where we don't know what the outcomes are, we don't have target values, so we can't use Supervised Learning because that requires us to have some data that we know the answers to already to use as training data. We want to cluster similar examples together into K clusters. This isn't the greatest algorithm in the world, but it is an example of one Unsupervised Learning algorithm. So, the biggest challenge or issue, problem with K-Means is having some intuition ahead of time to know what the correct number of clusters are, because you tell it what the number of clusters are when you call it. If you're wrong, we'll look at some examples, that we know how many clusters there are in the Iris data set. There's three. But I ran it with two, and I ran it with four, so you could see what happens. It's interesting. So, imagine you have a 10-petabyte data file, right? As a human being trying to visualize that data, and let's say it has 24 dimensions to it. It's like, "How do you find structure in that?" K-Means might be a good first place to start just to do some poking at that data to see what you might discover, to see if there are any trends, any clustering of data values in that 24-dimensional space. It is good for problems that have a small number of clusters with approximately proportional sizes. So, if you've got a data set that's much like our Iris data set, they're relatively clustered together. If your data has a bunch of really small clusters, many data sets, and say three-dimensional space, you've got a small cluster here, and small cluster here, and small cluster here, and then you've got a big cluster, and a big cluster might not be the best algorithm to use on it, but using K-Means may lead you to discover that clustering characteristic is present in your data. It makes a assumption that the clusters are linearly-separable like we saw in the SVM. There is some way to draw some hypersurface. This one was printed with known outcomes because I knew what they were. So, I went through, and I drew this graph. So, this was the original graph. This was the data set given to K-Means with three clusters, and it doesn't know what the names of the marks. We just refers to them as cluster one, cluster zero, and cluster two. You see it did a pretty good job. These data points right here for Virginica were misclassified as virginica, but it did a pretty good job with three clusters. So, you can see where I called K-Means, and you pass in the variable or you set the variable number of clusters is equal to three, and then he call fit on it, and then he can print out the data numerically or you can graph it. That's what this bottom graph was. Here's an example with three clusters, and I was curious to see what happens when you turn the knob on clusters, so I thought that there were two. So, the Virginica and Versicolor got merged together into one cluster, and Setosa is off to the side, and then I went with four, and it split this area into cluster three, so thanks. There was a dividing line here and a dividing line there. So, K-Means works by leveraging similarities among examples, data points in a multidimensional data space, and this distance between them, and distance has to follow certain rules, and gave me any negative distances. The distance from point one to two is the same as the distance from point two to one. The symmetry property and the distance between an initial point in the furthest point is greater than or equal to the distance from the initial point to a second point, than to the farthest point. That is the triangle called the triangle inequality, and in short, it means aren't any shortcuts in two-dimensional space, so there's no wormholes are falling off space, or anything like that. There's number of ways to measure distances. We remember from geometry the Euclidean Distance is the square root of the squares of the differences in the coordinates. There's Manhattan Distance, which is up and over, up and over and over and over, and then there's, I hadn't heard this one before the Chebyshev Distance, and like this one. Who's played chess? So, you know how king moves? So, the Chebyshev Distance is, if the King is here, so the distance to the next square and all the squares around it as one, but if the King has to move two. By the way, king can move and so forth, and outward from the square that the king is on. It's called the Chebyshev Distance, and it's never heard of it before. That was interesting. The K-Means assumes that the data has clusters, that there is some structure in there, and the clusters are made up of similar examples with some starting example, and it just randomly picks a sample, and that sample's called the prototype. Or you can think about it as the centroid also, center of mass, if you will, in the center of the cluster. The clusters have a roughly spherical shape, so they have to be perfectly spherical. Certainly, not true for a real-world data, and K-Means uses Euclidean Distance. You can work with ordinal numbers: one, two, three. I can work with binary numbers in the classification, spam classification problem. Things to be aware of, because it uses Euclidean distances, the numbers need to have roughly the same scale, otherwise features with very large values dominate and potentially end up throwing off your results. One solution to that problem is to transform your data before K-Means by statistically standardizing all of the features and transforming them into new features by this dimensionality reduction process, such as principal component analysis. I think it's greatest challenges again, and that expects you to have a prior knowledge of how many clusters are there in your data, but then you can discover. You can run it with two clusters and see the data that you get, and then run it with three, run it with four, five, six, and you begin. Hopefully, would begin to intuit some structure in your data. It's always important to do a reality check to see whether the result is reproducible under different conditions, and that the results make sense and true in all these algorithms. You always have to test your algorithm and verify that it's making accurate predictions within whatever your error tolerances that's acceptable to your problem you're trying to solve. So, the user provides K clusters, and the algorithm picks K random samples as the original centroids, and the algorithm assigns all the features to each of the K clusters based on their Euclidean Distance to each of those centroids, and those features become part of that cluster. After all the features have been assigned to a centroid, the algorithm recalculates the new centroid for each cluster based on the features within the centroid. The algorithm checks how much the position of each centroid is changed, and that the changes under a certain threshold are much like we saw in the linear regression example, then it assumes it's stable and it stops, and it returns the result. I use the K-Means above. The library contains another method called MiniBatchKMeans, and it differs from K-Means because it can operate on portions of a data. K-Means you have to give it the whole dataset. So, if you have a 100-petabyte file, you're not going to load 100-petabyte file into your working machine's memory at one time. So, the advantage of MiniBatch is that, it can process the data in these little chunks, and perhaps on a Hadoop or Lustre file system, where these file systems as we saw before were designed to store very large files, right? You run the MiniBatch on it, and it pulls across the data as it needs it to make its predictions. Yeah, to make its predictions. You can take more time, but it produces results that are very similar to K-Means, where you can process all of the data at one time. So, I import my libraries, I load the Iris data set, I import K-Means, I tell it the number of clusters, and I set the random state variable, and just hold the constant. So, it's not changing every time I run this, and I call "fit" on the data. That's where it goes off and does its classifications, and then print a bunch of results, and then this is all the rest of the stuff about printing using principal component analysis to reduce the data and make those plots that we saw in print error results and all of them. I can go through all of this. That was a two-cluster example. This was a four-cluster example, and then the results here.