[MUSIC] Okay, so where are we? We've covered some selected topics in a very broad field of machine learning and particularly talking about supervised learning. So, we've covered rules and trees as representation of certain kinds of hypothesis. We've covered combining a set of weak learners to make a stronger learner with ensembles and boosting and we talked about disk optimization technique of gradient descent along with some applications of it and how to speed it up Optimization with Gradient Descent and then have it paralyze it. And so, now we're gonna talk about a few selected topics in unsupervised learning. So, this is a very broad field as well. And we're only gonna cover a couple of selected topics that data scientists ought to be familiar with, but is by no means a coverage of the entire area. So, what is unsupervised learning? So, we have these four categories of machine learning that are characterized by where the feedback comes from, then unsupervised learning is perhaps the one that stands out the most. So, we've only talked about supervised learning, but you can think about all four of these. So, in supervised learning the feedback for the learning comes explicitly in the data. In reinforcement learning the feedback is supplied by the environment. So, this is control theory when you're trying to keep a plane in the air for example. And in game theory the feedback comes from the other participants in the system, the other players in the game but, in unsupervised learning, there is no feedback. And so, here, you're just looking for patterns in the data itself and trying to put them to use. Okay, so almost all work and unsupervised learning can be viewed as terms of learning a probabilistic model of the data. So, applications of unsupervised learning include detecting outliers. So, we use this factory behaving normally or not. Classification, which is a little different than classification we've talked about, given that there's no class labels. But, you might find groups of similar items and when you see a new item come in, you can figure out which among the groups you've found previously, is this one most similar for. And, thereby classify it as within that group. You can also think about compression and communication, so instead of sending a sequence of ab, ab, ab, ab pairs, you may say, let's just send the sequence ab and repeat seven times. And so this is a shorter representation of that string. So in all cases, we're identifying these patterns that describe the data and then putting them to use for various services. Okay, so in particular what I wanna do is talk about clustering and this is by no means the only form of unsupervised learning. But, it's the one that you'll bump into quite a bit. Another common one that we're not gonna spend much time on is dimension reduction. Okay, so in clustering, there's perhaps no precise definition of what a cluster is but the, which is one of the reasons why you see all these different algorithms for different scenarios but the output is usually the same. Which is a set of sets of data items. Okay, so these items may be points in some multidimensional space and your goal is to group them by similarity. Or the others maybe you say vertices in a graph and your goal is to find communities that help, that discuss, that, collaborate more closely than in other communities or that communicate more closely than other communities. Okay, so here you're looking at the edge structure of the graph to identify the clusters but the output is still the same it's the set of sets of items. All right, so they give you an example that comes up in a collaboration we have with some motionographers here at the University of Washington. So, in this SeaFlow device, they borrowed techniques from flow cytometry and adapted them for environmental monitoring. So, in flow cytometry,the idea is to take it's a diagnostic tool in medical applications, and the idea is to take a particular amount of blood or plasma from the patient and put particles in that sample through a capillary single file. Bounce light off of it and by the absorption and refraction patterns of the particle you can sort of work out which kind of pathogen it is. With these applications you're typically looking for one of a small set of possible critters. So, you kind of know what you're looking for. Or we're trying to lab bench in a kind of controlled environment. What these researchers have done is taken one of these devices, or not one of these, developed their own device by scratch, ruggedized it, and turned it upside down and put it at the bottom of an ocean going vessel, in order to, in real time get a continuous stream of labeled particles rubbishing the microbial populations of the open ocean. And this is the completely unknown. There's, you can get some information out of satellite images, but there's really nothing that's known below the surface. There's no real ground truth from measuring this besides very sparse data from direct sampling. Okay, so how this works is, so one of the challenges here, I said they push them through single file. Here they do not. They actually open up the pipe a little bit and allow lots of particles to go through because the engineering is too difficult to get things to go through single file, when you are under water and the boat's doing this, okay. So, that's one of the problems. You have to save energy in the engineering problem but you have to make it up in the data analysis because now the data is much noisier, okay. So as light shines through. This particle stream it's collected on either side by various detectors. And there's two detectors that work initially to help filter out to identify which particles are at the center of the stream and which ones are at the edges so you can filter out the ones at the edges. And then there's other detectors for various wavelengths of light. And what you can do here is first filter the data so the you only are worried about the. Particles that were in the middle of the stream. And then you can start inspecting the different wavelengths to identify these clusters that represent different microbial populations. So, in this case, we looking at size. Nanoplankton, ultraplankton, and picoplankton. And some of the absorption and refraction patterns of these different populations are understood. But, in the data stream you still need to figure out what these clusters are. And so they apply a clustering algorithm, a variation on a very popular clustering algorithm called k-means to analyze this data. And so we'll talk about k-means next.