So in this part of this module, we will look at two things. How do we do data exploration, with this idea of looking at what features to keep, what not to keep, what methods to use. There's a lot of methods out there but we'll talk only about two. One is very simple, one is a bit more technical. Before I proceed, I want to share with you, if you have not heard about Edward Tufte. He is the doyen of visualization, and New York Times called him, as they say the Leonardo da Vinci of data and Bloomberg as the Gallileo of graphics. I think these are all alliterations but actually you should look at what he says, he says often the best way to describe, explore, and summarize a set of numbers is draw pictures, nobody denies that, okay. Very simply let's just think of a flat map, and you can go to the web if you want and look for a topography, okay? So let's just think of a typography and let's type in the word topography, what will you get? Or what you will see is, let's go to Wikipedia, here is a topography, right? Here is a topography, and what you see here is the same two-dimensional map coming alive into features, and these features help you differentiate, hey here is mountains, here is rivers, here's a road, here's a desert, here are trees. You would like to do the same thing with data, so how do you do that? So obviously we're simplifying that. We're taking something which is multi-dimensional and putting it down on a flat piece of paper and pretending that it conveys the same information. So how many different ways are there visualizing data? As many ways as you want, and people keep finding new ways. We've already seen histograms and distributions, and they go into a topic called density estimation, which we will not go there but you know what they are. Covariance, we said okay, a two-by-two analysis gives you a number which is between zero and minus one and one, sorry and tells you the strength and the direction of relationships, we already saw that. Today we're going to talk about two things, a scatter plot, this innocent thing called a scatter plot, which is usually very very informative, and a more complicated way of looking at data called principal components analysis which allows you to reduce the number of features. There are three advanced techniques, I will talk to you a little bit more about them towards the end of this module, it's something that you may like to pursue on your own. So let's take scatter plots. This, if you have not seen it before or even if you've seen it before, it's called the Iris dataset. It's the dataset on three subspecies of Iris. This dataset was discovered originally by Fischer as early 1930s paper, and later on it's also called the Anderson Iris dataset because Anderson collected this data of iris flowers related to three species and I can't say them very well, one is called Setosa, the other one is called Versicolor and the third one is called Virginica. Now, the interesting thing is two of them were collected from the same pasture and picked on the same day and measured at the same time by the same person which is very fascinating. You've got 50 samples of each of them, and the measurements were done on four dimensions. One, the sepal length and the sepal width, and other one is the petal length and the petal width. I have to apologize, I keep messing up the two, when I wanted to say petal I say sepal, that's going to happen. So I called four variables on which each flower was measured. So we have 150 observations, okay? How does it look? Let's say I do a scatter plot. The first scatter plot is not as meaningful. What does it show? It shows the sepal length versus the sepal width, but you already see there are two clusters in it if you stare at it, right? So even without knowing what it is, you can start saying, "hey, essentially two classes of data iris". The second scatter plot is the petal length versus the petal width. Clearly you say "wow, there are two clusters in it", and you suspect that they, maybe the data is telling you something. So basically from four dimensions, I've reduced it to two and we've started the data and we can do this two-by-two in many ways, there seem to be two clusters emerging. We say "hey, wait a minute, there are three types of species out here". So when you start labeling them and I hope you can see the color on this, you see the blue one in the bottom is one species, the red one in the middle is another species, and the gray one on the top is a third species. If you didn't know which is what species, you wouldn't see the clusters, but now you can almost see that it's cluster one, cluster two, cluster three. So it is basically used to tell you that if you didn't know the labels you would think there are two clusters, but if you knew the labels, you will know there are three clusters. So that's the difference between knowing what the object is, it's just supervised learning, and unknowing what the object is, which is unsupervised learning. So, in some sense we will use this as a very small example to tell you how labeling the object improves the fact that you can cluster, but again it's a scatter plot, it's useful. Here are the same thing, and remember I have four variables, so I can choose them in six different ways, four times three by one times two, four choose two, and you can see each of these scatter plots, and you can also imagine that no scatter plot by itself is telling you that there are three clusters and that you could put color on them. These kind of plots are useful, they are a way of visualizing, but they have limitations. Clearly they are a powerful visualization tool and I would definitely suggest we use them. Second, the problem is they're limited to two or three dimensions. Maybe in the next module I'll show you a three-dimensional data where we can peek at it from different angles and start seeing clusters. Three, I don't think anybody can pick in four dimensions easily, I haven't heard of somebody but I had a friend once who talked of five dimensions. So obviously that is a limitation. There is another important limitation, that we can only think of the sepal width and length and the petal width and length, but if you can't think of combining these features, although our mind doesn't work that way, but let us say maybe there is a function which is a combination of these features, a new way of looking at them, which can reduce the dimensions. Obviously that is another limitation. So then that means it's limited to the labels you're already given to these dimensions, the scatter plot. Another limitation is say we have hundreds of features, say you have 100 features, right? So you're taking body measurements and there is a wonderful data on how many measurements you can take off a body, and let's say we took 100 of them. How many scatter plots is that? 100 times 100. So you are to sit there looking at 10,000 scatter plots, do you see any data? Not possible. So what is the solution? I think the solution is to use a lens and project this high-dimensional data into a lower-dimensional space or maybe onto this table, right? So from three-dimensions, you can then project it into two-dimensions, or you have the four-dimensional Iris data which we are going to see, and I want to use that and project it into two-dimensions.