We're going to look at the Rattle example, a small example in Rattle. It's a toy example. We can use a much bigger data set. So this is a data that we collected, and it is actually from a book, Applied Multivariate Statistics sixth edition by Richard Johnson and Dean Wichern. Okay. Now, this data is being used in many ways and can collect it from going to many university websites. So we don't have any preconceived notion that Illinois is the best university in the world, you know it, I know it, we don't have to tell anybody, right? But what we really want to do, is let's say I have data on different universities probably private, public [inaudible] , and they collect data online. Then on many features and then what we tried to see is how are these sort of grouped in some way. But we really are not going to say anything about it. So let's say I collected data. I have data on, of course, the name of the university, the state of the university is in, the SAT scores which is the standardized test scores for admitting freshmen. What percentage of freshmen are in the top 10 percent of their high school class, right? Acceptance ratio, which is high in some places and low in other places. It's a percentage of people who get accepted who apply. Then you have the student faculty ratio which is an important norm, in measuring the quality of education according to some people. Then the expenses, of course, that's become interesting topic of discussion or what's the annual expenses? Finally, and of course the question of defining how you measure it but that's a different story altogether because it is tuition plus residence plus books and various other things, right? And the graduation rate. When I was a young professor, I thought everybody graduate because being a professor who didn't fail too many courses, we thought everybody graduated in time. Shockingly, it's not true. There are places where people don't graduate for many reasons, not they may shift. They may leave the program. They may go to another university. They may simply dropout. This is shocking news to you, to some of you at least. To me it was when I first learned it but then it's a measure of how good a university is. Here are some universities which whose names you should know about. You would know about, right? I just shown you the top five universities in our next top six. Okay. So how do you run principal component analysis gradually? What you'd do is and I'll show you in the software also because of something we had not done before. We're going to read this data which is given to you. It is in a file called universities.csv. You're going to read this data. We're not going to partition it. So I could have simply uncheck partition but instead I said partition a hundred percent hundreds 0-0 which is keeping all the data. As we know, principal component analysis can be done only in the numerical variables, so I'm going just choose the numerical variable. The name of the university is an identifier as you can see. I'm going to ignore the state because it's a categorical variable and the target variable is the graduation rate. So I'm not going to cluster on that. I'm only going to do the principal component analysis on the numerical variables, SAT scores, top 10 acceptance rates, student faculty staff student faculty ratio, and expenses. When I run it, it extracts five components. It calls it PC1, PC2, PC3, PC4, PC5. Okay. So principal component 1,2,3,4,5 it reports. It also report Y5, well turns out these five dimensions are unnecessary. So what it has done is it's sort of rotated the data into five different components. Component one has you can see that, you can see it has different values. So it's 0.49 of the SAT score, it is 0.45 of the top 10, it's minus 0.44 of the acceptance ratio, it's minus 0.43 of this student faculty ratio and 0.41 of the expenses. So this is the component. If you will which is measuring high SAT scores, it's putting more value, positive value on the high school's top, 10 percent of high-school students at the universities. It is penalizing acceptances rate in the student, the higher the student faculty ratio it puts a penalty and it's saying it's an expensive university. I'm not going to label university based on that but you can get an idea of what kind of university this is, right? The second principal component, which is measuring another kind of variation is, it has a negative loading on SAT scores. It has a negative loading on the top 10. It has a positive loading on acceptance, right? So what acceptance it loads more on this component. It has a negative load loading on the student faculty ratio. So this is a university which is probably admitting students but treating them very very bad. Okay. Keeping small class sizes and of course it is more expensive, look at the loading on expenses. So this is something which people can afford, they go to this university. Okay? And so forth. So each of these principal component is a linear function after five variables. So I have decomposed this data into five different directions. Okay. Maybe the first direction is research. The second direction is nurturing. So you could give a label to it, you can dispute about it, but each of these components could be a label. Okay? Each of these components. We're trying to predict the graduation rate later on but each of these components accounts for a certain fraction of variability in the data. Along the first component, not everybody's a research University but it's a component how researchy you are, right? That component accounts for 80 percent. If you look at it, at the bottom, it accounts for 80 percent of the variation of the data. The second component is how nurturing you are? Let's say, I'm just giving it a label just for discussion, no particular reason beyond that, right? That accounts for an additional 12 percent of the variation of the data. So the first two components account for 92 percent of the variations that you see in the data. If you add it up, the five components together, the last component of course is accounting for 0.53 percent of the data.It may be important some times because that may be the component which is actually making a difference whether you attend a University or not. Okay. So that's why, we made right a label data but there we may get into value discussions. But finally, the computer doesn't know what values, it is saying it has the Data and had decompose it into five. The first one accounts for the most of the variation. You'll think it's research if you don't know what it is but it is a component. The second component after that,accounts for 12 percent and the next one accounts for 4 percent and so forth, right? So the caution is that you may need to study this carefully before you try to label it. Labeling always has its own hazards. Okay.