Now, I want to show you this in Rattle and then move forward to see what we can do with it. So as usual, so you can do it that way and it's a good practice to clean all your dataset. So I start fresh and I'm going to open Rattle. There you go. This here is easy to use as you can see. I'm going to load this file and it is deep inside my computer as usual. Sorry. It's right there. Okay. So here is your CSV file and I open it. There it is, and I execute. I'm going to use the entire set of data and I'm going to make 100/0/0. I could have said, don't partition. I got to simply uncheck this box like that or I could have done this. Here university is an identifiable. Next, state. I'm going to ignore because it's categoric, and what else? The graduation rate is the target variable. I want to do a principal component on the predictor variables which is SAT, Top 10, acceptance rate, student faculty ratio, expenses, and I hit an execute. It has loaded the data. If you look at it, if you look at Explore, very strangely, very interestingly, principal components is sitting under Explore and that's the right place for it to sit. You've got a principal component. It tells you there are two ways of doing principal components, and centering the data and scaling it is always a good thing, and we will see in R in a minute. We do execute. There it does. So I have reproduced this data. As we saw, the first component accounts for 80 percent, the next for 92, the next one 12 percent additional, four percent additional, and so forth. What can I do with that? So from this, we learned that the first two components are going to account for 92 percent of the data. So instead of using five features of the university, I can just use two features of the university and I will demonstrate. So what I'm going to do is I would like to open an R file, a script file again, which will allow you to play around a little bit with it and do a little exercise. So here, I'm going to Data_Session_4. In this, you see an R file which is called PCA or Principal Component Analysis, and I'm going to open that. Here is the source file. It works the same way and when I first run it, it gets the working directory. As you can see, the working directory is here and that'll be your working directory, and then it's going to read the data. Now notice, the university's data is in my working directory. For this command to work, you better move your CSV file on universities to your working directory. The next command does the principal component. Now, I have given at least that centering is true and scaling is true. Basically what it's saying is center the data by removing the mean, scale it by dividing it by the standard deviation, each feature, and it is saving the components. So if you notice here on the right, you can see how it is created the principal components. So here it is saving the results of all the components on the next command. So here if you notice, it is saving the principal component. So here are the components of each university. So basically, this is component one of university one, component two of university two. So what this command has done is run the PCA and extracted the components of each of the universities. The next command is going to add the components to the original variables. So if you look at it, this dataset university PCA has got 13 variables, and why 13? You had eight initial variables and we have added five new variables to it. So that's what it does. The next command says summary. So summary is what you just saw, 80 percent of the variance and 12 additional percent by the second principal component. So this command just does exactly what like that did. The next command is going to be where the rubber hits the road. So first of all, they're going to run a regression model, which is here, and we are going to use all the variables and not the principal components. So basically, first I'm going to run a regression model with all the variables. No principal components. You will see now that this regression has an Adjusted R-square of 63 percent. Now instead of running a model with all the five variables, what happens? What is the model we are running? We are running the graduation rate as a function of these five features. Remember, graduation rate is our target variable. By the way, if you look at the regression it says SAT is important and it also says expenses are important. Well, they pay more, you graduate? No. Actually, it has a negative coefficient. So the T value is not significant and the only significant value in this whole regression is acceptance rate. So I say T is not as important, top 10 is not as important, student faculty ratio is not so important that makes me feel not too happy about things. The thing that matters is acceptance rate. So I said graduation rate is better if you accept less, fewer students. Okay fine. Well, that's enough philosophy here. Let's go on. Let's run a model. This time, we are estimating the graduation rate as a function of just the first two principal components. So if you look at it, here is just the first principal component. So I'm just running a model, which is regressing the graduation rate against the component we called Research genus. You can see that that has an adjusted R-square of 52 percent and it is significant, highly significant. So it shouldn't surprise you. The component that captures the most variability in the data is the principal component one and when you run it against graduation rate, it says higher the research genus, lower is the graduation rate. I don't like it but that's what it says. Maybe in research, they just flunk their students. Don't go to them. Next, now I run the model with two principal components. If you can see this, here I have PC1 plus PC2 and this is the way in R we run it. When you have leisure, look at these commands. If you don't have time, don't worry about it. Very interestingly, you get a model which is almost as good as all five variables. This model has got an adjusted R-square of 64 percent and this model is telling you something very interesting, and both the variables are significant. You can see the T values are big, P values are small, R-squared is high. Two components are doing the job of five variables, we have reduced the features. But it's saying, if you go to a university which is high in principal component two, the chances you are graduating is higher. Makes sense, it's a nurturing university maybe and that's nurturing, it is scoring higher in the nurturing variable and you tend to graduate better. Now it's interesting, which regression you like. I'll leave it to you. What we are trying to say is these two components by themselves, are able to do the job of five variables? When we used five variables, we found acceptance is the only criteria, but when we look at this, it gets a more nuanced view. It says it's not just acceptance, but it is how you are loading on component number two. It's such a small dataset. Don't make a big deal of it.