Hey everybody. Welcome to our Lecture, Task 1, which is Data Wrangling. In the previous video, I mentioned that you were supposed to do complete five task in order to complete your capstone project. In this video, I'm going to mimic what Task 1 is about, but using a different dataset. I'm going to use the German car's data that I downloaded from Kaggle and then save it to my local machine. It's a CSV file, so I'm reading it as a CSV and then I can look at the head, the first five rows. I have mileage, make, model, and so on. Now, the first thing I want to do is to check whether or not there is any missing value. Then I will add all of the missing value together. If I do the true mean there is a missing value, false means there is no missing value. Let me look at tail, all these. There are some missing value. Here you can see but let me do with some. If I do some, you will see too many dots. The model, the gear, and the hp have missing data. What we can do, we need to check out the shape of this data set. We've got 46,405 rows. We can drop this 142 rows that will not affect our data too much or we can replace the missing data by just something that we feel comfortable with. But for simplicity, I'm just going to drop NAs, and I'll go through every rows and I'm going to drop by rows. Drop NAs and then the implicit control to make it permanent, and if I check again none of the column have missing data. The next thing that I want to do is to get the age of the car. You see the year, this is the year where the car was built. To get the year, how old is the car, I need to get today's year minus the year the car was built. Our builds are under date-time object here. Now, I've grabbed the year from that. We are in 2022. Now, I will create a new column called H, which is equal to this year minus the year that the car was built. If you look at the head again, you see the new column here. Now, there is no need to keep this year here, so I can drop that. Then I might even drop-off making model for simplicity. Let's look at we describe the mean of the prior, for example, it on 16,000 and then if you look at the mean age is about this. The maximum age is 11 years old. I mean, there is no car that was before 2011, so everything 2011 up to now. Now, the next thing that I want to do, like I said, I want to draw a year, year column. I want to drop the fuel, the make and then the model. I want to drop those actually, the fuel. Let me grab the fuel type. I want that I don't want to drop that yet. Let me leave it there for now. Now, we drop those. We've got fewer columnar. The next I would make changes categorical data into dummy variable. But before that, let's do some visualization here. We visualize our price versus the mileage. If you look at it, the newer car are more expensive and then the car that have more mileage is cheaper. The y-axis is the price, the x-axis is the mileage here. Which is not a surprise. Now, let's find our dummy variables now. Again, our dummy variable, DF is not defined. This is cars I believe, our data frame, we call it cars out. I need this. Let me just select this. Our conversion is here, we use our color card yet. We're going to just convert all the categorical data into a dummy variable. Do that here. Their gear has been converted. Now, if you look at the box-plot of the data set for price versus the year again, you see older car are cheaper. Then this horizontal line represent the mean of the price. The bottom line is the 25th percentile. The top line here is the 75th percentile, and then these are outlier. Now, the next thing that I want to do is to scale the data. We get the data, we minus its mean and then divided by its standard deviation. I build a function that will do that and then I apply the function to our car DF. I will call that scale dummy. I will reshape, I got dark. Now, we set capital X to be everything else, every price. That means we are dropping the price column. Now, X will have nine column only and I will set y to be the price column only. Y should have only one column. Now, the last graph that we want to do is a three-dimensional plots. Three-dimensional plots here. Now, we have that. If you look at it, this is in three-dimensions. This is what Plotly will allow you to do. Plotly will allow you to have an interactive graph and you can zoom in, zoom out. Let's go back to our Jupyter Notebook. In this section, we'll learn how to convert our panel data set to panel data frame. Once we did that, we were able to get some insight from the data, getting some info, describe it, and then check for missing data and then get it or does missing data, do some visualization and then scale the data set at the end. We did our two-dimension plot and then three-dimension plot also. Thanks everybody else. I will see you in the next video where we're going to talk about principal component analysis. The most important dimensionality reduction in datasets. Thank you.