Welcome to session three of the third week of our course on modeling risk and realities. In this session we will explore using some common distributions, such as the normal distribution on the data. We'll think of parameters of the distribution and the goodness of the. For example the Chi-Square test. Welcome to session three of the third week of Modeling Risk and Realities. I'm Senthil Veeraraghavan again. I'm a faculty at Operations, Information and Decisions Department at the Wharton School. So far we looked at data in visualization in session one, and session two we looked at to the variety discrete and continuous distributions. In this session we're going to focus on how well a distribution fits. We will look at hypothesis testing and goodness of fit. Let's go forward. Fitting distributions to data. We made a case in the previous sessions that it is important to fit a right distribution by visualizing the data. In fact, we generated two histograms for our two datasets. Dataset1_histogram.xls file and Dataset2_histogram.xls file are available on your course website. Now we can use those files to test goodness effect. Before we do that, let's understand the concept that exists behind testing goodness of a distribution on a data set. Goodness of Fit tests. After evaluating the histograms and the summary statistics such as the mean and the standard deviation, we can now explore distributions that provide good fit to our data set. The Goodness of fit test provide statistical evidence to test hypotheses about the nature of the distribution that fits our data. Two of the popular statistical goodness of fit tests are the Chi-Square test and Kolmogorov-Smirnov test. The Chi-Square test is represented by the Greek letter chi, here. And Kolmogorov-Smirnov test is named after two Russian mathematicians, Kolmogorov and Smirnov, who made fundamental contributions to probability. Anderson-Darling test is another test that's used less frequently. It's named after two American mathematicians, Anderson and Darling. In this course, we will focus on the Chi-Square test. In general, all these tests can be very tedious to apply for complex distributions. In such cases, we recommend computer softwares to evaluate these tests, however, we will run the Chi-Square test for two common distributions. The normal distribution and the uniform distribution for our data sets. What's a Chi-Square test? Chi-Square tests the following null hypothesis against an alternate hypothesis. The null hypothesis could be the studied data comes from a random variable that follows a specified distribution, such as a normal distribution or a uniform distribution. The alternate hypothesis is the sample data does not come from the specified distribution. As note, this is a one-sided test. What do I mean by one-sided test? In this test, you can disprove that the data came from a specific distribution. But you cannot prove that it came from that distribution. You can disprove that it came from a normal distribution, but it cannot convincingly categorically prove that it did come form a normal distribution. Let's think about running the Chi-Square test on your dataset. Well run the Chi-Square test on our datasets but first we will look at some thumb rules to run the test. Ideally, you should have at least 50 data points. Otherwise the test is not as powerful. Divide your data into n buckets, lets say, with at least 5 of the observations in each bucket. Also, remember that every Chi-Square test has degrees of freedom. The degrees of freedom is measured by the number of buckets you're using minus the parameters of the specified distribution minus one. We will need more statistical details to get into how the degrees of freedom is measured but let's look at an example. Suppose you have ten buckets and you're trying to fit a normal distribution that has two parameters, mean and standard deviation. The degrees of freedom here is 10- 2- 1 which is 7. For each Chi-Square test with some degree of freedom, we can predict the null hypothesis with some confidence which could be set at 99% or 95% etc. These Chi-Square confidence tables are available at lots of sources. For example, see the table at the following length that I've provided. We will explore Chi-Square tests on our two datasets. Dataset1_histogram and Dataset2_histogram parts. We used the Dataset1 in section one, and we generated the histogram. The histogram gave us two curves, the pdf, which is the probability density function, which we saw in week two, session two, and that's given in the blue bar chart. And also, the cumulative distribution function, which gives us accumulated values, and that's given in the red code. Just visualizing the pdfs, such as a uniform distribution. It looks that the pdf is pretty flat, and therefore, we run a Chi-Square test for uniform distribution, and for the Chi-Square test, for the minimum distribution, we're going to use min and max values for the data. You can also try normal distribution, and try to fit it, and you will see that normal distribution fits poorly compared to the uniform distribution. So, what uniform distribution are we going to use? We're going to use your uniform distribution with minimum value of 0.09 from the data set, and the maximum value 99.87 that we saw from the dataset. So there are two parameters to the uniform distribution. Recall that our null hypothesis is that the data comes from a random variable that belongs in the uniform distribution. We have 7 degrees of freedom because there are 10 bins and 2 parameters and 10- 2- 1 = 7. And in the Excel video, I show you how to generate the Chi-Square test. We have the Dataset1 histogram file now, and we are going to now look at how to test a Chi-Square test for our data. Using a theoretical distribution, a uniform distribution, with minimum of 0.09 and maximum of 99.87. For that, we need first to generate the theoretical cdf. If you recall the Formula and the discussion we had in session two. We can do this by picking the value that we're interested in. In this case it's 10 minus the minimum value. And I want to fix it. Divided by the maximum value minus the minimum value. And we need to fix everything there except for the first terms. So we have this. And let me calculate the theoretical CDF all the way through, except the maximum point is not 100. We want to make sure that maximum point is not 100, but 99.87. So that's our maximum value, so that gives us 1. So we have accumulator distribution function. Let's write that in percentages so that it's easy to view it. From this, we can also generate the theoretical probability of being in the bin. So the bin probability is, For the first bin, it's between zero and ten so it says exactly that. For the second bin alone, this is the accumulator for the first and the second bin, so to calculate what's the probability, theoretically, of falling within the second bin, you just take the second bin minus the first bin. So that gets you 10%. And we can do this for all the calculations here, all the way through, and we get the bin theoretical probabilities. So, give me a data set, and any data point in that random variable distribution has a theoretical probability of falling in in the lowest bin at 9.98%. And the probability of falling in the second bin is 10.02%, and so on. So this distribution is almost uniform. And the theoretical distribution is uniform but the bin is cut off at .09 and 99.87. So let's actually compare what's going to happen for frequency for around 250 points. 250 points, they're going to fall into this bin with these probabilities. So 250 multiplied by the bin theoretical probability gets you 24 points are likely to be in the bin. And so on and so forth for all calculations. So theoretically speaking, you should get about 24.8 points in the first bin. The second bin you should have about 25 points, 25 points, and so on. The first and the last bin are slightly smaller because they're getting cut off not at 100, but 99.87, so you have a slightly smaller probability of bin here. And so now we have the theoretical frequency and the actual frequency in the data set. Now we can run the chi-square test. So I am going to to write chi-square test here. And we can write a number for the chi-square test, it's a formula. It's chisq.test. Then you get to choose the actual frequency range and the theoretical frequency range. And you close it, you get about 0.0127, just releasing it to three decimals, it's 0.013. And that's the chi-square value you're going to use and there are 70 degrees freedom on this. So now we have done the chi-squared test trying to fit uniform distribution on our data. Let's go ahead and look at the table and see whether we are able to reject the null hypothesis. We generate the chi-square values, and the chi-square test gives us a value of 0.013. Now we can lookup that value for degrees of freedom in the tables that I provided you. For example, follow the web link and you'll find the following, we fail to reject the null hypothesis. That is, we fail to reject the hypothesis that the data comes from a uniform distribution with a high degree of confidence. Remember, we cannot prove for sure that the data comes from a uniform distribution but we have failed to reject the null hypothesis that the data comes from a uniform distribution. Remember, chi-square test is a one-sided test. Now let's look at data set 2. On data set 2, the figure gives us a histogram with pdf in the blue bars. pdf is a probability density function. And the CDF in the red curve. The CDF is a cumulative distribution function. And the visualization of the pdf tells us it looks like a normal distribution. Hence, let's fit a normal distribution on this data set. We run a chi-squared test for normal distribution using the average and the standard deviation from the data. In data set 2, We will look at the goodness of fit of a normal distribution for the sample average of 47.2, and the standard deviation of 15.78, which we calculated from the data set. So our null hypothesis is, data comes from a normal distribution with those two parameters as mean and standard deviation. Again, the degrees of freedom is 7. We run the chi-squared test as we see in the Excel video. In the data set histogram, file we have the histogram that we generated in the first session of this week. We have a histogram that looks like a bell curve. It suggests we should check our normal distribution. So we're going to test for normal distribution in our data and see whether the normal distribution has a good fit with our data set. And chi-squared test is a goodness of fit test. To do that, the first step you're going to derive the CDF of the normal distribution, theoretical, and we saw from formulas in session two where we derive this. So we'll just use that normdist. And we can pick a value x, and we're going to pick the mean of the normal distribution. We're going to pick as 47.20. And the standard deviation is 15.78. Let's fix those by choosing F4, and then the last option is whether to go to cumulative or probability. We want the cumulative, so press one, or write true, or choose cumulative. We get the value 0.001. And we take it all the way to the last cell which gets us to 0.9999589, which is pretty close to one. But, it's not exactly one because the normal distribution Has a tail going to infinity. Once we have the CDF, we need to, this gives us the cumulative value or the sum of all these bars up to that point. To calculate, what is the bucket, what is the bar that fills into that bin, we need to subtract two adjacent cumulative values. So that's what we're going to do in the next column. So we figure what is the probability of falling in each bin, theoretically. The first bin is just the value of falling, the first bin, up to the first bin and so that's M4, 0.001. The value of falling in, the probability of falling in the second bin is the probability of being below the second bin but above the first bin, so M5 minus M4. And we do this for every value up to the last point and we get the probability values. To make better visual sense of this, I'm going to convert this to percentages, and you can see that. Pick the data at random. Where is it going to fall? It has a 25% chance of falling in the middle or 18% chance of falling in the mid ranges, whereas 0.14% chance of falling in the lowest bin or 0.29% chance of falling in the highest bin, so it gives the shape like a bell curve. With the bin probabilities, we can now generate the bin frequency. That is supposed to be theoretically calculated. And the bin frequency theoretically calculated is as follows. We have 250 data points, and each of the 250 data points has some probability of falling in this bin. We just multiply that by the probability. And we have 0.34 for the first bin, and we can take it all the way to the last bin. So, we should expect about 61 values, frequencies in the middle and very little towards the edges. So, let's compare the actual frequency with the theoretical frequency. The theoretical frequency is not in whole numbers. But this is our chi-square test. Now we can run our chi-square test. And the way to run it is, chisq.test, chi-square test. Pick the actual range of frequencies. Pick the theoretical range of frequencies, and you have the chi-square value, 0.8851. Now we have seven degrees of freedom in our chi-square test. We'll see that soon in the PowerPoint presentation. We take this value of the chi-square test and we're going to check whether normal distribution uses a good fit. We can also test the uniform distribution here, and you can see that the uniform distribution gives a poorer fit than the normal distribution. And we get a value of 0.8851. Again, the precise value doesn't matter. We look at the link for degrees of freedom, we find that, we fail to reject the null hypothesis that the data came from a normal distribution. The tabulated Excel files are now reported in Dataset1_FIT.xlsx file and Dataset2_FIT.xlsx. Both the files are available on your course website. The second test of goodness of fit that we will take a look is the Kolmogorov-Smirnov test, or simply known as the K-S test. For small samples, K-S test is more suitable. The basic idea of Kolmogorov-Smirnov test is as follows, first, we arrange the data values in ascending order. Then, we arrange the theoretical values derived similarly in an ascending order from the cumulative distribution function. Then we find the maximal difference between every data value and its corresponding theoretical value. If this maximal difference value is low then the fit is very good. Which means that we are comparing two columns in ascending order and the gap between the two columns is never very high, then this is a good fit. Typically a maximal value of 0.03 or 0.04 or even lower is considered as a very good fit. Modeling Using Continuous Distributions. As you can see, depending on the size and the nature of the data, modeling reality using continuous distributions and choosing the correct distribution that fits our data is a challenging task. Surely, it's mathematically very elegant to use continuous distribution but the approach also creates several complexities. Hence, in real life, often simulation is used, and that will be our focus in week four. Anyway, congrats on ending week three and best wishes for week four. I'm Senthil Veeraraghavan, a faculty in Operations, Information and Decisions Department. And you can follow me @senthil_veer. We've just completed week three of the course. Continuous distributions, such as normal and uniform distributions are often used to model uncertainty in the real world. This week you had an opportunity to look at fitting distributions to data for modeling future outcomes. Using continuous distributions to model the world can be elegant but complex. My colleague Sergei will be back next week to talk about how we can use simulations to model complex realities and to compare different settings. Enjoy week four.