Hello, everyone, so right now we're going to go through the second part of the case study. We're actually going to look at what that model fit produced. Using a Bayesian framework, we're going to try to see if we can tease out some important facts about the way that the models fit, and maybe about our parameter estimates. So we're going to walk through, once again, our multilevel regression problem, right? We can do a lot of different things within this framework. And sometimes, whenever we get all of our parameter estimates at the end, we need to kind of massage them a little bit to try to figure out what we can do and what we can estimate from them. So the first thing that we need to realize is that, once again, just like in the intro, whenever we fit models in a Bayesian sense, we do not get point estimates, okay? So if I was fitting these in the usual frequentist sense, I get my R output or I get my Python output. And what I notice is that the output in that scenario would look something like this. They would say, hey, you're best estimate of the intercept, it's probably 0, right? Your best estimate of the IQ coefficient, it's probably 0.61, same with the age coefficient. So this one's probably, if I calculate the mean, for argument's sake, let's make it 0.45. And for argument's sake of the standard deviation, let's just make it 18.2. So this is what I would expect to get whenever I do kind of a frequentist analysis. I get some point estimate, and then I get something called the standard error. So I can end up saying the standard error for this guy is probably 0.05, right? This would be my estimate, and this would be my standard error. In a Bayesian framework, things are kind of flipped on their head. And the reason why they're flipped on their head is because now we have distributions on the parameters instead of just point estimates, using maximal likelihood. And because of this, we get a much more rich set of definitions. We can notice, too, that this is actually a histogram. This is actually a histogram instead of a density. And the reason why is that we use sampling techniques Use sampling techniques to actually fit this model instead of a lot of math, okay? So this is one of the benefits of using a computational paradigm, is that we can end up exploiting it to create these really fancy models. And we'll see how this can be used in the next set of videos in order to do more complex analysis. But for right now, we simply need to know that we get distributions on my parameters. So the distribution of the intercept, it looks like it's between -2 and 2, centered at about 0. This guy should come down here, it's centered at about 0. I could calculate a bunch of statistics. This should not be a surprise, because at the end of the day, I centered my data. Whoops, so I centered my data, and as a result, I expect that my intercept should be right around 0. My IQ coefficient is kind of interesting. It's saying that if I were to take a mother's IQ and multiply it, on average, by about 0.6, that's going to be one of my best estimates of the child's IQ. We can see that this estimate varies from 0.5 to 0.7, right? So if I have my child IQ. Is equal to my intercept term + beta 1 times the mom's IQ + beta 2 times the age of the mother. I would put some numbers between 0.5 and 0.7, that's my best estimate of this beta 1 term, between 0 and 2 for this guy. And for my age coefficient, we originally expressed the idea that, hey, maybe 0 is a reasonable value for this guy. So we can see that in the IQ coefficient, it's above 0. We can see that, on average, there's not a perfect relationship between the mother's IQ and between the child's. But on the average, it's between 0.5 and 0.7. The age coefficient, on the other hand, we can see that 0 is actually a pretty reasonable value. It's still well within the bounds of my distribution, thus there could be no relationship. But in general, we don't have much certainty over this parameter, right? Unlike in the frequentist case, where we use standard errors to try to gauge how wide our estimates are. In the posterior distribution, we can actually see, hey, this number probably should be between 0 and 1. But at the end of the day, we don't have much certainty over it. And as a result, my age coefficient is going to be quite wide. The standard deviation of the error at the end of the day is going to be how far, on average, from my mean estimate, from this guy, so how far from my child IQ hat was my actual observation? And we can see that it has an average, I think we said 18.2 a little bit earlier. What this is telling me is, if I end up predicting that something is, let's say, I don't know, 85, and here's my distribution. Okay, the standard deviation on this guy will be roughly 18.2, and that's going to be my uncertainty. So that's what these distributions are telling me, and we can see that this is really powerful. We can see that in a frequentist analysis, we only get one number in this thing called a standard error. A Bayesian analysis, we get an entire distribution on the values. And this allows me to do something really cool, which we'll see in a few slides, which is to calculate distributions on my estimates. So another thing that we can do is we can actually plot the samples up against each other. We can try to see if there's any correlation, perhaps, between the IQ coefficient and the age coefficient. So here are my three histograms smoothed out. And we can see that because of the joint plots are all relatively, they're cloud-like, they're kind of stretched out. We don't see much correlation, to be totally honest, between any of the coefficients. That means that we probably don't have collinearity in this model. This is actually a pretty good model. The lack of multicollinearity is pretty evident, because these coefficients themselves are not correlated in a positive or a negative way. And this tells us a lot about how the coefficients interact with each other. In order to kind of understand what's going on, and before we do something called predictives, we need to think about what's actually going on in one of these analyses. So what we do in a Bayesian analysis is we take a bunch of data, So we have a bunch of data, and I have some belief about a parameter, right? Let's say that this is the IQ coefficient. So this is the IQ coefficient, this is coefficient on IQ. What I do is is I have some belief on this guy, and I'm going to update him using my data. And this is going to give me a distribution on the IQ coefficient. I'm then going to use this distribution, along with a bunch of others, so I'm going to use the age distribution, as well. I'm going to use the distribution of the intercept. I'm going to use the distribution of the IQ coefficient. And because I have uncertainty, I'm now going to use them kind of like a machine to get predictive data. So let's say that my boss ends up coming up to me and he says, hey, I have a mother with this IQ, this age. Let's say that the IQ is, I don't know, 100, the age is 25, Right, it's 25, and let's say that I have some belief for the intercept. He's going to say, what's your best estimate? What's your best estimate for what the child's IQ might be? And so I'm going to ask myself this question. In a frequentist argument, I'm going to get a point estimate. I'm going to get one single point, and so this would be a frequentist estimate. And it's going to be one single point. And then I have some standard errors which are asymptotically going to be what my error I predict on this point is. In a Bayesian estimate, I'm actually going to get an entire distribution on the errors, going to get an entire distribution to predict. And this is one of the things that I want to show you right now. So what I did here was I took the first observation, and I ran the model. And once I have my estimates on all the coefficients, once I have all of these estimates right here, all I did was I now generated data, conditional. On those distributions to determine what the distribution of my output should look like, and here's what I found. So for my first data point, here's the distribution that I expect, my new observation should follow. Let's say that I were to put it through the model again. The real value was this blue one, and the 95% credible interval is from this red line over to this red line. And we can see that it just falls outside, so in this case my 95% credible interval did not capture what I was looking for, right? My predictive interval did not capture what I was looking for, right? In this one, we can see my estimate was probably right around here, my mean estimate. This was pretty close to the actual parameter. Similarly, right here, this is the distribution around my prediction for observation 3. We can see that observation 3, once again, was inside of my interval, observation 4 was, as well. We can do this for all of the points. And we can kind of get an idea of how well our model may actually be fitting the data given what we've observed, okay? So the way that we think about this is, imagine that I'm going to look at, let's say, this distribution right here, this predictive distribution. So it goes from, let's say for argument's sake, 70, all the way up to, let's say, 140, for argument's sake. What I'm going to do, the first thing that I'm going to do is I'm going to plot 70 down here, 140 up here, these will be my bounds. My point was, in this case, just below 70, so my point was right here, and here's my interval. And so we can see that my point fell slightly outside of it, right? Similarly, for the second one, we have a lower bound, we have an upper bound, but the point fell inside. I think it fell a little bit high, so we're going to put the point right here. And I can do this for every single set across. All I did was, and to make this plot a little bit easier to visualize, I standardized this to 1 for every interval. And I standardized this to -1 for every interval, okay? And so what does this look like? What does this new plot look like? This is the normalized posterior predictive interval spot. But what this tells us is we can see for the very first point, exactly as we predicted, right? It's going to be falling a little bit outside of my credible interval. My second one is up here, third one, fourth one. If I go back, my second one is a little bit high, so I should see that in my predictive interval. My third one was a little bit lower, I see that right here. And what this plot tells me is this plot tells me all of the blue dots, if I were to estimate them in the model, right? My posterior interval, my 95% posterior interval, would have covered them. It would have made the correct prediction. All the points that are outside, all of these red points, these are all the points that my posterior interval would have missed. So this type of plot can give me an idea of, hey, maybe my model is not predicting things as it should. The good news is is that roughly 5% of the points fall outside of this, so these are good predictive intervals. The one downside is that we look like we have a lot of points on this left tail that we're missing, right? So this is something that we'll get to in a second, but this predictive distribution, we might be missing these lower tails. And sometimes we're committing massive errors because of this. And so this plot basically tells me, hey, to have a symmetrical belief about the error, to have that normal distribution might not be correct. We can also talk about what the width of our uncertainty is. So if I look at the histogram of the width of the predictive intervals, let's look at this one first. This predictive interval, let's say it was from 70 to 140, this width would've been 70. Let's say that this one was from 60 to 130, this one again would have been 70. What this is telling me is that for all of the intervals in this plot, the average width was right around 71 and some change. And so this tells me that I don't really have a good certainty around all of my predictions, right? Someone ends up saying, hey, can you predict for this point? Yes, I can end up saying, it's probably 100, plus or minus 35, on average, right? That's what this width of the post predictive intervals is telling me. And because of this, I need to be super careful whenever I'm making estimates. The good news is that in a frequentist analysis, we would just have a point estimate, and we'd have some idea of the standard errors. In a Bayesian analysis we can end up seeing that all of my uncertainty is pretty wide. And as a result of the wideness of each of these intervals, I may not have good resolution on what my actual estimates are. So this is something that we can see directly in a Bayesian setting. And it's pretty cool that it tells us that our model's not fitting all the best. These are the histogram of the errors. So all I did was I took the mean of the prediction, so let's go back to one of our predictions. For this one, I might have chosen, let's say the mean is right here, so let's say it's 98. Let's say that this point, for argument's sake, was right around 60, so the error here would have been about 38, okay? Similarly, I'm going to say that the mean of this one is, let's say, roughly about 85. This point may be about 98t, so the error on this one would be 13. And I do this for all of the points, I plot this histogram of errors. And two things that I notice here, number one, my model is doing a decent job of keeping most of the errors within about 15. So we can see that a pretty good majority of the data is in here. We can also see that it has this far right tail, and specifically for these two estimates, some bad things are happening, right? Some estimates where we miss over 40, this is telling me that my model may not be the best in predicting everything. And it's also telling me that there are some definite outliers in the main structure that I need to account for, okay? This last plot I think is one of the most interesting. And this is called the posterior predictive check, and so what is this plot telling me? If I were to plot a histogram of all of the child's IQs in the original dataset, I'm going to get this blue curve. So I'm going to get this guy right here. So this is the distribution of the actual values in the dataset. Then for each of my predictive intervals, let's say that I made predictions on every single data point, and I plotted the density of that. So this could've been one of my sets of predictions, this could've been another set. And because I'm using a Bayesian framework, these sets have variability. What this is telling me is this plot can tell me if I have systemic bias in either direction. And it can also tell me if the variance of my estimates are approximately correct. So I see that the width of this interval is roughly equal to the width of this interval. I'm not too worried, but two things jump out at me whenever I look at this plot. Number one, the mean of all of my predictions. The mean of all of my predictions looks to be perhaps a little bit off center from what the mean of this blue distribution is. Furthermore, we have this tail right here. And what this is saying is, this is saying I predict that very few people will have IQs, right? Maybe less than 50 or from this 50 to 70 range, right? But I predict a lot more will have maybe from 70 to 85 range, right? So I'm overpredicting, sorry, I'm underpredicting the amount here, but overpredicting the amount here. And because of this, this is telling me that my estimates may be biased. And furthermore, in general, my estimates may be pulled to the left, trying to account for this odd fat tail right here. Overall, all of my estimates are heading that way, and so this is something that I can pick up on. And just so we can get a better intuition of these plots, imagine that my predictions were perfect, right? Imagine that this was the plot of the data. If my predictions were perfect, I should see a bunch of lines that very closely follow the actual data, or close to perfect. So I should expect to see this, but instead I'm seeing this. And this tells me that the distribution that I'm actually producing on results may actually not be the best. And so that's something that I may want to account for. So a few observations about the model that we've seen so far, just kind of looking at all these plots, right? First thing, we didn't include the high school education variable, and this is something that we may want to include. The second thing, we have a reasonable number of observations, right? That fall outside of their 95% predictive intervals. That means that, overall, our intervals are doing what we want, but the width of these intervals themselves are a little bit wide, right? Our uncertainty about each of the estimates is pretty wide, and so we want to see if we can do a little bit better on that. We systemically overestimate IQs of about 65 to 80, and underestimate those of 90 to 115, we saw this in this plot. And furthermore, the posterior predictive check rate, right? Shows that our model is perhaps not accounting for these children in that left tail, right? So if the errors actually were distributed normally, we would expect that this distribution itself, this blue curve, might actually be a little bit closer to normal, right? But in general, it seems like our predictions are getting pulled towards that left tail to try to account for that. So next time we're going to explore updates that we can make to this model, right? We're going to look to incorporate the high school status of the mother, and work with a hierarchical structure to improve our model. And we will also want to come up with some way of addressing that skew in the data so that we can take advantage of that left tail and not have our estimates be all biased downwards towards that tail. Thank you, and I'll see you in part three.