Hello everyone. Right now we're going to go over part three of Bayesian approaches to statistics and modeling. We're going to finish up the case-study, making a few edits to the case that we were just doing in part two. So in part two we focused mainly on trying to find a simple model, seeing how the fits worked, and we introduced a few plots. We had to glaze over quite a bit of the math and everything else, but the idea is that we tried to use plots to try to see how our model is working and maybe some places that we could try to improve it. So, the first thing that we want to take into account is we want to take into account in our new model to improve upon the last one this Mom High School variable. Did the mother of the child go to high school? Hopefully this will take into account perhaps some more implicit grouping. What variables though should we cluster on? Should we only do high school? Should we include different IQ buckets? Are we going to do both? It depends on the belief that we're expressing. So, in this analysis, we can cluster on a bunch of different things and what that's implicitly saying, just like in any other multilevel model or hierarchical model, is it saying that the data comes from the same process if you're in the same group, in a slightly different process if you're coming from a different group. For this analysis we're going to keep it simple, we're going to only cluster based on the binary variable of whether a mother came to high school or not or went to high school. In similarly, I'm going to bucket IQ into three ranges. The first one is going to be a low IQ. So if you have an IQ of less than 85 we'll call low IQ, medium IQ for the mother is 85 to 115, and high IQ is above 115. Okay. So, we'll have six variables in the mix, six buckets if you will, and we'll see how these things will play out. So, in the model, we're going to allow for each of the three parameters to vary. We're going to allow for different intercepts as well as a different coefficient on the mother's IQ, and a different coefficient on the IQ of the age. But we're going to have six of them, one for low, medium, and high, IQ mothers who did not go to high school, and another three for low, medium, and high, IQ mothers that did go to high school. So, we have six different intercept parameters, we have six different slopes on the age, we have six different slopes on the mother's IQ and this will hopefully try to bucket the data and hopefully give us a better idea of what's going on by allowing these slopes and these intercepts to vary. We're also going to say, and this is going to be another assumption, that each of these intercepts, come from a common distribution themselves. We're going to say the same thing about the slopes on the age and the same with the slopes on the IQ. We're going to do this to regularize these values. In a traditional hierarchical model or a multilevel model, we allow these things to vary as much as they want. In this Bayesian model we can actually constrain them even at a higher level. So, this is going to be almost like a three-stage model. Modeling in a Bayesian way though, as I've said before, focuses us to state a lot of our assumptions outright, and I want to make this point before we end up getting started. Stating these assumptions, stating these priors, allow someone to challenge them. I make a hierarchical assumption on these parameters and is this belief right? Is the first question you should be asking and the answer is of course not. Whenever you're modeling always ask why a decision is being made, try to back it up. My belief, is that there's a relationship between the IQ of the mother, and how the age variable ends up relating towards each group. Each of these groups I believe may have different slopes and I'm going to try to account for that belief. Okay. We could have used different modeling techniques, we could have used a variety of other methods, but this is the one that I decided on just to keep it relatively approachable on this topic. At the end of the day, all models are subjective, Bayesian or not. As we remember this as we increase the complexity, these are modeling choices based on our updated beliefs from our first model, they're not ground truth. So, all modeling is subjective even in the frequentest sense, Bayesian statistics just forces us to lay a lot more out on the table. So, this is going to be the form of the model, and what we're going to say is we're going to say that child IQ, is a function of my intercept term. So we have six intercepts, we can see this right here. So we have six different intercepts, similarly we have six different slopes on the mom's IQ, and six different slopes on the mom's age. So, if I have a mom who went to high school, so high school equals one, and the IQ is high, I'm going to have a different set of parameters. Then similarly, if the mom did not go to high school, so high school equals one, and maybe IQ medium. We're going to have a different set of parameters for this person as well. This allows me to have multiple groups, and then we use each of these individual intercepts corresponding to what the mom's IQ was in her group and the mom's age was to now predict the child's IQ. So, how does this look pictorially? I think that this is probably one of the easiest way to think about hierarchical models. Imagine that I had some distribution, right on Beta zero itself. So, I had some distribution on Beta naught itself. This is going to be the belief that I'm going to start out with. I'm going to have similar distributions on Beta one and Beta two. These are the global Beta one and Beta twos that we actually fit in the last model. So, I have some belief on them. What I'm going to do, is from these models, I'm going to draw now individual means. So we're only going to talk about the mean structure in this picture. I'm going to draw means from these models. So, let's say that I get, let's say a mean of five for intercept. This will be five, this will now be a normal five with some standard deviation distribution. Let's say on this one I draw a zero. This will be now a normal zero, I don't know what the error on this distribution is. Similarly, I could have drawn another one, let's say that I am drawing now. So, the second one was zero let's say that I draw out a third point out of here. Let's say it's two, this will now be a normal two distribution on here. So, what this is saying is, draw the mean of each of these Beta naughts one for each group, from some parent distribution. This is the hierarchical assumption. I'm going to do the exact same thing for Beta one. So, I'm going to draw six Beta ones out of the hat, right, another six Beta twos out of the hat, I'm going to do all of those. So, now I'm going to get these six draws from the intercept. Now what I'm going to do, is I'm going to build up a response distribution. So, what this is saying is, this is saying that the mean itself is equal to some Beta naught, that I'm going to get from this first guy, plus some Beta one that I'm going to get from somebody over here, plus some Beta two that I'm going to get from somebody over here. In this response distribution, like let's make the assumption right, so, this is high school yes IQ low no. So this'll be the response distribution for high school equals yes, IQ equals low. Right. Similarly, this one over here this would be the response distribution for high school equals no, IQ equals high. How do I get my data? I'm actually sampling from that response distribution. So, this is what the full picture of what's going on is, if it looks confusing. Right? What's actually walk the model back. My responses themselves are first grouped based on their mom's IQ level and also based on whether their mom goes to school. So, these are the observations of whenever high school equals yes and IQ equals low. Okay. I use these to come up with a response distribution. Okay. This response distribution gets its intercept from this guy over here. So, it gets it's intercept from this distribution, this is a posterior distribution on the intercept for that particular high school equals yes, IQ equals low group. In similarly, all of my Betas zeros, all of my intercepts get their mean from this guy, from this global Beta naught. So, we can think about the model as being generative top down, or we can think about as being kind of inferential, what is my data in this group? Is my data down here? Tell me about the response distribution, what does that tell me about the distribution of the group of the intercepts? What does that tell me about the global intercept as a whole? So, there's a lot going on. These models often have a lot of moving parts as we can see in the graphic but they can model really complicated relationships in a more intuitive way, kind of thinking about things generatively, how did our data come about? What are the downfalls though is it fitting one of these models in a Bayesian framework can be very computationally difficult. So, in a lot of circumstances whenever we're fitting these models, right? We do have to do some simulation, we do have to sometimes work out quite a bit of Math. The Model - Updates. So, we have six new intercepts, right? Each of these are going to be constrained by that hierarchical relationship that kind of global relationship, and then we're also going to change the distribution of the errors. I wanted to do this to try to capture that left tail we originally saw. We originally saw that our data was kind of shaped like this and our predictions were kind of like this. So, what we want to do is we want to hopefully give a distribution a little bit of skew so that it can capture this tail but at the same time be able to also generalize to most of the our rest of the distribution. We could have done this with a mixture model or a bunch of other methods but overall I think that this was the easiest next step to illustrate for perhaps this case analysis. Right. So, just like before we'll run the model, we'll get the posterior distributions, this time we can compare each subgroups intercept all six of them and see if they have any relationships. Why didn't we incorporate non-linear terms instead of creating hierarchy model, right? That would be expressing a different belief. It's expressing the belief that different IQ groups come from different distributions, is what I'm doing with the hierarchical model, right? A non-linear extension implies that they all come from the same model but perhaps they just vary in a non-linear way. We can do both, we can have hierarchy of models with non-linear terms, to keep it simple we just did hierarchical models with linear terms so that we don't have lots of coefficients running around to visualize. So, we print out all the distributions this is what we get, the distribution for the intercepts are relatively boring because we centered all the data, so we would expect each of the intercepts to be centered with a mean of zero. The one thing that we do notice though is that some of the intercepts have a much worse resolution. For instance, if someone went to high school and their IQ was low, we don't have much information about this intercept because the posterior is very wide. Similarly, we have quite a bit of information about this guy and this guy because the posterior distribution is small. This is because N is small and this group. If I remember right it's something like 10 or 15 observations, it's very small whereas we have quite a few more people in these groups so we get better resolution of what's going on. Similarly, we also have posterior distributions for the IQ co-efficient. So these are the actual posteriors, right? We can see that our resolution on this perimeter up here is pretty small, so if you had a mom who went to high school and her IQ is low, we really have no idea what's going on with the relationship between her IQ and yours. It's kind of all over the place, but if your mother didn't go to high school and are IQ is medium we have a pretty good idea of what's going on. So, this tells us that we have good resolution around that parameter but poor resolution around this one, we have a lot of uncertainty. The one odd man out is this observation right here. We can see that if your mom did not go to high school, but her IQ is high you have a negative relationship between her IQ and your IQ. This may be because whenever your mother's IQ is high, you may actually have a little bit of regression towards the mean, you may actually be a little bit more normal taking into account maybe your father or other factors, right? What's interesting though is that if your mom did go to high school and her IQ is high, it's actually a positive coefficient, so there's something going on here, right? The certainty is pretty good, we still can't rule out zero in either one of these two distributions because zero is reasonably within those intervals, but we can't say is this would be an interesting relationship to maybe look into. We have good resolution on these parameters because the distributions are firmly to one side or another, and so that's something that we may want to look at. We look at the age coefficient, it seems like except for perhaps this estimate right here, everybody contains zero. So, there's nothing that we can really say about the age co-efficient. We have very low resolution right here. We can see that even though this is a minus five to five variants a lot of these we can't really tell what's going on with the relationship between age except maybe in these two cases. There's really not much that we can say about the age co-efficient because our variance of all these parameters is so high. There seems to be some relationship that if your mom goes to high school and she's medium or high IQ we can end up seeing that the age actually does make a difference, it appears as if older moms actually do produce children and end up having a higher IQ. We don't know if that's a relationship that we should study or not, but it is something we can look into. We also have this thing called a skew parameter, so for this distribution assumption on the errors, I end up saying that a child's IQ, so a child's IQ was distributed with this distribution called a skew normal. So, it's distributed skew normal. What does this mean? This means that it's basically a normal distribution, but I have a third parameter to control skew. If the parameter is negative, then it has a skew that looks like this. It's a negative skew, and if the parameter is positive, then it'll have a distribution that looks like this, and so these are ways of using the skew normal distribution, ways of getting skewed data. What we can actually see is whenever I fit the model based on this skew parameter, I can actually see that largely I do have a tail for a skew of a negative number, and what this tells me, is there actually is some left-skew my data. If the normal model that we originally used was correct, then we should expect all of this would be really clustered around zero, but we can see that nope we actually have quite a bit of skew because most of this distribution is negative, we can actually say that, with reasonable certainty, this data actually might be left skewed in general, and hopefully this will help us account for some of those areas in the model. Just like last time we built up these predictive distributions. If we remember in the last model observation one, our estimate was actually outside of the interval. In this case, it's actually inside. So, in this case we actually did a pretty good job of capturing our estimate. Same with observation three, observation four, observation two, we did a pretty good job of capturing all of these. Whatever we end up looking at the plot, I think it's because all these things are so close, but we can see that we're still doing a pretty good job of capturing all of these points in the posterior. The histogram of the width of the predictive intervals this one's pretty interesting. Because originally we had a distribution that was pretty strongly centered on 70 looked a lot like this. So, it was basically all of my variance is relatively about 70 wide, all my predictive intervals are roughly about 70 units wide each of these, we're roughly about 70 units wide. We can see here now, each cluster actually has a different width, and the result of this we can see that down here I don't have a lot of data, I think this is actually high IQ mothers that didn't go to high school. So, we can see that we have quite a bit of certainty maybe they didn't have a very large in-group variants. Similarly, I can see that individual groups actually stick out quite well. We don't have a lot of resolution for some groups, we have high uncertainties, we do have high certainties for some groups. This may be overfitting because we don't have a lot of data, but at the end of the day this characteristic that each of these intervals are different in terms of width is actually pretty interesting. The histogram of the errors, so we can see that once again if a mother's IQ is high, and she didn't go to high school, we don't have a lot of data, but it looks like we nailed those estimates. They probably just don't have a very high in group variance. Otherwise, it looks like most of these distributions are centered around zero, we have a long tail on the left side for moms that did go to high school, but there have a low IQ. The interpretation of these errors at the end of the day is that they look relatively homogeneous, we may need a little bit more data to see if there's any clear trends, but overall the errors look pretty good. The histogram of the errors overall, so we don't have numbers like way out in the 50s and 60s anymore for the error, we now does have some concentrated around 40. Right. Similarly, most of our data now is relatively within 15, we still have this right tail. So, we didn't do a phenomenal job predicting, but we did do a little bit better job than before. Lastly, our posterior predictive check. So, if you'll remember from last time, our original estimates looked like this. So, they went under, came up like this, and then they came back down. Right. So, we heavily underestimated here, heavily overestimated here. We still see a little bit of this trend going on, but don't nearly have as much white space here, and here, and here as we did before. We can see that we're better capturing this entire skew or better capturing this, and so this is something that we wanted to try to capture whenever we did this, we're actually getting slightly better estimates from a distributional standpoint, and overall this is a great thing. The one downside is that it looks like we still missed, it looks like we still miss this hump, and so maybe a mixture distribution or something is in order to capture the points that fit down here these low IQ predictions that we just can't seem to fit with this model either. The histogram of the errors, we can see that they look relatively the same. Once again the flat model has this error that's way out on the right tail. It appears as if we still are overestimating one point pretty precisely, but overall we can see the scale and this one is 50, and 50 we have a big bar right near zero. The histogram of the errors is not impressively different, but it is a little bit different, and as a result, we also have much more estimates on our parameters. So, the multilevel model for this data isn't a massive improvement in the prediction ability, but it is an improvement in inference. We can now tease out individual group effects provide insight into the slopes, into the intercepts, but this also means that there's an increased variance for the parameter estimates as result. We don't have as much data in each of the groups as we had in the first model. Right? Because each of these groups are a subset of the total population, and as a result each of our parameter variances has had to blow out just a little bit, and this creates fatter distributions, and maybe a little bit more uncertainty overall. The next steps in modeling, we may want to account for that over-inflated left tail using a mixture distribution or another technique, maybe even something highly non distributional. Then we may want to also find more intelligent groupings for IQ. I arbitrarily chose 85 and 115 as my cutoff points. I could choose more intelligent points as well. I would use these to update my belief about the world, and to model. As a few parting notes as we went through this case we saw that Bayesian models because they give you distributions on the parameters. They are computationally intensive, but they can model very complex relationships. We just modeled a two-stage hierarchy model very quickly, right? We can perform regularization. There's a bunch of math that allow us to take a lot of the methods and frequent statistics, and model them using the same principles all under the same framework, and this can be a very powerful idea. These methods originally were very intractable due to the amount of sampling that you need, and due to computational power, but the method always remains the same. Create a belief about the world, collect data model, do your updates, and with the advances in sampling techniques and computational power especially of the cloud, a lot more computational power can be used to make these models and to make better inferences. Modeling is an iterative process, it's something that's filled with assumptions, and is subjected by its very nature. Sometimes capturing the variance of an estimate is sometimes more important than the estimate itself. If I get an estimate on that age parameter of I don't know 0.2, and I get an estimate on that IQ parameter 0.6, I don't really care. What I care more about is how certain am I about that relationship, and we can see that is we get less data, we're less certain about a lot of things. Large posterior intervals indicate large uncertainty in the estimate. Given the model and given the current data, and this is something that we definitely need to watch out for in Bayesian methods bringing this to the forefront whenever we have a bad fitting model or a model with a lot of variance, which I think we do in this case, these models of IQ are very very noisy, but especially with this IQ data we now get a better idea of what's going on, and perhaps better ways to tackle the modeling to get closer to what might actually be going on in the real world. So, thank you everyone for walking through this three part case study on how to go about Bayesian modeling. We saw a lot of different types of pictures, and a lot of different ways that we can go about modeling data, we can go about updating our beliefs, and we can go about trying to understand what possibly our model might be trying to tell us. Bayesian methods because of their power, allows us to do a lot of very complex analysis, but at the same time they're downsize such as computational complexity, and sometimes a little bit of mathematics make them a little bit more inaccessible, but sometimes that increase in the power, and that increase in really trying to get a better model of the world can be quite worth it. So, thank you so much once again, I hope you enjoyed these special topics on Bayesian statistics. Goodbye.