Hello, everyone. Today, we're going to be talking about the first of the three sections of the case study on Bayesian approaches to statistics and modeling. So, we're going to be walking through a multi-level regression problem using one of these Bayesian frameworks. In Bayesian frameworks, they're an incredibly flexible way of doing things such as variable selection, regularization, kind of a kin to the Lasso models and the rich models that we've seen earlier. Fit models where the number of parameter exceeds the number of observations. We can model dependence, hierarchical structures, and a lot more, all within the same model because we have these ideas of posterior distributions on all of our parameters. We're going to barely scratch the surface of methods, but we want to explore the Bayesian workflow for modeling, and most importantly, we want to get an intuition as to what's going on with these methods. The one thing that I do want to say is that if you are familiar with these methods, I'm going to gloss over some of the math. I'm going to gloss over some of maybe even the best practices in order to try to get at the heart of what we're trying to do. So, please be patient if you would've noticing something like that. It's something that we definitely emphasized but there was a trade off because this is a special topics part of the course. So, before we end of getting started, if you are interested in a lot of this analysis, I recommend two books. The first one is doing Bayesian data analysis. It's also called the puppies book. This is a very approachable great introduction to Bayesian statistics, and it is by far, in my personal favorite on the subject. Then, the second one is Bayesian data analysis by Andrew Gelman and Hal. He ends up writing this beautiful book that's typically used at the graduate-level. He's one of the authors of STAN. He's overall just a great person to follow in the statistics community, especially whenever it comes to Bayesian data analysis. Before we end up proceeding, we need to get back to this idea of what Bayesian data analysis is. In order to do this, we need to talk about the three steps. So, we typically go through whenever we build up one of these models. The first thing is we need to establish a belief about the world. This includes a prior and likelihood functions. You can think of this as setting up the model and making sure that all of the working parts are in place. The second thing that we need to do is we need to use data and probability. We need to update our beliefs. We need to check that the model agrees with the data. This is checking fits. This is making sure that our model is actually trying to capture as much of reality as we can. Then, the third thing that we need to do is we need to update our view of the world based on the results from our model. Given our data and given our model, how should we change our beliefs so that we can go about conducting this process all over again if we explore new data? So, for this case study, we could have used a lot of different data. The one that I end up choosing was based on the National Longitudinal Survey of Youth. It has 434 observations, and is trying to look at the kid's IQ score given the IQ score of a mother. So, imagine your mom in this first example has an IQ score of 121. Your mom went to high school. Let's highlight some of these variables here. So, your mom went to high school. Your mom has an IQ score of 121. Your mom is 27. I'm going to try to see if I can predict your child's IQ score based on these three variables. That's going to be at the heart of this analysis. This dataset is pretty large but it's still small enough that we can start seeing the benefits of Bayesian data analysis. For this, we're going to be using a linear model. For better of worse, I end up choosing a linear regression model to start out with. We, in frequent statistics, see this all the time. We express the form of our model. In this form, we say that a child's IQ is equal to sum intercept term plus some slope term times the mother's IQ, plus some slope term times the mom's age. I'm going to start out with this very basic regression model. We're not going to include the high-school variable, and we're going to do this for a few reasons. The first one is to keep it simple at first, and we'll build out more advanced model in a second. The second thing is we want to really see what's going on. How does this compare to traditional models? If I just fit this in a frequentest way what advantages and disadvantages do I have with doing this in a more Bayesian way? So, up until now, we haven't done anything more different than what we've done in the past. But in the Bayesian framework, we need to specify prior distributions on our beliefs, and this is something that's new. So, a key point, and this is the first key point, every parameter must begin with a distribution that captures our belief. These distributions that we place on these parameters are called priors. Just as we did in the intro case, imagine that I have the belief in this model that my intercept, lets go back to this line, imagine that I believe that my intercept should be centered on zero, and look like this. I may express this with using a normal prior. Similarly, someone else could go about this analysis. They may have more information than me, and let's say, they say, "No, this should actually be skewed." So, they could make that argument. They could say this should look like this. Similarly, for Beta one, I have no idea, I've never dealt with this type of IQ data before. So, I am going to say, "Hey, maybe it's centered at one. Maybe I think that a child's IQ is perfectly predicted by mother's IQ, so centered at one." But I'm going to include a lot of variants in mind. This idea is that if I don't know something for sure, if I don't know for sure that it should be centered on one, I should make sure that I include a lot of variance in my prior belief, and we state these via distributional assumptions. So, the first step in a Bayesian analysis is to state, "Hey, what are my priors? What do I actually think that things are going to look like before I go about doing my updates?" Ideally, we do this before we even see the data. As we want to establish these before we see the data just so we can try to get our prior beliefs down on paper. A good practice is to often write out what all of your priors are, and write a short little sentence about why you chose them. So, my initial prior, and once again I'm not an expert in this area, but if I put a relatively weak prior after centering my data on the intercept. So, once I center my data, I know that my intercept is probably going to be centered at zero, or it should be, but I'm going to give it a variance of 20. I want to make sure that this parameter is not constrained, I put a really weak prior on this one. I'm not largely concerned about the intercept though. The two terms that I'm the most concerned about in this model are Beta one and Beta two. If we remember right, this Beta one corresponds to the mom's IQ, and this Beta two corresponds to the slope term on the age of the mother. My belief, is that whenever I'm coming into this analysis, the mom's IQ explains most, if not all, of the IQ of the child. So, I express that by saying, let's say that a mom has an IQ of 110. I'm going to have my prior belief be that the child has an IQ of 110 as well. Is this a correct assumption? No. But it's the one that I'm going to go with. However, even though this is a ratio term, even though this is a slope, I'm still going to put a relatively wide prior on it, I have no idea how this data is going to play out. Because of this, I don't want my prior to affect the way that my model is going to fit, more so than anything else. Similarly, I'm going to say that whenever I have a mother, irregardless of her age, it's not going to make a difference on the IQ. I could've end up putting a different prior on this. I could have said, Oh, no, I actually think that older mothers may produce children because of the way that they're raised or something similar, that is independent of this B1 variable, of this IQ variable, and this may have a positive association. So, these are going to be my priors. These beliefs can be anything that you want, but ideally, you inform them with your best guesses. The last thing that I'm going to say is I'm going to say, given my mean. So, I'm going to have this be my mean. So, this is my mean function, it's a linear function. I'm going to say the child IQs in general are distributed normally with this mean, and then some, I'm just going to call it sigma error. This is just going to be the general error in the model, and this is going to give me an idea of, hey, how far away, on average, were my estimates from the reality? Hopefully, we want a really small sigma error, ideally, because it tells us that our predictions are quite good. So, the one point that I need to drive home though, is why did I end up choosing these specific distributions? Why did I go about putting these exact priors? I end up explaining them in these sentences, and this is always a good practice, always try to write out what your priors are. But the thing about Bayesian statistics is, and this is why some people may shy away from it, it inherently allows subjectivity on the part of the modelers. Two different people may have used different priors, and as a result, they can conclude two different things. If you have the belief that this term is actually zero, and this term is actually one, or it's perfectly chordal if I'm going to remove one because that's not going to be true, but let's just say that this is some constant, it's non-zero. If I have that belief, the model predictions that you're going to have are going to be wildly different than mine. This induces a little bit of subjectivity. If this model seems subjective because of my assumptions, that's the point though, and that's what I want to drive home. My assumptions are mine and they're brought to the forefront of the analysis, I'm not an expert in this area. I've never dealt with this data. So, I'm trying to express that in the uncertainty about my estimates. I'm trying to say, "I have no idea what they might be, so I'm going to give my model a lot of room to try to interpret what they possibly could fit." I don't want my prior to influence my measurements too much. This is so another point that I want to drive home. Often, it's a great idea to maybe ask experts to incorporate prior data, and this is where Bayesian analysis really shines. If you have an idea from a previous experiment that let's say, that the intercept term is actually 0.5. Select, let's say that someone did an analysis on a different IQ set and said, "Hey, this actually could be 0.5." You should do that. Furthermore, you would hope that he would increase this variance, you might make one or two. This idea that we update our beliefs, we update our prior is based on previous knowledge. I talked to an expert they say, "No, the age actually matters a lot. You shouldn't have it centered at zero." These things can work their way into my priors before I see my data and they allow me to have a better parameter estimate, if they're in line with the data. Before we end up getting into what the actual model fits look like, I need to say a few things about the model fit. So, I fit these models using a program called STAN, which is a domain-specific language for Bayesian modeling. All this means is it's specifically designed to do Bayesian modeling. We can interface through Python in order to work with STAN but we do need to build up some scripts for some of these problems, like simple linear regression. The Bayesian models themselves are tractable, so this is the flip side. If we want all the upsides of Bayesian statistics, we need to deal with the downsides. However, the math can quickly become intractable, and because of this, we sometimes need to use sampling methods. So, instead of getting this beautiful curve on our estimates like this, it may actually be something that looks like a histogram. This is fine, so long as we have good samples from a sampling algorithm that approximates this, then we're usually okay. The last thing that I need to say. So, this data was centered before fitting. Whenever we take a step back, what would it mean if a mother had an IQ of zero or an age of zero and we're trying to predict their child, we're trying to predict their child's IQ? It doesn't really have an interpretation. So, the one thing that I did was I centered the date and the predictors before we got started, and this was just to try to build in a little bit of interpretation about the mean, and it also removes that noisy intercept term that we would get because we're fitting data quite far away from zero. So, let's hop into the actual model, and I'll see you in part two. Thank you.