Having defined the principles maximum likelihood destination for a single parameter, we can now consider the more complicated scenario where we were interested in doing maximum likelihood estimation for the full ensemble of parameters for a given Bayesian network structure. So, let's look at the more general case where now we have a general Bayesian network, not necessarily binary valued, and our parameters now consist of a set of parameters theta X for all possible values of the variable X. And the set of parameters theta Y given X for all values of the parent X and the child Y and that's the parameterization of this Bayesian network in the general case of table CPDs. Now let's think about the likelihood of, of the parameters, this set of parameters, theta what is the likelihood of the function look like? Well, the likelihood function, remember, is the probability of the data given the parameters, which decomposes because the parameters are IID, as the product of all data instances, xm and ym relative to the parameterization [INAUDIBLE]. Well, so we're going to break that probability up using the chain rule for Bayesian networks. And that's going to give me a decomposition of this joint probability as a product of the probability of the parent X times the conditional probability of y given x, which is just again, the application of the chain role in this very simple example. And now, I'm going to basically, break this up first into two separate products. So I'm going to move this product here. So I have one product over, just the first set of terms and then a second product over the y given x terms, and then I'm going to observe that the first set of terms, the probability of xm only depends on the parameters theta x, and the probability of ym given xm only depends on theta y given x. And so now I've broken up the likelihood function into a product of two or are called local likelihoods, where a local likelihood is the likelihood of just the variable x or just the variable y, given its parents, in this case, X. So more generally, I can do this exact same process in a somewhat more general case, where if we have a set of variables Xi up to Xn, then we can once again, look at the product overall data instances, break this up using the chain rule for Bayesian network as, for each instance the product of Xi given its parents, who I'm assuming that Ui are the parents. Actually, switch the order of the products, so that instead of doing, sorry, switch it here, so instead of doing m first I do the product over variables, then the product over data cases, and that gives me once again a product of terms, which I'm going to call Li of D CPD for Xi. And if we're willing to assume that variables, that the different CPDs don't share any parameters, so that each CPD is effectively an independent and the theta can be estimated separately, then I can now estimate, I can now optimize, save out of the ice, variable separately from all other variables by optimizing this local likelihood term. If we have table CPDS, it turns out that we can further decompose. Well, remember, that we started out with the probability, now we're looking for the local likelihood for variable x given its parents U, and as it depends on the CPD for x given U. And what we're going to do now is we're going to break up the data cases into separate buckets, so for every value, little x of x. And every value, little u of the parents U, we're going to multiply all of the instances m, for which the nth instance takes that value, little x, and that value, little u. And so what I've basically done is I've divided all of my instances into these disjoint buckets. Now, the point is we know what this is, the probability that xm takes the value little x. And, I'm giving that makes the value little u is simply the appropriate CPD entry theta x given U. And so this turns into this expression over here. And, so now the question is how many occurrences of theta x given U do I get? Well, the answer is how many times I multiply it in, the number of times that little u, x occurs in the data. And so this turns out to be simply a product over all possible values of the child, x, and the parent, Q, f the appropriate CPD entry, P of little x given u, to the power of the number of occurrences. little x, u. And if we now [SOUND] consider maximum likelihood of estimation, it turns out that the maximum likelihood estimate of x, of theta x given u is simply, as one would expect, the fraction of X = x, among faces, where u equals little u. So what happens with maximum likelihood estimation when we have a model with shared parameters? Let's consider that question in the context of a hidden Markov model, where initially, we're actually just going to look at a Markov chain, that is a case where we have just a single state variable S and we transition from one state to the other. So let's look at the likelihood function in this context before we consider what maximum likelihood estimation would look like. So the likelihood function of the parameter vector theta, which dictates the transition model given a set of observations, alpha of the states between time 0 and time t decomposes using the Markov assumption. As a product of t from 1 to t of the probability of the state, at time t, given the state at time t+1.1. Now we can further reformulate this product by, looking at how it decomposes across specific pairs of states, Si and Sj. So we can first multiply over all possible pairs of states, i and j, and then, consider all of the time points in which we transition from Si to Sj. And then, we have the probability of that transition from Si to sj. Now, the critical observation is that this probability is the same regardless of the time points, this is because of the time invariance assumption that we have in these models. And so the parameter, specifically, that we would be multiplying in this context is simply the transition parameter theta Si to Sj. Now, if you look at this expression which has all these products in it, the product of i and j, and then the product of the time points consistent with the transition or in which the transition from i to j took place. We end up with the product over here which multiplies over all ij of the parameter associated with that transition exponentiated by the sufficient statistics MSI to Sj, which is simply the number of times in which the, in the data stream that we saw, we had a transition between those two states. Now, if we now consider what maximum likelihood estimation would look like for this parameter theta hat sorry, for the parameter theta Si to Sj, the maximum likelihood estimate they've had, Si of Sj, would be exactly what we would expect, the number of transitions, in which Si transitioned to Sj divided by the number of total occurrences of the state Si within the time sequence. Let's now elaborate this to the context of a hidden Markov model just to see what it would look like. So here we have, in addition to the first component of the likelihood function, which is the same one that we had before, just the transition probabilities of the state sequence, we also have an additional component which looks at every state from one to big T of the probability of the particular observation, O of T, given the state S of T. Now we can just do the exact same process of reformulating the likelihood function. we've already seen what the first term looks like. It's exactly the same as the one that we had on the previous slide theta Si to Sj exponentiated with sufficient statistics. And we have an, a very analogous term that looks for over pairs of observe, of states i and observations k of the parameter that corresponds to the kth observation in the state i. exponentiated with the sufficient statistics, which is the number of times that we saw in the same state the in the same time point, the observation k and the state i. And so, adding now to our set of sufficient statistics, in addition to the count of going from Si to Sj, we also have sufficient statistics that count the number of times, time points, that we had the state Si and the observation okay. So to summarize maximum likelihood estimation in the case of discrete graph Bayesian networks. For a Bayesian network that has disjoined sets of parameters in the CPDs, that is where each CPD has its own set of parameters, the likelihood function decomposes as a product of local likelihood functions and this is important, because we're going to use that later on, one per variable. So the likelihood function is a product of small likelihoods or local likelihoods, one per variable. Now, when we have table CPU we get a further decomposition. Specifically even the local likelihood now decomposes as a product of. Likelihood's forced specific multinomials. One for each combination of the parents. And so now we get a further decomposition that is basically going to allow us to estimate each of those multinomials separately via the same likelihood estimation that we had in the original case of estimating parameters for a single multinomial, because if you just have a product of these likelihood functions each of them can be optimized separately. Finally in the very common case of networks that have shared CPDs, so this is a case unlike the first bullet where we had disjoint sets of parameters, here we have shared CPDs. We simply accumulate the sufficient statistics over all occurrences or uses of that CPD, and then, once we do that maximum likelihood estimation can be done in effectively the same way. Now, one important observation that arises from the form of maximum likelihood estimation is worth remembering and thinking about when one uses this. so let's look at the form of the, of the maximum likely estimate for, for parameter theta x given u and as we've seen that is a ratio between M, between the number of accounts of the substantiation little x and the parent assignment little u divided by the sum of all M x prime u for different values of X prime or written otherwise, it's M x u divided by M u. Now, what are the consequences of this expression when the number of parents grows large? When the number of parents grows larger, the number of buckets, that is, the number of different possible assignments, little u, increases exponentially with the number of parents. If you have two parents, it it grows exponentially with, to the number of different buckets u, but if you have fifteen or 20 parents, then the number of buckets into which we need to partition the data, increases very quickly. And that means that most buckets, that is most assignments little u, will have very few instances assigned to them which means that, effectively most of these estimates which divides M of xu divided by M of u, will be done with few or even zero instances Mu), of u, a which point any estimate is just totally bogus. And specifically, the that means that the parameter estimates that we're going to get in most of the multinomial parameters, once u gets large are going to be very poor. So this concept is called fragmentation of the data and it has the following important consequence. If the amount of data is limited relative to, to model dimensionality, we can often get better estimates of the parameters with simpler structures, even when these structures are actually wrong. That is, even if we make incorrect in dependence assumptions that are making, that are removing edges that really ought to be there, we can sometimes get better generalization in terms of say, log likelihood of test data than when using the much more complicated correct structure.