This lecture is about the expectation-maximization algorithm, also called the EM algorithm. In this lecture, we're going to continue the discussion of probabilistic topic models. In particular, we're going to introduce the EM algorithm, which is a family of useful algorithms for computing the maximum likelihood estimate of mixture models. So this is now familiar scenario of using a two component, the mixture model, to try to factor out the background words from one topic word of distribution here. So we're interested in computing this estimate, and we're going to try to adjust these probability values to maximize the probability of the observed document. Note that we assume that all the other parameters are known. So the only thing unknown is the word probabilities given by theta sub. In this lecture, we're going to look into how to compute this maximum likelihood estimate. Now, let's start with the idea of separating the words in the text data into two groups. One group would be explained by the background model. The other group would be explained by the unknown topic word distribution. After all, this is the basic idea of mixture model. But suppose we actually know which word is from which distribution? So that would mean, for example, these words the, is, and we are known to be from this background word distribution. On the other hand, the other words text, mining, clustering etc are known to be from the topic word distribution. If you can see the color, then these are shown in blue. These blue words are then assumed that to be from the topic word distribution. If we already know how to separate these words, then the problem of estimating the word distribution would be extremely simple. If you think about this for a moment, you'll realize that, well, we can simply take all these words that are known to be from this word distribution theta sub d and normalize them. So indeed this problem would be very easy to solve if we had known which words are from which a distribution precisely, and this is in fact making this model no longer a mixture model because we can already observe which distribution has been used to generate which part of the data. So we actually go back to the single word distribution problem. In this case let's call these words that are known to be from theta d, a pseudo document of d prime, and now all we need to do is just normalize these words counts for each word w_i. That's fairly straightforward. It's just dictated by the maximum likelihood estimator. Now, this idea however doesn't work because we in practice don't really know which word is from which distribution, but this gives us the idea of perhaps we can guess which word is from which it is written. Specifically given all the parameters, can we infer the distribution a word is from. So let's assume that we actually know tentative probabilities for these words in theta sub d. So now all the parameters are known for this mixture model, and now let's consider a word like a "text". So the question is, do you think "text" is more likely having been generated from theta sub d or from theta sub of b? So in other words, we want to infer which distribution has been used to generate this text. Now, this inference process is a typical Bayesian inference situation where we have some prior about these two distributions. So can you see what is our prior here? Well, the prior here is the probability of each distribution. So the prior is given by these two probabilities. In this case, the prior is saying that each model is equally likely, but we can imagine perhaps a different prior is possible. So this is called a prior because this is our guess of which distribution has been used to generate a word before we even off reserve the word. So that's why we call it the prior. So if we don't observe the word, we don't know what word has been observed. Our best guess is to say well, they're equally likely. All right. So it's just flipping a coin. Now in Bayesian inference we typically learn with update our belief after we have observed the evidence. So what is the evidence here? Well, the evidence here is the word text. Now that we know we're interested in the word text. So text that can be regarded as evidence, and if we use Bayes rule to combine the prior and the data likelihood, what we will end up with is to combine the prior with the likelihood that you see here, which is basically the probability of the word text from each distribution. We see that in both cases the text is possible. Note that even in the background it is still possible, it just has a very small probability. So intuitively what would be your guess in this case. Now if you're like many others, you are guess text is probably from theta sub d. It's more likely from theta sub d. Why? You will probably see that it's because text that has a much higher probability here by the theta sub d, then by the background model which has a very small probability. By this we're going to say, well, text is more likely from theta sub d. So you see our guess of which distribution has been used to generate the text would depend on how high the probability of the text is in each word distribution. We can do, tend to guess the distribution that gives us a word a higher probability, and this is likely to maximize the likelihood. So we're going to choose a word that has a higher likelihood. So in other words, we're going to compare these two probabilities of the word given by each distributions. But our guess must also be affected by the prior. So we also need to compare these two priors. Why? Because imagine if we adjust these probabilities, we're going to say the probability of choosing a background model is almost 100 percent. Now, if you have that kind of strong prior, then that would affect your guess. You might think, well, wait a moment, maybe text could have been from the background as well. Although the probability is very small here, the prior is very high. So in the end, we have to combine the two, and the base formula provides us a solid and principled way of making this kind of guess to quantify that. So more specifically, let's think about the probability that this word has been generated in fact from from theta sub d. Well, in order for texts to be generated from theta sub d two things must happen. First, the theta sub d must have been selected, so we have the selection probability here. Secondly, we also have to actually have observed text from the distribution. So when we multiply the two together, we get the probability that text has in fact been generated from theta sub d. Similarly, for the background model, the probability of generating text is another product of a similar form. Now, we also introduced the latent variable z here to denote whether the word is from the background or the topic. When z is zero, it means it's from the topic theta sub d. When it's one, it means it's from the background theta sub b. So now we have the probability that text is generated from each. Then we can simply normalize them to have an estimate of the probability that the word text is from theta sub d or from theta sub b. Then equivalently, the probability that z is equal to zero given that the observed evidence is text. So this is application of Bayes rule. But this step is very crucial for understanding the EM algorithm because if we can do this, then we would be able to first initialize the parameter values somewhat randomly, and then we're going to take a guess of these z values. Which distributing has been used to generate which word, and the initialized the parameter values would allow us to have a complete specification of the mixture model which further allows us to apply Bayes rule to infer which distribution is more likely to generate each word. This prediction essentially helped us to separate the words from the two distributions. Although we can't separate them for sure, but we can separate them probabilistically as shown here.