[SOUND] This lecture is about generating probabilistic models for text clustering. In this lecture, we're going to continue discussing text clustering, and we're going to introduce generating probabilistic models as a way to do text clustering. So this is the overall plan for covering text clustering. In the previous lecture, we have talked about what is text clustering and why text clustering is interesting. In this lecture, we're going to talk about how to do text clustering. In general, as you see on this slide, there are two kinds of approaches. One is generating probabilistic models, which is the topic of this lecture. And later, we'll also discuss similarity-based approaches. So to talk about generating models for text clustering, it would be useful to revisit the topic mining problem using topic models, because the two problems are very similar. This is a slide that you have seen earlier in the lecture on topic model. Here we show that we have input of a text collection C and a number of topics k, and vocabulary V. And we hope to generate as output two things. One is a set of topics denoted by Theta i's, each is awarded distribution and the other is pi i j. These are the probabilities that each document covers each topic. So this is a topic coverage and it's also visualized here on this slide. You can see that this is what we can get by using a topic model. Now, the main difference between this and the text clustering problem is that here, a document is assumed to possibly cover multiple topics. And indeed, in general, a document will be covering more than one topic with nonzero probabilities. In text clustering, however, we only allow a document to cover one topic, if we assume one topic is a cluster. So that means if we change the problem definition just slightly by assuming that each document that can only be generated by using precisely one topic. Then we'll have a definition of the clustering problem as you'll hear. So here the output is changed so that we no longer have the detailed coverage distributions pi i j. But instead, we're going to have a cluster assignment decisions, Ci. And Ci is a decision for the document i. And C sub i is going to take a value from 1 through k to indicate one of the k clusters. And basically tells us that d i is in which cluster. As illustrated here, we no longer have multiple topics covered in each document. It is precisely one topic. Although which topic is still uncertain. There is also a connection with the problem of mining one topic that we discussed earlier. So here again, it's a slide that you have seen before and here we hope to estimate a topic model or distribution based on precisely one document. And that's when we assume that this document, it covers precisely one topic. But we can also consider some variations of the problem. For example, we can consider there are N documents, each covers a different topic, so that's N documents, and topics. Of course, in this case, these documents are independent, and these topics are also independent. But, we can further allow these documents with share topics, and we can also assume that we are going to assume there are fewer topics than the number of documents, so these documents must share some topics. And if we have N documents that share k topics, then we'll again have precisely the document clustering problem. So because of these connections, naturally we can think about how to use a probabilistically generative model to solve the problem of text clustering. So the question now is what generative model can be used to do clustering? As in all cases of designing a generative model, we hope the generative model would adopt the output that we hope to generate or the structure that we hope to model. So in this case, this is a clustering structure, the topics and each document that covers one topic. And we hope to embed such preferences in the generative model. But, if you think about the main difference between this problem and the topic model that we talked about earlier. And you will see a main requirement is how can we force every document to be generated from precisely one topic, instead of k topics, as in the topic model? So let's revisit the topic model again in more detail. So this is a detailed view of a two component mixture model. When we have k components, it looks similar. So here we see that when we generate a document, we generate each word independent. And when we generate each word, but first make a choice between these distributions. We decide to use one of them with probability. So p of theta 1 is the probability of choosing the distribution on the top. Now we first make this decision regarding which distribution should be used to generate the word. And then we're going to use this distribution to sample a word. Now note that in such a generative model, the decision on which distribution to use for each word is independent. So that means, for example, the here could have generated from the second distribution, theta 2 whereas text is more likely generated from the first one on the top. That means the words in the document that could have been generated in general from multiple distributions. Now this is not what we want, as we said, for text clustering, for document clustering, where we hoped this document will be generated from precisely one topic. So now that means we need to modify the model. But how? Well, let's first think about why this model cannot be used for clustering. And as I just said, the reason is because it has allowed multiple topics to contribute a word to the document. And that causes confusion because we're not going to know which cluster this document is from. And it's, more importantly it's violating our assumption about the partitioning of documents in the clusters. If we really have one topic to correspond it to one cluster of documents, then we would have a document that we generate from precisely one topic. That means all the words in the document must have been generated from precisely one distribution. And this is not true for such a topic model that we're seeing here. And that's why this cannot be used for clustering because it did not ensure that only one distribution has been used to generate all the words in one document. So if you realize this problem, then we can naturally design alternative mixture model for doing clustering. So this is what you're seeing here. And we again have to make a decision regarding which distribution to use to generate this document because the document could potentially be generated from any of the k word distributions that we have. But this time, once we have made a decision to choose one of the topics, we're going to stay with this regime to generate all the words in the document. And that means, once we have made a choice of the distribution in generating the first word, we're going to go stay with this distribution in generating all of the other words in the document. So, in other words, we only make the choice once for, basically, we make the decision once for this document and this state was just to generate all the words. Similarly if I had choosing the second distribution, theta sub 2 here, you can see which state was this one. And then generate the entire document of d. Now, if you compare this picture with the previous one, you will see the decision of using a particular distribution is made just once for this document, in the case of document clustering. But in the case of topic model, we have to make as many decisions as the number of words in the document. Because for each word, we can make a potentially different decision. And that's the key difference between the two models. But this is obviously also a mixed model so we can just group them together as one box to show that this is the model that will give us a probability of the document. Now, inside of this model, there is also this switch of choosing a different distribution. And we don't observe that so that's a mixture model. And of course a main problem in document clustering is to infer which distribution has been used to generate a document and that would allow us to recover the cluster identity of a document. So it will be useful to think about the difference from the topic model as I have also mentioned multiple times. And there are mainly two differences, one is the choice of using that particular distribution is made just once for document clustering. Whereas in the topic model, it's made it multiple times for different words. The second is that word distribution, here, is going to be used to regenerate all the words for a document. But, in the case of one distribution doesn't have to generate all the words in the document. Multiple distribution could have been used to generate the words in the document. Let's also think about a special case, when one of the probability of choosing a particular distribution is equal to 1. Now that just means we have no uncertainty now. We just stick with one particular distribution. Now in that case, clearly, we will see this is no longer mixture model, because there's no uncertainty here and we can just use precisely one of the distributions for generating a document. And we're going back to the case of estimating one order distribution based on one document. So that's a connection that we discussed earlier. Now you can see it more clearly. So as in all cases of using a generative model to solve a problem, we first look at data and then think about how to design the model. But once we design the model, the next step is to write down the likelihood function. And after that we're going to look at the how to estimate the parameters. So in this case, what's the likelihood function? It's going to be very similar to what you have seen before in topic models but it will be also different. Now if you still recall what the likelihood function looks like in then you will realize that in general, the probability of observing a data point from mixture model is going to be a sum of all the possibilities of generating the data. In this case, so it's going to be a sum over these k topics, because every one can be user generated document. And then inside is the sum you can still recall what the formula looks like, and it's going to be the product of two probabilities. One is the probability of choosing the distribution, the other is the probability of observing a particular datapoint from that distribution. So if you map this kind of formula to our problem here, you will see the probability of observing a document d is basically a sum in this case of two different distributions because we have a very simplified situation of just two clusters. And so in this case, you can see it's a sum of two cases. In each case, it's indeed the probability of choosing the distribution either theta 1 or theta 2. And then, the probability is multiplied by the probability of observing this document from this particular distribution. And if you further expanded this probability of observing the whole document, we see that it's a product of observing each word X sub i. And here we made the assumption that each word is generated independently, so the probability of the whole document is just a product of the probability of each word in the document. So this form should be very similar to the topic model. But it's also useful to think about the difference and for that purpose, I am also copying the probability of topic model of these two components here. So here you can see the formula looks very similar or in many ways, they are similar. But there is also some difference. And in particular, the difference is on the top. You see for the mixture model for document clustering, we first take a product, and then take a sum. And that's corresponding to our assumption of first make a choice of choosing one distribution and then stay with the distribution, it'll generate all the words. And that's why we have the product inside the sum. The sum corresponds to the choice. Now, in topic model, we see that the sum is actually inside the product. And that's because we generated each word independently. And that's why we have the product outside, but when we generate each word we have to make a decision regarding which distribution we use so we have a sum there for each word. But in general, these are all mixture models and we can estimate these models by using the Algorithm, as we will discuss more later. [MUSIC]