[SOUND] This lecture is about probabilistic and latent Semantic Analysis or PLSA. In this lecture we're going to introduce probabilistic latent semantic analysis, often called PLSA. This is the most basic topic model, also one of the most useful topic models. Now this kind of models can in general be used to mine multiple topics from text documents. And PRSA is one of the most basic topic models for doing this. So let's first examine this power in the e-mail for more detail. Here I show a sample article which is a blog article about Hurricane Katrina. And I show some simple topics. For example government response, flood of the city of New Orleans. Donation and the background. You can see in the article we use words from all these distributions. So we first for example see there's a criticism of government response and this is followed by discussion of flooding of the city and donation et cetera. We also see background words mixed with them. So the overall of topic analysis here is to try to decode these topics behind the text, to segment the topics, to figure out which words are from which distribution and to figure out first, what are these topics? How do we know there's a topic about government response. There's a topic about a flood in the city. So these are the tasks at the top of the model. If we had discovered these topics can color these words, as you see here, to separate the different topics. Then you can do a lot of things, such as summarization, or segmentation, of the topics, clustering of the sentences etc. So the formal definition of problem of mining multiple topics from text is shown here. And this is after a slide that you have seen in an earlier lecture. So the input is a collection, the number of topics, and a vocabulary set, and of course the text data. And then the output is of two kinds. One is the topic category, characterization. Theta i's. Each theta i is a word distribution. And second, it's the topic coverage for each document. These are pi sub i j's. And they tell us which document it covers. Which topic to what extent. So we hope to generate these as output. Because there are many useful applications if we can do that. So the idea of PLSA is actually very similar to the two component mixture model that we have already introduced. The only difference is that we are going to have more than two topics. Otherwise, it is essentially the same. So here I illustrate how we can generate the text that has multiple topics and naturally in all cases of Probabilistic modelling would want to figure out the likelihood function. So we would also ask the question, what's the probability of observing a word from such a mixture model? Now if you look at this picture and compare this with the picture that we have seen earlier, you will see the only difference is that we have added more topics here. So, before we have just one topic, besides the background topic. But now we have more topics. Specifically, we have k topics now. All these are topics that we assume that exist in the text data. So the consequence is that our switch for choosing a topic is now a multiway switch. Before it's just a two way switch. We can think of it as flipping a coin. But now we have multiple ways. First we can flip a coin to decide whether we're talk about the background. So it's the background lambda sub B versus non-background. 1 minus lambda sub B gives us the probability of actually choosing a non-background topic. After we have made this decision, we have to make another decision to choose one of these K distributions. So there are K way switch here. And this is characterized by pi, and this sum to one. This is just the difference of designs. Which is a little bit more complicated. But once we decide which distribution to use the rest is the same we are going to just generate a word by using one of these distributions as shown here. So now lets look at the question about the likelihood. So what's the probability of observing a word from such a distribution? What do you think? Now we've seen this problem many times now and if you can recall, it's generally a sum. Of all the different possibilities of generating a word. So let's first look at how the word can be generated from the background mode. Well, the probability that the word is generated from the background model is lambda multiplied by the probability of the word from the background mode. Model, right. Two things must happen. First, we have to have chosen the background model, and that's the probability of lambda, of sub b. Then second, we must have actually obtained the word w from the background, and that's probability of w given theta sub b. Okay, so similarly, we can figure out the probability of observing the word from another topic. Like the topic theta sub k. Now notice that here's the product of three terms. And that's because of the choice of topic theta sub k, only happens if two things happen. One is we decide not to talk about background. So, that's a probability of 1 minus lambda sub B. Second, we also have to actually choose theta sub K among these K topics. So that's probability of theta sub K, or pi. And similarly, the probability of generating a word from the second. The topic and the first topic are like what you are seeing here. And so in the end the probability of observing the word is just a sum of all these cases. And I have to stress again this is a very important formula to know because this is really key to understanding all the topic models and indeed a lot of mixture models. So make sure that you really understand the probability of w is indeed the sum of these terms. So, next, once we have the likelihood function, we would be interested in knowing the parameters. All right, so to estimate the parameters. But firstly, let's put all these together to have the complete likelihood of function for PLSA. The first line shows the probability of a word as illustrated on the previous slide. And this is an important formula as I said. So let's take a closer look at this. This actually commands all the important parameters. So first of all we see lambda sub b here. This represents a percentage of background words that we believe exist in the text data. And this can be a known value that we set empirically. Second, we see the background language model, and typically we also assume this is known. We can use a large collection of text, or use all the text that we have available to estimate the world of distribution. Now next in the next stop this formula. [COUGH] Excuse me. You see two interesting kind of parameters, those are the most important parameters. That we are. So one is pi's. And these are the coverage of a topic in the document. And the other is word distributions that characterize all the topics. So the next line, then is simply to plug this in to calculate the probability of document. This is, again, of the familiar form where you have a sum and you have a count of a word in the document. And then log of a probability. Now it's a little bit more complicated than the two component. Because now we have more components, so the sum involves more terms. And then this line is just the likelihood for the whole collection. And it's very similar, just accounting for more documents in the collection. So what are the unknown parameters? I already said that there are two kinds. One is coverage, one is word distributions. Again, it's a useful exercise for you to think about. Exactly how many parameters there are here. How many unknown parameters are there? Now, try and think out that question will help you understand the model in more detail. And will also allow you to understand what would be the output that we generate when use PLSA to analyze text data? And these are precisely the unknown parameters. So after we have obtained the likelihood function shown here, the next is to worry about the parameter estimation. And we can do the usual think, maximum likelihood estimator. So again, it's a constrained optimization problem, like what we have seen before. Only that we have a collection of text and we have more parameters to estimate. And we still have two constraints, two kinds of constraints. One is the word distributions. All the words must have probabilities that's sum to one for one distribution. The other is the topic coverage distribution and a document will have to cover precisely these k topics so the probability of covering each topic that would have to sum to 1. So at this point though it's basically a well defined applied math problem, you just need to figure out the solutions to optimization problem. There's a function with many variables. and we need to just figure out the patterns of these variables to make the function reach its maximum. >> [MUSIC]