[SOUND] >> This lecture is about the Overview of Statistical Language Models, which cover proper models as special cases. In this lecture we're going to give a overview of Statical Language Models. These models are general models that cover probabilistic topic models as a special cases. So first off, what is a Statistical Language Model? A Statistical Language Model is basically a probability distribution over word sequences. So, for example, we might have a distribution that gives, today is Wednesday a probability of .001. It might give today Wednesday is, which is a non-grammatical sentence, a very, very small probability as shown here. And similarly another sentence, the eigenvalue is positive might get the probability of .00001. So as you can see such a distribution clearly is Context Dependent. It depends on the Context of Discussion. Some Word Sequences might have higher probabilities than others but the same Sequence of Words might have different probability in different context. And so this suggests that such a distribution can actually categorize topic such a model can also be regarded as Probabilistic Mechanism for generating text. And that just means we can view text data as data observed from such a model. For this reason, we call such a model as Generating Model. So, now given a model we can then assemble sequences of words. So, for example, based on the distribution that I have shown here on this slide, when matter it say assemble a sequence like today is Wednesday because it has a relative high probability. We might often get such a sequence. We might also get the item value as positive sometimes with a smaller probability and very, very occasionally we might get today is Wednesday because it's probability is so small. So in general, in order to categorize such a distribution we must specify probability values for all these different sequences of words. Obviously, it's impossible to specify that because it's impossible to enumerate all of the possible sequences of words. So in practice, we will have to simplify the model in some way. So, the simplest language model is called the Unigram Language Model. In such a case, it was simply a the text is generated by generating each word independently. But in general, the words may not be generated independently. But after we make this assumption, we can significantly simplify the language more. Basically, now the probability of a sequence of words, w1 through wn, will be just the product of the probability of each word. So for such a model, we have as many parameters as the number of words in our vocabulary. So here we assume we have n words, so we have n probabilities. One for each word. And then some to 1. So, now we assume that our text is a sample drawn according to this word distribution. That just means, we're going to draw a word each time and then eventually we'll get a text. So for example, now again, we can try to assemble words according to a distribution. We might get Wednesday often or today often. And some other words like eigenvalue might have a small probability, etcetera. But with this, we actually can also compute the probability of every sequence, even though our model only specify the probabilities of words. And this is because of the independence. So specifically, we can compute the probability of today is Wednesday. Because it's just a product of the probability of today, the probability of is, and probability of Wednesday. For example, I show some fake numbers here and when you multiply these numbers together you get the probability that today's Wednesday. So as you can see, with N probabilities, one for each word, we actually can characterize the probability situation over all kinds of sequences of words. And so, this is a very simple model. Ignore the word order. So it may not be, in fact, in some problems, such as for speech recognition, where you may care about the order of words. But it turns out to be quite sufficient for many tasks that involve topic analysis. And that's also what we're interested in here. So when we have a model, we generally have two problems that we can think about. One is, given a model, how likely are we to observe a certain kind of data points? That is, we are interested in the Sampling Process. The other is the Estimation Process. And that, is to think of the parameters of a model given, some observe the data and we're going to talk about that in a moment. Let's first talk about the sampling. So, here I show two examples of Water Distributions or Unigram Language Models. The first one has higher probabilities for words like a text mining association, it's separate. Now this signals a topic about text mining because when we assemble words from such a distribution, we tend to see words that often occur in text mining contest. So in this case, if we ask the question about what is the probability of generating a particular document. Then, we likely will see text that looks like a text mining paper. Of course, the text that we generate by drawing words. This distribution is unlikely coherent. Although, the probability of generating attacks mine [INAUDIBLE] publishing in the top conference is non-zero assuming that no word has a zero probability in the distribution. And that just means, we can essentially generate all kinds of text documents including very meaningful text documents. Now, the second distribution show, on the bottom, has different than what was high probabilities. So food [INAUDIBLE] healthy [INAUDIBLE], etcetera. So this clearly indicates a different topic. In this case it's probably about health. So if we sample a word from such a distribution, then the probability of observing a text mining paper would be very, very small. On the other hand, the probability of observing a text that looks like a food nutrition paper would be high, relatively higher. So that just means, given a particular distribution, different than the text. Now let's look at the estimation problem now. In this case, we're going to assume that we have observed the data. I will know exactly what the text data looks like. In this case, let's assume we have a text mining paper. In fact, it's abstract of the paper, so the total number of words is 100. And I've shown some counts of individual words here. Now, if we ask the question, what is the most likely Language Model that has been used to generate this text data? Assuming that the text is observed from some Language Model, what's our best guess of this Language Model? Okay, so the problem now is just to estimate the probabilities of these words. As I've shown here. So what do you think? What would be your guess? Would you guess text has a very small probability, or a relatively large probability? What about query? Well, your guess probably would be dependent on how many times we have observed this word in the text data, right? And if you think about it for a moment. And if you are like many others, you would have guessed that, well, text has a probability of 10 out of 100 because I've observed the text 10 times in the text that has a total of 100 words. And similarly, mining has 5 out of 100. And query has a relatively small probability, just observed for once. So it's 1 out of 100. Right, so that, intuitively, is a reasonable guess. But the question is, is this our best guess or best estimate of the parameters? Of course, in order to answer this question, we have to define what do we mean by best, in this case, it turns out that our guesses are indeed the best. In some sense and this is called Maximum Likelihood Estimate. And it's the best thing that, it will give the observer data our maximum probability. Meaning that, if you change the estimate somehow, even slightly, then the probability of the observed text data will be somewhat smaller. And this is called a Maximum Likelihood Estimate. [MUSIC]