[MUSIC] So now let's talk about the problem a little bit more, and specifically let's talk about the two different ways of estimating the parameters. One is called the Maximum Likelihood estimate that I already just mentioned. The other is Bayesian estimation. So in maximum likelihood estimation, we define best as meaning the data likelihood has reached the maximum. So formally it's given by this expression here, where we define the estimate as a arg max of the probability of x given theta. So, arg max here just means its actually a function that will turn. The argument that gives the function maximum value, adds the value. So the value of arg max is not the value of this function. But rather, the argument that has made it the function reaches maximum. So in this case the value of arg max is theta. It's the theta that makes the probability of X, given theta, reach it's maximum. So this estimate that in due it also makes sense and it's often very useful, and it seeks the premise that best explains the data. But it has a problem, when the data is too small because when the data points are too small, there are very few data points. The sample is small, then if we trust data in entirely and try to fit the data and then we'll be biased. So in the case of text data, let's say, all observed 100 words did not contain another word related to text mining. Now, our maximum likelihood estimator will give that word a zero probability. Because giving the non-zero probability would take away probability mass from some observer word. Which obviously is not optimal in terms of maximizing the likelihood of the observer data. But this zero probability for all the unseen words may not be reasonable sometimes. Especially, if we want the distribution to characterize the topic of text mining. So one way to address this problem is actually to use Bayesian estimation, where we actually would look at the both the data, and our prior knowledge about the parameters. We assume that we have some prior belief about the parameters. Now in this case of course, so we are not going to look at just the data, but also look at the prior. So the prior here is defined by P of theta, and this means, we will impose some preference on certain theta's of others. And by using Bayes Rule, that I have shown here, we can then combine the likelihood function. With the prior to give us this posterior probability of the parameter. Now, a full explanation of Bayes rule, and some of these things related to Bayesian reasoning, would be outside the scope of this course. But I just gave a brief introduction because this is general knowledge that might be useful to you. The Bayes Rule is basically defined here, and allows us to write down one conditional probability of X given Y in terms of the conditional probability of Y given X. And you can see the two probabilities are different in the order of the two variables. But often the rule is used for making inferences of the variable, so let's take a look at it again. We can assume that p(X) Encodes our prior belief about X. That means before we observe any other data, that's our belief about X, what we believe some X values have higher probability than others. And this probability of X given Y is a conditional probability, and this is our posterior belief about X. Because this is our belief about X values after we have observed the Y. Given that we have observed the Y, now what do we believe about X? Now, do we believe some values have higher probabilities than others? Now the two probabilities are related through this one, this can be regarded as the probability of the observed evidence Y, given a particular X. So you can think about X as our hypothesis, and we have some prior belief about which hypothesis to choose. And after we have observed Y, we will update our belief and this updating formula is based on the combination of our prior. And the likelihood of observing this Y if X is indeed true, so much for detour about Bayes Rule. In our case, what we are interested in is inferring the theta values. So, we have a prior here that includes our prior knowledge about the parameters. And then we have the data likelihood here, that would tell us which parameter value can explain the data well. The posterior probability combines both of them, so it represents a compromise of the the two preferences. And in such a case, we can maximize this posterior probability. To find this theta that would maximize this posterior probability, and this estimator is called a Maximum a Posteriori, or MAP estimate. And this estimator is a more general estimator than the maximum likelihood estimator. Because if we define our prior as a noninformative prior, meaning that it's uniform over all the theta values. No preference, then we basically would go back to the maximum likelihood estimated. Because in such a case, it's mainly going to be determined by this likelihood value, the same as here. But if we have some not informative prior, some bias towards the different values then map estimator can allow us to incorporate that. But the problem here of course, is how to define the prior. There is no free lunch and if you want to solve the problem with more knowledge, we have to have that knowledge. And that knowledge, ideally, should be reliable. Otherwise, your estimate may not necessarily be more accurate than that maximum likelihood estimate. So, now let's look at the Bayesian estimation in more detail. So, I show the theta values as just a one dimension value and that's a simplification of course. And so, we're interested in which variable of theta is optimal. So now, first we have the Prior. The Prior tells us that some of the variables are more likely the others would believe. For example, these values are more likely than the values over here, or here, or other places. So this is our Prior, and then we have our theta likelihood. And in this case, the theta also tells us which values of theta are more likely. And that just means loose syllables can best expand our theta. And then when we combine the two we get the posterior distribution, and that's just a compromise of the two. It would say that it's somewhere in-between. So, we can now look at some interesting point that is made of. This point represents the mode of prior, that means the most likely parameter value according to our prior, before we observe any data. This point is the maximum likelihood estimator, it represents the theta that gives the theta of maximum probability. Now this point is interesting, it's the posterior mode. It's the most likely value of the theta given by the posterior of this. And it represents a good compromise of the prior mode and the maximum likelihood estimate. Now in general in Bayesian inference, we are interested in the distribution of all these parameter additives as you see here. If there's a distribution over see how values that you can see. Here, P of theta given X. So the problem of Bayesian inference is to infer this posterior, this regime, and also to infer other interesting quantities that might depend on theta. So, I show f of theta here as an interesting variable that we want to compute. But in order to compute this value, we need to know the value of theta. In Bayesian inference, we treat theta as an uncertain variable. So we think about all the possible variables of theta. Therefore, we can estimate the value of this function f as extracted value of f, according to the posterior distribution of theta, given the observed evidence X. As a special case, we can assume f of theta is just equal to theta. In this case, we get the expected value of the theta, that's basically the posterior mean. That gives us also one point of theta, and it's sometimes the same as posterior mode, but it's not always the same. So, it gives us another way to estimate the parameter. So, this is a general illustration of Bayesian estimation and its an influence. And later, you will see this can be useful for topic mining where we want to inject the sum prior knowledge about the topics. So to summarize, we've used the language model which is basically probability distribution over text. It's also called a generative model for text data. The simplest language model is Unigram Language Model, it's basically a word distribution. We introduced the concept of likelihood function, which is the probability of the a data given some model. And this function is very important, given a particular set of parameter values this function can tell us which X, which data point has a higher likelihood, higher probability. Given a data sample X, we can use this function to determine which parameter values would maximize the probability of the observed data, and this is the maximum livelihood estimate. We also talk about the Bayesian estimation or inference. In this case we, must define a prior on the parameters p of theta. And then we're interested in computing the posterior distribution of the parameters, which is proportional to the prior and the likelihood. And this distribution would allow us then to infer any derive that is from theta. [MUSIC]