[SOUND]. This lecture is about the syntagmatic relation discovery and mutual information. In this lecture we are going to continue discussing syntagmatic relation discovery. In particular, we are going to talk about another the concept in the information series, we called it mutual information and how it can be used to discover syntagmatic relations. Before we talked about the problem of conditional entropy and that is the conditional entropy computed different pairs of words. It is not really comparable, so that makes it harder with this cover, strong synagmatic relations globally from corpus. So now we are going to introduce mutual information, which is another concept in the information series that allows us to, sometimes, normalize the conditional entropy to make it more comparable across different pairs. In particular, mutual information in order to find I(X:Y), matches the entropy reduction of X obtained from knowing Y. More specifically the question we are interested in here is how much of an entropy of X can we obtain by knowing Y. So mathematically it can be defined as the difference between the original entropy of X, and the condition of Y of X given Y. And you might see, as you can see here it can also be defined as reduction of entropy of Y because of knowing X. Now normally the two conditional interface H of X given Y and the entropy of Y given X are not equal, but interestingly, the reduction of entropy by knowing one of them, is actually equal. So, this quantity is called a Mutual Information in order to buy I here. And this function has some interesting properties, first it is also non-negative. This is easy to understand because the original entropy is always not going to be lower than the possibility reduced conditional entropy. In other words, the conditional entropy will never exceed the original entropy. Knowing some information can always help us potentially, but will not hurt us in predicting x. The signal property is that it is symmetric like additional entropy is not symmetrical, mutual information is, and the third property is that It reaches its minimum, zero, if and only if the two random variables are completely independent. That means knowing one of them does not tell us anything about the other and this last property can be verified by simply looking at the equation above and it reaches 0 if and only the conditional entropy of X [INAUDIBLE] Y is exactly the same as original entropy of X. So that means knowing why it did not help at all and that is when X and a Y are completely independent. Now when we fix X to rank different Ys using conditional entropy would give the same order as ranking based on mutual information because in the function here, H(X) is fixed because X is fixed. So ranking based on mutual entropy is exactly the same as ranking based on the conditional entropy of X given Y, but the mutual information allows us to compare different pairs of x and y. So, that is why mutual information is more general and in general, more useful. So, let us examine the intuition of using mutual information for Syntagmatical Relation Mining. Now, the question we ask forcing that relation mining is, whenever "eats" occurs, what other words also tend to occur? So this question can be framed as a mutual information question, that is, which words have high mutual information was eats, so computer the missing information between eats and other words. And if we do that, and it is basically a base on the same as conditional we will see that words that are strongly associated with eats, will have a high point. Whereas words that are not related will have lower mutual information. For this, I will give some example here. The mutual information between "eats" and "meats", which is the same as between "meats" and "eats," because the information is symmetrical is expected to be higher than the mutual information between eats and the, because knowing the does not really help us as a predictor. It is similar, and knowing eats does not help us predicting, the as well. And you also can easily see that the mutual information between a word and itself is the largest, which is equal to the entropy of this word and so, because in this case the reduction is maximum because knowing one allows us to predict the other completely. So the conditional entropy is zero, therefore the mutual information reaches its maximum. It is going to be larger, then are equal to the machine volume eats in other words. In other words picking any other word and the computer picking between eats and that word. You will not get any information larger the computation from eats and itself. So now let us look at how to compute the mute information. Now in order to do that, we often use a different form of mutual information, and we can mathematically rewrite the mutual information into the form shown on this slide. Where we essentially see a formula that computes what is called a KL-divergence or divergence. This is another term in information theory. It measures the divergence between two distributions. Now, if you look at the formula, it is also sum over many combinations of different values of the two random variables but inside the sum, mainly we are doing a comparison between two joint distributions. The numerator has the joint, actual observed the joint distribution of the two random variables. The bottom part or the denominator can be interpreted as the expected joint distribution of the two random variables, if they were independent because when two random variables are independent, they are joined distribution is equal to the product of the two probabilities. So this comparison will tell us whether the two variables are indeed independent. If they are indeed independent then we would expect that the two are the same, but if the numerator is different from the denominator, that would mean the two variables are not independent and that helps measure the association. The sum is simply to take into consideration of all of the combinations of the values of these two random variables. In our case, each random variable can choose one of the two values, zero or one, so we have four combinations here. If we look at this form of mutual information, it shows that the mutual information matches the divergence of the actual joint distribution from the expected distribution under the independence assumption. The larger this divergence is, the higher the mutual information would be. So now let us further look at what are exactly the probabilities, involved in this formula of mutual information. And here, this is all the probabilities involve, and it is easy for you to verify that. Basically, we have first to [INAUDIBLE] probabilities corresponding to the presence or absence of each word. So, for w1, we have two probabilities shown here. They should sum to one, because a word can either be present or absent. In the segment, and similarly for the second word, we also have two probabilities representing presence or absences of this word, and there is some to y as well. And finally, we have a lot of joined probabilities that represent the scenarios of co-occurrences of the two words, and they are shown here. And they sum to one because the two words can only have these four possible scenarios. Either they both occur, so in that case both variables will have a value of one, or one of them occurs. There are two scenarios. In these two cases one of the random variables will be equal to one and the other will be zero and finally we have the scenario when none of them occurs. This is when the two variables taking a value of zero. So these are the probabilities involved in the calculation of mutual information, over here. Once we know how to calculate these probabilities, we can easily calculate the mutual information. It is also interesting to know that there are actually some relations or constraint among these probabilities, and we already saw two of them, right? So in the previous slide, that you have seen that the marginal probabilities of these words sum to one and we also have seen this constraint, that says the two words have these four scenarios of co-occurrency, but we also have some additional constraints listed in the bottom. For example, this one means if we add up the probabilities that we observe the two words occur together and the probabilities when the first word occurs and the second word does not occur. We get exactly the probability that the first word is observed. In other words, when the word is observed. When the first word is observed, and there are only two scenarios, depending on whether the second word is also observed. So, this probability captures the first scenario when the second word actually is also observed, and this captures the second scenario when the second word is not observed. So, we only see the first word, and it is easy to see the other equations also follow the same reasoning. Now these equations allow us to compute some probabilities based on other probabilities, and this can simplify the computation. So more specifically, if we know the probability that a word is present, like in this case, so if we know this, and if we know the probability of the presence of the second word, then we can easily compute the absence probability, right? It is very easy to use this equation to do that, and so we take care of the computation of these probabilities of presence and absence of each word. Now let's look at the [INAUDIBLE] distribution. Let us assume that we also have available the probability that they occurred together. Now it is easy to see that we can actually compute all the rest of these probabilities based on these. Specifically for example using this equation we can compute the probability that the first word occurred and the second word did not, because we know these probabilities in the boxes, and similarly using this equation we can compute the probability that we observe only the second word. Word. And then finally, this probability can be calculated by using this equation because now this is known, and this is also known, and this is already known, right. So this can be easier to calculate. So now this can be calculated. So this slide shows that we only need to know how to compute these three probabilities that are shown in the boxes, naming the presence of each word and the co-occurence of both words, in a segment. [MUSIC]