By now, you should have a sense of how word embeddings can help you build NLP applications. One of the most fascinating properties of word embeddings is that they can also help with analogy reasoning. And while reasonable analogies may not be by itself the most important NLP application, they might also help convey a sense of what these word embeddings are doing, what these word embeddings can do. Let me show you what I mean here are the featurized representations of a set of words that you might hope a word embedding could capture. Let's say I pose a question, man is to woman as king is to what? Many of you will say, man is to woman as king is to queen. But is it possible to have an algorithm figure this out automatically? Well, here's how you could do it, let's say that you're using this four dimensional vector to represent man. So this will be your E5391, although just for this video, let me call this e subscript man. And let's say that's the embedding vector for woman, so I'm going to call that e subscript woman, and similarly for king and queen. And for this example, I'm just going to assume you're using four dimensional embeddings, rather than anywhere from 50 to 1,000 dimensional, which would be more typical. One interesting property of these vectors is that if you take the vector, e man, and subtract the vector e woman, then, You end up with approximately -1, negative another 1 is -2, decimal 0- 0, 0- 0, close to 0- 0, so you get roughly -2 0 0 0. And similarly if you take e king minus e queen, then that's approximately the same thing. That's about -1- 0.97, it's about -2. This is about 1- 1, since kings and queens are both about equally royal. So that's 0, and then age difference, food difference, 0. And so what this is capturing is that the main difference between man and woman is the gender. And the main difference between king and queen, as represented by these vectors, is also the gender. Which is why the difference e man- e woman, and the difference e king- e queen, are about the same. So one way to carry out this analogy reasoning is, if the algorithm is asked, man is to woman as king is to what? What it can do is compute e man- e woman, and try to find a vector, try to find a word so that e man- e woman is close to e king- e of that new word. And it turns out that when queen is the word plugged in here, then the left hand side is close to the the right hand side. So these ideas were first pointed out by Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. And it's been one of the most remarkable and surprisingly influential results about word embeddings. And I think has helped the whole community get better intuitions about what word embeddings are doing. So let's formalize how you can turn this into an algorithm. In pictures, the word embeddings live in maybe a 300 dimensional space. And so the word man is represented as a point in the space, and the word woman is represented as a point in the space. And the word king is represented as another point, and the word queen is represented as another point. And what we pointed out really on the last slide is that the vector difference between man and woman is very similar to the vector difference between king and queen. And this arrow I just drew is really the vector that represents a difference in gender. And remember, these are points we're plotting in a 300 dimensional space. So in order to carry out this kind of analogical reasoning to figure out, man is to woman is king is to what, what you can do is try to find the word w, So that, This equation holds true, so you want there to be, so you want to find the word w Then finding the word that maximizes the similarity e w compared to e king minus e man plus e woman Right, so what I did is, I took this e question mark, and replaced that with ew, and then brought ew to just one side of the equation. And then the other three terms to the right hand side of this equation. So we have some appropriate similarity function for measuring how similar is the embedding of some word w to this quantity of the right. Then finding the word that maximizes the similarity should hopefully let you pick out the word queen. And the remarkable thing is, this actually works. If you learn a set of word embeddings and find a word w that maximizes this type of similarity, you can actually get the exact right answer. Depending on the details of the task, but if you look at research papers, it's not uncommon for research papers to report anywhere from, say, 30% to 75% accuracy on analogy using tasks like these. Where you count an anology attempt as correct only if it guesses the exact word right. So only if, in this case, it picks out the word queen. Before moving on, I just want to clarify what this plot on the left is. Previously, we talked about using algorithms like t-SAE to visualize words. What t-SAE does is, it takes 300-D data, and it maps it in a very non-linear way to a 2D space. And so the mapping that t-SAE learns, this is a very complicated and very non-linear mapping. So after the t-SAE mapping, you should not expect these types of parallelogram relationships, like the one we saw on the left, to hold true. And it's really in this original 300 dimensional space that you can more reliably count on these types of parallelogram relationships in analogy pairs to hold true. And it may hold true after a mapping through t-SAE, but in most cases, because of t-SAE's non-linear mapping, you should not count on that. And many of the parallelogram analogy relationships will be broken by t-SAE. Now before moving on, let me just describe the similarity function that is most commonly used. So the most commonly used similarity function is called cosine similarity. So this is the equation we had from the previous slide. So in cosine similarity, you define the similarity between two vectors u and v as u transpose v divided by the lengths by the Euclidean lengths. So ignoring the denominator for now, this is basically the inner product between u and v. And so if u and v are very similar, their inner product will tend to be large. And this is called cosine similarity because this is actually the cosine of the angle between the two vectors, u and v. So that's the angle phi, so this formula is actually the cosine of the angle between them. And so you remember from calculus that if this phi, then the cosine of phi looks like this. So if the angle between them is 0, then the cosine similarity is equal to 1. And if their angle is 90 degrees, the cosine similarity is 0. And then if they're 180 degrees, or pointing in completely opposite directions, it ends up being -1. So that's where the term cosine similarity comes from, and it works quite well for these analogy reasoning tasks. If you want, you can also use square distance or Euclidian distance, u-v squared. Technically, this would be a measure of dissimilarity rather than a measure of similarity. So we need to take the negative of this, and this will work okay as well. Although I see cosine similarity being used a bit more often. And the main difference between these is how it normalizes the lengths of the vectors u and v. So one of the remarkable results about word embeddings is the generality of analogy relationships they can learn. So for example, it can learn that man is to woman as boy is to girl, because the vector difference between man and woman, similar to king and queen and boy and girl, is primarily just the gender. It can learn that Ottawa, which is the capital of Canada, that Ottawa is to Canada as Nairobi is to Kenya. So that's the city capital is to the name of the country. It can learn that big is to bigger as tall is to taller, and it can learn things like that. Yen is to Japan, since yen is the currency of Japan, as ruble is to Russia. And all of these things can be learned just by running a word embedding learning algorithm on the large text corpus It can spot all of these patterns by itself, So in this video, you saw how word embeddings can be used for analogy reasoning. And while you might not be trying to build an analogy reasoning system yourself as an application, this I hope conveys some intuition about the types of feature-like representations that these representations can learn. And you also saw how cosine similarity can be a way to measure the similarity between two different word embeddings. Now, we talked a lot about properties of these embeddings and how you can use them. Next, let's talk about how you'd actually learn these word embeddings, let's go on to the next video.