[MUSIC] This lecture is about topic mining and analysis. We're going to talk about using a term as topic. This is a slide that you have seen in a earlier lecture where we define the task of topic mining and analysis. We also raised the question, how do we exactly define the topic of theta? So in this lecture, we're going to offer one way to define it, and that's our initial idea. Our idea here is defining a topic simply as a term. A term can be a word or a phrase. And in general, we can use these terms to describe topics. So our first thought is just to define a topic as one term. For example, we might have terms like sports, travel, or science, as you see here. Now if we define a topic in this way, we can then analyze the coverage of such topics in each document. Here for example, we might want to discover to what extent document one covers sports. And we found that 30% of the content of document one is about sports. And 12% is about the travel, etc. We might also discover document two does not cover sports at all. So the coverage is zero, etc. So now, of course, as we discussed in the task definition for topic mining and analysis, we have two tasks. One is to discover the topics. And the second is to analyze coverage. So let's first think about how we can discover topics if we represent each topic by a term. So that means we need to mine k topical terms from a collection. Now there are, of course, many different ways of doing that. And we're going to talk about a natural way of doing that, which is also likely effective. So first of all, we're going to parse the text data in the collection to obtain candidate terms. Here candidate terms can be words or phrases. Let's say the simplest solution is to just take each word as a term. These words then become candidate topics. Then we're going to design a scoring function to match how good each term is as a topic. So how can we design such a function? Well there are many things that we can consider. For example, we can use pure statistics to design such a scoring function. Intuitively, we would like to favor representative terms, meaning terms that can represent a lot of content in the collection. So that would mean we want to favor a frequent term. However, if we simply use the frequency to design the scoring function, then the highest scored terms would be general terms or functional terms like the, etc. Those terms occur very frequently English. So we also want to avoid having such words on the top so we want to penalize such words. But in general, we would like to favor terms that are fairly frequent but not so frequent. So a particular approach could be based on TF-IDF weighting from retrieval. And TF stands for term frequency. IDF stands for inverse document frequency. We talked about some of these ideas in the lectures about the discovery of word associations. So these are statistical methods, meaning that the function is defined mostly based on statistics. So the scoring function would be very general. It can be applied to any language, any text. But when we apply such a approach to a particular problem, we might also be able to leverage some domain-specific heuristics. For example, in news we might favor title words actually general. We might want to favor title words because the authors tend to use the title to describe the topic of an article. If we're dealing with tweets, we could also favor hashtags, which are invented to denote topics. So naturally, hashtags can be good candidates for representing topics. Anyway, after we have this design scoring function, then we can discover the k topical terms by simply picking k terms with the highest scores. Now, of course, we might encounter situation where the highest scored terms are all very similar. They're semantically similar, or closely related, or even synonyms. So that's not desirable. So we also want to have coverage over all the content in the collection. So we would like to remove redundancy. And one way to do that is to do a greedy algorithm, which is sometimes called a maximal marginal relevance ranking. Basically, the idea is to go down the list based on our scoring function and gradually take terms to collect the k topical terms. The first term, of course, will be picked. When we pick the next term, we're going to look at what terms have already been picked and try to avoid picking a term that's too similar. So while we are considering the ranking of a term in the list, we are also considering the redundancy of the candidate term with respect to the terms that we already picked. And with some thresholding, then we can get a balance of the redundancy removal and also high score of a term. Okay, so after this that will get k topical terms. And those can be regarded as the topics that we discovered from the connection. Next, let's think about how we're going to compute the topic coverage pi sub ij. So looking at this picture, we have sports, travel and science and these topics. And now suppose you are give a document. How should we pick out coverage of each topic in the document? Well, one approach can be to simply count occurrences of these terms. So for example, sports might have occurred four times in this this document and travel occurred twice, etc. And then we can just normalize these counts as our estimate of the coverage probability for each topic. So in general, the formula would be to collect the counts of all the terms that represent the topics. And then simply normalize them so that the coverage of each topic in the document would add to one. This forms a distribution of the topics for the document to characterize coverage of different topics in the document. Now, as always, when we think about idea for solving problem, we have to ask the question, how good is this one? Or is this the best way of solving problem? So now let's examine this approach. In general, we have to do some empirical evaluation by using actual data sets and to see how well it works. Well, in this case let's take a look at a simple example here. And we have a text document that's about a NBA basketball game. So in terms of the content, it's about sports. But if we simply count these words that represent our topics, we will find that the word sports actually did not occur in the article, even though the content is about the sports. So the count of sports is zero. That means the coverage of sports would be estimated as zero. Now of course, the term science also did not occur in the document and it's estimate is also zero. And that's okay. But sports certainly is not okay because we know the content is about sports. So this estimate has problem. What's worse, the term travel actually occurred in the document. So when we estimate the coverage of the topic travel, we have got a non-zero count. So its estimated coverage will be non-zero. So this obviously is also not desirable. So this simple example illustrates some problems of this approach. First, when we count what words belong to to the topic, we also need to consider related words. We can't simply just count the topic word sports. In this case, it did not occur at all. But there are many related words like basketball, game, etc. So we need to count the related words also. The second problem is that a word like star can be actually ambiguous. So here it probably means a basketball star, but we can imagine it might also mean a star on the sky. So in that case, the star might actually suggest, perhaps, a topic of science. So we need to deal with that as well. Finally, a main restriction of this approach is that we have only one term to describe the topic, so it cannot really describe complicated topics. For example, a very specialized topic in sports would be harder to describe by using just a word or one phrase. We need to use more words. So this example illustrates some general problems with this approach of treating a term as topic. First, it lacks expressive power. Meaning that it can only represent the simple general topics, but it cannot represent the complicated topics that might require more words to describe. Second, it's incomplete in vocabulary coverage, meaning that the topic itself is only represented as one term. It does not suggest what other terms are related to the topic. Even if we're talking about sports, there are many terms that are related. So it does not allow us to easily count related terms to order, conversion to coverage of this topic. Finally, there is this problem of word sense disintegration. A topical term or related term can be ambiguous. For example, basketball star versus star in the sky. So in the next lecture, we're going to talk about how to solve the problem with of a topic. [MUSIC]