[MUSIC] This lecture is about evaluation of text clustering. So far we have talked about multiple ways of doing text clustering but how do we know which method works the best? So this has to do with evaluation. Now to talk about evaluation one must go back to the clustering bias that we introduced at the beginning. Because two objects can be similar depending on how you look at them, we must clearly specify the perspective of similarity. Without that, the problem of clustering is not well defined. So this perspective is also very important for evaluation. If you look at this slide, and you can see we have two different ways to cluster these shapes, and if you ask a question, which one is the best, or which one is better? You actually see, there's no way to answer this question without knowing whether we'd like to cluster based on shapes, or cluster based on sizes. And that's precisely why the perspective on clustering bias is crucial for evaluation. In general, we can evaluate text clusters in two ways, one is direct evaluation, and the other indirect evaluation. So in direct evaluation, we want to answer the following questions, how close are the system-generated clusters to the ideal clusters that are generated by humans? So the closeness here can be assessed from multiple perspectives and that will help us characterize the quality of cluster result in multiple angles, and this is sometimes desirable. Now we also want to quantify the closeness because this would allow us to easily compare different measures based on their performance figures. And finally, you can see, in this case, we essentially inject the clustering bias by using humans, basically humans would bring in the the need or desire to clustering bias. Now, how do we do that exactly? Well, the general procedure would look like this. Given a test set which consists of a lot of text objects, we can have humans to create the ideal clustering result, that is, we're going to ask humans to partition the objects to create the gold standard. And they will use their judgments based on the need of a particular application to generate what they think are the best clustering results, and this would be then used to compare with the system generated clusters from the same test set. And ideally, we want the system results to be the same as the human generated results, but in general, they are not going to be the same. So we would like to then quantify the similarity between the system-generated clusters and the gold standard clusters. And this similarity can also be measure from multiple perspectives and this will give us various meshes to quantitatively evaluate a cluster, a clustering result. And some of the commonly used measures include the purity, which measures whether a cluster has a similar object from the same cluster, in the gold standard. And normalized mutual information is a commonly used measure which basically measures based on the identity of cluster of object in the system generally. How well can you predict the cluster of the object in the gold standard or vice versa? And mutual information captures, the correlation between these cluster labels and normalized mutual information is often used for quantifying the similarity for this evaluation purpose, F measure is another possible measure. Now again a thorough discussion of this evaluation and these evaluation issues would be beyond the scope of this course. I've suggested some reading in the end that you can take a look at to know more about that. So here I just want to discuss some high level ideas that would allow you to think about how to do evaluation in your applications. The second way to evaluate text clusters is to do indirect evaluation. So in this case the question to answer is, how useful are the clustering results for the intended applications? Now this of course is application specific question, so usefulness is going to depend on specific applications. In this case, the clustering bias is imposed by the independent application as well, so what counts as a best cluster result would be dependent on the application. Now procedure wise we also would create a test set with text objects for the intended application to quantify the performance of the system. In this case, what we care about is the contribution of clustering to some application so we often have a baseline system to compare with. This could be the current system for doing something, and then you hope to add a clustering to improve it, or the baseline system could be using a different clustering method. And then what you are trying to experiment with, and you hope to have better idea of word clustering. So in any case you have a baseline system work with, and then you add a clustering algorithm to the baseline system to produce a clustering system. And then we have to compare the performance of your clustering system and the baseline system in terms of the performance measure for that particular application. So in this case we call it indirect evaluation of clusters because there's no explicit assessment of the quality of clusters, but rather it's to assess the contribution of clusters to a particular application. So, to summarize text clustering, it's a very useful unsupervised general text mining technique, and it's particularly useful for obtaining an overall picture of the text content. And this is often needed to explore text data, and this is often the first step when you deal with a lot of text data. The second application or second kind of applications is through discover interesting clustering structures in text data and these structures can be very meaningful. There are many approaches that can be used to form text clustering and we discussed model based approaches and some narrative based approaches. In general, strong clusters tend to show up no matter what method is used. Also the effectiveness of a method highly depends on whether the desired clustering bias is captured appropriately, and this can be done either through using the right generating model, the model design appropriate for the clustering, or the right similarity function expressly define the bias. Deciding the optimal number of customers is a very difficult problem for order cluster methods, and that's because it's unsupervised algorithm, and there's no training there how to guide us to select the best number of clusters. Now sometimes you may see some methods that can automatically determine the number of clusters, but in general that has some implied application of clustering bias there and that's just not specified. Without clearly defining a clustering bias, it's just impossible to say the optimal number of cluster is what, so this important to keep in mind. And I should also say sometimes we can also use application to determine the number of clusters, for example, if you're clustering search results, then obviously you don't want to generate the 100 clusters, so the number can be dictated by the interface design. In other situations, we might be able to use the fitness to data to assess whether we've got a good number of clusters to explain our data well. And to do that, you can vary the number of clusters and watch how well you can fit the data. In general when you add a more components to a mixed model you should fit the data better because you, you don't, you can always set the probability of using the new component as zero. So you can't in general fit the data worse than before, but the question is as you add more components would you be able to significantly improve the fitness of the data and that can be used to determine the right number of clusters. And finally evaluation of clustering results, this kind can be done both directly and indirectly, and we often would like to do both in order to get a good sense about how well our method works. So here's some suggested reading and this is particularly useful to better understand how the matches are calculated and clustering in general [MUSIC]