[SOUND] This lecture is about the Evaluation of Text Categorization. So we've talked about many different methods for text categorization. But how do you know which method works better? And for a particular application, how do you know this is the best way of solving your problem? To understand these, we have to how to we have to know how to evaluate categorization results. So first some general thoughts about the evaluation. In general, for evaluation of this kind of empirical tasks such as categorization, we use methodology that was developed in 1960s by information retrieval researchers. Called a Cranfield Evaluation Methodology. The basic idea is to have humans create test correction, where, we already know, every document is tagged with the desired categories. Or, in the case of search, for which query, which documents that should have been retrieved, and this is called, a ground truth. Now, with this ground truth test correction, we can then reuse the collection to test the many different systems and then compare different systems. We can also turn off some components in the system to see what's going to happen. Basically it provides a way to do control experiments to compare different methods. So this methodology has been virtually used for all the tasks that involve empirically defined problems. So in our case, then, we are going to compare our systems categorization results with the categorization, ground truth, created by humans. And we're going to compare our systems decisions, which documents should get which category with what categories have been assigned to those documents by humans. And we want to quantify the similarity of these decisions or equivalently, to measure the difference between the system output and the desired ideal output generated by the humans. So obviously, the highest similarity is the better results are. The similarity could be measured in different ways. And that would lead to different measures. And sometimes it's desirable also to match the similarity from different perspectives just to have a better understanding of the results in detail. For example, we might be also interested in knowing which category performs better and which which category is easy to categorize, etc. In general, different categorization mistakes however, have different costs for specific applications. So some areas might be more serious than others. So ideally, we would like to model such differences, but if you read many papers in categorization you will see that they don't generally do that. Instead, they will use a simplified measure and that's because it's often okay not to consider such a cost variation when we compare methods and when we are interested in knowing the relative difference of these methods. So it's okay to introduce some bias, as long as the bias is not already with a particular method and then we should expect the more effective method to perform better than a less effective one, even though the measure is not perfect. So the first measure that we'll introduce is called classification accuracy and this is a basic into measure the percentage of correct decisions. So here you see that there are categories denoted by c1 through ck and there are n documents, denoted by d1 through d N. And for each pair of category and the document, we can then look at the situation. And see if the system has said yes to this pair, basically has assigned this category to this document. Or no, so this is denoted by Y or M, that's the systems of the decision. And similarly, we can look at the human's decisions also, if the human has assigned a category to the document of that there will be a plus sign here. That just means that a human. We think of this assignment is correct and incorrect then it's a minus. So we'll see all combinations of this Ns, yes and nos, minus and pluses. There are four combinations in total. And two of them are correct, and that's when we have y(+) or n(-), and then there are also two kinds of errors. So the measure of classification accuracy is simply to count how many of these decisions are correct. And normalize that by the total number of decisions we have made. So, we know that the total number of decisions is n, multiplied by k. And, the number of correct decisions are basically of two kinds. One is y plusses. And the other is n minus this n. We just put together the count. Now, this is a very convenient measure that will give us one number to characterize performance of a method. And the higher, the better, of course. But the method also has some problems. First it has treated all the decisions equally. But in reality, some decision errors are more serious than others. For example, it may be more important to get the decisions right on some documents, than others. Or maybe, more important to get the decisions right on some categories, than others, and this would call for some detailed evaluation of this results to understand the strands and of different methods, and to understand the performance of these methods. In detail in a per category or per document basis. One example that shows clearly the decision errors are having different causes is spam filtering that could be retrieved as two category categorization problem. Missing a legitimate email result, is one type of error. But letting spam to come into your folder is another type of error. The two types of errors are clearly very different, because it's very important not to miss a legitimate email. It's okay to occasionally let a spam email to come into your inbox. So the error of the first, missing a legitimate email is very, is of high cost. It's a very serious mistake and classification error, classification accuracy does not address this issue. There's also another problem with imbalance to test set. Imagine there's a skew to test set where most instances are category one and 98% of instances are category one. Only 2% are in category two. In such a case, we can have a very simple baseline that accurately performs very well and that baseline. Sign with similar, I put all instances in the major category. That will get us 98% accuracy in this case. It's going to be appearing to be very effective, but in reality, this is obviously not a good result. And so, in general, when we use classification accuracy as a measure, we want to ensure that the causes of balance. And one above equal number of instances, for example in each class the minority categories or causes tend to be overlooked in the evaluation of classification accuracy. So, to address these problems, we of course would like to also evaluate the results in other ways and in different ways. As I said, it's beneficial to look at after multiple perspectives. So for example, we can look at the perspective from each document as a perspective based on each document. So the question here is, how good are the decisions on this document? Now, as in the general cases of all decisions, we can think about four combinations of possibilities, depending on whether the system has said yes and depending on whether the human has said it correct or incorrect or said yes or no. And so the four combinations are first when both the human systems said yes, and that's the true positives, when the system says, yes, and it's after the positive. So, when the system says, yes, it's a positive. But, when the human confirm that it is indeed correct, that becomes a true positive. When the system says, yes, but the human says, no, that's incorrect, that's a false positive, have FP. And when the system says no, but the human says yes, then it's a false negative. We missed one assignment. When both the system and human says no, then it's also correctly to assume that's true negatives. All right, so then we can have some measures to just better characterize the performance by using these four numbers and so two popular measures are precision and recall. And these were also proposed by information retrieval researchers 1960s for evaluating search results, but now they have become standard measures, use it everywhere. So when the system says yes, we can ask the question, how many are correct? What's the percent of correct decisions when the system says yes? That's called precision. It's true positive divided by all the cases when the system says yes, all the positives. The other measure is called recall, and this measures whether the document has all the categories it should have. So in this case it's divide the true positive by true positives and the false negatives. So these are all the cases where this human Says the document should have this category. So this represents both categories that it should have got, and so recall tells us whether the system has actually indeed assigned all the categories that it should have to this document. This gives us a detailed view of the document, then we can aggregate them later. And if we're interested in some documents, and this will tell us how well we did on those documents, the subsets of them. It might be more interesting than others, for example. And this allows us to analyze errors in more detail as well. We can separate the documents of certain characteristics from others, and then look at the errors. You might see a pattern A for this kind of document, this long document. It doesn't as well for shock documents. And this gives you some insight for inputting the method. Similarly, we can look at the per-category evaluation. In this case, we're going to look at the how good are the decisions on a particular category. As in the previous case we can define precision and recall. And it would just basically answer the questions from a different perspective. So when the system says yes, how many are correct? That means looking at this category to see if all the documents that are assigned with this category are indeed in this category, right? And recall, would tell us, has the category been actually assigned to all the documents That should have this category. It's sometimes also useful to combine precision and recall as one measure, and this is often done by using f measure. And this is just a harmonic mean of precision. Precision and recall defined on this slide. And it's also controlled by a parameter beta to indicate whether precision is more important or recall is more. When beta is set to 1, we have measure called F1, and in this case, we just take equal weight upon both procedure and recall. F1 is very often used as a measure for categorization. Now, as in all cases, when we combine results, you always should think about the best way of combining them, so in this case I don't know if you have thought about it and we could have combined them just with arithmetic mean, right. So that would still give us the same range of values, but obviously there's a reason why we didn't do that and why f1 is more popular, and it's actually useful to think about difference. And we think about that, you'll see that there is indeed some difference and some undesirable property of this arithmatic. Basically, it will be obvious to you if you think about a case when the system says yes for all the category and document pairs. And then try the compute the precision and recall in that case. And see what would happen. And basically, this kind of measure, the arithmetic mean, is not going to be as reasonable as F1 minus one [INAUDIBLE] trade off, so that the two values are equal. There is an extreme case where you have 0 for one letter and one for the other. Then F1 will be low, but the mean would still be reasonably high. [MUSIC]