Welcome to Module 5. We're talking about the topics: "Manual annotation and evaluation" In this module, we will address the question how computer linguistic tools can facilitate manual annotation and how these tools can be improved through manual annotation. The second topic is "evaluation". We will address the question how the performance of computer linguistic tools can be measures systematically on personal data. The topic "agreement" deals with the task of unifying the judgements and potential differences of human annotators to estimate the difficulties and performance limit of such tools. What if we look at it from a rather pessimistic perspective? Computer linguistic tools work fairly well for standard languages and contemporary newspaper texts on which they have been developed and trained. This applies especially for a dozen languages which are relatively well equipped with computer linguistic tools. The situation is worse if one doesn't want to process newspaper texts or if one wants to work in another area of application. For historical languages, tools that have been developed on modern languages are often completely unsuitable and useless. When dealing with written data without orthographical norm or "user-generated content" the results are typically also rather poor. And it doesn't work at all for languages without any resources or tools. What can be done to solve this problem? By adapting the existing tools these gaps can be eliminated. That's a very important techniques used under the expression "domain adaptation". Usually, the tools are adapted automatically to fit the specific domain of application. You can either pre-process your data or post-process the results. If a part-of-speech tagger has been trained on Standard German of the 20th century and we try to run this tagger on texts of the 19th or 18th century we might want to let these training texts "grow old" artificially. Or we can try to modernise the texts regarding their orthography of the 18th or 19th century in an additional preprocessing step. If you are working with rule-based systems such an adaptation often requires the extension of lexicons and the adding of specific names to the gazetteers. A third important paradigm is the application of manual annotations for specific machine learning approaches in order to finally solve the problem on greater amounts of text data. First you'll have to create your own annotations and then build your application-specific computer linguistic tool by using this data and train a machine learning system. A good introduction to this topic is the book by Pustejovsky and Stubbs mentioned in the bibliographic references at the end of this module. What are the most important aspects of manual annotations? The first question here is: Why do we annotate at all? One important reason for the annotations is the facilitation of information access in corpora. For historical language data, a normalisation in the form of an interlinear gloss to find words that might be written in very different ways in the text. Another important reason is the enhancement of effective re-usage through third parties. One might for example want to assign deeper annotations based on the pre-existent ones. Or someone might want to do further analyses with this data. These annotations transform the possible interpretation of the text data into digitised, accessible and processable data. By this means, the analyses can be reconstructed and they become accessible or - if necessary - contestable. This is desirable for the scientific method. The last aspect we'll focus on today are the so-called supervised machine learning approaches which learn automatically from annotated data how to compute such annotations on large amounts of text data. Depending on the models' quality manually annotated data can be used for a partial or at the best, a fully automated annotation. Let's look at an example for manual annotation: The texts are about historical exchange of letters of princesses in German texts. In the sentence "Gnädiger, sehr hochgeEhrter, hertzlibster H Vatter, Ew Gnd" there are many abbreviations and special spelling. In this manual annotation project part-of-speech tags using the STTS tag set have been assigned on the level of tokens, as well as morphological annotations and a normalised spelling of the archaic language style and lemmatised forms. Next, semantic annotations have been made regarding the polite form, the grammatical function and the epistolary writing style and the involved persons. It is an overall very complex and detailed annotation project that works in an exemplary way in many ways. Annotations are expensive and we should consider how to make these annotations as economical as possible. The first principle for annotation could be formulated as follows: Try to annotate automatically as much as possible and try to avoid manual annotations. These two possibilities differ in several aspects from each other. These are the most important criteria: Manual annotations are typically highly accurate automatic annotations are less precise and correct. Automatic annotation is cheap, the computers work at a low price, manual annotation is time-consuming and challenging and also more expensive. Manual annotations are slow, especially for great amounts of data whereas automatic annotation is fast. Depending on the size of the test data that is to be annotated we are often compelled to annotate automatically. At the best, a small part of the data can be annotated manually. We've said before that manual annotation is correct but we need to be aware that also human annotators might make some mistakes. In opposition to automatic approaches these errors are typically less systematic. Machine learning approaches usually make very systematic errors which can be corrected automatically, if they are detected in large numbers. And finally, we need to know how many errors are introduced in the corpus by using automatic annotation. This error rate should be analysed and communicated to the end-users. Let's take a look at a case study and see how much annotated material is necessary to achieve a part-of-speech tagging accuracy of 95%, i.e. an error rate of 5% would be tolerable. The training curve that we already encountered in module 4 shows where the overall performance measured by known and unknown words surmounts the 95% threshold. That's the case for 20.000 tokens which approximately correspond to 1.000 sentences. Not much annotation is needed to achieve a good performance of 95% accuracy using data with normalised orthography and a simple tag set, namely the universal POS tag set. If you use a more complex tag set or data with non normalised orthography, at least 30.000 tokens should be annotated manually. How can an economic automisation of the annotation task look like? At the very beginning, you often just have to annotate manually. As soon as a certain amount of annotated data is available a computer model can be learned. Using this model, unannotated material can be pre-annotated. A human annotator validates these pre-annotations by accepting or correcting them. As soon as sufficient material has been annotated the final model can be computed in order to exclusively annotate automatically and only modify those annotations if really necessary. This is a learning cycle a so-called "incremental process of semi-automatic annotation " allowing an economisation of the annotations. We can speed up this process by using more data and a smaller budget for annotation. This is so-called "active learning" with the purpose of making the training curve slightly steeper by looking for informative examples worth to be annotated, i.e. where the machine learning process benefits most from it. Another variant of machine learning is the so-called "online learning" where the MIRA algorithm is frequently used. In this case, we build a model that is slightly adjusted each time an annotated example is added. Thus, the system can be constantly improved. What are the quality criteria for annotation tools. It is clear, the user interface should be intuitive that allows efficient controls. The interface should also allow consistency checking of the entered annotations. It should be also possible to annotate different types of layers with multiple annotations with this tool. The data import together with meta data, for example the import of TEI-encoded corpora and also a good export function are important. A built-in search functionality can be useful to find similar cases of a problem. The documentation, the educational material as well as the support for the tool for throughout the project duration. Platform-independent or web-based tools are an advantage. And finally, the aspect that we mentioned before: The integration of automatic annotation, automatic learning, i.e. online learning are useful for an economical annotation project. A tool that fulfils all these criteria in an exemplary manner, is "WebAnno" that allows you to annotated named entities for instance with just a few clicks. Very complex relations in texts such as the co-reference of pronouns and nominal expressions can be annotated. Apart from these general tools, you can also find more specialised tools that have proven to be useful: "EXMARaLDA" for example, works very well for complex annotations on multiple layers as we saw is in the exchange of letters of princesses where digital image data and audio data are used. There is another specialised tool named "Arborator" to annotate syntactic dependencies efficiently. A tool to annotate rhetorical text structures is "RSTTool". If you provided such annotations, the question is how this annotated material can be made accessible and how it can be freed from the data pool of the annotation tool and provided to a larger audience for analysis, visualisation and export. "ANNIS" is a good system to do so. It is a web-based tool allowing queries on all linguistic levels, morphology, syntax, semantic and text linguistics and the request of meta data. The data can also be presented in a clear and attractive way. Here is an example of token-based annotation for Arabis with a writing system from right to left. Or structured information for example chunks together with their information structure. ANNIS can also merge facsimiles with text passages and annotations and display them in a user-interface. Another example, a text analysis with rhetorical text structure that can be displayed in ANNIS after querying certain properties. Another application that rather focuses on digital editing and publishing is provided on textgrid.de "TextGrid" is a platform providing several technologies in the field of digital editing for Digital Humanities. There is also a platform-independent editor for TEI XML the so-called "TextGridLab" that can be used to explore the digital library of the TextGrid repository which contains many freely available texts in TEI format. You can create collections and editions, merge facsimiles or other images with text passages, and you have access to many historical dictionaries. The degree of linguistic annotation provided in the TextGridLab and the repository is, however, still relatively low. Here's an example from the TextGrid repository, a poem of Hans Aßmann von Abschatz. We see here, that the representation in TEI format is limited to the level of text and that no deep, linguistic annotations are planned and implemented. I also want to mention a few things about the interaction between manual annotation and machine learning. Pustejovsky and Stubbs suggested the MATTER model consisting of 6 stages for the annotation in the context of machine learning. Step 1 is the modelling of interesting phenomena and, subsequently, the creation of guidelines for the annotation. In step 2, real data are annotated according to the guidelines. In step 3, a statistical model is trained on the training material and will be analysed in step 4 on the test data. The results give some indications whether it is necessary to revise the guidelines or the data. This six-stage cycle will be run through one, two, three or even more times, until one is satisfied with the results of the machine learning system. When using manual annotations, it is important especially at the very beginning to collect annotations by different, independent human annotators, so-called multiple annotations. Tools like "WebAnno" allow you to calculate automatically if there are any differences in the annotations. In case, a third person can force a decision poviding a final, unique annotation. The process of manual harmonisation is optimally supported by WebAnno. Multiple annotations help to reveal weak points in the guidelines. In this case, the human annotators are very likely to produce different annotations. It is obvious, multiple annotations are expensive but especially during the initial phase, it is important to invest a certain amount of money. Let's move on to the second topic: "evaluation and agreement" We want to address the question, how good the performance of annotation tools actually is? We might want to know whether there is need for optimisation or to make an informed decision between two alternative tools and be able to say, this tool works better on my data. That's why we need quantitative evaluations. We need a so-called "goldstandard" to make such an evaluation and we need to know how good the inter-rater reliability is. That means: How consistent are human annotations regarding a certain type of annotations. This might lead to the optimisation of guidelines and the clarification of cases of doubt. It also allows to approximate how objectifiable the annotations are. Non-agreement between two annotators often indicates problematic cases, i.e. the overall difficulty of a task or an unclear guideline for annotation. In computational linguistics, we use four evaluation metrics as standard that I want to explain to you by showing a concrete evaluation of the performance of a named entity tagger. The first metric is called "recall", coverage or completeness. If a system has found a certain amount of entities, we want to calculate how many of all existent entities the system was able to identify and to label correctly. Example: A NER tagger correctly classified 600 out of 800 entities in a test corpus. The resulting recall is computed as follows: 600/800 = 75% The recall can be increased by recognising as many entities as possible, if if we aren't sure at all. All instances that are recognised by mistake don't influence this evaluation metric. The next important evaluation metric in computational linguistics is "precision" dealing with the amount of correctly identified entities measured on all recognised entities. A NER tagger that recognised and classified 1000 entities in a test corpus out of which only 600 were actually correct, has a precision of 60%. The precision can also be easily increased by not identifying all those entities where we are rather unsure, i.e. recognising as few entities as possible. Non-identified entities don't influence the precision metric. The precision increases if less erroneous instances are identified. The third important metric is the "F-score" combining precision and recall to a harmonic mean. A NER tagger with a precision of 60% and a recall of 75% gets the harmonic mean of these two numbers as F-score. That are exactly two thirds, i.e. 66.6% The F-score can be increased by keeping precision and recall on a similar level. The one-sided optimisation of either recall or precision is punished, that's why the harmonic mean is here for. These three metrics can also be described in a slightly different way, by applying the classification scheme of true positives, false positives, false negatives and true negatives. You can see in this table that "true" stands for a match between system and truth whereas "false" represents all those cases without a match between system and truth. There are two types of errors: the "false positives" - called "error type I" That means the system identifies an entity where it shouldn't. The opposite is a "false negative" or "error type II" the system doesn't identify an entity where it actually should have. These four evaluation measures can be compared in such a "true positive, false positive, false negative, true negative" - scheme. In the current slide, you see the recall that compares the true positives with true positives and false negatives. In contrast, the precision metric compares the true positives with the true positives and false positives. The F-score neglects the true negatives and the accuracy metric (the overall accuracy) and sets all four cells into a relationship to one another. The next important subtopic regarding evaluation is the so-called "inter-annotator agreement". If you're interested, I recommend you to read the article by Artstein and Poesio (2008). (see references) The simplest solution to measure the agreement of independent annotators is to measure the total cases where the human annotators agreed with each other. If the categories show very different distributions then we've got a problem! That means if one category occupies 90% and the other one only 10%. Then you just have to select the majority class and you already get an agreement of at least 90%, Another solution would be to take into account the analysis of chance agreement. That means that we also take into account the chance classification determined by the classes' distribution. "Cohen's kappa" is a metric to compare two annotators. If you have more than two annotators, you should rather use "Fleiss' Kappa". If the differences in classification should also be weighted, i.e. some differences are less serious than others during annotation "Krippendorff's Alpha" should be used since a differentiated weighting of the differences is possible. Additionally, "Krippendorff's Alpha" is more general compared to "Fleiss' Kappa". The qualitative interpretation of the agreement, i.e. which values are sufficient, which values are good and what can be considered a good agreement? There are some rules of thumb here, but, nevertheless, it always depends on the concrete application. We need to check thus which values can be achieved in other projects and similar annotation tasks? Let's summarise: We've seen that due to economic reasons manual annotation should be supported by automatic methods, for example using semi-automatic annotation. We've also seen that machine learning approaches can benefit from the effort made to generate manual annotations in order to apply such information on new data and large amounts of text data. When dealing with automatic annotation it is importance to know how good these tools are and how well these systems and models work. The error rate needs to be evaluated and analysed quantitatively on those texts where we actually apply the tools. When creating a "goldstandard" for the sake of evaluating a system's performance the inter-annotator agreement should be taken into account. This indicates how the system's performance can be approximated in comparison to manual annotations. Thank you very much for your attention. I hope you were able to see how manual annotations can be combined with automatic methods in order to make computer linguistic tools more efficient.