Conference article

Normalization in Context: Inter-Annotator Agreement for Meaning-Based Target Hypothesis Annotation

Adriane Boyd
Department of Linguistics, University of T¨ubingen, Germany

Download article

Published in: Proceedings of the 7th Workshop on NLP for Computer Assisted Language Learning (NLP4CALL 2018) at SLTC, Stockholm, 7th November 2018

Linköping Electronic Conference Proceedings 152:2, p. 10-22

NEALT Proceedings Series 36:2, p. 10-22

Show more +

Published: 2018-11-02

ISBN: 978-91-7685-173-9

ISSN: 1650-3686 (print), 1650-3740 (online)


We explore the contribution of explicit task contexts in the annotation of word-level and sentence-level normalizations for learner language. We present the annotation schemes and tools used to annotate both word- and sentence-level target hypotheses given an explicit task context for the Corpus of Reading Exercises in German (Ott et al., 2012) and discuss a range of inter-annotator agreement measures appropriate for evaluating target hypothesis and error annotation.

For learner answers to reading comprehension questions, we find that both the amount of task context and the correctness of the learner answer influence the inter-annotator agreement for word-level normalizations. For sentence-level normalizations, the teachers’ detailed assessments of the learner answer meaning provided in the corpus give indications of the difficulty of the target hypothesis annotation task. We provide a thorough evaluation inter-annotator agreement for multiple aspects of meaning-based target hypothesis annotation in context and explore metrics beyond inter-annotator agreement that can be used to evaluate the quality of normalization annotation.


normalization, target hypothesis annotation, reliability of annotation


Ron Artstein and Massimo Poesio. 2009. Survey article: Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4):1–42.

Marcel Bollmann, Stefanie Dipper, and Florian Petran. 2016. Evaluating inter-annotator agreement on historical spelling normalization. Proceedings of LAW X – The 10th Linguistic Annotation Workshop, pages 89–98.

Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46.

Daniel Dahlmeier, Hwee Tou Ng, and Siew Mei Wu. 2013. Building a large annotated corpus of learner English: The NUS corpus of learner English. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 22–31. Association for Computational Linguistics.

Eileen Fitzpatrick and M. S. Seegmiller. 2004. The Montclair electronic language database project. In U. Connor and T.A. Upton, editors, Applied Corpus Linguistics: A Multidimensional Perspective. Rodopi, Amsterdam.

Jirka Hana, Alexandr Rosen, Barbora Štindlová, and Petr Jäger. 2012. Building a learner corpus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA).

Hagen Hirschmann, Seanna Doolittle, and Anke Lüdeling. 2007. Syntactic annotation of non-canonical linguistic structures. In Proceedings of Corpus Linguistics 2007, Birmingham.

Institut für Deutsche Sprache. 2009. Korpusbasierte Wortformenliste DEREWO, v-100000t-2009-04-30-0.1, mit Benutzerdokumentation. Technical Report IDS-KL-2009-02, Institut für Deutsche Sprache, Programmbereich Korpuslinguistik.

Christine Köhn and Arne Köhn. 2018. An annotated corpus of picture stories retold by language learners. In Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), pages 121–132. Association for Computational Linguistics.

Klaus Krippendorff. 1980. Content Analysis: An Introduction to its Methodology. Sage Publications, Beverly Hills, CA.

John Lee, Joel Tetreault, and Martin Chodorow. 2009. Human evaluation of article and noun number usage: Influences of context and construction variability. In ACL 2009 Proceedings of the Linguistic Annotation Workshop III (LAW3). Association for Computational Linguistics.

Anke Lüdeling. 2008. Mehrdeutigkeiten und Kategorisierung: Probleme bei der Annotation von Lernerkorpora. In Maik Walter and Patrick Grommes, editors, Fortgeschrittene Lernervarietäten: Korpuslinguistik und Zweispracherwerbsforschung, pages 119–140. Max Niemeyer Verlag, Tübingen.

Detmar Meurers. 2015. Learner corpora and natural language processing. In Sylviane Granger, Gaëtanelle Gilquin, and Fanny Meunier, editors, The Cambridge Handbook of Learner Corpus Research, pages 537–566. Cambridge University Press.

Detmar Meurers and Markus Dickinson. 2017. Evidence and interpretation in language learning research: Opportunities for collaboration with computational linguistics. Language Learning, Special Issue on Language learning research at the intersection of experimental, corpus-based and computational methods: Evidence and interpretation. To appear.

Shogo Miura. 1998. Hiroshima English Learners’ Corpus: English learner No. 2 (English I & English II). Department of English Language Education, Hiroshima University.

Joseph Olive. 2005. Global autonomous language exploitation (gale). Technical report, DARPA/IPTO Proposer Information Pamphlet.

Niels Ott, Ramon Ziai, and Detmar Meurers. 2012. Creation and analysis of a reading comprehension exercise corpus: Towards evaluating meaning in context. In Thomas Schmidt and Kai Wörner, editors, Multilingual Corpora and Multilingual Corpus Analysis, Hamburg Studies in Multilingualism (HSM), pages 47–69. Benjamins, Amsterdam.

Rebecca Passonneau. 2006. Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. In Proceedings of the Fifth International Conference on Language Resources and Evauation (LREC-06).

Marc Reznicek, Anke Lüdeling, Cedric Krummes, and Franziska Schwantuschke. 2012. Das Falko-Handbuch. Korpusaufbau und Annotationen Version 2.0.

Alexandr Rosen, Jirka Hana, Barbora Štindlová, and Anna Feldman. 2013. Evaluating and automating the annotation of a learner corpus. Language Resources and Evaluation, pages 1–28.

Heike Telljohann, Erhard Hinrichs, and Sandra Kübler. 2004. The TüBa-D/Z treebank: Annotating German with a context-free backbone. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), Lissabon.

Joel Tetreault and Martin Chodorow. 2008. Native judgments of non-native usage: Experiments in preposition error detection. In Proceedings of the workshop on Human Judgments in Computational Linguistics at COLING-08, pages 24–32, Manchester, UK. Association for Computational Linguistics.

WeiyueWang, Jan-Thorsten Peter, Hendrik Rosendahl, and Hermann Ney. 2016. CharacTer: Translation edit rate on character level. In Proceedings of the First Conference on Machine Translation, pages 505–510, Berlin, Germany. Association for Computational Linguistics.

Seid Muhie Yimam, Chris Biemann, Richard Eckart de Castilho, and Iryna Gurevych. 2014. Automatic annotation suggestions and custom annotation layers in WebAnno. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 91–96, Baltimore, Maryland. Association for Computational Linguistics.

Citations in Crossref