Conference article

Annotating Errors in Student Texts: First Experiences and Experiments

Sara Stymne
Linguistics and Philology, Uppsala University, Sweden

Eva Pettersson
Linguistics and Philology, Uppsala University, Sweden

Beáta Megyesi
Linguistics and Philology, Uppsala University, Sweden

Anne Palmér
Scandinavian Languages, Uppsala University, Sweden

Download article

Published in: Proceedings of the Joint 6th Workshop on NLP for Computer Assisted Language Learning and 2nd Workshop on NLP for Research on Language Acquisition at NoDaLiDa, Gothenburg, 22nd May 2017

Linköping Electronic Conference Proceedings 134:6, p. 47-60

NEALT Proceedings Series 30:6, p. 47-60

Show more +

Published: 2017-05-11

ISBN: 978-91-7685-502-7

ISSN: 1650-3686 (print), 1650-3740 (online)


We describe the creation of an annotation layer for word-based writing errors for a corpus of student writings. The texts are written in Swedish by students between 9 and 19 years old. Our main purpose is to identify errors regarding spelling, split compounds and merged words. In addition, we also identify simple word-based grammatical errors, including morphological errors and extra words. In this paper we describe the corpus and the annotation process, including detailed descriptions of the error types and guidelines. We find that we can perform this annotation with a substantial inter-annotator agreement, but that there are still some remaining issues with the annotation. We also report results on two pilot experiments regarding spelling correction and the consistency of downstream NLP tools, to exemplify the usefulness of the annotated corpus.


No keywords available


Andrea Abel, KatrinWisniewski, Lionel Nicolas, Adriane Boyd, Jirka Hana, and Detmar Meurers. 2014. A trilingual learner corpus illustrating European reference levels.  RiCOGNIZIONI. Rivista di lingue, letterature e cultura moderne, 2(1):111–126.

Tua Abrahamsson and Pirko Bergman. 2014. Tankarna springer före: att bedöma ett andraspråk i utveckling. Liber, Stockholm, Sweden. Monica Axelsson and Ulrika Magnusson. 2012.

Forskning om flerspråkighet och kunskapsutveckling under skolåren. In Flerspråkighet: en forsknings översikt. Vetenskapsrådet, Stockholm, Sweden.

Yevgeni Berzak, Jessica Kenney, Carolyn Spadine, Jing Xian Wang, Lucia Lam, Keiko Sophie Mori, Sebastian Garza, and Boris Katz. 2016. Universal dependencies for learner English. In Proceedings of the 54th Annual Meeting of the ACL, pages 737–746, Berlin, Germany.

Lars Borin, Markus Forsberg, and Lennart Lönngren. 2008. SALDO 1.0 (Svenskt associationslexikon version 2). Språkbanken, University of Gothenburg.

Eric Brill and Robert C. Moore. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting of the ACL, pages 286–293, Hong Kong.

Johan Carlberger, Rickard Domeij, Viggo Kann, and Ola Knutsson. 2005. The development and performance of a grammar checker for Swedish: A language engineering perspective. In Ola Knutsson. 2005. Developing and Evaluating Language Tools for Writers and Learners of Swedish. Ph.D. thesis, Royal Institute of Technology (KTH), Stockholm, Sweden.

Jean Carletta. 1996. Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2):249–254.

Eva Ejerhed and Gunnel K¨allgren. 1997. Stockholm Umeå Corpus. Version 1.0. Produced by Department of Linguistics, Ume°a University and Department of Linguistics, Stockholm University.

Björn Hammarberg. 2005. Introduktion till ASU–korpusen, en longitudinell muntlig och skriftlig textkorpus av vuxna inlärares svenska med en motsvarande del från infödda svenskar. Institutionen for lingvistik, Stockholms universitet, Sweden.

Jirka Hana, Alexandr Rosen, Svatava Škodová, and Barbora Štindlová. 2004. Error-tagged learner corpus of Czech. In Proceedings of the Fourth Linguistic Annotation Workshop, Uppsala, Sweden.

John A. Hawkins and Paula Buttery. 2010. Criterial features in learner corpora: Theory and illustrations. English Profile Journal, 1(01):1–23.

Karen Kukich. 1992. Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR), 24(4):377–439.

J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174.

Vladimir Iosifovich Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10(8):707–710.

Janne Lindberg and Gunnar Eriksson. 2004. Crosscheck-korpusen – en elektronisk svensk
inl¨ararkorpus. In Proceedings of the ASLA Conference 2004.

Béata Megyesi, Jesper Näsman, and Anne Palmér. 2016. The Uppsala Corpus of Student Writings - corpus creation, annotation, and analysis. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia.

Diane Nicholls. 2003. The Cambridge Learner Corpus: Error coding and analysis for lexicography and ELT. In Proceedings of the Corpus Linguistics 2003 conference, pages 572–581, Lancaster, UK.

Joakim Nivre, Johan Hall, and Jens Nilsson. 2006. MaltParser: A data-driven parser-generator for dependency parsing. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06), pages 2216–2219, Genoa, Italy.

Joakim Nivre, Béata Megyesi, Sofia Gustafson-Capková, Filip Salomonsson, and Bengt Dahlqvist. 2008. Cultivating a Swedish treebank. In Resourceful Language Technology: Festschrift in Honor of Anna Sågvall Hein, pages 111–120. Acta Universitatis Upsaliensis, Uppsala, Sweden.

Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsafarty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia.

Jesper Näsman, Béata Megyesi, and Anne Palmér. 2017. Swegram – a web-based tool for automatic annotation and analysis of Swedish texts. In Proceedings of the 21st Nordic Conference on Computational Linguistics (NODALIDA’17), Gothenburg, Sweden.

Lena Öhrman. 1998. Felaktigt saärskrivna sammansättningar. Bachelor thesis, Stockholm University, Stockholm, Sweden.

Robert Östling. 2016. Shallow learning for sequence tagging. Presented at The 6th Swedish Language Technology Conference (SLTC16), Umeå, Sweden.

Eva Pettersson, Béata Megyesi, and Joakim Nivre. 2013. Normalisation of historical text using contextsensitive weighted Levenshtein distance and compound splitting. In Proceedings of the 19th Nordic Conference on Computational Linguistics (NODALIDA’ 13), Oslo, Norway.

Anju Saxena and Lars Borin. 2002. Locating and reusing sundry NLP flotsam in an e-learning application. In Proceedings of the Workshop on Customizing knowledge in NLP applications: strategies, issues, and evaluation (LREC12), Las Palmas, Canary Islands, Spain.

Svenska Akademiens ordlista. 2006. 13th edition. Svenska Akademien, Stockholm, Sweden.

Svenska Akademiens ordlista. 2015. 14th edition.

Svenska Akademien, Stockholm, Sweden.

Kari Tenfjord, Paul Meurer, and Knut Hofland. 2004. The ask-corpus - a language learner corpus of Norwegian as a second language. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal.

Kristina Toutanova and Robert Moore. 2002. Pronunciation modeling for improved spelling correction. In Proceedings of the 40th Annual Meeting of the ACL, pages 144–151, Philadelphia, Pennsylvania, USA.

Elena Volodina, Ildikó Pilán, Ingegerd Enström, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, and Monica Sandell. 2016. SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia.

Citations in Crossref