SVALA: Annotation of Second-Language Learner Text Based on Mostly Automatic Alignment of Parallel Corpora

Mats Wirén
Department of Linguistics, Stockholm University, Sweden

Arild Matsson
Språkbanken, Department of Swedish, University of Gothenburg, Sweden

Dan Rosén
Språkbanken, Department of Swedish, University of Gothenburg, Sweden

Elena Volodina
Språkbanken, Department of Swedish, University of Gothenburg, Sweden

Ladda ner artikel

Ingår i: Selected papers from the CLARIN Annual Conference 2018, Pisa, 8-10 October 2018

Linköping Electronic Conference Proceedings 159:23, s. 227-239

Visa mer +

Publicerad: 2019-05-28

ISBN: 978-91-7685-034-3

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


Annotation of second-language learner text is a cumbersome manual task which in turn requires interpretation to postulate the intended meaning of the learner’s language. This paper describes SVALA, a tool which separates the logical steps in this process while providing rich visual support for each of them. The first step is to pseudonymize the learner text to fulfil the legal and ethical requirements for a distributable learner corpus. The second step is to correct the text, which is carried out in the simplest possible way by text editing. During the editing, SVALA automatically maintains a parallel corpus with alignments between words in the learner source text and corrected text, while the annotator may repair inconsistent word alignments. Finally, the actual labelling of the corrections (the postulated errors) is performed. We describe the objectives, design and workflow of SVALA, and our plans for further development.


Normalization, Error annotation, Learner corpora, Parallel corpora, Word alignment


[Ahrenberg et al.2002] Lars Ahrenberg, Mikael Andersson, and Magnus Merkel. 2002. A system for incremental and interactive word linking. In LREC’02, pages 485–490.

[Berzak et al.2016] Yevgeni Berzak, Jessica Kenney, Carolyn Spadine, Jing Xian Wang, Lucia Lam, Keiko Sophie Mori, Sebastian Garza, and Boris Katz. 2016. Universal Dependencies for Learner English. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 737–746, Berlin, Germany. Association for Computational Linguistics.

[Borin et al.2016] Lars Borin, Markus Forsberg, Martin Hammarstedt, Dan Rosén, Roland Schäfer, and Anne Schumacher. 2016. Sparv: Språkbanken’s corpus annotation pipeline infrastructure. In SLTC 2016. The Sixth Swedish Language Technology Conference, Umeå University, 17-18 November, 2016.

[Boyd et al.2014] Adriane Boyd, Jirka Hana, Lionel Nicolas, Detmar Meurers, Katrin Wisniewski, Andrea Abel, Karin Sch¨one, Barbora ? Stindlov´a, and Chiara Vettori. 2014. The MERLIN corpus: Learner Language and the CEFR. In LREC’14, Reykjavik, Iceland. European Language Resources Association (ELRA).

[Boyd2018] Adriane Boyd. 2018. Normalization in Context: Inter-Annotator Agreement for Meaning-Based Target Hypothesis Annotation. In Proceedings of Natural Language Processing for Computer-Assisted Language Learning (NLP4CALL), Stockholm, Sweden.

[Eckart de Castilho et al.2016] Richard Eckart de Castilho, Éva Mújdricza-Maydt, Seid Muhie Yimam, Silvana Hartmann, Iryna Gurevych, Anette Frank, and Chris Biemann. 2016. A web-based tool for the integrated annotation of semantic and syntactic structures. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pages 76–84. The COLING 2016 Organizing Committee. [Ellis1994] Rod Ellis. 1994. The Study of Second Language Acquisition. Oxford University Press, Oxford.

[Granger and Lefer2018] Sylviane Granger and Marie-Aude Lefer. 2018. The Translation-oriented Annotation System: A tripartite annotation system for translation research. In International Symposium on Parallel Corpora (ECETT — PaCor), pages 61–63. Instituto Universitario de Lenguas Modernas y Traductores, Facultad de Filolog´ia, Universidad Complutense de Madrid, Spain.

[Granger2008] Sylviane Granger. 2008. Learner corpora. In Anke Lüdeling and Merja Kytö, editors, Corpus Linguistics. An International Handbook, volume 1, chapter 15, pages 259–275. Mouton de Gruyter, Berlin.

[Graën2018] Johannes Graën. 2018. Exploiting Alignment in Multiparallel Corpora for Applications in Linguistics and Language Learning. Ph.D. thesis, University of Zurich.

[Hana et al.2012] Jirka Hana, Alexandr Rosen, Barbora Štindlová, and Petr Jäger. 2012. Building a learner corpus. In LREC’12, Istanbul, Turkey, may. European Language Resources Association (ELRA).

[Hultin2017] Felix Hultin. 2017. Correct-Annotator: An Annotation Tool for Learner Corpora. CLARIN Annual Conference 2017 in Budapest, Hungary.

[Li and Lee2018] Keying Li and John Lee. 2018. L1–L2 Parallel Treebank of Learner Chinese: Overused and Underused Syntactic Structures. In Proceedings of the 11th Language Resources and Evaluation Conference, Miyazaki, Japan. European Language Resource Association.

[Lüdeling2008] Anke Lüdeling. 2008. Mehrdeutigkeiten und Kategorisierung: Probleme bei der Annotation von Lernerkorpora. In M. Walter and P. Grommes, editors, Fortgeschrittene Lernervaritäten: Korpuslinguistik und Zweitspracherwerbsforschung, pages 119–140. Max Niemeyer, T¨ubingen, Germany.

[Megyesi et al.2018] Beéta Megyesi, Lena Granstedt, Sofia Johansson, Dan Rosén, Carl-Johan Schenström, Gunlög Sundberg, Mats Wirén, and Elena Volodina. 2018. Learner Corpus Anonymization in the Age of GDPR: Insights from the Creation of a Learner Corpus of Swedish. In Proceedings of Natural Language Processing for Computer-Assisted Language Learning (NLP4CALL), Stockholm, Sweden.

[Melamed1999] I. Dan Melamed. 1999. Bitext maps and alignment via pattern recognition. Computational Linguistics, 25(1):107–130, March.

[Mendes et al.2016] Amália Mendes, Sandra Antunes, Maarten Janssen, and Anabela Gonc¸alves. 2016. The COPLE2 corpus: a learner corpus for Portuguese. In LREC’16.

[Merkel et al.2003] Magnus Merkel, Michael Petterstedt, and Lars Ahrenberg. 2003. Interactive word alignment for corpus linguistics. In Proc. Corpus Linguistics 2003.

[Myers1986] Eugene W. Myers. 1986. An O(ND) difference algorithm and its variations. Algorithmica, 1(1):251–266.

[Obeid et al.2013] Ossama Obeid, Wajdi Zaghouani, Behrang Mohit, Nizar Habash, Kemal Oflazer, and Nadi Tomeh. 2013. A Web-based Annotation Framework For Large-Scale Text Correction. In The Companion Volume of the Proceedings of IJCNLP 2013: System Demonstrations, pages 1–4. Asian Federation of Natural Language Processing.

[ObrusnÍk2012] Adam ObrusnÍk. 2012. A hybrid approach to parallel text alignment. Masaryk University, Faculty of Arts, Department of English and American Studies, Brno, Czech Republic. Bachelor’s Diploma Thesis.

[Reznicek et al.2012] M. Reznicek, A. LÜdeling, C. Krummes, and F. Schwantuschke. 2012. Das Falko-Handbuch. Korpusaufbau und Annotationen Version 2.0. Humboldt-Universität zu Berlin, Berlin, Germany.

[Rosen et al.2014] Alexandr Rosen, Jirka Hana, Barbora Štindlová, and Anna Feldman. 2014. Evaluating and automating the annotation of a learner corpus. Lang. Resour. Eval., 48(1):65–92, March.

[Tenfjord et al.2006] Kari Tenfjord, Paul Meurer, and Knut Hofland. 2006. The ASK corpus: A language learner corpus of Norwegian as a second language. In LREC’06, pages 1821–1824.

[Tiedemann2006] J¨org Tiedemann. 2006. ISA & ICA—two web interfaces for interactive alignment of bitexts. In
LREC 2006.

[Volodina et al.2018] Elena Volodina, Lena Granstedt, Beáta Megyesi, Julia Prentice, Dan Rosén, Carl-Johan Schenström, Gunlög Sundberg, and Mats Wirén. 2018. Annotation of learner corpora: first SweLL insights. In Abstracts of the Swedish Language Technology Conference (SLTC) 2018, Stockholm, Sweden.

[Zipser and Romary2010] Florian Zipser and Laurent Romary. 2010. A model oriented approach to the mapping of annotation formats using standards. In Workshop on Language Resource and Language Technology Standards, LREC 2010, La Valette, Malta, May.

Citeringar i Crossref