Towards error annotation in a learner corpus of Portuguese

Iria del Río
University of Lisbon – CLUL, Portugal

Sandra Antunes
University of Lisbon – CLUL, Portugal

Amália Mendes
University of Lisbon – CLUL, Portugal

Maarten Janssen
University of Coimbra – CELGA-ILTEC, Portugal

Ladda ner artikel

Ingår i: Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition at SLTC, Umeå, 16th November 2016

Linköping Electronic Conference Proceedings 130:2, s. 8-17

Visa mer +

Publicerad: 2016-11-15

ISBN: 978-91-7685-633-8

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


In this article, we present COPLE2, a new corpus of Portuguese that encompasses written and spoken data produced by foreign learners of Portuguese as a foreign or second language (FL/L2). Following the trend towards learner corpus research applied to less commonly taught languages, it is our aim to enhance the learning data of Portuguese L2. These data may be useful not only for educational purposes (design of learning materials, curricula, etc.) but also for the development of NLP tools to support students in their learning process. The corpus is available online using TEITOK environment, a web-based framework for corpus treatment that provides several built-in NLP tools and a rich set of functionalities (multiple orthographic transcription layers, lemmatization and POS, normalization of the tokens, error annotation) to automatically process and annotate texts in XML format. A CQP-based search interface allows searching the corpus for different fields, such as words, lemmas, POS tags or error tags. We will briefly describe the work in progress regarding the constitution and linguistic annotation of this corpus, particularly focusing on error annotation.


Learner corpus, Error annotation, Corpus processing tool, Pedagogical resource


Boyd, A., J. Hana, L. Nicolas, D. Meurers, K. Wisniewski, A. Abel, K. Schöne, B. Štindlová and C. Vettori. 2014. The MERLIN corpus: Learner Language and the CEFR. In Proceedings of LREC, Reykjavik, Iceland. pp.1281-1288.

Burnard, L. and S. Bauman. Eds. 2013. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Text Encoding Initiative Consortium: Charlottesville, Virginia.

Council of Europe. 2001. Common European framework of reference for languages: Learning, teaching, assessment. Cambridge, U.K: Press Syndicate of the University of Cambridge.

Cresti, E. and M. Moneglia. Eds. 2005. C-ORAL-ROM. Integrated Reference Corpora for Spoken Romance Languages. Amsterdam/Philadelphia: John Benjamins Publishing Company.

Christ, O., B. Schulze, A. Hofmann and E. Koenig. 1999. The IMS Corpus Workbench: Corpus Query Processor (CQP): User’s Manual. Institute for Natural Language Processing. University of Stuttgart. (CQP V2.2).

Dagneaux, E., S. Denness, S. Granger, F. Meunier, J. Neff and J. Thewissen. Eds. 2005. Error Tagging Manual. Version 1.2. Centre for English Corpus Linguistics. Université Catholique de Louvain.

Delais-Roussarie E. and H. Yoo. 2010. The COREIL corpus: a learner corpus designed for studying phrasal phonology and intonation. In K. Dziubalska-Kolaczyk, M. Wrembel and M. Kul (Eds). Proceedings of New Sound 2010. Poznan, Pologne, pp. 100-105.

Díaz-Negrillo, A. & Fernández-Domíguez, J. 2006. Error Tagging Systems for Learner Corpora. RESLA, 19:83-102.

Granger, S. 1996. From CA to CIA and back: An integrated approach to computerized bilingual and learner corpora. In K. Aijmer, B. Altenberg and M. Johansson (Eds.). Languages in Contrast. Text-based cross-linguistic studies. Lund Studies in English 88. Lund: Lund University Press, pp. 37-51.

Granger, S. 2003. Error-tagged Learner Corpora and CALL: A Promising Synergy. CALICO Journal 20 (3). Special issue on error analysis and error correction in computer-assisted language learning, pp. 465-480.

Granger, S. 2004. Computer learner corpus research: current status and future prospects. In U. Connor & T. Upton (Eds.), Applied Corpus Linguistics: A Multidimensional Perspective (pp. 123-145). Amsterdam & Atlanta: Rodopi.

Granger, S. 2015. Contrastive Interlanguage Analysis: a reappraisal. International Journal of Learner Corpus Research. Vol. 1:1. John Benjamins Publishing Company, pp. 7-24.

Granger, S., E. Dagneaux, F. Meunier and M. Paquot. Eds. 2009. International Corpus of Learner English. Version 2. UCL: Presses Universitaires de Louvain.

Hinrichs, L. 2006. Codeswitching on theWeb. English and Jamaican Creole in e-mail communication. Amsterdam/Philadelphia: John Benjamins Publishing Company.

Janssen, M. 2012. NeoTag: a POS Tagger for Grammatical Neologism Detection. In Proceedings of LREC 2012, Istanbul, Turkey.

Janssen, M. 2016. TEITOK: Text-Faithful Annotated Corpora. In Proceedings of LREC 2016, Portorož, Slovenia.

Leiria, I. 2001. Léxico – aquisição e ensino do Português Europeu língua não materna. PhD Dissertation. Faculdade de Letras da Universidade de Lisboa.

Lozano, C. 2009. CEDEL2: Corpus Escrito del Español L2. In C. M. Bretones Callejas et al. (Eds). Applied Linguistics Now: Understanding Language and Mind / La Lingüística Aplicada Hoy: Comprendiendo el Lenguaje y la Mente. Almería: Universidad de Almería, pp. 197-212.

MacWhinney, B. 2000. The CHILDES Project: Tools for Analyzing Talk. 3rd Edition. Mahwah, NJ: Lawrence Erlbaum Associates.

Mendes, A., M. Généreux, I. Hendricks. 2014. Manual for the CRPC on the CQPweb interface. Manual 1.3. http://alfclul.clul.ul.pt/CQPweb/doc/CRPCmanual.v1_2_en.pdf.

Mendes, A., S. Antunes, M. Janssen and A. Gonçalves. 2016. The COPLE2 Corpus: a Learner Corpus for Portuguese. In Proceedings of LREC 2016, Portorož, Slovenia.

Meurers, D. 2015. Learner Corpora and Natural Language Processing. In S. Granger, G. Gilquin and F. Meunier (Eds.). The Cambridge Handbook of Learner Corpus Research. Cambridge University Press, pp. 537-566.

Nicholls, D. 2003. The Cambridge Learner Corpus – error coding and analysis for lexicography and ELT. In D. Archer, P. Rayson, A. Wilson and T. McEnery (Eds.). Proceedings of the Corpus Linguistics 2003 Conference. Lancaster University, pp. 572-581.

Rosen, A., J. Hana, B. Štindlová & A. Feldman 2013. Evaluating and automating the annotation of a learner corpus. Language Resources and Evaluation pp. 1-28.

Schmidt, T. 2012. EXMARaLDA and the FOLK tools – two toolsets for transcribing and annotating spoken language. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC). Istanbul, Turkey, pp. 236-40.

Tono, Y. 2003. Learner corpora: Design, development and applications. In D. Archer, P. Rayson, A. Wilson and T. McEnery (Eds.), Proceedings of the Corpus Linguistics 2003 Conference. Lancaster University, pp. 800-809.

Citeringar i Crossref