Towards error annotation in a learner corpus of Portuguese

Iria del Río
University of Lisbon – CLUL, Portugal

Sandra Antunes
University of Lisbon – CLUL, Portugal

Amália Mendes
University of Lisbon – CLUL, Portugal

Maarten Janssen
University of Coimbra – CELGA-ILTEC, Portugal

Ingår i: Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition at SLTC, Umeå, 16th November 2016

Linköping Electronic Conference Proceedings 130:2, s. 8-17

Publicerad: 2016-11-15

ISBN: 978-91-7685-633-8

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


In this article, we present COPLE2, a new corpus of Portuguese that encompasses written and spoken data produced by foreign learners of Portuguese as a foreign or second language (FL/L2). Following the trend towards learner corpus research applied to less commonly taught languages, it is our aim to enhance the learning data of Portuguese L2. These data may be useful not only for educational purposes (design of learning materials, curricula, etc.) but also for the development of NLP tools to support students in their learning process. The corpus is available online using TEITOK environment, a web-based framework for corpus treatment that provides several built-in NLP tools and a rich set of functionalities (multiple orthographic transcription layers, lemmatization and POS, normalization of the tokens, error annotation) to automatically process and annotate texts in XML format. A CQP-based search interface allows searching the corpus for different fields, such as words, lemmas, POS tags or error tags. We will briefly describe the work in progress regarding the constitution and linguistic annotation of this corpus, particularly focusing on error annotation.


Learner corpus, Error annotation, Corpus processing tool, Pedagogical resource


