From Distributions to Labels: A Lexical Proficiency Analysis using Learner Corpora

David Alfter
University of Gothenburg, Sweden

Yuri Bizzoni
University of Gothenburg, Sweden

Anders Agebjórn
University of Gothenburg, Sweden

Elena Volodina
University of Gothenburg, Sweden

Ildikó Pilán
University of Gothenburg, Sweden

Ladda ner artikel

Ingår i: Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition at SLTC, Umeå, 16th November 2016

Linköping Electronic Conference Proceedings 130:1, s. 1-7

Visa mer +

Publicerad: 2016-11-15

ISBN: 978-91-7685-633-8

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


In this work we look at how information from second language learner essay corpora can be used for the evaluation of unseen learner essays. Using a corpus of learner essays which have been graded by well-trained human assessors using the CEFR scale, we extract a list of word distributions over CEFR levels. For the analysis of unseen essays, we want to map each word to a so-called target CEFR level using this word list. However, the task of mapping from a distribution to a single label is not trivial. We are also investigating how we can evaluate the mapping from distribution to label. We show that the distributional profile of words from the essays, informed with the essays’ levels, consistently overlaps with our frequency-based method, in the sense that words holding the same level of proficiency as predicted by our mapping tend to cluster together in a semantic space. In the absence of a gold standard, this information can be useful to see how often a word is associated with the same level in two different models. Also, in this case we have a similarity measure that can show which words are more central to a given level and which words are more peripheral.


Lexical complexity, Common European Framework of Reference, Mapping, Semantic space


R. Artstein and M. Poesio. 2008. Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4):555–596.

Lars Borin, Markus Forsberg, and Johan Roxendal. 2012. Korp - the corpus infrastructure of Spr°akbanken. In LREC, pages 474–478.

Council of Europe. 2001. Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Press Syndicate of the University of Cambridge.

Eva Forsbom. 2006. A swedish base vocabulary pool. In Swedish Language Technology conference, Gothenburg.

Thomas Francçois, Elena Volodina, Ildikó Pilán, and Anaïs Tack. 2016. SVALex: a CEFR-graded Lexical Resource for Swedish Foreign and Second Language Learners. In LREC 2016.

Håkan Jansson, Sofie Johansson Kokkinakis, Judy Ribeck, and Emma Sköldberg. 2012. A Swedish Academic Word List: Methods and Data. In Proceedings of the 15th EURALEX International Congress, pages 7–11.

Adam Kilgarriff, Frieda Charalabopoulou, Maria Gavrilidou, Janne Bondi Johannessen, Saussan Khalil, Sofie Johansson Kokkinakis, Robert Lew, Serge Sharoff, Ravikiran Vadlapudi, and Elena Volodina. 2014. Corpus-based vocabulary lists for language learners for nine languages. Language resources and evaluation, 48(1):121–163.

K. Krippendorff. 1980. Content Analysis: An Introduction to Its Methodology. Chapter 12. Sage, Beverly Hills, CA.

Lorena Llozhi. 2016. SweLL list. A list of productive vocabulary generated from second language learners’ essays. Master’s Thesis. University of Gothenburg.

Tomas Mikolov and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems.

Katarina Heimann Mühlenbock and Sofie Johansson Kokkinakis. 2012. SweVoc-a Swedish vocabulary resource for CALL. In Proceedings of the SLTC 2012 workshop on NLP for CALL; Lund; 25th October; 2012, number 080, pages 28–34. Linköping University Electronic Press.

Ildikó Pilán, David Alfter, and Elena Volodina. 2016. Coursebook texts as a helping hand for classifying linguistic complexity in language learners’ writings. In Proceedings of the workshop on Computational Linguistics for Linguistic Complexity (CL4LC). COLING 2016. Osaka, Japan.

Elena Volodina, Ildikó Pilán, Stian Rødven Eide, and Hannes Heidarsson. 2014. You get what you annotate: a pedagogically annotated corpus of coursebooks for Swedish as a Second Language. In Proceedings of the third workshop on NLP for computer-assisted language learning at SLTC 2014, Uppsala University, number 107. Linköping University Electronic Press.

Elena Volodina, Ildikó Pilán, and David Alfter. 2016a. Classification of Swedish learner essays by CEFR levels. In Proceedings of EuroCALL 2016.

Elena Volodina, Ildikó Pilán, Ingegerd Enstr¨om, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, and Monica Sandell. 2016b. SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies. In LREC 2016.

Citeringar i Crossref