SweLLex: second language learners’ productive vocabulary

Elena Volodina
University of Gothenburg, Sweden

Ildikó Pilán
University of Gothenburg, Sweden

Lorena Llozhi
University of Gothenburg, Sweden

Baptiste Degryse
Universit´e catholique de Louvain, Belgium

Thomas François
Universit´e catholique de Louvain, Belgium / FNRS Post-doctoral Researcher

Ladda ner artikel

Ingår i: Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition at SLTC, Umeå, 16th November 2016

Linköping Electronic Conference Proceedings 130:10, s. 76-84

Visa mer +

Publicerad: 2016-11-15

ISBN: 978-91-7685-633-8

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


This paper presents a new lexical resource for learners of Swedish as a second language, SweLLex, and a know-how behind its creation. We concentrate on L2 learners’ productive vocabulary, i.e. words that they are actively able to produce, rather than the lexica they comprehend (receptive vocabulary). The proposed list covers productive vocabulary used by L2 learners in their essays. Each lexical item on the list is connected to its frequency distribution over the six levels of proficiency defined by the Common European Framework of Reference (CEFR) (Council of Europe, 2001}. To make this list a more reliable resource, we experiment with normalizing L2 word-level errors by replacing them with their correct equivalents. SweLLex has been tested in a prototype system for automatic CEFR level classification of essays as well as in a visualization tool aimed at exploring L2 vocabulary contrasting receptive and productive vocabulary usage at different levels of language proficiency.


Productive vocabulary scope, CEFR, normalization of learner writing, Swedish as a second language


Sture Allén. 2002. Våra viktiga ord. Liber, Sweden.

Lene Antonsen. 2012. Improving feedback on L2 misspellings-an FST approach. In Proceedings of the SLTC 2012 workshop on NLP for CALL; Lund; 25th October; 2012, number 080, pages 1–10. Linköping University Electronic Press.

Steven Bird. 2006. NLTK: the natural language toolkit. InProceedings of the COLING/ACL on Interactive presentation sessions, pages 69–72. Association for Computational Linguistics.

Stefan Bordag. 2008. A comparison of co-occurrence and similarity measures as simulations of context. In International Conference on Intelligent Text Processing and Computational Linguistics, pages 52–63. Springer.

Lars Borin, Markus Forsberg, and Johan Roxendal. 2012. Korp - the corpus infrastructure of Språkbanken.  In LREC, pages 474–478.

Lars Borin, Markus Forsberg, and Lennart Lönngren. 2013. SALDO: a touch of yin to WordNet’s yang. Language Resources and Evaluation, 47(4):1191–1211.

A. Capel. 2010. A1–B2 vocabulary: insights and issues arising from the English ProfileWordlists project. English Profile Journal, 1(1):1–11.

A. Capel. 2012. Completing the English Vocabulary Profile: C1 and C2 vocabulary. English Profile Journal, 3:1–14.

Council of Europe. 2001. Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Press Syndicate of the University of Cambridge.

Averil Coxhead. 2000. A new academic word list. TESOL quarterly, 34(2):213–238.

M. Dickinson and M. Ragheb. 2013. Annotation for Learner English Guidelines, v. 0.1. Technical report. Indiana University, Bloomington.

W. Francis and H. Kucera. 1982. Frequency analysis of English usage. Houghton Mifflin Company, Boston, MA.

Thomas Franc¸ois, Elena Volodina, Ildikó Pilán, and Anais Tack. 2016. SVALex: a CEFR-graded Lexical Resource for Swedish Foreign and Second Language Learners. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, may.

Ann-Kristin Hult, Sven-Göran Malmgren, and Emma Sköldberg. 2010. Lexin - a report from a recycling lexicographic project in the North. In Proceedings of the XIV Euralex International Congress (Leeuwarden, 6-10 July 2010).

Adam Kilgarriff, Pavel Rychly, Pavel Smrz, and David Tugwell. 2004. Itri-04-08 the Sketch Engine.  Information Technology, 105:116.

A. Kilgarriff, F. Charalabopoulou, M. Gavrilidou, J. B. Johannessen, S. Khalil, S. J. Kokkinakis, R. Lew, S.  Sharoff, R. Vadlapudi, and E. Volodina. 2014. Corpus-based vocabulary lists for language learners for nine languages. Language resources and evaluation, 48(1):121–163.

B. Laufer and G.C. Ravenhorst-Kalovski. 2010. Lexical Threshold Revisited: Lexical Text Coverage, Learners’ Vocabulary Size and Reading Comprehension. Reading in a foreign language, 22(1):15–30.

Be´ata Megyesi, Jesper N¨asman, and Anne Palmér. 2016. The Uppsala Corpus of Student Writings: Corpus Creation, Annotation, and Analysis. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).

Daniel Naber. 2003. A rule-based style and grammar checker. Master’s thesis, Bielefeld University, Bielefeld, Germany.

Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant, editors. 2014. Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task. Association for Computational Linguistics, Baltimore, Maryland.

OECD. 2013. OECD Skills Outlook 2013. First Results from the Survey of Adult Skills.

PIAAC. 2013. Survey of Adult Skills (PIAAC).

Anne Rimrott and Trude Heift. 2005. Language learners and generic spell checkers in CALL. CALICO journal, pages 17–48.

SCB. 2013. Tema utbildning, rapport 2013:2, Den internationella undersökningen av vuxnas färdigheter. Statistiska centralbyrån.

E.L. Thorndike. 1921. The teacher’s word book. Teachers College, Columbia University, New York.

Elena Volodina, Beata Megyesi, Mats Wirén, Lena Granstedt, Julia Prentice, Monica Reichenberg, and Gunl¨og Sundberg. 2016a. A Friend in Need? Research agenda for electronic Second Language infrastructure. In Proceedings of SLTC 2016, Umeå, Sweden.

Elena Volodina, Ildikó Pilán, Ingegerd Enström, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, and Monica Sandell. 2016b. SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies. LREC 2016, Slovenia.

MaWest. 1953. A General Service List of EnglishWords. London: Longman, Green and Co.

KatrinWisniewski, Karin Sch¨one, Lionel Nicolas, Chiara Vettori, Adriane Boyd, Detmar Meurers, Andrea Abel, and Jirka Hana. 2013. MERLIN: An online trilingual learner corpus empirically grounding the European reference levels in authentic learner data. InICT for Language Learning 2013, Conference Proceedings, Florence, Italy. Libreriauniversitaria. it Edizioni.

Citeringar i Crossref