Automatic CEFR Level Prediction for Estonian Learner Text

Sowmya Vajjala
LEAD Graduate School, University of Tübingen, Germany

Kaidi Lėo
Department of Linguistics, University of Alberta, Canada

Ladda ner artikel

Ingår i: Proceedings of the third workshop on NLP for computer-assisted language learning at SLTC 2014, Uppsala University

Linköping Electronic Conference Proceedings 107:9, s. 113–127

NEALT Proceedings Series 22:9, s. 113–127

Visa mer +

Publicerad: 2014-11-11

ISBN: 978-91-7519-175-1

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


This paper reports on approaches for automatically predicting a learner’s language proficiency in Estonian according to the European CEFR scale. We used the morphological and POS tag information extracted from the texts written by learners. We compared classification and regression modeling for this task. Our models achieve a classification accuracy of 79% and a correlation of 0.85 when modeled as regression. After a comparison between them, we concluded that classification is more effective than regression in terms of exact error and the direction of error. Apart from this, we investigated the most predictive features for both multi- class and binary classification between groups and also explored the nature of the correlations between highly predictive features. Our results show considerable improvement in classification accuracy over previously reported results and take us a step closer towards the automated assessment of Estonian learner text.


Estonian; Proficiency Classification; CEFR; Morphological Features; Machine Learning


Burstein, J. (2003). The e-rater Scoring Engine: Automated Essay Scoring with Natural Language Processing, chapter 7, pages 107–115. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.

Burstein, J. and Chodorow, M. (2010). Progress and New Directions in Technology for Automated Essay Evaluation, chapter 36, pages 487–497. Oxford University Press, 2nd edition.

Council of Europe (2001). Common European Framework of Reference for Languages: Learning, teaching, assessment. Cambridge University Press, Cambridge.

Crossley, S. A., Salsbury, T., McNamara, D. S., and Jarvis, S. (2011). Predicting lexical proficiency in language learners using computational indices. Language Testing, 28:561–580.

Eslon, P. (2014). Eesti vahekeele korpus (Estonian Interlanguage Corpus). Keel ja Kirjandus, 6:436–451.

Gyllstad, H., Grandfeldt, J., Bernardini, P., and Källkvist, M. (2014). Linguistic correlates to communicative proficiency levels of the CEFR: The case of syntactic complexity in written l2 english, l3 french and l4 italian. EUROSLA Yearbook, 14(1):1–30.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The WEKA data mining software: An update. The SIGKDD Explorations, 11(1):10–18.

Hall, M. A. (1998). Correlation-based Feature Subset Selection for Machine Learning. Hamilton, Newzealand.

Hancke, J. (2013). Automatic prediction of CEFR proficiency levels based on linguistic features of learner language. Master’s thesis, International Studies in Computational Linguistics. Seminar für Sprachwissenschaft, Universität Tübingen.

Hancke, J. and Meurers, D. (2013). Exploring CEFR classification for german based on rich linguistic modeling. In Learner Corpus Research 2013, Book of Abstracts, Bergen, Norway.

Kira, K. and Rendell, L. A. (1992). A practical approach to feature selection. In Ninth International Workshop on Machine Learning, pages 249–256.

Kononenko, I. (1994). Estimating attributes: Analysis and extensions of RELIEF. In European Conference on Machine Learning, pages 171–182.

Kyle, K. and Crossley, S. A. (2014). Automatically assessing lexical sophistication: Indices, tools, findings, and application. TESOL Quarterly, –:–.

Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4):474–496.

Lu, X. (2012). The relationship of lexical richness to the quality of ESL learners’ oral narratives. The Modern Languages Journal.

Östling, R., Smolentzov, A., Tyrefors Hinnerich, B., and Höglin, E. (2013). Automated essay scoring for swedish. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 42–47, Atlanta, Georgia. Association for Computational Linguistics.

Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing, pages 44–49, Manchester, UK.

Tono, Y. (2000). A corpus-based analysis of interlanguage development: analysing pos tag sequences of EFL learner corpora. In PALC’99: Practical Applications in Language Corpora, pages 323–340.

Vajjala, S. and Lõo, K. (2013). Role of morpho-syntactic features in Estonian proficiency classification. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications (BEA8), Association for Computational Linguistics.

Vyatkina, N. (2012). The development of second language writing complexity in groups and individuals: A longitudinal learner corpus study. The Modern Language Journal.

Williamson, D. M. (2009). A framework for implementing automated scoring. In The annual meeting of the American Educational Research Association (AERA) and the National Council on Measurement in Education (NCME).

Yannakoudakis, H., Briscoe, T., and Medlock, B. (2011). A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 180–189, Stroudsburg, PA, USA.

ssociation for Computational Linguistics. Corpus available: http://ilexir.co.uk/applications/clc-fce-dataset.

Zhang, B. (2008). Investigating proficiency classification for the examination for the certificate of proficiency in english (ECPE). In Spaan Fellow Working Papers in Second or Foreign Language Assessment.

Citeringar i Crossref