Conference article

Using shallow syntactic features to measure influences of L1 and proficiency level in EFL writings

Andrea Horbach
Department of Computational Linguistics, Saarland University, Saarbrücken, Germany

Jonathan Poitz
Department of Computational Linguistics, Saarland University, Saarbrücken, Germany

Alexis Palmer
Institute for Natural Language Processing, Stuttgart University, Stuttgart, Germany

Download article

Published in: Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015, Vilnius, 11th May, 2015

Linköping Electronic Conference Proceedings 114:4, p. 21-34

NEALT Proceedings Series 26:4, p. 21-34

Show more +

Published: 2015-05-06

ISBN: 978-91-7519-036-5

ISSN: 1650-3686 (print), 1650-3740 (online)


This paper proposes a framework for modeling and analyzing differences between texts written by different subgroups of learners of English as a Foreign Language (organized according to native language (L1) and proficiency level). Using frequency vectors of both POS-trigrams and mixed POS and function word trigrams, we compare learner language variants both to each other and to native English, German, and Chinese texts. We introduce the trigram usage factor metric for identifying sequences that are especially characteristic of a particular subgroup of learners. We show that distance between learner English and native English decreases with proficiency. Next we compare the distance between learner English and other native languages. Finally, we show that automatic proficiency classification benefits from using L1-specific classifiers.


learner language; shallow syntactic features; proficiency classification


Daniel Blanchard, Joel Tetreault, Derrick Higgins, Aoife Cahill, and Martin Chodorow. 2014. ETS corpus of non-native written English.

Sabine Brants, Stefanie Dipper, Peter Eisenberg, Silvia Hansen, Esther K¨onig, Wolfgang Lezius, Christian Rohrer, George Smith, and Hans Uszkoreit. 2004. TIGER: Linguistic interpretation of a German corpus. Journal of Language and Computation, Special Issue, 2(4):597–620.

Martin Chodorow and Claudia Leacock. 2000. An unsupervised method for detecting grammatical errors. In Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, NAACL 2000, pages 140–147, Stroudsburg, PA, USA. Association for Computational Linguistics.

Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA Data Mining Software: An update. SIGKDD Explor. Newsl., 11(1):10–18, November.

Matthieu Hermet and Alain D´esilets. 2009. Using first and second language models to correct preposition errors in second language authoring. In Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications, pages 64–72.

Association for Computational Linguistics. Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2):313–330.

Ryo Nagata and Edward W. D. Whittaker. 2013. Reconstructing an Indo-European family tree from non-native English texts. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 1: Long Papers, pages
1137–1147. The Association for Computer Linguistics.

Terence Odlin and Scott Jarvis. 2004. Same source, different outcomes: A study of Swedish influence on the acquisition of English in Finland. International Journal of Multilingualism, 1(2):123–140.

Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proc. of LREC. Marc Reznicek, Anke L¨udeling, and Franziska Schwantuschke. 2012. Das Falko-Handbuch: Korpusaufbau und Annotationen: Version 2.0.

Anne Schiller, Simone Teufel, Christine Stöckert, and Christine Thielen. 1999. Guidelines f¨ur das Tagging deutscher Textcorpora mit STTS. Technical report, IMS-CL, University Stuttgart.

Helmut Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing, Manchester, UK.

Larry Selinker. 1972. Interlanguage. International Review of Applied Linguistics in Language Teaching, 10(1–4):209–232.

Guihua Sun, Xiaohua Liu, Gao Cong, Ming Zhou, Zhongyang Xiong, John Lee, and Chin-Yew Lin. 2007. Detecting erroneous sentences using automatically mined sequential patterns. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 81–88, Prague, Czech Republic, June. Association for Computational Linguistics.

Joel R Tetreault and Martin Chodorow. 2009. Examining the use of region web counts for esl error detection. In Web as Corpus Workshop (WAC5), page 71.

Joel Tetreault, Daniel Blanchard, and Aoife Cahill. 2013. A report on the first native language identification shared task. In In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications.

Sze-Meng Jojo Wong, Mark Dras, and Mark Johnson. 2012. Exploring adaptor grammars for native language identification. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 699–709. Association for Computational Linguistics.

Nianwen Xue, Fu-Dong Chiou, and Martha Palmer. 2002. Building a large-scale annotated Chinese corpus. In Proceedings of the 19th International Conference on Computational Linguistics - Volume 1, COLING ’02, pages 1–8, Stroudsburg, PA, USA. Association for Computational Linguistics.

Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 180–189, Stroudsburg, PA, USA. Association for Computational Linguistics.

Citations in Crossref