Conference article

On the Development of a Large Scale Corpus for Native Language Identification

Thomas. G. Hudson
Durham University, Durham, UK

Sardar Jaf
Durham University, Durham, UK

Download article

Published in: Proceedings of the 17th International Workshop on Treebanks and Linguistic Theories (TLT 2018), December 13–14, 2018, Oslo University, Norway

Linköping Electronic Conference Proceedings 155:12, p. 115-129

Show more +

Published: 2018-12-10

ISBN: 978-91-7685-137-1

ISSN: 1650-3686 (print), 1650-3740 (online)

Abstract

Native Language Identification (NLI) is the task of identifying an author’s native language from their writings in a second language. In this paper, we introduce a new corpus (italki), which is larger than the current corpora. It can be used for training machine learning based systems for classifying and identifying the native language of authors of English text. To examine the usefulness of italki, we evaluate it by using it to train and test some of the well performing NLI systems presented in the 2017 NLI shared task. In this paper, we present some aspects of italki. We show the impact of the variation of italki’s training dataset size of some languages on systems performance. From our empirical finding, we highlight the potential of italki as a large scale corpus for training machine learning classifiers for classifying the native language of authors from their written English text. We obtained promising results that show the potential of italki to improve the performance of current NLI systems. More importantly, we found that training the current NLI systems on italki generalize better than training them on the current corpora.

Keywords

native language, training data, italki, NLI, native language identification, language identification, dataset, corpus

References

Brooke, J. & Hirst, G. (2012), Robust, Lexicalized Native Language Identification, in ‘COLING2012: Conference on Computational Linguistics’, The COLING 2012 Organizing Committee, Mumbai, India, pp. 391–408.

Chan, S., Jahromi, M. H., Benetti, B., Lakhani, A. & Fyshe, A. (2017), Ensemble Methods for Native Language Identification, in ‘BEA2017: Workshop on Innovative Use of NLP for Building Educational Applications’, Association for Computational Linguistics, Copenhagen, pp. 217–223.

Estival, D., Gaustad, T., Pham, S. B., Radford, W. & Hutchinson, B. (2007), Author profiling for English emails, in ‘PACLING2007: Conference of the Pacific Association for Computational Linguistics’, Melbourne, Australia, pp. 263–272.

Gibbons, J. (2003), Forensic Linguistics: An Introduction to Language in the Justice System, John Wiley & Sons.

Granger, S., Dagneaux, E., Meunier, F. & Paquot, M. (2002), International corpus of learner English, Presses universitaires de Louvain.

Jarvis, S. & Crossley, S. A. (2012), Approaching Language Transfer Through Text Classification: Explorations in the Detection based Approach, Vol. 64, Multilingual Matters, Bristol, UK.

Koppel, M., Schler, J. & Zigdon, K. (2005), ‘Automatically determining an anonymous author’s native language’, Intelligence and Security Informatics pp. 41–76.

Kulmizev, A., Blankers, B., Bjerva, J., Nissim, M., Van Noord, G., Plank, B. & Wieling, M. (2017), The Power of Character N-grams in Native Language Identification, in ‘BEA2017: Workshop on Innovative Use of NLP for Building Educational Applications’, Association for Computational Linguistics, Copenhagen, pp. 382–389.

Malmasi, S., Evanini, K., Cahill, A., Tetreault, J., Pugh, R., Hamill, C., Napolitano, D. & Qian, Y. (2017), A Report on the 2017 Native Language Identification Shared Task, in ‘BEA2017: Workshop on Innovative Use of NLP for Building Educational Applications’, Association for Computational Linguistics, Copenhagen, pp. 62–75.

Nigam, K., Lafferty, J. & Mccallum, A. (1999), Using Maximum Entropy for Text Classification, in ‘IJCAI1999: Workshop on Machine Learning for Information Filtering’, Stockholm, Sweden, pp. 61–67.

Rama, T. & Coltekin, C. (2017), Fewer features perform well at Native Language Identification task, in ‘BEA2017: Workshop on Innovative Use of NLP for Building Educational Applications’, Association for Computational Linguistics, Copenhagen, pp. 255–260.

Rozovskaya, A. & Roth, D. (2011), Algorithm Selection and Model Adaptation for ESL Correction Tasks, in ‘ACL2011: Meeting of the Association for Computational Linguistics)’, Portland, Oregon, USA.

Tetreault, J., Blanchard, D. & Cahill, A. (2013), A report on the first native language identification shared task, in ‘BEA2013: Workshop on innovative use of NLP for building educational applications’, Association for Computational Linguistics, Atlanta, Georgia, pp. 48–57.

Tetreault, J., Blanchard, D., Cahill, A. & Chodorow, M. (2012), Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification, in ‘COLING2012: Conference on Computational Linguistics’, Vol. 2, Mumbai, India, pp. 2585–2602.

Tofighi, P., Köse, C. & and Leila Rouka (2012), ‘Author’s native language identification from web-based texts’, International Journal of Computer and Communication Engineering 1(1), 47–50.

Vajjala, S. & Banerjee, S. (2017), A study of N-gram and Embedding Representations for Native Language Identification, in ‘BEA2017: Workshop on Innovative Use of NLP for Building Educational Applications’, Association for Computational Linguistics, Copenhagen, pp. 240–248.

Yang, Y. & Liu, X. (1999), A re-examination of text categorization methods, in ‘Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval’, ACM, pp. 42–49.

Citations in Crossref