Evaluation of language identification methods using 285 languages

Tommi Jauhiainen
University of Helsinki, Finland

Krister Lindén
University of Helsinki, Finland

Heidi Jauhiainen
University of Helsinki, Finland

Ladda ner artikel

Ingår i: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Linköping Electronic Conference Proceedings 131:21, s. 183-191

NEALT Proceedings Series 29:21, s. 183-191

Visa mer +

Publicerad: 2017-05-08

ISBN: 978-91-7685-601-7

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


Language identification is the task of giving a language label to a text. It is an important preprocessing step in many automatic systems operating with written text. In this paper, we present the evaluation of seven language identification methods that was done in tests between 285 languages with an out-of-domain test set. The evaluated methods are, furthermore, described using unified notation. We show that a method performing well with a small number of languages does not necessarily scale to a large number of languages. The HeLI method performs best on test lengths of over 25 characters, obtaining an F1-score of 99.5 already at 60 characters.


Inga nyckelord är tillgängliga


Steven Bird. 2006. Nltk: the natural language toolkit. In COLING-ACL ’06 Proceedings of the COLING/ACL on Interactive presentation sessions, pages 69–72, Sydney.

Ralf D. Brown. 2012. Finding and identifying text in 900+ languages. Digital Investigation, 9:S34–S43.

Ralf D. Brown. 2013. Selecting and weighting ngrams to identify 1100 languages. In Text, Speech, and Dialogue 16th International Conference, TSD 2013 Pilsen, Czech Republic, September 2013 Proceedings, pages 475–483, Pilsen.

Ralf D. Brown. 2014. Non-linear mapping for improved identification of 1300+ languages. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), pages 627–632, Doha, Qatar.

William B. Cavnar and John M. Trenkle. 1994. Ngram-based text categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161–175, Las Vegas.

Heidi Jauhiainen, Tommi Jauhiainen, and Krister Lindén. 2015. The finno-ugric languages and the internet project. Septentrio Conference Series, 0(2):87–98.

Tommi Jauhiainen, Krister Lindén, and Heidi Jauhiainen. 2016. Heli, a word-based backoff method for language identification. In Proceedings of the 3rd Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (VarDial), pages 153–162, Osaka, Japan.

Tommi Jauhiainen. 2010. Tekstin kielen automaattinen tunnistaminen. Master’s thesis, University of Helsinki, Helsinki.

Josh King and Jon Dehdari. 2008. An n-gram based language identification system. The Ohio State University.

Kone Foundation. 2012. The language programme 2012-2016. http://www.koneensaatio.fi/en.

Marco Lui and Timothy Baldwin. 2012. langid.py: an off-the-shelf language identification tool. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 25–30, Jeju.

Marco Lui, Jey Han Lau, and Timothy Baldwin. 2014. Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics, 2:27–40.

Martin Majliš. 2012. Yet another language identifier. In Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 46–54, Avignon.

Hermann Ney, Ute Essen, and Reinhard Kneser. 1994. On structuring probabilistic dependences in stochastic language modelling. Computer Speech and Language, 8(1):1–38.

Uwe Quasthoff, Matthias Richter, and Christian Biemann. 2006. Corpus portal for search in monolingual corpora. In Proceedings of the fifth international conference on Language Resources and Evaluation, LREC 2006, pages 1799–1802, Genoa.

Paul Rodrigues. 2012. Processing Highly Variant Language Using Incremental Model Selection. Ph.D. thesis, Indiana University.

Vesa Siivola, Teemu Hirsimäki, and Sami Virpioja. 2007. On growing and pruning kneserney smoothed n-gram models. IEEE Transactions on Audio, Speech and Language Processing, 15(5):1617–1624.

Erik Tromp and Mykola Pechenizkiy. 2011. Graphbased n-gram language identification on short texts. In Benelearn 2011 - Proceedings of the Twentieth Belgian Dutch Conference on Machine Learning, pages 27–34, The Hague.

Erik Tromp. 2011. Multilingual sentiment analysis on social media. Master’s thesis, Eindhoven University of Technology, Eindhoven.

C. J. van Rijsbergen. 1979. Information Retrieval. Butterworths.

Tommi Vatanen, Jaakko J. Väyrynen, and Sami Virpioja. 2010. Language identification of short text segments with n-gram models. In LREC 2010, Seventh International Conference on Language Resources and Evaluation, pages 3423–3430, Malta.

John Vogel and David Tresner-Kirsch. 2012. Robust language identification in short, noisy texts: Improvements to liga. In The Third International Workshop on Mining Ubiquitous and Social Environments, pages 43–50, Bristol.

Marcos Zampieri and Binyam Gebrekidan Gebre. 2014. Varclass: An open source language identification tool for language varieties. In Proceedings of Language Resources and Evaluation (LREC.

Citeringar i Crossref