Evaluation of language identification methods using 285 languages

Tommi Jauhiainen
University of Helsinki, Finland

Krister Lindén
University of Helsinki, Finland

Heidi Jauhiainen
University of Helsinki, Finland

Ingår i: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Linköping Electronic Conference Proceedings 131:21, s. 183-191

NEALT Proceedings Series 29:21, s. 183-191

Publicerad: 2017-05-08

ISBN: 978-91-7685-601-7

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


Language identification is the task of giving a language label to a text. It is an important preprocessing step in many automatic systems operating with written text. In this paper, we present the evaluation of seven language identification methods that was done in tests between 285 languages with an out-of-domain test set. The evaluated methods are, furthermore, described using unified notation. We show that a method performing well with a small number of languages does not necessarily scale to a large number of languages. The HeLI method performs best on test lengths of over 25 characters, obtaining an F1-score of 99.5 already at 60 characters.


