Exploring Features for Named Entity Recognition in Lithuanian Text Corpus

Jurgita Kapočūtė-Dzikienė
Kaunas University of Technology, Kaunas, Lithuania

Anders Nøklestad
University of Oslo, Norway

Janne Bondi Johannessen
University of Oslo, Norway

Algis Krupavičius
Kaunas University of Technology, Kaunas, Lithuania

Ingår i: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Linköping Electronic Conference Proceedings 85:11, s. 73-88

NEALT Proceedings Series 16:11, s. 73-88

Publicerad: 2013-05-17

ISBN: 978-91-7519-589-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


Despite the existence of effective methods that solve named entity recognition tasks for such widely used languages as English; there is no clear answer which methods are the most suitable for languages that are substantially different. In this paper we attempt to solve a named entity recognition task for Lithuanian; using a supervised machine learning approach and exploring different sets of features in terms of orthographic and grammatical information; different windows; etc. Although the performance is significantly higher when language dependent features based on gazetteer lookup and automatic grammatical tools (part-of-speech tagger; lemmatizer or stemmer) are taken into account; we demonstrate that the performance does not degrade when features based on grammatical tools are replaced with affix information only. The best results (micro-averaged F-score=0.895) were obtained using all available features; but the results decreased by only 0.002 when features based on grammatical tools were omitted.


Named entity recognition and classification; supervised machine learning; Lithuanian


