Exploring Features for Named Entity Recognition in Lithuanian Text Corpus

Jurgita Kapočūtė-Dzikienė
Kaunas University of Technology, Kaunas, Lithuania

Anders Nøklestad
University of Oslo, Norway

Janne Bondi Johannessen
University of Oslo, Norway

Algis Krupavičius
Kaunas University of Technology, Kaunas, Lithuania

Ladda ner artikel

Ingår i: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Linköping Electronic Conference Proceedings 85:11, s. 73-88

NEALT Proceedings Series 16:11, s. 73-88

Visa mer +

Publicerad: 2013-05-17

ISBN: 978-91-7519-589-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


Despite the existence of effective methods that solve named entity recognition tasks for such widely used languages as English; there is no clear answer which methods are the most suitable for languages that are substantially different. In this paper we attempt to solve a named entity recognition task for Lithuanian; using a supervised machine learning approach and exploring different sets of features in terms of orthographic and grammatical information; different windows; etc. Although the performance is significantly higher when language dependent features based on gazetteer lookup and automatic grammatical tools (part-of-speech tagger; lemmatizer or stemmer) are taken into account; we demonstrate that the performance does not degrade when features based on grammatical tools are replaced with affix information only. The best results (micro-averaged F-score=0.895) were obtained using all available features; but the results decreased by only 0.002 when features based on grammatical tools were omitted.


Named entity recognition and classification; supervised machine learning; Lithuanian


Al-Rfou’; R. and Skiena; S. (2012). SpeedRead: A Fast Named Entity Recognition Pipeline. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012); pages 51–66.

Daudaravicius; V.; Rimkute; E. and Utka; A. (2007). Morphological annotation of the Lithuanian corpus. In Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies (ACL’07); pages 94–99.

Desmet; B. and Hoste; V. (2010). Dutch named entity recognition using ensemble classifiers. In Computational Linguistics in the Netherlands 2010: selected papers from the twentieth CLIN meeting (CLIN 2010); pages 29–41.

Elsebai; A.; Meziane; F. and Belkredim; F. Z. (2009). A Rule Based Persons Names Arabic Extraction System. In Proceedings of the 11th International Conference on Innovation and Business Management (IBIMA); pages 53–59.

Georgiev; G.; Nakov; P.; Ganchev; K.; Osenova; P. and Simov; K. (2009). Feature-Rich Named Entity Recognition for Bulgarian Using Conditional Random Fields. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP- 2009); pages 113–117.

Gokhan; A. S. and Gulsen; E. (2012). Initial Explorations on using CRFs for Turkish Named Entity Recognition. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012); pages 2459–2474.

Haaland; Å. (2008). A Maximum Entropy Approach to Proper Name Classification for Norwegian. PhD thesis; University of Oslo.

Hasan; K. S.; Rahman; A.; and Ng; V. (2009). Learning-based named entity recognition for morphologically-rich; resource-scarce languages. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics; pages 354–262.

Johannessen; J. B.; Hagen; K.; Haaland; Å.; Nøklestad; A.; Jónsdottir; A. B.; Kokkinakis; D.; Meurer; P.; Bick; E. and Haltrup; D. (2005). Named Entity Recognition for the Mainland Scandinavian Languages. Literary & Linguistic Computing; 20(1): 91–102.

Kapociute; J. and Raškinis; G. (2005). Rule-based annotation of Lithuanian text corpora. Information technology and control; Kaunas; Technologija; 34 (3): 290–296.

Kitoogo F. E.; Baryamureeba; V; and De Pauw; G. (2008). Towards Domain Independent Named Entity Recognition. International Journal of Computing and ICT Research; 2 (2): 84– 95.

Krilavicius; T. and Medelis; Ž. Lithuanian stemmer. (2010). May; 2012. <https://github.com/tokenmill/ltlangpack/tree/master/snowball/>.

Lafferty; J. D.; McCallum; A. and Pereira; F. (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML’01); pages 282–289.

Mai; M. O. and Khaled; S. (2012). A Pipeline Arabic Named Entity Recognition Using a Hybrid Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); Linköping Electronic Conference Proceedings #85 [page 87 of 474] Approach. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012); pages 2159–2176.

Marcinczuk; M. and Janicki; M. (2012). Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts. In Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing (CICLing’12); (1): 258–269.

Marcinczuk; M.; Stanek; M.; Piasecki; M. and Musial; A. (2011). Rich Set of Features for Proper Name Recognition in Polish Texts. SIIS; Lecture Notes in Computer Science; 7053: 332–344.

Marcinkeviciene; R. (2000). Tekstynu lingvistika (teorija ir paktika) [Corpus linguistics (theory and practice)]. Darbai ir dienos; 24: 7–63. (in Lithuanian).

Nadeau; D. and Sekine; S. (2007). A survey of named entity recognition and classification. Linguisticae Investigationes; 30 (1): 3–26.

Nøklestad A. (2009). A Machine Learning Approach to Anaphora Resolution Including Named Entity Recognition; PP Attachment Disambiguation; and Animacy Detection. PhD Thesis; University of Oslo.

Pinnis; M. (2012). Latvian and Lithuanian Named Entity Recognition with TildeNER. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12); pages 1258–1265.

Popov; B.; Kirilov; A.; Maynard; D. and Manov; D. (2004). Creation of Reusable Components and Language Resources for Named Entity Recognition in Russian. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004); pages 309– 312.

Savickiene; I.; Kempe; V. and Brooks; P. J. (2009). Acquisition of gender agreement in Lithuanian: exploring the effect of diminutive usage in an elicited production task. Journal of Child Language; 36: 477–494.

Singh; U.; Goyal; V. and Lehal; G. S. (2012). Named Entity Recognition System for Urdu. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012); pages 2507–2518.

Sundheim; B. (1995). Overview of results of the muc-6 evaluation. In Proceedings of the 6th Conference on Message Understanding (MUC-6); pages 13–31.

Willett; P. (2006). The Porter stemming algorithm: then and now. Program: electronic library and information systems; 40 (3): 219–223.

Yeh; A. (2000). More Accurate Tests for the Statistical Significance of Result Differences. In Proceedings of the 18th International Conference on Computational Linguistics (COLING’00); 2; pages 947–953.

Zinkevicius; V. (2000). Lemuoklis – morfologinei analizei [Morphological analysis with Lemuoklis]. In: Gudaitis; L. (ed.) Darbai ir Dienos; 24: 246–273. (in Lithuanian).

Zinkevicius; V.; Daudaravicius; V. and Rimkute; E. (2005). The Morphologically annotated Lithuanian Corpus. In Proceedings of the Second Baltic Conference on Human Language Technologies; pages 365–370.

Citeringar i Crossref