Jurgita Kapočūtė-Dzikienė
Kaunas University of Technology, Kaunas, Lithuania
Anders Nøklestad
University of Oslo, Norway
Janne Bondi Johannessen
University of Oslo, Norway
Algis Krupavičius
Kaunas University of Technology, Kaunas, Lithuania
Download article![](/images/PDF_24.png)
Published in: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16
Linköping Electronic Conference Proceedings 85:11, p. 73-88
NEALT Proceedings Series 16:11, p. 73-88
Published: 2013-05-17
ISBN: 978-91-7519-589-6
ISSN: 1650-3686 (print), 1650-3740 (online)
Despite the existence of effective methods that solve named entity recognition tasks for such widely used languages as English; there is no clear answer which methods are the most suitable for languages that are substantially different. In this paper we attempt to solve a named entity recognition task for Lithuanian; using a supervised machine learning approach and exploring different sets of features in terms of orthographic and grammatical information; different windows; etc. Although the performance is significantly higher when language dependent features based on gazetteer lookup and automatic grammatical tools (part-of-speech tagger; lemmatizer or stemmer) are taken into account; we demonstrate that the performance does not degrade when features based on grammatical tools are replaced with affix information only. The best results (micro-averaged F-score=0.895) were obtained using all available features; but the results decreased by only 0.002 when features based on grammatical tools were omitted.
Named entity recognition and classification; supervised machine learning; Lithuanian
Al-Rfou’; R. and Skiena; S. (2012). SpeedRead: A Fast Named Entity Recognition Pipeline. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012); pages 51–66.
Daudaravicius; V.; Rimkute; E. and Utka; A. (2007). Morphological annotation of the Lithuanian corpus. In Proceedings of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies (ACL’07); pages 94–99.
Desmet; B. and Hoste; V. (2010). Dutch named entity recognition using ensemble classifiers. In Computational Linguistics in the Netherlands 2010: selected papers from the twentieth CLIN meeting (CLIN 2010); pages 29–41.
Elsebai; A.; Meziane; F. and Belkredim; F. Z. (2009). A Rule Based Persons Names Arabic Extraction System. In Proceedings of the 11th International Conference on Innovation and Business Management (IBIMA); pages 53–59.
Georgiev; G.; Nakov; P.; Ganchev; K.; Osenova; P. and Simov; K. (2009). Feature-Rich Named Entity Recognition for Bulgarian Using Conditional Random Fields. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP- 2009); pages 113–117.
Gokhan; A. S. and Gulsen; E. (2012). Initial Explorations on using CRFs for Turkish Named Entity Recognition. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012); pages 2459–2474.
Haaland; Å. (2008). A Maximum Entropy Approach to Proper Name Classification for Norwegian. PhD thesis; University of Oslo.
Hasan; K. S.; Rahman; A.; and Ng; V. (2009). Learning-based named entity recognition for morphologically-rich; resource-scarce languages. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics; pages 354–262.
Johannessen; J. B.; Hagen; K.; Haaland; Å.; Nøklestad; A.; Jónsdottir; A. B.; Kokkinakis; D.; Meurer; P.; Bick; E. and Haltrup; D. (2005). Named Entity Recognition for the Mainland Scandinavian Languages. Literary & Linguistic Computing; 20(1): 91–102.
Kapociute; J. and Raškinis; G. (2005). Rule-based annotation of Lithuanian text corpora. Information technology and control; Kaunas; Technologija; 34 (3): 290–296.
Kitoogo F. E.; Baryamureeba; V; and De Pauw; G. (2008). Towards Domain Independent Named Entity Recognition. International Journal of Computing and ICT Research; 2 (2): 84– 95.
Krilavicius; T. and Medelis; Ž. Lithuanian stemmer. (2010). May; 2012. <https://github.com/tokenmill/ltlangpack/tree/master/snowball/>.
Lafferty; J. D.; McCallum; A. and Pereira; F. (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML’01); pages 282–289.
Mai; M. O. and Khaled; S. (2012). A Pipeline Arabic Named Entity Recognition Using a Hybrid Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); Linköping Electronic Conference Proceedings #85 [page 87 of 474] Approach. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012); pages 2159–2176.
Marcinczuk; M. and Janicki; M. (2012). Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts. In Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing (CICLing’12); (1): 258–269.
Marcinczuk; M.; Stanek; M.; Piasecki; M. and Musial; A. (2011). Rich Set of Features for Proper Name Recognition in Polish Texts. SIIS; Lecture Notes in Computer Science; 7053: 332–344.
Marcinkeviciene; R. (2000). Tekstynu lingvistika (teorija ir paktika) [Corpus linguistics (theory and practice)]. Darbai ir dienos; 24: 7–63. (in Lithuanian).
Nadeau; D. and Sekine; S. (2007). A survey of named entity recognition and classification. Linguisticae Investigationes; 30 (1): 3–26.
Nøklestad A. (2009). A Machine Learning Approach to Anaphora Resolution Including Named Entity Recognition; PP Attachment Disambiguation; and Animacy Detection. PhD Thesis; University of Oslo.
Pinnis; M. (2012). Latvian and Lithuanian Named Entity Recognition with TildeNER. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12); pages 1258–1265.
Popov; B.; Kirilov; A.; Maynard; D. and Manov; D. (2004). Creation of Reusable Components and Language Resources for Named Entity Recognition in Russian. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004); pages 309– 312.
Savickiene; I.; Kempe; V. and Brooks; P. J. (2009). Acquisition of gender agreement in Lithuanian: exploring the effect of diminutive usage in an elicited production task. Journal of Child Language; 36: 477–494.
Singh; U.; Goyal; V. and Lehal; G. S. (2012). Named Entity Recognition System for Urdu. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012); pages 2507–2518.
Sundheim; B. (1995). Overview of results of the muc-6 evaluation. In Proceedings of the 6th Conference on Message Understanding (MUC-6); pages 13–31.
Willett; P. (2006). The Porter stemming algorithm: then and now. Program: electronic library and information systems; 40 (3): 219–223.
Yeh; A. (2000). More Accurate Tests for the Statistical Significance of Result Differences. In Proceedings of the 18th International Conference on Computational Linguistics (COLING’00); 2; pages 947–953.
Zinkevicius; V. (2000). Lemuoklis – morfologinei analizei [Morphological analysis with Lemuoklis]. In: Gudaitis; L. (ed.) Darbai ir Dienos; 24: 246–273. (in Lithuanian).
Zinkevicius; V.; Daudaravicius; V. and Rimkute; E. (2005). The Morphologically annotated Lithuanian Corpus. In Proceedings of the Second Baltic Conference on Human Language Technologies; pages 365–370.