Published: 2017-05-08
ISBN: 978-91-7685-601-7
ISSN: 1650-3686 (print), 1650-3740 (online)
Named Entity Recognition (NER), search, classification and tagging of names and name like informational elements in texts, has become a standard information extraction procedure for textual data during the last two decades. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent. Also used entity categories vary a lot (Nadeau and Sekine, 2007). The most general set of named entities is usually some version of three part categorization of locations, persons and corporations. In this paper we report evaluation results of NER with two different data: digitized Finnish historical newspaper collection Digi and modern Finnish technology news, Digitoday. Historical newspaper collection Digi contains 1,960,921 pages of newspaper material from years 1771–1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75%, and its NER evaluation collection consists of 75 931 words (Kettunen and Pääkkönen, 2016; Kettunen et al., 2016). Digitoday’s annotated collection consists of 240 articles in six different sections of the newspaper. Our new evaluated tool for NER tagging is non-conventional: it is a rulebased Finnish Semantic Tagger, the FST (Löfberg et al., 2005), and its results are compared to those of a standard rulebased NE tagger, FiNER.
Maud Ehrmann, Giovanni Colavizz, Yannick Rochat, and Frédéric Kaplan. 2016. Diachronic Evaluation of NER Systems on Old Newspapers. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), 97–107. https://www.linguistics.rub.de/konvens16/pub/13_konvensproc.pdf
Graeme Hirst. 2009. Ontology and the Lexicon. ftp://ftp.cs.toronto.edu/pub/gh/Hirst-Ontol-2009.pdf
Kimmo Kettunen and Tuula Pääkkönen. 2016. Measuring Lexical Quality of a Historical Finnish Newspaper Collection – Analysis of Garbled OCR Data with Basic Language Technology Tools and Means. LREC 2016, Tenth International Conference on Language Resources and Evaluation. http://www.lrecconf.org/proceedings/lrec2016/pdf/17_Paper.pdf.
Kimmo Kettunen, Eetu Mäkelä, Juha Kuokkala, Teemu Ruokolainen and Jyrki Niemi. 2016. Modern Tools for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910. Krestel, R., Mottin, D. and Müller, E. (eds.), Proceedings of Conference "Lernen, Wissen, Daten, Analysen", LWDA 2016, http://ceurws.org/Vol-1670/
Laura Löfberg, Jukka-Pekka Juntunen, Asko Nykänen, Krista Varantola, Paul Rayson and Dawn Archer. 2004. Using a semantic tagger as dictionary search tool. In 11th EURALEX (European Association for Lexicography) International Congress Euralex 2004: 127–134.
Laura Löfberg, Scott Piao, Paul Rayson, Jukka-Pekka Juntunen, Asko Nykänen and Krista Varantola. 2005. A semantic tagger for the Finnish language. http://eprints.lancs.ac.uk/12685/1/cl2005_fst.pdf
Daniel Lopresti. 2009. Optical character recognition errors and their effects on natural language processing. International Journal on Document Analysis and Recognition, 12(3): 141–151.
Eetu Mäkelä. 2014. Combining a REST Lexical Analysis Web Service with SPARQL for Mashup Semantic Annotation from Text. Presutti, V. et al. (Eds.), The Semantic Web: ESWC 2014 Satellite Events. Lecture Notes in Computer Science, vol. 8798, Springer: 424–428.
Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Language Processing. The MIT Press, Cambridge, Massachusetts.
Mónica Marrero, Julián Urbano , Sonia Sánchez-Cuadrado , Jorge Morato and Juan Miguel Gómez-Berbís. 2013. Named Entity Recognition: Fallacies, challenges and opportunities. Computer Standards & Interfaces 35(5): 482–489.
David Miller, Sean Boisen, Richard Schwartz, Rebecca Stone and Ralph Weischedel. 2000. Named entity extraction from noisy input: Speech and OCR. Proceedings of the 6th Applied Natural Language Processing Conference: 316–324, Seattle, WA. http://www.anthology.aclweb.org/A/A00/A00-1044.pdf
David Nadeau and Satoshi Sekine. 2007. A Survey of Named Entity Recognition and
Classification. Linguisticae Investigationes 30(1):3–26.
Thomas L. Packer, Joshua F. Lutes, Aaron P. Stewart, David W. Embley, Eric K. Ringger, Kevin D. Seppi and Lee S. Jensen. 2010. Extracting Person Names from Diverse and Noisy OCR Tex. Proceedings of the fourth workshop on Analytics for noisy unstructured text data. Toronto, ON, Canada: ACM. http://dl.acm.org/citation.cfm?id=1871845.
Scott Piao, Paul Rayson, Dawn Archer, Francesca Bianchi, Carmen Dayrell, Mahmoud El-Haj, Ricardo-María Jiménez, Dawn Knight, Michal Kren, Laura Löfberg, Rao Muhammad Adeel Nawab, Jawad Shafi, Phoey Lee Teh and Olga Mudraya. 2016. Lexical Coverage Evaluation of Large-scale Multilingual Semantic Lexicons for Twelve Languages. Proceedings of LREC.
http://www.lrecconf.org/proceedings/lrec2016/pdf/257_Paper.pdf.
Thierry Poibeau and Leila Kosseim. 2001. Proper Name Extraction from Non-Journalistic Texts. Language and Computers, 37: 144–157.
Paul Rayson, Dawn Archer, Scott Piao and Tony McEnery. 2004. The UCREL semantic analysis system. Proceedings of the workshop on Beyond Named Entity Recognition Semantic labelling for NLP tasks in association with 4th International Conference on Language Resources and Evaluation (LREC 2004): 7–12. http://www.lancaster.ac.uk/staff/rayson/publications/usas_lrec04ws.pdf
Kepa Joseba Rodriquez, Mike Bryant, Tobias Blanke and Magdalena Luszczynska. 2012. Comparison of Named Entity Recognition Tools for raw OCR text. Proceedings of KONVENS 2012 (LThist 2012 wordshop), Vienna September 21: 410–414.
Miikka Silfverberg. 2015. Reverse Engineering a Rule-Based Finnish Named Entity Recognizer. https://kitwiki.csc.fi/twiki/pub/FinCLARIN/KielipankkiEventNERWorkshop2015/Silfverberg_presentation.pdf.