The Effect of Excluding Out of Domain Training Data from Supervised Named-Entity Recognition

Adam Persson
Department of Linguistics, Stockholm University, Stockholm, Sweden

Ladda ner artikel

Ingår i: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Linköping Electronic Conference Proceedings 131:40, s. 289-292

NEALT Proceedings Series 29:40, s. 289-292

Visa mer +

Publicerad: 2017-05-08

ISBN: 978-91-7685-601-7

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


Supervised named-entity recognition (NER) systems perform better on text that is similar to its training data. Despite this, systems are often trained with as much data as possible, ignoring its relevance. This study explores if NER can be improved by excluding out of domain training data. A maximum entropy model is developed and evaluated twice with each domain in Stockholm-Umea° Corpus (SUC), once with all data and once with only in-domain data. For some domains, excluding out of domain training data improves tagging, but over the entire corpus it has a negative effect of less than two percentage points (both for strict and fuzzy matching).


Inga nyckelord är tillgängliga


Berger, A. L., Pietra, V. J. D. & Pietra, S. A. D. 1996. A maximum entropy approach to natural language processing. Computational linguistics, 22(1), 39-71.

Ciaramita, M. & Altun, Y. 2005. Named-entity recognition in novel domains with external lexical knowledge. In Proceedings of the NIPS Workshop on Advances in Structured Learning for Text and Speech Processing.

Francis,W. N. & H. Ku?cera. 1964. Manual of Information to accompany A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. Providence, Rhode Island: Department of Linguistics, Brown University. Revised 1971. Revised and amplified 1979.

Källgren, G. 2006. Documentation of the Stockholm-Umeå Corpus. Manual of the Stockholm Umeå Corpus version 2.0. Sofia Gustafson-Capková and Britt Hartmann (red). Stockholm University: Department of Linguistics.

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. & Dyer, C. 2016. Neural Architectures for Named Entity Recognition. In Proceedings of NAACL-HLT, 260-270.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M. & Perrot, M. (2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825-2830.

Persson, A. 2016. Övervakad namntaggning med domänspecifik träningsdata. (Bachelor thesis, Stockholm University, Stockholm, Sweden) Retrieved from http://www.divaportal.org/smash/get/diva2:934145/FULLTEXT01.pdf

Ratinov, L., & Roth, D. 2009. Design challenges and misconceptions in named entity recognition. Proceedings of the Thirteenth Conference on Computational Natural Language Learning, 147-155. Association for Computational Linguistics.

Salomonsson, A., Marinov, S. & Nugues, P. 2012. Identification of entities in Swedish. SLTC 2012, 63.

Sjöbergh, J. 2003. Combining POS-taggers for improved accuracy on Swedish text. Proceedings of
NoDaLiDa, 2003.

Östling, R. 2013. Stagger: An open-source part of speech tagger for Swedish. Northern European Journal of Language Technology (NEJLT), 3, 1-18.

Citeringar i Crossref