Konferensartikel

Modeling OOV Words With Letter N-Grams in Statistical Taggers: Preliminary Work in Biomedical Entity Recognition

Teemu Ruokolainen
Aalto University, Helsinki, Finland

Miikka Silfverberg
University of Helsinki, Helsinki, Finland

Ladda ner artikel

Ingår i: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Linköping Electronic Conference Proceedings 85:18, s. 181-193

NEALT Proceedings Series 16:18, p. 181-193

Visa mer +

Publicerad: 2013-05-17

ISBN: 978-91-7519-589-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

We discuss sequential tagging problems in natural language processing using statistical methodology. We propose an automatic and domain-independent approach to modeling out-ofvocabulary (OOV) words; that is words that do not occur in training data. Our method is based on using probabilistic letter n-gram models to model orthography of different tags. We show how to combine the approach with two widely used statistical models Hidden Markov Models and Conditional Random Fields. Instead of taking the common approach of directly using sub-strings as features resulting in an explosion in the number of model parameters; we compress orthographic information into a small number of parameters. Experiments in biomedical entity recognition on the Genia corpus show that the approach can alleviate the OOV problem resulting in improvement in overall model performance.

Nyckelord

Biomedical Entity Recognition; CRF; HMM; Letter N-Grams; OOV; Tagging

Referenser

Brants; T. (2000). Tnt - a statistical part-of-speech tagger. In Proceedings of the Sixth Applied Natural Language Processing (ANLP-2000); Seattle; WA.

Chen; S. and Goodman; J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language; 13(4):359–393.

Chen; S. and Rosenfeld; R. (2000). A survey of smoothing techniques for me models. Speech and Audio Processing; IEEE Transactions on; 8(1):37–50.

Collins; M. (2002). Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing; volume 10; pages 1–8. Association for Computational Linguistics.

Fujii; R. and Sakurai; A. (2012). Technical term recognition with semi-supervised learning using hierarchical bayesian language models. Natural Language Processing and Information Systems; pages 327–332.

Kim; J.-D.; Ohta; T.; Tsuruoka; Y.; Tateisi; Y.; and Collier; N. (2004). Introduction to the bio-entity recognition task at jnlpba. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications; JNLPBA ’04; pages 70–75; Stroudsburg; PA; USA. Association for Computational Linguistics.

Lafferty; J. D.; McCallum; A.; and Pereira; F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning; ICML ’01; pages 282–289; San Francisco; CA; USA. Morgan Kaufmann Publishers Inc.

McCallum; A.; Freitag; D.; and Pereira; F. (2000). Maximum entropy markov models for information extraction and segmentation. In Proceedings of the Seventeenth International Conference on Machine Learning; volume 951; pages 591–598.

Rössler; M. (2004). Adapting an ner-system for german to the biomedical domain. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications; JNLPBA ’04; pages 92–95; Stroudsburg; PA; USA. Association for Computational Linguistics.

Shen; D.; Zhang; J.; Zhou; G.; Su; J.; and Tan; C. (2003). Effective adaptation of a hidden markov model-based named entity recognizer for biomedical domain. In Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine-Volume 13; pages 49–56. Association for Computational Linguistics.

Stolcke; A.; Zheng; J.; Wang; W.; and Abrash; V. (2011). Srilm at sixteen: Update and outlook. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop.

Tjong Kim Sang; E. F. and Buchholz; S. (2000). Introduction to the conll-2000 shared task: chunking. In Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7; ConLL ’00; pages 127–132; Stroudsburg; PA; USA. Association for Computational Linguistics.

Tjong Kim Sang; E. F. and De Meulder; F. (2003). Introduction to the conll-2003 shared task: language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4; CONLL ’03; pages 142–147; Stroudsburg; PA; USA. Association for Computational Linguistics.

Vasserman; A. (2004). Identifying chemical names in biomedical text: an investigation of the substring co-occurrence based approaches. In Proceedings of the Student Research Workshop at HLT-NAACL 2004; HLT-SRWS ’04; pages 7–12; Stroudsburg; PA; USA. Association for Computational Linguistics.

Vatanen; T.; Väyrynen; J. J.; and Virpioja; S. (2010). Language identification of short text segments with n-gram models. In Chair); N. C. C.; Choukri; K.; Maegaard; B.; Mariani; J.; Odjik; J.; Piperidis; S.; Rosner; M.; and Tapias; D.; editors; Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10); Valletta; Malta. European Language Resources Association (ELRA).

Vishwanathan; S. V. N.; Schraudolph; N. N.; Schmidt; M. W.; and Murphy; K. P. (2006). Accelerated training of conditional random fields with stochastic gradient methods. In Proceedings of the 23rd international conference on Machine learning; ICML ’06; pages 969–976; New York; NY; USA. ACM.

Zhou; G. and Su; J. (2004). Exploring deep knowledge resources in biomedical name recognition. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications; JNLPBA ’04; pages 96–99; Stroudsburg; PA; USA. Association for Computational Linguistics.

Citeringar i Crossref