Konferensartikel

Tagging a Morphologically Complex Language Using an Averaged Perceptron Tagger: The Case of Icelandic

Hrafn Lofsson
School of Computer Science, Reykjavik University, Iceland

Robert Östling
Department of Linguistics, Stockholm University, Sweden

Ladda ner artikel

Ingår i: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Linköping Electronic Conference Proceedings 85:13, s. 105-119

NEALT Proceedings Series 16:13, p. 105-119

Visa mer +

Publicerad: 2013-05-17

ISBN: 978-91-7519-589-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

In this paper; we experiment with using Stagger; an open-source implementation of an Averaged Perceptron tagger; to tag Icelandic; a morphologically complex language. By adding languagespecific linguistic features and using IceMorphy; an unknown word guesser; we obtain state-of- the-art tagging accuracy of 92.82%. Furthermore; by adding data from a morphological database; and word embeddings induced from an unannotated corpus; the accuracy increases to 93.84%. This is equivalent to an error reduction of 5.5%; compared to the previously best tagger for Icelandic; consisting of linguistic rules and a Hidden Markov Model.

Nyckelord

Averaged Perceptron; Part-of-Speech Tagging; Morphological Database; Linguistic Features; Word Embeddings

Referenser

Berger; A. L.; Pietra; V. J. D.; and Pietra; S. A. D. (1996). A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics; 22:39–71.

Bjarnadóttir; K. (2012). The Database of Modern Icelandic Inflection. In Proceedings of the workshop “Language Technology for Normalization of Less-Resourced Languages”; SaLTMiL 8 – AfLaT; LREC; Istanbul; Turkey.

Brants; T. (2000). TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing; Seattle; WA; USA.

Collins; M. (2002). Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing; Philadelphia; PA; USA.

Collobert; R. and Weston; J. (2008). A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask learning. In Proceedings of the 25th International Conference on Machine learning; ICML; Helsinki; Finland.

Collobert; R.; Weston; J.; Bottou; L.; Karlen; M.; Kavukcuoglu; K.; and Kuksa; P. (2011). Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research; 12:2493–2537.

Dredze; M. andWallenberg; J. (2008a). Further Results and Analysis of Icelandic Part of Speech Tagging. Technical report; Department of Computer and Information Science; University of Pennsylvania.

Dredze; M. and Wallenberg; J. (2008b). Icelandic Data Driven Part of Speech Tagging. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies; ACL-HLT; Columbus; OH; USA.

Georgiev; G.; Zhikov; V.; Simov; K.; Osenova; P.; and Nakov; P. (2012). Feature-Rich Partof- speech Tagging for Morphologically Complex Languages: Application to Bulgarian. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics; EACL; Avignon; France

Giménez; J. and Màrquez; L. (2004). SVMTool: A general POS tagger generator based on Support Vector Machines. In Proceedings of the 4th International Conference on Language Resources and Evaluation; LREC; Lisbon; Portugal

Helgadóttir; S. (2005). Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic. In Holmboe; H.; editor; Nordisk Sprogteknologi 2004. Museum Tusculanums Forlag; Copenhagen.

Lafferty; J.; McCallum; A.; and Pereira; F. (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the 18th International Conference on Machine Learning; ICML; Williamstown; MA; USA.

Loftsson; H. (2008). Tagging Icelandic text: A linguistic rule-based approach. Nordic Journal of Linguistics; 31(1):47–72.

Loftsson; H.; Helgadóttir; S.; and Rögnvaldsson; E. (2011). Using a morphological database to increase the accuracy in PoS tagging. In Proceedings of Recent Advances in Natural Language Processing; RANLP; Hissar; Bulgaria.

Loftsson; H.; Kramarczyk; I.; Helgadóttir; S.; and Rögnvaldsson; E. (2009). Improving the PoS tagging accuracy of Icelandic text. In Proceedings of the 17th Nordic Conference of Computational Linguistics; NoDaLiDa; Odense; Denmark.

Loftsson; H. and Rögnvaldsson; E. (2007). IceNLP: A Natural Language Processing Toolkit for Icelandic. In Proceedings of Interspeech 2007; Special Session: “Speech and language technology for less-resourced languages”; Interspeech; Antwerp; Belgium.

Marcus; M. P.; Santorini; B.; and Marcinkiewicz; M. A. (1993). Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics; 19(20):313–330. Mikheev; A. (1997). Automatic Rule Induction for Unknown Word Guessing. Computational Linguistics; 21(4):543–565.

Nakagawa; T. and Yuji; M. (2006). Guessing parts-of-speech of unknown words using global information. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual meeting of the Association for Computational Linguistics; Sydney; Australia.

Nakov; P.; Bonev; Y.; Angelova; G.; Cius; E.; and Hahn; W. v. (2003). Guessing Morphological Classes of Unknown German Nouns. In Proceedings of Recent Advances in Natural Language Processing; RANLP; Borovets; Bulgaria.

Pind; J.; Magnússon; F.; and Briem; S. (1991). Íslensk orðtíðnibók [The Icelandic Frequency Dictionary]. The Institute of Lexicography; University of Iceland; Reykjavik; Iceland.

Radziszewski; A. (2013). A tiered CRF tagger for Polish. In Bembenik; R.; Skonieczny; L.;Rybi´nski; H.; Kryszkiewicz; M.; and Niezgódka; M.; editors; Intelligent Tools for Building a Scientific Information Platform: Advanced Architectures and Solutions. Springer Verlag.

Ratnaparkhi; A. (1996). A Maximum Entropy Model for Part-Of-Speech Tagging. In Proceedings of the Empirical Methods in Natural Language Processing Conference; Philadelphia; PA; USA.

Rögnvaldsson; E. and Helgadóttir; S. (2011). Morphosyntactic Tagging of Old Icelandic Texts and Its Use in Studying Syntactic Variation and Change. In Sporleder; C.; van den Bosch; A.; and Zervanou; K.; editors; Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series. Springer; Berlin.

Shen; L.; Satta; G.; and Joshi; A. (2007). Guided Learning for Bidirectional Sequence Classification.In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics; ACL; Prague; Czech Republic.

Søgaard; A. (2011). Semi-supervised condensed nearest neighbor for part-of-speech tagging. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies; ACL-HLT; Portland; Oregon.

Spoustová; D. j.; Haji?c; J.; Raab; J.; and Spousta; M. (2009). Semi-supervised Training for the Averaged Perceptron POS Tagger. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics; EACL; Athens; Greece.

Toutanova; K.; Klein; D.; Manning; C. D.; and Singer; Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology; NAACL; Edmonton; Canada.

Tsuruoka; Y.; Miyao; Y.; and Kazama; J. (2011). Learning with Lookahead: Can History- Based Models Rival Globally Optimized Models? In Proceedings of the Fifteenth Conference on Computational Natural Language Learning; CoNLL; Portland; Oregon; USA.

Turian; J.; Ratinov; L.; and Bengio; Y. (2010). Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics; ACL; Uppsala; Sweden.

Östling; R. (2012). Stagger: A modern POS tagger for Swedish. In Proceedings of the Swedish Language Technology Conference; SLTC; Lund; Sweden.

Citeringar i Crossref