Konferensartikel

Tagging the Past: Experiments using the Saga Corpus

Hrafn Lofsson
School of Computer Science, Reykjavik University, Iceland

Ladda ner artikel

Ingår i: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Linköping Electronic Conference Proceedings 85:12, s. 89-104

NEALT Proceedings Series 16:12, p. 89-104

Visa mer +

Publicerad: 2013-05-17

ISBN: 978-91-7519-589-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

There is an increasing interest in the NLP community in developing tools for annotating historical data; for example; to facilitate research in the field of corpus linguistics. In this work; we experiment with several PoS taggers using a sub-corpus of the Icelandic Saga Corpus. This is carried out in three main steps. First; we evaluate taggers; which were trained on Modern Icelandic; when tagging Old Icelandic. Second; we semi-automatically correct errors in the training corpus using a bootstrapping method. Finally; we evaluate the taggers on the corrected training corpus. The best performing single tagger is Stagger; a tagger based on the averaged perceptron algorithm; obtaining an accuracy of 91.76%. By combining the output of three taggers; using a simple voting scheme; the accuracy increases to 92.32%.

Nyckelord

Historical Data; Icelandic Saga Corpus; Part-of-Speech Tagging

Referenser

Blitzer; J.; McDonald; R.; and Pereira; F. (2006). Domain adaptation with structural correspondence learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; EMNLP; Sydney; Australia.

Brants; T. (2000). TnT: A statistical part-of-speech tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing; Seattle; WA; USA.

Brill; E. and Wu; J. (1998). Classifier Combination for Improved Lexical Disambiguation. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics; COLING-ACL; Montreal; Quebec; Canada.

Collins; M. (2002). Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In Proceedings of the ACL-02 Conference on Empirical methods in Natural Language Processing; Philadelphia; PA; USA.

Dredze; M. and Wallenberg; J. (2008). Icelandic Data Driven Part of Speech Tagging. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies; ACL-HLT; Columbus; OH; USA.

Forsbom; E. (2009). Extending the View: Explorations in Bootstrapping a Swedish PoS Tagger. In Proceedings of the 17th Nordic Conference of Computational Linguistics; NoDaLiDa; Odense; Denmark.

van Halteren; H.; Zavrel; J.; and Daelemans; W. (2001). Improving Accuracy in Wordclass Tagging through Combination of Machine Learning Systems. Computational Linguistics; 27(2):199–230.

Helgadóttir; S. (2005). Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic. In Holmboe; H.; editor; Nordisk Sprogteknologi 2004. Museum Tusculanums Forlag; Copenhagen.

Henrich; V.; Reuter; T.; and Loftsson; H. (2009). CombiTagger: A System for Developing Combined Taggers. In Proceedings of the 22nd International FLAIRS Conference; Special Track: Applied Natural Language Processing; Sanibel Island; Florida; USA.

Kroch; A. and Taylor; A. (2000). The Penn-Helsinki Parsed Corpus of Middle English (PPCME2). Department of Linguistics; University of Pennsylvania. CD-ROM; second edition.

Kübler; S. and Baucom; E. (2011). Fast Domain Adaptation for Part of Speech Tagging for Dialogues. In Proceedings of Recent Advances in Natural Language Processing; RANLP; Hissar; Bulgaria.

Loftsson; H. (2006). Tagging Icelandic text: An experiment with integrations and combinations of taggers. Language Resources and Evaluation; 40(2):175–181.

Loftsson; H. (2008). Tagging Icelandic text: A linguistic rule-based approach. Nordic Journal of Linguistics; 31(1):47–72. Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); Linköping Electronic Conference Proceedings #85 [page 102 of 474]

Loftsson; H.; Helgadóttir; S.; and Rögnvaldsson; E. (2011). Using a morphological database to increase the accuracy in PoS tagging. In Proceedings of Recent Advances in Natural Language Processing; RANLP; Hissar; Bulgaria.

Loftsson; H.; Kramarczyk; I.; Helgadóttir; S.; and Rögnvaldsson; E. (2009). Improving the PoS tagging accuracy of Icelandic text. In Proceedings of the 17th Nordic Conference of Computational Linguistics; NoDaLiDa; Odense; Denmark.

Loftsson; H. and Östling; R. (2013). Tagging a Morphologically Complex Language Using an Averaged Perceptron Tagger: The Case of Icelandic. In Proceedings of the 19th Nordic Conference of Computational Linguistics; NoDaLiDa; Oslo; Norway.

Pennacchiotti; M. and Zanzotto; F. M. (2008). Natural Language Processing across time: an empirical investigation on Italian. In Nordström; B. and Ranta; A.; editors; Advances in Natural Language Processing; 6th International Conference on NLP; GoTAL 2008; Proceedings. Springer; Berlin.

Pettersson; E.; Megyesi; B.; and Nivre; J. (2012). Parsing the past – identification of verb constructions in historical text. In EACL 2012 workshop on: Language Technology for Cultural Heritage; Social Sciences; and Humanities; Avignon; France.

Pind; J.; Magnússon; F.; and Briem; S. (1991). Íslensk orðtíðnibók [The Icelandic Frequency Dictionary]. The Institute of Lexicography; University of Iceland; Reykjavik; Iceland. Ratnaparkhi; A. (1996). A Maximum Entropy Model for Part-Of-Speech Tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing; Philadelphia; PA; USA.

Rögnvaldsson; E. and Helgadóttir; S. (2011). Morphosyntactic Tagging of Old Icelandic Texts and Its Use in Studying Syntactic Variation and Change. In Sporleder; C.; van den Bosch; A.; and Zervanou; K.; editors; Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workshop Series. Springer; Berlin.

Rögnvaldsson; E.; Ingason; A. K.; Sigurðsson; E. F.; and Wallenberg; J. (2012). The Icelandic Parsed Historical Corpus (IcePaHC). In Proceedings of the 8th International Conference on Language Resources and Evaluation; LREC 2012; Istanbul; Turkey.

Sánchez-Marco; C.; Boleda; G.; and Padró; L. (2011). Extending the tool; or how to annotate historical language varieties. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage; Social Sciences; and Humanities; Portland; OR; USA.

Scheible; S.; Whitt; R. J.; Durrell; M.; and Bennett; P. (2011a). A Gold Standard Corpus of Early Modern German. In Proceedings of the ACL-HLT 2011 Linguistic Annotation Workshop (LAW V); Portland; OR; USA.

Scheible; S.; Whitt; R. J.; Durrell; M.; and Bennett; P. (2011b). Evaluating an ’off-the-shelf’ POS-tagger on Early Modern German text. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage; Social Sciences; and Humanities; Portland; OR; USA.

Toutanova; K.; Klein; D.; Manning; C. D.; and Singer; Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology; NAACL. Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); Linköping Electronic Conference Proceedings #85 [page 103 of 474]

Wallenberg; J. C.; Ingason; A. K.; Sigurðsson; E. F.; and Rögnvaldsson; E. (2011). Icelandic Parsed Historical Corpus (IcePaHC). Version 0.9.

Zavrel; J. and Daelemans; W. (2000). Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers. In Proceedings of the 2nd International Conference on Language Resources and Evaluation; LREC; Athens; Greece.

Citeringar i Crossref