Konferensartikel

An SMT approach to automatic annotation of historical text

Eva Pettersson
Department of Linguistics and Philology, Uppsala University, Sweden and Swedish National Graduate School of Language Technology

Beáta Megyesi
Department of Linguistics and Philology, Uppsala University, Sweden

Jörg Tiedemann
Department of Linguistics and Philology, Uppsala University, Sweden

Ladda ner artikel

Ingår i: Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18

Linköping Electronic Conference Proceedings 87:5, s. 54-69

NEALT Proceedings Series 18:5, s. 54-69

Visa mer +

Publicerad: 2013-05-17

ISBN: 978-91-7519-587-2

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

In this paper we propose an approach to tagging and parsing of historical text; using characterbased SMT methods for translating the historical spelling to a modern spelling before applying the NLP tools. This way; existing modern taggers and parsers may be used to analyse historical text instead of training new tools specialised in historical language; which might be hard considering the lack of linguistically annotated historical corpora. We show that our approach to spelling normalisation is successful even with small amounts of training data; and that it is generalisable to several languages. For the two languages presented in this paper; the proportion of tokens with a spelling identical to the modern gold standard spelling increases from 64.8% to 83.9%; and from 64.6% to 92.3% respectively; which has a positive impact on subsequent tagging and parsing using modern tools.

Nyckelord

Digital Humanities; Natural Language Processing; Historical Text; Normalisation; Underresourced Languages; Less-Resource Languages; SMT

Referenser

Bollmann; M.; Petran; F.; and Dipper; S. (2011). Rule-based normalization of historical texts. In Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage; pages 34–42; Hissar; Bulgaria.

Brants; T. (2000). TnT - a statistical part-of-speech tagger. In Proceedings of the 6th Applied Natural Language Processing Conference (ANLP); Seattle; Washington; USA.

Ejerhed; E. and Källgren; G. (1997). Stockholm Umeå Corpus. Version 1.0. Produced by Department of Linguistics; Umeå University and Department of Linguistics; Stockholm University. ISBN 91-7191-348-3.

Halácsy; P.; Kornai; A.; and Oravecz; C. (2007). HunPos - an open source trigram tagger. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics; pages 209–212; Prague; Czech Republic.

Helgadóttir; S.; Svavarsdóttir; A.; Rögnvaldsson; E.; Bjarnadóttir; K.; and Loftsson; H. (2012). The tagged icelandic corpus (mím). In Proceedings of the Workshop on Language Technology for Normalisation of Less-Resourced Languages; pages 67–72.

Jiampojamarn; S.; Kondrak; G.; and Sherif; T. (2007). Applying many-to-many alignments and hidden markov models to letter-to-phoneme conversion. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2007); pages 372–379; Rochester; NY.

Krauwer; S.; Maegaard; B.; Khalid; C.; and Damsgaard Jørgensen; L. (2004). Report on Basic Language Resource Kit (BLARK) for Arabic.

Loftsson; H. and Rögnvaldsson; E. (2007). IceNLP: A natural language processing toolkit for Icelandic. In Proceedings of InterSpeech; Special session: Speech and language technology for less-resourced languages; Antwerp; Belgium.

Loth; A.; editor (1962). Late Medieval Icelandic Romances I. Kaupmannahöfn; Copenhagen.

Matthews; D. (2007). Machine transliteration of proper names. Master’s thesis; School of Informatics.

Nakov; P. and Tiedemann; J. (2012). Combining word-level and character-level models for machine translation between closely-related languages. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); pages 301–305; Jeju Island; Korea. Association for Computational Linguistics.

Nivre; J.; Hall; J.; and Nilsson; J. (2006a). MaltParser: A data-driven parser-generator for dependency parsing. In Proceedings of the 5th international conference on Language Resources and Evaluation (LREC); pages 2216–2219; Genoa; Italy.

Nivre; J.; Nilsson; J.; and Hall; J. (2006b). Talbanken05: A Swedish treebank with phrase structure and dependency annotation. In Proceedings of the 5th international conference on Language Resources and Evaluation (LREC); pages 24–26; Genoa; Italy.

Och; F. J. (2003). Minimum error rate training in statistical machine translation. In Proceedings of ACL’03; pages 160–167; Sapporo; Japan.

Palsson; H.; editor (2012). The Uppsala Edda. Viking Society for Northern Research.

Pettersson; E.; Megyesi; B.; and Nivre; J. (2012). Rule-based normalisation of historical text - a diachronic study. In Proceedings of the First International Workshop on Language Technology for Historical Text(s); Vienna; Austria.

Pind; J.; editor (1991). Icelandic Frequency Dictionary. Institute of Lexicography; Reykjavik; Iceland.

Rayson; P.; Archer; D.; and Nicholas; S. (2005). VARD versus Word – A comparison of the UCREL variant detector and modern spell checkers on English historical corpora. In Proceedings from the Corpus Linguistics Conference Series on-line e-journal; volume 1; Birmingham; UK.

Rögnvaldsson; E.; Ingason; A. K.; sson; E. F. S.; and Wallenberg; J. (2012). The icelandic parsed historical corpus (icepahc). In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12); Istanbul; Turkey. European Language Resources Association (ELRA).

Varga; D.; Németh; L.; Halácsy; P.; Kornai; A.; Trón; V.; and Nagy; V. (2005). Parallel corpora for medium density languages. In Proceedings of the RANLP; pages 590–596.

Vilar; D.; Peter; J.-T.; and Hermann; N. (2007). Can we translate letters? In Proceedings of the Second Workshop on Statistical Machine Translation; pages 33–39; Prague; Czech Republic. Association for Computational Linguistics.

Citeringar i Crossref