Konferensartikel

Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting

Eva Pettersson
Department of Linguistics and Philology, Uppsala University, Sweden and Swedish National Graduate School of Language Technology

Beàta Megyesi
Department of Linguistics and Philology, Uppsala University, Sweden

Joakim Nivre
Department of Linguistics and Philology, Uppsala University, Sweden

Ladda ner artikel

Ingår i: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Linköping Electronic Conference Proceedings 85:17, s. 163-179

NEALT Proceedings Series 16:17, p. 163-179

Visa mer +

Publicerad: 2013-05-17

ISBN: 978-91-7519-589-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

Natural language processing for historical text imposes a variety of challenges; such as to deal with a high degree of spelling variation. Furthermore; there is often not enough linguistically annotated data available for training part-of-speech taggers and other tools aimed at handling this specific kind of text. In this paper we present a Levenshtein-based approach to normalisation of historical text to a modern spelling. This enables us to apply standard NLP tools trained on contemporary corpora on the normalised version of the historical input text. In its basic version; no annotated historical data is needed; since the only data used for the Levenshtein comparisons are a contemporary dictionary or corpus. In addition; a (small) corpus of manually normalised historical text can optionally be included to learn normalisation for frequent words and weights for edit operations in a supervised fashion; which improves precision. We show that this method is successful both in terms of normalisation accuracy; and by the performance of a standard modern tagger applied to the historical text. We also compare our method to a previously implemented approach using a set of hand-written normalisation rules; and we see that the Levenshtein-based approach clearly outperforms the hand-crafted rules. Furthermore; the experiments were carried out on Swedish data with promising results and we believe that our method could be successfully applicable to analyse historical text for other languages; including those with less resources.

Nyckelord

Digital Humanities; Natural Language Processing; Historical Text; Normalisation; Levenshtein Edit Distance; Compound Splitting; Part-of-Speech Tagging; Underresourced Languages; Less-Resource Languages

Referenser

Ågren; M.; Fiebranz; R.; Lindberg; E.; and Lindström; J. (2011). Making verbs count. The research project ’Gender and Work’ and its methodology. Scandinavian Economic History Review; 59(3):271–291. Forthcoming.

Baron; A. and Rayson; P. (2008). Vard2: A tool for dealing with spelling variation in historical corpora. In Postgraduate Conference in Corpus Linguistics; Aston University; Birmingham.

Black; A. W. and Taylor; P. (1997). Festival speech synthesis system: system documentation. Technical report; University of Edinburgh; Centre for Speech Technology Research.

Bollmann; M. (2012). (semi-)Automatic Normalization of Historical Texts using Distance Measures and the norma tool. In Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanitites (ACRH-2).

Bollmann; M.; Petran; F.; and Dipper; S. (2011). Rule-based normalization of historical texts. In Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage; pages 34–42; Hissar; Bulgaria.

Borin; L.; Forsberg; M.; and Lönngren; L. (2008). Saldo 1.0 (svenskt associationslexikon version 2). Språkbanken; University of Gothenburg.

Brants; T. (2000). TnT - a statistical part-of-speech tagger. In Proceedings of the 6th Applied Natural Language Processing Conference (ANLP); Seattle; Washington; USA.

Ejerhed; E. and Källgren; G. (1997). Stockholm Umeå Corpus. Version 1.0. Produced by Department of Linguistics; Umeå University and Department of Linguistics; Stockholm University. ISBN 91-7191-348-3.

Halácsy; P.; Kornai; A.; and Oravecz; C. (2007). HunPos - an open source trigram tagger. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics; pages 209–212; Prague; Czech Republic.

Jurish; B. (2008). Finding canonical forms for historical German text. In Storrer; A.; Geyken; A.; Siebert; A.; and Würzner; K.-M.; editors; Text Resources and Lexical Knowledge: Selected Papers from the 9th Conference on Natural Language Processing (KONVENS 2008); pages 27–37. Mouton de Gruyter; Berlin.

Jurish; B. (2010). More Than Words: Using Token Context to Improve Canonicalization of Historical German. Journal for Language Technology and Computational Linguistics; 25(1):23– 39.

Kukich; K. (1992). Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR); 24(4):377–439.

Levenshtein; V. (1966). Binary Codes Capable of Correcting Deletions; Insertions and Reversals. Soviet Physics Doklady; 10(8):707–710.

Pettersson; E.; Megyesi; B.; and Nivre; J. (2012). Rule-based normalisation of historical text - a diachronic study. In Proceedings of the First International Workshop on Language Technology for Historical Text(s); Vienna; Austria.

Rayson; P.; Archer; D.; and Nicholas; S. (2005). VARD versus Word – A comparison of the UCREL variant detector and modern spell checkers on English historical corpora. In Proceedings from the Corpus Linguistics Conference Series on-line e-journal; volume 1; Birmingham; UK.

Stymne; S. (2008). German compounds in factored statistical machine translation. InRanta; A. and Nordström; B.; editors; Proceedings of GoTAL; 6th International Conference on Natural Language Processing; volume 5221; pages 464–475; Gothenburg; Sweden. Springer LNCS/LNAI.

Stymne; S. and Holmqvist; M. (2008). Processing of Swedish Compounds for Phrase-Based Statistical Machine Translation. In Proceedings of the 12th EAMT conference; Hamburg; Germany.

Citeringar i Crossref