Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting

Eva Pettersson
Department of Linguistics and Philology, Uppsala University, Sweden and Swedish National Graduate School of Language Technology

Beàta Megyesi
Department of Linguistics and Philology, Uppsala University, Sweden

Joakim Nivre
Department of Linguistics and Philology, Uppsala University, Sweden

Ingår i: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Linköping Electronic Conference Proceedings 85:17, s. 163-179

NEALT Proceedings Series 16:17, s. 163-179

Publicerad: 2013-05-17

ISBN: 978-91-7519-589-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


Natural language processing for historical text imposes a variety of challenges; such as to deal with a high degree of spelling variation. Furthermore; there is often not enough linguistically annotated data available for training part-of-speech taggers and other tools aimed at handling this specific kind of text. In this paper we present a Levenshtein-based approach to normalisation of historical text to a modern spelling. This enables us to apply standard NLP tools trained on contemporary corpora on the normalised version of the historical input text. In its basic version; no annotated historical data is needed; since the only data used for the Levenshtein comparisons are a contemporary dictionary or corpus. In addition; a (small) corpus of manually normalised historical text can optionally be included to learn normalisation for frequent words and weights for edit operations in a supervised fashion; which improves precision. We show that this method is successful both in terms of normalisation accuracy; and by the performance of a standard modern tagger applied to the historical text. We also compare our method to a previously implemented approach using a set of hand-written normalisation rules; and we see that the Levenshtein-based approach clearly outperforms the hand-crafted rules. Furthermore; the experiments were carried out on Swedish data with promising results and we believe that our method could be successfully applicable to analyse historical text for other languages; including those with less resources.


Digital Humanities; Natural Language Processing; Historical Text; Normalisation; Levenshtein Edit Distance; Compound Splitting; Part-of-Speech Tagging; Underresourced Languages; Less-Resource Languages


