Eva Pettersson
Department of Linguistics and Philology, Uppsala University, Sweden and Swedish National Graduate School of Language Technology
Beáta Megyesi
Department of Linguistics and Philology, Uppsala University, Sweden
Jörg Tiedemann
Department of Linguistics and Philology, Uppsala University, Sweden
Download articlePublished in: Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18
Linköping Electronic Conference Proceedings 87:5, p. 54-69
NEALT Proceedings Series 18:5, p. 54-69
Published: 2013-05-17
ISBN: 978-91-7519-587-2
ISSN: 1650-3686 (print), 1650-3740 (online)
In this paper we propose an approach to tagging and parsing of historical text; using characterbased SMT methods for translating the historical spelling to a modern spelling before applying the NLP tools. This way; existing modern taggers and parsers may be used to analyse historical text instead of training new tools specialised in historical language; which might be hard considering the lack of linguistically annotated historical corpora. We show that our approach to spelling normalisation is successful even with small amounts of training data; and that it is generalisable to several languages. For the two languages presented in this paper; the proportion of tokens with a spelling identical to the modern gold standard spelling increases from 64.8% to 83.9%; and from 64.6% to 92.3% respectively; which has a positive impact on subsequent tagging and parsing using modern tools.
Digital Humanities; Natural Language Processing; Historical Text; Normalisation; Underresourced Languages; Less-Resource Languages; SMT