Conference article

An SMT approach to automatic annotation of historical text

Eva Pettersson
Department of Linguistics and Philology, Uppsala University, Sweden and Swedish National Graduate School of Language Technology

Beáta Megyesi
Department of Linguistics and Philology, Uppsala University, Sweden

Jörg Tiedemann
Department of Linguistics and Philology, Uppsala University, Sweden

Download article

Published in: Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT Proceedings Series 18

Linköping Electronic Conference Proceedings 87:5, p. 54-69

NEALT Proceedings Series 18:5, p. 54-69

Show more +

Published: 2013-05-17

ISBN: 978-91-7519-587-2

ISSN: 1650-3686 (print), 1650-3740 (online)

Abstract

In this paper we propose an approach to tagging and parsing of historical text; using characterbased SMT methods for translating the historical spelling to a modern spelling before applying the NLP tools. This way; existing modern taggers and parsers may be used to analyse historical text instead of training new tools specialised in historical language; which might be hard considering the lack of linguistically annotated historical corpora. We show that our approach to spelling normalisation is successful even with small amounts of training data; and that it is generalisable to several languages. For the two languages presented in this paper; the proportion of tokens with a spelling identical to the modern gold standard spelling increases from 64.8% to 83.9%; and from 64.6% to 92.3% respectively; which has a positive impact on subsequent tagging and parsing using modern tools.

Keywords

Digital Humanities; Natural Language Processing; Historical Text; Normalisation; Underresourced Languages; Less-Resource Languages; SMT

References

No references available

Citations in Crossref