Conference article

Comparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts

Gerold Schneider
Institute of Computational Linguistics and Department of English, University of Zurich, Switzerland

Eva Pettersson
Department of Linguistics and Philology, Uppsala University, Sweden

Michael Percillier
Department of English, University of Mannheim, Germany

Download article

Published in: Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language

Linköping Electronic Conference Proceedings 133:8, p. 40-46

NEALT Proceedings Series 32:8, p. 40-46

Show more +

Published: 2017-05-10

ISBN: 978-91-7685-503-4

ISSN: 1650-3686 (print), 1650-3740 (online)

Abstract

To be able to use existing natural language processing tools for analysing historical text, an important preprocessing step is spelling normalisation, converting the original spelling to present-day spelling, before applying tools such as taggers and parsers. In this paper, we compare a probablistic, language-independent approach to spelling normalisation based on statistical machine translation (SMT) techniques, to a rule-based system combining dictionary lookup with rules and non-probabilistic weights. The rule-based system reaches the best accuracy, up to 94% precision at 74% recall, while the SMT system improves each tested period.

Keywords

No keywords available

References

Alistair Baron and Paul Rayson. 2008. VARD 2: A tool for dealing with spelling variation in historical corpora. In Proceedings of the Postgraduate Conference in Corpus Linguistics, Birmingham. Aston University.

Douglas Biber, Edward Finegan, and Dwight Atkinson. 1994. Archer and its challenges: Compiling and exploring a representative corpus of historical English registers. In Udo Fries, Peter Schneider, and Gunnel Tottie, editors, Creating and using English language corpora, Papers from the 14th International Conference on English Language Research on Computerized Corpora, Zurich 1993, pages 1–13. Rodopi, Amsterdam.

BNC Consortium. 2007. The British National Corpus, Version 3. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. http://www.natcorp.ox.ac.uk/.

Peter Brown, Vincent Della Pietra, Stephen Della Pietra, and Robert Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19(2), pages 263–311.

Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: an open source toolkit for handling large scale language models. in Proceedings of Interspeech 2008, pages 1618–1621.

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst, 2007. Moses: Open Source Toolkit for Statistical Machine Translation. in Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pages 177–180.

Hrafn Loftsson. 2008. Tagging icelandic text: A linguistic rule-based approach. Nordic Journal of Linguistics, 31(1).

Franz Josef Och and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 1(29), pages 19–51.

Eva Pettersson, Be´ata Megyesi, and J¨org Tiedemann. 2013. An SMT approach to automatic annotation of historical text. In Proceedings of the NoDaLiDa 2013 workshop on Computational Historical Linguistics.

Eva Pettersson, Beáta Megyesi, and Joakim Nivre. 2014. A multilingual evaluation of three spelling normalisation methods for historical text. In Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH) @ EACL 2014, pages 32–41, Gothenburg, Sweden.

Paul Rayson, Dawn Archer, Alistair Baron, Jonathan Culpeper, and Nicholas Smith. 2007. Tagging the bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In Proceedings of Corpus Linguistics 2007. University of Birmingham, UK.

Silke Scheible, Richard J. Whitt, Martin Durrell, and Paul Bennett. 2011. Evaluating an ’off-the-shelf’ POS-tagger on Early Modern German text. In Proceedings of the ACL-HLT 2011 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2011), Portland,
Oregon.
Christer Samuelsson and Atro Voutilainen. 1997. Comparing a linguistic and a stochastic tagger. In Proceedings of of ACL/EACL Joint Conference, Madrid.

Yves Scherrer and Toma?z Erjavec. 2013. Modernizing historical Slovene words with character-based SMT. In Proceedings of the 4th Biennial Workshop on Balto-Slavic Natural Language Processing, pages 58–62.

Gerold Schneider, Hans Martin Lehmann, and Peter Schneider. 2014. Parsing Early Modern English corpora. Literary and Linguistic Computing, first published online February 6, 2014 doi:10.1093/llc/fqu001.

J¨org Tiedemann. 2009. Character-based PSMT for closely related languages. Proceedings of 13th Annual Conference of the European Association for Machine Translation (EAMT’09), pages 12–19.

Citations in Crossref