Normalizing Medieval German Texts: from rules to deep learning

Natalia Korchagina
Institute of Computational Linguistics, University of Zurich, Switzerland

Ladda ner artikel

Ingår i: Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language

Linköping Electronic Conference Proceedings 133:4, s. 12-17

NEALT Proceedings Series 32:4, s. 12-17

Visa mer +

Publicerad: 2017-05-10

ISBN: 978-91-7685-503-4

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


The application of NLP tools to historical texts is complicated by a high level of spelling variation. Different methods of historical text normalization have been proposed. In this comparative evaluation I test the following three approaches to text canonicalization on historical German texts from 15 th –16 th centuries: rule-based, statistical machine translation, and neural machine translation. Character based neural machine translation, not being previously tested for the task of normalization, showed the best results.


Inga nyckelord är tillgängliga


Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.

Marcel Bollmann and Anders Søgaard. 2016. Improving historical spelling normalization with bidirectional LSTMs and multi-task learning. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 131–139, Osaka, Japan, December. The COLING 2016 Organizing Committee.

Marcel Bollmann, Florian Petran, and Stefanie Dipper. 2011. Rule-based normalization of historical texts. In Proceedings of Language Technologies for Digital Humanities and Cultural Heritage Workshop, pages 34–42.

Marcel Bollmann. 2012. (Semi-)automatic normalization of historical texts using distance measures and the Norma tool. In Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2), Lisbon, Portugal.

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. CoRR, abs/1409.1259.

Marta R. Costa-Juss`a and Jos´e A. R. Fonollosa. 2016. Character-based neural machine translation. CoRR, abs/1603.00810.

Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2016. Fully character-level neural machine translation without explicit segmentation. CoRR, abs/1610.03017.

Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. CoRR, abs/1508.04025.

Eva Pettersson, Beáta Megyesi, and Joakim Nivre. 2013a. Normalization of historical text using context-sensitive weighted Levenshtein distance and compound splitting. In Proceedings of the 19th Nordic Conference on Computational Linguistics.

Eva Pettersson, Beáta Megyesi, and J¨org Tiedemann. 2013b. An SMT approach to automatic annotation of historical text. In Proceedings of the Workshop on Computational Historical Linguistics at NODALIDA.

Eva Pettersson, Beáta Megyesi, and Joakim Nivre. 2014. A multilingual evaluation of three spelling normalisation methods for historical text. In Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH), pages 32–41, Gothenburg, Sweden, April. Association for Computational Linguistics.

Felipe S´anchez-Mart´inez, Isabel Mart´inez-Sempere, Xavier Ivars-Ribes, and Rafael C. Carrasco. 2013. An open diachronic corpus of historical Spanish: annotation criteria and automatic modernisation of spelling. CoRR, abs/1306.3692.

Silke Scheible, Richard J. Whitt, Martin Durrell, and Paul Bennett. 2011. A gold standard corpus of Early Modern German. In Proceedings of the 5th Linguistic Annotation Workshop, LAW V ’11, pages 124–128, Stroudsburg, PA, USA. Association for Computational Linguistics.

Yves Scherrer and Toma?z Erjavec. 2013. Modernizing historical Slovene words with character-based SMT. In BSNLP 2013 - 4th Biennial Workshop on Balto-Slavic Natural Language Processing, Sofia, Bulgaria, August.

Rico Sennrich. 2012. Perplexity minimization for translation model domain adaptation in statistical machine translation. In EACL 2012, 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, April 23-27, 2012, pages 539–549.

Rico Sennrich. 2016. How grammatical is characterlevel neural machine translation? Assessing MT quality with contrastive translation pairs. CoRR, abs/1612.04629.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112.

Citeringar i Crossref