OCR and post-correction of historical Finnish texts

Senka Drobac
Department of Modern Languages, University of Helsinki, Finland

Pekka Kauppinen
Department of Modern Languages, University of Helsinki, Finland

Krister Lindén
Department of Modern Languages, University of Helsinki, Finland

Ingår i: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Linköping Electronic Conference Proceedings 131:9, s. 70-76

NEALT Proceedings Series 29:9, s. 70-76

Publicerad: 2017-05-08

ISBN: 978-91-7685-601-7

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


This paper presents experiments on Optical character recognition (OCR) as a combination of Ocropy software and data-driven spelling correction that uses Weighted Finite-State Methods. Both model training and testing were done on Finnish corpora of historical newspaper text and the best combination of OCR and post-processing models give 95.21% character recognition accuracy.


