Konferensartikel

OCR and post-correction of historical Finnish texts

Senka Drobac
Department of Modern Languages, University of Helsinki, Finland

Pekka Kauppinen
Department of Modern Languages, University of Helsinki, Finland

Krister Lindén
Department of Modern Languages, University of Helsinki, Finland

Ladda ner artikel

Ingår i: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Linköping Electronic Conference Proceedings 131:9, s. 70-76

NEALT Proceedings Series 29:9, p. 70-76

Visa mer +

Publicerad: 2017-05-08

ISBN: 978-91-7685-601-7

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

This paper presents experiments on Optical character recognition (OCR) as a combination of Ocropy software and data-driven spelling correction that uses Weighted Finite-State Methods. Both model training and testing were done on Finnish corpora of historical newspaper text and the best combination of OCR and post-processing models give 95.21% character recognition accuracy.

Nyckelord

Inga nyckelord är tillgängliga

Referenser

Thomas M Breuel, Adnan Ul-Hasan, Mayce Ali Al-Azawi, and Faisal Shafait. 2013. High-performance OCR for printed English and Fraktur using LSTM networks. In 2013 12th International Conference on Document Analysis and Recognition, pages 683–687. IEEE.

Thomas M Breuel. 2008. The OCRopus open source OCR system. In Electronic Imaging 2008, pages 68150F–68150F. International Society for Optics and Photonics.

Thomas Breuel. 2009. Recent progress on the OCRopus OCR system. In Proceedings of the International Workshop on Multilingual OCR, page 2. ACM.

Steffen Eger, Tim vor der Brck, and Alexander Mehler. 2016. A comparison of four character-level stringto-string translation models for (OCR) spelling error correction. The Prague Bulletin of Mathematical Linguistics, 105:77–99.

Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10:707.

R. Llobet, J. R. Cerdan-Navarro, J. C. Perez-Cortes, and J. Arlandis. 2010. OCR post-processing using weighted finite-state transducers. In 2010 20th International Conference on Pattern Recognition, pages 2021–2024, Aug.

Faisal Shafait. 2009. Document image analysis with OCRopus. In Multitopic Conference, 2009. INMIC 2009. IEEE 13th International, pages 1–6. IEEE.

Miikka Silfverberg and Jack Rueter. 2015. Can morphological analyzers improve the quality of optical character recognition? In Septentrio Conference Series, number 2, pages 45–56.

Miikka Silfverberg, Pekka Kauppinen, and Krister Lind´en. 2016. Data-driven spelling correction using weighted finite-state methods. In Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata, pages 51–59, Berlin, Germany, August. Association for Computational Linguistics.

Uwe Springmann, Dietmar Najock, Hermann Morgenroth, Helmut Schmid, Annette Gotscharek, and Florian Fink. 2014. OCR of historical printings of latin texts: problems, prospects, progress. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pages 71–75. ACM.

Citeringar i Crossref