Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing

Koistinen, Mika; Kettunen, Kimmo; Pääkkönen, Tuula

Konferensartikel

Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing

Mika Koistinen
National Library of Finland, The Centre for Preservation and Digitisation, Finland

Kimmo Kettunen
National Library of Finland, The Centre for Preservation and Digitisation, Finland

Tuula Pääkkönen
National Library of Finland, The Centre for Preservation and Digitisation, Finland

Ladda ner artikel

Ingår i: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Linköping Electronic Conference Proceedings 131:38, s. 277-283

NEALT Proceedings Series 29:38, p. 277-283

Visa mer +

Publicerad: 2017-05-08

ISBN: 978-91-7685-601-7

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit our method achieves 27.48% (FineReader 7 or 8) and 9.16% (FineReader 11) improvement on word level. Keywords: Optical Character Recognition, OCR Quality, Digital Image Processing, Binarization, Noise Removal, Tesseract, Finnish, Historical Documents

Nyckelord

Inga nyckelord är tillgängliga

Referenser

R. C. Carrasco. 2014. An open-source OCR evaluation tool. In DATeCH ’14 Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pages 179–184.

M. Droettboom. 2003. Correcting broken characters in the recognition of historical documents. In JCDL 03 Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries, pages 364–366.

A. El Harraj and N. Raissouni. 2015. Ocr accuracy improvement on document images through a novel preprocessing approach. In Signal & Image Processing : An International Journal (SIPIJ), volume 6, pages114–133.

J. Evershed and K. Fitch. 2014. Correcting Noisy OCR: Context beats Confusion (2014). In DATeCH ’14 Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pages 45–51.

G. Ganchimeg. 2015. History document image background noise and removal methods. In International Journal of Knowledge Content Development & Technology, volume 5, pages 11–24.

R. C. Gonzales and R. E. Woods. 2002. Digital Image Processing. Prentice-Hall.

M. Helinski, M. Kmieciak, and T. Parkola. 2012. Report on the comparison of Tesseract and ABBYY FineReader OCR engines. Technical report, Poznan Supercomputing and networking center, Poland.

N. Howe. 2013. Document Binarization with Automatic Parameter Tuning. In Journal International
Journal on Document Analysis and Recognition, volume 16, pages 247–258.

A. Järvelin, H. Keskustalo, E. Sormunen, M. Saastamoinen, and K. Kettunen. 2015. Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach. In Journal of the Association for Information Science and Technology, volume 67.

K. Kettunen, T. P¨a¨akk¨onen, and M. Koistinen. 2016. Between diachrony and synchrony: evaluation of lexical quality of a digitized historical Finnish newspaper collection with morphological analyzers. In Baltic HLT 2016, volume 289, pages 122–129.

R. Krutsch and D. Tenorio. 2011. Histogram Equalization, Application Note. Technical report.

D. Lopresti. 2009. Optical character recognition errors and their effects on natural language processing. In International Journal on Document Analysis and Recognition, volume 12, pages 141–151.

N. Makkar and S Singh. 2012. A Brief tour to various Skew Detection and Correction Techniques. In International Journal for Science and Emerging Technologies with Latest Trend, volume 4, pages 54–58.

W. Niblack. 1986. An Introduction to Image Processing, volume SMC-9. Prentice-Hall, Eaglewood
Cliffs, NJ.

K. Ntirogiannis, B. Gatos, and I. Pratikakis. 2014. ICFHR2014 Competition on Handwritten Document Image Binarization (H-DIBCO 2014). In 2014 14th International Conference on Frontiers in Handwriting Recognition, pages 809–813.

N. Otsu. 1979. A Threshold Selection Method from Gray-Level Histograms. In IEEE Transactions on Systems, Man and Cybernetics, volume SMC-9, pages 62–66.

S. Parashar and S. Sogi. 2012. Finding skewness and deskewing scanned document. 3(4):1619–1924.

S. M. Pizer, R. E. Johnston, J. P. Ericksen, B. C. Yankaskas, and K. E. Muller. 1990. Contrast Limited Histogram Equalization Speed and Effectiveness.

I. Pratikakis, B. Gatos, and K. Ntirogiannis. 2013. ICDAR 2013 Document Image Binarization Contest (DIBCO 2013). In 2013 12th International Conference on Document Analysis and Recognition, pages 1471–1476.

S. V. Rice and T. A. Nartker. 1996. The ISRI Analytic Tools for OCR Evaluation Version 5.1. Technical report, Information Science Research Institute (ISRI).

J. Sauvola and M. Pietik¨ainen. 1999. Adaptive Document Image Binarization. In The Journal of the Pattern recognition society, volume 33, pages 225–236.

M. Segzin and B. Sankur. 2004. Survey over image thresholding techniques and quantitative performance evaluation.

R. Smith. 1995. A Simple and Efficient Skew Detection Algorithm via Text Row Algorithm. In Proceedings 3rd ICDAR’95, IEEE (1995), pages 1145–1148.

R. Smith. 2007. An Overview of the Tesseract OCR Engine. In Proc. Ninth Int. Conference on Document Analysis and Recognition (ICDAR), IEEE (1995), pages 629–633.

M. L. Smitha, P. J. Antony, and D. N. Sachin. 2016. ocument Image Analysis Using Imagemagick and Tesseract-ocr. In International Advanced Research Journal in Science, Engineering and Technology (IARJSET), volume 3, pages 108–112.

T. Stanhope. 2016. Applications of Low-Cost Computer Vision for Agricultural Implement Feedback and Control.

O. Tange. 2011. GNU Parallel - The Command-Line Power Tool. In The USENIX Magazine, pages 42–47.

S. Tanner, T. Muñoz, and P. Hemy Ros. 2009. Measuring Mass Text Digitization Quality and Usefulness. Lessons Learned from Assessing the OCR Accuracy of the British Library’s 19th Century Online Newspaper Archive. 15(7/8).

C. Wolf, J. Jolion, and F. Chassaing. 2002. Text Localization, Enhancement and Binarization in Multimedia Documents. In Proceedings of the International Conference on Pattern Recognition (ICPR), volume 4, pages 1037–1040. Quebec City, Canada.

Konferensartikel

Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing

Abstract

Nyckelord

Referenser

Citeringar i Crossref