Konferensartikel

A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora

Aleksi Vesanto
Turku NLP Group, Department of FT

Asko Nivala
Cultural History, Finland / Turku Institute for Advanced Studies, University of Turku, Finland

Tapio Salakoski
Turku NLP Group, Department of FT

Hannu Salmi
Cultural History, Finland

Filip Ginter
Turku NLP Group, Department of FT

Ladda ner artikel

Ingår i: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Linköping Electronic Conference Proceedings 131:49, s. 330-333

NEALT Proceedings Series 29:49, s. 330-333

Visa mer +

Publicerad: 2017-05-08

ISBN: 978-91-7685-601-7

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

We present a software for retrieving and exploring duplicated text passages in low quality OCR historical text corpora. The system combines NCBI BLAST, a software created for comparing and aligning biological sequences, with the Solr search and indexing engine, providing a web interface to easily query and browse the clusters of duplicated texts. We demonstrate the system on a corpus of scanned and OCR-recognized Finnish newspapers and journals from years 1771 to 1910.

Nyckelord

Inga nyckelord är tillgängliga

Referenser

Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, Oct. Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008. Kimmo Kettunen, Tuula Pääkkönen, and Mika Koistinen. 2016. Between diachrony and synchrony: Evaluation of lexical quality of a digitized historical finnish newspaper and journal collection with morphological analyzers. In Baltic HLT. David A. Smith, Ryan Cordell, Elizabeth Maddock Dillon, Nick Stramp, and John Wilkerson. 2014. Detecting and modeling local text reuse. In Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’14, pages 183–192, Piscataway, NJ, USA. IEEE Press.

Citeringar i Crossref