Konferensartikel
A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora
Aleksi Vesanto
Turku NLP Group, Department of FT
Asko Nivala
Cultural History, Finland / Turku Institute for Advanced Studies, University of Turku, Finland
Tapio Salakoski
Turku NLP Group, Department of FT
Hannu Salmi
Cultural History, Finland
Filip Ginter
Turku NLP Group, Department of FT
Ladda ner artikelIngår i: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden
Linköping Electronic Conference Proceedings 131:49, s. 330-333
NEALT Proceedings Series 29:49, p. 330-333
Visa mer +
Publicerad: 2017-05-08
ISBN: 978-91-7685-601-7
ISSN: 1650-3686 (tryckt), 1650-3740 (online)
Abstract
We present a software for retrieving and exploring duplicated text passages in low quality OCR historical text corpora. The system combines NCBI BLAST, a software created for comparing and aligning biological sequences, with the Solr search and indexing engine, providing a web interface to easily query and browse the clusters of duplicated texts. We demonstrate the system on a corpus of scanned and OCR-recognized Finnish newspapers and journals from years 1771 to 1910.
Nyckelord
Inga nyckelord är tillgängliga
Referenser
Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, Oct.
Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008.
Kimmo Kettunen, Tuula Pääkkönen, and Mika Koistinen. 2016. Between diachrony and synchrony: Evaluation of lexical quality of a digitized historical finnish newspaper and journal collection with morphological analyzers. In Baltic HLT.
David A. Smith, Ryan Cordell, Elizabeth Maddock Dillon, Nick Stramp, and John Wilkerson. 2014. Detecting and modeling local text reuse. In Proceedings of the 14th ACM/IEEE-CS Joint Conference
on Digital Libraries, JCDL ’14, pages 183–192, Piscataway, NJ, USA. IEEE Press.
Citeringar i Crossref