Konferensartikel

Applying BLAST to Text Reuse Detection in Finnish Newspapers and Journals, 1771-1910

Aleksi Vesanto
Turku NLP Group, Department of FT, University of Turku, Finland

Asko Nivala
Cultural History and Turku Institute for Advanced Studies, University of Turku, Finland

Heli Rantala
Cultural History, University of Turku, Finland

Tapio Salakoski
Turku NLP Group, Department of FT, University of Turku, Finland

Hannu Salmi
Cultural History, University of Turku, Finland

Filip Ginter
Turku NLP Group, Department of FT, University of Turku, Finland

Ladda ner artikel

Ingår i: Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language

Linköping Electronic Conference Proceedings 133:10, s. 54-58

NEALT Proceedings Series 32:10, s. 54-58

Visa mer +

Publicerad: 2017-05-10

ISBN: 978-91-7685-503-4

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

We present the results of text reuse detection, based on the corpus of scanned and OCR-recognized Finnish newspapers and journals from 1771 to 1910. Our study draws on BLAST, a software created for comparing and aligning biological sequences. We show different types of text reuse in this corpus, and also present a comparison to the software Passim, developed at the Northeastern University in Boston, for text reuse detection.

Nyckelord

Inga nyckelord är tillgängliga

Referenser

Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, Oct.

Ryan Cordell. 2015. Reprinting, Circulation, and the Network Author in Antebellum Newspapers. American Literary History, 27(3):417–445.

Kimmo Kettunen. 2016. Keep, change or delete? Setting up a low resource ocr post-correction framework for a digitized old finnish newspaper collection. In D. Calvanese, D. De Nart, and C. Tasso, editors, Digital Libraries on the Move. IRCDL 2015.
Communications in Computer and Information Science, volume 612. Springer, Cham.

Tuula Pääkkönen, Jukka Kervinen, Asko Nivala, Kimmo Kettunen, and Eetu Mäkelä. 2016. Exporting Finnish Digitized Historical Newspaper Contents for Offline Use. D-Lib Magazine, 22(7).

David A. Smith, Ryan Cordell, Elizabeth Maddock Dillon, Nick Stramp, and John Wilkerson. 2014. Detecting and modeling local text reuse. In Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’14, pages 183–192, Piscataway, NJ, USA. IEEE Press.

David A. Smith, Ryan Cordell, and Abby Mullen. 2015. Computational Methods for Uncovering Reprinted Texts in Antebellum Newspapers. American Literary History, 27(3):E1–E15.

Aleksi Vesanto, Asko Nivala, Tapio Salakoski, Hannu Salmi, and Ginter Filip. 2017. A system for identifying and exploring text repetition in large historical document corpora. In Proceedings of NoDaLiDa 2017.

Citeringar i Crossref