Maria Moritz
Institute of Computer Science, University of Goettingen, Germany
Marco Büchler
Institute of Computer Science, University of Goettingen, Germany
Ladda ner artikelIngår i: Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language
Linköping Electronic Conference Proceedings 133:5, s. 18-23
NEALT Proceedings Series 32:5, p. 18-23
Publicerad: 2017-05-10
ISBN: 978-91-7685-503-4
ISSN: 1650-3686 (tryckt), 1650-3740 (online)
Text reuse is a common way to transfer historical texts. It refers to the repetition of text in a new context and ranges from nearverbatim (literal) and para-phrasal reuse to completely non-literal reuse (e.g., allusions or translations). To improve the detection of reuse in historical texts, we need to better understand its characteristics. In this work, we investigate the relationship between para-phrasal reuse and word senses. Specifically, we investigate the conjecture that words with ambiguous word senses are less prone to replacement in para-phrasal text reuse. Our corpus comprises three historical English Bibles, one of which has previously been annotated with word senses. We perform an automated word-sense disambiguation based on supervised learning. By investigating our conjecture we strive to understand whether unambiguous words are rather used for word replacements when a text reuse happens, and consequently, could serve as a discriminating feature for reuse detection.
Salha M. Alzahrani, Naomie Salim, and Ajith Abraham. 2012. Understanding plagiarism linguistic patterns, textual features, and detection methods. Trans. Sys. Man Cyber Part C, 42(2):133–149.
Daniel Baer, Torsten Zesch, and Iryna Gurevych. 2012. Text reuse detection using a composition of text similarity measures. In Proceedings of COLING 2012, pages 167–184, Mumbai, India. The COLING 2012 Organizing Committee.
Susanne R Borgwaldt, Frauke M Hellwig, and Annette M B De Groot. 2005. Onset entropy matters–letterto-phoneme mappings in seven languages. Reading and Writing, 18(3):211–229.
Zdenek Ceska and Chris Fox. 2011. The influence of text pre-processing on plagiarism detection. Association for Computational Linguistics.
Christine Fellbaum. 1998. WordNet An Electronic Lexical Database. MIT Press.
Samuel Fernando and Mark Stevenson. 2008. A semantic similarity approach to paraphrase detection. Computational Linguistics UK (CLUK 2008) 11th Annual Research Colloqium.
Andrea Moro, Alessandro Raganato, and Roberto Navigli. 2014. Entity linking meets word sense disambiguation: A unified approach. Transactions of the Association for Computational Linguistics, 2:231–244.
Roberto Navigli and Simone Paolo Ponzetto. 2012. Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell., 193:217–250, December.
GH Paetzold. 2015. Morph adorner toolkit: Morph adorner made simple.
Alessandro Raganato, Jose Camacho-Collados, Antonio Raganato, and Yunseo Joung. 2016. Semantic indexing of multilingual corpora and its application on the history domain. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pages 140–147, Osaka, Japan. The COLING 2016 Organizing Committee.
Miguel A Sanchez-Perez, Grigori Sidorov, and Alexander F Gelbukh. 2014. A winning approach to text alignment for text reuse detection at pan 2014. In CLEF (Working Notes), pages 1004–1011.
Helmut Schmid. 1999. Improvements in part-ofspeech tagging with an application to german. In Natural language processing using very large corpora, pages 13–25. Springer.
Claude E Shannon. 1949. Communication theory of secrecy systems. Bell Labs Technical Journal, 28(4):656–715.