Adam Ek
Department of Linguistics, Stockholm University, Sweden
Sofia Knuutinen
Department of Linguistics, Stockholm University, Sweden
Download articlePublished in: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden
Linköping Electronic Conference Proceedings 131:36, p. 266-270
NEALT Proceedings Series 29:36, p. 266-270
Published: 2017-05-08
ISBN: 978-91-7685-601-7
ISSN: 1650-3686 (print), 1650-3740 (online)
This article explores the application of text normalization methods based on Levenshtein distance and Statistical Machine Translation to the literary genre, specifically on the collected works of August Strindberg. The goal is to normalize archaic spellings to modern day spelling. The study finds evidence of success in text normalization, and explores some problems and improvements to the process of analysing mid-19th to early 20th century Swedish texts. This article is part of an ongoing project at Stockholm University which aims to create a corpus and webfriendly texts from Strindsberg’s collected works.
A. Baron and P. Rayson. 2008. Vard2: A tool for dealing with spelling variation in historical corpora, In Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham, UK.
E. Pettersson, B. Megyesi and J. Nivre. 2012 Rule-Based Normalisation of Historical Text a Diachronic Study, Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna, September 21, 2012
E. Pettersson, B. Megyesi and J. Tiedemann. 2013 An SMT Approach to Automatic Annotation of Historical Text, Proceedings of the workshop on computational historical linguistics at NODALIDA 2013. NEALT Proceedings Series 18 / Linkping Electronic Conference Proceedings 87: 5469.
E. Pettersson, B. Megyesi and J. Nivre. 2013 Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting, Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); Linköping Electronic Conference Proceedings 85: 163-179.
E. Pettersson 2016 Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction, Studia Linguistica Upsaliensia 17. 147 pp. Uppsala: Acta Universitatis Upsaliensis, Uppsala, Sweden.
E. Pettersson 2016 User Manual for Normalisation of Noisy Input Data using HistNorm, Department of Linguistics and Philology. Uppsala: Uppsala university, Uppsala, Sweden.
L. Borin, M. Forsberg and L. Lönngren. 2008 Saldo 1.0 (svenskt associationslexikon version 2), Sprkbanken, Gothenburg: University of Gothenburg. Gothenburg, Sweden.
P. Rayson, D. Archer, and N. Smith. 2005. VARD versus Word A comparison of the UCREL variant detector and modern spell checkers on English historical corpora, In Proceedings from the Corpus Linguistics Conference Series online e-journal, volume 1, Birmingham, UK.
P. Nakov, J. Tiedmann 2012. Combining word-level and character-level models for machine translation between closely-related languages., In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) pages 301-305. Jeju Island, Korea. Association for Computational Linguistics.