Mainstreaming August Strindberg with Text Normalization

Ek, Adam; Knuutinen, Sofia

Konferensartikel

Mainstreaming August Strindberg with Text Normalization

Adam Ek
Department of Linguistics, Stockholm University, Sweden

Sofia Knuutinen
Department of Linguistics, Stockholm University, Sweden

Ladda ner artikel

Ingår i: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Linköping Electronic Conference Proceedings 131:36, s. 266-270

NEALT Proceedings Series 29:36, p. 266-270

Visa mer +

Publicerad: 2017-05-08

ISBN: 978-91-7685-601-7

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

This article explores the application of text normalization methods based on Levenshtein distance and Statistical Machine Translation to the literary genre, specifically on the collected works of August Strindberg. The goal is to normalize archaic spellings to modern day spelling. The study finds evidence of success in text normalization, and explores some problems and improvements to the process of analysing mid-19th to early 20th century Swedish texts. This article is part of an ongoing project at Stockholm University which aims to create a corpus and webfriendly texts from Strindsberg’s collected works.

Nyckelord

Inga nyckelord är tillgängliga

Referenser

A. Baron and P. Rayson. 2008. Vard2: A tool for dealing with spelling variation in historical corpora, In Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham, UK.

E. Pettersson, B. Megyesi and J. Nivre. 2012 Rule-Based Normalisation of Historical Text a Diachronic Study, Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna, September 21, 2012

E. Pettersson, B. Megyesi and J. Tiedemann. 2013 An SMT Approach to Automatic Annotation of Historical Text, Proceedings of the workshop on computational historical linguistics at NODALIDA 2013. NEALT Proceedings Series 18 / Linkping Electronic Conference Proceedings 87: 5469.

E. Pettersson, B. Megyesi and J. Nivre. 2013 Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting, Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); Linköping Electronic Conference Proceedings 85: 163-179.

E. Pettersson 2016 Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction, Studia Linguistica Upsaliensia 17. 147 pp. Uppsala: Acta Universitatis Upsaliensis, Uppsala, Sweden.

E. Pettersson 2016 User Manual for Normalisation of Noisy Input Data using HistNorm, Department of Linguistics and Philology. Uppsala: Uppsala university, Uppsala, Sweden.

L. Borin, M. Forsberg and L. Lönngren. 2008 Saldo 1.0 (svenskt associationslexikon version 2), Sprkbanken, Gothenburg: University of Gothenburg. Gothenburg, Sweden.

P. Rayson, D. Archer, and N. Smith. 2005. VARD versus Word A comparison of the UCREL variant detector and modern spell checkers on English historical corpora, In Proceedings from the Corpus Linguistics Conference Series online e-journal, volume 1, Birmingham, UK.

P. Nakov, J. Tiedmann 2012. Combining word-level and character-level models for machine translation between closely-related languages., In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) pages 301-305. Jeju Island, Korea. Association for Computational Linguistics.

Konferensartikel

Mainstreaming August Strindberg with Text Normalization

Abstract

Nyckelord

Referenser

Citeringar i Crossref