Automatic conversion of colloquial Finnish to standard Finnish

Inari Listenmaa
Chalmers Institute of Technology, Sweden

Francis M. Tyers
HSL-fakultehta, UiT Norgga árktal šs universitehtaNorway

Ladda ner artikel

Ingår i: Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania

Linköping Electronic Conference Proceedings 109:27, s. 219-223

NEALT Proceedings Series 23:27, p. 219-223

Visa mer +

Publicerad: 2015-05-06

ISBN: 978-91-7519-098-3

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


This paper presents an unsupervised method for converting between colloquial Finnish and standard Finnish. The method relies upon a small number of orthographical rules combined with a large language model of standard Finnish for ranking the possible conversions. Aside from this contribution, the paper also presents an evaluation corpus consisting of aligned sentences in colloquial Finnish, orthographically-standardised colloquial Finnish and standard Finnish. The methods we present outperforms the baseline of simply treating colloquial Finnish as standard Finnish and offers promise for the adaptation of language-technology tools created for standard Finnish to colloquial Finnish. To this end the paper also presents preliminary results which show promise for using normalisation in the machine translation task.


Inga nyckelord är tillgängliga


Jacob Eisenstein. 2013. What to do about bad language on the internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics.

Kenneth Heafield. 2011. KenLM: faster and smaller language model queries. In Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland, United Kingdom, July.

5The word oo is the negative form of the verb olla ‘to be’ in Finnish.

Fred Karlsson. 2008. Finnish: An Essential Grammar. Routledge, Abingdon, Oxon.

Philipp Koehn, Hieu Hoang, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In Demonstration session at the Annual Meeting of the Association for Computational Linguistics (ACL2007).

Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of MT Summit.

Karen Kukich. 1992. Techniques for automatically correcting words in text. ACM Comput. Surv., 24(4):377–439, December.

Preslav Nakov and J¨org Tiedemann. 2012. Combining word-level and character-level models for machine translation between closely-related languages. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 301–305.

Kevin Scannell. 2011. Statistical unicodification of african languages. Language Resources and Evaluation, 45(3):375–386.

Kevin Scannell. 2014. Statistical models for text normalization and machine translation. In Proceedings of the Celtic Language Technology Workshop at COLING 2014.

Jörg Tiedemann. 2009. Character-based PSMT for Closely Related Languages. In Proceedings of 13th Annual Conference of the European Association for Machine Translation (EAMT09), pages 12–19.

Kevin Unhammer and Trond Trosterud. 2009. Reuse of free resources in machine translation between nynorsk and bokml. In Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation, pages 35–42.

Jenni Viinikka and Eero Voutilainen. 2013. Ääniä ilmassa, merkkejä paperilla – puhutun ja kirjoitetun kielen suhteesta. Kielikello.

Richard Zens, Franz Josef Och, and Hermann Ney. 2002. Phrase-based statistical machine translation. In KI - 2002: Advances in Artificial Intelligence. 25. Annual German Conference on AI, KI 2002, volume 2479, pages 18–32. Springer Verlag.

Citeringar i Crossref