Towards the Classification of the Finnish Internet Parsebank: Detecting Translations and Informality

Laippala, Veronika; Kanerv, Jenna; Missil&auml;, Anna; Pyysalo, Sampo; Salakoski, Tapio; Ginter, Filip

Conference article

Towards the Classification of the Finnish Internet Parsebank: Detecting Translations and Informality

Veronika Laippala
Turku Institute for Advanced Studies / School of Languages and Translation Studies, University of Turku, Finland

Jenna Kanerv
Department of Information Technology, University of Turku, Finland

Anna Missilä
School of Languages and Translation Studies, University of Turku, Finland

Sampo Pyysalo
Department of Information Technology, University of Turku, Finland

Tapio Salakoski
Department of Information Technology, University of Turku, Finland

Filip Ginter
Department of Information Technology, University of Turku, Finland

Download article

Published in: Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania

Linköping Electronic Conference Proceedings 109:15, p. 107-116

NEALT Proceedings Series 23:15, p. 107-116

Show more +

Published: 2015-05-06

ISBN: 978-91-7519-098-3

ISSN: 1650-3686 (print), 1650-3740 (online)

Abstract

This paper presents the first results on detecting informality, machine and human translations in the Finnish Internet Parsebank, a project aiming at a large-scale, web-based corpus with full morphological and syntactic analyses. The paper aims at classifying the Parsebank according to these criteria using syntactic n-grams, as well as studying the linguistic characteristics of the classes. The results are practically applicable, with an AUC range of 85–85% for the human, ~ 98% for the machine translated texts and 73% for the informal texts. While word-based classification performs well for the in-domain experiments, delexicalized methods with with morpho-syntactic features prove to be more tolerant to variation caused by genre or source language. In addition, the results show that the features used in the classification provide interesting pointers for further, more detailed studies on the linguistic characteristics of these texts.

Keywords

No keywords available

References

Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, and John Langford. 2011. A reliable effective terascale linear learning system. CoRR, abs/1110.4198.

Roee Aharoni, Moshe. Koppel, and Yoav Goldberg. 2014. Automatic detection of machine translated text and translation quality estimation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Short Papers, pages 289–295.

Yuki Arase and Ming Zhou. 2013. Machine translation detection from monolingual web-text. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: Long Papers, pages 1597–1607.

Alexander Ehud Avner, Noam Ordan, and Shuly Wintner. 2014. Identifying translationese at the word and sub-word level. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.

Mona Baker. 1993. Corpus linguistics and translation studies: implications and applications. In Gill Francis and Elena Tognini-Bonelli, editors, Text and Technology: In Honour of John Sinclair, pages 233–252. John Benjamins.

Marco Baroni and Silvia Bernardini. 2006. A new approach to the study of translationese: Machinelearning the difference between original and translated text. Literary and Linguistic Computing, 21(3):259–274.

Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. 2009. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3):209–226.

Douglas Biber, Susan Conrad, and Randi Reppen. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press, Cambridge.

Soshana. Blum-Kulka and Edwards Levenston. 1983. Universals of lexical simplification. Language Learning, 28:399–415.

Yoav Goldberg and John Orwant. 2013. A dataset of syntactic-ngrams over time froma a very large corpus of english books. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task; Semantic Textual Similarity, pages 241–247. Association for Computational Linguistics.

Google. 2015. Google Translate.

Eija-Riitta Gr¨onros. 2006. Arkikielesta yleiskieleen (From everyday language to standard language). Kielikello, 4.

Katri Haverinen, Jenna Nyblom, Timo Viljanen, Veronika Laippala, Samuel Kohonen, Anna Missilä, Stina Ojala, Tapio Salakoski, and Filip Ginter. 2013. Building the essential resources for Finnish: the Turku Dependency Treebank. Language Resources and Evaluation, pages 1–39.

Iustina Ilisei, Diana Inkpen, Gloria Corpas Pastor, and Ruslan Mitkov. 2010. Identification of translationese: A machine learning approach. In Alexander F. Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, volume 6008 of Lecture Notes in Computer Science, pages 503–511. Springer.

Institute of Languages of Finland. 2014. Kielitoimiston sanakirja / Dictionary of the Institute for Languages of Finland. Number 35 in Kotimaisten kielten keskuksen verkkojulkaisuja. Kotimaisten kielten keskus / Institute for Languages of Finland.

Jenna Kanerva, Juhani Luotolahti, Veronika Laippala, and Filip Ginter. 2014. Syntactic n-gram collection from a large-scale corpus of internet finnish. In Proceedings of the Sixth International Conference Baltic HLT.

Adam Kilgarriff and G. Grefenstette. 2003. Introduction to the special issue on web as corpus. Computational Linguistics, 29:333–347.

Shibamouli Lahiri, Prasenjit Mitra, and Xiaofei Lu. 2011. Informality judgment at sentence level and experiments with formality score. In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, volume 6609 of Lecture Notes in Computer Science, pages 446–457. Springer Berlin Heidelberg.

Veronika Laippala, Timo Viljanen, Antti Airola, Jenna Kanerva, Sanna Salanterä, Tapio Salakoski, and Filip Ginter. 2014. Statistical parsing of varieties of clinical Finnish. Artificial Intelligence in Medicine, 61(3):131–136.

Sara Laviosa-Braithwaite. 1995. Comparable corpora: towards a corpus linguistic methodology for the empirical study of translation. In M. Thelen and B. Lewandowska-Tomaszczyzk, editors, Translation and Meaning Part 3. Proceedings of the Maastricht Session of the 2nd International Maastricht-Lodz Duo Colloquium on ”Translation and Meaning”., pages 153–163.

Hoogeschool Maastricht, Maastricht. Sara Laviosa. 2002. Corpus-based Translation Studies: Theory, Findings, Applications. Rodopi, Amsterdam, New York.
Lotta Lehti and Veronika Laippala. 2014. Style in french politicians’ blogs: Degree of formality. Language at Internet, 11.

Adam Lopez. 2008. Statistical machine translation. ACM Computing Surveys, 40(3):1–49.

Kirsti M¨akinen. 1989. Sanojen tyyliv¨ari. In Nykysuomen sanavarat, pages 200–212. WSOY.

Anna Mauranen. 2000. Strange strings in translated language: A study on corpora. In Intercultural Faultlines. Research Models in Translation Studies 1, pages 119–141. St. Jerome Publishing, Manchester.

Alejandro Mosquera and Paloma Moreda. 2011. The use of metrics for measuring informality levels in web 2.0 texts. In Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology.

Sampo Nevalainen. 2003. Käännöskirjallisuuden puhekielisyyksistä - kaksinkertaista illuusiota? (on the informality of translated literature - a double illusion?). Virittäjä, (1):2–26.

Marius Popescu. 2011. Studying translationese on the character level. In Proceedings of Recent Advances in Natural Language Processing, pages 634–639.

Tiina Puurtinen. 2003. Genre-specific features of translationese? Linguistic differences between translated and non-translated Finnish children’s literature. Literary and Linguistic Computing, 18(4):389–406.

John M. Sinclair. 1996. Preliminary recommendations on Corpus Typology. http://www.ilc.cnr.it/EAGLES/corpustyp/corpustyp.html.

Conference article

Towards the Classification of the Finnish Internet Parsebank: Detecting Translations and Informality

Abstract

Keywords

References

Citations in Crossref