Conference article

Towards the Classification of the Finnish Internet Parsebank: Detecting Translations and Informality

Veronika Laippala
Turku Institute for Advanced Studies / School of Languages and Translation Studies, University of Turku, Finland

Jenna Kanerv
Department of Information Technology, University of Turku, Finland

Anna Missilä
School of Languages and Translation Studies, University of Turku, Finland

Sampo Pyysalo
Department of Information Technology, University of Turku, Finland

Tapio Salakoski
Department of Information Technology, University of Turku, Finland

Filip Ginter
Department of Information Technology, University of Turku, Finland

Published in: Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania

Linköping Electronic Conference Proceedings 109:15, s. 107-116

NEALT Proceedings Series 23:15, s. 107-116

Published: 2015-05-06

ISBN: 978-91-7519-098-3

ISSN: 1650-3686 (print), 1650-3740 (online)


This paper presents the first results on detecting informality, machine and human translations in the Finnish Internet Parsebank, a project aiming at a large-scale, web-based corpus with full morphological and syntactic analyses. The paper aims at classifying the Parsebank according to these criteria using syntactic n-grams, as well as studying the linguistic characteristics of the classes. The results are practically applicable, with an AUC range of 85–85% for the human, ~ 98% for the machine translated texts and 73% for the informal texts. While word-based classification performs well for the in-domain experiments, delexicalized methods with with morpho-syntactic features prove to be more tolerant to variation caused by genre or source language. In addition, the results show that the features used in the classification provide interesting pointers for further, more detailed studies on the linguistic characteristics of these texts.


