Veronika Laippala
School of Languages and Translation Studies, University of Turku, Finland
Roosa Kyllönen
School of Languages and Translation Studies, University of Turku, Finland
Jesse Egbert
Applied Linguistics, Northern Arizona University, USA
Douglas Biber
Applied Linguistics, Northern Arizona University, USA
Sampo Pyysalo
Department of Future Technologies, University of Turku, Finland
Download articlePublished in: Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), September 30 - October 2, Turku, Finland
Linköping Electronic Conference Proceedings 167:30, p. 292--297
NEALT Proceedings Series 42:30, p. 292--297
Published: 2019-10-02
ISBN: 978-91-7929-995-8
ISSN: 1650-3686 (print), 1650-3740 (online)
We consider cross- and multilingual text classification approaches to the identification of online registers (genres), i.e. text varieties with specific situational characteristics. Register is the most important predictor of linguistic variation, and register information could improve the potential of online data for many applications. We introduce the first manually annotated non-English corpus of online registers featuring the full range of linguistic variation found online. The data set consists of 2,237 Finnish documents and follows the register taxonomy developed for the Corpus of Online Registers of English (CORE). Using CORE and the newly introduced corpus, we demonstrate the feasibility of cross-lingual register identification using a simple approach based on convolutional neural networks and multilingual word embeddings. We further find that register identification results can be improved through multilingual training even when a substantial number of annotations is available in the target language.
Multilingual text classification
Online Registers
Convolutional neural networks
Multilingual word vectors