Conference article

Toward Multilingual Identification of Online Registers

Veronika Laippala
School of Languages and Translation Studies, University of Turku, Finland

Roosa Kyllönen
School of Languages and Translation Studies, University of Turku, Finland

Jesse Egbert
Applied Linguistics, Northern Arizona University, USA

Douglas Biber
Applied Linguistics, Northern Arizona University, USA

Sampo Pyysalo
Department of Future Technologies, University of Turku, Finland

Download article

Published in: Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), September 30 - October 2, Turku, Finland

Linköping Electronic Conference Proceedings 167:30, p. 292--297

NEALT Proceedings Series 42:30, p. 292--297

Show more +

Published: 2019-10-02

ISBN: 978-91-7929-995-8

ISSN: 1650-3686 (print), 1650-3740 (online)


We consider cross- and multilingual text classification approaches to the identification of online registers (genres), i.e. text varieties with specific situational characteristics. Register is the most important predictor of linguistic variation, and register information could improve the potential of online data for many applications. We introduce the first manually annotated non-English corpus of online registers featuring the full range of linguistic variation found online. The data set consists of 2,237 Finnish documents and follows the register taxonomy developed for the Corpus of Online Registers of English (CORE). Using CORE and the newly introduced corpus, we demonstrate the feasibility of cross-lingual register identification using a simple approach based on convolutional neural networks and multilingual word embeddings. We further find that register identification results can be improved through multilingual training even when a substantial number of annotations is available in the target language.


Multilingual text classification Online Registers Convolutional neural networks Multilingual word vectors


No references available

Citations in Crossref