Creating register sub-corpora for the Finnish Internet Parsebank

Veronika Laippala
Turku Institute for Advanced Studies, University of Turku, Finland / School of Languages and Translation Studies, University of Turku, Finland / Turku NLP Group, University of Turku, Finland

Juhani Luotolahti
Turku NLP Group, University of Turku, Finland

Aki-Juhan Kyröläinen
School of Languages and Translation Studies, University of Turku, Finland / Turku NLP Group, University of Turku, Finland

Tapio Salakoski
Turku NLP Group, University of Turku, Finland

Filip Ginter
Turku NLP Group, University of Turku, Finland

Ingår i: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Linköping Electronic Conference Proceedings 131:18, s. 152-161

NEALT Proceedings Series 29:18, s. 152-161

Publicerad: 2017-05-08

ISBN: 978-91-7685-601-7

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


This paper develops register sub-corpora for the Web-crawled Finnish Internet Parsebank. Currently, all the documents belonging to different registers, such as news and user manuals, have an equal status in this corpus. Detecting the text register would be useful for both NLP and linguistics (Giesbrecht and Evert, 2009) (Webber, 2009) (Sinclair, 1996) (Egbert et al., 2015). We assemble the subcorpora by first naively deducing four register classes from the Parsebank document URLs and then developing a classifier based on these, to detect registers also for the rest of the documents. The results show that the naive method of deducing the register is efficient and that the classification can be done sufficiently reliably. The analysis of the prediction errors however indicates that texts sharing similar communicative purposes but belonging to different registers, such as news and blogs informing the reader, share similar linguistic characteristics. This attests of the well-known difficulty to define the notion of registers for practical uses. Finally, as a significant improvement to its usability, we release two sets of sub-corpus collections for the Parsebank. The A collection consists of two million documents classified to blogs, forum discussions, encyclopedia articles and news with a naive classification precision of >90%, and the B collection four million documents with a precision of >80%.


