Building a Large Automatically Parsed Corpus of Finnish

Filip Ginter
Department of IT, University of Turku, Finland

Jenna Nyblom
Department of IT, University of Turku, Finland

Veronika Laippala
Department of Languages and Translation Studies, University of Turku, Finland

Samuel Kohonen
Department of IT, University of Turku, Finland

Katri Haverinen
Department of IT, University of Turku, Finland and Turku Centre for Computer Science (TUCS), Turku, Finland

Simo Vihjanen
Lingsoft, Inc., Turku, Finland

Tapio Salakoski
Department of IT, University of Turku, Finland and Turku Centre for Computer Science (TUCS), Turku, Finland

Ingår i: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Linköping Electronic Conference Proceedings 85:26, s. 291-300

NEALT Proceedings Series 16:26, s. 291-300

Publicerad: 2013-05-17

ISBN: 978-91-7519-589-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


We describe the methods and resources used to build FinnTreeBank-3; a 76.4 million token corpus of Finnish with automatically produced morphological and dependency syntax analyses. Starting from a definition of the target dependency scheme; we show how existing resources are transformed to conform to this definition and subsequently used to develop a parsing pipeline capable of processing a large-scale corpus. An independent formal evaluation demonstrates high accuracy of both morphological and syntactic annotation layers. The parsed corpus is freely available within the FIN-CLARIN infrastructure project.


Dependency parsing; Finnish; CLARIN; parsebank; treebank


