Building a Large Automatically Parsed Corpus of Finnish

Ginter, Filip; Nyblom, Jenna; Laippala, Veronika; Kohonen, Samuel; Haverinen, Katri; Vihjanen, Simo; Salakoski, Tapio

Konferensartikel

Building a Large Automatically Parsed Corpus of Finnish

Filip Ginter
Department of IT, University of Turku, Finland

Jenna Nyblom
Department of IT, University of Turku, Finland

Veronika Laippala
Department of Languages and Translation Studies, University of Turku, Finland

Samuel Kohonen
Department of IT, University of Turku, Finland

Katri Haverinen
Department of IT, University of Turku, Finland and Turku Centre for Computer Science (TUCS), Turku, Finland

Simo Vihjanen
Lingsoft, Inc., Turku, Finland

Tapio Salakoski
Department of IT, University of Turku, Finland and Turku Centre for Computer Science (TUCS), Turku, Finland

Ladda ner artikel

Ingår i: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Linköping Electronic Conference Proceedings 108:26, s. 291-300

NEALT Proceedings Series 16:26, p. 291-300

Visa mer +

Publicerad: 2013-05-17

ISBN: 978-91-7519-589-6

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

We describe the methods and resources used to build FinnTreeBank-3; a 76.4 million token corpus of Finnish with automatically produced morphological and dependency syntax analyses. Starting from a definition of the target dependency scheme; we show how existing resources are transformed to conform to this definition and subsequently used to develop a parsing pipeline capable of processing a large-scale corpus. An independent formal evaluation demonstrates high accuracy of both morphological and syntactic annotation layers. The parsed corpus is freely available within the FIN-CLARIN infrastructure project.

Nyckelord

Dependency parsing; Finnish; CLARIN; parsebank; treebank

Referenser

Bohnet; B. (2010). Top accuracy and fast dependency parsing is not a contradiction. In Proceedings of COLING’10; pages 89–97.

de Marneffe; M.-C. and Manning; C. (2008a). Stanford typed dependencies manual. Technical report; Stanford University. Revised for Stanford Parser v. 2.0.4 in November 2012.

De Marneffe; M.-C. and Manning; C. (2008b). Stanford typed dependencies representation. In Proceedings of COLING’08; Workshop on Cross-Framework and Cross-Domain Parser Evaluation; pages 1–8.

Hakulinen; A.; Vilkuna; M.; Korhonen; R.; Koivisto; V.; Heinonen; T.-R.; and Alho; I. (2004). Iso suomen kielioppi / Grammar of Finnish. Suomalaisen kirjallisuuden seura.

Haverinen; K. (2012). Syntax annotation guidelines for the Turku Dependency Treebank. Technical Report 1034; Turku Centre for Computer Science.

Haverinen; K.; Ginter; F.; Laippala; V.; Kohonen; S.; Viljanen; T.; Nyblom; J.; and Salakoski; T. (2011). A dependency-based analysis of treebank annotation errors. In Proceedings of Depling’11; pages 115–124.

Haverinen; K.; Viljanen; T.; Laippala; V.; Kohonen; S.; Ginter; F.; and Salakoski; T. (2010). Treebanking Finnish. In Proceedings of TLT9; pages 79–90.

Koehn; P. (2005). Europarl: a parallel corpus for statistical machine translation. In Proceedings of MT Summit X; pages 79–86.

Nivre; J.; Hall; J.; Nilsson; J.; Chanev; A.; Eryi?git; G.; Kübler; S.; Marinov; S.; and Marsi; E. (2007). MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering; 13(2):95–135.

Pahikkala; T.; Tsivtsivadze; E.; Airola; A.; Boberg; J.; and Salakoski; T. (2007). Learning to rank with pairwise regularized least-squares. In Joachims; T.; Li; H.; Liu; T.-Y.; and Zhai; C.; editors; SIGIR 2007 Workshop on Learning to Rank for Information Retrieval; pages 27–33.

Steinberger; R.; Pouliquen; B.; Widiger; A.; Ignat; C.; Erjavec; T.; Tufi¸s; D.; and Varga; D. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of LREC’06; pages 2142–2147.

Voutilainen; A.; Lindén; K.; and Purtonen; T. (2011). Designing a dependency representation and grammar definition corpus for Finnish. In Las tecnologías de la información y las comunicaciones: Presente y future en el análisis de córpora. Actas del III Congreso Internacional de Lingüística de Corpus; pages 151–158.

Voutilainen; A.; Purtonen; T.; and Muhonen; K. (2012a). FinnTreeBank2 manual. Technical report; University of Helsinki; Department of Modern Languages.

Voutilainen; A.; Purtonen; T.; and Muhonen; K. (2012b). Outsourcing parsebanking: The FinnTreeBank project. In Shall We Play the Festschrift Game?; pages 117–132. Springer.

Konferensartikel

Building a Large Automatically Parsed Corpus of Finnish

Abstract

Nyckelord

Referenser

Citeringar i Crossref