Konferensartikel

Universal Dependencies for Finnish

Sampo Pyysalo
Department of Information Technology, University of Turku, Finland

Jenna Kanerva
Department of Information Technology / University of Turku Graduate School (UTUGS), University of Turku, Finland

Anna Missilä
School of Languages and Translation Studies, University of Turku, Finland

Veronika Laippala
Turku Institute for Advanced Studies (TIAS) / School of Languages and Translation Studies, University of Turku, Finland

Filip Ginter
Department of Information Technology, University of Turku, Finland

Ladda ner artikel

Ingår i: Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania

Linköping Electronic Conference Proceedings 109:21, s. 163-172

NEALT Proceedings Series 23:21, p. 163-172

Visa mer +

Publicerad: 2015-05-06

ISBN: 978-91-7519-098-3

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

In recent years, there has been substantial interest in annotation schemes that apply consistently to many languages. Building on several recent proposals that have aimed to unify morphological and syntactic annotation, the Universal Dependencies (UD) project seeks to introduce a cross-linguistically applicable part-of- speech tagset, feature inventory, and set of dependency relations as well as a large number of uniformly annotated treebanks. In this paper, we present Universal Dependencies for Finnish, one of the ten languages in the recent first release of UD project treebank data. We detail the mapping of previously introduced annotation to the UD standard, describing a number of specific challenges and their resolution. We additionally present a first set of dependency parsing experiments comparing the performance of a state-of-the-art parser trained on a language-specific annotation schema to performance on the corresponding UD annotation. The parsing results show improvement in parsing scores compared to the source annotation, indicating that the conversion is accurate and supporting the feasibility of UD as a parsing target.

Nyckelord

Inga nyckelord är tillgängliga

Referenser

Bejcek, E., Panevová, J., Popelka, J., Stranák, P., Ševciková, M., Štepánek, J., and Žabokrtsk?, Z. (2012). Prague dependency treebank 2.5 – a revisited version of pdt 2.0. In Proceedings of the 24th International Conference on Computational Linguistics (Coling 2012), pages 231–246.

Bohnet, B. (2010). Top accuracy and fast dependency parsing is not a contradiction. In Proceedings of COLING’10, pages 89–97. Bohnet, B., Nivre, J., Boguslavsky, I., Farkas, R., Ginter, F., and Hajic, J. (2013). Joint morphological and syntactic analysis for richly inflected languages. Transactions of the Association for Computational Linguistics, 1:415–428.

Bosco, C., Montemagni, S., and Simi, M. (2013) Converting italian treebanks: Towards an italian stanford dependency treebank. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 61–69.

de Marneffe, M.-C., Dozat, T., Silveira, N., Haverinen, K., Ginter, F., Nivre, J., and Manning, C. D. (2014). Universal Stanford Dependencies: A cross-linguistic typology. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), volume 14, pages 4585–4592.

de Marneffe, M.-C., MacCartney, B., and Manning, C. D. (2006). Generating typed dependency parses from phrase structure parses. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), volume 6, pages 449–454.

Farkas, R., Vincze, V., and Schmid, H. (2012). Dependency parsing of hungarian: Baseline results and challenges. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 55–65.

Hakulinen, A., Korhonen, R., Vilkuna, M., and Koivisto, V. (2004). Iso suomen kielioppi. Suomalaisen kirjallisuuden seura.

Haverinen, K., Laippala, V., Kohonen, S., Missil, A., Nyblom, J., Ojala, S., Viljanen, T., Salakoski, T., and Ginter, F. (2013a). Towards a dependency-based propbank of general finnish. In Proceedings of the 19th Nordic Conference on Computational Linguistics (NoDaLiDa’13), pages 41–57.

Haverinen, K., Nyblom, J., Viljanen, T., Laippala, V., Kohonen, S., Missilä, A., Ojala, S., Salakoski, T., and Ginter, F. (2013b). Building the essential resources for finnish: the Turku Dependency Treebank. Language Resources and Evaluation, pages 1–39.

Kanerva, J., Luotolahti, J., Laippala, V., and Ginter, F. (2014). Syntactic n-gram collection from a large-scale corpus of internet finnish. In Proceedings of the Sixth International Conference Baltic HLT, pages 184–191.

Lind´en, K., Silfverberg, M., and Pirinen, T. (2009). HFST tools for morphology — an efficient open-source package for construction of morphological analyzers. In State of the Art in Computational Morphology, volume 41 of Communications in Computer and Information Science, pages 28–47.

Lynn, T., Foster, J., Dras, M., and Tounsi, L. (2014). Cross-lingual transfer parsing for lowresourced languages: An Irish case study. In Proceedings of the First Celtic Language Technology Workshop, pages 41–49.

McDonald, R., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., Hall, K., Petrov, S., Zhang, H., T¨ackstr¨om, O., Bedini, C., Bertomeu Castell´o, N., and Lee, J. (2013). Universal dependency annotation for multilingual parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), pages 92–97.

M¨uller, T., Schmid, H., and Sch¨utze, H. (2013). Efficient higher-order CRFs for morphological tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.

Nivre, J. (2014). Universal Dependencies for Swedish. In SLTC 2014.

Nivre, J., Bosco, C., Choi, J., de Marneffe, M.- C., Dozat, T., Farkas, R., Foster, J., Ginter, F., Goldberg, Y., Haji?c, J., Kanerva, J., Laippala, V., Lenci, A., Lynn, T., Manning, C., McDonald, R., Missilä, A., Montemagni, S., Petrov, S., Pyysalo, S., Silveira, N., Simi, M., Smith, A., Tsarfaty, R., Vincze, V., and Zeman, D. (2015). Universal dependencies 1.0.

Nivre, J., Choi, J., de Marneffe, M.-C., Dozat, T.,Ginter, F., Goldberg, Y., Haji?c, J., Manning, C., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., and Zeman, D. (2014). Universal dependencies documentation 1.0.

Petrov, S., Das, D., and McDonald, R. (2012). A universal part-of-speech tagset. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC), pages 2089–2096.

Pirinen, T. (2008). Suomen kielenäärellistilainen automaattinen morfologinen jäsennin avoimen lähdekoodin resurssein. Master’s thesis, University of Helsinki.

Silveira, N., Dozat, T., de Marneffe, M.-C., Bowman, S., Connor, M., Bauer, J., and Manning, C. D. (2014). A gold standard dependency corpus for English. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014).

Simi, M., Bosco, C., and Montemagni, S. (2014). Less is more? towards a reduced inventory of categories for training a parser for the italian stanford dependencies. In Proceedings of LREC 2014.

Stenetorp, P., Pyysalo, S., Topic, G., Ohta, T., Ananiadou, S., and Tsujii, J. (2012). Brat: a web-based tool for nlp-assisted text annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 102–107.

Sulkala, H. and Karjalainen, M. (1992). Finnish. Descriptive Grammar Series. Routledge, London. Tsarfaty, R. (2013). A unified morpho-syntactic scheme of stanford dependencies. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), pages 578–584.

Voutilainen, A. (2011). FinnTreeBank: Creating a research resource and service for language researchers with Constraint Grammar. In Proceedings of the NODALIDA 2011 workshop Constraint Grammar Applications.

Zeman, D. (2008). Reusable tagset conversion using tagset drivers. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC), pages 213–218.

Citeringar i Crossref