Increasing Return on Annotation Investment: the Automatic Construction of a Universal Dependency Treebank for Dutch

Gosse Bouma
Centre for Language and Cognition, University of Groningen, The Netherlands

Gertjan van Noord
Centre for Language and Cognition, University of Groningen, The Netherlands

Ladda ner artikel

Ingår i: Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies, 22 May, Gothenburg Sweden

Linköping Electronic Conference Proceedings 135:3, s. 19-26

NEALT Proceedings Series 31:3, s. 19-26

Visa mer +

Publicerad: 2017-05-29

ISBN: 978-91-7685-501-0

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


We present a method for automatically converting the Dutch Lassy Small treebank, a phrasal dependency treebank, to UD. All of the information required to produce accurate UD annotation appears to be available in the underlying annotation. However, we also note that the close connection between POS-tags and dependency labels that is present in UD is missing in the Lassy treebanks. As a consequence, annotation decisions in the Dutch data for such phenomena as nominalization and clausal complements of prepositions seem to differ to some extent from comparable data in English and German. Because the conversion is automatic, we can now also compare three state-of-the-art dependency parsers trained on UD Lassy Small with Alpino, a hybrid Dutch parser which produces output that is compatible with the original Lassy annotations.


Inga nyckelord är tillgängliga


Lars Ahrenberg. 2015. Converting an English-Swedish parallel treebank to universal dependencies. In Third International Conference on Dependency Linguistics (DepLing 2015), Uppsala, August 24-26, pages 10–19. Association for Computational Linguistics.

Chris Alberti, Daniel Andor, Ivan Bogatyy, Michael Collins, Dan Gillick, Lingpeng Kong, Terry Koo, Ji Ma, Mark Omernick, Slav Petrov, Chayut Thanapirom, Zora Tung, and David Weiss. 2017. Syntaxnet models for the CoNLL 2017 shared task.

Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally normalized transition-based neural networks. In Proceedings of the ACL.

Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, and George Smith. 2002. The TIGER treebank. In Proceedings of the workshop on Treebanks and Linguistic Theories, volume 168.

Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X shared task on multilingual dependency parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning, pages 149–164. Association for Computational Linguistics.

Noam Chomsky. 1968. Remarks on nominalization. Linguistics Club, Indiana University. Anders Johannsen, H´ector Mart´inez Alonso, and Barbara Plank. 2015. Universal dependencies for Danish. In International Workshop on Treebanks and Linguistic Theories (TLT14), page 157.

Eliyahu Kiperwasser and Yoav Goldberg. 2016. Easy-first dependency parsing with hierarchical tree LSTMs. Transactions of the ACL, 4:445–461.

Teresa Lynn and Jennifer Foster. 2016. Universal dependencies for Irish. In Celtic Language Technology Workshop, pages 79–92.

Adam Meyers, Ruth Reeves, Catherine Macleod, Rachel Szekely, Veronika Zielinska, Brian Young, and Ralph Grishman. 2004. Annotating noun argument structure for NomBank. In LREC, volume 4, pages 803–806.

Lilja Øvrelid and Petter Hohle. 2016. Universal dependencies for Norwegian. In Proceedings of the Tenth International Conference on Language Resources and Evaluation. Portorož, Slovenia.

Wojciech Skut, Thorsten Brants, Brigitte Krenn, and Hans Uszkoreit. 1998. A linguistically interpreted corpus of German newspaper text. arXiv preprint cmp-lg/9807008.

Leonoor van der Beek, Gosse Bouma, Rob Malouf, and Gertjan van Noord. 2002. The Alpino dependency treebank. In Computational Linguistics in the Netherlands (CLIN) 2001, Twente University.

Gertjan van Noord, Gosse Bouma, Frank van Eynde, Daniel de Kok, Jelmer van der Linde, Ineke Schuurman, Erik Tjong Kim Sang, and Vincent Vandeghinste. 2013. Large scale syntactic annotation of written Dutch: Lassy. In Peter Spyns and Jan Odijk, editors, Essential Speech and Language Technology for Dutch: the STEVIN Programme, pages 147–164. Springer.

Gertjan van Noord. 2006. At last parsing is now operational. In Piet Mertens, Cedrick Fairon, Anne Dister, and Patrick Watrin, editors, TALN06. Verbum Ex Machina. Actes de la 13e conference sur le traitement automatique des langues naturelles, pages 20–42.

Daniel Zeman, Ond rej Dušek, David Marecek, Martin Popel, Loganathan Ramasamy, Jan Štepánek, Zdenek Žabokrtsk? and Jan Hajic. 2014. HamleDT: Harmonized multi-language dependency treebank. Language Resources and Evaluation, 48(4):601–637.

Citeringar i Crossref