Defining the Eukalyptus forest - the Koala treebank of Swedish

Yvonne Adesam
Språkbanken, Department of Swedish, University of Gothenburg, Sweden

Gerlof Bouma
Språkbanken, Department of Swedish, University of Gothenburg, Sweden

Richard Johansson
Språkbanken, Department of Swedish, University of Gothenburg, Sweden

Ladda ner artikel

Ingår i: Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania

Linköping Electronic Conference Proceedings 109:4, s. 1-9

NEALT Proceedings Series 23:4, p. 1-9

Visa mer +

Publicerad: 2015-05-06

ISBN: 978-91-7519-098-3

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


This paper describes the creation of the Koala corpus, a 100k token manually an- notated corpus of Swedish contemporary texts, and in particular the part-of-speech and syntactic annotation. The resource will be made freely available.


Inga nyckelord är tillgängliga


Lars Borin, Markus Forsberg, and Lennart Lönngren. 2013. SALDO: a touch of yin to WordNet’s yang. Language Resources and Evaluation, 47(4):1191–1211.

Thorsten Brants, Wojciech Skut, and Hans Uszkoreit. 1999. Syntactic annotation of a German newspaper corpus. In Proceedings of the ATALA Treebank Workshop, pages 69–76.

Sabine Brants, Stefanie Dipper, Peter Eisenberg, Silvia Hansen-Schirra, Esther König,Wolfgang Lezius, Christian Rohrer, George Smith, and Hans Uszkoreit. 2004. Tiger: Linguistic interpretation of a German corpus. Research on Language and Computation, 2(4):597–620.

Aoife Cahill, Michael Burke, Ruth O’Donovan Josef Van Genabith, and Andy Way. 2004. Longdistance dependency resolution in automatically acquired wide-coverage PCFG-based LFG approximations. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, pages 319–326.

Marie-Catherine de Marneffe, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher Manning. 2014. Universal stanford dependencies: A cross-linguistic typology. In Proceedings of LREC.

Eva Ejerhed, Gunnel Källgren, Ola Wennstedt, and Magnus Åström. 1992. The linguistic annotation system of the Stockholm-Umeå corpus project - description and guidelines. Technical report, Department of Linguistics, Umeå University.

Helen Hoekstra, Michael Moortgat, Ineke Schuurman, and Ton van der Wouden. 2001. Syntactic annotation for the spoken Dutch corpus project (CGN). InWalter Daelemans, Khalil Sima’an, Jorn Veenstra, and Jakub Zavrel, editors, Computational Linguistics in the Netherlands 2000. Selected Papers from the Eleventh CLIN Meeting, pages 73–87. Rodopi.

Philipp Koehn. 2002. Europarl: A multilingual corpus for evaluation of machine translation. Bengt Loman and Nils Jörgensen. 1971. Manual för analys och beskrivning av makrosyntagmer. Studentlitteratur, Lund.

Yusuke Miyao, Takashi Ninomiya, , and Jun’ichi Tsujii. 2004. Corpus-oriented grammar development for acquiring a head-driven phrase structure grammar from the Penn Treebank. In Proceedings of the 1st International Joint Conference on Natural Language Processing (IJCNLP 2004), pages 684–693.

Joakim Nivre, Jens Nilsson, and Johan Hall. 2006. Talbanken05: A Swedish treebank with phrase structure and dependency annotation. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), pages 1392–1395.

Joakim Nivre, Beáta Megyesi, Sofia Gustafson-Capková, Filip Salomonsson, and Bengt Dahlqvist. 2008. Cultivating a Swedish treebank. In Resourceful Language Technology: Festschrift in Honor of Anna Sågvall Hein. Uppsala University, Department of Linguistics and Philology.

Joakim Nivre. 2002. What kinds of trees grow in Swedish soil? a comparison of four annotation schemes for Swedish. In Proceedings of the Workshop on Treebanks and Linguistic Theories, September 20-21 (TLT02).

Robert Östling. 2013. Stagger: an open-source part of speech tagger for swedish. Northern European Journal of Language Technology, 3:1–18.

Wojciech Skut, Brigitte Krenn, Thorsten Brants, and Hans Uszkoreit. 1997. An annotation scheme for free word order languages. In Proceedings of the 5th Conference on Applied Natural Language Processing, pages 88–95.

Ulf Teleman, Staffan Hellberg, and Erik Andersson. 1999. Svenska Akademiens Grammatik. Svenska Akademien, Stockholm.

Ulf Teleman. 1974. Manual för grammatisk beskrivning av talad och skriven svenska. Studentlitteratur, Lund.

Heike Telljohann, Erhard Hinrichs, Sandra Kübler, Heike Zinsmeister, and Kathrin Beck. 2012. Stylebook for the Tübingen treebank of written German (TüBa-D/Z). Technical report, Seminar für Sprachwissenschaft, Tübingen.

Gertjan van Noord, Gosse Bouma, Frank Van Eynde, Daniël de Kok, Jelmer van der Linde, Ineke Schuurman, Erik Tjong Kim San Sang, and Vincent Vandeghinste. 2013. Large scale syntactic annotation of written dutch: Lassy. In Peter Spyns and Jan Odijk, editors, Essential Speech and Language Technology for Dutch, Theory and Applications of Natural Language Processing, pages 147–164. Springer Berlin Heidelberg.

Martin Volk, Anne Göhring, Torsten Marek, and Yvonne Samuelsson. 2010. SMULTRON (version 3.0) — the Stockholm MULtilingual parallel TReebank. http://www.cl.uzh.ch/research/parallelcorpora/paralleltreebanks_en.html.

Citeringar i Crossref