Conference article

Joint UD Parsing of Norwegian Bokmål and Nynorsk

Erik Velldal
Department of Informatics, University of Oslo, Norway

Lilja Øvrelid
Department of Informatics, University of Oslo, Norway

Petter Hohle
Department of Informatics, University of Oslo, Norway

Download article

Published in: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Linköping Electronic Conference Proceedings 131:1, p. 1-10

NEALT Proceedings Series 29:1, p. 1-10

Show more +

Published: 2017-05-08

ISBN: 978-91-7685-601-7

ISSN: 1650-3686 (print), 1650-3740 (online)

Abstract

This paper investigates interactions in parser performance for the two official standards for written Norwegian: Bokmål and Nynorsk. We demonstrate that while applying models across standards yields poor performance, combining the training data for both standards yields better results than previously achieved for each of them in isolation. This has immediate practical value for processing Norwegian, as it means that a single parsing pipeline is sufficient to cover both varieties, with no loss in accuracy. Based on the Norwegian Universal Dependencies treebank we present results for multiple taggers and parsers, experimenting with different ways of varying the training data given to the learners, including the use of machine translation.

Keywords

No keywords available

References

Željko Agic, Anders Johannsen, Barbara Plank, Héctor Alonso Martínez, Natalie Schluter, and Anders Søgaard. 2016. Multilingual projection for parsing truly low-resource languages. Transactions of the Association for Computational Linguistics, 4:301-312.

Waleed Ammar, George Mulcaire, Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2016. One parser, many languages. arXiv preprint arXiv:1602.01595.

Bernd Bohnet. 2010. Very High Accuracy and Fast Dependency Parsing is not a Contradiction. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 89-97, Beijing, China.

Thorsten Brants. 2000. TnT - A Statistical Part-of-Speech Tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference, Seattle, WA, USA.

Xavier Carreras. 2007. Experiments with a higherorder projective dependency parser. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Conference on Computational Natural Language Learning, pages 957-961, Prague, Czech Republic.

Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 740-750, Doha, Qatar.

Jinho D. Choi, Joel Tetreault, and Amanda Stent. 2015. It Depends: Dependency Parser Comparison Using A Web-Based Evaluation Tool. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, pages 387-396, Beijing, China.

Michael Collins. 2002. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pages 1-8, PA, USA.

Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singe. 2006. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551-585.

Marie-Catherine de Marneffe, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher D. Manning. 2014. Universal Stanford dependencies. A cross-linguistic typology. In Proceedings of the International Conference on Language Resources and Evaluation, pages 4585-4592, Reykjavik, Iceland.

Mikel L. Forcada, Mireia Ginestí-Rosell, Jacob Nordfalk, Jim O’Regan, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Gema Ramírez-Sánchez, and Francis M. Tyers. 2011.  Apertium: a free/open-source platform for rulebased machine translation. Machine Translation, 25(2):127-144.

Petter Hohle, Lilja Øvrelid, and Erik Velldal. 2017. Optimizing a PoS tagset for Norwegian dependency parsing. In Proceedings of the 21st Nordic Conference of Computational Linguistics, Gothenburg, Sweden.

Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak. 2005. Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering, 11(3).

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781.

Jens Nilsson and Joakim Nivre. 2008. MaltEval: An evaluation and visualization tool for dependency parsing. In Proceedings of the Sixth International Conference on Language Resources and Evaluation, pages 161-166, Marrakech, Morocco.

Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Haji?c, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the International Conference on Language Resources and Evaluation, Portorož, Slovenia.

Joakim Nivre. 2015. Towards a Universal Grammar for Natural Language Processing. In Computational Linguistics and Intelligent Text Processing, volume 9041 of Lecture Notes in Computer Science, pages 3-16. Springer International Publishing.

Lilja Øvrelid and Petter Hohle. 2016. Universal Dependencies for Norwegian. In Proceedings of the Tenth International Conference on Language Resources and Evaluation, Portorož, Slovenia.

Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A Universal Part-of-Speech Tagset. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, pages 2089-2096, Istanbul, Turkey.

Arne Skjærholt and Lilja Øvrelid. 2012. Impact of treebank characteristics on cross-lingual parser adaptation. In Proceedings of the Eleventh International Workshop on Treebanks and Linguistic Theories, pages 187-198, Lisbon, Portugal.

Anders Søgaard. 2011. Data point selection for crosslanguage adaptation of dependency parsers. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 682-686, Portland, Oregon.

Per Erik Solberg, Arne Skjærholt, Lilja Øvrelid, Kristin Hagen, and Janne Bondi Johannessen. 2014. The Norwegian Dependency Treebank. In Proceedings of the Ninth International Conference on Language Resources and Evaluation, pages 789-795, Reykjavik, Iceland.

Kathrin Spreyer, Lilja Øvrelid, and Jonas Kuhn. 2010. Training parsers on partial trees: A cross-language comparison. In Proceedings of the International Conference on Language Resources and Evaluation (LREC).

Milan Straka, Jan Hajic, Jana Straková, and Jan Hajic jr. 2015. Parsing universal dependency treebanks using neural networks and search-based oracle. In Proceedings of Fourteenth International Workshop on Treebanks and Linguistic Theories, Warsaw, Poland.

Milan Straka, Jan Hajic, and Jana Straková. 2016. UDPipe: trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, pos tagging and parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation, Portorož, Slovenia.

Jana Straková, Milan Straka, and Jan Hajic. 2014. Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 13-18, Baltimore, Maryland.

Oscar Täckström, Ryan McDonald, and Jakob Uszkoreit. 2012. Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics, Montreal, Canada.

Jörg Tiedemann, Željko Agic Zeljko, and Joakim Nivre. 2014. Treebank translation for cross-lingual parser induction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pages 130-140.

Kevin Brubeck Unhammer and Trond Trosterud. 2009. Reuse of Free Resources in Machine Translation between Nynorsk and Bokmål. In Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation, pages 35-42, Alicante.

Dan Zeman and Philip Resnik. 2008. Cross-language parser adaptation between related languages. In Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, Hyderabad, India.

Citations in Crossref