Automatic Morpheme Segmentation and Labeling in Universal Dependencies Resources

Miikka Silfverberg
Department of Linguistics, University of Colorado, USA

Mans Hulden
Department of Linguistics, University of Colorado, USA

Ladda ner artikel

Ingår i: Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies, 22 May, Gothenburg Sweden

Linköping Electronic Conference Proceedings 135:18, s. 140-145

NEALT Proceedings Series 31:18, s. 140-145

Visa mer +

Publicerad: 2017-05-29

ISBN: 978-91-7685-501-0

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


Newer incarnations of the Universal Dependencies (UD) resources feature rich morphological annotation on the wordtoken level as regards tense, mood, aspect, case, gender, and other grammatical information. This information, however, is not aligned to any part of the word forms in the data. In this work, we present an algorithm for inferring this latent alignment between morphosyntactic labels and substrings of word forms. We evaluate the method on three languages where we have manually labeled part of the Universal Dependencies data—Finnish, Swedish, and Spanish—and show that the method is robust enough to use for automatic discovery, segmentation, and labeling of allomorphs in the data sets. The model allows us to provide a more detailed morphosyntactic labeling and segmentation of the UD data.


Inga nyckelord är tillgängliga


Ryan Cotterell, Christo Kirov, John Sylak-Glassman, David Yarowsky, Jason Eisner, and Mans Hulden. 2016a. The SIGMORPHON 2016 shared task-morphological reinflection. In Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 10–22, Berlin, Germany, August. Association for Computational Linguistics.

Ryan Cotterell, Tim Vieira, and Hinrich Schütze. 2016b. A joint model of orthography and morphological segmentation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 664–669, San Diego, California, June. Association for Computational Linguistics.

Mathias Creutz and Krista Lagus. 2005. Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Technical Report A81, Helsinki University of Technology.

Joaquim Ferreira da Silva, Gaël Dias, Sylvie Guilloré, and José Gabriel Pereira Lopes. 1999. Using Local-Maxs algorithm for the extraction of contiguous and non-contiguous multiword lexical units. In Progress in Artificial Intelligence: 9th Portuguese Conference on Artificial Intelligence, EPIA ’99 E´vora, Portugal, September 21–24, 1999 Proceedings, pages 113–132. Springer Berlin Heidelberg, Berlin, Heidelberg.

Sajib Dasgupta and Vincent Ng. 2007. Highperformance, language-independent morphological segmentation. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 155–163, Rochester, New York, April. Association for Computational Linguistics.

Markus Dreyer and Jason Eisner. 2011. Discovering morphological paradigms from plain text using a Dirichlet process mixture model. In Proceedings of EMNLP 2011, pages 616–627, Edinburgh. Association for Computational Linguistics.

John Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. Computational linguistics, 27(2):153–198.

Sharon Goldwater and Tom Griffiths. 2007. A fully Bayesian approach to unsupervised part-of-speech tagging. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 744–751, Prague, Czech Republic, June. Association for Computational Linguistics.

Stig-Arne Grönroos, Sami Virpioja, Peter Smit, and Mikko Kurimo. 2014. Morfessor FlatCat: An HMM-based method for unsupervised and semisupervised learning of morphology. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics, pages 1177–1185, Dublin, Ireland, August. Dublin City University and Association for Computational Linguistics.

Katharina Kann, Ryan Cotterell, and Hinrich Schütze. 2016. Neural morphological analysis: Encodingdecoding canonical segments. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 961–967, Austin, Texas, November. Association for Computational Linguistics.

Oskar Kohonen, Sami Virpioja, and Krista Lagus. 2010. Semi-supervised learning of concatenative morphology. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology (SIGMORPHON), pages 78–86. Association for Computational Linguistics.

Joakim Nivre, Željko Agic, Lars Ahrenberg, Maria Jesus Aranzabe, Masayuki Asahara, Aitziber Atutxa, et al. 2017. Universal dependencies 2.0. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University in Prague.

Hoifung Poon, Colin Cherry, and Kristina Toutanova. 2009. Unsupervised morphological segmentation with log-linear models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 209–217. Association for Computational Linguistics.

Teemu Ruokolainen, Oskar Kohonen, Kairit Sirts, Stig-Arne Grönroos, Mikko Kurimo, and Sami Virpioja. 2016. A comparative study of minimally supervised morphological segmentation. Computational Linguistics, 42(1):91–120.

Patrick Schone and Daniel Jurafsky. 2000. Knowledge-free induction of morphology using latent semantic analysis. In Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning, pages 67–72. Association for Computational Linguistics.

Kairit Sirts and Sharon Goldwater. 2013. Minimallysupervised morphological segmentation using adaptor grammars. Transactions of the Association for Computational Linguistics, 1:255–266.

Radu Soricut and Franz Och. 2015. Unsupervised morphology induction using word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1627–1637, Denver, Colorado, May–June. Association for Computational Linguistics.

Citeringar i Crossref