Conference article

Improving cross-domain dependency parsing with dependency-derived clusters

Jostein Lien
Department of Informatics, University of Oslo, Norway

Erik Velldal
Department of Informatics, University of Oslo, Norway

Lilja Øvrelid
Department of Informatics, University of Oslo, Norway

Download article

Published in: Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania

Linköping Electronic Conference Proceedings 109:16, s. 117-126

NEALT Proceedings Series 23:16, s. 117-126

Show more +

Published: 2015-05-06

ISBN: 978-91-7519-098-3

ISSN: 1650-3686 (print), 1650-3740 (online)


This paper describes a semi-supervised approach to improving statistical dependency parsing using dependency-based word clusters. After applying a baseline parser to unlabeled text, clusters are induced using K-means with word features based on the dependency structures. The parser is then re-trained using information about the clusters, yielding improved parsing accuracy on a range of different data sets, including WSJ and the English Web Treebank. We report improved results using both in-domain and out-of-domain data, and also include a comparison with using n-gram-based Brown clustering.


No keywords available


Anne Abeillé, Lionel Clément, and François Toussenel, 2003. Treebanks: Building and Using Parsed Corpora, chapter Building a Treebank for French. Kluwer, Dordrecht.

Ann Bies, Justin Mott, Colin Warner, and Seth Kulick. 2012. English Web Treebank LDC2012T13.

Steven Bird, Edward Loper, and Ewan Klein. 2009. Natural Language Processing with Python. O’Reilly Media Inc.

Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479, December.

Marie Candito and Djame Seddah. 2010. Parsing word clusters. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 76–84, Los Angeles, CA.

Xavier Carreras. 2007. Experiments with a higherorder projective dependency parser. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 957–961, Prague, Czech Republic.

Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale, and Mark Johnson. 2000. Brown laboratory for linguistic information processing (BLLIP) 1987–89 WSJ corpus release 1 LDC2000T43.

Eugene Charniak. 2000. A maximum-entropyinspired parser. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, pages 132–139, Seattle, WA.

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, (9).

Jennifer Foster,Özlem Cetinoglu, Joachim Wagner, Joseph Le Roux, Stephen Hogan, Joakim Nivre, Deirdre Hogan, and Josef van Genabith. 2011. #hardtoparse: POS tagging and parsing the twitterverse. In Proceedings of the AAAI Workshop on Analysing Microtext, pages 20–25, San Francisco, CA.

Jesús Giménez and Lluis Màrquez. 2004. SVMTool: A general POS tagger generator based on Support Vector Machines. In Proceedings of the 4th International Conference on Language Resources and Evaluation, Lisbon, Portugal.

Jan Hajic. 1998. Building a syntactically annotated corpus: The Prague Dependency Treebank. In Eva Hajicová, editor, Issues of Valency and Meaning. Studies in Honor of Jarmila Panevov´a, pages 12–19. Prague Karolinum, Charles University Press.

Terry Koo, Xavier Carreras, and Michael Collins. 2008. Simple semi-supervised dependency parsing. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 595–603, Columbus, OH.

David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397.

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English. The Penn Treebank. Computational Linguistics, 19(2):313–330.

Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 91–98, Ann Arbor, MI.

Yusuke Miyao, Takashi Ninomiya, and Jun’ichi Tsujii. 2004. Corpus-oriented grammar development for acquiring a Head-driven Phrase Structure Grammar from the Penn Treebank. In Proceedings of the 1st International Joint Conference on Natural Language Processing, pages 684–693, Hainan Island, China.

Joakim Nivre, Johan Hall, Jens Nilsson, G¨ulsen Eryigit, Sandra K¨ubler, Marinov Svetoslav, Erwin Marsi, and Atanas Chanev. 2007. MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering, 13(2):95–135.

Joakim Nivre. 2009. Non-projective dependency parsing in expected linear time. In Proceedings of the 47th Meeting of the Association for Computational Linguistics, pages 351–359, Suntec, Singapore.

Lilja Øvrelid and Arne Skjærholt. 2012. Lexical categories for improved parsing of web data. In Proceedings of the 24th International Conference on Computational Linguistics (COLING), pages 903–912, Bombay, India.

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

Slav Petrov and Ryan McDonald. 2012. Overview of the 2012 shared task on parsing the web. In Notes of the First Workshop on Syntactic Analysis of Non- Canonical Language (SANCL).

Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 433–440, Sydney, Australia, July.

Kenji Sagae and Andrew S. Gordon. 2009. Clustering words by syntactic similarity improves dependency parsing of predicate-argument structures. In Proceedings of the 11th International Conference on Parsing Technologies (IWPT), pages 192–201, Paris, France.

Kenji Sagae and Jun’ichi Tsujii. 2008. Shift-reduce dependency DAG parsing. In Proceedings of the 22nd International Conference on Computational Linguistics, pages 753–760, Manchester.

David Sculley. 2010. Web-scale K-means clustering. In Proceedings of the 19th International Conference on World Wide Web, pages 177–1178, Raleigh, NC.

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Meeting of the Association for Computational Linguistics, pages 384–394, Uppsala, Sweden.

Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, Mohammed El-Bachouti, Robert Belvin, and Ann Houston. 2011. OntoNotes release 4.0 LDC2011T03.

Citations in Crossref