Jenna Kanerva
TurkuNLP Group, University of Turku, Graduate School (UTUGS), Turku, Finland
Sampo Pyysalo
Language Technology Lab DTAL, University of Cambridge, United Kingdom
Filip Ginter
TurkuNLP Group, University of Turku, Finland
Download article
Published in: Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), September 18-20, 2017, Università di Pisa, Italy
Linköping Electronic Conference Proceedings 139:11, p. 83-91
Published: 2017-09-13
ISBN: 978-91-7685-467-9
ISSN: 1650-3686 (print), 1650-3740 (online)
Word embeddings induced from large amounts of unannotated text are a key resource for many NLP tasks. Several recent studies have proposed extensions of the basic distributional semantics approach where words form the context of other words, adding features from e.g. syntactic dependencies. In this study, we look in a different direction, exploring models that leave words out entirely, instead basing the context representation exclusively on syntactic and morphological features. Remarkably, we find that the resulting vectors still capture clear semantic aspects of words in addition to syntactic ones. We assess the properties of the vectors using both intrinsic and extrinsic evaluations, demonstrating in a multilingual parsing experiment using 55 treebanks that fully delexicalized syntax-based word representations give a higher average parsing performance than conventional word2vec embeddings.