Konferensartikel

Word vectors, reuse, and replicability: Towards a community repository of large-text resources

Murhaf Fares
Language Technology Group, Department of Informatics, University of Oslo, Norway

Andrey Kutuzov
Language Technology Group, Department of Informatics, University of Oslo, Norway

Stephan Oepen
Language Technology Group, Department of Informatics, University of Oslo, Norway

Erik Velldal
Language Technology Group, Department of Informatics, University of Oslo, Norway

Ladda ner artikel

Ingår i: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Linköping Electronic Conference Proceedings 131:37, s. 271-276

NEALT Proceedings Series 29:37, p. 271-276

Visa mer +

Publicerad: 2017-05-08

ISBN: 978-91-7685-601-7

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

This paper describes an emerging shared repository of large-text resources for creating word vectors, including pre-processed corpora and pre-trained vectors for a range of frameworks and configurations. This will facilitate reuse, rapid experimentation, and replicability of results.

Nyckelord

Inga nyckelord är tillgängliga

Referenser

Al-Rfou, R., Perozzi, B., & Skiena, S. (2013). Polyglot. Distributed word representations for multilingual NLP. In Proceedings of the 17th Conference on Natural Language Learning (p. 183–192). Sofia, Bulgaria.

Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (p. 238–247).

Baltimore, Maryland: Association for Computational Linguistics. Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python. Beijing: O’Reilly.

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.

Durme, B. V., & Lall, A. (2010). Online generation of locality sensitive hash signatures. In Proceedings of the 48th Meeting of the Association for Computational Linguistics (p. 231–235). Uppsala, Sweden.

Dyer, C., Ballesteros, M., Ling, W., Matthews, A., & Smith, N. (2015). Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd Meeting of the Association for Computational Linguistics and of the 7th International Joint Conference on Natural Language Processing. Bejing, China.

Gladkova, A., Drozd, A., & Matsuoka, S. (2016). Analogybased Detection of Morphological and Semantic Relations with Word Embeddings: What Works and What Doesn’t. In Proceedings of the NAACL Student Research Workshop (p. 8–15). San Diego, California: Association for Computational Linguistics.

Hellrich, J., & Hahn, U. (2016). Bad Company— Neighborhoods in Neural Embedding Spaces Considered Harmful. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers (p. 2785–2796). Osaka, Japan: The COLING 2016 Organizing Committee.

Hill, F., Reichart, R., & Korhonen, A. (2015). SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation. Computational Linguistics, 41(4), 665–695.

Hofland, K. (2000). A self-expanding corpus based on newspapers on the Web. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC-2000).

Hofmann, T. (1999). Probabilistic latent semantic analysis. In Proceedings of the fifteenth conference on uncertainty in artificial intelligence (p. 289–296). Stockholm, Sweden.

Kanerva, P., Kristoferson, J., & Holst, A. (2000). Random indexing of text samples for latent semantic analysis. In Proceedings of the 22nd annual conference of the cognitive science society (p. 1036). PA, USA.

Karlgren, J., & Sahlgren, M. (2001). From words to understanding. In Y. Uesaka, P. Kanerva, & H. Asoh (Eds.), Foundations of real-world intelligence (p. 294–308). Stanford, CA, USA: CSLI Publications.

Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (p. 1746–1751). Doha, Qatar.

Kutuzov, A., & Kuzmenko, E. (2017). Building Web-Interfaces for Vector Semantic Models with theWebVectors Toolkit. In Proceedings of the Demonstrations at the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics.

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. In Proceedings of the 2016 Meeting of the North American Chapter of the Association for Computational Linguistics and Human Language Technology Conference (p. 260–270). San Diego, CA, USA.

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review(104), 211–240.

Levy, O., & Goldberg, Y. (2014). Linguistic Regularities in Sparse and Explicit Word Representations. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning (p. 171–180). Ann Arbor, Michigan: Association for Computational Linguistics.

Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association of Computational Linguistics, 3, 211–225.

Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. (2014). The Stanford CoreNLP Natural Language Processing Toolkit. In Association for Computational Linguistics (ACL) System Demonstrations (p. 55–60).

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems 26, 3111–3119.

Parker, R., Graff, D., Kong, J., Chen, K., & Maeda, K. (2011). English Gigaword Fifth Edition LDC2011T07 (Tech. Rep.). Technical Report. Linguistic Data Consortium, Philadelphia.

Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (p. 1532–1543). Doha, Qatar: Association for Computational Linguistics.

Plank, B., Søgaard, A., & Goldberg, Y. (2016). Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of the 54th Meeting of the Association for Computational Linguistics. Berlin, Germany.

Ravichandran, D., Pantel, P., & Hovy, E. (2005). Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering. In Proceedings of the 43rd Meeting of the Association for Computational Linguistics (p. 622–629). Ann Arbor, MI, USA.

Rehürek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (pp. 45–50). Valletta, Malta: ELRA.

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C., Ng, A., & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (p. 1631–1642). Seattle, WA, USA.

Solberg, L. J. (2012). A corpus builder for Wikipedia. Unpublished master’s thesis, University of Oslo, Norway.

Spearman, C. (1904). The Proof and Measurement of Association between Two Things. The American Journal of Psychology, 15(1), 72–101.

Straka, M., Hajic, J., Straková, J., & Hajic jr., J. (2015). Parsing universal dependency treebanks using neural networks and search-based oracle. In Proceedings of the 14th International Workshop on Treebanks and Linguistic Theories. Warsaw, Poland.

Velldal, E. (2011). Random indexing re-hashed. In Proceedings of the 18th Nordic Conference of Computational Linguistics (p. 224–229). Riga, Latvia.

Widdows, D., & Ferraro, K. (2009). Semantic vectors: a scalable open source package and online technology management application. In Proceedings of the 6th International Conference on Language Resources and Evaluation. Marrakech, Morocco.

Citeringar i Crossref