Conference article

New Measures to Investigate Term Typology by Distributional Data

Jussi Karlgren
Kungliga Tekniska Högskolan, Stockholm, Sweden and Gavagai, Stockholm

Download article

Published in: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Linköping Electronic Conference Proceedings 85:28, p. 311-319

NEALT Proceedings Series 16:28, p. 311-319

Show more +

Published: 2013-05-17

ISBN: 978-91-7519-589-6

ISSN: 1650-3686 (print), 1650-3740 (online)

Abstract

This report describes a series of exploratory experiments to establish whether terms of different semantic type can be distinguished in useful ways in a semantic space constructed from distributional data. The hypotheses explored in this paper are that some words are more variant in their distribution than others; that the varying semantic character of words will be reflected in their distribution; and this distributional difference is encoded in current distributional models; but that the information is not accessible through the methods typically used in application of them. This paper proposes some new measures to explore variation encoded in distributional models but not usually put to use in understanding the character of words represented in them. These exploratory findings show that some proposed measures show a wide range of variation across words of various types.

Keywords

Term typology; distributional semantics

References

Hisamitsu; T.; Niwa; Y.; and Tsujii; J.-i. (2000). A method of measuring term representativeness: baseline method using co-occurrence distribution. In Proceedings of the 18th conference on Computational linguistics; pages 320–326; Morristown; NJ; USA. Association for Computational Linguistics.

Justeson; J. S. and Katz; S. M. (1995). Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering; 1:9–27.

Kanerva; P.; Kristofersson; J.; and Holst; A. (2000). Random indexing of text samples for latent semantic analysis. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society; CogSci’00; page 1036. Erlbaum.

Katz; S. (1996). Distribution of content words and phrases in text and language modelling. Natural Language Engineering; 2(1):15–60.

Robertson; S. and Zaragoza; H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval; 3:333–389.

Sahlgren; M. (2006). The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. PhD Dissertation; Department of Linguistics; Stockholm University.

Sahlgren; M.; Holst; A.; and Kanerva; P. (2008). Permutations as a means to encode order in word space. In Proceedings of the 30th Annual Conference of the Cognitive Science Society; CogSci’08; pages 1300–1305; Washington D.C.; USA.

Smadja; F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics; 19:143–177.

Spärck Jones; K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation; 28:11–21.

Swadesh; M. (1971). The origin and diversification of language. Aldine; Chicago. Edited by Joel Sherzer post mortem.

Citations in Crossref