An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora

Moen, Hans; Peltonen, Laura-Maria; Suhonen, Henry; Matinolli, Hanna-Maria; Mieronkoski, Riitta; Telen, Kirsi; Terho, Kirsi; Salakoski, Tapio; Salanterä, Sanna

Konferensartikel

An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora

Hans Moen
Turku NLP Group, Department of Future Technologies, University of Turku, Finland

Laura-Maria Peltonen
Department of Nursing Science, University of Turku, Finland

Henry Suhonen
Department of Nursing Science, University of Turku, Finland / Turku University Hospital, Finland

Hanna-Maria Matinolli
Department of Nursing Science, University of Turku, Finland

Riitta Mieronkoski
Department of Nursing Science, University of Turku, Finland

Kirsi Telen
Department of Nursing Science, University of Turku, Finland

Kirsi Terho
Department of Nursing Science, University of Turku, Finland / Turku University Hospital, Finland

Tapio Salakoski
Turku NLP Group, Department of Future Technologies, University of Turku, Finland

Sanna Salanterä
Department of Nursing Science, University of Turku, Finland / Turku University Hospital, Finland

Ladda ner artikel

Ingår i: Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), September 30 - October 2, Turku, Finland

Linköping Electronic Conference Proceedings 167:14, s. 131--139

NEALT Proceedings Series 42:14, p. 131--139

Visa mer +

Publicerad: 2019-10-02

ISBN: 978-91-7929-995-8

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

We present our work towards developing a system that should find, in a large text corpus, contiguous phrases expressing similar meaning as a query phrase of arbitrary length. Depending on the use case, this task can be seen as a form of (phrase-level) query rewriting. The suggested approach works in a generative manner, is unsupervised and uses a combination of a semantic word n-gram model, a statistical language model and a document search engine. A central component is a distributional semantic model containing word n-grams vectors (or embeddings) which models semantic similarities between n-grams of different order. As data we use a large corpus of PubMed abstracts. The presented experiment is based on manual evaluation of extracted phrases for arbitrary queries provided by a group of evaluators. The results indicate that the proposed approach is promising and that the use of distributional semantic models trained with uni-, bi- and trigrams seems to work better than a more traditional unigram model.

Konferensartikel

An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora

Abstract

Nyckelord

Referenser

Citeringar i Crossref