Topic Models: Accounting Component Structure of Bigrams

Michael Nokel
Lomonosov Moscow State University, Russian Federation

Natalia Loukachevitch
Lomonosov Moscow State University, Russian Federation

Ladda ner artikel

Ingår i: Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania

Linköping Electronic Conference Proceedings 109:19, s. 145-152

NEALT Proceedings Series 23:19, p. 145-152

Visa mer +

Publicerad: 2015-05-06

ISBN: 978-91-7519-098-3

ISSN: 1650-3686 (tryckt), 1650-3740 (online)


The paper describes the results of an experimental study of integrating bigram collocations and similarities between them and unigrams into topic models. First of all, we propose a novel algorithm PLSA-SIM that is a modification of the original algorithm PLSA. It incorporates bigrams and maintains relationships between unigrams and bigrams based on their component structure. Then we analyze a variety of word association measures in order to integrate top-ranked bigrams into topic models. All experiments were conducted on four text collections of different domains and languages. The experiments distinguish a subgroup of tested measures that produce top-ranked bigrams, which demonstrate significant improvement of topic models quality for all collections, when integrated into PLSA-SIM algorithm.


Inga nyckelord är tillgängliga


David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors. Proceedings of the 26th Annual International Conference on Machine Learning: 25–32.

David Andrzejewski and David Buttler. 2011. Latent Topic Feedback for Information Retrieval. Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 600–608.

Arthur Asuncion, Max Welling, Padhraic Smyth, Yee Whye Teh. 2009. On Smoothing and Inference for Topic Models. Proceedings of the 25th International Conference on Uncertainty in Artificial Intelligence: 27–34.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research, volume 3: 993–1022.

Gerlof Bouma. 2009. Normalized (Pointwise) Mutual Information. Proceedings of the Biennial GSCL Conference: 31–40.

Jordan Boyd-Graber, David M. Blei, and Xiaojin Zhu. 2007. A Topic Model for Word Sense Disambiguation. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning: 1024–1033.

Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, David M. Blei. 2009. Reading Tea Leaves: How Human Interpret Topic Models. Proceedings of the 24th Annual Conference on Neural Information Processing Systems: 288–296.

KennethWard Church, and Patrick Hanks. 1990. Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics, volume 16: 22–29.

Beatrice Daille. 1995. Combined Approach for Terminology Extraction: Lexical Statistics and Linguistic Filtering PhD Dissertation. University of Paris, Paris.

Ali Daud, Juanzi Li, Lizhu Zhou, Faqir Muhammad. 2010. Knowledge discovery through directed probabilistic topic models: a survey. Frontiers of Computer Science in China, 4(2): 280–301. Vidas Daudarvicius and Ruta Marcinkeviciené. 2003. Gravity Counts for the Boundaries of Collocations. International Journal of Corpus Linguistics, 9(2): 321–348.

Paul Deane. 2005. A Nonparametric Method for Extraction of Candidate Phrasal Terms. Proceedings of the 43rd Annual Meeting of the ACL: 605–613.

Ted Dunning. 1993. Accurate Methods for the Statistics of Surprise and Coincidence. International Journal of Computational Linguistics, 19(1): 61–74.

Vladimir Eidelman, Jordan Boyd-Graber, and Philip Resnik. 2012. Topic Models for Dynamic Translation Model Adaptation. Proceedings of the 50th Annual Meeting of the Association of Computational Linguistics, volume 2: 115–119.

Thomas L. Griths, Mark Steyvers, and Joshua B. Tenenbaum. 2007. Topics in Semantic Representation. Psychological Review, 114(2): 211–244.

Thomas Hofmann. 1999. Probabilistic Latent Semantic Indexing. Proceedings of the 22nd Annual International SIGIR Conference on Research and Development in Information Retrieval: 50–57.

Wei Hu, Nobuyuki Shimizu, Hiroshi Nakagawa, and Huanye Shenq. 2008. Modeling Chinese Documents with Topical Word-Character Models. Proceedings of the 22nd International Conference on Computational Linguistics: 345–352.

Paul Jaccard. 1901. Distribution de la flore alpine dans le Bassin des Dranses et dans quelques regions voisines. Bull. Soc. Vaudoise sci. Natur. V. 37. Bd. 140: 241–272.

Mark Johnson M. 2010. PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names. Proceedings of the 48th Annual Meeting of the ACL: 1148–1157.

Mihoko Kitamura, and Yuji Matsumoto. 1996. Automatic Extraction of Word Sequence Correspondences in Parallel Corpora. Proceedings of the 4th Annual Workshop on Very Large Corpora: 79–87.

Jey Han Lau, Timothy Baldwin, and David Newman. 2013. On Collocations and Topic Models. ACM Transactions on Speech and Language Processing, 10(3): 1–14.

Bing Liu. 2012. Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers.

Jose Gabriel Pereira Lopes, and Joaquim Ferreira da Silva. 1999. A Local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units. Proceedings of the 6th Meeting on the Mathematics of Language: 369–381.

David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, Andrew McCallum. 2011. Optimizing Semantic Coherence in Topic Models. Proceedings of EMNLP’11: 262–272.

David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic Evaluation of Topic Coherence. Proceedings of Human Language Technologies: The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics: 100–108.

Youngja Park, Roy J. Byrd, and Branimir K. Boguraev. 2002. Automatic Glossary Extraction: Beyond Terminology Identification. Proceedings of the 19th International Conference on Computational Linguistics: 1–7.

Frank Smadja, Kathleen R. McKeown, and Vasileios Hatzivassiloglou. 1996. Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics, 22(1): 1–38.

Keith Stevens, Philip Kegelmeyer, David Adnrzejewski, and David Buttler. 2012. Exploring Topic Coherence over Many Models and Many Topics. Proceedings of EMNLP-CoNLL’12: 952–961.

Konstantin V. Vorontsov, and Anna A. Potapenko. 2014. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization. Proceedings of AIST’2014. LNCS, Springer Verlag-Germany, volume CCIS 439: 28–45.

Hanna M. Wallach. 2006. Topic Modeling: Beyond Bag-of-Words. Proceedings of the 23rd International Conference on Machine Learning: 977–984.

Xuerui Wang, Andrew McCallum, and Xing Wei. 2007. Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval. Proceedings of the 2007 Seventh IEEE International Conference on Data Mining: 697–702.

Dingding Wang, Shenghuo Zhu, Tao Li, and Yihong Gong. 2009. Multi-Document Summarization using Sentence-based Topic Models. Proceedings of the ACL-IJCNLP 2009 Conference Short Papers: 297–300.

Xing Wei and W. Bruce Croft. 2006. LDA-Based Document Models for Ad-hoc Retrieval. Proceedings of the 29th International Conference on Research and Development in Information Retrieval: 178–185.

Zhongwu Zhai, Bing Liu, Hua Xu, and Peifa Jia. 2010. Grouping Product Features Using Semi-Supervised Learning with Soft-Constraints. Proceedings of the 23rd International Conference on Computational Linguistics: 1272–1280.

Wen Zhang, Taketoshi Yoshida, Tu Bao Ho, and Xijin Tang. 2008. Augmented Mutual Information for Multi-Word Term Extraction. International Journal of Innovative Computing, Information and Control, 8(2): 543–554.

Shibin Zhou, Kan Li, and Yushu Liu. 2009. Text Categorization Based on Topic Model. International Journal of Computational Intelligence Systems, volume 2, No. 4: 398–409.

Citeringar i Crossref