The paper describes the results of an experimental study of integrating bigram collocations and similarities between them and unigrams into topic models. First of all, we propose a novel algorithm PLSA-SIM that is a modification of the original algorithm PLSA. It incorporates bigrams and maintains relationships between unigrams and bigrams based on their component structure. Then we analyze a variety of word association measures in order to integrate top-ranked bigrams into topic models. All experiments were conducted on four text collections of different domains and languages. The experiments distinguish a subgroup of tested measures that produce top-ranked bigrams, which demonstrate significant improvement of topic models quality for all collections, when integrated into PLSA-SIM algorithm.
David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors. Proceedings of the 26th Annual International Conference on Machine Learning: 25–32.
David Andrzejewski and David Buttler. 2011. Latent Topic Feedback for Information Retrieval. Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 600–608.
Arthur Asuncion, Max Welling, Padhraic Smyth, Yee Whye Teh. 2009. On Smoothing and Inference for Topic Models. Proceedings of the 25th International Conference on Uncertainty in Artificial Intelligence: 27–34.
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research, volume 3: 993–1022.
Gerlof Bouma. 2009. Normalized (Pointwise) Mutual Information. Proceedings of the Biennial GSCL Conference: 31–40.
Jordan Boyd-Graber, David M. Blei, and Xiaojin Zhu. 2007. A Topic Model for Word Sense Disambiguation. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning: 1024–1033.
Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, David M. Blei. 2009. Reading Tea Leaves: How Human Interpret Topic Models. Proceedings of the 24th Annual Conference on Neural Information Processing Systems: 288–296.
KennethWard Church, and Patrick Hanks. 1990. Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics, volume 16: 22–29.
Beatrice Daille. 1995. Combined Approach for Terminology Extraction: Lexical Statistics and Linguistic Filtering PhD Dissertation. University of Paris, Paris.
Ali Daud, Juanzi Li, Lizhu Zhou, Faqir Muhammad. 2010. Knowledge discovery through directed probabilistic topic models: a survey. Frontiers of Computer Science in China, 4(2): 280–301. Vidas Daudarvicius and Ruta Marcinkeviciené. 2003. Gravity Counts for the Boundaries of Collocations. International Journal of Corpus Linguistics, 9(2): 321–348.
Paul Deane. 2005. A Nonparametric Method for Extraction of Candidate Phrasal Terms. Proceedings of the 43rd Annual Meeting of the ACL: 605–613.
Ted Dunning. 1993. Accurate Methods for the Statistics of Surprise and Coincidence. International Journal of Computational Linguistics, 19(1): 61–74.
Vladimir Eidelman, Jordan Boyd-Graber, and Philip Resnik. 2012. Topic Models for Dynamic Translation Model Adaptation. Proceedings of the 50th Annual Meeting of the Association of Computational Linguistics, volume 2: 115–119.
Thomas L. Griths, Mark Steyvers, and Joshua B. Tenenbaum. 2007. Topics in Semantic Representation. Psychological Review, 114(2): 211–244.
Thomas Hofmann. 1999. Probabilistic Latent Semantic Indexing. Proceedings of the 22nd Annual International SIGIR Conference on Research and Development in Information Retrieval: 50–57.
Wei Hu, Nobuyuki Shimizu, Hiroshi Nakagawa, and Huanye Shenq. 2008. Modeling Chinese Documents with Topical Word-Character Models. Proceedings of the 22nd International Conference on Computational Linguistics: 345–352.
Paul Jaccard. 1901. Distribution de la flore alpine dans le Bassin des Dranses et dans quelques regions voisines. Bull. Soc. Vaudoise sci. Natur. V. 37. Bd. 140: 241–272.
Mark Johnson M. 2010. PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names. Proceedings of the 48th Annual Meeting of the ACL: 1148–1157.
Mihoko Kitamura, and Yuji Matsumoto. 1996. Automatic Extraction of Word Sequence Correspondences in Parallel Corpora. Proceedings of the 4th Annual Workshop on Very Large Corpora: 79–87.
Jey Han Lau, Timothy Baldwin, and David Newman. 2013. On Collocations and Topic Models. ACM Transactions on Speech and Language Processing, 10(3): 1–14.
Bing Liu. 2012. Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers.
Jose Gabriel Pereira Lopes, and Joaquim Ferreira da Silva. 1999. A Local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units. Proceedings of the 6th Meeting on the Mathematics of Language: 369–381.
David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, Andrew McCallum. 2011. Optimizing Semantic Coherence in Topic Models. Proceedings of EMNLP’11: 262–272.
David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic Evaluation of Topic Coherence. Proceedings of Human Language Technologies: The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics: 100–108.
Youngja Park, Roy J. Byrd, and Branimir K. Boguraev. 2002. Automatic Glossary Extraction: Beyond Terminology Identification. Proceedings of the 19th International Conference on Computational Linguistics: 1–7.
Frank Smadja, Kathleen R. McKeown, and Vasileios Hatzivassiloglou. 1996. Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics, 22(1): 1–38.
Keith Stevens, Philip Kegelmeyer, David Adnrzejewski, and David Buttler. 2012. Exploring Topic Coherence over Many Models and Many Topics. Proceedings of EMNLP-CoNLL’12: 952–961.
Konstantin V. Vorontsov, and Anna A. Potapenko. 2014. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization. Proceedings of AIST’2014. LNCS, Springer Verlag-Germany, volume CCIS 439: 28–45.
Hanna M. Wallach. 2006. Topic Modeling: Beyond Bag-of-Words. Proceedings of the 23rd International Conference on Machine Learning: 977–984.
Xuerui Wang, Andrew McCallum, and Xing Wei. 2007. Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval. Proceedings of the 2007 Seventh IEEE International Conference on Data Mining: 697–702.
Dingding Wang, Shenghuo Zhu, Tao Li, and Yihong Gong. 2009. Multi-Document Summarization using Sentence-based Topic Models. Proceedings of the ACL-IJCNLP 2009 Conference Short Papers: 297–300.
Xing Wei and W. Bruce Croft. 2006. LDA-Based Document Models for Ad-hoc Retrieval. Proceedings of the 29th International Conference on Research and Development in Information Retrieval: 178–185.
Zhongwu Zhai, Bing Liu, Hua Xu, and Peifa Jia. 2010. Grouping Product Features Using Semi-Supervised Learning with Soft-Constraints. Proceedings of the 23rd International Conference on Computational Linguistics: 1272–1280.
Wen Zhang, Taketoshi Yoshida, Tu Bao Ho, and Xijin Tang. 2008. Augmented Mutual Information for Multi-Word Term Extraction. International Journal of Innovative Computing, Information and Control, 8(2): 543–554.
Shibin Zhou, Kan Li, and Yushu Liu. 2009. Text Categorization Based on Topic Model. International Journal of Computational Intelligence Systems, volume 2, No. 4: 398–409.