Topic Modeling: A Consistent Framework for Comparative Studies

Ana Amaro, Fernando Bacao

Abstract


In recent years, the field of Topic Modeling (TM) has grown in importance due to the increasing availability of digital text data. TM is an unsupervised learning technique that helps uncover latent semantic structures in large sets of documents, making it a valuable tool for finding relevant patterns. However, evaluating the performance of TM algorithms can be challenging as different metrics and datasets are often used, leading to inconsistent results. In addition, many current surveys of TM algorithms focus on a limited number of models and exclude state-of-the-art approaches. This paper has the objective of addressing these issues by presenting a comprehensive comparative study of five TM algorithms across three different benchmark datasets using five different metrics. We offer an updated survey of the latest TM approaches and evaluation metrics, providing a consistent framework for comparing different algorithms while introducing state-of-the art approaches that have been disregarded in the literature. The experiments, which primarily use Context Vectors (CV) Topic Coherence as an evaluation metric, show that Top2Vec is the best-performing model across all datasets, disrupting the tendency for Latent Dirichlet Allocation to be the best performer.

 

Doi: 10.28991/ESJ-2024-08-01-09

Full Text: PDF


Keywords


Natural Language Processing; Top2Vec; Topic Coherence; Topic Modeling; Unsupervised Learning.

References


Li, X., & Lei, L. (2021). A bibliometric analysis of topic modeling studies (2000–2017). Journal of Information Science, 47(2), 161–175. doi:10.1177/0165551519877049.

Lisena, P., Harrando, I., Kandakji, O., & Troncy, R. (2020). TOMODAPI: A Topic Modeling API to Train, Use and Compare Topic Models. Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS). doi:10.18653/v1/2020.nlposs-1.19.

Hasan, Md., Hossain, Md. M., Ahmed, A., & Rahman, M. S. (2019). Topic Modeling: A Comparison of The Performance of Latent Dirichlet Allocation and LDA2vec Model on Bangla Newspaper. 2019 International Conference on Bangla Speech and Language Processing (ICBSLP). doi:10.1109/icbslp47725.2019.202047.

Mohammed, S. H., & Al-Augby, S. (2020). LSA & LDA topic modeling classification: Comparison study on E-books. Indonesian Journal of Electrical Engineering and Computer Science, 19(1), 353–362. doi:10.11591/ijeecs.v19.i1.pp353-362.

O’Callaghan, D., Greene, D., Carthy, J., & Cunningham, P. (2015). An analysis of the coherence of descriptors in topic modeling. Expert Systems with Applications, 42(13), 5645–5657. doi:10.1016/j.eswa.2015.02.055.

Albalawi, R., Yeap, T. H., & Benyoucef, M. (2020). Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis. Frontiers in Artificial Intelligence, 3, 42. doi:10.3389/frai.2020.00042.

Bennett, A., Misra, D., & Than, N. (2021). Have you tried Neural Topic Models? Comparative Analysis of Neural and Non-Neural Topic Models with Application to COVID-19 Twitter Data, 1-7. doi:10.48550/arXiv.2105.10165.

Harrando, I., Lisena, P., & Troncy, R. (2021). Apples to Apples: A Systematic Evaluation of Topic Models. Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications. doi:10.26615/978-954-452-072-4_055.

Calheiros, A. C., Moro, S., & Rita, P. (2017). Sentiment Classification of Consumer-Generated Online Reviews Using Topic Modeling. Journal of Hospitality Marketing & Management, 26(7), 675–693. doi:10.1080/19368623.2017.1310075.

Zhou, D., Chen, L., & He, Y. (2014). A Simple Bayesian Modeling Approach to Event Extraction from Twitter. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). doi:10.3115/v1/p14-2114.

Tang, H., Li, M., & Jin, B. (2019). A Topic Augmented Text Generation Model: Joint Learning of Semantics and Structural Features. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). doi:10.18653/v1/d19-1513.

Haghighi, A., & Vanderwende, L. (2009). Exploring content models for multi-document summarization. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics on - NAACL ’09. doi:10.3115/1620754.1620807.

Boyd-Graber, J., Blei, D., & Zhu, X. (2007, June). A topic model for word sense disambiguation. Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), 28-30 June, 2007, Prague, Czech Republic.

Zhao, B., & Xing, E. (2007). HM-BiTAM: Bilingual topic exploration, word alignment, and translation. Advances in Neural Information Processing Systems, 20, 3-6 December, 2007, Vancouver, Canada.

Luo, W., Stenger, B., Zhao, X., & Kim, T.-K. (2015). Automatic Topic Discovery for Multi-Object Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 29(1). doi:10.1609/aaai.v29i1.9789.

Grimmer, J. (2010). A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in senate press releases. Political Analysis, 18(1), 1–35. doi:10.1093/pan/mpp034.

Chen, Y., Ghosh, J., Bejan, C. A., Gunter, C. A., Gupta, S., Kho, A., Liebovitz, D., Sun, J., Denny, J., & Malin, B. (2015). Building bridges across electronic health record systems through inferred phenotypic topics. Journal of Biomedical Informatics, 55, 82–93. doi:10.1016/j.jbi.2015.03.011.

Reisenbichler, M., & Reutterer, T. (2019). Topic modeling in marketing: recent advances and research opportunities. Journal of Business Economics, 89(3), 327–356. doi:10.1007/s11573-018-0915-7.

Fino, E., Hanna-Khalil, B., & Griffiths, M. D. (2021). Exploring the public’s perception of gambling addiction on Twitter during the COVID-19 pandemic: Topic modeling and sentiment analysis. Journal of Addictive Diseases, 39(4), 489–503. doi:10.1080/10550887.2021.1897064.

Poongodi, M., Nguyen, T. N., Hamdi, M., & Cengiz, K. (2021). Global cryptocurrency trend prediction using social media. Information Processing & Management, 58(6). doi:10.1016/j.ipm.2021.102708.

Saura, J. R., Ribeiro-Soriano, D., & Zegarra Saldaña, P. (2022). Exploring the challenges of remote work on Twitter users’ sentiments: From digital technology development to a post-pandemic era. Journal of Business Research, 142, 242–254. doi:10.1016/j.jbusres.2021.12.052.

Saura, J. R., Palacios-Marqués, D., & Ribeiro-Soriano, D. (2023). Exploring the boundaries of open innovation: Evidence from social media mining. Technovation, 119, 102447. doi:10.1016/j.technovation.2021.102447.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407. doi:10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9.

Hofmann, T. (1999). Probabilistic latent semantic indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. doi:10.1145/312624.312649.

Paatero, P., & Tapper, U. (1994). Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2), 111–126. doi:10.1002/env.3170050203.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.

Harris, Z. S. (1954). Distributional Structure. 10(2–3), 146–162. doi:10.1080/00437956.1954.11659520.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 5-10 December, 2013, Lake Tahoe, United States.

Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning, 21-26 June, 2014, Beijing, China.

Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. doi:10.3115/v1/d14-1162.

Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of Tricks for Efficient Text Classification. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. doi:10.18653/v1/e17-2068.

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135–146. doi:10.1162/tacl_a_00051.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. doi:10.48550/arXiv.1810.04805.

Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982-3992. doi:10.18653/v1/d19-1410.

Yonghui Wu, Yuxin Ding, Wang, X., & Jun Xu. (2010). A comparative study of topic models for topic clustering of Chinese web news. 2010 3rd International Conference on Computer Science and Information Technology, Chengdu, China. doi:10.1109/iccsit.2010.5564723.

Pietsch, A.-S., & Lessmann, S. (2018). Topic modeling for analyzing open-ended survey responses. Journal of Business Analytics, 1(2), 93–116. doi:10.1080/2573234x.2019.1590131.

Wang, C., Paisley, J., & Blei, D. M. (2011). Online variational inference for the hierarchical Dirichlet process. Proceedings of the fourteenth international conference on artificial intelligence and statistics, 11-13 April, 2011, Fort Lauderdale, United Sates.

Lenz, D., & Winker, P. (2020). Measuring the diffusion of innovations with paragraph vector topic models. PLoS ONE, 15(1), 0226685. doi:10.1371/journal.pone.0226685.

Angelov, D. (2020). Top2vec: Distributed representations of topics. arXiv preprint. doi:10.48550/arXiv.2008.09470.

Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint, arXiv:2203.05794. doi:10.48550/arXiv.2203.05794.

Hong, L., & Davison, B. D. (2010). Empirical study of topic modeling in Twitter. Proceedings of the First Workshop on Social Media Analytics, 80-88. doi:10.1145/1964858.1964870.

Bellman, R., & Kalaba, R. E. (1965). Dynamic programming and modern control theory. Academic Press, New York, United States.

McInnes, L., Healy, J., Saul, N., & Großberger, L. (2018). UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software, 3(29), 861. doi:10.21105/joss.00861.

Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2579-2605.

Campello, R.J.G.B., Moulavi, D., Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. Advances in Knowledge Discovery and Data Mining. PAKDD 2013, Lecture Notes in Computer Science, 7819, Springer, Berlin, Germany. doi:10.1007/978-3-642-37456-2_14.

Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11–21. doi:10.1108/eb026526.

Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. doi:10.1145/290941.291025.

Aletras, N., & Stevenson, M. (2013). Evaluating topic coherence using distributional semantics. Proceedings of the 10th international conference on computational semantics (IWCS 2013), 19-22 March, Potsdam, Germany.

Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010). Automatic evaluation of topic coherence. Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, 2-4 June, 2010, Los Angeles, United States.

Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J., & Blei, D. (2009). Reading tea leaves: How humans interpret topic models. Advances in neural information processing systems, 22, 7-10 December, 2009, Vancouver, Canada.

Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. Proceedings of the 2011 conference on empirical methods in natural language processing, 27-31 July, Edinburgh, Scotland.

Lau, J. H., Newman, D., & Baldwin, T. (2014). Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. doi:10.3115/v1/e14-1056.

Bouma, G. (2009). Normalized (pointwise) Mutual Information in Collocation Extraction. Proceedings of GSCL, 30, 31-40.

Newman, D., Bonilla, E. V., & Buntine, W. (2011). Improving topic coherence with regularized topic models. Advances in neural information processing systems, 24, 12-14 December, 2011, Granada, Spain.

Doogan, C., & Buntine, W. (2021). Topic Model or Topic Twaddle? Re-evaluating Semantic Interpretability Measures. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. doi:10.18653/v1/2021.naacl-main.300.

Dieng, A. B., Ruiz, F. J. R., & Blei, D. M. (2020). Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8, 439–453. doi:10.1162/tacl_a_00325.

Lang, K. (1995). NewsWeeder: Learning to Filter Netnews. Machine Learning Proceedings 1995, 331–339, Morgan Kaufmann, Burlington, United States. doi:10.1016/b978-1-55860-377-6.50048-7.

Zhao, H., Du, L., Buntine, W., & Liu, G. (2017). MetaLDA: A Topic Model that Efficiently Incorporates Meta Information. 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, United States. doi:10.1109/icdm.2017.73.

Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the Space of Topic Coherence Measures. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 399-408. doi:10.1145/2684822.2685324.

Sharma, E., Li, C., & Wang, L. (2019). BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2204–2213. doi:10.18653/v1/p19-1212.

Hoffman, M., Bach, F., & Blei, D. (2010). Online learning for latent dirichlet allocation. Advances in Neural Information Processing Systems 23 (NIPS 2010), 6-9 December, 2010, Vancouver, Canada.

Zhao, R., & Tan, V. Y. F. (2017). Online Nonnegative Matrix Factorization with Outliers. IEEE Transactions on Signal Processing, 65(3), 555–570. doi:10.1109/TSP.2016.2620967.


Full Text: PDF

DOI: 10.28991/ESJ-2024-08-01-09

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 Ana Amaro, Fernando Bacao