Topic Modeling: A Consistent Framework for Comparative Studies
Abstract
Doi: 10.28991/ESJ-2024-08-01-09
Full Text: PDF
Keywords
References
Li, X., & Lei, L. (2021). A bibliometric analysis of topic modeling studies (2000–2017). Journal of Information Science, 47(2), 161–175. doi:10.1177/0165551519877049.
Lisena, P., Harrando, I., Kandakji, O., & Troncy, R. (2020). TOMODAPI: A Topic Modeling API to Train, Use and Compare Topic Models. Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS). doi:10.18653/v1/2020.nlposs-1.19.
Hasan, Md., Hossain, Md. M., Ahmed, A., & Rahman, M. S. (2019). Topic Modeling: A Comparison of The Performance of Latent Dirichlet Allocation and LDA2vec Model on Bangla Newspaper. 2019 International Conference on Bangla Speech and Language Processing (ICBSLP). doi:10.1109/icbslp47725.2019.202047.
Mohammed, S. H., & Al-Augby, S. (2020). LSA & LDA topic modeling classification: Comparison study on E-books. Indonesian Journal of Electrical Engineering and Computer Science, 19(1), 353–362. doi:10.11591/ijeecs.v19.i1.pp353-362.
O’Callaghan, D., Greene, D., Carthy, J., & Cunningham, P. (2015). An analysis of the coherence of descriptors in topic modeling. Expert Systems with Applications, 42(13), 5645–5657. doi:10.1016/j.eswa.2015.02.055.
Albalawi, R., Yeap, T. H., & Benyoucef, M. (2020). Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis. Frontiers in Artificial Intelligence, 3, 42. doi:10.3389/frai.2020.00042.
Bennett, A., Misra, D., & Than, N. (2021). Have you tried Neural Topic Models? Comparative Analysis of Neural and Non-Neural Topic Models with Application to COVID-19 Twitter Data, 1-7. doi:10.48550/arXiv.2105.10165.
Harrando, I., Lisena, P., & Troncy, R. (2021). Apples to Apples: A Systematic Evaluation of Topic Models. Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications. doi:10.26615/978-954-452-072-4_055.
Calheiros, A. C., Moro, S., & Rita, P. (2017). Sentiment Classification of Consumer-Generated Online Reviews Using Topic Modeling. Journal of Hospitality Marketing & Management, 26(7), 675–693. doi:10.1080/19368623.2017.1310075.
Zhou, D., Chen, L., & He, Y. (2014). A Simple Bayesian Modeling Approach to Event Extraction from Twitter. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). doi:10.3115/v1/p14-2114.
Tang, H., Li, M., & Jin, B. (2019). A Topic Augmented Text Generation Model: Joint Learning of Semantics and Structural Features. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). doi:10.18653/v1/d19-1513.
Haghighi, A., & Vanderwende, L. (2009). Exploring content models for multi-document summarization. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics on - NAACL ’09. doi:10.3115/1620754.1620807.
Boyd-Graber, J., Blei, D., & Zhu, X. (2007, June). A topic model for word sense disambiguation. Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), 28-30 June, 2007, Prague, Czech Republic.
Zhao, B., & Xing, E. (2007). HM-BiTAM: Bilingual topic exploration, word alignment, and translation. Advances in Neural Information Processing Systems, 20, 3-6 December, 2007, Vancouver, Canada.
Luo, W., Stenger, B., Zhao, X., & Kim, T.-K. (2015). Automatic Topic Discovery for Multi-Object Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 29(1). doi:10.1609/aaai.v29i1.9789.
Grimmer, J. (2010). A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in senate press releases. Political Analysis, 18(1), 1–35. doi:10.1093/pan/mpp034.
Chen, Y., Ghosh, J., Bejan, C. A., Gunter, C. A., Gupta, S., Kho, A., Liebovitz, D., Sun, J., Denny, J., & Malin, B. (2015). Building bridges across electronic health record systems through inferred phenotypic topics. Journal of Biomedical Informatics, 55, 82–93. doi:10.1016/j.jbi.2015.03.011.
Reisenbichler, M., & Reutterer, T. (2019). Topic modeling in marketing: recent advances and research opportunities. Journal of Business Economics, 89(3), 327–356. doi:10.1007/s11573-018-0915-7.
Fino, E., Hanna-Khalil, B., & Griffiths, M. D. (2021). Exploring the public’s perception of gambling addiction on Twitter during the COVID-19 pandemic: Topic modeling and sentiment analysis. Journal of Addictive Diseases, 39(4), 489–503. doi:10.1080/10550887.2021.1897064.
Poongodi, M., Nguyen, T. N., Hamdi, M., & Cengiz, K. (2021). Global cryptocurrency trend prediction using social media. Information Processing & Management, 58(6). doi:10.1016/j.ipm.2021.102708.
Saura, J. R., Ribeiro-Soriano, D., & Zegarra Saldaña, P. (2022). Exploring the challenges of remote work on Twitter users’ sentiments: From digital technology development to a post-pandemic era. Journal of Business Research, 142, 242–254. doi:10.1016/j.jbusres.2021.12.052.
Saura, J. R., Palacios-Marqués, D., & Ribeiro-Soriano, D. (2023). Exploring the boundaries of open innovation: Evidence from social media mining. Technovation, 119, 102447. doi:10.1016/j.technovation.2021.102447.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407. doi:10.1002/(sici)1097-4571(199009)41:6<391::aid-asi1>3.0.co;2-9.
Hofmann, T. (1999). Probabilistic latent semantic indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. doi:10.1145/312624.312649.
Paatero, P., & Tapper, U. (1994). Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2), 111–126. doi:10.1002/env.3170050203.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
Harris, Z. S. (1954). Distributional Structure. 10(2–3), 146–162. doi:10.1080/00437956.1954.11659520.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 5-10 December, 2013, Lake Tahoe, United States.
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning, 21-26 June, 2014, Beijing, China.
Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. doi:10.3115/v1/d14-1162.
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of Tricks for Efficient Text Classification. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. doi:10.18653/v1/e17-2068.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135–146. doi:10.1162/tacl_a_00051.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. doi:10.48550/arXiv.1810.04805.
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982-3992. doi:10.18653/v1/d19-1410.
Yonghui Wu, Yuxin Ding, Wang, X., & Jun Xu. (2010). A comparative study of topic models for topic clustering of Chinese web news. 2010 3rd International Conference on Computer Science and Information Technology, Chengdu, China. doi:10.1109/iccsit.2010.5564723.
Pietsch, A.-S., & Lessmann, S. (2018). Topic modeling for analyzing open-ended survey responses. Journal of Business Analytics, 1(2), 93–116. doi:10.1080/2573234x.2019.1590131.
Wang, C., Paisley, J., & Blei, D. M. (2011). Online variational inference for the hierarchical Dirichlet process. Proceedings of the fourteenth international conference on artificial intelligence and statistics, 11-13 April, 2011, Fort Lauderdale, United Sates.
Lenz, D., & Winker, P. (2020). Measuring the diffusion of innovations with paragraph vector topic models. PLoS ONE, 15(1), 0226685. doi:10.1371/journal.pone.0226685.
Angelov, D. (2020). Top2vec: Distributed representations of topics. arXiv preprint. doi:10.48550/arXiv.2008.09470.
Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint, arXiv:2203.05794. doi:10.48550/arXiv.2203.05794.
Hong, L., & Davison, B. D. (2010). Empirical study of topic modeling in Twitter. Proceedings of the First Workshop on Social Media Analytics, 80-88. doi:10.1145/1964858.1964870.
Bellman, R., & Kalaba, R. E. (1965). Dynamic programming and modern control theory. Academic Press, New York, United States.
McInnes, L., Healy, J., Saul, N., & Großberger, L. (2018). UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software, 3(29), 861. doi:10.21105/joss.00861.
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2579-2605.
Campello, R.J.G.B., Moulavi, D., Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. Advances in Knowledge Discovery and Data Mining. PAKDD 2013, Lecture Notes in Computer Science, 7819, Springer, Berlin, Germany. doi:10.1007/978-3-642-37456-2_14.
Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11–21. doi:10.1108/eb026526.
Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. doi:10.1145/290941.291025.
Aletras, N., & Stevenson, M. (2013). Evaluating topic coherence using distributional semantics. Proceedings of the 10th international conference on computational semantics (IWCS 2013), 19-22 March, Potsdam, Germany.
Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010). Automatic evaluation of topic coherence. Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, 2-4 June, 2010, Los Angeles, United States.
Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J., & Blei, D. (2009). Reading tea leaves: How humans interpret topic models. Advances in neural information processing systems, 22, 7-10 December, 2009, Vancouver, Canada.
Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. Proceedings of the 2011 conference on empirical methods in natural language processing, 27-31 July, Edinburgh, Scotland.
Lau, J. H., Newman, D., & Baldwin, T. (2014). Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. doi:10.3115/v1/e14-1056.
Bouma, G. (2009). Normalized (pointwise) Mutual Information in Collocation Extraction. Proceedings of GSCL, 30, 31-40.
Newman, D., Bonilla, E. V., & Buntine, W. (2011). Improving topic coherence with regularized topic models. Advances in neural information processing systems, 24, 12-14 December, 2011, Granada, Spain.
Doogan, C., & Buntine, W. (2021). Topic Model or Topic Twaddle? Re-evaluating Semantic Interpretability Measures. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. doi:10.18653/v1/2021.naacl-main.300.
Dieng, A. B., Ruiz, F. J. R., & Blei, D. M. (2020). Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics, 8, 439–453. doi:10.1162/tacl_a_00325.
Lang, K. (1995). NewsWeeder: Learning to Filter Netnews. Machine Learning Proceedings 1995, 331–339, Morgan Kaufmann, Burlington, United States. doi:10.1016/b978-1-55860-377-6.50048-7.
Zhao, H., Du, L., Buntine, W., & Liu, G. (2017). MetaLDA: A Topic Model that Efficiently Incorporates Meta Information. 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, United States. doi:10.1109/icdm.2017.73.
Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the Space of Topic Coherence Measures. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 399-408. doi:10.1145/2684822.2685324.
Sharma, E., Li, C., & Wang, L. (2019). BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2204–2213. doi:10.18653/v1/p19-1212.
Hoffman, M., Bach, F., & Blei, D. (2010). Online learning for latent dirichlet allocation. Advances in Neural Information Processing Systems 23 (NIPS 2010), 6-9 December, 2010, Vancouver, Canada.
Zhao, R., & Tan, V. Y. F. (2017). Online Nonnegative Matrix Factorization with Outliers. IEEE Transactions on Signal Processing, 65(3), 555–570. doi:10.1109/TSP.2016.2620967.
DOI: 10.28991/ESJ-2024-08-01-09
Refbacks
- There are currently no refbacks.
Copyright (c) 2024 Ana Amaro, Fernando Bacao