본문 바로가기
  • Home

A Comparative Study on Topic Modeling of LDA, Top2Vec, and BERTopic Models Using LIS Journals in WoS

  • Journal of the Korean Society for Library and Information Science
  • 2024, 58(1), pp.5-30
  • DOI : 10.4275/KSLIS.2024.58.1.005
  • Publisher : 한국문헌정보학회
  • Research Area : Interdisciplinary Studies > Library and Information Science
  • Received : December 23, 2023
  • Accepted : January 29, 2024
  • Published : February 28, 2024

LeeYong-Gu 1 Kim SeonWook ORD ID 2

1경북대학교
2대구가톨릭대학교

Excellent Accredited

ABSTRACT

The purpose of this study is to extract topics from experimental data using the topic modeling methods(LDA, Top2Vec, and BERTopic) and compare the characteristics and differences between these models. The experimental data consist of 55,442 papers published in 85 academic journals in the field of library and information science, which are indexed in the Web of Science(WoS). The experimental process was as follows: The first topic modeling results were obtained using the default parameters for each model, and the second topic modeling results were obtained by setting the same optimal number of topics for each model. In the first stage of topic modeling, LDA, Top2Vec, and BERTopic models generated significantly different numbers of topics(100, 350, and 550, respectively). Top2Vec and BERTopic models seemed to divide the topics approximately three to five times more finely than the LDA model. There were substantial differences among the models in terms of the average and standard deviation of documents per topic. The LDA model assigned many documents to a relatively small number of topics, while the BERTopic model showed the opposite trend. In the second stage of topic modeling, generating the same 25 topics for all models, the Top2Vec model tended to assign more documents on average per topic and showed small deviations between topics, resulting in even distribution of the 25 topics. When comparing the creation of similar topics between models, LDA and Top2Vec models generated 18 similar topics(72%) out of 25. This high percentage suggests that the Top2Vec model is more similar to the LDA model. For a more comprehensive comparison analysis, expert evaluation is necessary to determine whether the documents assigned to each topic in the topic modeling results are thematically accurate.

Citation status

* References for papers published after 2023 are currently being built.