본문 바로가기
  • Home

Improving Topic-Specific Document Selection through BERTopic and Re-Clustering

  • Journal of The Korea Society of Computer and Information
  • Abbr : JKSCI
  • 2025, 30(5), pp.39~49
  • Publisher : The Korean Society Of Computer And Information
  • Research Area : Engineering > Computer Science
  • Received : April 1, 2025
  • Accepted : May 12, 2025
  • Published : May 30, 2025

Woon-Kyo Lee 1 Ja-Hee Kim 1

1서울과학기술대학교

Accredited

ABSTRACT

This study analyzes the performance of BERTopic-based clustering across various data distributions. A re-clustering method is proposed to improve the selection of documents on a specific target topic. Existing clustering techniques often face challenges in accurately selecting documents when the proportion of documents related to the target topic is very low or very high. To address this issue, sampling is performed on retrieved documents to include target topic documents at varying ratios. Clustering is performed using SBERT-based BERTopic with K-means and HDBSCAN algorithms, and the results are compared. In the re-clustering step, documents initially classified as outliers are re-clustered and merged with the original results. The results before and after re-clustering were compared using four evaluation metrics. Accuracy improved from 0.7251 to 0.9421, and the F1 Score increased from 0.8449 to 0.9423. ARI increased by 0.3626 and NMI by 0.2805. This indicates that the proposed method enhances clustering quality and improves the accuracy of document selection.

Citation status

* References for papers published after 2023 are currently being built.