@article{ART003204593},
author={Woon-Kyo Lee and Ja-Hee Kim},
title={Improving Topic-Specific Document Selection through BERTopic and Re-Clustering},
journal={Journal of The Korea Society of Computer and Information},
issn={1598-849X},
year={2025},
volume={30},
number={5},
pages={39-49}
TY - JOUR
AU - Woon-Kyo Lee
AU - Ja-Hee Kim
TI - Improving Topic-Specific Document Selection through BERTopic and Re-Clustering
JO - Journal of The Korea Society of Computer and Information
PY - 2025
VL - 30
IS - 5
PB - The Korean Society Of Computer And Information
SP - 39
EP - 49
SN - 1598-849X
AB - This study analyzes the performance of BERTopic-based clustering across various data distributions.
A re-clustering method is proposed to improve the selection of documents on a specific target topic.
Existing clustering techniques often face challenges in accurately selecting documents when the proportion of documents related to the target topic is very low or very high. To address this issue, sampling is performed on retrieved documents to include target topic documents at varying ratios.
Clustering is performed using SBERT-based BERTopic with K-means and HDBSCAN algorithms, and the results are compared. In the re-clustering step, documents initially classified as outliers are re-clustered and merged with the original results. The results before and after re-clustering were compared using four evaluation metrics. Accuracy improved from 0.7251 to 0.9421, and the F1 Score increased from 0.8449 to 0.9423. ARI increased by 0.3626 and NMI by 0.2805. This indicates that the proposed method enhances clustering quality and improves the accuracy of document selection.
KW - BERTopic;Clustering;Re-Clustering;K-means;HDBSCAN
DO -
UR -
ER -
Woon-Kyo Lee and Ja-Hee Kim. (2025). Improving Topic-Specific Document Selection through BERTopic and Re-Clustering. Journal of The Korea Society of Computer and Information, 30(5), 39-49.
Woon-Kyo Lee and Ja-Hee Kim. 2025, "Improving Topic-Specific Document Selection through BERTopic and Re-Clustering", Journal of The Korea Society of Computer and Information, vol.30, no.5 pp.39-49.
Woon-Kyo Lee, Ja-Hee Kim "Improving Topic-Specific Document Selection through BERTopic and Re-Clustering" Journal of The Korea Society of Computer and Information 30.5 pp.39-49 (2025) : 39.
Woon-Kyo Lee, Ja-Hee Kim. Improving Topic-Specific Document Selection through BERTopic and Re-Clustering. 2025; 30(5), 39-49.
Woon-Kyo Lee and Ja-Hee Kim. "Improving Topic-Specific Document Selection through BERTopic and Re-Clustering" Journal of The Korea Society of Computer and Information 30, no.5 (2025) : 39-49.
Woon-Kyo Lee; Ja-Hee Kim. Improving Topic-Specific Document Selection through BERTopic and Re-Clustering. Journal of The Korea Society of Computer and Information, 30(5), 39-49.
Woon-Kyo Lee; Ja-Hee Kim. Improving Topic-Specific Document Selection through BERTopic and Re-Clustering. Journal of The Korea Society of Computer and Information. 2025; 30(5) 39-49.
Woon-Kyo Lee, Ja-Hee Kim. Improving Topic-Specific Document Selection through BERTopic and Re-Clustering. 2025; 30(5), 39-49.
Woon-Kyo Lee and Ja-Hee Kim. "Improving Topic-Specific Document Selection through BERTopic and Re-Clustering" Journal of The Korea Society of Computer and Information 30, no.5 (2025) : 39-49.