@article{ART001226360},
author={Lee, Jae Yun},
title={Improving the Performance of Document Clustering with Distributional Similarities},
journal={Journal of the Korean Society for Information Management},
issn={1013-0799},
year={2007},
volume={24},
number={4},
pages={267-283},
doi={10.3743/KOSIM.2007.24.4.267}
TY - JOUR
AU - Lee, Jae Yun
TI - Improving the Performance of Document Clustering with Distributional Similarities
JO - Journal of the Korean Society for Information Management
PY - 2007
VL - 24
IS - 4
PB - 한국정보관리학회
SP - 267
EP - 283
SN - 1013-0799
AB - In this study, measures of distributional similarity such as KL-divergence are applied to cluster documents instead of traditional cosine measure, which is the most prevalent vector similarity measure for document clustering. Three variations of KL-divergence are investigated; Jansen-Shannon divergence, symmetric skew divergence, and minimum skew divergence. In order to verify the contribution of distributional similarities to document clustering, two experiments are designed and carried out on three test collections. In the first experiment the clustering performances of the three divergence measures are compared to that of cosine measure. The result showed that minimum skew divergence outperformed the other divergence measures as well as cosine measure. In the second experiment second-order distributional similarities are calculated with Pearson correlation coefficient from the first-order similarity matrixes. From the result of the second experiment, second-order distributional similarities were found to improve the overall performance of document clustering. These results suggest that minimum skew divergence must be selected as document vector similarity measure when considering both time and accuracy, and second-order similarity is a good choice for considering clustering accuracy only.
KW - distributional similarity;divergence;second-order similarity;document clustering;automatic classification
DO - 10.3743/KOSIM.2007.24.4.267
ER -
Lee, Jae Yun. (2007). Improving the Performance of Document Clustering with Distributional Similarities. Journal of the Korean Society for Information Management, 24(4), 267-283.
Lee, Jae Yun. 2007, "Improving the Performance of Document Clustering with Distributional Similarities", Journal of the Korean Society for Information Management, vol.24, no.4 pp.267-283. Available from: doi:10.3743/KOSIM.2007.24.4.267
Lee, Jae Yun "Improving the Performance of Document Clustering with Distributional Similarities" Journal of the Korean Society for Information Management 24.4 pp.267-283 (2007) : 267.
Lee, Jae Yun. Improving the Performance of Document Clustering with Distributional Similarities. 2007; 24(4), 267-283. Available from: doi:10.3743/KOSIM.2007.24.4.267
Lee, Jae Yun. "Improving the Performance of Document Clustering with Distributional Similarities" Journal of the Korean Society for Information Management 24, no.4 (2007) : 267-283.doi: 10.3743/KOSIM.2007.24.4.267
Lee, Jae Yun. Improving the Performance of Document Clustering with Distributional Similarities. Journal of the Korean Society for Information Management, 24(4), 267-283. doi: 10.3743/KOSIM.2007.24.4.267
Lee, Jae Yun. Improving the Performance of Document Clustering with Distributional Similarities. Journal of the Korean Society for Information Management. 2007; 24(4) 267-283. doi: 10.3743/KOSIM.2007.24.4.267
Lee, Jae Yun. Improving the Performance of Document Clustering with Distributional Similarities. 2007; 24(4), 267-283. Available from: doi:10.3743/KOSIM.2007.24.4.267
Lee, Jae Yun. "Improving the Performance of Document Clustering with Distributional Similarities" Journal of the Korean Society for Information Management 24, no.4 (2007) : 267-283.doi: 10.3743/KOSIM.2007.24.4.267