Improving the Performance of Document Clustering with Distributional Similarities (분포 유사도를 이용한 문헌클러스터링의 성능향상에 대한 연구)

Lee, Jae Yun (이재윤)

doi:10.3743/KOSIM.2007.24.4.267

Improving the Performance of Document Clustering with Distributional Similarities

Journal of the Korean Society for Information Management
Abbr : JKOSIM
2007, 24(4), pp.267~283
DOI : 10.3743/KOSIM.2007.24.4.267
Publisher : 한국정보관리학회
Research Area : Interdisciplinary Studies > Library and Information Science
Received : November 30, 2007
Accepted : December 10, 2007
Published : December 30, 2007

Lee, Jae Yun ¹

¹경기대학교

Accredited

ABSTRACT

In this study, measures of distributional similarity such as KL-divergence are applied to cluster documents instead of traditional cosine measure, which is the most prevalent vector similarity measure for document clustering. Three variations of KL-divergence are investigated; Jansen-Shannon divergence, symmetric skew divergence, and minimum skew divergence. In order to verify the contribution of distributional similarities to document clustering, two experiments are designed and carried out on three test collections. In the first experiment the clustering performances of the three divergence measures are compared to that of cosine measure. The result showed that minimum skew divergence outperformed the other divergence measures as well as cosine measure. In the second experiment second-order distributional similarities are calculated with Pearson correlation coefficient from the first-order similarity matrixes. From the result of the second experiment, second-order distributional similarities were found to improve the overall performance of document clustering. These results suggest that minimum skew divergence must be selected as document vector similarity measure when considering both time and accuracy, and second-order similarity is a good choice for considering clustering accuracy only.

KEYWORDS

distributional similarity, divergence, second-order similarity, document clustering, automatic classification

Citation status

* References for papers published after 2025 are currently being built.

[journal] 정영미 / 2005 / 정보검색연구 / 서울: 구미무역(주) 출판부

[journal] 정영미 / 2001 / 지식 분류의 자동화를 위한 클러스터링 모형 연구 18(2) : 203~230

[journal] Dagan, Ido / 1999 / Similarity-based models of cooccurrence probabilities 34(1-3) : 43~69

[journal] Griffith, A / 1984 / Hierarchic agglomerative clustering methods for automatic document classification : 175 3~11

[journal] Kullback, S / 1951 / On information and sufficiency

[journal] Kullback, Solomon / 1968 / Information Theory and Statistics

[journal] Lee, Lillian / 1999 / Measures of distributional similarity : 25~32

[journal] Lee, Lillian / 2001 / On the effectiveness of the skew divergence for statistical language analysis : 65~72

[journal] Lin, Dekang / 1998 / Automatic retrieval and clustering of similar words 98 : 768~773

[journal] Lin, Jianhua / 1991 / Divergence measures based on the Shannon entropy : 145~151

[journal] Pereira, Fernando / 1993 / Distributional clustering of English words / McGraw Hill.

[journal] Theodoridis, S / 2003 / Pattern Recognition / Oxford, UK: Elsevier

[journal] Weeds, J. E. / 2003 / Measures and Applications of Lexical Distributional Similarity

[journal] White, H. D / 1981 / Author cocitation: a literature measure of intellectual structure : 163~171

[journal] Griffiths, A / 1986 / Using interdocument similarity information in document retrieval systems

[journal] Lee, Lillian / 1999 / Distributional similarity models: Clustering vs. nearest neighbors : 33~40

[journal] Salton, Gerard / 1983 / Introduction to Modern Information Retrieval

KJCKorea
Journal Central

Journal of the Korean Society for Information Management 2025 KCI Impact Factor : 1.27

Improving the Performance of Document Clustering with Distributional Similarities

ABSTRACT

KEYWORDS

Citation status

* References for papers published after 2025 are currently being built.

Journal of the Korean Society for Information Management 2025 KCI Impact Factor : 1.27

Improving the Performance of Document Clustering with Distributional Similarities

ABSTRACT

KEYWORDS

Statistics

Tools

Issue List

Citation status

KCI Citation Counts (12)

REFERENCES (17) * References for papers published after 2025 are currently being built.

Search PDF

Citation

* References for papers published after 2025 are currently being built.