Paragraph-based K-Means Clustering by using Meaning-based Paragraph Division (의미 기반 문단 분할을 활용한 문단 기반 K-Means 클러스터링)

SAJOON PARK (박사준); Jae Ho Kim (김재호)

doi:10.34163/jkits.2017.12.1.014

Paragraph-based K-Means Clustering by using Meaning-based Paragraph Division

Journal of Knowledge Information Technology and Systems
Abbr : JKITS
2017, 12(1), pp.157~164
DOI : 10.34163/jkits.2017.12.1.014
Publisher : Korea Knowledge Information Technology Society
Research Area : Interdisciplinary Studies > Interdisciplinary Research
Published : February 28, 2017

SAJOON PARK ¹, Jae Ho Kim ²

¹대구한의대학교
²강릉원주대학교

Accredited

ABSTRACT

As the number of electronic documents explosively increases, it becomes more and more difficult to retrieve information from them rapidly and accurately. To solve this problem, documents are clustered in various ways and generally K-Means algorithm is used to achieve it. K-Means algorithm is adequate to cluster so many documents rapidly and easily, but it does not consider the meaning of documents on clustering. In this research, we propose a document clustering technique of using meaning-based paragraphs. The proposed technique divides documents in a document set into meaning-based paragraphs by measuring similarity between sentences, chooses representative paragraphs having the maximum coherence value from each document, and then commits K-Means algorithm depending on them. In this paper, different from existing methods, we proposed a novel similarity function between two adjacent sentences by using WordNet as a ontology to calculate the similarity between words. And we introduced a method which can be used to calculate coherence of meaning-based paragraph by normalizing the sum of tf-idf value of words in the paragraph. We conducted experiments to prove the performance of the proposed technique by using the Reuter-21578 document set. The experimental result showed the document clustering technique of using meaning-based paragraphs improves the precision and the recall of document retrieval.

KEYWORDS

Document retrieval, Clustering, Meaning-based paragraph, K-means clustering method, Reuter-21578 document set, Precision, recall

Citation status

* References for papers published after 2025 are currently being built.

[book] T. Radecki / 1980 / A model of a document-clustering-based information retrieval system with a Boolean search request formulation / SIGIR : 334~344

[confproc] M. Steinbach / 2000 / A comparison of document clustering techniques / KDD Workshop on Text mining

[confproc] A. Leusik / 2001 / Evaluating document clustering for interactive information retrieval / CIKM 2001 : 33~40

[thesis] K. S. Lee / 2001 / A document ranking model based on vector space retrieval and cluster analysis in information retrieval / Ph.D. / KAIST

[confproc] N. Y. Kim / 2002 / Document clustering analysis based on similarity calculation between cluster centroids / Proceedings of the 2002 IEIE Autumn Conference 25(2) : 119~122

[journal] 박사준 / 2013 / Meaning-Flow Based Clustering for Document Retrieval in a Large Document Set / 한국지식정보기술학회 논문지 / 한국지식정보기술학회 8(4) : 37~42

[confproc] D. R. Cutting / 1992 / Scatter/Gather: A cluster-based approach to browsing large document collections / SIGIR'92 : 318~329

[journal] H. J. Jain / 2012 / Context senstive text summarization using K means clustering algorithm / IJSCE 2(2) : 301~304

[journal] A. Gelbukh / 2005 / Combining sources of evidence for recognition of relevant passages in texts / LNCS 3563 : 283~290

[book] G. Salton / 1968 / Automatic information organization and retrieval / McGraw-Hill

[journal] J. Spark / 1972 / A statistical interpretation of term specificity and its application in retrieval / Journal of Documentation 28 : 11~21

[confproc] E. M. Voorhees / 1995 / Learning collection fusion strategies / Proceedings of the 18th ACM SIGIR Conference on Research and Development in Information Retrieval : 172~179

[journal] H. J. Jain / 2012 / Context senstive text summarization using K means clustering algorithm / IJSCE 2(2) : 301~304

[journal] A. Goswami / 2006 / Fast and exact out-of-core and distributed k-means clustering / Knowlege and Information Systems 10 : 17~40

[journal] G. Salton / 1975 / A vector space model for automatic indexing / Commun. ACM 18(11) : 613~620

KJCKorea
Journal Central

Journal of Knowledge Information Technology and Systems KCI Impact Factor : 0.0

Paragraph-based K-Means Clustering by using Meaning-based Paragraph Division

ABSTRACT

KEYWORDS

Citation status

* References for papers published after 2025 are currently being built.

Journal of Knowledge Information Technology and Systems KCI Impact Factor : 0.0

Paragraph-based K-Means Clustering by using Meaning-based Paragraph Division

ABSTRACT

KEYWORDS

Statistics

Tools

Issue List

Citation status

KCI Citation Counts (5)

REFERENCES (15) * References for papers published after 2025 are currently being built.

Search PDF

Citation

* References for papers published after 2025 are currently being built.