본문 바로가기
  • Home

Paragraph-based K-Means Clustering by using Meaning-based Paragraph Division

  • Journal of Knowledge Information Technology and Systems
  • Abbr : JKITS
  • 2017, 12(1), pp.157-164
  • DOI : 10.34163/jkits.2017.12.1.014
  • Publisher : Korea Knowledge Information Technology Society
  • Research Area : Interdisciplinary Studies > Interdisciplinary Research
  • Published : February 28, 2017

SAJOON PARK 1 Jae Ho Kim 2

1대구한의대학교
2강릉원주대학교

Accredited

ABSTRACT

As the number of electronic documents explosively increases, it becomes more and more difficult to retrieve information from them rapidly and accurately. To solve this problem, documents are clustered in various ways and generally K-Means algorithm is used to achieve it. K-Means algorithm is adequate to cluster so many documents rapidly and easily, but it does not consider the meaning of documents on clustering. In this research, we propose a document clustering technique of using meaning-based paragraphs. The proposed technique divides documents in a document set into meaning-based paragraphs by measuring similarity between sentences, chooses representative paragraphs having the maximum coherence value from each document, and then commits K-Means algorithm depending on them. In this paper, different from existing methods, we proposed a novel similarity function between two adjacent sentences by using WordNet as a ontology to calculate the similarity between words. And we introduced a method which can be used to calculate coherence of meaning-based paragraph by normalizing the sum of tf-idf value of words in the paragraph. We conducted experiments to prove the performance of the proposed technique by using the Reuter-21578 document set. The experimental result showed the document clustering technique of using meaning-based paragraphs improves the precision and the recall of document retrieval.

Citation status

* References for papers published after 2023 are currently being built.