본문 바로가기
  • Home

Optimization of Number of Training Documents in Text Categorization

  • Journal of the Korean Society for Information Management
  • Abbr : JKOSIM
  • 2006, 23(4), pp.277~294
  • DOI : 10.3743/KOSIM.2006.23.4.277
  • Publisher : 한국정보관리학회
  • Research Area : Interdisciplinary Studies > Library and Information Science
  • Received : November 25, 2006
  • Accepted : December 11, 2006
  • Published : December 30, 2006

Shim, Kyung 1

1아이리스닷넷

Accredited

ABSTRACT

This paper examines a level of categorization performance in a reallife collection of abstract articles in the fields of science and technology, and tests the optimal size of documents per category in a training set using a kNN classifier. The corpus is built by choosing categories that hold more than 2,556 documents first, and then 2,556 documents per category are randomly selected. It is further divided into eight subsets of different size of training documents: each set is randomly selected to build training documents ranging from 20 documents (Tr20) to 2,000 documents (Tr2000) per category. The categorization performances of the 8 subsets are compared. The average performance of the eight subsets is 30% in F1 measure which is relatively poor compared to the findings of previous studies. The experimental results suggest that among the eight subsets the Tr100 appears to be the most optimal size for training a kNN classifier. In addition, the correctness of subject categories assigned to the training sets is probed by manually reclassifying the training sets in order to support the above conclusion by establishing a relation between and the correctness and categorization performance.

Citation status

* References for papers published after 2023 are currently being built.