본문 바로가기
  • Home

The Effect of the Quality of Pre-Assigned Subject Categories on the Text Categorization Performance

  • Journal of the Korean Society for Information Management
  • Abbr : JKOSIM
  • 2006, 23(2), pp.265~285
  • DOI : 10.3743/KOSIM.2006.23.2.265
  • Publisher : 한국정보관리학회
  • Research Area : Interdisciplinary Studies > Library and Information Science
  • Received : May 30, 2006
  • Accepted : June 22, 2006
  • Published : June 30, 2006

Shim, Kyung 1 Young-Mee Chung 2

1Systems R&D Center, Iris.Net
2연세대학교

Accredited

ABSTRACT

In text categorization a certain level of correctness of labels assigned to training documents is assumed without solid knowledge on that of real-world collections. Our research attempts to explore the quality of pre-assigned subject categories in a real-world collection, and to identify the relationship between the quality of category assignment in training set and text categorization performance. Particularly, we are interested in to what extent the performance can be improved by enhancing the quality (i.e., correctness) of category assignment in training documents. A collection of 1,150 abstracts in computer science is re-classified by an expert group, and divided into 907 training documents and 227 test documents (15 duplicates are removed). The performances of before and after re-classification groups, called Initial set and Recat-1/Recat-2 sets respectively, are compared using a kNN classifier. The average correctness of subject categories in the Initial set is 16%, and the categorization performance with the Initial set shows 17% in F1 value. On the other hand, the Recat-1 set scores F1 value of 61%, which is 3.6 times higher than that of the Initial set.

Citation status

* References for papers published after 2023 are currently being built.