본문 바로가기
  • Home

An Experimental Study on Feature Selection Using Wikipedia for Text Categorization

  • Journal of the Korean Society for Information Management
  • Abbr : JKOSIM
  • 2012, 29(2), pp.155~171
  • DOI : 10.3743/KOSIM.2012.29.2.155
  • Publisher : 한국정보관리학회
  • Research Area : Interdisciplinary Studies > Library and Information Science
  • Received : May 21, 2012
  • Accepted : June 16, 2012
  • Published : June 30, 2012

KIM, YONG HWAN 1 Young-Mee Chung 1

1연세대학교

Accredited

ABSTRACT

In text categorization, core terms of an input document are hardly selected as classification features if they do not occur in a training document set. Besides, synonymous terms with the same concept are usually treated as different features. This study aims to improve text categorization performance by integrating synonyms into a single feature and by replacing input terms not in the training document set with the most similar term occurring in training documents using Wikipedia. For the selection of classification features, experiments were performed in various settings composed of three different conditions: the use of category information of non-training terms, the part of Wikipedia used for measuring term-term similarity, and the type of similarity measures. The categorization performance of a kNN classifier was improved by 0.35~1.85% in F1 value in all the experimental settings when non-learning terms were replaced by the learning term with the highest similarity above the threshold value. Although the improvement ratio is not as high as expected, several semantic as well as structural devices of Wikipedia could be used for selecting more effective classification features.

Citation status

* References for papers published after 2023 are currently being built.