본문 바로가기
  • Home

Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities

  • Journal of the Korean Society for Information Management
  • Abbr : JKOSIM
  • 2007, 24(1), pp.251~271
  • DOI : 10.3743/KOSIM.2007.24.1.251
  • Publisher : 한국정보관리학회
  • Research Area : Interdisciplinary Studies > Library and Information Science
  • Received : February 26, 2007
  • Accepted : March 14, 2007
  • Published : March 30, 2007

Kim, Pan Jun 1 Lee, Jae Yun 2

1신라대학교
2경기대학교

Accredited

ABSTRACT

This paper studies the problem of classifying documents with labeled and unlabeled learning data, especially with regards to using document similarity features. The problem of using unlabeled data is practically important because in many information systems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. There are two steps in general semi-supervised learning algorithm. First, it trains a classifier using the available labeled documents, and classifies the unlabeled documents. Then, it trains a new classifier using all the training documents which were labeled either manually or automatically. We suggested two types of semi-supervised learning algorithm with regards to using document similarity features. The one is one step semi-supervised learning which is using unlabeled documents only to generate document similarity features. And the other is two step semi-supervised learning which is using unlabeled documents as learning examples as well as similarity features. Experimental results, obtained using support vector machines and naive Bayes classifier, show that we can get improved performance with small labeled and large unlabeled documents then the performance of supervised learning which uses labeled-only data. When considering the efficiency of a classifier system, the one step semi-supervised learning algorithm which is suggested in this study could be a good solution for improving classification performance with unlabeled documents.

Citation status

* References for papers published after 2023 are currently being built.