@article{ART001050317},
author={Kim, Pan Jun and Lee, Jae Yun},
title={Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities},
journal={Journal of the Korean Society for Information Management},
issn={1013-0799},
year={2007},
volume={24},
number={1},
pages={251-271},
doi={10.3743/KOSIM.2007.24.1.251}
TY - JOUR
AU - Kim, Pan Jun
AU - Lee, Jae Yun
TI - Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities
JO - Journal of the Korean Society for Information Management
PY - 2007
VL - 24
IS - 1
PB - 한국정보관리학회
SP - 251
EP - 271
SN - 1013-0799
AB - This paper studies the problem of classifying documents with labeled and unlabeled learning data, especially with regards to using document similarity features. The problem of using unlabeled data is practically important because in many information systems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. There are two steps in general semi-supervised learning algorithm. First, it trains a classifier using the available labeled documents, and classifies the unlabeled documents. Then, it trains a new classifier using all the training documents which were labeled either manually or automatically. We suggested two types of semi-supervised learning algorithm with regards to using document similarity features. The one is one step semi-supervised learning which is using unlabeled documents only to generate document similarity features. And the other is two step semi-supervised learning which is using unlabeled documents as learning examples as well as similarity features. Experimental results, obtained using support vector machines and naive Bayes classifier, show that we can get improved performance with small labeled and large unlabeled documents then the performance of supervised learning which uses labeled-only data. When considering the efficiency of a classifier system, the one step semi-supervised learning algorithm which is suggested in this study could be a good solution for improving classification performance with unlabeled documents.
KW - automatic classification;text categorization;semi-supervised learning;unlabeled documents;document similarities;SVM classifier;naive Bayes classifier
DO - 10.3743/KOSIM.2007.24.1.251
ER -
Kim, Pan Jun and Lee, Jae Yun. (2007). Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities. Journal of the Korean Society for Information Management, 24(1), 251-271.
Kim, Pan Jun and Lee, Jae Yun. 2007, "Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities", Journal of the Korean Society for Information Management, vol.24, no.1 pp.251-271. Available from: doi:10.3743/KOSIM.2007.24.1.251
Kim, Pan Jun, Lee, Jae Yun "Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities" Journal of the Korean Society for Information Management 24.1 pp.251-271 (2007) : 251.
Kim, Pan Jun, Lee, Jae Yun. Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities. 2007; 24(1), 251-271. Available from: doi:10.3743/KOSIM.2007.24.1.251
Kim, Pan Jun and Lee, Jae Yun. "Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities" Journal of the Korean Society for Information Management 24, no.1 (2007) : 251-271.doi: 10.3743/KOSIM.2007.24.1.251
Kim, Pan Jun; Lee, Jae Yun. Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities. Journal of the Korean Society for Information Management, 24(1), 251-271. doi: 10.3743/KOSIM.2007.24.1.251
Kim, Pan Jun; Lee, Jae Yun. Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities. Journal of the Korean Society for Information Management. 2007; 24(1) 251-271. doi: 10.3743/KOSIM.2007.24.1.251
Kim, Pan Jun, Lee, Jae Yun. Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities. 2007; 24(1), 251-271. Available from: doi:10.3743/KOSIM.2007.24.1.251
Kim, Pan Jun and Lee, Jae Yun. "Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities" Journal of the Korean Society for Information Management 24, no.1 (2007) : 251-271.doi: 10.3743/KOSIM.2007.24.1.251