@article{ART001096271},
author={Shim, Kyung},
title={Optimization of Number of Training Documents in Text Categorization},
journal={Journal of the Korean Society for Information Management},
issn={1013-0799},
year={2006},
volume={23},
number={4},
pages={277-294},
doi={10.3743/KOSIM.2006.23.4.277}
TY - JOUR
AU - Shim, Kyung
TI - Optimization of Number of Training Documents in Text Categorization
JO - Journal of the Korean Society for Information Management
PY - 2006
VL - 23
IS - 4
PB - 한국정보관리학회
SP - 277
EP - 294
SN - 1013-0799
AB - This paper examines a level of categorization performance in a reallife collection of abstract articles in the fields of science and technology, and tests the optimal size of documents per category in a training set using a kNN classifier. The corpus is built by choosing categories that hold more than 2,556 documents first, and then 2,556 documents per category are randomly selected. It is further divided into eight subsets of different size of training documents: each set is randomly selected to build training documents ranging from 20 documents (Tr20) to 2,000 documents (Tr2000) per category. The categorization performances of the 8 subsets are compared. The average performance of the eight subsets is 30% in F1 measure which is relatively poor compared to the findings of previous studies. The experimental results suggest that among the eight subsets the Tr100 appears to be the most optimal size for training a kNN classifier. In addition, the correctness of subject categories assigned to the training sets is probed by manually reclassifying the training sets in order to support the above conclusion by establishing a relation between and the correctness and categorization performance.
KW - text categorization;KNN classifier;test collections;size of training documents
DO - 10.3743/KOSIM.2006.23.4.277
ER -
Shim, Kyung. (2006). Optimization of Number of Training Documents in Text Categorization. Journal of the Korean Society for Information Management, 23(4), 277-294.
Shim, Kyung. 2006, "Optimization of Number of Training Documents in Text Categorization", Journal of the Korean Society for Information Management, vol.23, no.4 pp.277-294. Available from: doi:10.3743/KOSIM.2006.23.4.277
Shim, Kyung "Optimization of Number of Training Documents in Text Categorization" Journal of the Korean Society for Information Management 23.4 pp.277-294 (2006) : 277.
Shim, Kyung. Optimization of Number of Training Documents in Text Categorization. 2006; 23(4), 277-294. Available from: doi:10.3743/KOSIM.2006.23.4.277
Shim, Kyung. "Optimization of Number of Training Documents in Text Categorization" Journal of the Korean Society for Information Management 23, no.4 (2006) : 277-294.doi: 10.3743/KOSIM.2006.23.4.277
Shim, Kyung. Optimization of Number of Training Documents in Text Categorization. Journal of the Korean Society for Information Management, 23(4), 277-294. doi: 10.3743/KOSIM.2006.23.4.277
Shim, Kyung. Optimization of Number of Training Documents in Text Categorization. Journal of the Korean Society for Information Management. 2006; 23(4) 277-294. doi: 10.3743/KOSIM.2006.23.4.277
Shim, Kyung. Optimization of Number of Training Documents in Text Categorization. 2006; 23(4), 277-294. Available from: doi:10.3743/KOSIM.2006.23.4.277
Shim, Kyung. "Optimization of Number of Training Documents in Text Categorization" Journal of the Korean Society for Information Management 23, no.4 (2006) : 277-294.doi: 10.3743/KOSIM.2006.23.4.277