Classification of Literary Works(Novels) Using Text Mining (텍스트 마이닝에 의한 문학 작품 분류)

Chung, Wonil (정원일); Bahng, Seunghee (방승희); Park, Myung Kwan (박명관)

doi:10.33639/ptc.2021..35.016

Classification of Literary Works(Novels) Using Text Mining

PHILOSOPHY·THOUGHT·CULTURE
2021, (35), pp.381~407
DOI : 10.33639/ptc.2021..35.016
Publisher : Research Institute for East-West Thought
Research Area : Humanities > Other Humanities
Received : December 6, 2020
Accepted : January 29, 2021
Published : January 31, 2021

Chung, Wonil ¹, Bahng, Seunghee ², Park, Myung Kwan ¹

¹동국대학교
²국민대학교

Accredited

ABSTRACT

This paper is to introduce quantitative text analysis of some literary works registered in the Project Gutenberg among Big Data and classification of the works using text mining techniques. After performing data preprocessing using the programming language R, we measured cosine similarity between chapters within a novel and cosine similarity between chapters of different novels to classify the novels. We found the cosine similarity between chapters within the novel is relatively high, but not between the novels. Furthermore, clustering analysis, which is an unsupervised machine learning task, showed strong cohesion of semantic distance, and classification analysis, which is a supervised machine learning task, showed high accuracy. In addition, we have confirmed that children's novels can be classified as easy-to-read works due to the large cosine similarity value and small semantic distance between chapters. Therefore, quantitative text analysis using text mining technique is expected to serve as a foundation for performing qualitative text analysis.

KEYWORDS

text mining, classification, clustering, cosain similarity, children’s novel

Citation status

* References for papers published after 2024 are currently being built.

[journal] Furnas, G. W. / 1983 / Statistical semantics : Analysis of the potential performance of keyword information systems / Bell System Technical Journal 62(6) : 17531806~

[confproc] Hearst, Marti A / 1992 / Automatic acquisition of hyponyms from large text corpora / Proceedings of the 14th Conference on Computational Linguistics : 539545~

[web] / http://blog.schoollibraryjournal.com/afuse8production/2010/03/15/top-100-chil drens-novels-25-21/ / http://blog.schoollibraryjournal.com/afuse8production/2010/03/15/top-100-childrens-novels-25-21/

[web] / http://blog.schoollibraryjournal.com/afuse8production/2010/03/31/top-100-chil drens-novels-9/ / http://blog.schoollibraryjournal.com/afuse8production/2010/03/31/top-100-childrens-novels-9/

[web] / http://blogs.slj.com/afuse8production/2012/06/13/top-100-childrens-novels-15-the-secret-garden-by-frances-hodgson-burnett/ / http://blogs.slj.com/afuse8production/2012/06/13/top-100-childrens-novels-15-the-secret-garden-by-frances-hodgson-burnett/

[book] Jockers, Matthew L. / 2014 / Text Analysis with R for Students of Literature / Springer

[journal] Kumar, Anil A / 2012 / Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering / International Journal of Engineering Research & Technology (IJERT) 1(5)

[book] Mehl, M. R. / 2006 / Handbook of multimethod measurement in psychology / American Psychological Association : 141156~

[journal] Salton, G. / 1975 / A vector space model for automatic indexing / Communications of the ACM 18(11) : 613620~

[book] Weaver, W. / 1955 / Machine Translation of Languages: Fourteen Essays / MIT Press

[confproc] Webster Jonathan J. / 1992 / Tokenization As The Initial Phase In NLP / AcrEs DE COLING-92. NANTE'S : 23~28

This paper was written with support from the National Research Foundation of Korea.

KJCKorea
Journal Central

PHILOSOPHY·THOUGHT·CULTURE 2024 KCI Impact Factor : 0.5

Classification of Literary Works(Novels) Using Text Mining

ABSTRACT

KEYWORDS

Citation status

* References for papers published after 2024 are currently being built.

PHILOSOPHY·THOUGHT·CULTURE 2024 KCI Impact Factor : 0.5

Classification of Literary Works(Novels) Using Text Mining

ABSTRACT

KEYWORDS

Statistics

Tools

Issue List

Citation status

KCI Citation Counts (3)

REFERENCES (11) * References for papers published after 2024 are currently being built.

Search PDF

Citation

* References for papers published after 2024 are currently being built.