Document Classification Methodology Using Autoencoder-based Keywords Embedding (오토인코더 기반 키워드 임베딩을 통한 문서 분류 방법론)

Seobin Yoon (윤서빈); Namgyu Kim (김남규)

doi:10.9708/jksci.2023.28.09.035

Document Classification Methodology Using Autoencoder-based Keywords Embedding

Journal of The Korea Society of Computer and Information
Abbr : JKSCI
2023, 28(9), pp.35~46
DOI : 10.9708/jksci.2023.28.09.035
Publisher : The Korean Society Of Computer And Information
Research Area : Engineering > Computer Science
Received : August 9, 2023
Accepted : September 15, 2023
Published : September 27, 2023

Seobin Yoon ¹, Namgyu Kim ¹

¹국민대학교

Accredited

ABSTRACT

In this study, we propose a Dual Approach methodology to enhance the accuracy of document classifiers by utilizing both contextual and keyword information. Firstly, contextual information is extracted using Google's BERT, a pre-trained language model known for its outstanding performance in various natural language understanding tasks. Specifically, we employ KoBERT, a pre-trained model on the Korean corpus, to extract contextual information in the form of the CLS token. Secondly, keyword information is generated for each document by encoding the set of keywords into a single vector using an Autoencoder. We applied the proposed approach to 40,130 documents related to healthcare and medicine from the National R&D Projects database of the National Science and Technology Information Service (NTIS). The experimental results demonstrate that the proposed methodology outperforms existing methods that rely solely on document or word information in terms of accuracy for document classification.

KEYWORDS

Deep Learning, Document Classification, Keyword Embedding, Document Embedding, Pre-Trained Language Model

KJCKorea
Journal Central

Journal of The Korea Society of Computer and Information 2025 KCI Impact Factor : 1.01

Document Classification Methodology Using Autoencoder-based Keywords Embedding

ABSTRACT

KEYWORDS

Citation status

* References for papers published after 2025 are currently being built.

Journal of The Korea Society of Computer and Information 2025 KCI Impact Factor : 1.01

Document Classification Methodology Using Autoencoder-based Keywords Embedding

ABSTRACT

KEYWORDS

Statistics

Tools

Issue List

Citation status

KCI Citation Counts (4)

REFERENCES (24) * References for papers published after 2025 are currently being built.

Search PDF

Citation

* References for papers published after 2025 are currently being built.