본문 바로가기
  • Home

Document Classification Methodology Using Autoencoder-based Keywords Embedding

  • Journal of The Korea Society of Computer and Information
  • Abbr : JKSCI
  • 2023, 28(9), pp.35-46
  • DOI : 10.9708/jksci.2023.28.09.035
  • Publisher : The Korean Society Of Computer And Information
  • Research Area : Engineering > Computer Science
  • Received : August 9, 2023
  • Accepted : September 15, 2023
  • Published : September 27, 2023

Seobin Yoon 1 Namgyu Kim 1

1국민대학교

Accredited

ABSTRACT

In this study, we propose a Dual Approach methodology to enhance the accuracy of document classifiers by utilizing both contextual and keyword information. Firstly, contextual information is extracted using Google's BERT, a pre-trained language model known for its outstanding performance in various natural language understanding tasks. Specifically, we employ KoBERT, a pre-trained model on the Korean corpus, to extract contextual information in the form of the CLS token. Secondly, keyword information is generated for each document by encoding the set of keywords into a single vector using an Autoencoder. We applied the proposed approach to 40,130 documents related to healthcare and medicine from the National R&D Projects database of the National Science and Technology Information Service (NTIS). The experimental results demonstrate that the proposed methodology outperforms existing methods that rely solely on document or word information in terms of accuracy for document classification.

Citation status

* References for papers published after 2022 are currently being built.