본문 바로가기
  • Home

Self-Supervised Document Representation Method

  • Journal of The Korea Society of Computer and Information
  • Abbr : JKSCI
  • 2020, 25(5), pp.187-197
  • DOI : 10.9708/jksci.2020.25.05.187
  • Publisher : The Korean Society Of Computer And Information
  • Research Area : Engineering > Computer Science
  • Received : April 7, 2020
  • Accepted : May 5, 2020
  • Published : May 29, 2020

Yun Yeo Il 1 Namgyu Kim 1

1국민대학교

Accredited

ABSTRACT

Recently, various methods of text embedding using deep learning algorithms have been proposed. Especially, the way of using pre-trained language model which uses tremendous amount of text data in training is mainly applied for embedding new text data. However, traditional pre-trained language model has some limitations that it is hard to understand unique context of new text data when the text has too many tokens. In this paper, we propose self-supervised learning-based fine tuning method for pre-trained language model to infer vectors of long-text. Also, we applied our method to news articles and classified them into categories and compared classification accuracy with traditional models. As a result, it was confirmed that the vector generated by the proposed model more accurately expresses the inherent characteristics of the document than the vectors generated by the traditional models.

Citation status

* References for papers published after 2022 are currently being built.