본문 바로가기
  • Home

KorPatELECTRA : A Pre-trained Language Model for Korean Patent Literature to improve performance in the field of natural language processing(Korean Patent ELECTRA)

  • Journal of The Korea Society of Computer and Information
  • Abbr : JKSCI
  • 2022, 27(2), pp.15-23
  • DOI : 10.9708/jksci.2022.27.02.015
  • Publisher : The Korean Society Of Computer And Information
  • Research Area : Engineering > Computer Science
  • Received : December 14, 2021
  • Accepted : January 26, 2022
  • Published : February 28, 2022

Ji-Mo Jang 1 Jae-Ok Min 2 Han-Sung Noh 1

1한국특허정보원
2한국특허정보원 R&D센터

Accredited

ABSTRACT

In the field of patents, as NLP(Natural Language Processing) is a challenging task due to the linguistic specificity of patent literature, there is an urgent need to research a language model optimized for Korean patent literature. Recently, in the field of NLP, there have been continuous attempts to establish a pre-trained language model for specific domains to improve performance in various tasks of related fields. Among them, ELECTRA is a pre-trained language model by Google using a new method called RTD(Replaced Token Detection), after BERT, for increasing training efficiency. The purpose of this paper is to propose KorPatELECTRA pre-trained on a large amount of Korean patent literature data. In addition, optimal pre-training was conducted by preprocessing the training corpus according to the characteristics of the patent literature and applying patent vocabulary and tokenizer. In order to confirm the performance, KorPatELECTRA was tested for NER(Named Entity Recognition), MRC(Machine Reading Comprehension), and patent classification tasks using actual patent data, and the most excellent performance was verified in all the three tasks compared to comparative general-purpose language models.

Citation status

* References for papers published after 2023 are currently being built.

This paper was written with support from the National Research Foundation of Korea.