KorPatELECTRA : A Pre-trained Language Model for Korean Patent Literature to improve performance in the field of natural language processing(Korean Patent ELECTRA) (KorPatELECTRA : 자연어처리 분야에서의 성능 향상을 위한 한국어 특허 문헌 사전학습 언어모델(KorPatELECTRA))

Ji-Mo Jang (장지모); Jae-Ok Min (민재옥); Han-Sung Noh (노한성)

doi:10.9708/jksci.2022.27.02.015

KorPatELECTRA : A Pre-trained Language Model for Korean Patent Literature to improve performance in the field of natural language processing(Korean Patent ELECTRA)

Journal of The Korea Society of Computer and Information
Abbr : JKSCI
2022, 27(2), pp.15~23
DOI : 10.9708/jksci.2022.27.02.015
Publisher : The Korean Society Of Computer And Information
Research Area : Engineering > Computer Science
Received : December 14, 2021
Accepted : January 26, 2022
Published : February 28, 2022

Ji-Mo Jang ¹, Jae-Ok Min ², Han-Sung Noh ¹

¹한국특허정보원
²한국특허정보원 R&D센터

Accredited

ABSTRACT

In the field of patents, as NLP(Natural Language Processing) is a challenging task due to the linguistic specificity of patent literature, there is an urgent need to research a language model optimized for Korean patent literature. Recently, in the field of NLP, there have been continuous attempts to establish a pre-trained language model for specific domains to improve performance in various tasks of related fields. Among them, ELECTRA is a pre-trained language model by Google using a new method called RTD(Replaced Token Detection), after BERT, for increasing training efficiency. The purpose of this paper is to propose KorPatELECTRA pre-trained on a large amount of Korean patent literature data. In addition, optimal pre-training was conducted by preprocessing the training corpus according to the characteristics of the patent literature and applying patent vocabulary and tokenizer. In order to confirm the performance, KorPatELECTRA was tested for NER(Named Entity Recognition), MRC(Machine Reading Comprehension), and patent classification tasks using actual patent data, and the most excellent performance was verified in all the three tasks compared to comparative general-purpose language models.

KEYWORDS

Patent, ELECTRA, pre-training, NLP, tokenizer, Language model ∙

Citation status

* References for papers published after 2024 are currently being built.

[journal] 정수정 / 2014 / Zur Analyse von mehr oder weniger festen Wortverbindungen in Patentschriften im Deutschen und Koreanischen / 독일어문화권연구 / 독일어문화권연구소 (23) : 359~381

[other] DEVLIN, Jacob / 2018 / Bert: Pre-training of deep bidirectional transformers for language understanding

[journal] Yang, Zhilin / 2019 / Xlnet: Generalized autoregressive pretraining for language understanding / Advances in neural information processing systems 32

[other] RADFORD, Alec / 2018 / Improving language understanding by generative pre-training

[other] LAN, Zhenzhong / 2019 / A Lite BERT for Self-supervised Learning of Language Representations

[other] LIU, Yinhan / 2019 / Roberta: A robustly optimized bert pretraining approach

[journal] LIM, J. H. / 2020 / Recent R&D Trends for Pretrained Language Model / Electronics and Telecommunications Trends 35(3) : 9~19

[other] CLARK, Kevin / 2020 / Electra: Pre-training text encoders as discriminators rather than generators

[other] WANG, Alex / 2018 / GLUE: A multi-task benchmark and analysis platform for natural language understanding

[journal] LEE, Jinhyuk / 2020 / BioBERT: a pre-trained biomedical language representation model for biomedical text mining / Bioinformatics 36(4) : 1234~1240

[other] BELTAGY, Iz / 2019 / Scibert: A pretrained language model for scientific text

[confproc] LEWIS, Patrick / 2020 / Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art / Proceedings of the 3rd Clinical Natural Language Processing Workshop : 146~157

[confproc] RAJ KANAKARAJAN, Kamal / 2021 / BioELECTRA: Pretrained Biomedical text Encoder using Discriminators / Proceedings of the 20th Workshop on Biomedical Language Processing : 143~154

[book] VASWANI, Ashish / 2017 / Advances in neural information processing systems : 5998~6008

[journal] 민재옥 / 2020 / Korean Machine Reading Comprehension for Patent Consultation Using BERT / 정보처리학회 논문지 / 한국정보처리학회 9(4) : 145~152

[journal] Park, Joo-Yeon / 2020 / Improving Recognition of Patent’s Claims with Deep Neural Networks / Collection of papers from Korea Information Processing Society 27(1) : 500~503

[journal] LEE, Jieh-Sheng / 2020 / Patent classification by fine-tuning BERT language model / World Patent Information 61 : 101965~

[other] RUST, Phillip / 2020 / How good is your tokenizer? on the monolingual performance of multilingual language models

[other] SENNRICH, Rico / 2015 / Neural machine translation of rare words with subword units

[other] PARK, Sungjoon / 2021 / KLUE: Korean Language Understanding Evaluation

[confproc] PARK, Jinwoo / 2020 / Patent Tokenizer: a research on the optimization of tokenize for the Patent sentence using the Morphemes and SentencePiece / Annual Conference on Human and Language Technology. Human and Language Technology : 441~445

This paper was written with support from the National Research Foundation of Korea.

KJCKorea
Journal Central

Journal of The Korea Society of Computer and Information 2024 KCI Impact Factor : 0.81

KorPatELECTRA : A Pre-trained Language Model for Korean Patent Literature to improve performance in the field of natural language processing(Korean Patent ELECTRA)

ABSTRACT

KEYWORDS

Citation status

* References for papers published after 2024 are currently being built.

Journal of The Korea Society of Computer and Information 2024 KCI Impact Factor : 0.81

KorPatELECTRA : A Pre-trained Language Model for Korean Patent Literature to improve performance in the field of natural language processing(Korean Patent ELECTRA)

ABSTRACT

KEYWORDS

Statistics

Tools

Issue List

Citation status

KCI Citation Counts (4)

REFERENCES (21) * References for papers published after 2024 are currently being built.

Search PDF

Citation

* References for papers published after 2024 are currently being built.