본문 바로가기
  • Home

A Novel Two-Stage Attacks on Korean Language Models: Single- Token Triggers Search and Morphology-Preserving Minimal Edits

  • Journal of The Korea Society of Computer and Information
  • Abbr : JKSCI
  • 2026, 31(2), pp.75~85
  • DOI : 10.9708/jksci.2026.31.02.075
  • Publisher : The Korean Society Of Computer And Information
  • Research Area : Engineering > Computer Science
  • Received : December 8, 2025
  • Accepted : January 26, 2026
  • Published : February 27, 2026

Areum Im 1 Taehwa Lee 1 Soojin Lee ORD ID 1

1국방대학교

Accredited

ABSTRACT

In this study, we propose a novel two-stage attack framework applicable to Korean-based language models with agglutinative characteristics. The first stage is an inference-time universal adversarial trigger (UAT) attack, performed without intervention in the learning process. It precisely searches for single-token triggers capable of reversing the model's predictions using only the gradient information. The second stage, targeting only samples that failed in the first stage, is an adversarial example attack. It replaces no more than two tokens combining particles and suffixes based on a morphology-preserving minimal edit strategy. The effectiveness of our framework was evaluated on the NSMC dataset using the KoBERT and KoELECTRA models. Experimental results showed that triggers attached to the end of sentences had a high attack success rate due to the characteristic of Korean language in which key information appears at the end of sentences. Furthermore, words that indirectly express sentiment also functioned as powerful triggers. The KoBERT model achieved an attack success rate of 0.963, and the KoELECTRA model achieved an attack success rate of 0.940.

Citation status

* References for papers published after 2024 are currently being built.