본문 바로가기
  • Home

A Defensive Data Preprocessing Pipeline for Mitigating Data Poisoning in AI Code Generators

  • Journal of The Korea Society of Computer and Information
  • Abbr : JKSCI
  • 2026, 31(6), pp.153~159
  • Publisher : The Korean Society Of Computer And Information
  • Research Area : Engineering > Computer Science
  • Received : March 26, 2026
  • Accepted : June 8, 2026
  • Published : June 30, 2026

Min-Ju Kang 1 Jin-Young Kim 1

1성균관대학교

Accredited

ABSTRACT

The rapid adoption of Large Language Model-based AI code generators has led to an escalation in data poisoning attacks. While previous research primarily focuses on post-training defense mechanisms, preventative measures at the training data level remain scarce; as a result, data poisoning continues to exert a direct and persistent influence on the internal representations of these models. Therefore, this study suggests a defensive data preprocessing pipeline to address data poisoning attacks on AI code generators. The pipeline leverages a code language model to quantify distributional anomaly and consistency scores for each data point, which are then integrated with CVSS scores to determine final risk levels, thereby enabling the systematic identification and removal of high-risk data to enhance the overall reliability of the training set. As a result, the application of this pipeline led to an approximately 75% reduction in the Attack Success Rate (ASR) relative to the baseline. Ultimately, these findings demonstrate that the proposed preprocessing pipeline effectively mitigates data poisoning and highlights the practical viability of a data-centric defense strategy.

Citation status

* References for papers published after 2024 are currently being built.