본문 바로가기
  • Home

Performance and Stability Analysis of a Fine-Tuned BERT-BiLSTM-CRF NER Model for Automated Information Extraction in Openly Licensed Works

  • Journal of Software Assessment and Valuation
  • Abbr : JSAV
  • 2025, 21(4), pp.53~61
  • Publisher : Korea Software Assessment and Valuation Society
  • Research Area : Engineering > Computer Science
  • Received : December 2, 2025
  • Accepted : December 20, 2025
  • Published : December 26, 2025

Hwang Sung Hun 1 Milandu Keith Moussavou Boussougou 2 Dong Joo Park 1

1숭실대학교
2숭실대

Accredited

ABSTRACT

Korean legal documents pose challenges for information extraction due to complex layouts, Optical Character Recognition (OCR) noise, and agglutinative morphology. This paper proposes an automated Named-Entity Recognition(NER) pipeline that integrates Qwen-VL-based OCR, a Begin-Inside-Outside (B-I-O)-tagged training dataset, and fine-tuned BERT-family encoders with a BiLSTM-Conditional Random Field (CRF) decoder. We fine-tune mBERT, KLUE-RoBERTa-Large, and XLM-RoBERTa-Large under both Pure and BiLSTM-CRF settings, incorporating 30% OCR-style noise. A 5-폴드cross-validation demonstrates that CRF-enhanced models achieve more stable and structurally consistent predictions, with XLM-RoBERTa-Large-CRF reaching an average F1-score of 0.998. The results highlight a practical design for robust NER in noisy OCR environments.

Citation status

* References for papers published after 2024 are currently being built.