본문 바로가기
  • Home

Embedding Model-Based Approach to Duplicate Verification in MARC Records

  • Journal of Korean Library and Information Science Society
  • Abbr : JKLISS
  • 2025, 56(4), pp.1~20
  • Publisher : Korean Library And Information Science Society
  • Research Area : Interdisciplinary Studies > Library and Information Science
  • Received : November 19, 2025
  • Accepted : December 18, 2025
  • Published : December 30, 2025

Soon-Young Lee 1 Song Min-Geon 1 Soosang Lee 1

1부산대학교

Accredited

ABSTRACT

This study aimed to improve the performance of duplicate verification algorithms for MARC records by applying AI technology. To overcome the limitations of existing rule-based algorithms, we utilized AI embedding models based on semantic similarity of text to vectorize MARC records and verify duplicate records through similarity search and semantic similarity analysis. The specific research methodology consisted of two phases. First, we implemented a duplicate verification algorithm for MARC records based on vector similarity search using embedding models and evaluated its performance using the same dataset as the prior study. Second, reflecting on the evaluation results of the initial experiment, we implemented an algorithm that maximizes the advantages of the embedding approach—specifically, identifying duplicate records caused by variations in string notation. For this purpose, we evaluated the algorithm’s performance using newly constructed experimental data and evaluation metrics. The experimental dataset was designed to reflect notational variations that may occur in actual library settings, applying eight transformation rules. The results of the first experiment showed that the rate of correctly identifying identical groups as duplicates improved compared to the prior study. However, the embedding approach revealed limitations in areas requiring precise matching of numbers and special characters, such as incorrectly judging multi-volume materials with different volume information as similar. The results of the second experiment, designed to validate the advantages of the embedding approach, demonstrated 100% identification of both duplicate records and transformation rules across the entire experimental dataset.

Citation status

* References for papers published after 2024 are currently being built.