본문 바로가기
  • Home

A method for metadata extraction from a collection of records using Named Entity Recognition in Natural Language Processing

  • Journal of Korean Society of Archives and Records Management
  • Abbr : JRMASK
  • 2024, 24(2), pp.65~88
  • DOI : 10.14404/JKSARM.2024.24.2.065
  • Publisher : Korean Society of Archives and Records Management
  • Research Area : Interdisciplinary Studies > Library and Information Science > Archival Studies / Conservation
  • Received : April 16, 2024
  • Accepted : May 10, 2024
  • Published : May 31, 2024

Chiho Song 1

1(사)한국국가기록연구원 원장

Accredited

ABSTRACT

This pilot study explores a method of extracting metadata values ​​and descriptions from records using named entity recognition (NER), a technique in natural language processing (NLP), a subfield of artificial intelligence. The study focuses on handwritten records from the Guro Industrial Complex, produced during the 1960s and 1970s, comprising approximately 1,200 pages and 80,000 words. After the preprocessing process of the records, which included digitization, the study employed a publicly available language API based on Google’s Bidirectional Encoder Representations from Transformers (BERT) language model to recognize entity names within the text. As a result, 173 names of people and 314 of organizations and institutions were extracted from the Guro Industrial Complex’s past records. These extracted entities are expected to serve as direct search terms for accessing the contents of the records. Furthermore, the study identified challenges that arose when applying the theoretical methodology of NLP to real-world records consisting of semistructured text. It also presents potential solutions and implications to consider when addressing these issues.

Citation status

* References for papers published after 2023 are currently being built.