A method for metadata extraction from a collection of records using Named Entity Recognition in Natural Language Processing (자연어 처리의 개체명 인식을 통한 기록집합체의 메타데이터 추출 방안)

Chiho Song (송치호)

doi:10.14404/JKSARM.2024.24.2.065

A method for metadata extraction from a collection of records using Named Entity Recognition in Natural Language Processing

Journal of Korean Society of Archives and Records Management
Abbr : JRMASK
2024, 24(2), pp.65~88
DOI : 10.14404/JKSARM.2024.24.2.065
Publisher : Korean Society of Archives and Records Management
Research Area : Interdisciplinary Studies > Library and Information Science > Archival Studies / Conservation
Received : April 16, 2024
Accepted : May 10, 2024
Published : May 31, 2024

Chiho Song ¹

¹(사)한국국가기록연구원 원장

Accredited

ABSTRACT

This pilot study explores a method of extracting metadata values and descriptions from records using named entity recognition (NER), a technique in natural language processing (NLP), a subfield of artificial intelligence. The study focuses on handwritten records from the Guro Industrial Complex, produced during the 1960s and 1970s, comprising approximately 1,200 pages and 80,000 words. After the preprocessing process of the records, which included digitization, the study employed a publicly available language API based on Google’s Bidirectional Encoder Representations from Transformers (BERT) language model to recognize entity names within the text. As a result, 173 names of people and 314 of organizations and institutions were extracted from the Guro Industrial Complex’s past records. These extracted entities are expected to serve as direct search terms for accessing the contents of the records. Furthermore, the study identified challenges that arose when applying the theoretical methodology of NLP to real-world records consisting of semistructured text. It also presents potential solutions and implications to consider when addressing these issues.

KEYWORDS

AI, NLP, Metadata, LLM, NER

KJCKorea
Journal Central

Journal of Korean Society of Archives and Records Management 2024 KCI Impact Factor : 0.72

A method for metadata extraction from a collection of records using Named Entity Recognition in Natural Language Processing

ABSTRACT

KEYWORDS

Citation status

* References for papers published after 2024 are currently being built.

Journal of Korean Society of Archives and Records Management 2024 KCI Impact Factor : 0.72

A method for metadata extraction from a collection of records using Named Entity Recognition in Natural Language Processing

ABSTRACT

KEYWORDS

Statistics

Tools

Issue List

Citation status

KCI Citation Counts (1)

REFERENCES (19) * References for papers published after 2024 are currently being built.

Search PDF

Citation

* References for papers published after 2024 are currently being built.