본문 바로가기
  • Home

A Study on Rule-Based Text Mining and Named Entity Recognition Approaches for Data Formalization of “The Memoirs of Casanova”

  • Journal of The Korea Society of Computer and Information
  • Abbr : JKSCI
  • 2026, 31(5), pp.165~178
  • Publisher : The Korean Society Of Computer And Information
  • Research Area : Engineering > Computer Science
  • Received : March 27, 2026
  • Accepted : May 5, 2026
  • Published : May 29, 2026

Sunghoon Jeong 1 Yujin Noh 1 Jeongeun Hwang 1 Jinsun Kim 1 Hajin Kim 1 Hyoji Ha 1

1아주대학교

Accredited

ABSTRACT

This study constructs and analyzes structured data by applying digital humanities methodologies to Casanova's memoirs, considered the most extensive autobiographical record of the 18th century. Based on the text from the memoir, we designed a data refinement pipeline that integrates NLP technologies —such as Stanza, spaCy, and NRCLex—with generative AI. Specifically, to resolve the complex naming conventions and title issues, a rule-based algorithm was introduced to verify data accuracy. Through this process, we constructed structured data for a total of 1,924 individuals, encompassing seven attributes including gender, mention frequency, and associated emotion words. The analysis of the data revealed a distinct alternation between sections peaking with large-scale influxes of new characters and sections where a select few individuals repeatedly appeared, creating dense relationship networks. Notably, in sections centered around public and institutional events, as well as in the latter half of the narrative, the proportion of female characters plummeted to below half, demonstrating a pattern where the narrative converges upon a core group of male figures. To validate the efficacy of this methodology, the identical system was applied to Benjamin Franklin's autobiography. The results demonstrated stable operation and achieved higher accuracy in areas such as regional classification compared to conventional methods, thereby proving its accessibility and scalability.

Citation status

* References for papers published after 2024 are currently being built.

This paper was written with support from the National Research Foundation of Korea.