This study constructs and analyzes structured data by applying digital humanities methodologies to Casanova's memoirs, considered the most extensive autobiographical record of the 18th century. Based on the text from the memoir, we designed a data refinement pipeline that integrates NLP technologies —such as Stanza, spaCy, and NRCLex—with generative AI. Specifically, to resolve the complex naming conventions and title issues, a rule-based algorithm was introduced to verify data accuracy.
Through this process, we constructed structured data for a total of 1,924 individuals, encompassing seven attributes including gender, mention frequency, and associated emotion words. The analysis of the data revealed a distinct alternation between sections peaking with large-scale influxes of new characters and sections where a select few individuals repeatedly appeared, creating dense relationship networks.
Notably, in sections centered around public and institutional events, as well as in the latter half of the narrative, the proportion of female characters plummeted to below half, demonstrating a pattern where the narrative converges upon a core group of male figures. To validate the efficacy of this methodology, the identical system was applied to Benjamin Franklin's autobiography. The results demonstrated stable operation and achieved higher accuracy in areas such as regional classification compared to conventional methods, thereby proving its accessibility and scalability.