본문 바로가기
  • Home

Fine-tuning of Korean Text-to-Speech Model Based on Elderly Speaker Voice Data

  • Journal of Internet of Things and Convergence
  • Abbr : JKIOTS
  • 2026, 12(1), pp.9~16
  • Publisher : The Korea Internet of Things Society
  • Research Area : Engineering > Computer Science > Internet Information Processing
  • Received : December 31, 2025
  • Accepted : February 20, 2026
  • Published : February 28, 2026

YeongJu Kim 1 Kwangmoon Cho 1 Do Hyun Lee 1

1국립목포대학교

Accredited

ABSTRACT

This study proposes an effective method for building a Text-to-Speech (TTS) model in limited data environments using Korean voice data from elderly speakers aged 60 to 90. We collected approximately 250 minutes of voice data from 50 elderly speakers (25 males and 25 females), applying fine-tuning techniques with the XTTS (Cross-lingual Text-to-Speech) model based on an average of 5 minutes of data per speaker. In the data preprocessing stage, we refined speech segments and transcription quality through Automatic Speech Recognition (ASR) using the Whisper large-v3 model and Voice Activity Detection (VAD). We improved training efficiency and stability by applying Mixed Precision learning and CosineAnnealing scheduler. Experimental results demonstrate that with optimal hyperparameter settings, most speakers achieved low Word Error Rate (WER) and Character Error Rate (CER). Even for speakers who initially showed high error rates during the initial training phase, performance was significantly improved through retraining. This study presents the feasibility of building a Korean TTS system reflecting the voice characteristics of elderly speakers and provides an efficient Few-shot learning methodology applicable in environments with extremely limited data per speaker.

Citation status

* References for papers published after 2024 are currently being built.