본문 바로가기
  • Home

Design and Implementation of Sentence-Level Lip-reading with a Korean Morpheme-Based Multimodal AVSR Model (KM-AVSR)

  • Journal of The Korea Society of Computer and Information
  • Abbr : JKSCI
  • 2025, 30(8), pp.75~85
  • Publisher : The Korean Society Of Computer And Information
  • Research Area : Engineering > Computer Science
  • Received : June 18, 2025
  • Accepted : August 4, 2025
  • Published : August 29, 2025

Hee-Dong Yoon 1 Se-Uk Lee 2 Dong-Kyu Moon 2 Myung-Ho Kim 1

1숭실대학교
2대교씨엔에스

Accredited

ABSTRACT

In this paper, we propose KM-AVSR, a Korean Morpheme-based Multimodal Audio-Visual Speech Recognition (AVSR) model designed to enhance sentence-level lip-reading accuracy. Lip-reading has become increasingly valuable for understanding speech in noisy environments or in the absence of audio, with promising applications in Korean language education, assistive technologies, and surveillance. To address the challenges posed by the syllabic and agglutinative nature of Korean, KM-AVSR adopts morpheme-based subword tokenization. The model independently encodes visual (lip movements) and auditory (raw waveform) inputs using separate encoders, fuses the modalities through a multilayer perceptron, and decodes the output using a hybrid Connectionist Temporal Classification (CTC) and Transformer-based decoder. Evaluations on a Korean lip-reading dataset demonstrate that KM-AVSR achieves a Character Error Rate (CER) of 15.66%, representing a 39.35% improvement over a conventional CNN-based AVSR model. These results highlight the effectiveness of morpheme-level subword modeling and hybrid decoding in Korean AVSR.

Citation status

* References for papers published after 2024 are currently being built.