본문 바로가기
  • Home

An Enhanced Scene Retrieval Method for OTT Content Based on Multimodal Action and Speech Recognition

  • Journal of Software Assessment and Valuation
  • Abbr : JSAV
  • 2025, 21(2), pp.61~69
  • Publisher : Korea Software Assessment and Valuation Society
  • Research Area : Engineering > Computer Science
  • Received : May 2, 2025
  • Accepted : June 20, 2025
  • Published : June 30, 2025

Seung-Hyeon Park 1 Injae Yoo 1 Byuong-Chan Park 1 Seok-Yoon Kim 1 Youngmo Kim 1

1숭실대학교

Accredited

ABSTRACT

With the rapid growth of OTT (Over-the-Top) platforms, the volume of video content has increased exponentially, leading to a rising demand for precise and context-aware scene retrieval technologies. In narrative-driven content such as dramas and films, which often involve complex editing techniques and character-centric storytelling, conventional keyframe-based search methods fall short in capturing semantic continuity and scene context. This paper proposes an advanced method for OTT content scene retrieval based on multimodal action and speech recognition, combining a Transformer-based action recognition model with Speech-to-Text (STT) technology. The proposed approach segments continuous video frames into meaningful action intervals and constructs a de-duplicated scene graph by integrating key objects and their relationships within each segment. Furthermore, speech segments are accurately extracted and temporally aligned with visual data, enabling a unified multimodal representation of scenes. This integration supports more refined and semantically rich scene searches, such as character-centered navigation, emotion-based clip extraction, and dialogue-driven retrieval. The proposed method is expected to significantly enhance the personalization and reusability of OTT content in various user-centered applications.

Citation status

* References for papers published after 2023 are currently being built.