An Enhanced Scene Retrieval Method for OTT Content Based on Multimodal Action and Speech Recognition (멀티모달 행동 및 음성 인식 기반의 OTT 콘텐츠 장면 검색 고도화 방법)

Seung-Hyeon Park (박승현); Injae Yoo (유인재); Byuong-Chan Park (박병찬); Seok-Yoon Kim (김석윤); Youngmo Kim (김영모)

An Enhanced Scene Retrieval Method for OTT Content Based on Multimodal Action and Speech Recognition

Journal of Software Forensics
Abbr : JSF
2025, 21(2), pp.61~69
Publisher : Korea Software Assessment and Valuation Society
Research Area : Engineering > Computer Science
Received : May 2, 2025
Accepted : June 20, 2025
Published : June 30, 2025

Seung-Hyeon Park ¹, Injae Yoo ¹, Byuong-Chan Park ¹, Seok-Yoon Kim ¹, Youngmo Kim ¹

¹숭실대학교

Accredited

ABSTRACT

With the rapid growth of OTT (Over-the-Top) platforms, the volume of video content has increased exponentially, leading to a rising demand for precise and context-aware scene retrieval technologies. In narrative-driven content such as dramas and films, which often involve complex editing techniques and character-centric storytelling, conventional keyframe-based search methods fall short in capturing semantic continuity and scene context. This paper proposes an advanced method for OTT content scene retrieval based on multimodal action and speech recognition, combining a Transformer-based action recognition model with Speech-to-Text (STT) technology. The proposed approach segments continuous video frames into meaningful action intervals and constructs a de-duplicated scene graph by integrating key objects and their relationships within each segment. Furthermore, speech segments are accurately extracted and temporally aligned with visual data, enabling a unified multimodal representation of scenes. This integration supports more refined and semantically rich scene searches, such as character-centered navigation, emotion-based clip extraction, and dialogue-driven retrieval. The proposed method is expected to significantly enhance the personalization and reusability of OTT content in various user-centered applications.

KEYWORDS

Multimodal Scene Retrieval, Transformer-based Action Recognition, Speech-Visual Information Integration, OTT Content Analysis, Semantics-Centered Scene Graph

Citation status

* References for papers published after 2025 are currently being built.

[journal] Sangwon Lee / 2022 / An Analysis of the Impact of OTT Service Growth on Media Market Performance / Journal of the Korea Contents Association 22(4) : 199~206

[journal] Cheolmin Lim / 2023 / Study on analyzing the pattern of content consumption through local OTT service based on the spatiotemporal context / Journal of the Korea Digital Content Society 24(2) : 273~291

[other] Donghwan Noh / 2022 / Enhancing Competitiveness Through Data Utilization in Video OTT Platforms : 30~33

[journal] Dohyung Park / 2023 / The Case Study on Development of Segmentation and Data-driven Persona Based on OTT Service Usage Logs: Focusing on Netflix / Journal of the Korea Institute of Information Scientists and Engineers 25(3) : 35~45

[web] KBS Archive / Introducing the Result of 1 Million Hours of KBS Archive / Old TV: KBS Archive / https://www.youtube.com/watch?v=TsOA0PdrW2s

[journal] Zhang Fuying / 2017 / Research on Application of Scene Transition Technique in the Film The Myth / Journal of the Korea Entertainment Industry Association 11(8) : 187~196

[other] Haowei Liu / 2024 / Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval

[confproc] Haonan Zhang / 2024 / Text-Video Retrieval with Global-Local Semantic Consistent Learning / Computer Vision and Pattern Recognition

[journal] Electronics and Telecommunications Research Institute (ETRI) / 2019 / Trends in AI-Based Video Content Generation Technologies / ETRI Electronics and Telecommunications Trends Analysis 34(3) : 34~42 / https://ettrends.etri.re.kr/ettrends/177/0905177004/34-3_034-042.pdf

[confproc] George Cazenavette / 2023 / Generalizing Dataset Distillation via Deep Generative Prior / Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

[journal] Manli Wang / 2023 / Low-light image enhancement by deep learning network for improved illumination map / Computer Vision and Image Understanding 233

[other] Alec Radford / 2022 / Robust Speech Recognition via Large-Scale Weak Supervision

[other] Anmol Gulati / 2020 / Conformer: Convolution-augmented Transformer for Speech Recognition

[other] Trong-Thuan Nguyen / 2024 / HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

KJCKorea
Journal Central

Journal of Software Forensics 2025 KCI Impact Factor : 0.3