본문 바로가기
  • Home

Text-Controlled 4D Human Generation

  • Journal of The Korea Society of Computer and Information
  • Abbr : JKSCI
  • 2026, 31(4), pp.55~66
  • Publisher : The Korean Society Of Computer And Information
  • Research Area : Engineering > Computer Science
  • Received : February 20, 2026
  • Accepted : April 1, 2026
  • Published : April 30, 2026

Chanwoo Kim 1 Sanghun Kim 2 Hwasup Lim 1

1과학기술연합대학원대학교
2주식회사 브로즈

Accredited

ABSTRACT

Generating 4D humans from textual descriptions has become an important problem in various applications such as the metaverse and virtual reality. However, previous Text-to-4D generation methods typically generate appearance and motion jointly, which leads to limited controllability and high computational cost. In this paper, we propose a novel text-driven 4D human generation pipeline that integrates separately generated appearance and motion. First, from the given appearance and motion descriptions, a human appearance image and a motion sequence are generated using Stable Diffusion and Motion Diffusion Model, respectively. Next, MusePose combines the generated appearance and motion into a frontal-view video, which is then extended into multi-view videos using SV4D. Finally, Grid4D is employed to learn 4D representation from the synthesized multi-view videos. To validate the proposed pipeline, we construct a dataset for 4D human generation and conduct quantitative and qualitative evaluations on rendered videos. Experimental results show that the proposed method achieves 77.5% in Dynamic Degree, 58.3% in Aesthetic Quality, and 24.8% in Overall Consistency, indicating that while trade-offs exist among some metrics, the method maintains a balance between dynamic expressiveness and visual quality.

Citation status

* References for papers published after 2024 are currently being built.