본문 바로가기
  • Home

A Simulation-Based VQA Dataset for Evaluating Intrinsic Physical Property Inference Capabilities of Vision-Language Models

  • Journal of The Korea Society of Computer and Information
  • Abbr : JKSCI
  • 2026, 31(5), pp.95~104
  • Publisher : The Korean Society Of Computer And Information
  • Research Area : Engineering > Computer Science
  • Received : March 27, 2026
  • Accepted : May 13, 2026
  • Published : May 29, 2026

DongJu Jang 1 Yeong-In Lee 1 Ha-Young Kim 1

1연세대학교

Accredited

ABSTRACT

This study proposes a multi-view video question-answering benchmark and dataset to train and evaluate vision-language models on inferring intrinsic physical properties, such as mass and elasticity, through robot-object interactions beyond simple scene recognition. To this end, we collected data by designing cube pushing and sphere dropping tasks based on inverse kinematics control within a simulation environment, and analyzed the performance by fine-tuning state-of-the-art models. The experimental results demonstrated that although pre-trained models showed low accuracy, their performance improved significantly in the mass inference task where the final displacement remains static, successfully overcoming existing text response biases after fine-tuning. Conversely, in the elasticity inference task, which requires tracking a momentary dynamic trajectory, the performance improvement was limited and the models exhibited a limitation of regressing to linguistic biases. In conclusion, this dataset provides an environment to quantitatively evaluate the physical reasoning capabilities of the models, contributing to laying the foundation for efficient action planning and decision-making in real-world robots in the future.

Citation status

* References for papers published after 2024 are currently being built.

This paper was written with support from the National Research Foundation of Korea.