본문 바로가기
  • Home

A Study on a Multimodal RAG Framework for Overcoming the Limitations of Multimodal Large Language Models (MLLMs)

  • Journal of Software Forensics
  • Abbr : JSF
  • 2026, 22(2), pp.179~191
  • DOI : 10.29056/jsf.2026.06.16
  • Publisher : Korea Software Assessment and Valuation Society
  • Research Area : Engineering > Computer Science
  • Received : June 3, 2026
  • Accepted : June 20, 2026
  • Published : June 30, 2026

Ahn cheolbum 1 KIM, JIN HONG 2

1서일대학교
2배재대학교

Accredited

ABSTRACT

Recent multimodal large language models (MLLMs) have achieved remarkable advances in visual information processing; however, they continue to face two persistent limitations. First, these models are unable to access up-to-date information or domain-specific private data that falls outside their training datasets. Second, the phenomenon known as visual hallucination—whereby models misinterpret factual relationships and generate plausible yet erroneous outputs—occurs with considerable frequency. These limitations serve as critical obstacles to the practical adoption of MLLMs in fields that demand high levels of reliability, such as medical diagnosis, legal analysis, and precision manufacturing inspection. In response, this paper proposes a novel framework that integrates Retrieval-Augmented Generation (RAG) technology into image analysis as a means of overcoming these challenges, and systematically examines the key implementation hurdles alongside the prospects for future advancement.

Citation status

* References for papers published after 2024 are currently being built.