본문 바로가기
  • Home

A Comparative Study on the Performance of GPT-4o-mini, Claude 4 Sonnet, and Gemini 2.5 Flash Models Using the Prompt Runner Framework

  • Journal of The Korea Society of Computer and Information
  • Abbr : JKSCI
  • 2026, 31(2), pp.43~50
  • DOI : 10.9708/jksci.2026.31.02.043
  • Publisher : The Korean Society Of Computer And Information
  • Research Area : Engineering > Computer Science
  • Received : January 7, 2026
  • Accepted : February 1, 2026
  • Published : February 27, 2026

Misun Lee 1

1세종대학교

Accredited

ABSTRACT

This study presents a comparative analysis of three large language models (LLMs)—GPT-4o-mini, Claude 4 Sonnet, and Gemini 2.5 Flash—using a novel evaluation framework called Prompt Runner. The framework systematically measures the models’ performance across nine linguistic and reasoning prompt types, totaling 90 items. Evaluation criteria include Accuracy, Consistency, Logic, Creativity, and Response Time. Accuracy was computed through Sentence-BERT-based cosine similarity, with Consistency and Logic derived by applying weight factors (0.95, 0.9 respectively). Creativity was assessed based on a weighted sum of Novelty, Diversity, and Fluency (0.5N + 0.3D + 0.2F). The analysis revealed that Claude 4 Sonnet demonstrated superior performance in logical reasoning (0.58) and creativity (0.44), while GPT-4o-mini exhibited faster response times. Gemini 2.5 Flash showed higher performance in accuracy (0.66) and consistency (0.62). Notably, Claude 4 Sonnet achieved the most stable and consistent performance in balancing overall capability and response time, thereby being evaluated as a model that effectively ensures both efficiency and quality. This study systematically identified the characteristics and performance differences of large language models (LLMs) across various prompt types by conducting a comparative analysis of quantitative performance indicators based on each model’s API.

Citation status

* References for papers published after 2024 are currently being built.