Recently, large language models (LLMs) have demonstrated strong performance in natural language processing tasks, including sentiment analysis; however, prediction performance and interpretability consistency can vary significantly depending on prompt design. This study conducts a comparative analysis of the sentiment analysis performance and LIME-based explainability of foundation models across different prompt types. Experiments were conducted on Korean app reviews, English IMDB reviews, and the English TweetEval dataset using role-based, context-rich, few-shot, and format-constrained prompts, and the performance of GPT-4o-mini and Gemini 2.5 Flash models was evaluated. In addition, the semantic consistency and explanation stability of prediction results were analyzed both quantitatively and qualitatively using sentence embedding–based cosine similarity and LIME. Experimental results showed that in the binary classification setting (IMDB), performance differences remained within 1.5 percentage points (%p), and differences in explanation consistency were also 0%p, indicating overall limited variation. In contrast, in the three-class settings (APP and Tweet), performance differences of up to approximately 3.0%p were observed. In the tweet domain, differences in explanation consistency were confirmed to be 25.0%p based on LIME agreement and 12.5%p based on overall agreement. This study is meaningful in that it systematically analyzes the impact of prompt design on both performance and explainability in LLM-based sentiment analysis.