Recent advances in attention-based language models have increased interest in automated essay scoring (AES) for English essays, particularly analytic scoring.
However, prior work has primarily focused on improving score agreement with human raters, with limited attention to how different model architectures realize analytic assessment from an educational perspective. Accordingly, this study examines how architectural differences in attention-based language models influence analytic English essay scoring by comparing encoder-based (BERT), decoder-based (Qwen), and encoder–decoder-based (T5) models on the PERSUADE 1.0 dataset, consisting of essays written by U.S. students in grades 8–12 and rated by two trained raters, with final scores determined through adjudication by a third rater. The analysis evaluates agreement with human raters using Quadratic Weighted Kappa (QWK), attention patterns by classifying attention-weighted tokens into content and function words via part-of-speech tagging, and scoring efficiency by measuring average inference time per essay under both CPU and GPU environments. The results show that the encoder– decoder-based (T5) model achieves the highest agreement with human raters by focusing more on content words, reflecting a meaning-focused assessment strategy. In contrast, other models show lower agreement with less emphasis on content words.
Despite moderate computational costs, the encoder–decoder-based model remains feasible for educational use. These findings highlight the importance of model architecture in analytic AES and offer guidance for selecting practical scoring systems.