Fine-tuned Korean Language Models for Sociolinguistic Studies (사회언어학 연구를 위한 한국어 미세조정 언어모델)

Kangsan Noh (노강산); Kim, Soo Yeon (김수연); Hye-Won Choi (최혜원); JANG, HAYEUN (장하연); Sanghoun Song (송상헌)

Fine-tuned Korean Language Models for Sociolinguistic Studies

The Sociolinguistic Journal of Korea
Abbr : 사회언어학
2024, 32(3), pp.41~64
Publisher : The Sociolinguistic Society Of Korea
Research Area : Humanities > Linguistics
Received : August 12, 2024
Accepted : September 11, 2024
Published : September 30, 2024

Kangsan Noh ¹, Kim, Soo Yeon ², Hye-Won Choi ³, JANG, HAYEUN ⁴, Sanghoun Song ¹

¹고려대학교
²세종대학교
³이화여대
⁴성균관대학교

Accredited

ABSTRACT

This paper aims to test deep-learning-based Korean language models’ capacity to learn and detect social registers embedded in speech data, specifically age, gender, and regional dialects. A comprehensive understanding of linguistic phenomena requires contextualizing speech based on speakers’ age, gender, and geographic background, along with the processing of syntactic structures. To bridge the gap between human language understanding and model processing, we fine-tuned three representative Korean language models—KR-BERT, KoELECTRA-base, and KLUE-RoBERTa-base—using transcribed data from 4,000 hours of speech by middle-aged and elderly Korean speakers. The findings reveal that KoELECTRA-base outperformed the other two models across all social registers, which is likely attributed to its larger vocabulary and parameters size. Among the dialects, the Jeju dialect showed the highest accuracy in inference, which is attributed to its distinctiveness, making it easier for the models to detect. In addition to the fine-tuning process, we have made our fine-tuned models publicly available to support researchers interested in Korean computational sociolinguistics.

KEYWORDS

age, dialect, gender, Korean language model, social register

Citation status

* References for papers published after 2024 are currently being built.

[journal] 박권식 / 2021 / Verification of Korean Pre-trained Models' Feasibility of Syntactic Research Using Pairwise Sentences / 언어와 정보 / 한국언어정보학회 25(3) : 1~21

[journal] 송상헌 / 2022 / Adversarial Example-Based Evaluation of How Language Models Understand Korean Case Alternation / 언어학 / 대한언어학회 30(1) : 45~72

[journal] 이규민 / 2021 / DeepKLM - A Computational Language Model-based Library for Syntactic Experiments - / 언어사실과 관점 / 연세대학교 언어정보연구원 52 : 265~306

[journal] 이현주 / 2023 / Achievements and Prospects for the Study of the Chungcheong dialect / 방언학 / 한국방언학회 38 : 47~79

[journal] 옥성수 / 2024 / Strategies and analysis for constructing middle-aged and elderly dialect speech data for artificial intelligence training. / 언어학 / 대한언어학회 32(1) : 1~19

[confproc] Brown, T. / 2020 / Language models are few-shot learners / Proceedings of the 34 th International Conference on Neural Information Processing Systems / NIPS : 1877~1901

[journal] Bucholtz, M. / 2005 / Identity and interaction: A sociocultural linguistic approach / Discourse Studies 7(4-5) : 585~614

[book] Butler, J. / 1990 / Gender trouble: Femenism and the subversion of identity / Routledge

[book] Chambers, J. K. / 1998 / Dialectology / Cambridge University Press

[confproc] Chaves, R. P. / 2021 / Look at that! BERT can be easily distracted from paying attention to morphosyntax / Proceedings of the Society for Computation in Linguistics 2021 / Society for Computation in Linguistics : 28~38

[confproc] Clark, K. / 2019 / What does BERT look at? An analysis of BERT’s attention / Proceedings of the 2019 ACL Workshop Blackbox NLP: Analyzing and Interpreting Neural Networks for NLP / Association for Computational Linguistics : 276~286

[web] Clark, K. / 2020 / ELECTRA : Pre-training text encoders as discriminators rather than generators / arXiv / arXiv:2003.10555

[confproc] Devlin, J. / 2019 / BERT: Pre-training of deep bidirectional transformers for language understanding / The 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies / Association for Computational Linguistics : 4171~4186

[book] Eckert, P. / 1997 / The handbook of sociolinguistics / Blackwell Publishers : 151~167

[web] Goldberg, Y. / 2019 / Assessing BERT’s syntactic abilities / arXiv / arXiv:1901.05287

[confproc] Hu, J. / 2020 / A systematic assessment of syntactic generalization in neural language models / Proceedings of the 58 th Annual Meeting of the Association for Computational Linguistics / Association for Computational Linguistics : 1725~1744

[confproc] Jawahar, G. / 2019 / What does BERT learn about the structure of language? / Paper presented at the 57th Annual Meeting of the Association for Computational Linguistics

[book] Labov, W. / 1966 / The social stratification of English in New York City / Center for Applied Linguistics

[web] Lee, S. / 2020 / KR-BERT : A small-scale Korean-specific language model / arXiv / arXiv:2008.03979

[journal] Linzen, T. / 2019 / What can linguistics and deep learning contribute to each other? A response to Pater / Language 95(1) : 99~108

[web] Liu, Y. / 2019 / RoBERTa : A robustly optimized BERT pretraining approach / arXiv / arXiv:1907.11692

[journal] Nguyen, D. / 2016 / Computational sociolinguistics : A survey / Computational Linguistics 42(3) : 537~593

[web] Park, J. / 2020 / KoELECTRA: Pretrained ELECTRA model for Korean / https://github.com/monologg/KoELECTRA

[web] Park, S. / 2021 / KLUE : Korean language understanding evaluation / arXiv / arXiv:2105.09680

[confproc] Stoop, W. M. C. A. / 2014 / Using idiolects and sociolects to improve word prediction / Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics : 318~327

[book] Tannen, D. / 1990 / You just don’t understand: Women and men in conversation / Ballantine Books

[confproc] Vaswani, A. / 2017 / Attention is all you need / Proceedings of the 31st International Conference on Neural Information Processing Systems / NIPS : 6000~6010

[web] / 노인복지법 / https://www.law.go.kr/%EB%B2%95%EB%A0%B9/%EB%85%B8%EC%9D%B8%EB%B3%B5%EC%A7%80%EB%B2%95

[web] 동아일보 / 2023 / 기대수명 17년 늘었는데... ‘65세 노인’ 43년째 그대로 / https://www.donga.com/news/Society/article/all/20230324/118495371/1

This paper was written with support from the National Research Foundation of Korea.

KJCKorea
Journal Central

The Sociolinguistic Journal of Korea 2024 KCI Impact Factor : 0.55