The Korean Coronavirus Corpus: A Large-Scale Analysis Using Computational Skills

Gyu-min Lee (이규민); Song, Sanghoun (송상헌)

doi:10.14353/sjk.2022.30.3.08

The Korean Coronavirus Corpus: A Large-Scale Analysis Using Computational Skills

The Sociolinguistic Journal of Korea
Abbr : 사회언어학
2022, 30(3), pp.213~243
DOI : 10.14353/sjk.2022.30.3.08
Publisher : The Sociolinguistic Society Of Korea
Research Area : Humanities > Linguistics
Received : August 10, 2022
Accepted : September 1, 2022
Published : September 30, 2022

Gyu-min Lee ¹, Song, Sanghoun ²

¹기타기관
²고려대학교

Accredited

ABSTRACT

Despite the massive impact of COVID-19 on society, beyond the numbers of confirmed cases and deaths, there remains a lack of large-scale data depicting the effects of the virus on the society of the Republic of Korea. To fill this gap, we collected 1.822 million news articles with more than 1 billion morphemes from January 2020 to June 2022, creating a Korean version of the Coronavirus Corpus. This corpus is introduced in the current study. In addition, to demonstrate how such massive corpus can be utilized, we conducted information theoretical analyses to see how the stance of the press media on topics such as vaccines and social distancing affected the COVID-19 situation in the Republic of Korea. Specifically, we utilized several computational linguistic skills including concordance building and sentiment analysis through both traditional and machine learning techniques and measured the transfer entropy to estimate the impact with information theory. The results suggest that the overall impact of the press media on the society was minimal to non-existent.

KEYWORDS

COVID-19, media, Republic of Korea, corpus, computational linguistics, sentiment analysis, diachronic analysis

Citation status

* References for papers published after 2025 are currently being built.

[book] Bird, S. / 2009 / Natural language processing with Python: analyzing text with the natural language toolkit / O’Reilly Media, Inc

[journal] Bleich, E. / 2021 / Media portrayals of Muslims: a comparative sentiment analysis of American newspapers, 1996–2015 / Politics, Groups, and Identities 9(1) : 20~39

[journal] Broersma, M. / 2018 / The Legitimacy paradox / Journalism 20(1) : 92~94

[journal] Choi, S. / 2017 / Large-scale machine learning of media outlets for understanding public reactions to nationwide viral infection outbreaks / Methods 129 : 50~59

[journal] Citraresmana, E., Erlina / 2022 / Investigating Lexical Concept and Semantic Representation of COVID-19 in Coronavirus Corpus: A Corpus-Based Study / Education Quaterly Reviews 5(2) : 74~89

[journal] Davies, M. / 2021 / The Coronavirus Corpus : Design, construction, and use / International Journal of Corpus Linguistics 26(4) : 583~598

[book] DeFleur, M. L. / 2016 / Mass communication theories: Explaining origins, processes, and effects / Routledge

[confproc] Devlin, J. / 2019 / BERT: Pre-training of deep bidirectional transformers for language understanding / NAACL HLT 2019 - 2019Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the conference, 1(Mlm) : 4171~4186

[journal] Dong, J. / 2022 / A Discourse dynamics exploration of attitudinal responses towards COVID-19 in academia and media / International Journal of Corpus Linguistics 26(4) : 532~556

[journal] Guliashvili, N. / 2022 / “Invader or inhabitant?” – Competing metaphors for the COVID-19 pandemic / Health Communication

[journal] Hyland, K. / 2021 / The Covid infodemic / International Journal of Corpus Linguistics 26(4) : 444~468

[confproc] Jang, H. / 2013 / KOSAC: A Full-fledged Korean sentiment analysis corpus / Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27) : 366~373

[confproc] Maas, A. / 2011 / Learning word vectors for sentiment analysis / Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies : 142~150

[web] McQuail, D. / 2005 / Mass communication theory: An introduction / SAGE Publications / https://psycnet.apa.org/record/1987-98365-000

[journal] 강민정 / 2020 / A Critical Discourse Analysis of Online News Headlines: Focusing on Articles about COVID-19 on the Diamond Princess / 사회언어학 / 한국사회언어학회 28(3) : 1~31

[journal] Montkhongtham, N. / 2021 / A Coronavirus corpus-driven study on the uses of if-conditionals in the pandemic period / REFLections 28(1) : 33~58

[other] Müller, M / 2020 / COVID-Twitter-BERT: A Natural language processing model to analyse COVID-19 content on Twitter

[journal] 남희정 / 2021 / Creating the Image of Ideal Leaders in Editorials in the Time of COVID-19 / 사회언어학 / 한국사회언어학회 29(2) : 45~76

[confproc] Nangia, N. / 2019 / Human vs. muppet: A Conservative estimate of human performance on the GLUE benchmark / Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics : 4566~4575

[web] Newman, N / 2021 / Digital News Report 2021 / https://reutersinstitute.politics.ox.ac.uk/digital-news-report/2021

[confproc] Nielsen, F. Å. / 2011 / A new ANEW: Evaluation of a word list for sentiment analysis in microblogs / Workshop on’Making Sense of Microposts: Big Things Come in Small Packages : 93~98

[journal] 박서희 / 2021 / Critical Discourse Analysis of COVID-19 Political Discourse: Through a Comparison of Speeches by Donald Trump and Andrew Cuom / 사회언어학 / 한국사회언어학회 29(3) : 139~165

[journal] Thirumaran, K. / 2021 / COVID-19 in Singapore and New Zealand : Newspaper portrayal, crisis management / Tourism Management Perspectives 38 : 100812~

[journal] Wasif, R. / 2021 / Terrorists or persecuted? The Portrayal of Islamic nonprofits in US newspapers post 9/11 / Voluntas 32(5) : 1139~1153

[journal] Xia, G. / 2022 / A Corpus-based study of public attitudes towards Coronavirus vaccines / Complexity 2022 : 1139~1153

This paper was written with support from the National Research Foundation of Korea.

KJCKorea
Journal Central

The Sociolinguistic Journal of Korea 2025 KCI Impact Factor : 0.66

The Korean Coronavirus Corpus: A Large-Scale Analysis Using Computational Skills

ABSTRACT

KEYWORDS

Citation status

* References for papers published after 2025 are currently being built.

The Sociolinguistic Journal of Korea 2025 KCI Impact Factor : 0.66

The Korean Coronavirus Corpus: A Large-Scale Analysis Using Computational Skills

ABSTRACT

KEYWORDS

Statistics

Tools

Issue List

Citation status

KCI Citation Counts (1)

REFERENCES (25) * References for papers published after 2025 are currently being built.

Search PDF

Citation

* References for papers published after 2025 are currently being built.