본문 바로가기
  • Home

An Analysis of Trends and Achievements in Corpus Linguistics: Using the Text Mining Method

  • The Japanese Language Association of Korea
  • Abbr : JLAK
  • 2025, (85), pp.51~68
  • Publisher : The Japanese Language Association Of Korea
  • Research Area : Humanities > Japanese Language and Literature
  • Received : June 28, 2025
  • Accepted : August 22, 2025
  • Published : September 20, 2025

Jang, Kun-Soo 1

1詳明大

Accredited

ABSTRACT

This study uses text mining to identify and describe the trends and outcomes in the field of “Japanese corpus linguistics.” It specifically aims to clarify when corpora were first utilized within Japanese linguistics and Japanese language education, as well as to highlight the domains that have been studied most extensively. As part of the methodology, the Google Scholar search tool was employed to gather research results that included the terms “Japanese language” and “corpus.” Text mining was then performed using KH Coder on the titles of 1,117 research papers and books published between 1995 and 2024. A summary of the analytical results is provided below. [1] Text mining was used to extract high-frequency words from the titles of academic papers and books. Corpus linguistics is most commonly applied in the field of “Japanese language education,” with the terms “learner” (150 occurrences), “native language” (111 occurrences), and “Japanese language education” (59 occurrences) being among the most frequent. [2] The corpus is categorized into several types: “spoken corpus,” “written corpus,” “learner corpus,” “historical corpus,” and others. The frequency of word occurrences was analyzed in each category. As a result, research is being conducted across various domains, with particular emphasis on spoken corpus (268 occurrences) and written corpus (213 occurrences), where research activity is exceptionally robust. [3] “Hierarchical cluster analysis” and a “co-occurrence network” were conducted to examine the similarities among the top 100 extracted terms. Additionally, the year of publication was set as an external variable to confirm the trends and results of the corpus study over the past 30 years. Research has been conducted in the following sequence: a parallel corpus, a spoken corpus, a written corpus, and a Japanese learner corpus.

Citation status

* References for papers published after 2024 are currently being built.