본문 바로가기
  • Home

A Study on methods of Data Collection and application for Optical Character Recognition of Chinese Character in ancient book

  • The Journal of Study on Language and Culture of Korea and China
  • Abbr : JSLCKC
  • 2022, (65), pp.43-84
  • DOI : 10.16874/jslckc.2022..65.002
  • Publisher : Korean Society of Study on Chinese Languge and Culture
  • Research Area : Humanities > Chinese Language and Literature
  • Received : July 10, 2022
  • Accepted : August 20, 2022
  • Published : August 31, 2022

KHOO HYUN AH 1

1용인대학교

Accredited

ABSTRACT

Digitalization has been introduced and used in many fields in the information age. But digitalization of ancient Chinese characters in Korea is still at an elementary stage. The development of OCR for ancient Chinese characters in Korea was first attempted by the government in 2009. After that, a large-scale systematic project of 10 million character in 2020, but this project also has a limitation in that most of it collected books published in woodblock prints. Therefore, this study explored data collection methods for the establishment of OCR for ancient Chinese characters and the use of OCR of old Chinese characters. First, in order to build OCR of ancient Chinese characters with high accuracy, various typeface data must be collected. And since old books are mostly written in the square style of Chinese handwriting or the semicursive style of writing, they must expand the types of source data. Such as calligraphy, art works, and household goods. Various fonts can be collected based on printing tools. Such as metal type, wood type, and woodblock print. The diversity of data can be secured by allowing the font to include different types and blco books. In addition, it is essential to collect Chinese rhyme book, Okpyeon, and dictionaries to include many different Chinese characters. For example, 『Hongmu Jeongwun yeokhun』, 『Sasungtonghae』, 『Samwunsunghui』, 『Gyujangjeonwun』, etc. The results of ancient Chinese characters OCR can be used in translation, digital archive construction, font development, tourism industry. Author and publication period can be ascertained through font recognition. And it can be used in preservation studies, for example, the degree of damage can be proven by the original image. OCR, an ancient Chinese character, is a very important record heritage containing Korean culture. Digitalization of heritage will accelerate the development of Korean humanities and contribute to the development of academic and industrial fields. And it can create various jobs through the production of new contents. It is hoped that the results of this study will help develop ancient Chinese character recognition OCR with higher accuracy and usability in the future.

Citation status

* References for papers published after 2023 are currently being built.