본문 바로가기
  • Home

HTML Tag Depth Embedding: An Input Embedding Method of the BERT Model for Improving Web Document Reading Comprehension Performance

  • Journal of Internet of Things and Convergence
  • Abbr : JKIOTS
  • 2022, 8(5), pp.17-25
  • DOI : 10.20465/KIOTS.2022.8.5.017
  • Publisher : The Korea Internet of Things Society
  • Research Area : Engineering > Computer Science > Internet Information Processing
  • Received : July 18, 2022
  • Accepted : September 6, 2022
  • Published : October 31, 2022

Jin-Wang Mok 1 Jang Hyun Jae 1 Hyun-Seob Lee 1

1백석대학교

Accredited

ABSTRACT

Recently the massive amount of data has been generated because of the number of edge devices increases. And especially, the number of raw unstructured HTML documents has been increased. Therefore, MRC(Machine Reading Comprehension) in which a natural language processing model finds the important information within an HTML document is becoming more important. In this paper, we propose HTDE(HTML Tag Depth Embedding Method), which allows the BERT to train the depth of the HTML document structure. HTDE makes a tag stack from the HTML document for each input token in the BERT and then extracts the depth information. After that, we add a HTML embedding layer that takes the depth of the token as input to the step of input embedding of BERT. Since tokenization using HTDE identifies the HTML document structures through the relationship of surrounding tokens, HTDE improves the accuracy of BERT for HTML documents. Finally, we demonstrated that the proposed idea showing the higher accuracy compared than the accuracy using the conventional embedding of BERT.

Citation status

* References for papers published after 2023 are currently being built.

This paper was written with support from the National Research Foundation of Korea.