본문 바로가기
  • Home

A Study of Main Contents Extraction from Web News Pages based on XPath Analysis

  • Journal of The Korea Society of Computer and Information
  • Abbr : JKSCI
  • 2015, 20(7), pp.1-7
  • Publisher : The Korean Society Of Computer And Information
  • Research Area : Engineering > Computer Science

Bokkeun Sun 1

1호서대학교

Accredited

ABSTRACT

Although data on the internet can be used in various fields such as source of data of IR(Information Retrieval), Data mining and knowledge information servece, and contains a lot of unnecessary information. The removal of the unnecessary data is a problem to be solved prior to the study of the knowledge-based information service that is based on the data of the web page, in this paper, we solve the problem through the implementation of XTractor(XPath Extractor). Since XPath is used to navigate the attribute data and the data elements in the XML document, the XPath analysis to be carried out through the XTractor. XTractor Extracts main text by html parsing, XPath grouping and detecting the XPath contains the main data. The result, the recognition and precision rate are showed in 97.9%, 93.9%, except for a few cases in a large amount of experimental data and it was confirmed that it is possible to properly extract the main text of the news.

Citation status

* References for papers published after 2022 are currently being built.