A Study of Main Contents Extraction from Web News Pages based on XPath Analysis

BokKeun Sun (선복근)

A Study of Main Contents Extraction from Web News Pages based on XPath Analysis

Journal of The Korea Society of Computer and Information
Abbr : JKSCI
2015, 20(7), pp.1~7
Publisher : The Korean Society Of Computer And Information
Research Area : Engineering > Computer Science

BokKeun Sun ¹

¹호서대학교

Accredited

ABSTRACT

Although data on the internet can be used in various fields such as source of data of IR(Information Retrieval), Data mining and knowledge information servece, and contains a lot of unnecessary information. The removal of the unnecessary data is a problem to be solved prior to the study of the knowledge-based information service that is based on the data of the web page, in this paper, we solve the problem through the implementation of XTractor(XPath Extractor). Since XPath is used to navigate the attribute data and the data elements in the XML document, the XPath analysis to be carried out through the XTractor. XTractor Extracts main text by html parsing, XPath grouping and detecting the XPath contains the main data. The result, the recognition and precision rate are showed in 97.9%, 93.9%, except for a few cases in a large amount of experimental data and it was confirmed that it is possible to properly extract the main text of the news.

KEYWORDS

Main Text Extraction, Web News Page, XPath Grouping, HTML Parsing

Citation status

* References for papers published after 2025 are currently being built.

[web] / HTML5 / http://www.w3.org/TR/html5/

[journal] 박명철 / 2014 / Interactive Learning Tool Based on HTML5 Using Unplugged Contents / 한국컴퓨터정보학회논문지 / 한국컴퓨터정보학회 19(11) : 73~79

[journal] D. Shen / 2007 / Noise reduction through summarization for Web-page classification / Information Processing and Management 43 : 1735~1747

[confproc] J. Si / 2011 / A Template-based forum posts content extraction method / International Conference on ICECE : 38~41

[confproc] H. Mohammadzadeh / 2011 / A Fast and accurate approach for main content extraction based on character encoding / 22nd International workshop on database and expert systems applications : 167~171

[confproc] S.Gupta / 2003 / DOM-based content extraction of HTML documents / WWW '03: Proceedings of the 12th International Conference on WWW : 207~214

[journal] R. Gunasundari / 2012 / A Study of content extraction from web pages based on links / International Journal of Data Mining & Knowledge management Process(IJDKP) 2(3)

[journal] B. Zhou / 2009 / Chinese web page content extraction based on page content analysis / Journal of Computational Information Systems 5(6) : 1861~1871

[confproc] S.Pretzsch / 2012 / FODEX-Towards generic data extraction from web forums / 26th International conference on advanced information networking and applications workshops : 821~826

[web] / Clearly / https://chrome.google.com/webstore/detail/clearly/iooicodkiihhpojmmeghjclgihfjdjhj

[web] / Readability / https://www.readability.com/

[confproc] A. Arasu / 2003 / Extracting structured adta from web pages / SIGMOD ‘03:Proceedings of the 2003 ACM SIGMOD international conference on Management of data : 337~348

[journal] 오상윤 / 2009 / X2RD: Storing and Querying XML Data Using XPath To Relational Database / 한국컴퓨터정보학회논문지 / 한국컴퓨터정보학회 14(3) : 57~64

[web] / XPath / http://www.w3.org/TR/xpath/

KJCKorea
Journal Central

Journal of The Korea Society of Computer and Information 2025 KCI Impact Factor : 1.01