A Study on Extracting News Contents from News Web Pages (뉴스 웹 페이지에서 기사 본문 추출에 관한 연구)

Yong-Gu Lee (이용구)

doi:10.3743/KOSIM.2009.26.1.305

A Study on Extracting News Contents from News Web Pages

Journal of the Korean Society for Information Management
Abbr : JKOSIM
2009, 26(1), pp.305~320
DOI : 10.3743/KOSIM.2009.26.1.305
Publisher : 한국정보관리학회
Research Area : Interdisciplinary Studies > Library and Information Science
Received : February 17, 2009
Accepted : March 2, 2009
Published : March 30, 2009

Yong-Gu Lee ¹

¹피츠버그대학

Accredited

ABSTRACT

The news pages provided through the web contain unnecessary information. This causes low performance and inefficiency of the news processing system. In this study, news content extraction methods, which are based on sentence identification and block-level tags news web pages, was suggested. To obtain optimal performance, combinations of these methods were applied. The results showed good performance when using an extraction method which applied the sentence identification and eliminated hyperlink text from web pages. Moreover, this method showed better results when combined with the extraction method which used block-level. Extraction methods, which used sentence identification, were effective for raising the extraction recall ratio.

KEYWORDS

web news content extraction, sentence based extraction, block based extraction, web mining

Citation status

* References for papers published after 2025 are currently being built.

[book] 정영미 / 2005 / 정보검색연구 / 구미무역 출판부

[journal] 한광록 / 2007 / Text Extraction and Summarization from Web News / 한국컴퓨터정보학회논문지 / 한국컴퓨터정보학회 12(5) : 1~10

[journal] Cadenhead,Tyrone / 2008 / Improving web infor- mation indexing and retrieval based on center block duplication detection / Inter-national Journal of Innovative Com-puting and Applications 1(3) : 194~204

[confproc] Debnath, Sandip / 2005 / Automatic extraction of in- formative blocks from webpages / Pro-ceedings of the 2005 ACM Symposium on Applied Computing : 1722~1726

[journal] Etzioni, Oren / 1996 / The world wide web: Quagmire or gold mine / Communica-tions of the ACM 39(11) : 65~68

[confproc] Gupta, S / 2003 / DOM-based content extraction of HTML documents / Proceedings of the 12th International Conference on World Wide Web : 249~256

[confproc] Lin, Shian-Hua / 2002 / Discovering informative content blocks from web documents / Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining : 588~593

[confproc] Reis, Davi Castro / 2003 / Automatic web news extraction using tree edit distance / Proceedings of the 13th International Conference on World Wide Web : 502~511

[journal] Sebastiani, Fabrizio / 2002 / Machine learning in automated text categorization / ACM Computing Surveys 34(1) : 1~47

[confproc] Song, Ruihua / 2004 / Learning block importance models for web pages / Proceedings of the 13th International Con- ference on World Wide Web : 203~111

[other] Vitali, Fabio / 2004 / Rule-Based Structural Analysis of Web Pages / Document Analysis Systems VI : 425~437

[confproc] Yi, Lan / 2003 / Eliminating noisy information in Web pages for data mining / Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and data Mining : 296~305

[confproc] Yu, Shipeng / 2003 / Improving pseudorelevance feedback in web information retrieval using web page segmentation / Proceedings of the 12th International Conference on World Wide Web : 11~18

This paper was written with support from the National Research Foundation of Korea.

KJCKorea
Journal Central

Journal of the Korean Society for Information Management 2025 KCI Impact Factor : 1.27

A Study on Extracting News Contents from News Web Pages

ABSTRACT

KEYWORDS

Citation status

* References for papers published after 2025 are currently being built.

Journal of the Korean Society for Information Management 2025 KCI Impact Factor : 1.27

A Study on Extracting News Contents from News Web Pages

ABSTRACT

KEYWORDS

Statistics

Tools

Issue List

Citation status

KCI Citation Counts (2)

REFERENCES (13) * References for papers published after 2025 are currently being built.

Search PDF

Citation

* References for papers published after 2025 are currently being built.