An Automatic Schema Generation System based on the Contents for Integrating Web Information Sources (웹 정보원 통합을 위한 내용 기반의 스키마 자동생성시스템)

곽준영; BAEJONGMIN (배종민)

An Automatic Schema Generation System based on the Contents for Integrating Web Information Sources

Journal of The Korea Society of Computer and Information
Abbr : JKSCI
2008, 13(6), pp.77~86
Publisher : The Korean Society Of Computer And Information
Research Area : Engineering > Computer Science

곽준영 ¹, BAEJONGMIN ²

¹(주)위너스텍
²경상대학교

Accredited

ABSTRACT

The Web information sources can be regarded as the largest distributed database to the users. By virtually integrating the distributed information sources and regarding them as a single huge database, we can query the database to extract information. This capability is important to develop Web application programs. We have to infer a database schema from browsing-oriented Web documents in order to integrate databases. This paper presents a heuristic algorithm to infer the XML Schema fully automatically from semi-structured Web documents. The algorithm first extracts candidate pattern regions based on predefined structure-making tags, and determines a target pattern region using a few heuristic factors, and then derives XML Schema extraction rules from the target pattern region. The schema extraction rule is represented in XQuery, which makes development of various application systems possible using open standard XML tools. We also present the experimental results for several public web sources to show the effectiveness of the algorithm.

KEYWORDS

정보추출(Information Extraction), XML스키마(XML Schema), XML, 반복패턴(Repeated Pattern), 정보통합(Information Integration)

Citation status

* References for papers published after 2025 are currently being built.

[journal] Alberto H. F. Laender / 2002 / A brief survey of Web Data extraction tools / ACM Sigmod Record 31(2) : 84~93

[journal] A. Doan / 2006 / Managing Information Extraction: State of the Art and Research Directions / ACM SIGMOD

[confproc] B. Liu / 2003 / Mining Data Records in Web Pages / ACM SIGKDD : 601~606

[confproc] C. H. Chang / 2001 / IEPAD : Information Extraction Based on Pattern Discovery / Proc. of WWW10 : 681~688

[journal] O. Etzioni / 2005 / Unsupervised Name-Entity Extraction from the Web / An Experimental Study Artificial Intelligence 165(1) : 91~134

[web] / HTML TIDY,[Online] / http://tidy.sourceforge.net/

[confproc] M. Banko / 2007 / Open Information Extraction from the Web / Proc. of IJCAI : 187~206

[confproc] M. J. Cafarella / 2007 / Navigating Extracted Data with Schema Discovery / Proc. of the 10th International Workshop on WebDB

[web] World-Wide Web Consortium / 2006 / XQuery 1.0: An XML Query Language", [Online] / W3C Candidate Recommendation / http://www.w3.org/TR/xquery/

[web] World-Wide Web Consortium / 2004 / Document Object Model (DOM) Level 3 Core Specification, [Online] / W3C Recommendation / http://www.w3.org/TR/ DOM-Level-3-Core/

KJCKorea
Journal Central

Journal of The Korea Society of Computer and Information 2025 KCI Impact Factor : 1.01