A Design and Implementation of Paragraph-based Focused Web Crawler Using Semantic Priority of Link (링크의 의미 중요성을 이용하는 문단 기반 집중 웹 크롤러 설계 및 구현)

NamOh Kang (강남오); Jae-Ho Kim (김재호)

doi:10.34163/jkits.2020.15.6.015

A Design and Implementation of Paragraph-based Focused Web Crawler Using Semantic Priority of Link

Journal of Knowledge Information Technology and Systems
Abbr : JKITS
2020, 15(6), pp.1075~1083
DOI : 10.34163/jkits.2020.15.6.015
Publisher : Korea Knowledge Information Technology Society
Research Area : Interdisciplinary Studies > Interdisciplinary Research
Received : October 8, 2020
Accepted : December 11, 2020
Published : December 31, 2020

NamOh Kang ¹, Jae-Ho Kim ²

¹계명대학교
²강릉원주대학교

Accredited

ABSTRACT

A search engine maintaining whole Web consistency is very important to retrieve information correctly and efficiently. However, as the size of Web is rapidly growing and content is also dynamically changing, it is impossible for the search engine to achieve the goal by using limited resources such as hardware, network and computing time. In order to solve this problem, a focused web crawler has been introduced which can identify and visit the most promising links related to a specific topic and avoid downloading off-topic documents efficiently under limited resources. In this research, we propose a paragraph-based focused web crawler using semantic priority of link. The proposed system selects promising links from a downloaded web page by measuring similarity between a topic and link's data such as anchor text and a paragraph containing the link. In this paper, different from existing methods, we proposed a novel similarity function for calculating a link priority by using WordNet. And we introduced a method to visit high-priority link first. We conducted experiments to prove the performance of the proposed paragraph-based web focused crawler by using some topics. The experimental result showed the paragraph-based web focused crawler using semantic priority of link improves the term frequency of document retrieval.

KEYWORDS

Web search engines, Focused web crawlers, Link priorities, Semantic webs, Information retrievals, WordNet

Citation status

* References for papers published after 2024 are currently being built.

[web] / 2020 / Total number of websites / https://www.internetlivestats.com/total-number-of-websites/

[web] / 2020 / How search organizes information / https://www.google.com/intl/en/search/howsearchworks/crawling-indexing/

[thesis] S. W. Kim / 2007 / Focused crawler for efficient web gathering / Soongsil University

[confproc] S. Chakrabarti / 1999 / Focused crawling: a new approach to topic-specific web resource discovery / proceedings of 8th International World Wide Web Conference : 545~562

[journal] H. Lu / 2016 / An improved focused crawler : using web page classification and link priority evaluation / Mathematical Problems in Engineering 2016 : 1~10

[journal] S. Shah / 2014 / Focused and deep web crawling-A review / International Journal of Computer Science and Information Technologies 5(6) : 7488~7492

[journal] D. Bhatt / 2015 / Focused web crawler / Advanced in Computer Science and Information Technology 2(11) : 1~6

[confproc] N. W. Min / 2014 / Ranking hyperlinks approach for focused web crawler / International Conference on Advances in Engineering and Technology : 233~235

[journal] T. Peng / 2013 / Focused crawling enhanced by CBP-SLC / Knowledge-Based Systems 51 : 15~26

[confproc] N. Luo / 2006 / A new method for focused crawler cross tunnel / First International Conference, RSKT / Lecture Notes in Computer Science 4062 : 632~637

[journal] T. Peng / 2008 / Tunneling enhanced by web page content block partition for focused crawling / Concurrency Computation Practice and Experience 20(1) : 61~74

[journal] 박사준 / 2017 / Paragraph-based K-Means Clustering by using Meaning-based Paragraph Division / 한국지식정보기술학회 논문지 / 한국지식정보기술학회 12(1) : 157~164

[web] / 2020 / WordNet / https://wordnet.princeton.edu/

[journal] B. Ganguly / 2012 / A review of focused web crawling strategies / International Journal of Advanced Computer Research 2(4) : 261~267

[web] / 2020 / Ontology / https://www.w3.org/standards/semanticweb/ontology

[confproc] G. Subbiah / 2004 / SrinivasaMurthy, and G. Aghila, Ontology-based web crawler / Proc. ITCC:Coding Comupt : 334~341

[journal] 이재무 / 2016 / The Ontology Construction of Instructional Domain Knowledge / 한국지식정보기술학회 논문지 / 한국지식정보기술학회 11(1) : 57~63

KJCKorea
Journal Central

Journal of Knowledge Information Technology and Systems KCI Impact Factor : 0.0

A Design and Implementation of Paragraph-based Focused Web Crawler Using Semantic Priority of Link

ABSTRACT

KEYWORDS

Citation status

* References for papers published after 2024 are currently being built.

Journal of Knowledge Information Technology and Systems KCI Impact Factor : 0.0

A Design and Implementation of Paragraph-based Focused Web Crawler Using Semantic Priority of Link

ABSTRACT

KEYWORDS

Statistics

Tools

Issue List

Citation status

KCI Citation Counts (2)

REFERENCES (17) * References for papers published after 2024 are currently being built.

Search PDF

Citation

* References for papers published after 2024 are currently being built.