본문 바로가기
  • Home

A Design and Implementation of Paragraph-based Focused Web Crawler Using Semantic Priority of Link

  • Journal of Knowledge Information Technology and Systems
  • Abbr : JKITS
  • 2020, 15(6), pp.1075-1083
  • DOI : 10.34163/jkits.2020.15.6.015
  • Publisher : Korea Knowledge Information Technology Society
  • Research Area : Interdisciplinary Studies > Interdisciplinary Research
  • Received : October 8, 2020
  • Accepted : December 11, 2020
  • Published : December 31, 2020

NamOh Kang 1 Jae Ho Kim 2

1계명대학교
2강릉원주대학교

Accredited

ABSTRACT

A search engine maintaining whole Web consistency is very important to retrieve information correctly and efficiently. However, as the size of Web is rapidly growing and content is also dynamically changing, it is impossible for the search engine to achieve the goal by using limited resources such as hardware, network and computing time. In order to solve this problem, a focused web crawler has been introduced which can identify and visit the most promising links related to a specific topic and avoid downloading off-topic documents efficiently under limited resources. In this research, we propose a paragraph-based focused web crawler using semantic priority of link. The proposed system selects promising links from a downloaded web page by measuring similarity between a topic and link's data such as anchor text and a paragraph containing the link. In this paper, different from existing methods, we proposed a novel similarity function for calculating a link priority by using WordNet. And we introduced a method to visit high-priority link first. We conducted experiments to prove the performance of the proposed paragraph-based web focused crawler by using some topics. The experimental result showed the paragraph-based web focused crawler using semantic priority of link improves the term frequency of document retrieval.

Citation status

* References for papers published after 2023 are currently being built.