본문 바로가기
  • Home

A study on the enhanced filtering method of the deduplication for bulk harvest of web records

  • The Korean Journal of Archival Studies
  • 2013, (35), pp.133-160
  • Publisher : Korean Society Of Archival Studies
  • Research Area : Interdisciplinary Studies > Library and Information Science

이연수 1 남성운 1 윤대현 2

1국가기록원
2한국정보화진흥원

Accredited

ABSTRACT

As the network and electronic devices have been developed rapidly, the influences the web exerts on our daily lives have been increasing. Information created on the web has been playing more and more essential role as the important records which reflect each era. So there is a strong demand to archive information on the web by a standardized method. One of the methods is the snapshot strategy, which is crawling the web contents periodically using automatic software. But there are two problems in this strategy. First, it can harvest the same and duplicate contents and it is also possible that meaningless and useless contents can be crawled due to complex IT skills implemented on the web. In this paper, we will categorize the problems which can emerge when crawling web contents using snapshot strategy and present the possible solutions to settle the problems through the technical aspects by crawling the web contents in the public institutions.

Citation status

* References for papers published after 2023 are currently being built.