Mining and Harvesting High Quality Topical Resources from the Web
-
Graphical Abstract
-
Abstract
Focused crawlers aim to effectively prioritize uncrawled URLs to harvest relevant pages while avoiding irrelevant ones. In practice, harvesting high quality topical Web resources is more important due to the explosion of Web information. Our study shows that the popular focused crawling strategy cannot achieve this goal. In this paper we develop a new focused crawler, namely On-line topical quality estimation (OTQE), which intelligently evaluates the topical quality of uncrawled pages by the observed link and content evidences and prioritize their URLs accordingly. The new crawler is scalable and requires fewer additional resources to do link-based analysis. The experimental results on crawling 3.6 million Web pages demonstrate the advantages of our proposed method over traditional focused crawlers.
-
-