Abstract:
This article analyzes the limitations of traditional topic crawlers, and on this basis, compares depth first and breadth first crawling strategies to construct an improve...Show MoreMetadata
Abstract:
This article analyzes the limitations of traditional topic crawlers, and on this basis, compares depth first and breadth first crawling strategies to construct an improved topic URL crawling strategy. By using regular expressions and web page selectors to locate actionable positions in a webpage, the program simulates human operations on the webpage based on these positions, in order to obtain more topic related URLs and webpage content. Finally, by establishing the experimental process of themed crawler, designing and improving the themed web crawler algorithm, and finally comparing and analyzing the experimental results, it is shown that the improved URL crawler strategy in this paper can greatly reduce the number of total urls crawled by crawler, reduce the crawling time, and improve the efficiency of unit crawler crawling target themed web pages.
Published in: 2023 International Conference on Applied Intelligence and Sustainable Computing (ICAISC)
Date of Conference: 16-17 June 2023
Date Added to IEEE Xplore: 09 August 2023
ISBN Information: