Abstract:
In this paper, we present the development and characteristics of a specialized web-scale forum crawler. The main idea is to crawl relevant forum content from the web with...Show MoreMetadata
Abstract:
In this paper, we present the development and characteristics of a specialized web-scale forum crawler. The main idea is to crawl relevant forum content from the web with minimal server resource consumption, and to organize crawled content into logical units, in order to make it easier for further processing and analysis. Forum posts contain relevant information that are of interest to forum crawler. Although forums have different designs, and are built on different technologies, they always have identical logic navigation that connects homepage and particular posts through forum lists and threads by specific URLs. Considering this common implicit navigation, we have optimized web crawling problem to be URL-type recognition problem. URL-type database and regular expressions are used in order to achieve URL-type recognition. These regular expressions are expanded with special custom characters and commands that gave this forum crawler advantage over other web based crawlers. The results shown in this paper are obtained by crawling a set of web forums with different technology, location and design. Each test compared the results obtained by standard web based crawler and our specialized forum crawler. Our test results show that by crawling only specific data and URL paths on the forum, we have managed to reduce the time of crawling and to achieve lower server resources consumption.
Published in: 2013 21st Telecommunications Forum Telfor (TELFOR)
Date of Conference: 26-28 November 2013
Date Added to IEEE Xplore: 20 January 2014
ISBN Information: