A distributed spiders antitype was designed by Web service based on service-oriented architecture (SOA).This antitype is made up of a server and several clients. The clients are controlled to download a new Web page by the server according to the crawled pages. Moreover, they must manage the to crawl , crawled URL queues and noise URL queue after analyzing it by multi-threads. Furthermore, they keep connection with the server to pass the unknown URL and domain names. The server is made up of the front platform and the background. The front platform controls the clients including the design of load balance policy and real-time monitoring of clients by Microsoft Message Queue (MSMQ). Web service is deployed on the server background which contains the structure of persistent data connection. With the help of this structure, the front platform and the clients can access data by the normative interface. Finally, a lot of experiments were done which show that the distributed spiders system has good robust performance.
Published in:
Web Mining and Web-based Application, 2009. WMWA '09. Second Pacific-Asia Conference on
Date of Conference: 6-7 June 2009