Cloud Based Web Scraping for Big Data Applications | IEEE Conference Publication | IEEE Xplore

Cloud Based Web Scraping for Big Data Applications


Abstract:

With the penetration of new technologies, there is a rapid growth of internet users and data (mostly unstructured) generated by those users on the internet. As scraping i...Show More

Abstract:

With the penetration of new technologies, there is a rapid growth of internet users and data (mostly unstructured) generated by those users on the internet. As scraping is one of the major sources for extraction of unstructured data from the Internet, we have analyzed the scraping process when introduced to a bulk of data extraction. We faced several challenges while scraping large amount of data, such as encountering captcha, storage issue for a large volume of data, need for intensive computation capacity and reliability of data extraction. In this paper, we investigate cloud-based web scraping architecture able to handle storage and computing resources with elasticity on demand using Amazon Web Services(Elastic Compute Cloud and DynamoDB). Our solution tries to address both scraping and feasibility for big data applications in a single cloud-based architecture for data-based industries. We discuss selenium as one of our tool for web scraping because of web drivers it supports which simulates a real user working with a browser. We also analyze the scalability and performance of the proposed cloud-based scrapper and describe the advantages of the proposed cloud-based scraping over other cloud-based scrapers.
Date of Conference: 03-05 November 2017
Date Added to IEEE Xplore: 23 November 2017
ISBN Information:
Conference Location: New York, NY, USA
No metrics found for this document.

I. Introduction

The presence of an abundance of data in World Wide Web can be both blessing and curse. A lot of relevant information as an enormous source is presented; however, it is challenging to acquire that information. Most information on the web is presented as HTML document, a sketch to accord information to humans, not for computers and such data are Unstructured data [1]. The data growth rate on the internet is soaring every year and the data is predominantly unstructured [3]. Since unstructured data do not follow any data model, information churning is not easy. Extraction of a large volume of data from the internet and saving for future work is our concern. Such extraction can be realized using software solution called Web Scraping Software. we introduce a cloud-based scraping architecture for unstructured data acquisition from the web.

No metrics found for this document.
Contact IEEE to Subscribe

References

References is not available for this document.