Conferences >2017 IEEE International Confe...

Cloud Based Web Scraping for Big Data Applications

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

With the penetration of new technologies, there is a rapid growth of internet users and data (mostly unstructured) generated by those users on the internet. As scraping i...Show More

Metadata

Abstract:

With the penetration of new technologies, there is a rapid growth of internet users and data (mostly unstructured) generated by those users on the internet. As scraping is one of the major sources for extraction of unstructured data from the Internet, we have analyzed the scraping process when introduced to a bulk of data extraction. We faced several challenges while scraping large amount of data, such as encountering captcha, storage issue for a large volume of data, need for intensive computation capacity and reliability of data extraction. In this paper, we investigate cloud-based web scraping architecture able to handle storage and computing resources with elasticity on demand using Amazon Web Services(Elastic Compute Cloud and DynamoDB). Our solution tries to address both scraping and feasibility for big data applications in a single cloud-based architecture for data-based industries. We discuss selenium as one of our tool for web scraping because of web drivers it supports which simulates a real user working with a browser. We also analyze the scalability and performance of the proposed cloud-based scrapper and describe the advantages of the proposed cloud-based scraping over other cloud-based scrapers.

Published in: 2017 IEEE International Conference on Smart Cloud (SmartCloud)

Date of Conference: 03-05 November 2017

Date Added to IEEE Xplore: 23 November 2017

ISBN Information:

DOI: 10.1109/SmartCloud.2017.28

Conference Location: New York, NY, USA

No metrics found for this document.

Contents

I. Introduction

The presence of an abundance of data in World Wide Web can be both blessing and curse. A lot of relevant information as an enormous source is presented; however, it is challenging to acquire that information. Most information on the web is presented as HTML document, a sketch to accord information to humans, not for computers and such data are Unstructured data [1]. The data growth rate on the internet is soaring every year and the data is predominantly unstructured [3]. Since unstructured data do not follow any data model, information churning is not easy. Extraction of a large volume of data from the internet and saving for future work is our concern. Such extraction can be realized using software solution called Web Scraping Software. we introduce a cloud-based scraping architecture for unstructured data acquisition from the web.

No metrics found for this document.

References is not available for this document.

Cloud Based Web Scraping for Big Data Applications

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Cloud Based Web Scraping for Big Data Applications

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

References

IEEE Account

Purchase Details

Profile Information

Need Help?