Conferences >2016 IEEE International Confe...

Data-at-rest security for spark

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Apache Spark enables fast computations and greatly accelerates analytics applications by efficiently utilizing the main memory and caching data for later use. At its core...Show More

Metadata

Abstract:

Apache Spark enables fast computations and greatly accelerates analytics applications by efficiently utilizing the main memory and caching data for later use. At its core Apache Spark uses data structures called RDDs (Resilient Distributed Datasets) to give a unified view to the distributed data. However, the data represented in the RDDs remain unencrypted which can result in leakage of confidential data produced or processed by applications. Apache Spark persists (unencrypted) RDDs to the disk storage under various circumstances including but not limited to caching, RDD checkpointing and data spill during the data shuffling operations, etc. This lack of security makes Apache Spark unsuitable for processing of sensitive information that should be secured at all times. Moreover, RDDs stored in the main memory are prone to main-memory attacks such as RAM-scrapping. In this paper, we propose and develop solutions to fill-up such security lapses in the current Apache Spark framework. We present three different approaches to incorporate security in the Apache Spark framework. These approaches are designed to limit the exposure of unencrypted data during data processing, caching and data spill to disk. We use combination of cryptographic splitting and encryption to secure data stored and spilled by Apache Spark, both to the disk as well as to the main memory. Our approaches provide strong security by incorporating combination of Information Dispersal Algorithm (IDA) and Shamir's Perfect Secret Sharing (PSS). Extensive experimentation show that with appropriately chosen parameters our security approaches provide high security at a performance penalty between 10%-25%.

Published in: 2016 IEEE International Conference on Big Data (Big Data)

Date of Conference: 05-08 December 2016

Date Added to IEEE Xplore: 06 February 2017

ISBN Information:

DOI: 10.1109/BigData.2016.7840754

Conference Location: Washington, DC, USA

Contents

I. Introduction

Apache Spark has emerged as a fast growing and widely adopted framework for big-data analytics [1]. The capability to run batch processing, streaming, iterative and interactive jobs within a single framework [2] has made Apache Spark as natural choice for researchers, data scientists and industry. It is supported and maintained by a large community of open source contributors. Many companies not only use Apache Spark but also offer their own version of Apache Spark to their customers as-a-service over the cloud. However, when it comes to security and confidentiality of data, Apache Spark faces challenges that are already faced by cloud computing infrastructures and other existing big -data platforms such as Hadoop [3]. Researchers have proposed various solutions for data security and confidentiality at different levels in big -data frameworks such as Hadoop. However, these approaches are not directly applicable to Apache Spark due to architectural differences between Apache Spark and other frameworks. A review of existing security approaches is provided in Section VII.

References is not available for this document.

Data-at-rest security for spark

Abstract:

Metadata

Abstract:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Data-at-rest security for spark

Alerts

Abstract:

Metadata

Abstract:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

References

IEEE Account

Purchase Details

Profile Information

Need Help?