Loading web-font TeX/Math/Italic
Data-at-rest security for spark | IEEE Conference Publication | IEEE Xplore

Data-at-rest security for spark


Abstract:

Apache Spark enables fast computations and greatly accelerates analytics applications by efficiently utilizing the main memory and caching data for later use. At its core...Show More

Abstract:

Apache Spark enables fast computations and greatly accelerates analytics applications by efficiently utilizing the main memory and caching data for later use. At its core Apache Spark uses data structures called RDDs (Resilient Distributed Datasets) to give a unified view to the distributed data. However, the data represented in the RDDs remain unencrypted which can result in leakage of confidential data produced or processed by applications. Apache Spark persists (unencrypted) RDDs to the disk storage under various circumstances including but not limited to caching, RDD checkpointing and data spill during the data shuffling operations, etc. This lack of security makes Apache Spark unsuitable for processing of sensitive information that should be secured at all times. Moreover, RDDs stored in the main memory are prone to main-memory attacks such as RAM-scrapping. In this paper, we propose and develop solutions to fill-up such security lapses in the current Apache Spark framework. We present three different approaches to incorporate security in the Apache Spark framework. These approaches are designed to limit the exposure of unencrypted data during data processing, caching and data spill to disk. We use combination of cryptographic splitting and encryption to secure data stored and spilled by Apache Spark, both to the disk as well as to the main memory. Our approaches provide strong security by incorporating combination of Information Dispersal Algorithm (IDA) and Shamir's Perfect Secret Sharing (PSS). Extensive experimentation show that with appropriately chosen parameters our security approaches provide high security at a performance penalty between 10%-25%.
Date of Conference: 05-08 December 2016
Date Added to IEEE Xplore: 06 February 2017
ISBN Information:
Conference Location: Washington, DC, USA

I. Introduction

Apache Spark has emerged as a fast growing and widely adopted framework for big-data analytics [1]. The capability to run batch processing, streaming, iterative and interactive jobs within a single framework [2] has made Apache Spark as natural choice for researchers, data scientists and industry. It is supported and maintained by a large community of open source contributors. Many companies not only use Apache Spark but also offer their own version of Apache Spark to their customers as-a-service over the cloud. However, when it comes to security and confidentiality of data, Apache Spark faces challenges that are already faced by cloud computing infrastructures and other existing big -data platforms such as Hadoop [3]. Researchers have proposed various solutions for data security and confidentiality at different levels in big -data frameworks such as Hadoop. However, these approaches are not directly applicable to Apache Spark due to architectural differences between Apache Spark and other frameworks. A review of existing security approaches is provided in Section VII.

Contact IEEE to Subscribe

References

References is not available for this document.