Abstract:
Spark SQL lets spark programmers query structured data inside Spark programs using SQL statements. It provides spark programmers with great convenience to leverage the be...Show MoreMetadata
Abstract:
Spark SQL lets spark programmers query structured data inside Spark programs using SQL statements. It provides spark programmers with great convenience to leverage the benefits of relational processing, and its internal RDD distributed processing also accelerates query on large data sets. However, Spark SQL is not designed for long-run services and its built-in data source would load data from storage system, such as HDFS and local file system, in each table scan without cache mechanism. Although users could keep data in memory using "cache" command explicitly, the data cached in memory is coarse grained. In this paper, we present an indexing structure which is a pluggable component of Spark SQL based on Apache Spark. Compared with Spark SQL, it has some additional advantages. Firstly, it allows users to create index of structured data to be processed, which speeds up the query performance greatly. Secondly, it enables programmers to load fine-grained data file of structured data into memory, which is flexible to load "hot data" into memory and to evict "cold data" out of memory.
Date of Conference: 04-06 November 2017
Date Added to IEEE Xplore: 23 November 2017
ISBN Information: