Indexing for Large Scale Data Querying Based on Spark SQL | IEEE Conference Publication | IEEE Xplore

Indexing for Large Scale Data Querying Based on Spark SQL


Abstract:

Spark SQL lets spark programmers query structured data inside Spark programs using SQL statements. It provides spark programmers with great convenience to leverage the be...Show More

Abstract:

Spark SQL lets spark programmers query structured data inside Spark programs using SQL statements. It provides spark programmers with great convenience to leverage the benefits of relational processing, and its internal RDD distributed processing also accelerates query on large data sets. However, Spark SQL is not designed for long-run services and its built-in data source would load data from storage system, such as HDFS and local file system, in each table scan without cache mechanism. Although users could keep data in memory using "cache" command explicitly, the data cached in memory is coarse grained. In this paper, we present an indexing structure which is a pluggable component of Spark SQL based on Apache Spark. Compared with Spark SQL, it has some additional advantages. Firstly, it allows users to create index of structured data to be processed, which speeds up the query performance greatly. Secondly, it enables programmers to load fine-grained data file of structured data into memory, which is flexible to load "hot data" into memory and to evict "cold data" out of memory.
Date of Conference: 04-06 November 2017
Date Added to IEEE Xplore: 23 November 2017
ISBN Information:
Conference Location: Shanghai, China

Contact IEEE to Subscribe

References

References is not available for this document.