Skip to Main Content
The data generated by scientific simulation, sensor, monitor or optical telescope has increased with dramatic speed. In order to analyze the raw data fast and space efficiently, data pre-process operation is needed to achieve better performance in data analysis phase. Current research shows an increasing tread of adopting MapReduce framework for large scale data processing. However, the data access patterns which generally applied to scientific data set are not supported by current MapReduce framework directly. The gap between the requirement from analytics application and the property of MapReduce framework motivates us to provide support for these data access patterns in MapReduce framework. In our work, we studied the data access patterns in matrix files and proposed a new concentric data layout solution to facilitate matrix data access and analysis in MapReduce framework. Concentric data layout is a hierarchical data layout which maintains the dimensional property in large data sets. Contrary to the continuous data layout adopted in current Hadoop framework, concentric data layout stores the data from the same sub-matrix into one chunk, and then stores chunks symmetrically in a higher level. This matches well with the matrix like computation. The concentric data layout preprocesses the data beforehand, and optimizes the afterward run of MapReduce application. The experiments show that the concentric data layout improves the overall performance, reduces the execution time by about 38% when reading a 64 GB file. It also mitigates the unused data read overhead and increases the useful data efficiency by 32% on average.