Skip to Main Content
Demands for highly scalable parallel data processing platforms is raising due to an explosion in the number of massive-scale data intensive applications both in industry and in sciences. Performing statistical computing over huge data repositories poses a significant challenge to existing statistical software and computational infrastructure. After analyzing various open source computational infrastructures and their programming paradigm APIs, the results have shown that most of them are JVM based, and their APIs are given as Java interfaces or abstract classes. This paper proposes a generic framework JR Bridge, which can integrate R and JVM-based computational infrastructures by generating Java APIs code wrapper around the native R code automatically and handling type conversion. Using this framework, we build a distributed statistical computing environment by integrating R with Hadoop. With the Hadoop Distributed File System plug in, it brings a way to store and access datasets with millions of objects. With MapReduce plug in, it brings a natural environment to code MapReduce algorithms in R. The experiment result shows JR Bridge scales linearly with the size of the datasets and thus provides a scalable solution for large-scale statistical computing in R.
Date of Conference: 6-8 Dec. 2012