Skip to Main Content
In many domains such as Telecom various scenarios necessitate the processing of large amounts of data using statistical and machine learning algorithms. A noticeable effort has been made to move the data management systems into MapReduce parallel processing environments such as Hadoop and Pig. Nevertheless these systems lack the features of advanced machine learning and statistical analysis. Frame-works such as Mahout on top of Hadoop support machine learning but their implementations are at the preliminary stage. For example Mahout does not provide Support Vector Machine (SVM) algorithms and it is difficult to use. On the other hand traditional statistical software tools such as R containing comprehensive statistical algorithms for advanced analysis are widely used. But such software can only run on a single computer and therefore it is not scalable. In this paper we propose an integrated solution RPig which takes the advantages of R (for machine learning and statistical analysis capabilities) and parallel data processing capabilities of Pig. The RPig framework offers a scalable advanced data analysis solution for machine learning and statistical analysis. Analysis jobs can be easily developed with RPig script in high level languages. We describe the design implementation and an eclipse-based RPigEditor for the RPig framework. Using application scenarios from the Telecom domain we show the usage of RPig and how the framework can significantly reduce the development effort. The results demonstrate the scalability of our framework and the simplicity of deployment for analysis jobs.
Date of Conference: 3-6 Dec. 2012