RPig: A scalable framework for machine learning and advanced statistical functionalities | IEEE Conference Publication | IEEE Xplore

RPig: A scalable framework for machine learning and advanced statistical functionalities


Abstract:

In many domains such as Telecom various scenarios necessitate the processing of large amounts of data using statistical and machine learning algorithms. A noticeable effo...Show More

Abstract:

In many domains such as Telecom various scenarios necessitate the processing of large amounts of data using statistical and machine learning algorithms. A noticeable effort has been made to move the data management systems into MapReduce parallel processing environments such as Hadoop and Pig. Nevertheless these systems lack the features of advanced machine learning and statistical analysis. Frame-works such as Mahout on top of Hadoop support machine learning but their implementations are at the preliminary stage. For example Mahout does not provide Support Vector Machine (SVM) algorithms and it is difficult to use. On the other hand traditional statistical software tools such as R containing comprehensive statistical algorithms for advanced analysis are widely used. But such software can only run on a single computer and therefore it is not scalable. In this paper we propose an integrated solution RPig which takes the advantages of R (for machine learning and statistical analysis capabilities) and parallel data processing capabilities of Pig. The RPig framework offers a scalable advanced data analysis solution for machine learning and statistical analysis. Analysis jobs can be easily developed with RPig script in high level languages. We describe the design implementation and an eclipse-based RPigEditor for the RPig framework. Using application scenarios from the Telecom domain we show the usage of RPig and how the framework can significantly reduce the development effort. The results demonstrate the scalability of our framework and the simplicity of deployment for analysis jobs.
Date of Conference: 03-06 December 2012
Date Added to IEEE Xplore: 04 February 2013
ISBN Information:
Conference Location: Taipei, Taiwan

References

References is not available for this document.