Massive-scale analytics (MSA) applications are characterized by the large amount of data that they process and the complexity of algorithms used to process the data. The ideal MSA system will not only support processing of large amounts of data but also offer a high degree of parallelism and support scheduling and resource allocation of complex workloads. Designers of MSA systems must provide three necessities: programming abstractions, runtime systems, and hardware. Historically, two communities have undertaken the task of designing MSA systems: the database community, which has argued for an SQL (Structured Query Language)-influenced processing paradigm, and the high-performance computing community, which has focused on developing infrastructures for highly efficient, but complex, parallel implementations. These two communities have developed disparate technologies to meet the necessities of MSA systems, and the solutions provided by the individual communities are not completely satisfactory. In this paper, we attempt to characterize the strengths and weaknesses of the approaches of these two communities at all levels of the MSA stack, characterize implications with respect to resource management within the MSA system, and define how an MSA system should be designed.
Note: The Institute of Electrical and Electronics Engineers, Incorporated is distributing this Article with permission of the International Business Machines Corporation (IBM) who is the exclusive owner. The recipient of this Article may not assign, sublicense, lease, rent or otherwise transfer, reproduce, prepare derivative works, publicly display or perform, or distribute the Article.