Skip to Main Content
Heterogeneous parallel systems including accelerators such as Graphics Processing Units (GPUs), are expected to play a major role in architecting the largest systems in the world, as well as the most powerful embedded devices. Impressive computational speedups have been reported for numerous algorithms in fields of medical image processing, digital signal processing, astrophysics, modeling and simulations. However, it is frequently assumed that the working data set of the application fits in the memory of the accelerator. In this paper, first we elevate this constraint by presenting a simple and scalable compile-time approach for processing large data sets based on I/O tiling. Second, we combine tiling with streaming in our asynchronous execution model, which enables efficient data-driven processing of large data sets on heterogeneous platforms with accelerators. Finally, we present results for several micro benchmarks and three data parallel kernels.