Skip to Main Content
This paper presents a multiple stream scheduling mechanism to enable parallel execution of kernels, data sending from host to device and data receiving from device to host with multiple streams in CUDA. Our mechanism can divide the kernels and bi-directional data transmission into small subtasks, and allow to easily and efficiently overlap them on the CUDA compatible graphic processing unit(GPU). To set the optimal subtask size, we have built one compute bound model for computing intensive application and one data bound model for bi-directional data transmission intensive application. Basing on the two models, we also provided three scheduling algorithms for data dependent and data independent applications to maximize the efficiency of the overlap. We have applied the mechanism to a set of benchmarks to understand the performance. The results show that our work can successfully hide the latency to achieve high performance which is very close to the optimal.