Skip to Main Content
Graphics processing units (CPUs) have been accepted as a powerful and viable coprocessor solution in high-performance computing domain. In order to maximize the benefit of CPUs for a multicore platform, a mechanism is needed for CPU threads in a parallel application to share this computing resource for efficient execution. NVIDIA's Fermi architecture pioneers the feature of concurrent kernel execution; however, only kernels of the same thread context can execute in parallel. In order to get the best use of a GPU device in a multi-threaded application environment, this paper explores the techniques to effectively share a context, i.e., context funneling, which could be done either manually at application level, or automatically at the GPU runtime starting from CUDA v4.0. For synthetic microbenchmark tests, we find that both funneling mechanisms are more capable of exploring the benefit of concurrent kernel execution than traditional context switching, therefore improving the overall application performance. We also find that the manual funneling mechanism provides the highest performance and more explicit control, while CUDA v4.0 provides better productivity with good performance. Finally, we assess the impact of such techniques on a compact application benchmark, SSCA#3 - SAR sensor processing.