Loading [a11y]/accessibility-menu.js
Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters | IEEE Conference Publication | IEEE Xplore

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters


Abstract:

GPUDirect RDMA (GDR) brings the high-performance communication capabilities of RDMA networks like InfiniBand (IB) to GPUs (referred to as "Device"). It enables IB network...Show More

Abstract:

GPUDirect RDMA (GDR) brings the high-performance communication capabilities of RDMA networks like InfiniBand (IB) to GPUs (referred to as "Device"). It enables IB network adapters to directly write/read data to/from GPU memory. Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, provide an attractive approach for developing scientific applications with irregular communication characteristics by providing shared memory address space abstractions, along with one-sided communication semantics. However, current approaches and designs of OpenSHMEM on GPU clusters do not take advantage of the GDR features leading to inefficiencies and sub-optimal performance. In this paper, we analyze the performance of various OpenSHMEM operations with different inter-node and intra-node communication configurations (Host-to-Device, Device-to-Device, and Device-to-Host) on GPU based systems. We propose novel designs that ensure "truly one-sided" communication for the different inter-/intra-node configurations identified above while working around the hardware limitations. To the best of our knowledge, this is the first work that investigates GDR-aware designs for OpenSHMEM communication operations. Experimental evaluations indicate 2.5X and 7X improvement in point-point communication for intra-node and inter-node, respectively. The proposed framework achieves 2.2µs for an intra-node 8 byte put operation from Host-to-Device, and 3.13µs for an inter-node 8 byte put operation from GPU to remote GPU. With Stencil2D application kernel from SHOC benchmark suite, we observe a 19% reduction in execution time on 64 GPU nodes. Further, for GPULBM application, we are able to improve the performance of the evolution phase by 53% and 45% on 32 and 64 GPU nodes, respectively.
Date of Conference: 08-11 September 2015
Date Added to IEEE Xplore: 29 October 2015
Electronic ISBN:978-1-4673-6598-7

ISSN Information:

Conference Location: Chicago, IL, USA

Contact IEEE to Subscribe

References

References is not available for this document.