By Topic

Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

6 Author(s)
Potluri, S. ; Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA ; Wang, H. ; Bureddy, D. ; Singh, A.K.
more authors

Many modern clusters are being equipped with multiple GPUs per node to achieve better compute density and power efficiency. However, moving data in/out of GPUs continues to remain a major performance bottleneck. With CUDA 4.1, NVIDIA has introduced Inter-Process Communication (IPC) to address data movement overheads between processes using different GPUs connected to the same node. State-of-the-art MPI libraries like MVAPICH2 are being modified to allow application developers to use MPI calls directly over GPU device memory. This improves the programmability for application developers by removing the burden of dealing with complex data movement optimizations. In this paper, we propose efficient designs for intra-node MPI communication on multi-GPU nodes, taking advantage of IPC capabilities provided in CUDA. We also demonstrate how MPI one-sided communication semantics can provide better performance and overlap by taking advantage of IPC and the Direct Memory Access (DMA) engine on a GPU. We demonstrate the effectiveness of our designs using micro-benchmarks and an application. The proposed designs improve GPU-to-GPU MPI Send/Receive latency for 4MByte messages by 79% and achieve 4 times the bandwidth for the same message size. One-sided communication using Put and Active synchronization shows 74% improvement in latency for 4MByte message, compared to the existing Send/Receive based implementation. Our benchmark using Get and Passive Synchronization demonstrates that true asynchronous progress can be achieved using IPC and the GPU DMA engine. Our designs for two-sided and one-sided communication improve the performance of GPULBM, a CUDA implementation of Lattice Boltzmann Method for multiphase flows, by 16%, compared to the performance using existing designs in MVAPICH2. To the best of our knowledge, this is the first paper to provide a comprehensive solution for MPI two-sided and one-sided GPU-to-GPU communication within a node, using CUDA IPC.

Published in:

Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International

Date of Conference:

21-25 May 2012