By Topic

Fast and scalable MPI-level broadcast using InfiniBand's hardware multicast support

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Jiuxing Liu ; Dept. of Comput. & Inf. Sci., Ohio State Univ., Columbus, OH, USA ; Mamidala, A.R. ; Panda, D.K.

Summary form only given. Modern high performance applications require efficient and scalable collective communication operations. Currently, most collective operations are implemented based on point-to-point operations. We propose to use hardware multicast in InfiniBand to design fast and scalable broadcast operations in MPl. InfiniBand supports multicast with unreliable datagram (UD) transport service. This makes it hard to be directly used by an upper layer such as MPl. To bridge the semantic gap between MPI_Bcast and InfiniBand hardware multicast, we have designed and implemented a substrate on top of InfiniBand which provides functionalities such as reliability, inorder delivery and large message handling. By using a sliding-window based design, we improve MPI_Bcast latency by removing most of the overhead in the substrate out of the communication critical path. By using optimizations such as a new coroot based scheme and delayed ACK, we can further balance and reduce the overhead. We have also addressed many detailed design issues such as buffer management, efficient handling of out-of-order and duplicate messages, timeout and retransmission, flow control and RDMA based ACK communication. Our performance evaluation shows that in an 8 node cluster testbed, hardware multicast based designs can improve MPl broadcast latency up to 58% and broadcast throughput up to 112%. The proposed solutions are also much more tolerant to process skew compared with the current point-to-point based implementation. We have also developed analytical model for our multicast based schemes and validated them with experimental numbers. Our analytical model shows that with the new designs, one can achieve MPl broadcast latency of small messages with 20.0μs and of one MTU size message (around 1836 bytes of data payload) with 40.0μs in a 1024 node cluster.

Published in:

Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International

Date of Conference:

26-30 April 2004