Multi-dimensional MPI communications, where MPI communications have to be performed in each dimension of a Cartesian communicator, have been frequently used in many of today's high performance computing applications. While individual MPI collective communications for regular communicators with a one-dimensional linear-ranking of processes have been extensively studied and optimized, little optimizations have been performed for multi-dimensional MPI collective communications on multi-dimensional Cartesian topology. In this paper, we optimize multi-dimensional MPI collective communications for SMP and multi-core systems at the application level. We show that the default Cartesian topology built by the state-of-the-art MPI implementations produce sub-optimal performance for multi-dimensional MPI collective communications. We design optimal process-to-core mapping schemes for Cartesian communicators to minimize the total inter-node communications. The proposed technique improves the performance by up to 80% over the default Cartesian topology built by Cray's MPI implementation MPT 3.1.02 on the world's current second fastest supercomputer Jaguar at Oak Ridge National Laboratory.
Published in:
Cluster Computing (CLUSTER), 2012 IEEE International Conference on
Date of Conference: 24-28 Sept. 2012