Skip to Main Content
In today's high performance computing, many MPI programs (e.g., ScaLAPACK applications, High Performance Linpack Benchmark HPL, and many PDE solvers based on domain decomposition methods) organize their computational processes as multidimensional process grids. Communications are often necessary in each dimension. Multidimensional broadcast, where a broadcast has to be performed in each dimension, is one of the many operations in applications that use multidimensional process grids. In this paper, we study the impact of the MPI process-to-core mapping on the performance of multidimensional broadcast operations. We show that the default process-to-core mappings in today's state-of-the-art MPI implementations are often sub-optimal for multidimensional broadcast. We propose an application-level multicore-aware process-to-core re-mapping scheme that is capable of achieving optimal performance for multidimensional broadcast operations. The proposed multicore-aware process-to-core re-mapping scheme improves the performance of multidimensional broadcast operations by up to 64% over the default mapping scheme on the world's current eighth fastest supercomputer, Kraken, at the Oak Ridge National Laboratory.