Skip to Main Content
The IBM Blue Gene/P (BG/P) system is a massively parallel supercomputer succeeding BG/L, and it comes with many machine design enhancements and new architectural features at the hardware and software levels. This paper presents techniques leveraging such features to deliver high performance MPI collective communication primitives. In particular, we exploit BG/P rich set of network hardware in exploring three classes of collective algorithms: global algorithms on global interrupt and collective networks for MPI COMM WORLD; rectangular algorithms for rectangular communicators on the torus network; and binomial algorithms for irregular communicators over the torus point-to-point network. We also utilize various forms of data movements including the direct memory access (DMA) engine, collective network, and shared memory, to implement synchronous and asynchronous algorithms of different objectives and performance characteristics. Our performance study on BG/P hardware with up to 16 K nodes demonstrates the efficiency and scalability of the algorithms and optimizations.