Skip to Main Content
In HPC domain, a majority of applications build on MPI and employ collective operations in their communication kernels. Improving the performance of collectives has been long term focused by a lot of work. Recently, in the optimization work of collectives on multi-core clusters, hierarchical algorithm designs are remark-able. This kind of algorithms can greatly reduce the inter-node traffic but increase the intra-node traffic load at the same time. Meanwhile, in hierarchical collectives, the part of intra-node collectives take more and more time while the number of cores in each node keeps growing. Improving the performance of intra-node collectives is critical to the holistic performance. However, on multi-cores, the factor of process affinity greatly impacts the performance of an intra-node collective. This peculiarity challenges us how to improve the overall performance of intra-node collectives. Towards this problem, in this paper, we propose a novel and systemic strategy for tuning the performance of intra-node collectives. As illustrative examples, we have implemented our strategy on a dual-socket Intel Clovertown platform and successfully tuned the performance of Broadcast and Allgather up to 14% and 52% improvement together.