Skip to Main Content
Shared memory optimizations for blocking collective communications implemented for multi-core, and distributed systems have previously shown to improve the performance of these operations. Such previous studies have tended to neglect the architecture of multi-core node and shared-memory communication characteristics. In this paper, we examine in detail the impact of on-node memory and cache hierarchy, and the optimization opportunities these provide, on the performance of the barrier and broadcast collective operations. The primary contribution of this paper is the demonstration of how exploiting the local memory-hierarchy impacts the performance of these operations in the distributed system context. Our results show that factors such as the location of communicating process in the node, number of communication processes, amount of shared-memory communication, and the amount of inter-socket (Central Processing Unit (CPU) socket) communication affect latency-sensitive and bandwidth-sensitive collective operations. The effect of these parameters varies on the type of operations, and are coupled to the architecture of the shared-memory node and the scale of collective operation. We have seen that for 3,072 processes on Jaguar, and considering the socket layout in collective communication algorithm improves the large-data MPI Bcast () performance by 50% and MPI Barrier by 40% when compared to neglecting this architectural feature. For 512 processes job on Smoky, the corresponding improvement is 38%, and an order of magnitude, respectively. Small data broadcast performance is not noticeably impacted on Jaguar, when considering the shared-memory hierarchy, and on Smoky the corresponding performance improvement is 3%.