Skip to Main Content
Low latency collective communications are key to application scalability. As systems grow larger, minimizing collective communication time becomes increasingly challenging. Offload is an effective technique for accelerating collective operations, however, algorithms for collective communication constantly evolve such that flexible implementations are critical. This paper presents triggered operations -- a semantic building block that allows the key components of collective communications to be offloaded while allowing the host side software to define the algorithm. Simulations are used to demonstrate the performance improvements achievable through the offload of MPI_Allreduce using these building blocks.