Skip to Main Content
Message-Passing is the dominant programming model for distributed memory parallel computers and Message-Passing Interface (MPI) is the standard. Along with point-to-point send and receive message primitives, MPI includes a set of collective communication operations that are used to synchronize and coordinate groups of tasks. The MPI_Barrier, one of the most important collective procedures, has been extensively studied on a variety of architectures over last twenty years. However, a cluster of Platform FPGAs is a new architecture and offers interesting, resource-efficient options for implementing the barrier operation. This paper describes an FPGA implementation of MPI_Barrier. The premise is that barrier (and other collective communication operations) are very sensitive to latency as the number of nodes scales to the tens-of-thousands. The relatively slow processors found on FPGAs will significantly cap performance. The FPGA hardware design implements a tree-based algorithm and is tightly integrated with the custom high-speed on-chip/off-chip network. MPI access is available through a specially-designed kernel module. This effectively offloads the work from the CPU and OS into hardware. The evaluation of this design shows significant performance gains compared with a conventional software implementation on both an FPGA cluster and a commodity cluster. Further, it suggests that moving other MPI collective operations into hardware would be beneficial.