Skip to Main Content
The main contributors to message delivery latency in message passing environments are the copying operations needed to transfer and bind a received message to the consuming process/thread. A significant portion of the software communication overhead is attributed to message copying. Recently, a set of factors has been leading high-performance processor architectures toward designs that feature multiple processing cores on a single chip (a.k.a. CMP). The cell broadband engine (BE) shows potential to provide high-performance to parallel applications (e.g., MPI applications). An efficient implementation of collective communication operations is one of the key issues to reach high-performance and scalability in parallel applications. In this work, we implement several collective communications and investigate their performance in terms of latency and the associated components. For this, broadcast and total-exchange functions are implemented on the cell BE processor.