Skip to Main Content
Many-core architectures provide an efficient way of harnessing the growing numbers of transistors available. However, energy and latency costs of communication increasingly limit the parallel programs running on these platforms. Existing designs provide a functional communication layer, but not necessarily the most efficient solution. Due to power limitations, efficiency is now a primary concern that motivates us to look again at cache coherence. First, we analyze the communication behavior of parallel applications. The observed sharing patterns reveal considerable locality of shared data accesses between threads with consecutive IDs. This pattern corresponds to strong physical locality between adjacent cores in a chip-multiprocessor (CMP). This paper explores the design of Proximity Coherence: a novel scheme in which L1 load misses are optimistically forwarded to nearby caches via new dedicated links. We exploit these patterns and improve the efficiency of communication. The results show that careful analysis leads to the design of a more efficient coherence protocol. The protocol reduces the latency of load misses by up to 33 percent (17 percent, on average), improving overall execution time by up to 13 percent. Furthermore, it also reduces network-on-chip traffic by 19 percent and energy consumption by up to 30 percent.