Skip to Main Content
The rapid scaling of silicon technology has enabled massive transistor integration densities. Nanometer feature sizes, however, are marred by increasing variability and susceptibility to wear-out. Billion-transistor designs, such as chip multiprocessors (CMPs), are especially vulnerable to defects. CMPs rely on a network-on-chip for all their communication needs. A single link failure within this on-chip fabric can impede, halt, or even deadlock, intertile communication, which can render the entire chip multiprocessor useless. In this paper, we present a technique capable of handling very large numbers of permanent wire failures that occur in parallel links either at manufacture-time or at runtime (dynamically). As opposed to marking an entire parallel link as faulty, whenever some wires fail, the proposed methodology employs these partially-faulty links (PFLs) to continue the transfer of information-albeit at a gracefully degraded mode-in order to maintain network connectivity. Furthermore, the presented technique can designate PFLs as fully-faulty when several wires fail, by utilizing appropriate routing algorithms that bypass nonoperational links, while still maintaining load-balance in the vicinity of PFLs. The proposed scheme employs architectural support within the on-chip routers to detect link failures and enable reconfiguration at the granularity of individual wires. Hardware synthesis confirms the low-cost nature of the proposed architecture, and full-system simulations using both synthetic network traffic and real workloads demonstrate its efficacy.