Skip to Main Content
The processing elements of many modern tightly coupled multicomputers are connected via mesh or toroidal networks. Such interconnects are simple and highly scalable, but suffer from high fragmentation, low utilization, and insufficient fault tolerance when the resources allocated to each job are dedicated. High-dimensional interconnects may be more efficient in certain cases, but are based on complex and expensive components and scale poorly. We present a novel hardware/software architectural approach that detaches the processing elements of the system from the interconnect and augments the traditional toroidal topology to provide additional connectivity options and additional link redundancy. We explore the properties of the new "multitoroidal" topology and the improvements it offers in resource utilization and failure tolerance. We present the results of extensive simulation studies to show that for practically important types of workloads, the resource utilization may be increased by 50 percent and, in certain cases, as much as 100 percent compared to toroidal machines and is, in fact, close to the theoretically optimal case of a full crossbar interconnect. The combined hardware/software architectural innovation is a major significant improvement in resource utilization on top of the state of the art in scheduling algorithm research. Also, multitoroidal multicomputers are able to work under link failure rates of 0.002 failures per week that would shut down toroidal machines. A variant of the multitoroidal architecture is implemented in the Blue Gene/L supercomputer.