Skip to Main Content
In the evolving submicron technology, making it particularly attractive to use decentralized designs. A common form of decentralization adopted in processors is to partition the execution core into multiple clusters. Each cluster has a small instruction window, and a set of functional units. A number of algorithms have been proposed for distributing instructions among the clusters. The first part of this paper analyzes (qualitatively as well as quantitatively) the effect of various hardware parameters such as the type of cluster interconnect, the fetch size, the cluster issue width, the cluster window size, and the number of clusters on the performance of different instruction distribution algorithms. The study shows that the relative performance of the algorithms is very sensitive to these hardware parameters and that the algorithms that perform relatively better with four or fewer clusters are generally not the best ones for a larger number of clusters. This is important, given that with an imminent increase in the transistor budget, more clusters are expected to be integrated on a single chip. The second part of the paper investigates alternate interconnects that provide scalable performance as the number of clusters is increased. In particular, it investigates two hierarchical interconnects - a single ring of crossbars and multiple rings of crossbars - as well as instruction distribution algorithms to take advantage of these interconnects. Our study shows that these new interconnects with the appropriate distribution techniques achieve an IPC (instructions per cycle) that is 15-20 percent better than the most scalable existing configuration, and is within 2 percent of that achieved by a hypothetical ideal processor having a 1-cycle latency crossbar interconnect. These results confirm the utility and applicability of hierarchical interconnects and hierarchical distribution algorithms in clustered processors.