Skip to Main Content
The increase in the complexity of a wide-issue processor with its pipeline width is one of the primary concerns of processor designers. In the conventional design, the hardware in the processor core is laid out to handle multiple instructions with two source operands in each pipeline stage. However, an analysis of SPEC2000 programs reveals that an integer program on the average constitutes 25.2 percent of two-op (both source registers) integer instructions and 72.5 percent of one-op/zero-op integer instructions. Floating-point (FP) programs are found to constitute on the average 15.8 percent of two-op integer instructions and 44.1 percent of one-op/zero-op integer instructions. The analysis observes that the hardware laid out for worst-case requirements in the integer pipeline is highly underutilized for a significant portion of time. To alleviate the complexity issues, we propose the split pipeline architecture, a novel technique to distinguish and process instructions based on their source operand requirements. The conventional pipeline is split into two after the decode stage, and the two pipelines are again converged at the execution stage. This leads to a capability of processing instructions at a higher clock rate and at almost the same instruction-per-cycle (IPC) throughput, as compared to a conventional processor. Various flavors of the proposed architecture are simulated and analyzed in this paper, with a circuit level analysis to determine the impact on the critical path delays. Results show that a processor that can fetch, decode, and commit eight instructions in each cycle and with split pipelines of two two-source integer instruction and six zero/one-source integer instruction can achieve a clock rate that is 15.8 percent faster than an eight-wide conventional processor while reducing the IPC throughput by only 0.7 percent for SPEC2000 benchmarks. Similarly, a four-wide processor with split pipelines of one two-source integer instruction and three zero/- - one-source integer instructions can achieve a clock rate that is 19.69 percent faster than a four-wide conventional processor while reducing the IPC throughput by only 1.9 percent.