Cart (Loading....) | Create Account
Close category search window

Optimizing Instruction Scheduling through Combined In-Order and O-O-O Execution in SMT Processors

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Hui Wang ; Dept. of Electr. Eng., Univ. of Texas, Richardson, TX ; Sangireddy, R. ; Baldawa, S.

The resource sharing nature of Simultaneous Multithreading (SMT) processors and the presence of long latency instructions from concurrent threads make the instruction scheduling window (IW), which is a primary shared component among key pipeline structures in SMT, a performance bottleneck. Due to the tight constraints on its physical size, the IW faces more severe pressure to handle the instructions from various threads while attempting to avoid resource monopolization by some low-ILP threads. It is particularly challenging to optimize the efficiency and fairness in IW utilization to fulfill the affordable performance by SMT under the shadow of long latency instructions. Most of the existing optimization schemes in SMT processors rely on the fetch policy to control the instructions that are allowed to enter the pipeline, while little effort is put to control the long latency instructions that are already located in the IW. In this paper, we propose streamline buffers to handle the long latency instructions that have already entered the pipeline and clog the IW, while the controlling fetch policies take time to react. Each streamline buffer extracts from IW and holds a chain of instructions from a thread that are stalled by dependency on a long latency load. When the load value returns, the streamline buffer then serves these instructions directly to in-order execution, avoiding any instruction replay. This is done in supplement to the conventional IW that serves in parallel the other instructions for out-of-order (o-o-o) execution. Analysis of SPEC2000 integer and FP benchmarks reveals that instructions dependent on long latency loads, typically have their first source operand ready within 5 percent-15 percent of their total wait time in the IW. Our scheme is able to utilize this asymmetry in source operands' ready time to achieve a complexity effective design. As compared to the baseline SMT architecture, our design when working in conjunction with earlier propose- - d ICOUNT.2.8 fetch policy for 4-threads effectively reduces the IW full rate by 9.4 percent (11 percent for 2-thread), improves average IPC for MIXED workloads by 9.6 percent (8 percent for MEM workloads and 4.4 percent for CPU workloads), and fairness by 7.56 percent (7.24 percent for 2-thread). Similar enhancements are observed when run in conjunction with an RR.2.8 fetch policy. Further, our scheme when combined with DCRA improves the performance on the average by 21.7 percent, while DCRA improves by 16.3 percent when run alone.

Published in:

Parallel and Distributed Systems, IEEE Transactions on  (Volume:20 ,  Issue: 3 )

Date of Publication:

March 2009

Need Help?

IEEE Advancing Technology for Humanity About IEEE Xplore | Contact | Help | Terms of Use | Nondiscrimination Policy | Site Map | Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest professional association for the advancement of technology.
© Copyright 2014 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.