By Topic

Microarchitecture, 2004. MICRO-37 2004. 37th International Symposium on

Date 4-8 Dec. 2004

Filter Results

Displaying Results 1 - 25 of 38
  • [Cover page]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (397 KB)  
    Freely Available from IEEE
  • [Title page]

    Page(s): i - iv
    Save to Project icon | Request Permissions | PDF file iconPDF (79 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): v - viii
    Save to Project icon | Request Permissions | PDF file iconPDF (57 KB)  
    Freely Available from IEEE
  • Message from the General and Program Chairs

    Page(s): ix
    Save to Project icon | Request Permissions | PDF file iconPDF (37 KB)  
    Freely Available from IEEE
  • Committees

    Page(s): x - xi
    Save to Project icon | Request Permissions | PDF file iconPDF (34 KB)  
    Freely Available from IEEE
  • list-reviewer

    Page(s): xii - xiii
    Save to Project icon | Request Permissions | PDF file iconPDF (30 KB)  
    Freely Available from IEEE
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication

    Page(s): 7 - 17
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (352 KB) |  | HTML iconHTML  

    In the modern era of wire-dominated architectures, specific effort must be made to reduce needless communication within out-of-order pipelines while still maintaining binary compatibility. To ease pressure on highly-connected elements such as the issue logic and bypass network, we propose the dynamic detection and speculative execution of instruction strands-linear chains of dependent instructions without intermediate fan-out. The hardware required for detecting these chains is simple and resides off the critical path of the pipeline, and the execution targets are the normal ALUs with a self-bypass mode. By collapsing these strings of dependencies into atomic macro-instructions, the efficiency of the issue queue and reorder buffer can be increased. Our results show that over 25% of all dynamic ALU instructions can be grouped, decreasing both the average reorder buffer occupancy and issue queue occupancy by over a third. Additionally, these strands have several properties which make them amenable to simple performance optimizations. Our experiments show average IPC increases of 17% on a four-wide machine and 20% on an eight-wide machine in Spec2000int and Mediabench applications. Finally, strands ease the IPC penalties of multicycle issue and bypass by reducing dependency pressures, providing opportunity for clock frequency gains as well. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth

    Page(s): 18 - 29
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (200 KB) |  | HTML iconHTML  

    A mini-graph is a dataflow graph that has an arbitrary internal size and shape but the interface of a singleton instruction: two register inputs, one register output, a maximum of one memory operation, and a maximum of one (terminal) control transfer. Previous work has exploited dataflow sub-graphs whose execution latency can be reduced via programmable FPGA-style hardware. In this paper we show that mini-graphs can improve performance by amplifying the bandwidths of a superscalar processor's stages and the capacities of many of its structures without custom latency-reduction hardware. Amplification is achieved because the processor deals with a complete mini-graph via a single quasi-instruction, the handle. By constraining mini-graph structure and forcing handles to behave as much like singleton instructions as possible, the number and scope of the modifications over a conventional superscalar microarchitecture is kept to a minimum. This paper describes mini-graphs, a simple algorithm for extracting them from basic block frequency profiles, and a microarchitecture for exploiting them. Cycle-level simulation of several benchmark suites shows that mini-graphs can provide average performance gains of 2-12% over an aggressive baseline, with peak gains exceeding 40%. Alternatively, they can compensate for substantial reductions in register file and scheduler size, and in pipeline bandwidth. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization

    Page(s): 30 - 40
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (304 KB) |  | HTML iconHTML  

    Application-specific instruction set extensions are an effective way of improving the performance of processors. Critical computation subgraphs can be accelerated by collapsing them into new instructions that are executed on specialized function units. Collapsing the subgraphs simultaneously reduces the length of computation as well as the number of intermediate results stored in the register file. The main problem with this approach is that a new processor must be generated for each application domain. While new instructions can be designed automatically, there is a substantial amount of engineering cost incurred to verify and to implement the final custom processor. In this work, we propose a strategy to transparent customization of the core computation capabilities of the processor without changing its instruction set. A congurable array of function units is added to the baseline processor that enables the acceleration of a wide range of data flow subgraphs. To exploit the array, the microarchitecture performs subgraph identification at run-time, replacing them with new microcode instructions to configure and utilize the array. We compare the effectiveness of replacing subgraphs in the fill unit of a trace cache versus using a translation table during decode, and evaluate the tradeoffs between static and dynamic identification of subgraphs for instruction set customization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • MicroLib: A Case for the Quantitative Comparison of Micro-Architecture Mechanisms

    Page(s): 43 - 54
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (904 KB) |  | HTML iconHTML  

    While most research papers on computer architectures include some performance measurements, these performance numbers tend to be distrusted. Up to the point that, after so many research articles on data cache architectures, for instance, few researchers have a clear view of what are the best data cache mechanisms. To illustrate the usefulness of a fair quantitative comparison, we have picked a target architecture component for which lots of optimizations have been proposed (data caches), and we have implemented most of the performance-oriented hardware data cache optimizations published in top conferences in the past 4 years. Beyond the comparison of data cache ideas, our goals are twofold: (1) to clearly and quantitatively evaluate the effect of methodology shortcomings, such as model precision, benchmark selection, trace selection..., on assessing and comparing research ideas, and to outline how strong is the methodology effect in many cases, (2) to outline that the lack of interoperable simulators and not disclosing simulators at publication time make it difficult if not impossible to fairly assess the benefit of research ideas. This study is part of a broader effort, called MicroLib, an open library of modular simulators aimed at promoting the disclosure and sharing of simulator models. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic Synthesis of High-Speed Processor Simulators

    Page(s): 55 - 66
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (328 KB) |  | HTML iconHTML  

    Microprocessor simulators are very popular in research and teaching environments. For example, functional simulators are often used to perform architectural studies, to fast-forward over uninteresting code, to generate program traces, and to warm up tables before switching to a more detailed but slower simulator. Unfortunately, most portable functional simulators are on the order of 100 times slower than native execution. This paper describes a set of novel techniques and optimizations to synthesize portable functional simulators that are only 6.6 times slower on average (16 times in the worst case) than native execution and 19 times faster than SimpleScalar's sim-fast on the SPECcpu2000 programs. When simulating a memory hierarchy, the synthesized code is 2.6 times faster than the equivalent ATOM code. Our fully automated synthesis approach works without access to source/assembly code or debug information. It generates C code, integrates optional user-provided code, performs unwanted-code removal, preserves basic blocks, generates low-overhead profiles, employs a simple heuristic to determine potential jump targets, only compiles important instructions, and utilizes mixed-mode execution, i.e., it interleaves compiled and interpreted simulation to maximize performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Thermal Modeling, Characterization and Management of On-Chip Networks

    Page(s): 67 - 78
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (552 KB) |  | HTML iconHTML  

    Due to the wire delay constraints in deep submicron technology and increasing demand for on-chip bandwidth, networks are becoming the pervasive interconnect fabric to connect processing elements on chip. With ever-increasing power density and cooling costs, the thermal impact of on-chip networks needs to be urgently addressed. In this work, we first characterize the thermal profile of the MIT Raw chip. Our study shows networks having comparable thermal impact as the processing elements and contributing significantly to overall chip temperature, thus motivating the need for network thermal management. The characterization is based on an architectural thermal model we developed for on-chip networks that takes into account the thermal correlation between routers across the chip and factors in the thermal contribution of on-chip interconnects. Our thermal model is validated against finite-element based simulators for an actual chip with associated power measurements, with an average error of 5.3%. We next propose ThermalHerd, a distributed, collaborative run-time thermal management scheme for on-chip networks that uses distributed throttling and thermal-correlation based routing to tackle thermal emergencies. Our simulations show ThermalHerd effectively ensuring thermal safety with little performance impact. With Raw as our platform, we further show how our work can be extended to the analysis and management of entire on-chip systems, jointly considering both processors and networks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Pinpointing Representative Portions of Large Intel ® Itanium ® Programs with Dynamic Instrumentation

    Page(s): 81 - 92
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (219 KB) |  | HTML iconHTML  

    Detailed modeling of the performance of commercial applications is difficult. The applications can take a very long time to run on real hardware and it is impractical to simulate them to completion on performance models. Furthermore, these applications have complex execution environments that cannot easily be reproduced on a simulator, making porting the applications to simulators difficult. We attack these problems using the well-known SimPoint methodology to find representative portions of an application to simulate, and a dynamic instrumentation framework called Pin to avoid porting altogether. Our system uses dynamic instrumentation instead of simulation to find representative portions - called Pin-Points - for simulation. We have developed a toolkit that automatically detects PinPoints, validates whether they are representative using hardware performance counters, and generates traces for large Itanium® programs. We compared SimPoint-based selection to random selection of simulation points. We found for 95% of the SPEC2000 programs we tested, the PinPoints prediction was within 8% of the actual whole-program CPI, as opposed to 18% for random selection. We measure the end-to-end error, comparing real hardware to a performance model, and have a simple and efficient methodology to determine the step that introduced the error. Finally, we evaluate the system in the context of multiple configurations of real hardware, commercial applications, and industrial-strength performance models to understand the behavior of a complete and practical workload collection system. We have successfully used our system with many commercial Itanium® programs, some running for trillions of instructions, and have used the resulting traces for predicting performance of those applications on future Itanium processors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Fuzzy Correlation between Code and Performance Predictability

    Page(s): 93 - 104
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (448 KB) |  | HTML iconHTML  

    Recent studies have shown that most SPEC CPU2K benchmarks exhibit strong phase behavior, and the Cycles per Instruction (CPI) performance metric can be accurately predicted based on program's control-flow behavior, by simply observing the sequencing of the program counters, or extended instruction pointers (EIPs). One motivation of this paper is to see if server workloads also exhibit such phase behavior. In particular, can EIPs effectively predict CPI in server workloads? We propose using regression trees to measure the theoretical upper bound on the accuracy of predicting the CPI using EIPs, where accuracy is measure by the explained variance of CPI with EIPs. Our results show that for most server workloads and, surprisingly, even for CPU2K benchmarks, the accuracy of predicting CPI from EIPs varies widely. We classify the benchmarks into four quadrants based on their CPI variance and predictability of CPI using EIPs. Our results indicate that no single sampling technique can be broadly applied to a large class of applications. We propose a new methodology that selects the best-suited sampling technique to accurately capture the program behavior. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Whole Execution Traces

    Page(s): 105 - 116
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (256 KB) |  | HTML iconHTML  

    Different types of program profiles (control flow, value, address, and dependence) have been collected and extensively studied by researchers to identify program characteristics that can then be exploited to develop more effective compilers and architectures. Due to the large amounts of profile data produced by realistic program runs, most work has focused on separately collecting and compressing different types of profiles. In this paper we present a unified representation of profiles called Whole Execution Trace (WET) which includes the complete information contained in each of the above types of traces. Thus WETs provide a basis for a next generation software tool that will enable mining of program profiles to identify program characteristics that require understanding of relationships among various types of profiles. The key features of our WET representation are: WET is constructed by labeling a static program representation with profile information such that relavent and related profile information can be directly accessed by analysis algorithms as they traverse the representation; a highly effective two tier strategy is used to significantly compress the WET; and compression techniques are designed such that they do not adversely affect the ability to rapidly traverse WET for extracting subsets of information corresponding to individual profile types as well as a combination of profile types (e.g., in form of dynamic slices of WETs). Our experimentation shows that on an average execution traces resulting from execution of 647 Million statements can be stored in 331 Megabytes of storage after compression. The compression factors range from 16 to 83. Moreover the rates at which different types of profiles can be individually or simultaneously extracted are high. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Wrong Path Events: Exploiting Unusual and Illegal Program Behavior for Early Misprediction Detection and Recovery

    Page(s): 119 - 128
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (304 KB) |  | HTML iconHTML  

    Control and data speculation are widely used to improve processor performance. Correct speculation can reduce execution time, but incorrect speculation can lead to increased execution time and greater energy consumption. This paper proposes a mechanism to leverage unexpected program behavior, called wrong-path events, that occur during periods of incorrect speculation. A wrong-path event is an instance of illegal or unusual program behavior that is more likely to occur on the wrong path than on the correct path, such as a NULL pointer dereference. When a wrong-path event occurs, the processor can predict that it is on the wrong path and speculatively initiate misprediction recovery. The purpose of the proposed mechanism is to improve the effectiveness of speculative execution in a processor by helping to insure that the processor remain "on the correct path" throughout periods of speculative execution. We describe a set of wrong-path events which can be used as strong indicators of misprediction. We find that on average 5% of the mispredicted branches in the SPEC2000 integer benchmarks produce a wrong-path event an average of 51 cycles before the branch is executed. We show that once a wrong-path event occurs, it is possible to accurately predict which unresolved branch in the processor is mispredicted using a simple, novel prediction mechanism. We discuss the advantages and shortcomings of wrong-path events and propose new areas for future research. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Control Flow Optimization Via Dynamic Reconvergence Prediction

    Page(s): 129 - 140
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (288 KB) |  | HTML iconHTML  

    This paper presents a novel microarchitecture technique for accurately predicting control flow reconvergence dynamically. A reconvergence point is the earliest dynamic instruction in the program where we can expect program paths to reconverge regardless of the outcome or target of the current branch. Thus, even if the immediate control flow after a branch is uncertain, execution following the reconvergence point is certain. This paper proposes a novel hardware re-convergence predictor which is both implementable and accurate, with a 4KB predictor achieving more than 95% accuracy for SPEC INT, and larger implementations achieving greater than 99% accuracy. The information provided from reconvergence prediction can increase the effectiveness of a range of previously proposed performance optimizations, including speculative multithreading, control independence, and squash reuse. This paper also demonstrates a new technique that takes advantage of the dynamic reconvergence prediction information in order to predict a wrong path excursion ahead of branch resolution. On average, 34% of wrong path fetches are eliminated. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Case for Clumsy Packet Processors

    Page(s): 147 - 156
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (248 KB) |  | HTML iconHTML  

    Hardware faults can occur in any computer system. Although faults cannot be tolerated for most systems (e.g., servers or desktop processors), many applications (e.g., networking applications) provide robustness in software. However, processors do not utilize this resiliency, i.e., regardless of the application at hand, a processor is expected to operate completely fault-free. In this paper, we will question this traditional approach of complete correctness and investigate possible performance and energy optimizations when this correctness constraint is released. We first develop a realistic model that estimates the change in the fault rates according to the clock frequency of the cache. Then, we present a scheme that dynamically adjusts the clock frequency of the data caches to achieve the desired optimization goal, e.g., reduced energy or reduced access latency. Finally, we present simulation results investigating the optimal operation frequency of the data caches, where reliability is compromised in exchange of reduced energy and increased performance. Our simulation results indicate that the clock frequency of the data caches can be increased as much as 4 times without incurring a major penalty on the reliability. This also results in 41% reduction in the energy consumed in the data caches and a 24% reduction in the energy-delay-fallibility product. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamically Trading Frequency for Complexity in a GALS Microprocessor

    Page(s): 157 - 168
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (216 KB) |  | HTML iconHTML  

    Microprocessors are traditionally designed to provide "best overall" performance across a wide range of applications and operating environments. Several groups have proposed hardware techniques that save energy by "downsizing" hardware resources that are underutilized by the current application phase. Others have proposed a different energy-saving approach: dividing the processor into domains and dynamically changing the clock frequency and voltage within each domain during phases when the full domain frequency is not required. What has not been studied to date is how to exploit the adaptive nature of these approaches to improve performance rather than to save energy. In this paper, we describe an adaptive globally asynchronous, locally synchronous (GALS) microprocessor with a fixed global voltage and four independently clocked domains. Each domain is streamlined with modest hardware structures for very high clock frequency. Key structures can then be upsized on demand to exploit more distant parallelism, improve branch prediction, or increase cache capacity. Although doing so requires decreasing the associated domain frequency, other domain frequencies are unaffected. Our approach, therefore, is to maximize the throughput of each domain by finding the proper balance between the number of clock periods, and the clock frequency, for each application phase. To achieve this objective, we use novel hardware-based control techniques that accurately and efficiently capture the performance of all possible cache and queue configurations within a single interval, without having to resort to exhaustive online exploration or expensive offline profiling. Measuring across a broad suite of application benchmarks, we find that configuring our adaptive GALS processor just once per application yields 17.6% better performance, on average, than that of the "best overall" fully synchronous design. By adapting automatically to application phases, we can increase this advantage to more than 20%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamically Controlled Resource Allocation in SMT Processors

    Page(s): 171 - 182
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (248 KB) |  | HTML iconHTML  

    SMT processors increase performance by executing instructions from several threads simultaneously. These threads use the resources of the processor better by sharing them but, at the same time, threads are competing for these resources. The way critical resources are distributed among threads determines the final performance. Currently, processor resources are distributed among threads as determined by the fetch policy that decides which threads enter the processor to compete for resources. However, current fetch policies only use indirect indicators of resource usage in their decision, which can lead to resource monopolization by a single thread or to resource waste when no thread can use them. Both situations can harm performance and happen, for example, after an L2 cache miss. In this paper, we introduce the concept of dynamic resource control in SMT processors. Using this concept, we propose a novel resource allocation policy for SMT processors. This policy directly monitors the usage of resources by each thread and guarantees that all threads get their fair share of the critical shared resources, avoiding monopolization. We also define a mechanism to allow a thread to borrow resources from another thread if that thread does not require them, thereby reducing resource under-use. Simulation results show that our dynamic resource allocation policy outperforms a static resource allocation policy by 8%, on average. It also improves the best dynamic resource-conscious fetch policies like FLUSH++ by 4%, on average, using the harmonic mean as a metric. This indicates that our policy does not obtain the ILP boost by unfairly running high ILP threads over slow memory-bounded threads. Instead, it achieves a better throughput-fairness balance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Balanced Multithreading: Increasing Throughput via a Low Cost Multithreading Hierarchy

    Page(s): 183 - 194
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (208 KB)  

    A simultaneous multithreading (SMT) processor can issue instructions from several threads every cycle, allowing it to effectively hide various instruction latencies; this effect increases with the number of simultaneous contexts supported. However, each added context on an SMT processor incurs a cost in complexity, which may lead to an increase in pipeline length or a decrease in the maximum clock rate. This paper presents new designs for multithreaded processors which combine a conservative SMT implementation with a coarse-grained multithreading capability. By presenting more virtual contexts to the operating system and user than are supported in the core pipeline, the new designs can take advantage of the memory parallelism present in workloads with many threads, while avoiding the performance penalties inherent in a many-context SMT processor design. A design with 4 virtual contexts, but which is based on a 2-context SMT processor core, gains an additional 26% throughput when 4 threads are run together. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Conjoined-Core Chip Multiprocessing

    Page(s): 195 - 206
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (376 KB) |  | HTML iconHTML  

    Chip Multiprocessors (CMP) and Simultaneous Multi-threading (SMT) are two approaches that have been proposed to increase processor efficiency. We believe these two approaches are two extremes of a viable spectrum. Between these two extremes, there exists a range of possible architectures, sharing varying degrees of hardware between processors or threads. This paper proposes conjoined-core chip multiprocessing - topologically feasible resource sharing between adjacent cores of a chip multiprocessor to reduce die area with minimal impact on performance and hence improving the overall computational efficiency. It investigates the possible sharing of floating-point units, crossbar ports, instruction caches, and data caches and details the area savings that each kind of sharing entails. It also shows that the negative impact on performance due to sharing is significantly less than the benefits of reduced area. Several novel techniques for intelligent sharing of the hardware resources to minimize performance degradation are presented. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hardware and Binary Modification Support for Code Pointer Protection From Buffer Overflow

    Page(s): 209 - 220
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (296 KB) |  | HTML iconHTML  

    Buffer overflow vulnerabilities are currently the most prevalent security vulnerability; they are responsible for over half of the CERT advisories issued in the last three years. Since many attacks exploit buffer overflow vulnerabilities, techniques that prevent buffer overflow attacks would greatly increase the difficulty of writing a new worm. This paper examines both software and hardware solutions for protecting code pointers from buffer overflow attacks. We first evaluate the performance overhead of the existing Point-Guard software solution for protecting code pointers, and show that it can be applied using binary modification to protect return pointers on the stack. These software techniques guard against write attacks, but not read attacks, where an attacker is attempting to gain information about the pointer protection mechanism in order to later mount a write buffer attack. To address this, we examine encryption hardware to provide security for code pointers from read and write attacks. In addition, we show that pure software solutions can degrade program performance, and the light-weight encryption hardware techniques we examine can be used to provide protection with little performance overhead. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.