By Topic

Microarchitecture, 2000. MICRO-33. Proceedings. 33rd Annual IEEE/ACM International Symposium on

Date 10-13 Dec. 2000

Filter Results

Displaying Results 1 - 25 of 33
  • Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000

    Save to Project icon | Request Permissions | PDF file iconPDF (232 KB)  
    Freely Available from IEEE
  • Author index

    Page(s): 357
    Save to Project icon | Request Permissions | PDF file iconPDF (62 KB)  
    Freely Available from IEEE
  • Eager writeback-a technique for improving bandwidth utilization

    Page(s): 11 - 21
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (956 KB)  

    Modern high-performance processors utilize multi-level cache structures to help tolerate the increasing latency of main memory. Most of these caches employ either a writeback or a write-through strategy to deal with store operations. Write-through caches propagate data to more distant memory levels at the time each store occurs, which requires a very large bandwidth between the memory hierarchy levels. Writeback caches can significantly reduce the bandwidth requirements between caches and memory by marking cache lines as dirty when stores are processed and writing those lines to the memory system only when that dirty line is evicted. Unfortunately, for applications that experience significant numbers of cache misses due to streaming data, writeback cache designs can degrade overall system performance by clustering bus activity when dirty lines contend with data being fetched into the cache. In this paper we present a new technique called Eager Writeback, which re-distributes and balances memory traffic by writing and “cleaning” dirty cache lines prior to their eviction. Eager Writeback can be viewed as a compromise between write-through and writeback policies, in which dirty lines are written later than write-through, but prior to writeback. We will show that this approach can reduce the large number of writes seen in a write-through design, while avoiding the performance degradation caused by clustering bus traffic in a writeback approach View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reducing wire delay penalty through value prediction

    Page(s): 317 - 326
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (880 KB)  

    In this paper we show that value prediction can be used to avoid the penalty of long wire delays by predicting the data that is communicated through these long wires and validating the prediction locally where the value is produced. Only in the case of misprediction, the long wire delay is experienced. We apply this concept to a clustered microarchitecture in order to reduce inter-cluster communication. The predictability of values provides the dynamic instruction partitioning hardware with less constraints to optimize the trade-off between communication requirements and workload balance, which is the most critical issue of the partitioning scheme. We show that value prediction reduces the penalties caused by inter-cluster communication by 18% on average for a realistic implementation of a 4-cluster microarchitecture View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Silent stores for free

    Page(s): 22 - 31
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (956 KB)  

    Silent store instructions write values that exactly match the values that are already stored at the memory address that is being written. A recent study reveals that significant benefits can be gained by detecting and removing such stores from a program's execution. This paper studies the problem of detecting silent stores and shows that an average of 31% and 50% of silent stores can be detected for very low implementation cost, by exploiting temporal and spatial locality in a processor's load and store queues. We also show that over 83% of all silent stores can be detected using idle cache read access ports. Furthermore, we show that processors that use standard error-correction codes to protect data caches from transient errors can be modified only slightly to detect 100% of silent stores that hit in the cache. Finally, we show that silent store detection via these methods can result in a 11% harmonic mean performance improvement in a two-level store-through on-chip cache hierarchy that is based on a real microprocessor design View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving BTB performance in the presence of DLLs

    Page(s): 77 - 86
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (720 KB)  

    Dynamically Linked Libraries (DLLs) promote software modularity, portability, and flexibility and their use has become widespread. The authors characterize the behavior of five applications that make heavy use of DLLs, with a particular focus on the effects of DLLs on Branch Target Buffer (BTB) performance. DLLs aggravate hot set contention in the BTB. Standard software remedies are ineffective because the DLLs are shared, compiled separately, and dynamically linked to applications. We propose a hardware technique, the DLL BTB, that adds a small second buffer to the BTB and dedicates it to storing DLL target addresses. We show that the DLL BTB performance is similar to a BTB with a victim buffer, but the DLL BTB requires no parallel lookups or datapaths between the original BTB and the added buffer View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A static power model for architects

    Page(s): 191 - 201
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1032 KB)  

    Static power dissipation due to transistor leakage constitutes an increasing fraction of the total power in modern semiconductor technologies. Current technology trends indicate that the contribution will increase rapidly, reaching one half of total power dissipation within three process generations. Developing power efficient products will require consideration of static power in the earliest phases of design, including architecture and microarchitecture definition. We propose a simple equation for estimating static power consumption at the architectural level: Pstatic=VCC·N·k design·Iˆleak, where VCC is the supply voltage, N is the number of transistors, kdesign is a design dependent parameter, and Iˆleak is a technology dependent parameter. This model enables high-level reasoning about the likely static power demands of alternative microarchitectures. Reasonably accurate values for the factors within the equation may be obtained directly from the high-level designs or by straightforward scaling arguments. The factors within the equation also suggest opportunities for static power optimization, including reducing the total number of devices, partitioning the design to allow for lower supply voltages or slower, less leaky transistors, turning off unused devices, favoring certain design styles, and favoring high bandwidth over low latency. Speculation is also examined as a means to employ slower transistors without a significant performance penalty View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Accurate and efficient predicate analysis with binary decision diagrams

    Page(s): 112 - 123
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1024 KB)  

    Functionality and performance of EPIC architectural features depend on extensive compiler support. Predication, one of these features, promises to reduce control flow overhead and to enhance optimization, provided that compilers can utilize it effectively. Previous work has established the need for accurate, direct predicate analysis and has demonstrated a few useful techniques, but has not provided an efficient, general framework. The paper presents the Predicate Analysis System (PAS), which maps knowledge of predicate and condition relations in general control flow onto a convenient logical substrate, the reduced ordered binary decision diagram. PAS is the first such framework to demonstrate direct, accurate, and efficient analysis of arbitrary condition and predicate define networks in arbitrary control flow View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance improvement with circuit-level speculation

    Page(s): 348 - 355
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (664 KB)  

    Current superscalar microprocessors' performance depends on its frequency and the number of useful instructions that can be processed per cycle (IPC). In this paper we propose a method called approximation to reduce the logic delay of a pipe-stage. The basic idea of approximation is to implement the logic function partially instead of fully. Most of the time the partial implementation gives the correct result as if the function is implemented fully but with fewer gates delay allowing a higher pipeline frequency. We apply this method on three logic blocks. Simulation results show that this method provides some performance improvement for a wide-issue superscalar if these stages are finely pipelined View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality

    Page(s): 32 - 41
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (876 KB)  

    DRAM row-buffer conflicts occur when a sequence of requests on different rows goes to the same memory bank, causing much higher memory access latency than requests to the same row or to different banks. We analyze the sources of row-buffer conflicts in the context of superscalar processors, and propose a permutation based page interleaving scheme to reduce row-buffer conflicts and to exploit data access locality in the row-buffer. Compared with several existing schemes, we show that the permutation based scheme dramatically increases the hit rates on DRAM row-buffers and reduces memory stall time of the SPEC95 and TPC-C workloads. The memory stall times of the workloads are reduced up to 68% and 50%, compared with the conventional cache line and page interleaving schemes, respectively View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic zero compression for cache energy reduction

    Page(s): 214 - 220
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (544 KB)  

    Dynamic zero compression reduces the energy required for cache accesses by only writing and reading a single bit for every zero-valued byte. This energy-conscious compression is invisible to software and is handled with additional circuitry embedded inside the cache RAM arrays and the CPU. The additional circuitry imposes a cache area overhead of 9% and a read latency overhead of around two F04 gate delays. Simulation results show that we can reduce total data cache energy by around 26% and instruction cache energy by around 10% for SPECint95 and MediaBench benchmarks. We also describe the use of an instruction recoding technique that increases instruction cache energy savings to 18% View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Flexible hardware acceleration for multimedia oriented microprocessors

    Page(s): 171 - 177
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (516 KB)  

    The execution of multimedia applications on a microprocessor greatly benefits from hardware acceleration, both in terms of speed and energy consumption. While the basic functionality implemented in these accelerators remains constant over different product versions, small changes are still often required. With the proposed architecture and protocol, the accelerator hardware has the performance and cost benefits of a hardwired solution, while featuring all the flexibility needed in practice. From a user point of view, the entire application is still programmable View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Frequent value compression in data caches

    Page(s): 258 - 265
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (660 KB)  

    Since the area occupied by cache memories on processor chips continues to grow, an increasing percentage of power is consumed by memory. We present the design and evaluation of the compression cache (CC) which is a first level cache that has been designed so that each cache line can either hold one uncompressed line or two cache lines which have been compressed to at least half their lengths. We use a novel data compression scheme based upon encoding of a small number of valves that appear frequently during memory accesses. This compression scheme preserves the ability to randomly access individual data items. We observed that the contents of 40%, 52% and 51% of the memory blocks of size 4, 8, and 16 words respectively in SPECint95 benchmarks can be compressed to at least half their sizes by encoding the top 2, 4, and 8 frequent valves respectively. Compression allows greater amounts of data to be stored leading to substantial reductions in miss rates (0-36.4%), off-chip traffic (3.9-48.1%), and energy consumed (1-27%). Traffic and energy reductions are in part derived by transferring data over external buses in compressed form View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Modulo scheduling for a fully-distributed clustered VLIW architecture

    Page(s): 124 - 133
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (800 KB)  

    Clustering is an approach that many microprocessors are adopting in recent times in order to mitigate the increasing penalties of wire delays. We propose a novel clustered VLIW architecture which has all its resources partitioned among clusters, including the cache memory. A modulo scheduling scheme for this architecture is also proposed. This algorithm takes into account both register and memory inter-cluster communications so that the final schedule results in a cluster assignment that favors cluster locality in cache references and register accesses. It has been evaluated for both 2- and 4-cluster configurations and for differing numbers and latencies of inter-cluster buses. The proposed algorithm produces schedules with very low communication requirements and outperforms previous cluster-oriented schedulers View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Relational profiling: enabling thread-level parallelism in virtual machines

    Page(s): 281 - 290
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (844 KB)  

    Virtual machine service threads can perform many tasks in parallel with program execution such as garbage collection, dynamic compilation, and profile collection and analysis. Hardware-assisted profiling is essential for providing service threads with needed information in a flexible and efficient way. A relational profiling architecture (RPA) is proposed for meeting this goal. The RPA selects particular instructions for profiling, and communicates collected information to service threads through shared memory message queues. The RPA's capabilities lead to new profiling applications, such as concurrent garbage collection. Simulations indicate that a low-cost implementation of the RPA should be able to profile four in-flight instructions simultaneously, and provide storage for eight profile records. Profiling overhead is less than 0.5% for concurrent garbage collection and edge profiling View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors

    Page(s): 337 - 347
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1004 KB)  

    We investigate instruction distribution methods for quad-cluster, dynamically-scheduled superscalar processors. We study a variety of methods with different cost, performance and complexity characteristics. We investigate both Pion-adaptive and adaptive methods and their sensitivity both to inter-cluster communication latencies and pipeline depth. Furthermore, we develop a set of models that allow us to identify how well each method attacks issue-bandwidth and inter-cluster communication restrictions. We find that a relatively simple method that changes clusters every other three instructions offers only a 17% performance slowdown compared to a non-clustered configuration operating at the same frequency Moreover; we show that by utilizing adaptive methods it is possible to further reduce this gap down to about 14%. Furthermore, performance appears to be more sensitive to inter-cluster communication latencies rather than to pipeline depth. The best performing method offers a slowdown of about 24% when inter-cluster communication latency is two cycle. This gap is only 20% when two additional stages are introduced in the front-end pipeline View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures

    Page(s): 245 - 257
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1128 KB)  

    Conventional microarchitectures choose a single memory hierarchy design point targeted at the average application. In this paper, we propose a cache and TLB layout and design that leverages repeater insertion to provide dynamic low-cost configurability trading off size and speed on a per application phase basis. A novel configuration management algorithm dynamically detects phase changes and reacts to an application's hit and miss intolerance in order to improve memory hierarchy performance while taking energy consumption into consideration. When applied to a two-level cache and TLB hierarchy at 0.1 μm technology, the result is an average 15% reduction in cycles per instruction (CPI), corresponding to an average 27% reduction in memory-CPI, across a broad class of applications compared to the best conventional two-level hierarchy of comparable size. Projecting to sub-.1 μm technology design considerations that call for a three-level conventional cache hierarchy for performance reasons, we demonstrate that a configurable L2/L3 cache hierarchy coupled with a conventional LI results in an average 43% reduction in memory hierarchy energy in addition to improved performance View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The impact of delay on the design of branch predictors

    Page(s): 67 - 76
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (752 KB)  

    Modern microprocessors employ increasingly complicated branch predictors to achieve instruction fetch bandwidth that is sufficient for wide out-of-order execution cores. While existing predictors can still be accessed in a single clock cycle, recent studies show that slower wires and faster clock rates will require multi-cycle access times to large on-chip structures, such as branch prediction tables. Thus, future branch predictors must consider not only area and accuracy, but also delay. The paper explores these tradeoffs in designing branch predictors and shows that increased accuracy alone cannot overcome the penalties in delay that arise with larger predictor structures. We evaluate three schemes for accommodating delay: a caching approach, an overriding approach, and a cascading lookahead approach. While we use a common branch predictor, gshare, as the prediction component, these schemes can be constructed using most types of predictors View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Register integration: a simple and efficient implementation of squash reuse

    Page(s): 223 - 234
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1232 KB)  

    Register integration (or simply integration) is a mechanism for incorporating speculative results directly into a sequential execution using data-dependence relationships. In this paper we use integration to implement squash reuse, the salvaging of instruction results that were needlessly discarded during the course of sequential recovery from a control- or datamis-speculation. To implement integration, we first allow the results of squashed instructions to remain in the physical register file past mis-speculation recovery. As the processor re-traces portions of the squashed path, integration logic examines each instruction as it is being renamed. Using an auxiliary table, this circuit searches the physical register file for the physical register belonging to the corresponding squashed instance of the instruction. If this register is found, integration succeeds and the squashed result is re-validated by a simple update of the rename table. Once integrated, an instruction is complete and may bypass the out-of-order core of the machine entirely. Integration reduces contention for queuing and execution resources, collapses dependent chains of instructions and accelerates the resolution of branches. It achieves this using only rename-table manipulations; no additional values are read from or written to the physical registers. Our preliminary evaluation shows that a minimal integration configuration can provide performance improvements of up to 8% when applied to current-generation micro-architectures and up to 11.5% when applied to more aggressive microarchitectures. Integration also reduces the amount of wasteful speculation in the machine, cutting the number of instructions executed by up to 15% and the number of instructions fetched along mis-speculated paths by as much as 6% View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Predictor-directed stream buffers

    Page(s): 42 - 53
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1040 KB)  

    An effective method for reducing the effect of load latency in modern processors is data prefetching. One form of data prefetching, stream buffers, has been shown to be particularly effective due to its ability to detect data streams and run ahead of them, prefetching as it goes. Unfortunately, in the past, the applicability of streaming was limited to stride intensive code. We propose Predictor-Directed Stream Buffers (PSB), a scheme in which the stream buffer follows an address prediction stream instead of a fired stride. In addition, we examine using confidence techniques to guide the allocation and prioritization of stream buffers and their prefetch requests. Our results show for pointer-based applications that PSB provides a 30% speedup on average over no prefetching, and provides an average 10% speedup over using previously proposed stride-based stream buffers for pointer-intensive applications View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • PipeRench implementation of the Instruction Path Coprocessor

    Page(s): 147 - 158
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (956 KB)  

    The paper demonstrates how an Instruction Path Coprocessor (I-COP) can be efficiently implemented using the PipeRench reconfigurable architecture. An I-COP is a programmable on-chip coprocessor that operates on the core processor's instructions to transform them into a new format that can be more efficiently executed. The I-COP can be used to implement many sophisticated hardware code modification techniques. We show how four specific techniques can be mapped to the PipeRench pipelined computation model. The experimental results show that a PipeRench I-COP used to perform trace construction and trace optimizations for a trace cache fill unit not only achieves good performance gains but can potentially be implemented in less than 10 mm 2 (assuming 0.18 micron technology) or approximately 3% of the die area of a current high-end microprocessor. We believe these results demonstrate the usefulness and feasibility of the I-COP concept View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient checker processor design

    Page(s): 87 - 97
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1020 KB)  

    The design and implementation of a modern microprocessor creates many reliability challenges. Designers must verify the correctness of large complex systems and construct implementations that work reliably in varied (and occasionally adverse) operating conditions. In our previous work, we proposed a solution to these problems by adding a simple, easily verifiable checker processor at pipeline retirement. Performance analyses of our initial design were promising, overall slowdowns due to checker processor hazards were less than 3%. However slowdowns for some outlier programs were larger. The authors closely examine the operation of the checker processor. We identify the specific reasons why the initial design works well for some programs, but slows others. Our analyses suggest a variety of improvements to the checker processor storage system. Through the addition of a 4 k checker cache and eight entry store queue, our optimized design eliminates virtually all core processor slowdowns. Moreover, we develop insights into why the optimized checker processor performs well, insights that suggest it should perform well for any program View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Increasing the size of atomic instruction blocks using control flow assertions

    Page(s): 303 - 313
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (848 KB)  

    For a variety of reasons, branch-less regions of instructions are desirable for high-performance execution. In this paper we propose a means for increasing the dynamic length of branch-less regions of instructions for the purposes of dynamic program optimization. We call these atomic regions frames and we construct them by replacing original branch instructions with assertions. Assertion instructions check if the original branching conditions still hold. If they hold, no action is taken. If they do not, then the entire region is undone. In this manner an assertion has no explicit control flow. We demonstrate that using branch correlation to decide when a branch should be converted into an assertion results in atomic regions that average over 100 instructions in length, with a probability of completion of 97%, and that constitute over 80% of the dynamic instruction stream. We demonstrate both static and dynamic means for constructing frames. When frames are built dynamically using finite sized hardware, they average 80 instructions in length and have good caching properties View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Very low power pipelines using significance compression

    Page(s): 181 - 190
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (908 KB)  

    Data, addresses, and instructions are compressed by maintaining only significant bytes with two or three extension bits appended to indicate the significant byte positions. This significance compression method is integrated into a 5-stage pipeline, with the extension bits flowing down the pipeline to enable pipeline operations only for the significant bytes. Consequently, register logic and cache activity (and dynamic power) are substantially reduced. An initial trace-driven study shows reduction in activity of approximately 30-40% for each pipeline stage. Several pipeline organizations are studied. A byte serial pipeline is the simplest implementation, but suffers a CPI (cycles per instruction) increase of 79% compared with a conventional 32-bit pipeline. Widening certain pipeline stages in order to balance processing bandwidth leads to an implementation with a CPI 24% higher than the baseline 32-bit design. Finally, full-width pipeline stages with operand gating achieve a CPI within 2-6% of the baseline 32-bit pipeline View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A framework for dynamic energy efficiency and temperature management

    Page(s): 202 - 213
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1176 KB)  

    While technology is delivering increasingly sophisticated and powerful chip designs, it is also imposing alarmingly high energy requirements on the chips. One way to address this problem is to manage the energy dynamically. Unfortunately, current dynamic schemes for energy management are relatively limited. In addition, they manage energy either for energy efficiency or for temperature control, but not for both simultaneously. In this paper, we design and evaluate for the first time an energy-management framework that tackles both energy efficiency and temperature control in a unified manner. We call this general approach Dynamic Energy Efficiency and Temperature Management (DEETM). Our framework combines many energy-management techniques and can activate them individually or in groups in a fine-grained manner according to a given policy. The goal of the framework is two-fold: maximize energy savings without extending application execution time beyond a given tolerable limit, and guarantee that the temperature remains below a given limit while minimizing any resulting slowdown. The framework successfully meets these goals. For example, it delivers a 40% energy reduction with only a 10% application slowdown View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.