Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000

10-13 Dec. 2000

Filter Results

Displaying Results 1 - 25 of 33
  • Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000

    Publication Year: 2000
    Request permission for commercial reuse | PDF file iconPDF (232 KB)
    Freely Available from IEEE
  • Author index

    Publication Year: 2000, Page(s): 357
    Request permission for commercial reuse | PDF file iconPDF (62 KB)
    Freely Available from IEEE
  • Two-level hierarchical register file organization for VLIW processors

    Publication Year: 2000, Page(s):137 - 146
    Cited by:  Papers (16)  |  Patents (10)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (912 KB)

    High-performance microprocessors are currently designed to exploit the inherent instruction level parallelism (ILP) available in most applications. The techniques used in their design and the aggressive scheduling techniques used to exploit this ILP tend to increase the register requirements of the loops. If more registers than those available in the architecture are required, some actions (such a... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Very low power pipelines using significance compression

    Publication Year: 2000, Page(s):181 - 190
    Cited by:  Papers (34)  |  Patents (9)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (908 KB)

    Data, addresses, and instructions are compressed by maintaining only significant bytes with two or three extension bits appended to indicate the significant byte positions. This significance compression method is integrated into a 5-stage pipeline, with the extension bits flowing down the pipeline to enable pipeline operations only for the significant bytes. Consequently, register logic and cache ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Eager writeback-a technique for improving bandwidth utilization

    Publication Year: 2000, Page(s):11 - 21
    Cited by:  Papers (22)  |  Patents (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (956 KB)

    Modern high-performance processors utilize multi-level cache structures to help tolerate the increasing latency of main memory. Most of these caches employ either a writeback or a write-through strategy to deal with store operations. Write-through caches propagate data to more distant memory levels at the time each store occurs, which requires a very large bandwidth between the memory hierarchy le... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Modulo scheduling for a fully-distributed clustered VLIW architecture

    Publication Year: 2000, Page(s):124 - 133
    Cited by:  Papers (15)  |  Patents (9)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (800 KB)

    Clustering is an approach that many microprocessors are adopting in recent times in order to mitigate the increasing penalties of wire delays. We propose a novel clustered VLIW architecture which has all its resources partitioned among clusters, including the cache memory. A modulo scheduling scheme for this architecture is also proposed. This algorithm takes into account both register and memory ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Flexible hardware acceleration for multimedia oriented microprocessors

    Publication Year: 2000, Page(s):171 - 177
    Cited by:  Papers (2)  |  Patents (12)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (516 KB)

    The execution of multimedia applications on a microprocessor greatly benefits from hardware acceleration, both in terms of speed and energy consumption. While the basic functionality implemented in these accelerators remains constant over different product versions, small changes are still often required. With the proposed architecture and protocol, the accelerator hardware has the performance and... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Accurate and efficient predicate analysis with binary decision diagrams

    Publication Year: 2000, Page(s):112 - 123
    Cited by:  Papers (7)  |  Patents (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1024 KB)

    Functionality and performance of EPIC architectural features depend on extensive compiler support. Predication, one of these features, promises to reduce control flow overhead and to enhance optimization, provided that compilers can utilize it effectively. Previous work has established the need for accurate, direct predicate analysis and has demonstrated a few useful techniques, but has not provid... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient conditional operations for data-parallel architectures

    Publication Year: 2000, Page(s):159 - 170
    Cited by:  Papers (14)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (976 KB)

    Many data-parallel applications, including emerging media applications, have regular structures that can easily be expressed as a series of arithmetic kernels operating on data streams. Data-parallel architectures are designed to exploit this regularity by performing the same operation on many data elements concurrently. However, applications containing data-dependent control constructs perform po... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Increasing the size of atomic instruction blocks using control flow assertions

    Publication Year: 2000, Page(s):303 - 313
    Cited by:  Papers (15)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (848 KB)

    For a variety of reasons, branch-less regions of instructions are desirable for high-performance execution. In this paper we propose a means for increasing the dynamic length of branch-less regions of instructions for the purposes of dynamic program optimization. We call these atomic regions frames and we construct them by replacing original branch instructions with assertions. Assertion instructi... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Relational profiling: enabling thread-level parallelism in virtual machines

    Publication Year: 2000, Page(s):281 - 290
    Cited by:  Papers (9)  |  Patents (10)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (844 KB)

    Virtual machine service threads can perform many tasks in parallel with program execution such as garbage collection, dynamic compilation, and profile collection and analysis. Hardware-assisted profiling is essential for providing service threads with needed information in a flexible and efficient way. A relational profiling architecture (RPA) is proposed for meeting this goal. The RPA selects par... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic zero compression for cache energy reduction

    Publication Year: 2000, Page(s):214 - 220
    Cited by:  Papers (41)  |  Patents (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (544 KB)

    Dynamic zero compression reduces the energy required for cache accesses by only writing and reading a single bit for every zero-valued byte. This energy-conscious compression is invisible to software and is handled with additional circuitry embedded inside the cache RAM arrays and the CPU. The additional circuitry imposes a cache area overhead of 9% and a read latency overhead of around two F04 ga... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Predictor-directed stream buffers

    Publication Year: 2000, Page(s):42 - 53
    Cited by:  Papers (17)  |  Patents (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1040 KB)

    An effective method for reducing the effect of load latency in modern processors is data prefetching. One form of data prefetching, stream buffers, has been shown to be particularly effective due to its ability to detect data streams and run ahead of them, prefetching as it goes. Unfortunately, in the past, the applicability of streaming was limited to stride intensive code. We propose Predictor-D... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors

    Publication Year: 2000, Page(s):337 - 347
    Cited by:  Papers (17)  |  Patents (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1004 KB)

    We investigate instruction distribution methods for quad-cluster, dynamically-scheduled superscalar processors. We study a variety of methods with different cost, performance and complexity characteristics. We investigate both Pion-adaptive and adaptive methods and their sensitivity both to inter-cluster communication latencies and pipeline depth. Furthermore, we develop a set of models that allow... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Frequent value compression in data caches

    Publication Year: 2000, Page(s):258 - 265
    Cited by:  Papers (15)  |  Patents (5)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (660 KB)

    Since the area occupied by cache memories on processor chips continues to grow, an increasing percentage of power is consumed by memory. We present the design and evaluation of the compression cache (CC) which is a first level cache that has been designed so that each cache line can either hold one uncompressed line or two cache lines which have been compressed to at least half their lengths. We u... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An integrated approach to accelerate data and predicate computations in hyperblocks

    Publication Year: 2000, Page(s):101 - 111
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (920 KB)

    To exploit increased instruction-level parallelism available in modern processors, we describe the formation and optimization of tracenets, an integrated approach to reducing the length of the critical path in data and predicated computation. By tightly integrating selective path expansion and path optimization within hyperblocks, our algorithm is able to produce highly optimized code without expl... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • PipeRench implementation of the Instruction Path Coprocessor

    Publication Year: 2000, Page(s):147 - 158
    Cited by:  Papers (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (956 KB)

    The paper demonstrates how an Instruction Path Coprocessor (I-COP) can be efficiently implemented using the PipeRench reconfigurable architecture. An I-COP is a programmable on-chip coprocessor that operates on the core processor's instructions to transform them into a new format that can be more efficiently executed. The I-COP can be used to implement many sophisticated hardware code modification... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A static power model for architects

    Publication Year: 2000, Page(s):191 - 201
    Cited by:  Papers (74)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1032 KB)

    Static power dissipation due to transistor leakage constitutes an increasing fraction of the total power in modern semiconductor technologies. Current technology trends indicate that the contribution will increase rapidly, reaching one half of total power dissipation within three process generations. Developing power efficient products will require consideration of static power in the earliest pha... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Silent stores for free

    Publication Year: 2000, Page(s):22 - 31
    Cited by:  Papers (17)  |  Patents (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (956 KB)

    Silent store instructions write values that exactly match the values that are already stored at the memory address that is being written. A recent study reveals that significant benefits can be gained by detecting and removing such stores from a program's execution. This paper studies the problem of detecting silent stores and shows that an average of 31% and 50% of silent stores can be detected f... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reducing wire delay penalty through value prediction

    Publication Year: 2000, Page(s):317 - 326
    Cited by:  Papers (10)  |  Patents (5)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (880 KB)

    In this paper we show that value prediction can be used to avoid the penalty of long wire delays by predicting the data that is communicated through these long wires and validating the prediction locally where the value is produced. Only in the case of misprediction, the long wire delay is experienced. We apply this concept to a clustered microarchitecture in order to reduce inter-cluster communic... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Calpa: a tool for automating selective dynamic compilation

    Publication Year: 2000, Page(s):291 - 302
    Cited by:  Papers (6)  |  Patents (7)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1000 KB)

    Selective dynamic compilation systems, typically driven by annotations that identify run-time constants, can achieve significant program speedups. However, manually inserting annotations is a tedious and time-consuming process that requires careful inspection of a program's static characteristics and run-time behavior and much trial and error in order to select the most beneficial annotations. Cal... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Register integration: a simple and efficient implementation of squash reuse

    Publication Year: 2000, Page(s):223 - 234
    Cited by:  Papers (12)  |  Patents (4)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1232 KB)

    Register integration (or simply integration) is a mechanism for incorporating speculative results directly into a sequential execution using data-dependence relationships. In this paper we use integration to implement squash reuse, the salvaging of instruction results that were needlessly discarded during the course of sequential recovery from a control- or datamis-speculation. To implement integr... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On pipelining dynamic instruction scheduling logic

    Publication Year: 2000, Page(s):57 - 66
    Cited by:  Papers (22)  |  Patents (7)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (808 KB)

    A machine's performance is the product of its IPC (Instructions Per Cycle) and clock frequency. Recently, S. Palacharla et al. (1997) warned that the dynamic instruction scheduling logic for current machines performs an atomic operation. Either you sacrifice IPC by pipelining this logic, thereby eliminating its ability to execute dependent instructions in consecutive cycles. Or you sacrifice clock... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance improvement with circuit-level speculation

    Publication Year: 2000, Page(s):348 - 355
    Cited by:  Papers (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (664 KB)

    Current superscalar microprocessors' performance depends on its frequency and the number of useful instructions that can be processed per cycle (IPC). In this paper we propose a method called approximation to reduce the logic delay of a pipe-stage. The basic idea of approximation is to implement the logic function partially instead of fully. Most of the time the partial implementation gives the co... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A study of slipstream processors

    Publication Year: 2000, Page(s):269 - 280
    Cited by:  Papers (20)  |  Patents (13)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (1024 KB)

    A slipstream processor reduces the length of a running program by dynamically skipping computation non-essential for correct forward progress. The shortened program runs faster as a result, but it is speculative. So a second unreduced copy of the program is run concurrently with and slightly behind the reduced copy-leveraging a chip multiprocessor (CMP) or simultaneous multithreading (SMT). The sh... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.