By Topic

Microarchitecture, 2000. MICRO-33. Proceedings. 33rd Annual IEEE/ACM International Symposium on

Date 10-13 Dec. 2000

Filter Results

Displaying Results 1 - 25 of 33
  • Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000

    Publication Year: 2000
    Request Permissions | PDF file iconPDF (232 KB)  
    Freely Available from IEEE
  • Author index

    Publication Year: 2000 , Page(s): 357
    Request Permissions | PDF file iconPDF (62 KB)  
    Freely Available from IEEE
  • Efficient conditional operations for data-parallel architectures

    Publication Year: 2000 , Page(s): 159 - 170
    Cited by:  Papers (13)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (976 KB)  

    Many data-parallel applications, including emerging media applications, have regular structures that can easily be expressed as a series of arithmetic kernels operating on data streams. Data-parallel architectures are designed to exploit this regularity by performing the same operation on many data elements concurrently. However, applications containing data-dependent control constructs perform po... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality

    Publication Year: 2000 , Page(s): 32 - 41
    Cited by:  Papers (19)  |  Patents (10)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (876 KB)  

    DRAM row-buffer conflicts occur when a sequence of requests on different rows goes to the same memory bank, causing much higher memory access latency than requests to the same row or to different banks. We analyze the sources of row-buffer conflicts in the context of superscalar processors, and propose a permutation based page interleaving scheme to reduce row-buffer conflicts and to exploit data ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Predictor-directed stream buffers

    Publication Year: 2000 , Page(s): 42 - 53
    Cited by:  Papers (17)  |  Patents (1)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (1040 KB)  

    An effective method for reducing the effect of load latency in modern processors is data prefetching. One form of data prefetching, stream buffers, has been shown to be particularly effective due to its ability to detect data streams and run ahead of them, prefetching as it goes. Unfortunately, in the past, the applicability of streaming was limited to stride intensive code. We propose Predictor-D... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Eager writeback-a technique for improving bandwidth utilization

    Publication Year: 2000 , Page(s): 11 - 21
    Cited by:  Papers (20)  |  Patents (1)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (956 KB)  

    Modern high-performance processors utilize multi-level cache structures to help tolerate the increasing latency of main memory. Most of these caches employ either a writeback or a write-through strategy to deal with store operations. Write-through caches propagate data to more distant memory levels at the time each store occurs, which requires a very large bandwidth between the memory hierarchy le... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An integrated approach to accelerate data and predicate computations in hyperblocks

    Publication Year: 2000 , Page(s): 101 - 111
    Request Permissions | Click to expandAbstract | PDF file iconPDF (920 KB)  

    To exploit increased instruction-level parallelism available in modern processors, we describe the formation and optimization of tracenets, an integrated approach to reducing the length of the critical path in data and predicated computation. By tightly integrating selective path expansion and path optimization within hyperblocks, our algorithm is able to produce highly optimized code without expl... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On pipelining dynamic instruction scheduling logic

    Publication Year: 2000 , Page(s): 57 - 66
    Cited by:  Papers (16)  |  Patents (5)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (808 KB)  

    A machine's performance is the product of its IPC (Instructions Per Cycle) and clock frequency. Recently, S. Palacharla et al. (1997) warned that the dynamic instruction scheduling logic for current machines performs an atomic operation. Either you sacrifice IPC by pipelining this logic, thereby eliminating its ability to execute dependent instructions in consecutive cycles. Or you sacrifice clock... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic zero compression for cache energy reduction

    Publication Year: 2000 , Page(s): 214 - 220
    Cited by:  Papers (41)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (544 KB)  

    Dynamic zero compression reduces the energy required for cache accesses by only writing and reading a single bit for every zero-valued byte. This energy-conscious compression is invisible to software and is handled with additional circuitry embedded inside the cache RAM arrays and the CPU. The additional circuitry imposes a cache area overhead of 9% and a read latency overhead of around two F04 ga... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Silent stores for free

    Publication Year: 2000 , Page(s): 22 - 31
    Cited by:  Papers (12)  |  Patents (3)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (956 KB)  

    Silent store instructions write values that exactly match the values that are already stored at the memory address that is being written. A recent study reveals that significant benefits can be gained by detecting and removing such stores from a program's execution. This paper studies the problem of detecting silent stores and shows that an average of 31% and 50% of silent stores can be detected f... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors

    Publication Year: 2000 , Page(s): 337 - 347
    Cited by:  Papers (10)  |  Patents (2)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (1004 KB)  

    We investigate instruction distribution methods for quad-cluster, dynamically-scheduled superscalar processors. We study a variety of methods with different cost, performance and complexity characteristics. We investigate both Pion-adaptive and adaptive methods and their sensitivity both to inter-cluster communication latencies and pipeline depth. Furthermore, we develop a set of models that allow... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The impact of delay on the design of branch predictors

    Publication Year: 2000 , Page(s): 67 - 76
    Cited by:  Papers (11)  |  Patents (11)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (752 KB)  

    Modern microprocessors employ increasingly complicated branch predictors to achieve instruction fetch bandwidth that is sufficient for wide out-of-order execution cores. While existing predictors can still be accessed in a single clock cycle, recent studies show that slower wires and faster clock rates will require multi-cycle access times to large on-chip structures, such as branch prediction tab... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reducing wire delay penalty through value prediction

    Publication Year: 2000 , Page(s): 317 - 326
    Cited by:  Papers (8)  |  Patents (5)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (880 KB)  

    In this paper we show that value prediction can be used to avoid the penalty of long wire delays by predicting the data that is communicated through these long wires and validating the prediction locally where the value is produced. Only in the case of misprediction, the long wire delay is experienced. We apply this concept to a clustered microarchitecture in order to reduce inter-cluster communic... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Modulo scheduling for a fully-distributed clustered VLIW architecture

    Publication Year: 2000 , Page(s): 124 - 133
    Cited by:  Papers (8)  |  Patents (8)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (800 KB)  

    Clustering is an approach that many microprocessors are adopting in recent times in order to mitigate the increasing penalties of wire delays. We propose a novel clustered VLIW architecture which has all its resources partitioned among clusters, including the cache memory. A modulo scheduling scheme for this architecture is also proposed. This algorithm takes into account both register and memory ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Two-level hierarchical register file organization for VLIW processors

    Publication Year: 2000 , Page(s): 137 - 146
    Cited by:  Papers (12)  |  Patents (7)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (912 KB)  

    High-performance microprocessors are currently designed to exploit the inherent instruction level parallelism (ILP) available in most applications. The techniques used in their design and the aggressive scheduling techniques used to exploit this ILP tend to increase the register requirements of the loops. If more registers than those available in the architecture are required, some actions (such a... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A framework for dynamic energy efficiency and temperature management

    Publication Year: 2000 , Page(s): 202 - 213
    Cited by:  Papers (32)  |  Patents (2)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (1176 KB)  

    While technology is delivering increasingly sophisticated and powerful chip designs, it is also imposing alarmingly high energy requirements on the chips. One way to address this problem is to manage the energy dynamically. Unfortunately, current dynamic schemes for energy management are relatively limited. In addition, they manage energy either for energy efficiency or for temperature control, bu... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The store-load address table and speculative register promotion

    Publication Year: 2000 , Page(s): 235 - 244
    Cited by:  Papers (1)  |  Patents (6)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (852 KB)  

    Register promotion is an optimization that allocates a value to a register for a region of its lifetime where it is provably not aliased. Conventional compiler analysis cannot always prove that a value is free of aliases, and thus promotion cannot always be applied. This paper proposes a new hardware structure, the store-load address table (SLAT), which watches both load and store instructions to ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Relational profiling: enabling thread-level parallelism in virtual machines

    Publication Year: 2000 , Page(s): 281 - 290
    Cited by:  Papers (8)  |  Patents (10)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (844 KB)  

    Virtual machine service threads can perform many tasks in parallel with program execution such as garbage collection, dynamic compilation, and profile collection and analysis. Hardware-assisted profiling is essential for providing service threads with needed information in a flexible and efficient way. A relational profiling architecture (RPA) is proposed for meeting this goal. The RPA selects par... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Calpa: a tool for automating selective dynamic compilation

    Publication Year: 2000 , Page(s): 291 - 302
    Cited by:  Papers (6)  |  Patents (7)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (1000 KB)  

    Selective dynamic compilation systems, typically driven by annotations that identify run-time constants, can achieve significant program speedups. However, manually inserting annotations is a tedious and time-consuming process that requires careful inspection of a program's static characteristics and run-time behavior and much trial and error in order to select the most beneficial annotations. Cal... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • PipeRench implementation of the Instruction Path Coprocessor

    Publication Year: 2000 , Page(s): 147 - 158
    Cited by:  Papers (2)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (956 KB)  

    The paper demonstrates how an Instruction Path Coprocessor (I-COP) can be efficiently implemented using the PipeRench reconfigurable architecture. An I-COP is a programmable on-chip coprocessor that operates on the core processor's instructions to transform them into a new format that can be more efficiently executed. The I-COP can be used to implement many sophisticated hardware code modification... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Compiler controlled value prediction using branch predictor based confidence

    Publication Year: 2000 , Page(s): 327 - 336
    Cited by:  Papers (1)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (880 KB)  

    Value prediction breaks data dependencies in a program thereby creating instruction level parallelism that can increase program performance. Hardware based value prediction techniques have been shown to increase speed, but at great cost as designs include prediction tables, selection logic, and a confidence mechanism. This paper proposes compiler-controlled value prediction optimizations that obta... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient checker processor design

    Publication Year: 2000 , Page(s): 87 - 97
    Cited by:  Papers (9)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (1020 KB)  

    The design and implementation of a modern microprocessor creates many reliability challenges. Designers must verify the correctness of large complex systems and construct implementations that work reliably in varied (and occasionally adverse) operating conditions. In our previous work, we proposed a solution to these problems by adding a simple, easily verifiable checker processor at pipeline reti... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving BTB performance in the presence of DLLs

    Publication Year: 2000 , Page(s): 77 - 86
    Cited by:  Patents (4)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (720 KB)  

    Dynamically Linked Libraries (DLLs) promote software modularity, portability, and flexibility and their use has become widespread. The authors characterize the behavior of five applications that make heavy use of DLLs, with a particular focus on the effects of DLLs on Branch Target Buffer (BTB) performance. DLLs aggravate hot set contention in the BTB. Standard software remedies are ineffective be... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Accurate and efficient predicate analysis with binary decision diagrams

    Publication Year: 2000 , Page(s): 112 - 123
    Cited by:  Papers (6)  |  Patents (2)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (1024 KB)  

    Functionality and performance of EPIC architectural features depend on extensive compiler support. Predication, one of these features, promises to reduce control flow overhead and to enhance optimization, provided that compilers can utilize it effectively. Previous work has established the need for accurate, direct predicate analysis and has demonstrated a few useful techniques, but has not provid... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance improvement with circuit-level speculation

    Publication Year: 2000 , Page(s): 348 - 355
    Cited by:  Papers (3)
    Request Permissions | Click to expandAbstract | PDF file iconPDF (664 KB)  

    Current superscalar microprocessors' performance depends on its frequency and the number of useful instructions that can be processed per cycle (IPC). In this paper we propose a method called approximation to reduce the logic delay of a pipe-stage. The basic idea of approximation is to implement the logic function partially instead of fully. Most of the time the partial implementation gives the co... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.