By Topic

38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05)

Date 12-16 Nov. 2005

Filter Results

Displaying Results 1 - 25 of 38
  • Proceedings. 38th Annual IEEE/ACM International Symposium on Microarchitecture

    Publication Year: 2005, Page(s): c1
    Request permission for commercial reuse | PDF file iconPDF (179 KB)
    Freely Available from IEEE
  • 38th Annual IEEE/ACM International Symposium on Microarchitecture - Title Page

    Publication Year: 2005, Page(s):i - iii
    Request permission for commercial reuse | PDF file iconPDF (224 KB)
    Freely Available from IEEE
  • 38th Annual IEEE/ACM International Symposium on Microarchitecture - Copyright

    Publication Year: 2005, Page(s): iv
    Request permission for commercial reuse | PDF file iconPDF (195 KB)
    Freely Available from IEEE
  • 38th Annual IEEE/ACM International Symposium on Microarchitecture - Table of contents

    Publication Year: 2005, Page(s):v - vii
    Request permission for commercial reuse | PDF file iconPDF (215 KB)
    Freely Available from IEEE
  • Message from the General Chairs

    Publication Year: 2005, Page(s):viii - ix
    Request permission for commercial reuse | PDF file iconPDF (163 KB) | HTML iconHTML
    Freely Available from IEEE
  • Message from the Program Co-Chairs

    Publication Year: 2005, Page(s): x
    Request permission for commercial reuse | PDF file iconPDF (148 KB) | HTML iconHTML
    Freely Available from IEEE
  • The Cell Processor Architecture

    Publication Year: 2005, Page(s): 3
    Cited by:  Papers (17)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (92 KB) | HTML iconHTML

    This talk will present the Cell processor, jointly developed by the STI (Sony-Toshiba-IBM) partnership. Cell is a non-homogeneous chip multiprocessor intended for general-purpose applications but with a particular emphasis on multimedia performance. The Cell processor combines a 64bit Power Architecture(TM) core with 8 Synergistic Processors. In many cases, it delivers more than an order of magnit... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • How to fake 1000 registers

    Publication Year: 2005, Page(s):12 pp. - 18
    Cited by:  Papers (9)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (311 KB) | HTML iconHTML

    Large numbers of logical registers can improve performance by allowing fast access to multiple subroutine contexts (register windows) and multiple thread contexts (multithreading). Support for both of these together requires a multiplicative number of registers that quickly becomes prohibitive. We overcome this limitation with the virtual context architecture (VCA), a new register-file architectur... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reducing instruction fetch cost by packing instructions into register windows

    Publication Year: 2005
    Cited by:  Papers (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (704 KB) | HTML iconHTML

    Instruction packing is a combination compiler/architectural approach that allows for decreased code size, reduced power consumption and improved performance. The packing is obtained by placing frequently occurring instructions into an instruction register file (IRF). Multiple IRF entries can then be accessed using special packed instructions. Previous IRF efforts focused on using a single 32-entry... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient use of invisible registers in Thumb code

    Publication Year: 2005
    Cited by:  Papers (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (376 KB) | HTML iconHTML

    The ARM processor is a dual width ISA processor that provides a 16-bit Thumb instruction set in addition to the 32-bit ARM instruction set. The compromises made in designing the Thumb instruction set leads to significantly increased instruction counts. This increase is in part due to the fact that only half of the register file is visible to most instructions in Thumb code. In this paper we addres... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Wish branches: combining conditional branching and predication for adaptive predicated execution

    Publication Year: 2005, Page(s):12 pp. - 54
    Cited by:  Papers (8)  |  Patents (6)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (395 KB) | HTML iconHTML

    Predicated execution has been used to reduce the number of branch mispredictions by eliminating hard-to-predict branches. However, the additional instruction overhead and additional data dependencies due to predicated execution sometimes offset the performance advantage of having fewer mispredictions. We propose a mechanism in which the compiler generates code that can be executed either as predic... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A criticality analysis of clustering in superscalar processors

    Publication Year: 2005, Page(s):12 pp. - 66
    Cited by:  Papers (10)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (438 KB) | HTML iconHTML

    Clustered machines partition hardware resources to circumvent the cycle time penalties incurred by large, monolithic structures. This partitioning introduces a long inter-cluster forwarding latency and the potential for load imbalance, both of which degrade IPC and thus counter the cycle time benefits of clustering. We show that program dataflow can be mapped to clustered machines so as to achieve... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Incremental commit groups for non-atomic trace processing

    Publication Year: 2005
    Cited by:  Papers (2)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (608 KB) | HTML iconHTML

    We introduce techniques to support efficient non-atomic execution of very long traces on a new binary translation based, ×86-64 compatible VLIW microprocessor. Incrementally committed long traces significantly reduce wasted computations on exception induced rollbacks by retaining the correctly committed parts of traces. We divide each scheduled trace into multiple commit groups; groups are c... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Pinot: speculative multi-threading processor architecture exploiting parallelism over a wide range of granularities

    Publication Year: 2005
    Cited by:  Papers (16)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (592 KB) | HTML iconHTML

    We propose a speculative multi-threading processor architecture called Pinot. Pinot exploits parallelism over a wide range of granularities without modifying program sources. Since exploitation of fine-grain parallelism suffers from limits of parallelism and overhead incurred by parallelization, it is better to extract coarse-grain parallelism. Coarse-grain parallelism is biased in some programs (... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic helper threaded prefetching on the Sun UltraSPARC® CMP processor

    Publication Year: 2005
    Cited by:  Papers (14)  |  Patents (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (352 KB) | HTML iconHTML

    Data prefetching via helper threading has been extensively investigated on simultaneous multi-threading (SMT) or virtual multi-threading (VMT) architectures. Although reportedly large cache latency can be hidden by helper threads at runtime, most techniques rely on hardware support to reduce context switch overhead between the main thread and helper thread as well as rely on static profile feedbac... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic thread extraction with decoupled software pipelining

    Publication Year: 2005
    Cited by:  Papers (71)  |  Patents (10)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (536 KB) | HTML iconHTML

    Until recently, a steadily rising clock rate and other uniprocessor micro architectural improvements could be relied upon to consistently deliver increasing performance for a wide range of applications. Current difficulties in maintaining this trend have lead microprocessor manufacturers to add value by incorporating multiple processors on a chip. Unfortunately, since decades of compiler research ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting vector parallelism in software pipelined loops

    Publication Year: 2005, Page(s):11 pp. - 129
    Cited by:  Papers (3)  |  Patents (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (678 KB) | HTML iconHTML

    An emerging trend in processor design is the addition of short vector instructions to general-purpose and embedded ISAs. Frequently, these extensions are employed using traditional vectorization technology first developed for supercomputers. In contrast, scalar hardware is typically targeted using ILP techniques such as software pipelining. This paper presents a novel approach for exploiting vecto... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Continuous path and edge profiling

    Publication Year: 2005, Page(s):11 pp. - 140
    Cited by:  Papers (5)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (541 KB) | HTML iconHTML

    Micro architectures increasingly rely on dynamic optimization to improve performance in ways that are difficult or impossible for ahead-of-time compilers. Dynamic optimizers in turn require continuous, portable, low cost, and accurate control-flow profiles to inform their decisions, but prior approaches have struggled to meet these goals simultaneously. This paper presents PEP, a hybrid instrument... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving region selection in dynamic optimization systems

    Publication Year: 2005, Page(s):11 pp. - 154
    Cited by:  Papers (5)  |  Patents (4)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (422 KB) | HTML iconHTML

    The performance of a dynamic optimization system depends heavily on the code it selects to optimize. Many current systems follow the design of HP Dynamo and select a single interprocedural path, or trace, as the unit of code optimization and code caching. Though this approach to region selection has worked well in practice, we show that it is possible to adapt this basic approach to produce region... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The future evolution of high-performance microprocessors

    Publication Year: 2005
    Cited by:  Papers (6)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (78 KB) | HTML iconHTML

    The evolution of high-performance microprocessors has reached several significant inflection points. First, the marginal utility of additional single-core complexity is now rapidly diminishing due to a number of factors. The increase in instructions per cycle from increases in sizes and numbers of functional units has plateaued. Meanwhile the increasing sizes of functional units and cores are havi... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable store-load forwarding via store queue index prediction

    Publication Year: 2005
    Cited by:  Papers (7)  |  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (312 KB) | HTML iconHTML

    Conventional processors use a fully-associative store queue (SQ) to implement store-load forwarding. Associative search latency does not scale well to capacities and bandwidths required by wide-issue, large window processors. In this work, we improve SQ scalability by implementing store-load forwarding using speculative indexed access rather than associative search. Our design uses prediction to i... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Address-indexed memory disambiguation and store-to-load forwarding

    Publication Year: 2005, Page(s):12 pp. - 182
    Cited by:  Papers (12)  |  Patents (4)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (294 KB) | HTML iconHTML

    This paper describes a scalable, low-complexity alternative to the conventional load/store queue (LSQ) for superscalar processors that execute load and store instructions speculatively and out-of-order prior to resolving their dependences. Whereas the LSQ requires associative and age-prioritized searches for each access, we propose that an address-indexed store-forwarding cache (SFC) perform store... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Store memory-level parallelism optimizations for commercial applications

    Publication Year: 2005, Page(s):12 pp. - 196
    Cited by:  Papers (3)  |  Patents (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (404 KB) | HTML iconHTML

    This paper studies the impact of off-chip store misses on processor performance for modern commercial applications. The performance impact of off-chip store misses is largely determined by the extent of their overlap with other off-chip cache misses. The epoch MLP model is used to explain and quantify how these overlaps are affected by various store handling optimizations and by the memory consist... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A mechanism for online diagnosis of hard faults in microprocessors

    Publication Year: 2005
    Cited by:  Papers (31)  |  Patents (5)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (336 KB) | HTML iconHTML

    We develop a microprocessor design that tolerates hard faults, including fabrication defects and in-field faults, by leveraging existing microprocessor redundancy. To do this, we must: detect and correct errors, diagnose hard faults at the field deconfigurable unit (FDU) granularity, and deconfigure FDUs with hard faults. In our reliable microprocessor design, we use DIVA dynamic verification to d... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • /spl mu/Complexity: estimating processor design effort

    Publication Year: 2005, Page(s):10 pp. - 218
    Cited by:  Papers (6)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (312 KB) | HTML iconHTML

    Microprocessor design complexity is growing rapidly. As a result, current development costs for top of the line processors are staggering, and are doubling every 4 years. As we design ever larger and more complex processors, it is becoming increasingly difficult to estimate how much time it will take to design and verify them. To compound this problem, processor design cost estimation still does n... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.