Scheduled System Maintenance:
On May 6th, system maintenance will take place from 8:00 AM - 12:00 PM ET (12:00 - 16:00 UTC). During this time, there may be intermittent impact on performance. We apologize for the inconvenience.
By Topic

Microarchitecture, 1993., Proceedings of the 26th Annual International Symposium on

Date 1-3 Dec. 1993

Filter Results

Displaying Results 1 - 25 of 28
  • Proceedings of 26th Annual International Symposium on Microarchitecture (Cat. No.93TH0602-3)

    Publication Year: 1993
    Save to Project icon | Request Permissions | PDF file iconPDF (41 KB)  
    Freely Available from IEEE
  • Superblock formation using static program analysis

    Publication Year: 1993 , Page(s): 247 - 255
    Cited by:  Papers (20)  |  Patents (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (664 KB)  

    Compile-time code transformations which expose instruction-level parallelism (ILP) typically take into account the constraints imposed by all execution scenarios in the program. However, there are additional opportunities to increase ILP along some execution sequences if the constraints from alternative execution sequences can be ignored. Traditionally, profile information has been used to identify important execution sequences for aggressive compiler optimization and scheduling. The paper presents a set of static program analysis heuristics used in the IMPACT compiler to identify execution sequences for aggressive optimization. The authors show that the static program analysis heuristics identify execution sequences without hazardous conditions that tend to prohibit compiler optimizations. As a result, the static program analysis approach often achieves optimization results comparable to profile information in spite of its inferior branch prediction accuracies. This observation makes a strong case for using static program analysis with or without profile information to facilitate aggressive compiler optimization and scheduling View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamically scheduled VLIW processors

    Publication Year: 1993 , Page(s): 80 - 92
    Cited by:  Papers (24)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1104 KB)  

    VLIW processors are viewed as an attractive way of achieving instruction-level parallelism because of their ability to issue multiple operations per cycle with relatively simple control logic. They are also perceived as being of limited interest as products because of the problem of object code compatibility across processors having different hardware latencies and varying levels of parallelism. The author introduces the concept of delayed split-issue and the dynamic scheduling hardware which, together, solve the compatibility problem for VLIW processors and, in fact, make it possible for such processors to use all of the interlocking and scoreboarding techniques that are known for superscalar processors View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An evaluation of bottom-up and top-down thread generation techniques

    Publication Year: 1993 , Page(s): 118 - 127
    Cited by:  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (708 KB)  

    Presents a model of coarse grain dataflow execution. The authors present one top down and two bottom up methods for generation of multithreaded code, and evaluate their effectiveness. The bottom up techniques start from a fine-grain dataflow graph and coalesce this into coarse-grain clusters. The top down technique generates clusters directly from the intermediate data dependence graph used for compiler optimizations. The authors discuss the relevant phases in the compilation process. They compare the effectiveness of the strategies by measuring the total number of clusters executed, the total number of instructions executed, cluster size, and number of matches per cluster. It turns out that the top down method generates more efficient code, and larger clusters. However the number of matches per cluster is larger for the top down method, which could incur higher cluster synchronization costs View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The 16-fold way: a microparallel taxonomy

    Publication Year: 1993 , Page(s): 60 - 69
    Cited by:  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (812 KB)  

    Presents a novel microparallel taxonomy for machines with multiple-instruction processing capabilities including VLIW, superscalar, and decoupled machines. The taxonomy is based upon the static or dynamic behavior of four abstract, operational stages that an instruction passes through. These stages are fetch, decode, execute, and retire. This two valued, four variable taxonomy results in sixteen ways that a processor's microarchitecture can be specified. The paper categorizes different machine instances that are either actual implementations or proposed systems within the taxonomy framework. Four new processor microarchitectures are postulated which provide additional features and are instances of the remaining unexplored microparallel classifications View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Employing finite automata for resource scheduling

    Publication Year: 1993 , Page(s): 12 - 20
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (504 KB)  

    Static instruction scheduling is an important optimization to exploit instruction level parallelism. If the scheduler has to consider resource constraints to prevent structural hazards, usually the processor timing is simulated by overlaying binary matrices representing the resource usage of instructions. This technique is rather time consuming. It is shown that the timing can be simulated by a deterministic finite automaton and the matrix operations for a simulation step are replaced by two table lookups. A prototype implementation shows that about an eighteenfold speedup of the simulation can be expected. This performance gain can be used either to speed up existing scheduling algorithms or to use more complex algorithms to improve scheduling results View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Measuring limits of parallelism and characterizing its vulnerability to resource constraints

    Publication Year: 1993 , Page(s): 105 - 117
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (912 KB)  

    Addresses a two-fold question: whether there is enough parallelism in numeric and non-numeric workloads, such as the SPEC92 benchmark suite, under ideal conditions, disregarding any resource constraints and more importantly, whether a high ideal parallelism can be further characterized to assess its extractability with finite resources. The authors have designed and implemented an analysis tool that accepts as input a dynamic execution trace from an IBM RS/6000 environment, and outputs a parallelized instruction trace (schedule) that could be executed on an abstract machine with unlimited functional units and various constraints on the rest of its resources, namely, registers, stack and memory. They also analyze two different instruction scheduling policies: greedy and lazy. The paper further offers a characterization of ideal parallelism (obtainable on a machine with infinite resources) using a measure called slack to assess its sustainability with finite resources View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An extended classification of inter-instruction dependency and its application in automatic synthesis of pipelined processors

    Publication Year: 1993 , Page(s): 236 - 246
    Cited by:  Papers (2)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (980 KB)  

    The conventional classification of inter-instruction dependencies (data, anti and output dependencies) provides a basic scheme for the analysis of pipeline hazards in pipelined instruction set processors. However, it does not consider the relative spatial positions of micro-operations in the pipeline, thus providing limited hints to hardware designers and compiler writers about the hazard resolution in generalized pipeline structures. The authors propose an extension to the conventional classification of dependencies, which is capable of encapsulating the spatial/temporal relationship and providing precise hardware/software resolution strategies. With the extended classification and its associated hardware/software resolution strategies, they are able to systematically analyze the potential register-related pipeline hazards for a given pipeline structure, determine appropriate resolution strategies, and explore the tradeoff between hardware and software complexities. The methodology enables the systematic synthesis of high performance pipelined micro-architectures, and is useful to derive the back-end of the supporting compilers View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A study on the number of memory ports in multiple instruction issue machines

    Publication Year: 1993 , Page(s): 49 - 58
    Cited by:  Papers (9)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (864 KB)  

    One of the key design concerns of multiple instruction issue (MII) processors is deciding how many memory ports need to be provided, considering performance and efficiency of the target processor. For an MII processor that exploits instruction-level parallelism (ILP) in non-numerical code, this decision is difficult to make due to its irregularity. The authors perform an empirical study aimed at characterizing a suitable MII organization that best exploits irregular ILP. The study is based on the selective scheduling compiler that performs precise memory disambiguation for concurrent execution of multiple memory operations, along with renaming, speculation, and software pipelining. The result indicates that a small number of memory ports (i.e. less than half of the issue rate) is enough for exploiting most of irregular ILP. The authors also examine related issues such as the utilization of memory ports and additional data cache misses caused by speculative loads View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient scheduling of fine grain parallelism in loops

    Publication Year: 1993 , Page(s): 2 - 11
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (580 KB)  

    This paper presents a new technique for software pipelining using the Petri nets. Our technique called the Petri Net Pacemaker (PNP) can create near optimal pipelines with less algorithmic effort than other techniques. The pacemaker is a novel idea which exploits the behavior of Petri nets to model the problem of scheduling operations of a loop body for software pipelining View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Instruction scheduling for the Motorola 88110

    Publication Year: 1993 , Page(s): 257 - 262
    Cited by:  Papers (2)  |  Patents (26)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (476 KB)  

    The Motorola 88110 is an advanced superscalar design. The processor can issue up to two instructions per cycle among ten functional units, and it includes sophisticated load-store, speculative execution, exception recovery, and branch target buffer facilities. This paper examines several computationally inexpensive instruction scheduling strategies for a post-processor code optimizer for the 88110, including basic block scheduling using reservation tables for writeback buses as well as functional units, delayed branch removal, loop alignment, and special loop entry scheduling. For a set of 32 loop-intensive benchmarks, a combination of delayed branch removal and loop alignment yields the best code improvement View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speculative execution exception recovery using write-back suppression

    Publication Year: 1993 , Page(s): 214 - 223
    Cited by:  Papers (7)  |  Patents (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (804 KB)  

    Compiler-controlled speculative execution has been shown to be effective in increasing the available instruction level parallelism (ILP) found in non-numeric programs. An important problem associated with compiler-controlled speculative execution is to accurately report and handle exceptions caused by speculatively executed instructions. Previous solutions to this problem incur either excessive hardware overhead or significant register pressure. The paper introduces a new architectural scheme referred to as write-back suppression. This scheme systematically suppresses register file updates for subsequent speculative instructions after an exception condition is detected for a speculatively executed instruction. The authors show that with a modest amount of hardware, write-back suppression supports accurate reporting and handling of exceptions for compiler-controlled speculative execution with minimal additional register pressure. Experiments based on a prototype compiler implementation and hardware simulation indicate that ensuring accurate handling of exceptions with write-back suppression incurs little run-time performance overhead View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A comparison of superscalar and decoupled access/execute architectures

    Publication Year: 1993 , Page(s): 100 - 103
    Cited by:  Papers (2)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (396 KB)  

    Presents a comparison of superscalar and decoupled access/execute architectures. Both architectures attempt to exploit instruction-level parallelism by issuing multiple instructions per cycle, employing dynamic scheduling to maximize performance. Simulation results are presented for four different configurations, demonstrating that the architectural queues of the decoupled machines provide similar functionality to register renaming, dynamic loop unrolling, and out-of-order execution of the superscalar machines with significantly less complexity View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A microarchitectural performance evaluation of a 3.2 Gbyte/s microprocessor bus

    Publication Year: 1993 , Page(s): 31 - 40
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (752 KB)  

    Several architectural innovations intended to reduce access latency and improve overall throughput increase system bandwidth requirements. Bandwidth scales with clock speed, and can be regarded as an architectural resource to be applied to latency reduction. A properly designed bus provides low arbitration latency and delivers high sustained bandwidth. The paper evaluates the performance of 3.2 Gbyte/s peak bandwidth, low-latency arbitration bus connecting a GaAs superscalar CPU to a GaAs memory management unit. A microarchitectural performance model was written in the Verilog hardware description language. Bus transactions characteristic of the SPECint92 benchmarks and other workloads were generated as input. Sustained bandwidths of 1.68 Gbytes/s were achieved with arbitration costs of less than 0.5 cycles per data transfer View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An analysis of dynamic scheduling techniques for symbolic applications

    Publication Year: 1993 , Page(s): 185 - 191
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (560 KB)  

    Instruction-level parallelism in a single stream of code for non-numerical applications has been the subject of many recent researches. This work extends the analysis to symbolic applications described with logic programming. In particular, the authors analyze the effects on performance of speculative execution, memory alias disambiguation, renaming and flow prediction. The obtained results indicate that one can reach a sustained parallelism of 4 (comparable with imperative languages), with the proper optimizations. The authors also show a comparison between static and dynamic scheduled approaches, outlining the conditions under which a dynamic solution can reach substantial improvements over a static one. In this way, they point out some important optimizations and parameters of a dynamic scheduling approach, indicating a guideline for future architectural implementations View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • EXPLORER: a retargetable and visualization-based trace-driven simulator for superscalar processors

    Publication Year: 1993 , Page(s): 225 - 235
    Cited by:  Papers (6)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1112 KB)  

    The quantitative approach to computer architecture and processor design requires extensive experimentation to assess the potential performance benefits of individual techniques. Using trace-driven simulation (TDS) tools in conjunction with optimizing compilers, a designer can quickly characterize the dynamic behavior and the resultant performance of a candidate machine running a realistic benchmark. Most current TDS tools have two shortcomings, namely the lack of retargetability and the lack of visualization support. The authors present the development of a highly retargetable TDS tool suite, called EXPLORER, that incorporates powerful visualization and interactive capabilities. A prototype of EXPLORER has been implemented and examples of retargeting and visualization-based simulation of the RS/6000 and the MPC 601 have been performed to demonstrate its effectiveness View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Two-ported cache alternatives for superscalar processors

    Publication Year: 1993 , Page(s): 41 - 48
    Cited by:  Papers (4)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (700 KB)  

    Superscalar implementations of RISC architectures are emerging as the dominant high-performance microprocessor technology for the mid-1990's. For instruction-level parallelism to increase beyond present levels, multiple memory operations per cycle are required. The paper evaluates several alternatives for two-ported data cache memory systems. A new split data cache memory design is compared to a more conventional true dual-ported memory. Experimental simulations are used to determine the performance benefits of these cache models on superscalar processors. These experiments are reported for a contemporary processor with modest instruction-level parallelism and for a hypothetical very aggressive, highly parallel processor View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Techniques for extracting instruction level parallelism on MIMD architectures

    Publication Year: 1993 , Page(s): 128 - 137
    Cited by:  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (944 KB)  

    Extensive research has been done on extracting parallelism from single instruction stream processors. The authors present some results of an investigation into ways to modify MIMD architectures to allow them to extract the instruction level parallelism achieved by current superscalar and VLIW machines. A new architecture is proposed which utilizes the advantages of a multiple instruction stream design while addressing some of the limitations that have prevented MIMD architectures from performing ILP operation. A new code scheduling mechanism is described to support this new architecture by partitioning instructions across multiple processing elements in order to exploit this level of parallelism View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A comparative performance evaluation of various state maintenance mechanisms

    Publication Year: 1993 , Page(s): 70 - 79
    Cited by:  Papers (1)  |  Patents (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (572 KB)  

    Speculative execution and dynamic scheduling are two promising techniques for achieving high performance in superscalar processors. These techniques require a mechanism for maintaining all architecturally visible machine state. The authors examine the performance implications of three common state maintenance mechanisms: the reorder buffer, the history buffer, and checkpointing. They model the execution of the four integer benchmarks from the SPEC89 suite for a variety of maintenance techniques. They report the results of these measurements and their implications with respect to the design of high performance superscalar processors View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A VLIW architecture based on shifting register files

    Publication Year: 1993 , Page(s): 263 - 268
    Cited by:  Papers (3)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (556 KB)  

    Anti-dependencies are a major cause of bottleneck in software pipelining. Anti-dependencies can be removed in software by code duplication as in the technique of modulo variable expansion. However, it is possible to eliminate anti-dependencies in hardware. So far, two such VLIW architectures have been proposed: polycyclic and URPR-1 architectures. These architectures also introduce attractive solutions for the memory contention problem in general. We, on the other hand, propose a more cost-effective architecture for the same purpose: SRFA. SRFA is based on “shifting register files” View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Register renaming and dynamic speculation: an alternative approach

    Publication Year: 1993 , Page(s): 202 - 213
    Cited by:  Papers (41)  |  Patents (23)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1032 KB)  

    Presents a novel mechanism that implements register renaming, dynamic speculation and precise interrupts. Renaming of registers is performed during the instruction fetch stage instead of the decode stage, and the mechanism is designed to operate in parallel with the tag match logic used by most cache designs. It is estimated that the critical path of the mechanism requires approximately the same number of logic levels as the tag match logic, and therefore should not impact cycle time View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • MIDEE: smoothing branch and instruction cache miss penalties on deep pipelines

    Publication Year: 1993 , Page(s): 193 - 201
    Cited by:  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (628 KB)  

    Pipelining is a major technique used in high performance processors. But its effectiveness is reduced by the branch instructions. A new organization for implementing branch instructions is presented: the Multiple Instruction Decode Effective Execution (MIDEE) organization. All the pipeline depths may be addressed using this organization. MIDEE is based on the use of double fetch and decode, early computation of the target address for branch instructions and two instruction queues. The double fetch-decode concerns a pair of instructions stored at consecutive addresses. These instructions are then decoded simultaneously, but no execution hardware is duplicated; only useful instructions are effectively executed. A pair of instruction queues are used between the fetch-decode stages and execution stages; this allows to hide branch penalty and most of the instruction cache misses penalty. Trace driven simulations show that the performance of deep pipeline processor may dramatically be improved when the MIDEE organization is implemented: branch penalty is reduced and pipeline stall delay due to instruction cache misses is also decreased View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Prophetic branches: a branch architecture for code compaction and efficient execution

    Publication Year: 1993 , Page(s): 94 - 99
    Cited by:  Papers (2)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (384 KB)  

    Deeply pipelined processors increase the cost of executing conditional branches. Several branch architectures based on both hardware and software techniques have been proposed to reduce this cost. A popular branch mechanism based on software techniques is static branch prediction with delay slot annulling. This mechanism reduces the cost of conditional branches by making delay slots visible in the architecture. Architectural visibility allows the software to exploit delay slots by executing instructions speculatively. The visibility of the delay slots, however, also results in an increase in code size; compilers must find appropriate instructions which can be scheduled into the delay slots. If no “useful” instructions can be found, then nops must be inserted in the delay slot. The authors propose a novel branch architecture called prophetic branches which allows compilers to exploit branch delays, yet could result in only a minimal increase in code size over non-pipelined code. They show that this branch mechanism can be implemented in deeply pipelined processors with only a minor change in the control logic View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Predictability of load/store instruction latencies

    Publication Year: 1993 , Page(s): 139 - 152
    Cited by:  Papers (33)  |  Patents (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1020 KB)  

    Due to increasing cache-miss latencies, cache control instructions are being implemented for future systems. The authors study the memory referencing behavior of individual machine-level instructions using simulations of fully-associative caches under MIN replacement. Their objective is to obtain a deeper understanding of useful program behavior that can be eventually employed at optimizing programs and to motivate architectural features aimed at improving the efficacy of memory hierarchies. The simulation results show that a very small number of load/store instructions account for a majority of data cache misses. Specifically, fewer than 10 instructions account for half the misses for six out of nine SPEC89 benchmarks. Selectively prefetching data referenced by a small number of instructions identified through profiling can reduce overall miss ratio significantly while only incurring a small number of unnecessary prefetches View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GPMB-software pipelining branch-intensive loops

    Publication Year: 1993 , Page(s): 21 - 29
    Cited by:  Papers (7)  |  Patents (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (592 KB)  

    To achieve higher instruction-level parallelism, the constraint imposed by a single control flow must be relaxed. Control operations should execute in parallel just like data operations. We present a new software pipelining method called GPMB (Global Pipelining with Multiple Branches) which is based on architectures supporting multi-way branching and multiple control flows. Preliminary experimental results show that, for IFless loops, GPMB performs as well as modulo scheduling, and for branch-intensive loops, GPMB performs much better than software pipelining assuming the constraint of one two-way branch per cycle View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.