By Topic

High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. The Ninth International Symposium on

Date 12-12 Feb. 2003

Filter Results

Displaying Results 1 - 25 of 36
  • Proceedings the Ninth International Symposium on High-Performance Computer Architecture. HPCA-9 2003

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (237 KB)  

    The following topics are dealt with: multi-threading; branch prediction; power efficient designs; superscalars; multiprocessor systems; memory and communication performance; profiling and simulation support; caching and prefetching; and networks and communication. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Variability in architectural simulations of multi-threaded workloads

    Page(s): 7 - 18
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (380 KB) |  | HTML iconHTML  

    Multi-threaded commercial workloads implement many important Internet services. Consequently, these workloads are increasingly used to evaluate the performance of uniprocessor and multiprocessor system designs. This paper identifies performance variability as a potentially major challenge for architectural simulation studies using these workloads. Variability refers to the differences between multiple estimates of a workload's performance. Time variability occurs when a workload exhibits different characteristics during different phases of a single run. Space variability occurs when small variations in timing cause runs starting from the same initial condition to follow widely different execution paths. Variability is a well-known phenomenon in real systems, but is nearly universally ignored in simulation experiments. In a central result of this paper we show that variability in multi-threaded commercial workloads can lead to incorrect architectural conclusions (e.g., 31% of the time in one experiment). We propose a methodology, based on multiple simulations and standard statistical techniques, to compensate for variability. Our methodology greatly reduces the probability of reaching incorrect conclusions, while enabling simulations to finish within reasonable time limits. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Author index

    Page(s): 389 - 390
    Save to Project icon | Request Permissions | PDF file iconPDF (151 KB)  
    Freely Available from IEEE
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • Memory system behavior of Java-based middleware

    Page(s): 217 - 228
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (426 KB) |  | HTML iconHTML  

    In this paper, we present a detailed characterization of the memory system, behavior of ECperf and SPECjbb using both commercial server hardware and Simics full-system simulation. We find that the memory footprint and primary working sets of these workloads are small compared to other commercial workloads (e.g. on-line transaction processing), and that a large fraction of the working sets are shared between processors. We observed two key differences between ECperf and SPECjbb that highlight the importance of isolating the behavior of the middle tier. First, ECperf has a larger instruction footprint, resulting in much higher miss rates for intermediate-size instruction caches. Second, SPECjbb's data set size increases linearly as the benchmark scales up, while ECperf's remains roughly constant. This difference can lead to opposite conclusions on the design of multiprocessor memory systems, such as the utility of moderate sized (i.e. 1 MB) shared caches in a chip multiprocessor. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic voltage scaling with links for power optimization of interconnection networks

    Page(s): 91 - 102
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (698 KB) |  | HTML iconHTML  

    Originally developed to connect processors and memories in multicomputers, prior research and design of interconnection networks have focused largely on performance. As these networks get deployed in a wide range of new applications, where power is becoming a key design constraint, we need to seriously consider power efficiency in designing interconnection networks. As the demand for network bandwidth increases, communication links, already a significant consumer of power now, will take up an ever larger portion of total system power budget. In this paper we motivate the use of dynamic voltage scaling (DVS) for links, where the frequency and voltage of links are dynamically adjusted to minimize power consumption. We propose a history-based DVS policy that judiciously adjusts link frequencies and voltages based on past utilization. Our approach realizes up to 6.3× power savings (4.6× on average). This is accompanied by a moderate impact on performance (15.2% increase in average latency before network saturation and 2.5% reduction in throughput.) To the best of our knowledge, this is the first study that targets dynamic power optimization of interconnection networks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Deterministic clock gating for microprocessor power reduction

    Page(s): 113 - 122
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (350 KB) |  | HTML iconHTML  

    With the scaling of technology and the need for higher performance and more functionality, power dissipation is becoming a major bottleneck for microprocessor designs. Pipeline balancing (PLB), a previous technique, is essentially a methodology to clock-gate unused components whenever a program's instruction-level parallelism is predicted to be low. However, no nonpredictive methodologies are available in the literature for efficient clock gating. This paper introduces deterministic clock gating (DCG) based on the key observation that for many of the stages in a modern pipeline, a circuit block's usage in a specific cycle in the near future is deterministically known a few cycles ahead of time. Our experiments show an average of 19.9% reduction in processor power with virtually no performance loss for an 8-issue, out-of-order superscalar processor by applying DCG to execution units, pipeline latches, D-Cache wordline decoders, and result bus drivers. In contrast, PLB achieves 9.9% average power savings at 2.9% performance loss. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Caches and hash trees for efficient memory integrity verification

    Page(s): 295 - 306
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (327 KB) |  | HTML iconHTML  

    We study the hardware cost of implementing hash-tree based verification of untrusted external memory by a high performance processor. This verification could enable applications such as certified program execution. A number of schemes are presented with different levels of integration between the on-processor L2 cache and the hash-tree machinery. Simulations show that for the best of our methods, the performance overhead is less than 25%, a significant decrease from the 10× overhead of a naive implementation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Active I/O switches in system area networks

    Page(s): 365 - 376
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (8237 KB) |  | HTML iconHTML  

    We present an active switch architecture to improve the performance of systems connected via system area networks. Our programmable active switches not only flexibly route packets between any combination of hosts and I/O devices, but also have the capability of running application-level code, forming a parallel processor in the SAN subsystem. By replacing existing SAN-based switches with a new active switch architecture, we can design a prototype system with otherwise commercially available, commodity parts that can dramatically speed up data-intensive applications and workloads on modern multi-programmed servers. We explain the programming model and detail the microarchitecture of our active switch, and analyze simulation results for nine benchmark applications that highlight various advantages of active switch-based systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reconsidering complex branch predictors

    Page(s): 43 - 52
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (311 KB) |  | HTML iconHTML  

    To sustain instruction throughput rates in more aggressively clocked microarchitectures, microarchitects have incorporated larger and more complex branch predictors into their designs, taking advantage of the increasing numbers of transistors available on a chip. Unfortunately, because of penalties associated with their implementations, the extra accuracy provided by many branch predictors does not produce a proportionate increase in performance. Specifically, we show that the techniques used to hide the latency of a large and complex branch predictor do not scale well and will be unable to sustain IPC for deeper pipelines. We investigate a different way to build large branch predictors. We propose an alternative predictor design that completely hides predictor latency so that accuracy and hardware budget are the only factors that affect the efficiency of the predictor. Our simple design allows the predictor to be pipelined efficiently by avoiding difficulties introduced by complex predictors. Because this predictor eliminates the penalties associated with complex predictors, overall performance exceeds that of even the most accurate known branch predictors in the literature at large hardware budgets. We conclude that as chip densities increase in the next several years, the accuracy of complex branch predictors must be weighed against the performance benefits of simple branch predictors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalar operand networks: on-chip interconnect for ILP in partitioned architectures

    Page(s): 341 - 353
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (10255 KB) |  | HTML iconHTML  

    The bypass paths and multiported register files in microprocessors serve as an implicit interconnect to communicate operand values among pipeline stages and multiple ALU. Previous superscalar designs implemented this interconnect using centralized structures that do not scale with increasing ILP demands. In search of scalability, recent microprocessor designs in industry and academia exhibit a trend towards distributed resources such as partitioned register files, banked caches, multiple independent compute pipelines, and even multiple program counters. Some of these partitioned microprocessor designs have begun to implement bypassing and operand transport using point-to-point interconnects rather than centralized networks. We call interconnects optimized for scalar data transport, whether centralized or distributed, scalar operand networks. Although these networks share many of the challenges of multiprocessor networks such as scalability and deadlock avoidance, they have many unique requirements, including ultra-low latencies (a few cycles versus tens of cycles) and ultra-fast operation-operand matching. This paper discusses the unique properties of scalar operand networks, examines alternative ways of implementing them, and describes in detail the implementation of one such network in the Raw microprocessor. The paper analyzes the performance of these networks for ILP workloads and the sensitivity of overall ILP performance to network properties. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Power-aware control speculation through selective throttling

    Page(s): 103 - 112
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (428 KB) |  | HTML iconHTML  

    With the constant advances in technology that lead to the increasing of the transistor count and processor frequency, power dissipation is becoming one of the major issues in high-performance processors. These processors increase their clock frequency by lengthening the pipeline, which puts more pressure on the branch prediction engine since branches take longer to be resolved. Branch mispredictions are responsible for around 28% of the power dissipated by a typical processor due to the useless activities performed by instructions that are squashed. This work focuses on reducing the power dissipated by mis-speculated instructions. We propose selective throttling as an effective way of triggering different power-aware techniques (fetch throttling, decode throttling or disabling the selection logic). The particular set of techniques applied to each branch is dynamically chosen depending on the branch prediction confidence level. For branches with a low confidence on the prediction, the most aggressive throttling mechanism is used whereas high confidence branch predictions trigger the least aggressive techniques. Results show that combining fetch bandwidth reduction along with select logic disabling provides the best performance both in terms of energy reduction and energy-delay improvement (14% and 9% respectively for 14 stages, and 17% and 12% respectively for 28 stages). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • Inter-cluster communication models for clustered VLIW processors

    Page(s): 354 - 364
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (371 KB) |  | HTML iconHTML  

    Clustering is a well-known technique to improve the implementation of single register file VLIW processors. Many previous studies in clustering adhere to an inter-cluster communication means in the form of copy operations. This paper, however, identifies and evaluates five different inter-cluster communication models, including copy operations, dedicated issue slots, extended operands, extended results, and broadcasting. Our study reveals that these models have a major impact on performance and implementation of the clustered VLIW. We found that copy operations executed in regular VLIW issue slots significantly constrain the scheduling freedom of regular operations. For example, in the dense code for our four cluster machine the total cycle count overhead reached 46.8% with respect to the unicluster architecture, 56% of which are caused by the copy operation constraint. Therefore, we propose to use other models (e.g. extended results or broadcasting), which deliver higher performance than the copy operation model at the same hardware cost. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • TCP: tag correlating prefetchers

    Page(s): 317 - 326
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (348 KB) |  | HTML iconHTML  

    Although caches for decades have been the backbone of the memory system, the speed gap between CPU and main memory suggests their augmentation with prefetching mechanisms. Recently, sophisticated hardware correlating prefetching mechanisms have been proposed, in some cases coupled with some form of dead-block prediction. In many proposals, however correlating prefetchers demand a significant investment in hardware. In this paper we show that correlating prefetchers that work with tags instead of cache-line addresses are significantly more resource-efficient, providing equal or better performance than previous proposals. We support this claim by showing that per-set tag sequences exhibit highly repetitive patterns both within a set and across different sets. Because a single tag sequence can capture multiple address sequences spread over different cache sets, significant space savings can be achieved. We propose a tag-based prefetcher called a tag correlating prefetcher (TCP). Even with very small history tables, TCP outperforms address-based correlating prefetchers many times larger. In addition, we show that such a prefetcher can yield most of its performance benefits if placed at the L2 level of an aggressive out-of-order processor. Only if one wants prefetching all the way up to L1, is dead-block prediction required. Finally, we draw parallels between the two-level structure of TCP and similar structures for branch prediction mechanisms; these parallels raise interesting opportunities for improving correlating memory prefetchers by harnessing lessons already learned for correlating branch predictors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploring the VLSI scalability of stream processors

    Page(s): 153 - 164
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (371 KB) |  | HTML iconHTML  

    Stream processors are high-performance programmable processors optimized to run media applications. Recent work has shown these processors to be more area- and energy-efficient than conventional programmable architectures. This paper explores the scalability of stream architectures to future VLSI technologies where over a thousand floating-point units on a single chip will be feasible. Two techniques for increasing the number of ALU in a stream processor are presented: intracluster and intercluster scaling. These scaling techniques are shown to be cost-efficient to tens of ALU per cluster and to hundreds of arithmetic clusters. A 640-ALU stream processor with 128 clusters and 5 ALU per cluster is shown to be feasible in 45 nanometer technology, sustaining over 300 GOPS on kernels and providing 15.3× of kernel speedup and 8.0× of application speedup over a 40-ALU stream processor with a 2% degradation in area per ALU and a 7% degradation in energy dissipated per ALU operation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A methodology for designing efficient on-chip interconnects on well-behaved communication patterns

    Page(s): 377 - 388
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (8660 KB) |  | HTML iconHTML  

    As the level of chip integration continues to advance at a fast pace, the desire for efficient interconnects - whether on-chip or off-chip - is rapidly increasing. Traditional interconnects like buses, point-to-point wires and regular topologies may suffer from poor resource sharing in the time and space domains, leading to high contention or low resource utilization. In this paper, we propose a design methodology for constructing networks for special-purpose computer systems with well-behaved (known) communication characteristics. A temporal and spatial model is proposed to define the sufficient condition for contention-free communication. Based upon this model, a design methodology using a recursive bisection technique is applied to systematically partition a parallel system such that the required number of links and switches is minimized while achieving low contention. Results show that the design methodology can generate more optimized on-chip networks with up to 60% fewer resources than meshes or tori while providing blocking performance closer to that of a fully connected crossbar. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Catching accurate profiles in hardware

    Page(s): 269 - 280
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (425 KB) |  | HTML iconHTML  

    Run-time optimization is one of the most important ways of getting performance out of modern processors. Techniques such as prefetching, trace caching, memory disambiguation etc., are all based upon the principle of observation followed by adaptation, and all make use of some sort of profile information gathered at run-time. Programs are very complex, and the real trick in generating useful run-time profiles is sifting through all the unimportant and infrequently occurring events to find those that are important enough to warrant optimization. In this paper, we present the multi-hash architecture to catch important events even in the presence of extensive noise. Multi-hash uses a small amount of area, between 7 to 16 Kilo-bytes, to accurately capture these important events in hardware, without requiring any software support. This is achieved using multiple hash tables for the filtering, and interval-based profiling to help identify how important an event is in relationship to all the other events. We evaluate our design for value and edge profiling, and show that over a set of benchmarks, we get an average error less than 1%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Incorporating predicate information into branch predictors

    Page(s): 53 - 64
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (365 KB) |  | HTML iconHTML  

    Predicated execution can be used to alleviate the costs associated with frequently mispredicted branches. This is accomplished by trading the cost of a mispredicted branch for execution of both paths following the conditional branch. In this paper we examine two enhancements for branch prediction in the presence of predicated code. Both of the techniques use recently calculated predicate definitions to provide a more intelligent branch prediction. The first branch predictor, called the squash false path filter, recognizes fetched branches known to be guarded with a false predicate and predicts them as not-taken with 100% accuracy. The second technique, called the predicate global update branch predictor, improves prediction by incorporating recent predicate information into the branch predictor. We use these techniques to aid the prediction of region-based branches. A region-based branch is a branch that is left in a predicated region of code. A region-based branch may be correlated with predicate definitions in the region in addition to those that define the branch's guarding predicate. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Runahead execution: an alternative to very large instruction windows for out-of-order processors

    Page(s): 129 - 140
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (297 KB) |  | HTML iconHTML  

    Today's high performance processors tolerate long latency operations by means of out-of-order execution. However, as latencies increase, the size of the instruction window must increase even faster if we are to continue to tolerate these latencies. We have already reached the point where the size of an instruction window that can handle these latencies is prohibitively large in terms of both design complexity and power consumption. And, the problem is getting worse. This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor without requiring an unreasonably large instruction window. Runahead execution unblocks the instruction window blocked by long latency operations allowing the processor to execute far ahead in the program path. This results in data being prefetched into caches long before it is needed. On a machine model based on the Intel® Pentium® processor, having a 128-entry instruction window, adding runahead execution improves the IPC (instructions per cycle) by 22% across a wide range of memory intensive applications. Also, for the same machine model, runahead execution combined with a 128-entry window performs within 1% of a machine with no runahead execution and a 384-entry instruction window. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mini-threads: increasing TLP on small-scale SMT processors

    Page(s): 19 - 30
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (389 KB) |  | HTML iconHTML  

    Several manufacturers have recently announced the first simultaneous-multithreaded processors, both as single CPU and as components of multi-CPU chips. All are small scale, comprising only two to four thread contexts. A significant impediment to the construction of larger-scale SMT is the register file size required by a large number of contexts. This paper introduces and evaluates mini-threads, a simple extension to SMT that increases thread-level parallelism without the commensurate increase in register file size. A mini-threaded SMT CPU adds additional per-thread state to each hardware context; an application executing in a context can create mini-threads that will utilize its own per-thread state, but share the context's architectural register set. The resulting performance will depend on the benefits of additional TLP compared to the costs of executing mini-threads with fewer registers. Our results quantify these factors in detail and demonstrate that mini-threads can improve performance significantly, particularly on small-scale, space-sensitive CPU designs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hierarchical backoff locks for nonuniform communication architectures

    Page(s): 241 - 252
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (381 KB) |  | HTML iconHTML  

    This paper identifies node affinity as an important property for scalable general-purpose locks. Nonuniform communication architectures (NUCA), for example CC-NUMA built from a few large nodes or from chip multiprocessors (CMP), have a lower penalty for reading data from a neighbor's cache than from a remote cache. Lock implementations that encourages handing over locks to neighbors will improve the lock handover time, as well as the access to the critical data guarded by the lock, but will also be vulnerable to starvation. We propose a set of simple software-based hierarchical backoff locks (HBO) that create node affinity in NUCA. A solution for lowering the risk of starvation is also suggested. The HBO locks are compared with other software-based lock implementations using simple benchmarks, and are shown to be very competitive for uncontested locks while being more than twice as fast for contended locks. An application study also demonstrates superior performance for applications with high lock contention and competitive performance for other programs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic data replication: an approach to providing fault-tolerant shared memory clusters

    Page(s): 203 - 214
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (366 KB) |  | HTML iconHTML  

    A challenging issue in today's server systems is to transparently deal with failures and application-imposed requirements for continuous operation. In this paper we address this problem in shared virtual memory (SVM) clusters at the programming abstraction layer. We design extensions to an existing SVM protocol that has been tuned for low-latency, high-bandwidth interconnects and SMP nodes and we achieve reliability through dynamic replication of application shared data and protocol information. Our extensions allow us to tolerate single (or multiple, but not simultaneous) node failures. We implement our extensions on a state-of-the-art cluster and we evaluate the common, failure-free case. We find that, although the complexity of our protocol is substantially higher than its failure-free counterpart, by taking advantage of architectural features of modern systems our approach imposes low overhead and can be employed for transparently dealing with system failures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Control techniques to eliminate voltage emergencies in high performance processors

    Page(s): 79 - 90
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (407 KB) |  | HTML iconHTML  

    Increasing focus on power dissipation issues in current microprocessors has led to a host of proposals for clock gating and other power-saving techniques. While generally effective at reducing average power, many of these techniques have the undesired side-effect of increasing both the variability of power dissipation and the variability of current drawn by the processor This increase in current variability, often referred to as the dI/dt problem, can cause supply voltage fluctuations. Such voltage fluctuations lead to unreliable circuits if not addressed, and increasingly expensive chip packaging techniques are needed to mitigate them. This paper proposes and evaluates a methodology for augmenting packaging techniques for dI/dt with microarchitectural control mechanisms. We discuss the resonant frequencies most relevant to current microprocessor packages, produce and evaluate a "dI/dt stressmark" that exercises the system at its resonant frequency, and characterize the behavior of more mainstream applications. Based on these results plus evaluations of the impact of controller error and delay, our microarchitectural control proposals offer bounds on supply voltage fluctuations, with nearly negligible impact on performance and energy. With the ITRS roadmap predicting aggressive drops in supply voltage and power supply impedances in coming chip generations, novel voltage control techniques will be required to stay on track. Our microarchitectural dI/dt controllers represent a step in this direction. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A statistically rigorous approach for improving simulation methodology

    Page(s): 281 - 291
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (356 KB) |  | HTML iconHTML  

    Due to cost, time, and flexibility constraints, simulators are often used to explore the design space when developing new processor architectures, as well as when evaluating the performance of new processor enhancements. However, despite this dependence on simulators, statistically rigorous simulation methodologies are not typically used in computer architecture research. A formal methodology can provide a sound basis for drawing conclusions gathered from simulation results by adding statistical rigor, and consequently, can increase confidence in the simulation results. This paper demonstrates the application of a rigorous statistical technique to the setup and analysis phases of the simulation process. Specifically, we apply a Plackett and Burman design to: (1) identify key processor parameters; (2) classify benchmarks based on how they affect the processor; and (3) analyze the effect of processor performance enhancements. Our technique expands on previous work by applying a statistical method to improve the simulation methodology instead of applying a statistical model to estimate the performance of the processor. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.