By Topic

Parallel Architecture and Compilation Techniques, 2004. PACT 2004. Proceedings. 13th International Conference on

Date 29 Sept.-3 Oct. 2004

Filter Results

Displaying Results 1 - 25 of 36
  • The energy impact of aggressive loop fusion

    Page(s): 153 - 164
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (426 KB) |  | HTML iconHTML  

    Loop fusion combines corresponding iterations of different loops, ft is traditionally used to decrease program run time, by reducing loop overhead and increasing data locality. In this paper, however, we consider its effect on energy. By merging program phases, fusion tends to increase the uniformity, or balance of demand for system resources. On a conventional superscalar processor, increased balance tends to increase IPC, and thus dynamic power, so that fusion-induced improvements in program energy are slightly smaller than improvements in program run time. If IPC is held constant, however, by reducing frequency and voltage - particularly on a processor with multiple clock domains - then energy improvements may significantly exceed run time improvements. We demonstrate the benefits of increased program balance under a theoretical model of processor energy consumption. We then evaluate the benefits of fusion empirically on synthetic and real-world benchmarks, using our existing loop-fusing compiler and a heavily modified version of the SimpleScalar/Wattch simulator. For the real-world benchmarks, we demonstrate energy savings ranging from 7-40%, with run times ranging from 1% slowdown to 17% speedup. In addition to validating our theoretical model, the simulation results allow us to "tease apart" the factors that contribute to fusion-induced time and energy savings. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An adaptive algorithm selection framework

    Page(s): 278 - 289
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (381 KB) |  | HTML iconHTML  

    Irregular and dynamic memory reference patterns can cause performance variations for low level algorithms in general and for parallel algorithms in particular. We present an adaptive algorithm selection framework which can collect and interpret the inputs of a particular instance of a parallel algorithm and select the best performing one from an existing library. We present the dynamic selection of parallel reduction algorithms. First we introduce a set of high-level parameters that can characterize different parallel reduction algorithms. Then we describe an offline, systematic process to generate predictive models, which can be used for run-time algorithm selection. Our experiments show that our framework: (a) selects the most appropriate algorithms in 85% of the cases studied, (b) overall delivers 98% of the optimal performance, (c) adaptively selects the best algorithms for dynamic phases of a running program (resulting in performance improvements otherwise not possible), and (d) adapts to the underlying machine architecture (tested on IBM Regatta and HP V-Class systems). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Compiler estimation of load imbalance overhead in speculative parallelization

    Page(s): 203 - 214
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (10159 KB) |  | HTML iconHTML  

    Speculative parallelization is a technique that complements automatic compiler parallelization by allowing code sections that cannot be fully analyzed by the compiler to be aggressively executed in parallel. However, while speculative parallelization can potentially deliver significant speedups, several overheads associated with the technique limit these speedups in practice. We propose a novel compiler model of speculative multithreaded execution that can be used to reason about the overheads and expected resulting performance gains, or losses, from speculative parallelization. This model is based on estimating the likely execution duration of threads, properly takes into account the scheduling restrictions of most speculative execution environments, and can include all speculative parallelization overheads. Also, different from heuristics that attempt to qualitatively estimate potentially "good" or "bad" sections for speculative multithreaded execution, this model allows the compiler to estimate the speedup or slowdown quantitatively. Such quantitative estimate can then be used by the compiler or run-time system to make more complex and educated tradeoff decisions. We use the proposed framework on a number of loops from a collection of SPEC benchmarks that suffer mainly from load imbalance and thread dispatch and commit overheads. Experimental results show that our framework can identify on average 68% of the loops that cause slowdowns and on average 97% of the loops that lead to speedups. In fact, our framework predicts the speedups or slowdowns with an error of less than 20% for an average of 44% of the loops across the benchmarks, and with an error of less than 50% for an average of 84% of the loops. Overall, our framework leads to a performance improvement of 5% on average, and as high as 38%, over a naive approach that attempts to speculatively parallelize all the loops considered. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fair cache sharing and partitioning in a chip multiprocessor architecture

    Page(s): 111 - 122
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (442 KB) |  | HTML iconHTML  

    This paper presents a detailed study of fairness in cache sharing between threads in a chip multiprocessor (CMP) architecture. Prior work in CMP architectures has only studied throughput optimization techniques for a shared cache. The issue of fairness in cache sharing, and its relation to throughput, has not been studied. Fairness is a critical issue because the operating system (OS) thread scheduler's effectiveness depends on the hardware to provide fair cache sharing to co-scheduled threads. Without such hardware, serious problems, such as thread starvation and priority inversion, can arise and render the OS scheduler ineffective. This paper makes several contributions. First, it proposes and evaluates five cache fairness metrics that measure the degree of fairness in cache sharing, and shows that two of them correlate very strongly with the execution-time fairness. Execution-time fairness is defined as how uniform the execution times of co-scheduled threads are changed, where each change is relative to the execution time of the same thread running alone. Secondly, using the metrics, the paper proposes static and dynamic L2 cache partitioning algorithms that optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling, has low overhead, and does not restrict the cache replacement algorithm to LRU. The static algorithm, although requiring the cache to maintain LRU stack information, can help the OS thread scheduler to avoid cache thrashing. Finally, this paper studies the relationship between fairness and throughput in detail. We found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness. Using a set of co-scheduled pairs of benchmarks, on average our algorithms improve fairness by a factor of 4×, while increasing the throughput by 15%, compared to a nonpartitioned shared cache. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Impact of Java memory model on out-of-order multiprocessors

    Page(s): 99 - 110
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (313 KB) |  | HTML iconHTML  

    The semantics of Java multithreading dictates all possible behaviors that a multithreaded Java program can exhibit on any platform. This is called the Java memory model (JMM) and describes the allowed reorderings among the memory operations in a thread. However, multiprocessor platforms traditionally have memory consistency models of their own. In this paper, we study the interaction between the JMM and the multiprocessor memory consistency models. In particular, memory barriers may have to be inserted to ensure that the multiprocessor execution of a multithreaded Java program respects the JMM. We study the impact of these additional memory barriers on program performance. Our experimental results indicate that the performance gain achieved by relaxed hardware memory consistency models far exceeds the performance degradation due to the introduction of JMM. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Decoupled software pipelining with the synchronization array

    Page(s): 177 - 188
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (453 KB) |  | HTML iconHTML  

    Despite the success of instruction-level parallelism (ILP) optimizations in increasing the performance of microprocessors, certain codes remain elusive. In particular, codes containing recursive data structure (RDS) traversal loops have been largely immune to ILP optimizations, due to the fundamental serialization and variable latency of the loop-carried dependence through a pointer-chasing load. To address these and other situations, we introduce decoupled software pipelining (DSWP), a technique that statically splits a single-threaded sequential loop into multiple nonspeculative threads, each of which performs useful computation essential for overall program correctness. The resulting threads execute on thread-parallel architectures such as simultaneous multithreaded (SMT) cores or chip multiprocessors (CMP), expose additional instruction level parallelism, and tolerate latency better than the original single-threaded RDS loop. To reduce overhead, these threads communicate using a synchronization array, a dedicated hardware structure for pipelined inter-thread communication. DSWP used in conjunction with the synchronization array achieves an 11% to 76% speedup in the optimized functions on both statically and dynamically scheduled processors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adding limited reconfigurability to superscalar processors

    Page(s): 53 - 62
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (347 KB) |  | HTML iconHTML  

    When adding reconfigurability to custom hardware, one must take great care that the reduction in speed due to the reconfigurable logic should not cancel out the gains obtained by reconfiguration. These gains are greatest in very specific and computation-intensive applications, and lessen as the applications become more general and heterogeneous. In the case of superscalar processors, this leads to limiting the amount of reconfigurability to precise changes in existing functional units instead of adding a fully configurable functional unit. We present a detailed study of the modifications necessary in a superscalar processor to allow an FPU to be dynamically reconfigured as several ALUs with a minimal increase in the latency of these functional units. The timing of the FPU's multiplier tree and the decision about reconfiguration are exposed. As there is more than one simple unit involved, this decision is more global than a cycle-by-cycle reconfiguration and must be made for a longer period of time. We discuss possible policies for the dynamic reconfiguration decisions. The results show interesting gains of up to 56% in the best cases, and average gains of 10%, on typical architectures over a wide range of applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Architectural support for high speed protection of memory integrity and confidentiality in multiprocessor systems

    Page(s): 123 - 134
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (398 KB) |  | HTML iconHTML  

    Recently there is a growing effort in both the architecture and the security community to create a hardware solution for authenticating system memory. As shown in the previous work, hardware-based memory authentication becomes a vital component for creating future trusted computing environments and digital rights protection. Almost all these prior work have focused on authenticating memory exclusively owned by a single processing element. However, in today's computing platforms, memory is often shared by multiple processing elements that support a shared system memory with a snooping cache coherence protocol. Authenticating shared memory is a new challenge to memory protection. In this paper, we present a secure and fast architecture for authenticating shared memory. In terms of incorporating memory authentication into the processor pipeline, we propose a new scheme called authentication speculative execution. Unlike the prior approaches, our scheme does not compromise security for performance. The novel ASE scheme is not only secure as it is combined with a onetime-pad (OTP) based memory encryption but also efficient to tolerate authentication latency by executing unauthenticated instructions speculatively. Results using modified RSIM running SPLASH2 benchmark show only 5% overhead in performance on dual and quad processor platforms. Furthermore, ASE shows 80% better performance on average over conservative nonspeculative execution based authentication schemes. The scheme is of practical use for both multiprocessor systems and uni-processor systems where memory is shared by one main processor and other co-processors on the system bus. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The value evolution graph and its use in memory reference analysis

    Page(s): 243 - 254
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (374 KB) |  | HTML iconHTML  

    We introduce a framework for the analysis of memory reference sets addressed by induction variables without closed forms. This framework relies on a new data structure, the value evolution graph (VEG), which models the global flow of values taken by induction variable with and without closed forms. We describe the application of our framework to array data-flow analysis, privatization, and dependence analysis. This results in the automatic parallelization of loops that contain arrays addressed by induction variables without closed forms. We implemented this framework in the Polaris research compiler. We present experimental results on a set of codes from the PERFECT, SPEC, and NCSA benchmark suites. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • TO-Lock: removing lock overhead using the owners' temporal locality

    Page(s): 255 - 266
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (378 KB) |  | HTML iconHTML  

    The performance of locking is critical, as programming languages with built-in thread support are coming into wide use. Many techniques for optimizing Java monitors have been proposed, based on the observation that the locks are rarely contended for in many applications. However, the problem of the performance degradation in SMP environments caused by necessary serializations of the processors' execution has not been addressed for shared objects. We propose a new algorithm for this problem. It uses simple instructions to acquire the lock by exploiting the owner locality for objects even if the ownership has migrated among the threads. Our algorithm is particularly effective for SMP environments because we can remove the overhead of the serialization caused by complex atomic operations for uncontended locks by allowing the lock operation and the code protected by the lock to be executed in parallel. We verified the safety of the algorithm by using a software tool, Spin. The experimental results of our benchmarking on an SMP machine using Intel Xeon processors showed that our algorithm can significantly improve the performance by 83% on average compared to the case using a complex atomic instruction. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable high performance cross-module inlining

    Page(s): 165 - 176
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (385 KB) |  | HTML iconHTML  

    Performing inlining of routines across file boundaries is known to yield significant run-time performance improvements. We present a scalable cross-module inlining framework that reduces the compiler's memory footprint, file thrashing, and overall compile-time. Instead of using the call-site ordering generated by the analysis phase, the transformation phase dynamically produces a new inlining order depending on the resource constraints of the system. We introduce dependences among call-sites and affinity among source files based on the Mines performed. We discuss the implementation of our technique and show how it substantially reduces compile-time and memory usage without sacrificing any run-time performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Architectural support for enhanced SMT job scheduling

    Page(s): 63 - 73
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (584 KB) |  | HTML iconHTML  

    By converting thread-level parallelism to instruction level parallelism, simultaneous multithreaded (SMT) processors are emerging as effective ways to utilize the resources of modern superscalar architectures. However, the full potential of SMT has not yet been reached as most modern operating systems use existing single-thread or multiprocessor algorithms to schedule threads, neglecting contention for resources between threads. To date, even the best SMT scheduling algorithms simply try to group threads for co-residency based on each thread's expected resource utilization but do not take into account variance in thread behavior. As such, we introduce architectural support that enables new thread scheduling algorithms to group threads for co-residency based on fine-grain memory system activity information. The proposed memory monitoring framework centers on the concept of a cache activity vector, which exposes runtime cache resource information to the operating system to improve job scheduling. Using this scheduling technique, we experimentally evaluate the overall performance improvement of workloads on an SMT machine compared against the most recent Linux job scheduler. This work is first motivated with experiments in a simulated environment, then validated on a hyperthreading-enabled Intel Pentium-4 Xeon microprocessor running a modified version of the latest Linux kernel. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A high-performance SIMD floating point unit for BlueGene/L: architecture, compilation, and algorithm design

    Page(s): 85 - 96
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (328 KB) |  | HTML iconHTML  

    We describe the design, implementation, and evaluation of a dual-issue SIMD-like extension of the PowerPC 440 floating-point unit (FPU) core. This extended FPU is targeted at both IBM's massively parallel BlueGene/L machine as well as more pervasive embedded platforms. It has several novel features, such as a computational crossbar and cross-load/store instructions, which enhance the performance of numerical codes. We further discuss the hardware-software co-design that was essential to fully realize the performance benefits of the FPU when constrained by the memory bandwidth limitations and high penalties for misaligned data access imposed by the memory hierarchy on a BlueGene/L node. We describe several novel compiler and algorithmic techniques to take advantage of this architecture. Using both hand-optimized and compiled code for key linear algebraic kernels, we validate the architectural design choices, evaluate the success of the compiler, and quantify the effectiveness of the novel algorithm design techniques. Preliminary performance data shows that the algorithm-compiler-hardware combination delivers a significant fraction of peak floating-point performance for compute-bound kernels such as matrix multiplication, and delivers a significant fraction of peak memory bandwidth for memory-bound kernels such as daxpy, while being largely insensitive to data alignment. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Partitioning of code for a massively parallel machine

    Page(s): 225 - 236
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1177 KB) |  | HTML iconHTML  

    Code partitioning is the problem of dividing sections of code among a set of processors for execution in parallel taking into account the communication overhead between the processors. Code partitioning of large amounts of code onto numerous processors requires variations to the classical partitioning algorithms, in part due to the memory and time requirements to partition a large set of data, but also due to the nature of the target machine and multiple constraints imposed by its architectural features. We present our experience in the design of enhancements to the classical multilevel k-way partitioning algorithm to deal with large graphs of over 1 million nodes, 5 constraints, and nodes of irregular size. Our algorithm was implemented to produce code for a massively parallel machine of up to 40,000 processors, and forms part of a hardware description language compiler. The algorithm and the compiler were tested on RTL designs for a next generation SPARC(R) processor. We present performance results and comparisons for partitioning multiprocessor hardware designs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementing malleability on MPI jobs

    Page(s): 215 - 224
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (289 KB) |  | HTML iconHTML  

    Parallel jobs are characterized for having processes that communicate and synchronize with each other frequently. A processor allocation strategy widely used in parallel supercomputers is space-sharing, that is assigning a processors partition to each job for its exclusive use. We present a global solution to offer virtual malleability on message-passing parallel jobs, by applying a processor allocation strategy, the Folding by JobType (FJT). This technique is based on folding and moldability concepts and tries to decide the optimal initial number of processes, when to fold jobs and the number of folding times by analyzing the current and past system information. At processor level, we apply co-scheduling. We implement and evaluate the FJT under several workloads with different job sizes, classes and machine utilization. Results show that the FJT adapts easily to load changes, and can obtain better performance than the rest evaluated, on workloads with high coefficient variation and especially with burst arrivals. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A multi-platform co-array Fortran compiler

    Page(s): 29 - 40
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (412 KB) |  | HTML iconHTML  

    Co-array Fortran (CAF) - a small set of extensions to Fortran 90 - is an emerging model for scalable, global address space parallel programming. CAF's global address space programming model simplifies the development of single-program-multiple-data parallel programs by shifting the burden for managing the details of communication from developers to compilers. This paper describes CAFC - a prototype implementation of an open-source, multiplatform CAF compiler that generates code well-suited for today's commodity clusters. The CAFC compiler translates CAF into Fortran 90 plus calls to one-sided communication primitives. The paper describes key details of CAFC's approach to generating efficient code for multiple platforms. Experiments compare the performance of CAF and MPI versions of several NAS parallel benchmarks on an Alpha cluster with a Quadrics interconnect, an Itanium 2 cluster with a Myrinet 2000 interconnect and an Itanium 2 cluster with a Quadrics interconnect. These experiments show that CAFC compiles CAF programs into code that delivers performance roughly equal to that of hand-optimized MPI programs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Code generation in the polyhedral model is easier than you think

    Page(s): 7 - 16
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (360 KB) |  | HTML iconHTML  

    Many advances in automatic parallelization and optimization have been achieved through the polyhedral model. It has been extensively shown that this computational model provides convenient abstractions to reason about and apply program transformations. Nevertheless, the complexity of code generation has long been a deterrent for using polyhedral representation in optimizing compilers. First, code generators have a hard time coping with generated code size and control overhead that may spoil theoretical benefits achieved by the transformations. Second, this step is usually time consuming, hampering the integration of the polyhedral framework in production compilers or feedback-directed, iterative optimization schemes. Moreover, current code generation algorithms only cover a restrictive set of possible transformation functions. This paper discusses a general transformation framework able to deal with nonunimodular, noninvertible, nonintegral or even nonuniform functions. It presents several improvements to a state-of-the-art code generation algorithm. Two directions are explored: generated code size and code generator efficiency. Experimental evidence proves the ability of the improved method to handle real-life problems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Retargeting JIT compilers by using C-compiler generated executable code

    Page(s): 41 - 50
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (289 KB) |  | HTML iconHTML  

    JIT compilers produce fast code, whereas interpreters are easy to port between architectures. We propose to combine the advantages of these language implementation techniques as follows: we generate native code by concatenating and patching machine code fragments taken from interpreter-derived code (generated by a C compiler); we completely eliminate the interpreter dispatch overhead and accesses to the interpreted code by patching jump target addresses and other constants into the fragments. In this paper we present the basic idea, discuss some issues in more detail, and present results from a proof-of-concept implementation, providing speedups of up to 1.87 over the fastest previous interpreter-based technique, and performance comparable to simple native-code compilers. The effort required for retargeting our implementation from the 386 to the PPC architecture was less than a person-day. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Static placement, dynamic issue (SPDI) scheduling for EDGE architectures

    Page(s): 74 - 84
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (327 KB) |  | HTML iconHTML  

    Technology trends present new challenges for processor architectures and their instruction schedulers. Growing transistor density increases the number of execution units on a single chip, and decreasing wire transmission speeds causes long and variable on-chip latencies. These trends severely limit the two dominant conventional architectures: dynamic issue superscalars, and static placement and issue VLIWs. We present a new execution model in which the hardware and static scheduler instead work cooperatively, called static placement dynamic issue (SPDI). This paper focuses on the static instruction scheduler for SPDI. We identify and explore three issues SPDI schedulers must consider - locality, contention, and depth of speculation. We evaluate a range of SPDI scheduling algorithms executing on an explicit data graph execution (EDGE) architecture. We find that a surprisingly simple one achieves an average of 5.6 instructions-per-cycle (IPC) for SPEC2000 64-wide issue machine, and is within 80% of the performance without on-chip latencies. These results suggest that the compiler is effective at balancing on-chip latency and parallelism, and that the division of responsibilities between the compiler and the architecture is well suited to future systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast paths in concurrent programs

    Page(s): 189 - 200
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (463 KB) |  | HTML iconHTML  

    Compiling concurrent programs to run on a sequential processor presents a difficult tradeoff between execution time and size of generated code. On one hand, the process-based approach to compilation generates reasonable sized code but incurs significant execution overhead due to concurrency. On the other hand, the automata-based approach incurs a much smaller execution overhead but can result in code that is several orders of magnitude larger. We propose a way of combining the two approaches so that the performance of the automata-based approach can be achieved without suffering the code size increase due to it. The key insight is that the best of the two approaches can be achieved by using symbolic execution (similar to the automata-based approach) to generate code for the commonly executed paths (referred to as fast paths) and using the process-based approach to generate code for the rest of the program. We demonstrate the effectiveness of this approach by implementing our techniques in the ESP compiler and applying them to a set of filter programs and to VMMC network firmware. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A compiler framework for recovery code generation in general speculative optimizations

    Page(s): 17 - 28
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (331 KB) |  | HTML iconHTML  

    A general framework that integrates both control and data speculation using alias profiling and/or compiler heuristic rules has shown to improve SPEC2000 performance on Itanium systems. However, speculative optimizations require check instructions and recovery code to ensure correct execution when speculation fails at runtime. How to generate check instructions and their associated recovery code efficiently and effectively is an issue yet to be well studied. Also, it is very important that the recovery code generated in the earlier phases integrate gracefully in the later optimization phases. At the very least, it should not hinder later optimizations, thus, ensuring overall performance improvement. This paper proposes a framework that uses an if-block structure to facilitate check instructions and recovery code generation for general speculative optimizations. It allows speculative instructions and their recovery code generated in the early compiler optimization phases to be integrated effectively with the subsequent optimization phases. It also allows multilevel speculation for multilevel pointers and multilevel expression trees to be handled with no additional complexity. The proposed recovery code generation framework has been implemented in the open research compiler (ORC). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • AC/DC: an adaptive data cache prefetcher

    Page(s): 135 - 145
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1090 KB) |  | HTML iconHTML  

    AC/DC is an adaptive method for prefetching data from main memory. The basic prefetch method divides the memory address space into equal-sized concentration zones (CZones), and uses a global history buffer to track and detect patterns in miss address "deltas" (differences between consecutive addresses) within each CZone. When simulated with a realistic desktop memory system, CZone prefetching with delta correlations (C/DC) outperforms four other previously proposed prefetching methods. C/DC yields an average performance improvement of 23 percent when compared with no prefetching. Adaptivity is then added to the basic method. A tuning algorithm dynamically configures the CZone size and prefetch degree (i.e. the amount of data pre-fetched) on a per program-phase basis. Adaptive reconfiguration provides additional performance improvements of 4% over C/DC. Overall, the adaptive CZone / delta correlation (AC/DC) method outperforms other methods studied by 10%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The stream virtual machine

    Page(s): 267 - 277
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (331 KB) |  | HTML iconHTML  

    Stream programming is currently being pushed as a way to expose concurrency and separate communication from computation. Since there are many stream languages and potential stream execution engines, we propose an abstract machine model that captures the essential characteristics of stream architectures, the stream virtual machine (SVM). The goal of the SVM is to improve interoperability, allow development of common compilation tools and reason about stream program performance. The SVM contains control processors, slave kernel processors, and slave DMA units. The compilation process takes a stream program down to the SVM and finally down to machine binary. To extract the parameters for our SVM model, we use micro-kernels to characterize two graphics processors and a stream engine, Imagine. The results are encouraging; the model estimates the performance of the target machines with high accuracy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques (PACT 2004)

    Save to Project icon | Request Permissions | PDF file iconPDF (172 KB)  
    Freely Available from IEEE