By Topic

Parallel Architectures and Compilation Techniques, 2005. PACT 2005. 14th International Conference on

Date 17-21 Sept. 2005

Filter Results

Displaying Results 1 - 25 of 42
  • PACT 2005. 14th International Conference on Parallel Architectures and Compilation Techniques

    Save to Project icon | Request Permissions | PDF file iconPDF (527 KB)  
    Freely Available from IEEE
  • 14th International Conference on Parallel Architectures and Compilation Techniques - Title Page

    Page(s): i - iii
    Save to Project icon | Request Permissions | PDF file iconPDF (45 KB)  
    Freely Available from IEEE
  • 14th Internatiionall Confference on Parallllell Archiittecttures and Compiillattiion Techniiques - Copyright Page

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (34 KB)  
    Freely Available from IEEE
  • 14th Internatiionall Confference on Parallllell Archiittecttures and Compiillattiion Techniiques - Table of contents

    Page(s): v - viii
    Save to Project icon | Request Permissions | PDF file iconPDF (62 KB)  
    Freely Available from IEEE
  • Message from the General Chair

    Page(s): ix
    Save to Project icon | Request Permissions | PDF file iconPDF (19 KB)  
    Freely Available from IEEE
  • Message from the Program Chair

    Page(s): x
    Save to Project icon | Request Permissions | PDF file iconPDF (16 KB)  
    Freely Available from IEEE
  • Tutorials and Workshops

    Page(s): xi
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (23 KB)  

    Provides an abstract for each of the tutorial presentations and a brief professional biography of each presenter. The complete presentations were not made available for publication as part of the conference proceedings. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Committees

    Page(s): xii - xiii
    Save to Project icon | Request Permissions | PDF file iconPDF (16 KB)  
    Freely Available from IEEE
  • list-reviewer

    Page(s): xiv - xv
    Save to Project icon | Request Permissions | PDF file iconPDF (18 KB)  
    Freely Available from IEEE
  • Multi-core to the masses

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (30 KB)  

    Summary form only given. It is likely that 2005 will be viewed as the year that parallelism came to the masses, with multiple vendors shipping dual/multi-core platforms into the mainstream consumer and enterprise markets. Assuming that this trend will follow Moore's Law scaling, mainstream systems will contain over 10 processing cores by the end of the decade, yielding unprecedented theoretical peak performance. However, it is unclear whether the software community is sufficiently ready for this transition and will be able to unleash these capabilities due to the significant challenges associated with parallel programming. This keynote addresses the motivation for multi-core architectures, their unique characteristics, and potential solutions to the fundamental software challenges, including architectural enhancements for transactional memory, fine-grain message passing, and speculative multi-threading. Finally, we stress the need for a concerted, accelerated effort, starting at the academic-level and encompassing the entire platform software ecosystem, to successfully make the multi-core architectural transition. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Variational path profiling

    Page(s): 7 - 16
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (248 KB) |  | HTML iconHTML  

    Current profiling techniques are good at identifying where time is being spent during program execution. These techniques are not as good at pinpointing exactly where in the execution there are definite opportunities a programmer can exploit with optimization. In this paper we present a new type of profiling analysis called variational path profiling (VPP). VPP pinpoints exactly where in the program there are potentially significant optimization opportunities for speedup. VPP finds the acyclic control flow paths that vary the most in execution time (the time it takes to execute each occurrence of the path). This is calculated by sampling the time it takes to execute frequent paths using hardware performance counters. The motivation for concentrating on a path with a high net variation in its execution time is that it can potentially be optimized so that most or all executions of that path have the minimal execution time seen during profiling. We present a profiling and analysis approach to find these variational paths, so that they can be communicated back to a programmer to guide optimization. Our results show that this variation accounts for a significant fraction of overall program execution time and a small number of paths account for a large fraction of this variation. By applying straight forward prefetching optimizations to these variational paths we see 8.5% speedups on average. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Extended whole program paths

    Page(s): 17 - 26
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (216 KB) |  | HTML iconHTML  

    We describe the design, generation and compression of the extended whole program path (eWPP) representation that not only captures the control flow history of a program execution but also its data dependence history. This representation is motivated by the observation that typically a significant fraction of data dependence history can be recovered from the control flow trace. To capture the remainder of the data dependence history we introduce disambiguation checks in the program whose control flow signatures capture the results of the checks. The resulting extended control flow trace enables the recovery of otherwise unrecoverable data dependences. The code for the checks is designed to minimize the increase in the program execution time and the extended control flow trace size when compared to directly collecting control flow and dependence traces. Our experiments show that compressed eWPPs are only 4% of the size of combined compressed control flow and dependence traces and their collection requires 20% more runtime overhead than overhead required for directly collecting the control flow and dependence traces. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Instruction based memory distance analysis and its application to optimization

    Page(s): 27 - 37
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2472 KB) |  | HTML iconHTML  

    Feedback-directed optimization has become an increasingly important tool in designing and building optimizing compilers as it provides a means to analyze complex program behavior that is not possible using traditional static analysis. Feedback-directed optimization offers the compiler opportunities to analyze and optimize the memory behavior of programs even when traditional array-based analysis is not applicable. As a result, both floating point and integer programs can benefit from memory hierarchy optimization. In this paper, we examine the notion of memory distance as it is applied to the instruction space of a program and to feedback-directed optimization. Memory distance is defined as a dynamic quantifiable distance in terms of memory references between two accesses to the same memory location. We use memory distance to predict the miss rates of instructions in a program. Using the miss rates, we then identify the program's critical instructions - the set of high miss instructions whose cumulative misses account for 95% of the L2 cache misses in the program - in both integer and floating-point programs. Our experiments show that memory-distance analysis can effectively identify critical instructions in both integer and floating-point programs. Additionally, we apply memory-distance analysis to memory disambiguation in out-of-order issue processors using those distances to determine when a load may be speculated ahead of a preceding store. Our experiments show that memory-distance-based disambiguation on average achieves within 5-10% of the performance gain of the store set technique which requires a hardware table. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • HPS: hybrid profiling support

    Page(s): 38 - 47
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (216 KB) |  | HTML iconHTML  

    Key to understanding and optimizing complex applications is our ability to dynamically monitor executing programs with low overhead and high accuracy. Toward this end, we present HPS, a hybrid profiling support system. HPS employs a hardware/software approach to program sampling that transparently, efficiently, and dynamically samples an executing instruction stream. Our system is an extension and application of dynamic instruction stream editing (DISE), a hardware technique that macro-expands instructions in the pipeline decode stage at runtime. HPS toggles profiling to sample the executing program as required by the profile consumer, e.g. a dynamic optimizer. Our system requires few hardware resources and introduces no "basic" overhead - the execution of instructions that triggers profiling. We use HPS to investigate the tradeoffs between overhead and accuracy of different profile types as well as different profiling schemes. In particular, we empirically evaluate hot data stream, hot call pair, and hot method identification using a number of parameterizations of bursty tracing, a popular sampling scheme used in dynamic optimization systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Maximizing CMP throughput with mediocre cores

    Page(s): 51 - 62
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (248 KB) |  | HTML iconHTML  

    In this paper we compare the performance of area equivalent small, medium, and large-scale multithreaded chip multiprocessors (CMTs) using throughput-oriented applications. We use area models based on SPARC processors incorporating these architectural features. We examine CMTs with in-order scalar processor cores, 2-way or 4-way in-order superscalar cores, private primary instruction and data caches, and a shared secondary cache. We explore a large design space, ranging from processor-intensive to cache-intensive CMTs. We use SPEC JBB2000, TPC-C, TPC-W, and XML Test to demonstrate that the scalar simple-core CMTs do a better job of addressing the problems of low instruction-level parallelism and high cache miss rates that dominate Web service middleware and online transaction processing applications. For the best overall CMT performance, smaller cores with lower performance, so called "mediocre" cores, maximize the total number of CMT cores and outperform CMTs built from larger, higher performance cores. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Characterization of TCC on chip-multiprocessors

    Page(s): 63 - 74
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (272 KB) |  | HTML iconHTML  

    Transactional coherence and consistency (TCC) is a novel coherence scheme for shared memory multiprocessors that uses programmer-defined transactions as the fundamental unit of parallel work, synchronization, coherence, and consistency. TCC has the potential to simplify parallel program development and optimization by providing a smooth transition from sequential to parallel programs. In this paper, we study the implementation of TCC on chip-multiprocessors (CMPs). We explore design alternatives such as the granularity of state tracking, double-buffering, and write-update and write-invalidate protocols. Furthermore, we characterize the performance of TCC in comparison to conventional snoopy cache coherence (SCC) using parallel applications optimized for each scheme. We conclude that the two coherence schemes perform similarly, with each scheme having a slight advantage for some applications. The bandwidth requirements of TCC are slightly higher but well within the capabilities of CMP systems. Also, we find that overflow of speculative state can be effectively handled by a simple victim cache. Our results suggest TCC can provide its programming advantages without compromising the performance expected from well-tuned parallel applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Store-ordered streaming of shared memory

    Page(s): 75 - 84
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (328 KB) |  | HTML iconHTML  

    Coherence misses in shared-memory multiprocessors account for a substantial fraction of execution time in many important scientific and commercial workloads. Memory streaming provides a promising solution to the coherence miss bottleneck because it improves memory level parallelism and lookahead while using on-chip resources efficiently. We observe that the order in which shared data are consumed by one processor is correlated to the order in which they were produced by another. We investigate this phenomenon and demonstrate that it can be exploited to send store-ordered streams (SORDS) of shared data from producers to consumers, thereby eliminating coherent read misses. Using a trace-driven analysis of all user and OS memory references in a cache-coherent distributed shared-memory multiprocessor, we show that SORDS-based memory streaming can eliminate between 36% and 100% of all coherent read misses in scientific workloads and between 23% and 48% in online transaction processing workloads. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An event-driven multithreaded dynamic optimization framework

    Page(s): 87 - 98
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (184 KB) |  | HTML iconHTML  

    Dynamic optimization has the potential to adapt the program's behavior at run-time to deliver performance improvements over static optimization. Dynamic optimization systems usually perform their optimization in series with the application's execution. This incurs overhead which reduces the benefit of dynamic optimization, and prevents some aggressive optimizations from being performed. In this paper we propose a new dynamic optimization framework called Trident. Concurrent with the program's execution, the framework uses hardware support to identify optimization opportunities, and uses spare threads on a multithreaded processor to perform dynamic optimizations for these optimization events. We evaluate the benefit of using Trident to guide code layout, basic compiler optimizations, and value specialization. Our results show that using Trident with these optimizations achieves an average 20% speedup, and is complementary with other memory latency tolerant techniques, such as prefetching. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and implementation of a compiler framework for helper threading on multi-core processors

    Page(s): 99 - 109
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (632 KB) |  | HTML iconHTML  

    Helper threading is a technique that utilizes a second core or logical processor in a multi-threaded system to improve the performance of the main thread. A helper thread executes in parallel with the main thread that it attempts to accelerate. In this paper, the helper thread merely prefetches data into a shared cache and does not incur any other programmer visible effects. Helper thread prefetching has been proposed as a viable solution in various scenarios where it is difficult to prefetch efficiently within the main thread itself. This paper presents our helper threading experience on SUN's second dual-core SPARC microprocessor, the UltraSPARC IV+. The two cores on this processor share an on-chip L2 and an off-chip L3 cache. We present a compiler framework to automatically construct helper threads and evaluate our scheme on the UltraSPARC IV+ processor. Our preliminary results using helper threads on the SPEC CPU2000 suite show gains of up to 22% on programs that suffer substantial L2 cache misses while at the same time incurring negligible losses on programs that do not suffer L2 cache misses. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Compiler directed early register release

    Page(s): 110 - 119
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (248 KB) |  | HTML iconHTML  

    This paper presents a novel compiler directed technique to reduce the register pressure and power of the register file by releasing registers early. The compiler identifies registers that mil only be read once and renames them to different logical registers. Upon issuing an instruction with one of these logical registers as a source, the processor knows that there will be no more uses of it and can release the register through checkpointing. This reduces the occupancy of our banked register file, allowing banks to be turned off for power savings. Our scheme is faster, simpler and requires less hardware than recently proposed techniques. It also maintains precise interrupts and exceptions where many other techniques do not. We reduce register occupancy by 28% in a large register file and gain in performance too; this translates into dynamic and static power saving of 18%. When compared to state-of-the-art approaches for varying register file sizes, our scheme is always faster (higher IPC) and always achieves a greater reduction in register file occupancy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic selection of compiler options using non-parametric inferential statistics

    Page(s): 123 - 132
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (296 KB) |  | HTML iconHTML  

    In this paper, we propose a statistical method to determine the setting of compiler options. Conventionally, programmers use standard - Ox settings which are provided by compiler developers. However, in order to obtain maximal performance, it is necessary to tune the compiler setting for the application as well as the underlying architecture. In this paper, we propose a methodology to configure compiler options automatically using profile information. We apply non-parametric statistical analysis, in particular the Mann-Whitney test, to decide whether to turn on or to turn off compiler flags. This approach produces compiler settings of gcc 33.1 for the SPEC2000 benchmark suite that outperform the standard - Ox switches on a Pentium 4 processor. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Data-centric transformations on non-integer iteration spaces

    Page(s): 133 - 142
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (312 KB) |  | HTML iconHTML  

    Data-centric transformations have been used in recent years to improve locality for several classes of applications. However, the existing work has applied these transformations for integer iteration spaces, i.e., the iteration spaces involving loop variables that take integer values between specified lower and upper bounds. In many applications, a loop could involve a loop variable which takes values from a sequence or set of real numbers, strings, or any other data type. We refer to such iteration spaces as non-integer iteration spaces. This paper focuses on the problem of applying data-centric transformations on applications with non-integer iteration spaces. We first present a general algorithm that uses a hash table. Then, we show how in many cases, we can exploit the repetitive nature of the dataset to avoid the overhead associated with such a table. Our algorithms have been implemented as part of a compiler for the XML query language XQuery, which supports processing over virtual XML. Our system also parallelizes the processing. We present experimental results from several applications to demonstrate the effectiveness of our transformations and parallel performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient techniques for advanced data dependence analysis

    Page(s): 143 - 153
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (232 KB) |  | HTML iconHTML  

    Scientific source code for high performance computers is extremely complex containing irregular control structures with complicated expressions. This complexity makes it difficult for compilers to analyze the code and perform optimizations. In particular with regard to program parallelization, complex expressions are often not taken intro consideration during the data dependence analysis phase. In this work we propose new data dependence analysis techniques to handle such complex instances of the dependence problem and increase program parallelization. Our method is based on a set of polynomial time techniques that can prove or disprove dependences in the presence of non-linear expressions, complex loop bounds, arrays with coupled subscripts, and if statement constraints. In addition our algorithm can produce accurate and complete direction vector information enabling the compiler to apply further transformations. To validate our method we performed an experimental evaluation and comparison against the I-Test, the Omega test and the Range test in the Perfect and SPEC benchmarks. The experimental results indicate that our dependence analysis tool is efficient and more effective in program parallelization than the other dependence tests. The improved parallelization of key loops results into higher speedups and better program execution performance in several benchmarks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel programming and parallel abstractions in Fortress

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (32 KB)  

    Summary form only given. The Programming Language Research Group at Sun Microsystems Laboratories seeks to apply lessons learned from the Java (TM) programming language to the next generation of programming languages. The Java language supports platform-independent parallel programming with explicit multithreading and explicit locks. As part of the DARPA program for High Productivity Computing Systems, we are developing Fortress, a language intended to support large-scale scientific computation. One of the design principles is that parallelism be encouraged everywhere (for example, it is intentionally just a little bit harder to write a sequential loop than a parallel loop). Another is to have fairly rich mechanisms for encapsulation and abstraction; the idea is to have a fairly complicated language for library writers that enable them to write libraries that present a relatively simple set of interfaces to the application programmer. We will discuss ideas for using a rich polymorphic type system to organize multithreading and data distribution on large parallel machines. The net result is similar in some ways to data distribution facilities in other languages such as HPF and Chapel, but more open-ended, because in Fortress the facilities are defined by user-replaceable libraries rather than wired into the compiler. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimizing Compiler for the CELL Processor

    Page(s): 161 - 172
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (272 KB) |  | HTML iconHTML  

    Developed for multimedia and game applications, as well as other numerically intensive workloads, the CELL processor provides support both for highly parallel codes, which have high computation and memory requirements, and for scalar codes, which require fast response time and a full-featured programming environment. This first generation CELL processor implements on a single chip a Power Architecture processor with two levels of cache, and eight attached streaming processors with their own local memories and globally coherent DMA engines. In addition to processor-level parallelism, each processing element has a Single Instruction Multiple Data (SIMD) unit that can process from 2 double precision floating points up to 16 bytes per instruction. This paper describes, in the context of a research prototype, several compiler techniques that aim at automatically generating high quality codes over a wide range of heterogeneous parallelism available on the CELL processor. Techniques include compiler-supported branch prediction, compiler-assisted instruction fetch, generation of scalar codes on SIMD units, automatic generation of SIMD codes, and data and code partitioning across the multiple processor elements in the system. Results indicate that significant speedup can be achieved with a high level of support from the compiler. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.