Scheduled System Maintenance:
Some services will be unavailable Sunday, March 29th through Monday, March 30th. We apologize for the inconvenience.
By Topic

Code Generation and Optimization, 2005. CGO 2005. International Symposium on

Date 20-23 March 2005

Filter Results

Displaying Results 1 - 25 of 36
  • [Cover]

    Publication Year: 2005 , Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (46 KB)  
    Freely Available from IEEE
  • International Symposium on Code Generation and Optimization

    Publication Year: 2005
    Save to Project icon | Request Permissions | PDF file iconPDF (73 KB)  
    Freely Available from IEEE
  • Table of contents

    Publication Year: 2005 , Page(s): v - viii
    Save to Project icon | Request Permissions | PDF file iconPDF (48 KB)  
    Freely Available from IEEE
  • Message from the General Co-Chairs

    Publication Year: 2005 , Page(s): ix
    Save to Project icon | Request Permissions | PDF file iconPDF (33 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Message from the Program Chair

    Publication Year: 2005 , Page(s): x
    Save to Project icon | Request Permissions | PDF file iconPDF (32 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Committees

    Publication Year: 2005 , Page(s): xi - xiii
    Save to Project icon | Request Permissions | PDF file iconPDF (38 KB)  
    Freely Available from IEEE
  • Reviewers

    Publication Year: 2005 , Page(s): xiv
    Save to Project icon | Request Permissions | PDF file iconPDF (22 KB)  
    Freely Available from IEEE
  • Virtual machine learning: thinking like a computer architect

    Publication Year: 2005
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (52 KB) |  | HTML iconHTML  

    Summary form only given. Modern commercial software is written in languages that execute on a virtual machine. Such languages often have dynamic features that require rich runtime support and preclude traditional static optimization. Implementations of these languages have employed dynamic optimization strategies to achieve significant performance improvements. In this paper the author describes some of these strategies and demonstrates their effectiveness. The author then argues that further advances in this field are being hindered by our bias toward adapting traditional static optimization techniques. Instead, we need to think more like a computer architect to create new approaches to optimization in virtual machines. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Context threading: a flexible and efficient dispatch technique for virtual machine interpreters

    Publication Year: 2005 , Page(s): 15 - 26
    Cited by:  Papers (1)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (240 KB) |  | HTML iconHTML  

    Direct-threaded interpreters use indirect branches to dispatch bytecodes, but deeply-pipelined architectures rely on branch prediction for performance. Due to the poor correlation between the virtual program's control flow and the hardware program counter, which we call the context problem, direct threading's indirect branches are poorly predicted by the hardware, limiting performance. Our dispatch technique, context threading, improves branch prediction and performance by aligning hardware and virtual machine state. Linear virtual instructions are dispatched with native calls and returns, aligning the hardware and virtual PC. Thus, sequential control flow is predicted by the hardware return stack. We convert virtual branching instructions to native branches, mobilizing the hardware's branch prediction resources. We evaluate the impact of context threading on both branch prediction and performance using interpreters for Java and OCaml on the Pentium and PowerPC architectures. On the Pentium IV our technique reduces mean mispredicted branches by 95%. On the PowerPC, it reduces mean branch stall cycles by 75% for OCaml and 82% for Java. Due to reduced branch hazards, context threading reduces mean execution time by 25% for Java and by 19% and 37% for OCaml on the P4 and PPC970, respectively. We also combine context threading with a conservative inlining technique and find its performance comparable to that of selective inlining. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatically reducing repetitive synchronization with a just-in-time compiler for Java

    Publication Year: 2005 , Page(s): 27 - 36
    Cited by:  Papers (1)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (200 KB) |  | HTML iconHTML  

    We describe an automatic technique to remove repetitive synchronization in Java™ programs by removing selected MONITORENTER/EXIT operations. Once these operations are removed, parts of a method that were not originally locked become protected by a lock. If it is unsafe to synchronize the code between the original locked regions, however, the code is not transformed. Scalability is also protected by not allowing a lock to be held for a significantly longer time than it would be held in the original program code. Our base algorithm improved the throughput of the industry-standard SPECjbb2000 benchmark by 2% to 5% for three different platforms. We also describe an extension to our algorithm to better handle virtual calls, which are prevalent in Java code, and this extension provides up to a further 5% improvement. Our computationally efficient algorithm was implemented and evaluated in a production Just-In-Time (JIT) compiler. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Compile-time concurrent marking write barrier removal

    Publication Year: 2005 , Page(s): 37 - 48
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (232 KB) |  | HTML iconHTML  

    Garbage collectors incorporating concurrent marking to cope with large live data sets and stringent pause time constraints have become common in recent years. The snapshot-at-the-beginning style of concurrent marking has several advantages over the incremental update alternative, but one main disadvantage: it requires the mutator to execute a significantly more expensive write barrier. This paper demonstrates that a large fraction of these write barriers are unnecessary, and may be eliminated by static analysis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Collecting and exploiting high-accuracy call graph profiles in virtual machines

    Publication Year: 2005 , Page(s): 51 - 62
    Cited by:  Papers (4)  |  Patents (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (248 KB) |  | HTML iconHTML  

    Due to the high dynamic frequency of virtual method calls in typical object-oriented programs, feedback-directed devirtualization and inlining is one of the most important optimizations performed by high-performance virtual machines. A critical input to effective feedback-directed inlining is an accurate dynamic call graph. In a virtual machine, the dynamic call graph is computed online during program execution. Therefore, to maximize overall system performance, the profiling mechanism must strike a balance between profile accuracy, the speed at which the profile becomes available to the optimizer, and profiling overhead. This paper introduces a new low-overhead sampling-based technique that rapidly converges on a high-accuracy dynamic call graph. We have implemented the technique in two high-performance virtual machines: Jikes RVM and J9. We empirically assess our profiling technique by reporting on the accuracy of the dynamic call graphs it computes and by demonstrating that increasing the accuracy of the dynamic call graph results in more effective feedback-directed inlining. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Effective adaptive computing environment management via dynamic optimization

    Publication Year: 2005 , Page(s): 63 - 73
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (184 KB) |  | HTML iconHTML  

    To minimize the surging power consumption of microprocessors, adaptive computing environments (ACEs) where microarchitectural resources can be dynamically tuned to match a program's runtime requirement and characteristics are becoming increasingly common. Adaptive computing environments usually have multiple configurable hardware units, necessitating exploration of a large number of combinatorial configurations in order to identify the most energy-efficient configuration. In this paper, we propose a scheme for efficient management of multiple configurable units, utilizing the inherent capabilities of dynamic optimization systems. Most dynamic optimizers typically detect dominant code regions (hotspots). We develop an ACE management scheme where hotpot boundaries are used for phase detection and adaptation. Since hotspots are of variable sizes and are often nested, program phase behavior which is hierarchical in nature is automatically captured in this technique. To demonstrate the usefulness and effectiveness of our framework, we use the proposed framework to dynamically adapt the sizes of L1 data and L2 caches that have different reconfiguration latencies and overheads. Our technique reduces L1D and L2 cache energy consumption by 47% and 58%, while a popular previously proposed technique only achieves reduction of 32% and 52% respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Maintaining consistency and bounding capacity of software code caches

    Publication Year: 2005 , Page(s): 74 - 85
    Cited by:  Papers (8)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (256 KB) |  | HTML iconHTML  

    Software code caches are becoming ubiquitous, in dynamic optimizers, runtime tool platforms, dynamic translators fast simulators and emulators, and dynamic compilers. Caching frequently executed fragments of code provides significant performance boosts, reducing the overhead of translation and emulation and meeting or exceeding native performance in dynamic optimizers. One disadvantage of caching, memory expansion, can sometimes be ignored when executing a single application. However, as optimizers and translators are applied more and more in production systems, the memory expansion from running multiple applications simultaneously becomes problematic. A second drawback to caching is the added requirement of maintaining consistency between the code cache and the original code. On architectures like IA-32 that do not require explicit application actions when modifying code, detecting code changes is challenging. Again, consistency can be ignored for certain sets of applications, but as caching systems scale up to executing large, modern, complex programs, consistency becomes critical. This paper presents efficient schemes for keeping a software code cache consistent and for dynamically bounding code cache size to match the current working set of the application. These schemes are evaluated in the DynamoRIO runtime code manipulation system, and operate on stock hardware in the presence of multiple threads and dynamic behavior, including dynamically-loaded, generated, and even modified code. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance of runtime optimization on BLAST

    Publication Year: 2005 , Page(s): 86 - 96
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (224 KB) |  | HTML iconHTML  

    Optimization of a real world application BLAST is used to demonstrate the limitations of static and profile-guided optimizations and to highlight the potential of runtime optimization systems. We analyze the performance profile of this application to determine performance bottlenecks and evaluate the effect of aggressive compiler optimizations on BLAST. We find that applying common optimizations (e.g. O3) can degrade performance. Profile guided optimizations do not show much improvement across the board, as current implementations do not address critical performance bottlenecks in BLAST. In some cases, these optimizations lower performance significantly due to unexpected secondary effects of aggressive optimizations. We also apply runtime optimization to BLAST using the ADORE framework. ADORE is able to detect performance bottlenecks and deploy optimizations resulting in performance gains up to 58% on some queries using data cache prefetching. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimizing sorting with genetic algorithms

    Publication Year: 2005 , Page(s): 99 - 110
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (280 KB) |  | HTML iconHTML  

    The growing complexity of modern processors has made the generation of highly efficient code increasingly difficult. Manual code generation is very time consuming, but it is often the only choice since the code generated by today's compiler technology often has much lower performance than the best hand-tuned codes. A promising code generation strategy, implemented by systems like ATLAS, FFTW, and SPIRAL, uses empirical search to find the parameter values of the implementation, such as the tile size and instruction schedules, that deliver near-optimal performance for a particular machine. However, this approach has only proven successful on scientific codes whose performance does not depend on the input data. In this paper we study machine learning techniques to extend empirical search to the generation of sorting routines, whose performance depends on the input characteristics and the architecture of the target machine. We build on a previous study that selects a "pure" sorting algorithm at the outset of the computation as a function of the standard deviation. The approach discussed in this paper uses genetic algorithms and a classifier system to build hierarchically-organized hybrid sorting algorithms capable of adapting to the input data. Our results show that such algorithms generated using the approach presented in this paper are quite effective at taking into account the complex interactions between architectural and input data characteristics and that the resulting code performs significantly better than conventional sorting implementations and the code generated by our earlier study. In particular, the routines generated using our approach perform better than all the commercial libraries that we tried including IBM ESSL, INTEL MKL and the C++ STL The best algorithm we have been able to generate is on the average 26% and 62% faster than the IBM ESSL in an IBM Power 3 and IBM Power 4, respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy

    Publication Year: 2005 , Page(s): 111 - 122
    Cited by:  Papers (14)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (336 KB) |  | HTML iconHTML  

    This paper describes an algorithm for simultaneously optimizing across multiple levels of the memory hierarchy for dense-matrix computations. Our approach combines compiler models and heuristics with guided empirical search to take advantage of their complementary strengths. The models and heuristics limit the search to a small number of candidate implementations, and the empirical results provide the most accurate information to the compiler to select among candidates and tune optimization parameter values. We have developed an initial implementation and applied this approach to two case studies, matrix multiply and Jacobi relaxation. For matrix multiply, our results on two architectures, SGI R10000 and Sun UltraSparc IIe, outperform the native compiler, and either outperform or achieve comparable performance as the ATLAS self-tuning library and the hand-tuned vendor BLAS library. Jacobi results also substantially outperform the native compilers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Predicting unroll factors using supervised classification

    Publication Year: 2005 , Page(s): 123 - 134
    Cited by:  Papers (13)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (224 KB) |  | HTML iconHTML  

    Compilers base many critical decisions on abstracted architectural models. While recent research has shown that modeling is effective for some compiler problems, building accurate models requires a great deal of human time and effort. This paper describes how machine learning techniques can be leveraged to help compiler writers model complex systems. Because learning techniques can effectively make sense of high dimensional spaces, they can be a valuable tool for clarifying and discerning complex decision boundaries. In this work we focus on loop unrolling, a well-known optimization for exposing instruction level parallelism. Using the Open Research Compiler as a testbed, we demonstrate how one can use supervised learning techniques to determine the appropriateness of loop unrolling. We use more than 2,500 loops - drawn from 72 benchmarks - to train two different learning algorithms to predict unroll factors (i.e., the amount by which to unroll a loop) for any novel loop. The technique correctly predicts the unroll factor for 65% of the loops in our dataset, which leads to a 5% overall improvement for the SPEC 2000 benchmark suite (9% for the SPEC 2000 floating point benchmarks). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multicores from the compiler's perspective: a blessing or a curse?

    Publication Year: 2005
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (58 KB) |  | HTML iconHTML  

    With all major processor vendors building multiple processor cores on a single chip, multicores is the next tour-de-force in computer architecture. The ability to parallelize and execute applications on multiple cores provides an opportunity to get the processor performance back on track with the Moore's law. In this article the author analyzes the seismic changes in computer architecture that lead to multicores and their impact on compilation. Next, we review different facets of the multicore compiler problem using three case studies - coarse grain parallelism from the perspective of the SUIF parallelizing compiler, instruction level parallelism with the RAW ILP compiler, and language exposed parallelism using the StreamIt language and compiler. Finally, we attempt to answer the ultimate question: with multicores, can we finally realize the dream that has eluded us for so long?. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimizing address code generation for array-intensive DSP applications

    Publication Year: 2005 , Page(s): 141 - 152
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (520 KB) |  | HTML iconHTML  

    The application code size is a critical design factor for many embedded systems. Unfortunately, most available compilers optimize primarily for speed of execution rather than code density. As a result, the compiler-generated code can be much larger than necessary. In particular, in the DSP domain, the past research found that optimizing address code generation can be very important since address code can account for over 50% of all program bits. This paper presents a compiler-directed scheme to minimize the number of instructions to be generated to manipulate address registers found in DSP architectures. As opposed to most of the prior techniques that attempt to reduce the number of such instructions through careful address register assignment, this paper proposes modifying loop access patterns in array-intensive signal processing applications. In addition, it demonstrates how the proposed scheme can cooperate with a data layout optimizer for increasing its benefits further. We also discuss how optimizations that target effective address code generation can conflict with data locality-enhancing transformations. We evaluate the proposed approach using twelve array-intensive embedded applications. Our experimental results indicate that the proposed approach not only leads to significant reductions in code size but also outperforms prior efforts on reducing code size of array-intensive DSP applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient SIMD code generation for runtime alignment and length conversion

    Publication Year: 2005 , Page(s): 153 - 164
    Cited by:  Papers (7)  |  Patents (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (264 KB) |  | HTML iconHTML  

    When generating codes for today's multimedia extensions, one of the major challenges is to deal with memory alignment issues. While hand programming still yields best performing SIMD codes, it is both time consuming and error prone. Compiler technology has greatly improved, including techniques that simdize loops with misaligned accesses by automatically rearranging misaligned memory streams in registers. Current techniques are applicable to runtime alignments, but they aggressively reduce the alignment overhead only when all alignments are known at compile time. This paper presents two major enhancements to the state of the art, improving both performance and coverage. First, we propose a novel technique to simdize loops with runtime alignment nearly as efficiently as those with compile-time misalignment. Runtime alignment is pervasive in real applications because it is either part of the algorithms, or it is an artifact of the compiler's inability to extract accurate alignment information from complex applications. Second, we incorporate length conversion operations, e.g., conversions between data of different sizes, into the alignment handling framework. Length conversions are pervasive in multimedia applications where mixed integer types are often used. Supporting length conversion can greatly improve the coverage of simdizable loops. Experimental results indicate that our runtime alignment technique achieves a 19% to 32% speedup increase over prior art for a benchmark stressing the impact of misaligned data. We also demonstrate speedup factors of up to 8.11 for real benchmarks over sequential execution. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Superword-level parallelism in the presence of control flow

    Publication Year: 2005 , Page(s): 165 - 175
    Cited by:  Papers (19)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (208 KB) |  | HTML iconHTML  

    In this paper, we describe how to extend the concept of superword-level parallelization (SLP), used for multimedia extension architectures, so that it can be applied in the presence of control flow constructs. Superword-level parallelization involves identifying scalar instructions in a large basic block that perform the same operation, and, if dependences do not prevent it, combining them into a superword operation on a multi-word object. A key insight is that we can use techniques related to optimizations for architectures supporting predicated execution, even for multimedia ISAs that do not provide hardware predication. We derive large basic blocks with predicated instructions to which SLP can be applied. We describe how to minimize overheads for superword predicates and re-introduce control flow for scalar operations. We discuss other extensions to SLP to address common features of real multimedia codes. We present automatically-generated performance results on 8 multimedia codes to demonstrate the power of this approach. We observe speedups ranging from 1.97X to 15.07X as compared to both sequential execution and SLP alone. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Compiler managed dynamic instruction placement in a low-power code cache

    Publication Year: 2005 , Page(s): 179 - 190
    Cited by:  Papers (16)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (632 KB) |  | HTML iconHTML  

    Modern embedded microprocessors use low power on-chip memories called scratch-pad memories to store frequently executed instructions and data. Unlike traditional caches, scratch-pad memories lack the complex tag checking and comparison logic, thereby proving to be efficient in area and power. In this work, we focus on exploiting scratch-pad memories for storing hot code segments within an application. Static placement techniques focus on placing the most frequently executed portions of programs into the scratch-pad. However, static schemes are inherently limited by not allowing the contents of the scratch-pad memory to change at run time. In a large fraction of applications, the instruction memory footprints exceed the scratch-pad memory size, thereby limiting the usefulness of the scratch-pad. We propose a compiler managed dynamic placement algorithm, wherein multiple hot code sequences, or traces, are overlapped with each other in the scratch-pad memory at different points in time during execution. Special copy instructions are provided to copy the traces into the scratch-pad memory at run-time. Using a power estimate, the compiler initially selects the most frequent traces in an application for relocation into the scratch-pad memory. Through iterative code motion and redundancy elimination, copy instructions are inserted in infrequently executed regions of the code. For a 64-byte code cache, the compiler managed dynamic placement achieves an average of 64% energy improvement over the static solution in a low-power embedded microcontroller. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Phase-aware remote profiling

    Publication Year: 2005 , Page(s): 191 - 202
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (200 KB) |  | HTML iconHTML  

    Recent advances in networking and embedded device technology have made the vision of ubiquitous computing a reality; users can access the Internet's vast offerings anytime and anywhere. Moreover, battery-powered devices such as personal digital assistants and Web-enabled mobile phones have successfully emerged as new access points to the world's digital, infrastructure. This ubiquity offers a new opportunity for software developers: users can now participate in the software development, optimization, and evolution process while they use their software. Such participation requires effective techniques for gathering profile information from remote, resource-constrained devices. Further, these techniques must be unobtrusive and transparent to the user; profiles must be gathered using minimal computation, communication, and power. Toward this end, we present a flexible hardware-software scheme for efficient remote profiling. We rely on the extraction of meta information from executing programs in the form of phases, and then use this information to guide intelligent online sampling and to manage the communication of those samples. Our results indicate that phase-based remote profiling can reduce the communication, computation, and energy consumption overheads by 50-75% over random and periodic sampling. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Practical path profiling for dynamic optimizers

    Publication Year: 2005 , Page(s): 205 - 216
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (264 KB) |  | HTML iconHTML  

    Modern processors are hungry for instructions. To satisfy them, compilers need to find and optimize execution paths across multiple basic blocks. Path profiles provide this context, but their high overhead has so far limited their use by dynamic compilers. We present new techniques for low overhead online practical path profiling (PPP). Following targeted path profiling (TPP), PPP uses an edge profile to simplify path profile instrumentation (profile-guided profiling). PPP improves over prior work by (1) reducing the amount of profiling instrumentation on cold paths and paths that the edge profile predicts well and (2) reducing the cost of the remaining instrumentation. Experiments in an ahead-of-time compiler perform edge profile-guided inlining and unrolling prior to path profiling instrumentation. These transformations are faithful to staged optimization, and create longer, harder to predict paths. We introduce the branch-flow metric to measure path flow as a function of branch decisions, rather than weighting all paths equally as in prior work. On SPEC2000, PPP maintains high accuracy and coverage, but has only 5% overhead on average (ranging from -3% to 13%), making it appealing for use by dynamic compilers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.