By Topic

Code Generation and Optimization, 2004. CGO 2004. International Symposium on

Date 20-24 March 2004

Filter Results

Displaying Results 1 - 25 of 35
  • Exploring code cache eviction granularities in dynamic optimization systems

    Page(s): 89 - 99
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (900 KB) |  | HTML iconHTML  

    Dynamic optimization systems store optimized or translated code in a software-managed code cache in order to maximize reuse of transformed code. Code caches store superblocks that are not fixed in size, contain links to other superblocks, and carry a high replacement overhead. These additional constraints reduce the effectiveness of conventional hardware-based cache management policies. In this paper, we explore code cache management policies that evict large blocks of code from the code cache, thus avoiding the bookkeeping overhead of managing single cache blocks. Through a combined simulation and analytical study of cache management overheads, we show that employing a medium-grained FIFO eviction policy results in an effective balance of cache management complexity and cache miss rates. Under high cache pressure the choice of medium granularity translates into a significant reduction in overall execution time versus both coarse and fine granularities. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimizing translation out of SSA using renaming constraints

    Page(s): 265 - 276
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (498 KB) |  | HTML iconHTML  

    Static Single Assignment form is an intermediate representation that uses Φ instructions to merge values at each confluent point of the control flow graph. Φ instructions are not machine instructions and must be renamed back to move instructions when translating out of SSA form. Without a coalescing algorithm, the out of SSA translation generates many move instructions. Leung and George [A. L. Leung et al., (1999)] use a SSA form for programs represented as native machine instructions, including the use of machine dedicated registers. For this purpose, they handle renaming constraints thanks to a pinning mechanism. Pinning Φ arguments and their corresponding definition to a common resource is also a very attractive technique for coalescing variables. Extending this idea, we propose a method to reduce the Φ-related copies during the out of SSA translation, thanks to a pinning-based coalescing algorithm that is aware of renaming constraints. We implemented our algorithm in the STMicroelectronics Linear Assembly Optimizer [B. Dupont de Dinechin et al., (2000)]. Our experiments show interesting results when comparing to the existing approaches of Leung and George [A. L. Leung et al., (1999)], Sreedhar et al. [V. Sreedhar et al., (1999)], and Appel and George for register coalescing [L. George et al., (1996)]. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Ispike: a post-link optimizer for the Intel® Itanium® architecture

    Page(s): 15 - 26
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (413 KB) |  | HTML iconHTML  

    Ispike is a post-link optimizer developed for the Intel® Itanium Processor Family (IPF) processors. The IPF architecture poses both opportunities and challenges to post-link optimizations. IPF offers a rich set of performance counters to collect detailed profile information at a low cost, which is essential to post-link optimization being practical. At the same time, the predication and bundling features on IPF make post-link code transformation more challenging than on other architectures. In Ispike, we have implemented optimizations like code layout, instruction prefetching, data layout, and data prefetching that exploit the IPF advantages, and strategies that cope with the IPF-specific challenges. Using SPEC CINT2000 as benchmarks, we show that Ispike improves performance by as much as 40% on the ltanium®2 processor, with average improvement of 8.5% and 9.9% over executables generated by the Intel® Electron compiler and by the Gcc compiler, respectively. We also demonstrate that statistical profiles collected via IPF performance counters and complete profiles collected via instrumentation produce equal performance benefit, but the profiling overhead is significantly lower for performance counters. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Extending path profiling across loop backedges and procedure boundaries

    Page(s): 251 - 262
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (540 KB) |  | HTML iconHTML  

    Since their introduction, path profiles have been used to guide the application of aggressive code optimizations and performing instruction scheduling. However, for optimization and scheduling, it is often desirable to obtain frequency counts of paths that extend across loop iterations and cross procedure boundaries. These longer paths, referred to as interesting paths, account for over 75% of the flow in a subset of SPEC benchmarks. Although the frequency counts of interesting paths can be estimated from path profiles, the degree of imprecision of these estimates is very high. We extend Ball Larus (BL) paths to create slightly longer overlapping paths and develop an instrumentation algorithm to collect their frequencies. While these paths are slightly longer than BL paths, they enable very precise estimation of frequencies of potentially much longer interesting paths. Our experiments show that the average cost of collecting frequencies of overlapping paths is 86.8%, which is 4.2 times that of BL paths. However, while the average imprecision in estimated total flow of interesting paths derived from BL path frequencies ranges from -38 % to +138 %, the average imprecision inflow estimates derived from overlapping path frequencies ranges only from -4% to +8%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FLASH: foresighted latency-aware scheduling heuristic for processors with customized datapaths

    Page(s): 201 - 212
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (519 KB) |  | HTML iconHTML  

    Application-specific instruction set processors (ASIPs) have the potential to meet the challenging cost, performance, and power goals of future embedded processors by customizing the hardware to suit an application. A central problem is creating compilers that are capable of dealing with the heterogeneous and nonuniform hardware created by the customization process. The processor datapath provides an effective area to customize, but specialized datapaths often have nonuniform connectivity between the function units, making the effective latency of a function unit dependent on the consuming operation. Traditional instruction schedulers break down in this environment due to their locally greedy nature of binding the best choice for a single operation even though that choice may be poor due to a lack of communication paths. To effectively schedule with nonuniform connectivity, we propose a foresighted latency-aware scheduling heuristic (FLASH) that performs lookahead across future scheduling steps to estimate the effects of a potential binding. FLASH combines a set of lookahead heuristics to achieve effective foresight with low compile-time overhead. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • VHC: quickly building an optimizer for complex embedded architectures

    Page(s): 53 - 64
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1322 KB) |  | HTML iconHTML  

    To meet the high demand for powerful embedded processors, VLIW architectures are increasingly complex (e.g., multiple clusters), and moreover, they now run increasingly sophisticated control-intensive applications. As a result, developing architecture-specific compiler optimizations is becoming both increasingly critical and complex, while time-to-market constraints remain very tight. We present a novel program optimization approach, called the virtual hardware compiler (VHC), that can perform as well as static compiler optimizations, but which requires far less compiler development effort, even for complex VLIW architectures and complex target applications. The principle is to augment the target processor simulator with superscalar-like features, observe how the target program is dynamically optimized during execution, and deduce an optimized binary for the static VLIW architecture. Developing an architecture-specific optimizer then amounts to modifying the processor simulator which is very fast compared to adapting static compiler optimizations to an architecture. We also show that a VHC-optimized binary trained on a number of data sets performs as well as a statically-optimized binary on other test data sets. The only drawback of the approach is a largely increased compilation time, which is often acceptable for embedded applications and devices. Using the Texas Instruments C62 VLIW processor and the associated compiler, we experimentally show that this approach performs as well as static compiler optimizations for a much lower research and development effort. Using a single-core C60 and a dual-core clustered C62 processors, we also show that the same approach can be used for efficiently retargeting binary programs within a family of processors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A compiler scheme for reusing intermediate computation results

    Page(s): 277 - 288
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (306 KB) |  | HTML iconHTML  

    Recent research has shown that programs often exhibit value locality. Such locality occurs when a code segment, although executed repeatedly in the program, takes only a small number of different values as input and, naturally, generates a small number of different outputs. It is potentially beneficial to replace such a code segment by a table, which records the computation results for the previous inputs. When the program execution re-enters the code segment with a repeated input, its computation can be simplified to a table look-up. We discuss a compiler scheme to identify code segments, which are good candidates for computation reuse. We discuss the conditions under which the table look-up costs less than repeating the execution, and we perform profiling to identify candidates, which have many repeated inputs at run time. Compared to previous work, this scheme requires no special hardware support and is therefore particularly useful for resource constrained systems such as handheld computing devices. We implement our scheme and its supporting analyses in GCC. We experiment with several multimedia benchmarks and the GNU Go game by executing them on a handheld computing device. The results show the scheme to improve the performance and to reduce the energy consumption quite substantially for these programs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • LLVM: a compilation framework for lifelong program analysis & transformation

    Page(s): 75 - 86
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (383 KB) |  | HTML iconHTML  

    We describe LLVM (low level virtual machine), a compiler framework designed to support transparent, lifelong program analysis and transformation for arbitrary programs, by providing high-level information to compiler transformations at compile-time, link-time, run-time, and in idle time between runs. LLVM defines a common, low-level code representation in static single assignment (SSA) form, with several novel features: a simple, language-independent type-system that exposes the primitives commonly used to implement high-level language features; an instruction for typed address arithmetic; and a simple mechanism that can be used to implement the exception handling features of high-level languages (and setjmp/longjmp in C) uniformly and efficiently. The LLVM compiler framework and code representation together provide a combination of key capabilities that are important for practical, lifelong analysis and transformation of programs. To our knowledge, no existing compilation approach provides all these capabilities. We describe the design of the LLVM representation and compiler framework, and evaluate the design in three ways: (a) the size and effectiveness of the representation, including the type information it provides; (b) compiler performance for several interprocedural problems; and (c) illustrative examples of the benefits LLVM provides for several challenging compiler problems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Custom data layout for memory parallelism

    Page(s): 291 - 302
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (373 KB) |  | HTML iconHTML  

    We describe a generalized approach to deriving a custom data layout in multiple memory banks for array-based computations, to facilitate high-bandwidth parallel memory accesses in modern architectures, where multiple memory banks can simultaneously feed one or more functional units. We do not use a fixed data layout, but rather select application-specific layouts according to access patterns in the code. A unique feature of this approach is its flexibility in the presence of code reordering transformations, such as the loop nest transformations commonly applied to array-based computations. We have implemented this algorithm in the DEFACTO system, a design environment for automatically mapping C programs to hardware implementations for FPGA-based systems. We present experimental results for five multimedia kernels that demonstrate the benefits of this approach. Our results show that custom data layout yields results as good as, or better than, naive or fixed cyclic layouts, and is significantly better for certain access patterns and in the presence of code reordering transformations. When used in conjunction with unrolling loops in a nest to expose instruction-level parallelism, we observe greater than a 75% reduction in the number of memory access cycles and speedups ranging from 3.96 to 46.7 for 8 memories, as compared to using a single memory with no unrolling. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Static identification of delinquent loads

    Page(s): 303 - 314
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (441 KB) |  | HTML iconHTML  

    The effective use of processor caches is crucial to the performance of applications. It has been shown that cache misses are not evenly distributed throughout a program. In applications running on RISC-style processors, a small number of delinquent load instructions are responsible for most of the cache misses. Identification of delinquent loads is the key to the success of many cache optimization and prefetching techniques. We propose a method for identifying delinquent loads that can be implemented at compile time. Our experiments over eighteen benchmarks from the SPEC suite shows that our proposed scheme is stable across benchmarks, inputs, and cache structures, identifying an average of 10% of the total number of loads in the benchmarks we tested that account for over 90% of all data cache misses. As far as we know, this is the first time a technique for static delinquent load identification with such a level of precision and coverage has been reported. While comparable techniques can also identify load instructions that cover 90% of all data cache misses, they do so by selecting over 50% of all load instructions in the code, resulting in a high number of false positives. If basic block profiling is used in conjunction with our heuristic, then our results show that it is possible to pin down just 1.3% of the load instructions that account for 82% of all data cache misses. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Specialized dynamic optimizations for high-performance energy-efficient microarchitecture

    Page(s): 137 - 148
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (375 KB)  

    We study several major characteristics of dynamic optimization within the PARROT power-aware, trace-cache-based microarchitectural framework. We investigate the benefit of providing optimizations which although tightly coupled with the microarchitecture in substance are decoupled in time. The tight coupling in substance provides the potential for tailoring optimizations for microarchitecture in a manner impossible or impractical not only for traditional static compilers but even for a JIT. We show that the contribution of common, generic optimizations to processor performance and energy efficiency may be more than doubled by creating a more intimate correlation between hardware specifics and the optimizer. In particular, dynamic optimizations can profit greatly from hardware supporting fused and SIMDified operations. At the same time, the decoupling in time allows optimizations to be arbitrarily aggressive without significant performance loss. We demonstrate that requiring up to 512 repetitions before a trace is optimized sacrifices almost no performance or efficiency as compared with lower thresholds. These results confirm the feasibility of energy efficient hardware implementation of an aggressive optimizer. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Software-controlled operand-gating

    Page(s): 125 - 136
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (331 KB) |  | HTML iconHTML  

    Operand gating is a technique for improving processor energy efficiency by gating off sections of the data path that are unneeded by short-precision (narrow) operands. A method for implementing software-controlled power gating is proposed and evaluated. The instruction set architecture (ISA) is enhanced to include opcodes that specify operand widths (if not already included in the ISA). A compiler or a binary translator uses statically available information to determine initial value ranges. The technique is enhanced through a profile-based analysis that results in the specialization of certain code regions for a given value range. After the analysis, instruction opcodes are assigned using the minimum required width. To evaluate this technique the Alpha instruction set is enhanced to include opcodes for 8, 16, and 32 bit operands. Applying the proposed software technique to the Speclnt95 benchmarks results in energy-delay savings of 14%. When combined with previously proposed hardware-based techniques, the energy-delay benefit is 28%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exposing memory access regularities using object-relative memory profiling

    Page(s): 315 - 323
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (354 KB) |  | HTML iconHTML  

    Memory profiling is the process of characterizing a program's memory behavior by observing and recording its response to specific input sets. Relevant aspects of the program's memory behavior may then be used to guide memory optimizations in an aggressively optimizing compiler. In general, memory access behavior has eluded meaningful characterization because of confounding artifacts from memory allocators, linker data layout, and OS memory management. Since these artifacts may change from run to run, memory access patterns may appear different in each run even for the same input set. Worse, regular memory access behavior such as linked list traversals appear to have no structure. We present object-relative translation and decomposition techniques to eliminate these artifacts and to expose previously obscured memory access patterns. To demonstrate the potential of these ideas, we implement two different memory profilers targeted at different sets of applications. These profilers outperform the existing ones in terms of profile size and useful information per byte of data. The first profiler is a lossless profiler, called WHOMP, which uses object-relativity to achieve a 22% better compression than the previously best known scheme. The second profiler, called LEAP, uses lossy compression to get highly compact profiles while providing useful information to the targeted applications. LEAP correctly characterizes the memory alias rates for 56% more instruction pairs than the previously best known scheme with a practical running time. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Physical experimentation with prefetching helper threads on Intel's hyper-threaded processors

    Page(s): 27 - 38
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (390 KB) |  | HTML iconHTML  

    Pre-execution techniques have received much attention as an effective way of prefetching cache blocks to tolerate the ever-increasing memory latency. A number of pre-execution techniques based on hardware, compiler, or both have been proposed and studied extensively by researchers. They report promising results on simulators that model a simultaneous multithreading (SMT) processor. We apply the helper threading idea on a real multithreaded machine, i.e., Intel Pentium 4 processor with hyper-threading technology, and show that indeed it can provide wall-clock speedup on real silicon. To achieve further performance improvements via helper threads, we investigate three helper threading scenarios that are driven by automated compiler infrastructure, and identify several key challenges and opportunities for novel hardware and software optimizations. Our study shows a program behavior changes dynamically during execution. In addition, the organizations of certain critical hardware structures in the hyper-threaded processors are either shared or partitioned in the multithreading mode and thus, the tradeoffs regarding resource contention can be intricate. Therefore, it is essential to judiciously invoke helper threads by adapting to the dynamic program behavior so that we can alleviate potential performance degradation due to resource contention. Moreover, since adapting to the dynamic behavior requires frequent thread synchronization, having light-weight thread synchronization mechanisms is important. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SYZYGY - a framework for scalable cross-module IPO

    Page(s): 65 - 74
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (325 KB) |  | HTML iconHTML  

    Performing analysis across module boundaries for an entire program is important for exploiting several runtime performance opportunities. However, due to scalability problems in existing full-program analysis frameworks, such performance opportunities are only realized by paying tremendous compile-time costs. Alternative solutions, such as partial compilations or user assertions, are complicated or unsafe and as a result, not many commercial applications are compiled today with cross-module optimizations. We present SYZYGY, a practical framework for performing efficient, scalable, interprocedural optimizations. The framework is implemented in the HP-UX Itanium® compilers and we have successfully compiled many very large applications consisting of millions of lines of code. We achieved performance improvements of up to 40% over optimization level two and compilation time improvements in the order of 100% and more compared to a previous approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using dynamic binary translation to fuse dependent instructions

    Page(s): 213 - 224
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (367 KB) |  | HTML iconHTML  

    Instruction scheduling hardware can be simplified and easily pipelined if pairs of dependent instructions are fused so they share a single instruction scheduling slot. We study an implementation of the x86 ISA that dynamically translates x86 code to an underlying ISA that supports instruction fusing. A microarchitecture that is codesigned with the fused instruction set completes the implementation. We focus on the dynamic binary translator for such a codesigned x86 virtual machine. The dynamic binary translator first cracks x86 instructions belonging to hot superblocks into RISC-style microoperations, and then uses heuristics to fuse together pairs of dependent microoperations. Experimental results with SPEC2000 integer benchmarks demonstrate that: (1) the fused ISA with dynamic binary translation reduces the number of scheduling decisions by about 30% versus a conventional implementation that uses hardware cracking into RISC microoperations; (2) an instruction scheduling slot needs only hold two source register fields even though it may hold two instructions; (3) translations generated in the proposed ISA consume about 30% less storage than a corresponding fixed-length RISC-style ISA. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Code generation for single-dimension software pipelining of multi-dimensional loops

    Page(s): 175 - 186
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (713 KB) |  | HTML iconHTML  

    Traditionally, software pipelining is applied either to the innermost loop of a given loop nest or from the innermost loop to the outer loops. We proposed a scheduling method, called single-dimension software pipelining (SSP), to software pipeline a multidimensional loop nest at an arbitrary loop level. We describe our solution to SSP code generation. In contrast to traditional software pipelining, SSP handles two distinct repetitive patterns, and thus requires new code generation algorithms. Further, these two distinct repetitive patterns complicate register assignment and require two levels of register renaming. As rotating registers support renaming at only one level, our solution is based on a combination of dynamic register renaming (using rotating registers) and static register renaming (using code replication). Finally, code size increase, an even more important issue for SSP than for traditional software-pipelining, is also addressed. Optimizations are proposed to reduce code size without significant performance degradation. We first present a code generation scheme and subsequently implement it for the IA-64 architecture, making effective use of rotating registers and predicated execution. We present some initial experimental results, which demonstrate not only the feasibility and correctness of our code generation scheme, but also its code quality. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A dynamically tuned sorting library

    Page(s): 111 - 122
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (447 KB) |  | HTML iconHTML  

    Empirical search is a strategy used during the installation of library generators such as ATLAS, FFTW, and SPIRAL to identify the algorithm or the version of an algorithm that delivers the best performance. In the past, empirical search has been applied almost exclusively to scientific problems. In this paper, we discuss the application of empirical search to sorting, which is one of the best understood symbolic computing problems. When contrasted with the dense numerical computations of ATLAS, FFTW, and SPIRAL, sorting presents a new challenge, namely that the relative performance of the algorithms depend not only on the characteristics of the target machine and the size of the input data but also on the distribution of values in the input data set. Empirical search is applied in the study reported here as part of a sorting library generator. The resulting routines dynamically adapt to the characteristics of the input data by selecting the best sorting algorithm from a small set of alternatives. To generate the run time selection mechanism our generator makes use of machine learning to predict the best algorithm as a function of the characteristics of the input data set and the performance of the different algorithms on the target machine. This prediction is based on the data obtained through empirical search at installation time. Our results show that our approach is quite effective. When sorting data inputs of 12M keys with various standard deviations, our adaptive approach selected the best algorithm for all the input data sets and all platforms that we tried in our experiments. The wrong decision could have introduced a performance degradation of up to 133%, with an average value of 44%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Compiler optimization of memory-resident value communication between speculative threads

    Page(s): 39 - 50
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (384 KB) |  | HTML iconHTML  

    Efficient inter-thread value communication is essential for improving performance in thread-level speculation (TLS). Although several mechanisms for improving value communication using hardware support have been proposed, there is relatively little work on exploiting the potential of compiler optimization. Building on recent research on compiler optimization of scalar value communication between speculative threads, we propose compiler techniques for the optimization of memory-resident values. In TLS, data dependences through memory-resident values are tracked by the underlying hardware and preserved by re-executing any speculative thread that violates a dependence; however, re-execution incurs a large performance penalty and should be used only to resolve data dependences that are infrequent. In contrast, value communication for frequently-occurring data dependences must be very efficient. We propose using the compiler to first identify frequently-occurring memory-resident data dependences, then insert synchronization for communicating values to preserve these dependences. We find that by synchronizing frequently-occurring data dependences we can significantly improve the efficiency of parallel execution. A comparison between compiler-inserted and hardware-inserted memory synchronization reveals that the two techniques are complementary, with each technique benefitting different benchmarks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Single-dimension software pipelining for multi-dimensional loops

    Page(s): 163 - 174
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (549 KB) |  | HTML iconHTML  

    Traditionally, software pipelining is applied either to the innermost loop of a given loop nest or from the innermost loop to outer loops. We propose a three-step approach, called single-dimension software pipelining (SSP), to software pipeline a loop nest at an arbitrary loop level. The first step identifies the most profitable loop level for software pipelining in terms of initiation rate or data reuse potential. The second step simplifies the multidimensional data-dependence graph (DDG) into a 1-dimensional DDG and constructs a 1-dimensional schedule for the selected loop level. The third step derives a simple mapping function which specifies the schedule time for the operations of the multidimensional loop, based on the 1-dimensional schedule. We prove that the SSP method is correct and at least as efficient as other modulo scheduling methods. We establish the feasibility and correctness of our approach by implementing it on the IA-64 architecture. Experimental results on a small number of loops show significant performance improvements over existing modulo scheduling methods that software pipeline a loop nest from the innermost loop. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Probabilistic predicate-aware modulo scheduling

    Page(s): 151 - 162
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (489 KB) |  | HTML iconHTML  

    Predicated execution enables the removal of branches by converting segments of branching code into sequences of conditional operations. An important side effect of this transformation is that the compiler must unconditionally assign resources to predicated operations. However, a resource is only put to productive use when the predicate associated with an operation evaluates to True. To reduce this superfluous commitment of resources, we propose probabilistic predicate-aware scheduling to assign multiple operations to the same resource at the same time, thereby over-subscribing its use. Assignment is performed in a probabilistic manner using a combination of predicate profile information and predicate analysis aimed at maximizing the benefits of over-subscription in view of the expected degree of conflict. Conflicts occur when two or more operations assigned to the same resource have their predicates evaluate to True. A predicate-aware VLIW processor pipeline detects such conflicts, recovers, and correctly executes the conflicting operations. By increasing the effective throughput of a fixed set of resources, probabilistic predicate-aware scheduling provided an average of 20% performance gain in our evaluations on a 4-issue processor, and 8% gain on a 6-issue processor. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving 64-bit Java IPF performance by compressing heap references

    Page(s): 100 - 110
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (301 KB) |  | HTML iconHTML  

    64-bit processor architectures like the Intel® Itanium® processor family are designed for large applications that need large memory addresses. When running applications that fit within a 32-bit address space, 64-bit CPUs are at a disadvantage compared to 32-bit CPUs because of the larger memory footprints for their data. This results in worse cache and TLB utilization, and consequently lower performance because of increased miss ratios. This paper considers software techniques for virtual machines that allow 32-bit pointers to be used on 64-bit CPUs for managed runtime applications that do not need the full 64-bit address space. We describe our pointer compression techniques and discuss our experience implementing these for Java applications. In addition, we give performance results with our techniques for both the SPEC JVM98 and SPEC JBB2000 benchmarks. We demonstrate a 12% performance improvement on SPEC JBB2000 and a reduction in the number of garbage collections required for a given heap size. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The accuracy of initial prediction in two-phase dynamic binary translators

    Page(s): 227 - 238
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (360 KB) |  | HTML iconHTML  

    Dynamic binary translators use a two-phase approach to identify and optimize frequently executed code dynamically. In the first step (profiling phase), blocks of code are interpreted or quickly translated to collect execution frequency information for the blocks. In the second phase (optimization phase), frequently executed blocks are grouped into regions and advanced optimizations are applied on them. This approach implicitly assumes that the initial profile of each block is representative of the block throughout its lifetime. We investigate the ability of the initial profile to predict the average program behavior. We compare the predicted behavior of varying lengths of the initial execution with the average program behavior for the whole program execution, and use the prediction from the training input as the reference. Our result indicates that, for the SPEC2000 benchmarks, even very short initial profiles have comparable prediction accuracy to the traditional profile-guided optimizations using the training input, although the initial profile is inadequate for predicting loop trip count information for some integer programs and several benchmarks can benefit from phase-awareness during dynamic binary translation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Targeted path profiling: lower overhead path profiling for staged dynamic optimization systems

    Page(s): 239 - 250
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (407 KB) |  | HTML iconHTML  

    We present a technique for reducing the overhead of collecting path profiles in the context of a dynamic optimizer. The key idea to our approach, called targeted path profiling (TPP), is to use an edge profile to simplify the collection of a path profile. This notion of profile-guided profiling is a natural fit for dynamic optimizers, which typically optimize the code in a series of stages. TPP is an extension to the Ball-Lams efficient path profiling algorithm. Its increased efficiency comes from two sources: (i) reducing the number of potential paths by not enumerating paths with cold edges, allowing array accesses to be substituted for more expensive hash table lookups, and (ii) not instrumenting regions where paths can be unambiguously derived from an edge profile. Our results suggest that on average the overhead of profile collection can be reduced by half (SPEC95) to almost two-thirds (SPEC2000) relative to the Ball-Larus algorithm with minimal impact on the information collected. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploring the performance potential of Itanium® processors with ILP-based scheduling

    Page(s): 189 - 200
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (6780 KB) |  | HTML iconHTML  

    HP and Intel's Itanium processor family (IPF) is considered as one of the most challenging processor architectures to generate code for. During global instruction scheduling, the compiler must balance the use of strongly interdependent techniques like code motion, speculation and predication. A too conservative application of these features can lead to empty execution slots, contrary to the EPIC philosophy. But overuse can cause resource shortage which spoils the benefit. We tackle this problem using integer linear programming (ILP), a proven standard optimization method. Our ILP model comprises global, partial-ready code motion with automated generation of compensation code as well as vital IPF features like control/data speculation and predication. The ILP approach can-with some restrictions-resolve the interdependences between these decisions and deliver the global optimum. This promises a speedup for compute-intensive applications as well as some theoretically funded insights into the potential of the architecture. Experiments with several hot functions from the SPEC benchmarks show substantial improvements: our postpass optimizer reduces the schedule lengths produced by Intel's compiler by about 20-40%. The resulting speedup of these routines is 16% on average. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.