Scheduled System Maintenance on May 29th, 2015:
IEEE Xplore will be upgraded between 11:00 AM and 10:00 PM EDT. During this time there may be intermittent impact on performance. We apologize for any inconvenience.
By Topic

Parallel Architectures and Compilation Techniques, 1998. Proceedings. 1998 International Conference on

Date 18-18 Oct. 1998

Filter Results

Displaying Results 1 - 25 of 53
  • Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192)

    Publication Year: 1998
    Save to Project icon | Request Permissions | PDF file iconPDF (89 KB)  
    Freely Available from IEEE
  • Table of contents

    Publication Year: 1998 , Page(s): v - viii
    Save to Project icon | Request Permissions | PDF file iconPDF (205 KB)  
    Freely Available from IEEE
  • Author index

    Publication Year: 1998 , Page(s): 434 - 435
    Save to Project icon | Request Permissions | PDF file iconPDF (210 KB)  
    Freely Available from IEEE
  • Instance-wise reaching definition analysis for recursive programs using context-free transductions

    Publication Year: 1998 , Page(s): 332 - 339
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (136 KB)  

    Automatic parallelization of recursive programs is still an open problem today, lacking suitable and precise static analyses. We present a novel reaching definition framework based on context-free transductions. The technique achieves a global and precise description of the data flow and discovers important semantic properties of programs. Taking the example of a real-world non-derecursivable program, we show the need for a reaching definition analysis able to handle run-time instances of statements separately. A running example sketches our parallelization scheme, and presents our reaching definition analysis. Future fruitful research, at the crossroad of program analysis and formal language theory, is also hinted at View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptive scheduling of computations and communications on distributed memory systems

    Publication Year: 1998 , Page(s): 366 - 373
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (192 KB)  

    Compile-time scheduling is one approach to extract parallelism which proved to be effective when the execution behavior is predictable. Unfortunately, the performance of most priority-based scheduling algorithms is computation dependent. Scheduling by using earliest-task-first (ETF) produces reasonably short schedules only when available parallelism is large enough to cover the communications. A priority-based decision is much more effective when parallelism is low. We propose a scheduling in which the decision function combines: (1) task-level as global priority, and (2) earliest-task-first as local priority. The degree of dominance of one of the above concepts is controlled by a computation profile factor such as task parallelism and communication. An iterative scheduler (forward and backward) is proposed for tuning the solution. In each iteration, the new schedule is used to sharpen the task-levels. This contributes in finding shorter schedules in next iteration. Evaluation is carried out by using synthetic task-graphs for computations with communications times for which optimum schedules are known. It is found that finish time of pure local scheduling (like ETF) and static priority-based scheduling significantly deviate from optimum when task parallelism is low in presence of relatively large communication. Our approach to adapting the scheduling decision to computation profile was able to produce near-optimum solutions through much less number of iterations than other approaches View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Transformations for improving data access locality in non-perfectly nested loops

    Publication Year: 1998 , Page(s): 314 - 321
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (84 KB)  

    Loop transformation techniques have matured to the point where the techniques are well integrated into production optimizing compilers. However, we believe that new and aggressive techniques are necessary because: i) computer system manufacturers have announced their plans for forthcoming processors with over 1 GHz clock frequency, wherein the miss penalties can be very high, and ii) transformation techniques have saturated in that they have delivered majority of the benefit that can be obtained with perfectly nested loops. The new techniques must target improvement of coverage, that is the classes of loops that can be optimized for locality of data access. In this paper, we present a new technique called size-reduction transformation to improve data access locality in a class of nonperfectly nested loops. The new technique is very effective when existing techniques, namely, linear loop and array transformations, fail to improve locality of reference. Size-reduction transformations are implemented in IBM's Fortran 90 optimizing compiler released recently, and have contributed significantly to the high performance of the compiler View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A direct-execution framework for fast and accurate simulation of superscalar processors

    Publication Year: 1998 , Page(s): 286 - 293
    Cited by:  Papers (21)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (184 KB)  

    Multiprocessor system evaluation has traditionally been based on direct-execution based Execution-Driven Simulations (EDS). In such environments, the processor component of the system is not fully modeled. With wide issue superscalar processors being the norm in today's multiprocessor nodes, there is an urgent need for modeling the processor accurately. However, using direct execution to model a superscalar processor has been considered an open problem. Hence, current approaches model the processor by interpreting the application executable. Unfortunately, this approach can be slow. In this paper, we propose a novel direct-execution framework that allows accurate simulation of wide-issue superscalar processors without the need for code interpretation. This is achieved with the aid of an Interface Window between the front-end and the architectural simulator, that buffers the necessary information. This eliminates the need for full-fledged instruction emulation. Overall, this approach enables detailed yet fast EDS of superscalar processors. Finally, we evaluate the framework and show good performance for uni- and multiprocessor configurations View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • General parallel computation can be performed with a cycle-free heap

    Publication Year: 1998 , Page(s): 96 - 103
    Cited by:  Papers (1)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (80 KB)  

    We argue that a powerful and general programming model for parallel computation exists that honors the principles of modular software construction, but disallows the formation of heap cycles. We believe this cycle-free frame and heap model can be used as the basis for a new species of computer systems that satisfies all principles of modular software construction and offers performance and programmability beyond what is possible within the limitations of today's computer system technology View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Static methods in hybrid branch prediction

    Publication Year: 1998 , Page(s): 222 - 229
    Cited by:  Papers (5)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (72 KB)  

    Hybrid branch predictors combine the predictions of multiple single-level or two-level branch predictors. The prediction-combining hardware-the “meta-predictor”-may itself be large, complex and slow. We show that the combination function is better performed statically, using prediction hints in the branch instructions. The hints are set by profiling or static analysis. Although the meta-predictor is static, the actual predictions remain dynamic, so there is little risk of worst-case performance. An important advantage of our approach is that a branch site only causes interference within a single component predictor reducing capacity demands. We argue that our proposal is implementable, and that it addresses the scaling issues currently facing hardware designers. We show that the static hybrid method we propose is more effective than existing techniques based on dynamic selection, and requires less hardware. For example, one result shows a conventional 4096-bit dynamic selection mechanism getting a 4.7% average miss rate, while our static approach gets 3.6%. These results are obtained with the Instruction Benchmark Suite (IBS), a realistic whole-system benchmark, and the SPECint95 suite, using realistic hardware sizes. All the results we present are based on a cross-validation methodology, in which the profile data used for static selection are based on training inputs that are entirely different from the inputs used to evaluate the performance of the technique View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Athapascan-1: On-line building data flow graph in a parallel language

    Publication Year: 1998 , Page(s): 88 - 95
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (124 KB)  

    In order to achieve practical efficient execution on a parallel architecture, a knowledge of the data dependencies related to the application appears as the key point for building an efficient schedule. By restricting accesses in shared memory, we show that such a data dependency graph can be computed on-line on a distributed architecture. The overhead introduced is bounded with respect to the parallelism expressed by the user: each basic computation corresponds to a user-defined task, each data-dependency to a user-defined data structure. We introduce a language named Athapascan-1 that allows to build a graph of dependencies from a strong typing of shared memory accesses. We detail compilation and implementation of the language. Besides, the performance of a code (parallel time, communication and arithmetic works, memory space) are defined from a cost model without the need of a machine model. We exhibit efficient scheduling with respect to these costs on theoretical machine models View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • MADELEINE: an efficient and portable communication interface for RPC-based multithreaded environments

    Publication Year: 1998 , Page(s): 240 - 247
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (88 KB)  

    Due to their ever-growing success in the development of distributed applications, today's multithreaded environments have to be highly portable and efficient on a large variety of hardware. Most of these environments have an implementation built on top of standard communication interfaces such as PVM or MPI, which are widely available on existing architectures. Obviously, this approach ensures a high level of portability. However we show in this paper that these communication interfaces do not meet the needs of RPC-based multithreaded environments as far as efficiency is concerned. We propose a new portable and efficient communication interface called MADELEINE, that is especially designed for such multithreaded environments. We report on several implementations of MADELEINE on top of various network protocols that demonstrate the efficiency of our approach View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient methods for multi-dimensional array redistribution

    Publication Year: 1998 , Page(s): 410 - 417
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (96 KB)  

    In this paper, we present efficient methods for multidimensional array redistribution. Based on the previous work, the basic-cycle calculation technique, we present a basic-block calculation (BBC) and a complete-dimension calculation (CDC) techniques. We have developed a theoretical model to analyze the computation costs of these two techniques. The theoretical model shows that the BBC method has smaller indexing costs and performs well for the redistribution with small array size. The CDC method has smaller packing/unpacking costs and performs well when the array size is large. We also have implemented these two techniques along with the PITFALLS method and the Prylli's method on an IBM SP2 parallel machine. The experimental results show that the BBC method has the smallest execution time of these four algorithms when the array size is small. The CDC method has the smallest execution time of these four algorithms when the array size is large. Furthermore, the BBC method outperforms the PITFALLS method and the Prylli's method for all test samples View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The START-VOYAGER parallel system

    Publication Year: 1998 , Page(s): 185 - 194
    Cited by:  Papers (3)  |  Patents (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (152 KB)  

    This paper presents the communication architecture of the START-VOYAGER system, a parallel machine composed of a cluster of unmodified IBM 604e-based SMP's connected via a high speed interconnection network. A custom network interface unit (NIU) plugs into a processor card slot of each SMP, providing a high-performance message passing substrate that supports both fast user-level message passing and cache-line coherent shared memory. The substrate consists of four hardware implemented message passing mechanisms to achieve high performance over a wide spectrum of communication patterns. START-VOYAGER also introduces a novel protection scheme which improves upon past designs by not requiring strictly synchronized gang-scheduling and by allowing system code and multiple user applications to share the network simultaneously without compromising protection nor performance. Performance predictions based on synthesized Verilog code show START-VOYAGER's novel message passing mechanisms offer a definitive advantage in a multi-threaded environment without compromising the performance in a single-threaded environment. Preliminary shared memory simulations are also promising View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Handling cross interferences by cyclic cache line coloring

    Publication Year: 1998 , Page(s): 112 - 117
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (232 KB)  

    Cross interference, conflicting data from several arrays, is particularly grave for caches with limited associativity. We present a uniform scheme that reduces both self and cross interference. Techniques for cyclic register allocation, namely the meeting graph, help to improve the usage of cache lines and to avoid conflicts. Cyclic graph coloring determines a new memory mapping function. Preliminary experiments show that in spite of the penalty for the more complex indexing functions, run times are improved View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using algebraic transformations to optimize expression evaluation in scientific code

    Publication Year: 1998 , Page(s): 376 - 384
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (156 KB)  

    Algebraic properties such as associativity or distributivity allow the manipulation of a set of mathematically equivalent expressions. However as shown in this paper the cost of evaluating such expressions on a computer is not constant within this domain. We suggest the use of algebraic transformations to improve the performance of computationally intensive applications on modern computer architectures. We claim that taking into account instruction-level parallelism and the new capabilities of processors when applying these transformations leads to large run-time improvements. Due to a combinatorial explosion, associative commutative pattern-matching techniques cannot systematically be used in this context. Thus, we introduce two performance enhancing algorithms providing factorization and multiply-add extraction heuristics and choice criteria based on a simple cost model. This paper describes our approach and a first implementation. Experiments on real code, including an excerpt from SPEC FP95, are very promising since we automatically obtain the same results as manual transformations, with a performance improvement by a factor of up to three View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Integrating loop and data transformations for global optimisation

    Publication Year: 1998 , Page(s): 12 - 19
    Cited by:  Papers (14)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (108 KB)  

    This paper is concerned with integrating global data transformations and local loop transformations in order to minimise overhead on distributed shared memory machines such as the SGi Origin 2000. By first developing an extended algebraic transformation framework, a new technique to allow the static application of global data transformations, such as partitioning, to reshaped arrays is presented, eliminating the need for expensive temporary copies and hence eliminating any communication and synchronisation. In addition, by integrating loop and data transformations, any introduced poor spatial locality and expensive array subscripts can be eliminated. A specific optimisation algorithm is derived and applied to well-known benchmarks, where it is shown to give a significant improvement in execution time over existing approaches View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting method-level parallelism in single-threaded Java programs

    Publication Year: 1998 , Page(s): 176 - 184
    Cited by:  Papers (6)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (52 KB)  

    Method speculation of object-oriented programs attempts to exploit method-level parallelism (MLP) by executing sequential method invocations in parallel, while still maintaining correct sequential ordering of data dependencies and memory accesses. In this paper, we show why the Java virtual machine is an effective environment for exploiting method-level parallelism, and demonstrate how method speculation can potentially speed up single-threaded general purpose Java programs. Results from our study show that significant speedups can be achieved on data-parallel applications with minimal programmer and compiler effort. On control-flow dependent programs, moderate speedups have been achieved, suggesting more significant performance improvements on these types of programs may come from more careful analysis or re-coding of the application. For both classes of applications, we discover performance debugging drastically improves speedups by eliminating or minimizing dependencies that limit the effectiveness of method speculation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Code generation in the polytope model

    Publication Year: 1998 , Page(s): 106 - 111
    Cited by:  Papers (3)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (88 KB)  

    Automatic parallelization of nested loops, based on a mathematical model, the polytope model, has been improved significantly over the last decade: state-of-the-art methods allow flexible distributions of computations in space and time, which lead to high-quality parallelism. However, these methods have not found their way into practical parallelizing compilers due to the lack of code generation schemes which are able to deal with the new-found flexibility. To close this gap is the purpose of this paper View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Capturing the effects of code improving transformations

    Publication Year: 1998 , Page(s): 118 - 123
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (152 KB)  

    Symbolic debugging of transformed code requires information about the impact of applying transformations on statement instances so that the appropriate values can be displayed to a user. We present a technique to automatically identify statement instance correspondences between untransformed and transformed code and generate mappings reflecting these correspondences as code improving transformations are applied. The mappings support classical optimizations as well as loop transformations. Establishing mappings requires analyzing how the position, number, and order of instances of a statement can change in a particular context when transformations are applied. In addition to enabling symbolic debugging of transformed code, these mappings can be used to understand transformed code and to compare values computed in both program versions either manually or automatically View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sirocco: cost-effective fine-grain distributed shared memory

    Publication Year: 1998 , Page(s): 40 - 49
    Cited by:  Papers (21)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (92 KB)  

    Software fine-grain distributed shared memory (FGDSM) provides a simplified shared-memory programming interface with minimal or no hardware support. Originally software FGDSMs targeted uniprocessor-node parallel machines. This paper presents Sirocco, a family of software FGDSMs implemented on a network of low-cost SMPs. Sirocco takes full advantage of SMP nodes by implementing inter-node sharing directly in hardware and overlapping computation with protocol execution. To maintain correct shared-memory semantics, however SMP nodes require mechanisms to guarantee atomic coherence operations. Multiple SMP processors may also result in contention for shared resources and reduce performance. SMP nodes also impact the cost trade-off. While SMPs typically charge higher price-premiums, for a given system size SMP nodes substantially reduce networking hardware requirement as compared to uniprocessor nodes. In this paper, we ask the question “Are SMPs cost-effective building blocks for software FGDSM?” We present experimental measurements on Sirocco implementations ranging from an all-software system to a system with minimal hardware support. Together with simple cost models we show that low-cost SMP nodes: (i) result in competitive performance with uniprocessor nodes, (ii) substantially reduce hardware requirement and are more cost-effective than uniprocessor nodes, (iii) significantly benefit from hardware support for coherence operations, and (iv) are especially beneficial for FGDSMs with high-overhead coherence operations View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient edge profiling for ILP-processors

    Publication Year: 1998 , Page(s): 294 - 303
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (96 KB)  

    Compilers for VLIW and superscalar machines increasingly use dynamic application behavior or profiling information in optimizations such as instruction scheduling, speculative code motion, and code layout. Hence it is extremely useful to develop inexpensive techniques that gather accurate profiling information. This paper presents novel edge profiling techniques that greatly reduce run-time overhead by efficiently exploiting instruction level parallelism between application and instrumentation. Best results are achieved when speculatively executing a software pipelined version of the instrumentation code. For an 8-wide issue machine, measurements for the SPECint95 benchmarks indicate a 10-fold reduction in overhead (from 32.8% to 3.3%), when compared with previous techniques View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficacy and performance impact of value prediction

    Publication Year: 1998 , Page(s): 148 - 154
    Cited by:  Papers (15)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (572 KB)  

    Value prediction is a technique that bypasses inter-instruction data dependencies by speculating on the outcomes of producer instructions, thereby allowing dependent consumer instructions to execute in parallel. This work makes several contributions in value prediction research. A hybrid value predictor that achieves an overall prediction rate of up to 83% is presented. The design of a value-predicting eight-wide superscalar machine with its speculative execution core is described. This design is able to achieve 8.6% to 23% IPC improvements on the SPEC benchmarks. Furthermore, it is shown that prediction rate is not a good indicator of speedup because over 40% of predictions made may not be useful in enhancing performance, and a simple hardware mechanism that eliminates many of these useless predictions is introduced View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Breaking the barriers: two models for MPI programming

    Publication Year: 1998 , Page(s): 248 - 255
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (80 KB)  

    The asynchronous nature of many MPI/PVM programs does not fit the BSP model. The barrier synchronization imposed by the model restricts the range of available algorithms and their performance. Through the suppression of barriers and the generalization of the concept of superstep we propose two new models, the BSP-like and the BSP Without Barriers (BSPWB) models. While the BSP-like extends the BSP* model to programs written using collective operations, the more general BSPWB model admits the MPI/PVM parallel asynchronous programming style. As LogP, the model encourages locality but it is simpler to use. The parameters of the models and their quality are evaluated on a distributed shared memory machine, the Origin 2000 and on a distributed memory machine, the CRAY T3E. The dependence of the time spent in an h-relation is stronger in the communication pattern than in the number of processors. The total variation of the h-relation time in both the patterns and processor numbers is smaller than sixty nanoseconds. To illustrate the proposed models, two different applications are considered: a Parallel Sort using Regular Sampling (PSRS) and a Parallel Dynamic Programming Algorithm solving the Single Resource Allocation Problem (SRAP). The PSRS is a synchronous algorithm with a rich set of collective communication patterns and coarse grain communications. On the opposite extreme, the SRAP is a fine grain communication algorithm using permutation patterns. The computational results prove the accuracy of the models. The prediction of the communication times is robust even for the SRAP, where communication is dominated by small messages View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A fast algorithm for scheduling time-constrained instructions on processors with ILP

    Publication Year: 1998 , Page(s): 158 - 166
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (144 KB)  

    Instruction scheduling is central to achieving performance in modern processors with instruction level parallelism (ILP). Classical work in this area has spanned the theoretical foundations of algorithms for instruction scheduling with provable optimality, as well as heuristic approaches with experimentally validated performance improvements. Typically, the theoretical foundations are developed in the context of basic-blocks of code. In this paper, we provide the theoretical foundations for scheduling basic-blocks of instructions with time-constraints, which can play an important role in compile-time ILP optimizations in embedded applications. We present an algorithm for scheduling unit-execution-time instructions on machines with multiple pipelines, in the presence of precedence constraints, release-times, deadlines, and latencies lij between any pairs of instructions i and j. Our algorithm runs in time O(n3α(n)), where α(n) is the functional inverse of the Ackermann function. It can be used construct feasible schedules for two classes of instances: (1) one pipeline and the latencies between instructions are restricted to the values of 0 and 1, and (2) arbitrary number of pipelines and monotone-interval order precedences. Our result can be seen as a natural extension of previous work on instruction scheduling for pipelined machines in the presence of deadlines View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast, accurate and flexible data locality analysis

    Publication Year: 1998 , Page(s): 124 - 129
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (36 KB)  

    This paper presents a tool based on a new approach for analyzing the locality exhibited by data memory references. The tool is very fast because it is based on a static locality analysis enhanced with very simple profiling information, which results in a negligible slowdown. This feature allows the tool to be used for highly time-consuming applications and to include it as a step in a typical iterative analysis-optimization process. The tool can provide a detailed evaluation of the reuse exhibited by a program, quantifying and qualifying the different types of misses either globally or detailed by program sections, data structures, memory instructions, etc. The accuracy of the tool is validated by comparing its results with those provided by a simulator View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.