By Topic

Parallel Processing, 2005. ICPP 2005. International Conference on

Date 14-17 June 2005

Filter Results

Displaying Results 1 - 25 of 79
  • Proceedings. 2005 International Conference on Parallel Processing

    Save to Project icon | Request Permissions | PDF file iconPDF (31 KB)  
    Freely Available from IEEE
  • 2005 International Conference on Parallel Processing - Title Page

    Page(s): i - iii
    Save to Project icon | Request Permissions | PDF file iconPDF (39 KB)  
    Freely Available from IEEE
  • 2005 International Conference on Parallel Processing - Copyright Page

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (46 KB)  
    Freely Available from IEEE
  • 2005 International Conference on Parallel Processing - Table of contents

    Page(s): v - x
    Save to Project icon | Request Permissions | PDF file iconPDF (55 KB)  
    Freely Available from IEEE
  • Message from the General Co-Chairs

    Page(s): xi
    Save to Project icon | Request Permissions | PDF file iconPDF (23 KB)  
    Freely Available from IEEE
  • Message from the Program Co-Chairs

    Page(s): xii
    Save to Project icon | Request Permissions | PDF file iconPDF (27 KB)  
    Freely Available from IEEE
  • Organizing Committee

    Page(s): xiii - xiv
    Save to Project icon | Request Permissions | PDF file iconPDF (25 KB)  
    Freely Available from IEEE
  • Program Committee

    Page(s): xv - xvi
    Save to Project icon | Request Permissions | PDF file iconPDF (23 KB)  
    Freely Available from IEEE
  • list-reviewer

    Page(s): xvii
    Save to Project icon | Request Permissions | PDF file iconPDF (21 KB)  
    Freely Available from IEEE
  • SAREC: a security-aware scheduling strategy for real-time applications on clusters

    Page(s): 5 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (216 KB) |  | HTML iconHTML  

    Security requirements of security-critical real-time applications must be met in addition to satisfying timing constraints. However, conventional real-time scheduling algorithms ignore the applications' security requirements. In recognition that an increasing number of applications running on clusters demand both real-time performance and security, we investigate the problem of scheduling a set of independent real-time tasks with various security requirements. We propose a security overhead model that is capable of measuring security overheads incurred by security-critical tasks. Further, we propose a security-aware scheduling strategy, or SAREC, which integrates security requirements into scheduling for real-time applications by employing our security overhead model. To evaluate the effectiveness of SAREC, we implement a security-aware real-time scheduling algorithm (SAREC-EDF), which incorporates the earliest deadline first (EDF) scheduling algorithm into SAREC Extensive simulation experiments show that SAREC-EDF significantly improves overall system performance over three baseline scheduling algorithms (variations of EDF) by up to 72.55%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multiprocessor energy-efficient scheduling for real-time tasks with different power characteristics

    Page(s): 13 - 20
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (304 KB) |  | HTML iconHTML  

    In the past decades, a number of research results have been reported for energy-efficient scheduling over uniprocessor and multiprocessor environments. Different from many of the past results on the assumption for task power characteristics, we consider real-time scheduling of tasks with different power characteristics. The objective is to minimize the energy consumption of task executions under the given deadline constraint. When tasks have a common deadline and are ready at time 0, we propose an optimal real-time task scheduling algorithm for multiprocessor environments with the allowance of task migration. When no task migration is allowed, a 1.412-approximation algorithm for task scheduling is proposed for different settings of power characteristics. The performance of the approximation algorithm was evaluated by an extensive set of experiments, where excellent results were reported. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A utility-based two level market solution for optimal resource allocation in computational grid

    Page(s): 23 - 30
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (152 KB) |  | HTML iconHTML  

    The paper presents a market oriented resource allocation strategy for grid resource. The proposed model uses the utility functions for calculating the utility of a resource allocation. This allows the integration of different optimization objectives into allocation process. This paper is target to solve above issues by using utility-based optimization scheme. We decompose the optimization problem into two levels of subproblems so that the computational complexity is reduced. Two market levels converge to its optimal points; a globally optimal point is achieved. Total user benefit of the computational grid is maximized when the equilibrium prices are obtained through the service market level optimization and resource market level optimization. The economic model is the basis of an iterative algorithm that, given a finite set of requests, is used to perform optimal resource allocation. The experiments show that scheduling based on pricing directed resource allocation involves less overhead and leads to more efficient resource allocation than conventional round robin scheduling. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Two-tier resource allocation for slowdown differentiation on server clusters

    Page(s): 31 - 38
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (184 KB) |  | HTML iconHTML  

    Slowdown, defined as the ratio of a request's queueing delay to its service time, is accepted as an important quality of service metric of Internet servers. In this paper, we investigate the problem of providing proportional slowdown differentiation (PSD) services to various applications and clients on cluster-based Internet servers. We extend a closed-form expression of the expected slowdown of a popular Internet workload model with a typical heavy-tailed service time distribution from a single server mode to a server cluster mode. Based on the closed-form expression, we design a two-tier resource allocation approach, which integrates a dispatcher-based node partitioning scheme and a server-based dynamic process allocation scheme. We evaluate the two-tier resource allocation approach via extensive simulations and compare it with an one-tier node partitioning approach. Simulation results show that the two-tier approach can provide fine-grained PSD services on cluster-based Internet servers. We implement the two-tier approach on a cluster testbed. Experimental results further demonstrate the feasibility of the approach in practice. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and implementation of overlay multicast protocol for multimedia streaming

    Page(s): 41 - 48
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (272 KB) |  | HTML iconHTML  

    In this paper, we propose a new protocol called shared tree streaming (or STS in short) protocol that is designed for interactive multimedia streaming applications. STS is a decentralized protocol that constructs a shared tree called s-DBMDT (sender-dependent degree-bounded minimum diameter tree) as an overlay network that involves all the participants of the application. For a given set of nodes where some of them are senders, s-DBMDT is a spanning tree where the maximum delay on the tree from those senders is minimized and the degree constraint on each node is held. We believe that this is the first approach that defines s-DBMDT construction problem and presents a distributed protocol for the purpose. Our performance evaluation is based on experiments in both simulated networks and real networks that strongly shows the efficiency and usefulness of STS protocol. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Embedding a cluster-based overlay mesh in mobile ad hoc networks without cluster heads

    Page(s): 49 - 56
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (208 KB) |  | HTML iconHTML  

    One strategy to tackle the complexity and scalability issue in large-scale mobile ad hoc networks (MANETs) is to use extra layers of abstraction. A common tactic is to group the nodes in the network into clusters. The clusters and the paths between them constitute an extra layer of overlay abstraction. To maintain the overlay structure, a head node is often elected in each cluster. The head nodes form a backbone to provide, among other things, passages from one cluster to another. Nonetheless, the use of dedicated head nodes also creates problems such as load and power imbalance. In this paper, we investigate the feasibility of building a cluster-based overlay mesh on MANETs without using cluster heads. Without head nodes, the challenge is in maintaining the overlay structure and performing inter-cluster routing. We will examine one possible scheme through simulation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploring processor design options for Java-based middleware

    Page(s): 59 - 68
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (160 KB) |  | HTML iconHTML  

    Java-based middleware is a rapidly growing workload for high-end server processors, particularly chip multiprocessors (CMP). To help architects design future microprocessors to run this important new workload, we provide a detailed characterization of two popular Java server benchmarks, ECperf and SPECjbb2000. We first estimate the amount of instruction-level parallelism in these workloads by simulating a very wide issue processor with perfect caches and perfect branch predictors. We then identify performance bottlenecks for these workloads on a more realistic processor by selectively idealizing individual processor structures. Finally, we combine our findings on available ILP in Java middleware with results from previous papers that characterize the availibility of TLP to investigate the optimal balance between ILP and TLP in CMPs. We find that, like other commercial workloads, Java middleware has only a small amount of instruction-level parallelism, even when run on very aggressive processors. When run on processors resembling currently available processors, the performance of Java middleware is limited by frequent traps, address translation and stalls in the memory system. We find that SPECjbb2000 differs from ECperf in two meaningful ways: (1) the performance of ECperf is affected much more by cache and TLB misses during instruction fetch and (2) SPECjbb2000 has more memory-level parallelism. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A vector-μSIMD-VLIW architecture for multimedia applications

    Page(s): 69 - 77
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (200 KB) |  | HTML iconHTML  

    Media processing has motivated strong changes in the focus and design of processors. These applications are composed of heterogeneous regions of code, some of them with high levels of DLP and other ones with only modest amounts of ILP. A common approach to deal with these applications are μSIMD-VLIWprocessors. However, the ILP regions fail to scale when we increase the width of the machine, which, on the other hand, is desired to achieve high performance in the DLP regions. In this paper, we propose and evaluate adding vector capabilities to a μSIMD-VLIW core to speed-up the execution of the DLP regions, while, at the same time, reducing the fetch bandwidth requirements. Results show that, in the DLP regions, both 2 and 4-issue width vector-μSIMD-VLIW architectures outperform a 8-issue width μSIMD-VLIW in factors of up to 2.7X and 4.2X (1.6X and 2.1X in average) respectively. As a result, the DLP regions become less than 10% of the total execution time and performance is dominated by the ILP regions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design tradeoffs for BLAS operations on reconfigurable hardware

    Page(s): 78 - 86
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (208 KB) |  | HTML iconHTML  

    Numerical linear algebra operations are key primitives in scientific computing. Performance optimizations of such operations have been extensively investigated and some basic operations have been implemented as software libraries. With the rapid advances in technology, hardware acceleration of linear algebra applications using FPGAs (field programmable gate arrays) has become feasible. In this paper, we propose FPGA-based designs for several BLAS operations, including vector product, matrix-vector multiply, and matrix multiply. By identifying the design parameters for each BLAS operation, we analyze the design tradeoffs. In the implementations of the designs, the values of the design parameters are determined according to the hardware constraints, such as the available area, the size of on-chip memory, the external memory bandwidth and the number of I/O pins. The proposed designs are implemented on a Xilinx Virtex-II Pro FPGA. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Timing high performance kernels through empirical compilation

    Page(s): 89 - 98
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (584 KB) |  | HTML iconHTML  

    There are a few application areas, which remain almost untouched by the historical and continuing advancement of compilation research. For the extremes of optimization required for high performance computing on one end, and embedded systems at the opposite end of the spectrum, many critical routines are still hand-tuned, often directly in assembly. At the same time, architecture implementations are performing an increasing number of compiler-like transformations in hardware, making it harder to predict the performance impact of a given series of optimizations applied at the ISA level. These issues, together with the rate of hardware evolution dictated by Moore's Law, make it almost impossible to keep key kernels running at peak efficiency. Automated empirical systems, where direct timings are used to guide optimization, have provided the most successful response to these challenges. This paper describes our approach to performing empirical optimization, which utilizes a low-level iterative compilation framework specialized for optimizing high performance computing kernels. We present results showing that this approach can not only provide speedups over traditional optimizing compilers, but can improve overall performance when compared to the best hand-tuned kernels selected by the empirical search of our well-known ATLAS package. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A novel approach for detecting heap-based loop-carried dependences

    Page(s): 99 - 106
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (240 KB) |  | HTML iconHTML  

    The problem of data dependences in pointer-based codes is crucial to various compiler optimizations. The approach presented in this paper focus on detecting data dependences induced by heap-directed pointers on loops that access dynamic data structures. Knowledge about the shape of the data structure accessible from a heap-directed pointer provides critical information for disambiguating heap accesses originating from it. Our approach is based on a previously developed shape analysis that maintains topological information of the connections among the different nodes (memory locations) in the data structure. As a novelty, our approach carries out abstract interpretation of the statements being analyzed, annotating memory locations with read/write information. This information will be later used in a very accurate dependence test, which we describe in this paper. We also discuss its application to three different programs: the sparse matrix-vector product, mst from Olden and twolf from the SPEC CPU2000 suite. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Enabling loop fusion and tiling for cache performance by fixing fusion-preventing data dependences

    Page(s): 107 - 115
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (232 KB) |  | HTML iconHTML  

    This paper presents a new approach to enabling loop fusion and tiling for arbitrary affine loop nests. Given a set of multiple loop nests, we present techniques that automatically eliminate all the fusion-preventing dependences by means of loop tiling and array copying. Applying our techniques iteratively to multiple loop nests yields a single loop nest that can be tiled for cache locality. Our approach handles LU, QR, Cholesky and Jacobi in a unified framework. Our experimental evaluation on an SGI Octane2 system shows that the benefit from the significantly reduced L1 and L2 cache misses has far more than offset the branching and loop control overhead introduced by our approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Integrated performance monitoring of a cosmology application on leading HEC platforms

    Page(s): 119 - 128
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (904 KB) |  | HTML iconHTML  

    The cosmic microwave background (CMB) is an exquisitely sensitive probe of the fundamental parameters of cosmology. Extracting this information is computationally intensive, requiring massively parallel computing and sophisticated numerical algorithms. In this work we present MADbench, a lightweight version of the MADCAP CMB power spectrum estimation code that retains the operational complexity and integrated system requirements. In addition, to quantify communication behavior across a variety of architectural platforms, we introduce the integrated performance monitoring (IPM) package: a portable, lightweight, and scalable tool for effectively extracting MPI message-passing overheads. A performance characterization study is conducted on some of the world's most powerful supercomputers, including the superscalar Seaborg (IBM Power3+) and CC-NUMA Columbia (SGIAltix), as well as the vector-based Earth Simulator (NEC SX-6 enhanced) and Phoenix (Cray XI) systems. In-depth analysis shows that in order to bridge the gap between theoretical and sustained system performance, it is critical to gain a clear understanding of how the distinct parts of large-scale parallel applications interact with the individual subcomponents of HEC platforms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • First evaluation of parallel methods of automatic global image registration based on wavelets

    Page(s): 129 - 136
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (248 KB) |  | HTML iconHTML  

    With the increasing importance of multiple multiplatform remote sensing missions, fast and automatic integration of digital data from disparate sources has become critical to the success of these endeavors. Firstly, an overview of development of automatic and parallel global image registration is given. And then, based on the analyses of existing three parallel methods of wavelet-based global registration, a new parallel strategy is proposed. Moreover, towards the quantitative evaluation, first results of the intercomparision of four parallel global registration algorithms are presented in theory and in experiments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel algorithm and implementation for realtime dynamic simulation of power system

    Page(s): 137 - 144
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (144 KB) |  | HTML iconHTML  

    As power systems continue to develop, realtime simulation and online dynamic security analysis using parallel computing are becoming increasingly important. This paper presents a novel multilevel partition scheme for parallel computing based on power network regional characteristics and describes the design and implementation of a hierarchical block bordered diagonal form (BBDF) algorithm for power network computation. Some optimization schemes are also proposed to reduce the computing and communication time and to improve the scalability of the program. The simulation results of a large network having 10188 nodes, 13499 branches, 1072 generators and 3003 loads show that the proposed algorithms and schemes running on a cluster system with 12 CPUs can provide a 15 times faster speed than the single CPU one, to satisfy the realtime simulation requirements for large scale power grids. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Heuristics for profile-driven method-level speculative parallelization

    Page(s): 147 - 156
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (200 KB) |  | HTML iconHTML  

    Thread level speculation (TLS) is an effective technique for extracting parallelism from sequential code. Method calls provide good templates for the boundaries of speculative threads as they often describe independent tasks. However, selecting the most profitable methods to speculate on is difficult as it involves complicated trade-offs between speculation violations, thread overheads, and resource utilization. This paper presents a first analysis of heuristics for automatic selection of speculative threads across method boundaries using a dynamic or profile-driven compiler. We study the potential of three classes of heuristics that involve increasing amounts of profiling information and runtime complexity. Several of the heuristics allow for speculation to start at internal method points, nested speculation, and speculative thread preemption. Using a set of Java benchmarks, we demonstrate that careful thread selection at method boundaries leads to speedups of 1.4 to 1.8 on practical TLS hardware. Single-pass heuristics that filter out less profitable methods using simple speedup estimates lead to the best average performance by consistently providing a good balance between over- and under-speculation. On the other hand, multi-pass heuristics that perform additional filtering by taking into account interactions between nested method calls often lead to significant under-speculation and perform poorly. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.