By Topic

Parallel, Distributed and Network-Based Processing (PDP), 2011 19th Euromicro International Conference on

Date 9-11 Feb. 2011

Filter Results

Displaying Results 1 - 25 of 99
  • [Front cover]

    Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (57 KB)  
    Freely Available from IEEE
  • [Title page i]

    Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (92 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (149 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (109 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): v - xii
    Save to Project icon | Request Permissions | PDF file iconPDF (154 KB)  
    Freely Available from IEEE
  • Preface from the Program Chairs

    Page(s): xiii
    Save to Project icon | Request Permissions | PDF file iconPDF (55 KB)  
    Freely Available from IEEE
  • Preface from the Organizing Chair

    Page(s): xiv
    Save to Project icon | Request Permissions | PDF file iconPDF (55 KB)  
    Freely Available from IEEE
  • Program Committee

    Page(s): xv
    Save to Project icon | Request Permissions | PDF file iconPDF (56 KB)  
    Freely Available from IEEE
  • Additional Reviewers

    Page(s): xvi
    Save to Project icon | Request Permissions | PDF file iconPDF (53 KB)  
    Freely Available from IEEE
  • A Fast and Verified Algorithm for Proving Store-and-Forward Networks Deadlock-Free

    Page(s): 3 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (215 KB) |  | HTML iconHTML  

    Deadlocks are an important issue in the design of interconnection networks. A successful approach is to restrict the routing function such that it satisfies a necessary and sufficient condition for deadlock-free routing. Typically, such a condition states that some (extended) dependency graph must be a cyclic. Defining and proving such a condition is complex. Proving that a routing function satisfies a condition can be complex as well. In this paper we present the first algorithm that automatically proves routing functions deadlock-free for store-and-forward networks. The time complexity of our algorithm is linear in the size of the resource dependency graph. The algorithm checks a variation of Duato's condition for adaptive routing. The condition and the algorithm have been formalized in the logic of the ACL2 interactive theorem prover. The correctness of our algorithm w.r.t. the condition is formally checked using ACL2. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic I/O Reconfiguration for a NFS-Based Parallel File System

    Page(s): 11 - 18
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (341 KB) |  | HTML iconHTML  

    The large gap between the speed in which data can be processed and the performance of I/O devices makes the shared storage infrastructure of a cluster a great bottle-neck. Parallel File Systems try to smooth such difference by distributing data onto several servers, increasing the system's available bandwidth. However, most implementations use a fixed number of I/O servers, defined during the initialization of the system, and can not add new resources without a complete redistribution of the existing data. With the execution of different applications at the same time, the concurrent access to these resources can aggravate the existing bottleneck, making very hard to define an initial number of servers that satisfies the performance requirements of different applications. This paper presents a reconfiguration mechanism for the dNFSp file system that uses on-line monitoring of application's I/O behavior to detect performance contention and dedicate more I/O resources to applications with higher demands. These extra resources are taken from the available nodes of the cluster, using their I/O devices as a temporary storage. We show that this strategy is capable of increasing the I/O performance in up to 200% for access patterns with short I/O phases and 47% for longer I/O phases. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reliability Study of Coding Schemes for Wide-Area Distributed Storage Systems

    Page(s): 19 - 23
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (350 KB) |  | HTML iconHTML  

    Distributed storage systems comprise a large number of commodity hardware distributed across several data centers. Even in the presence of failures (permanent failures) the system should provide reliable storage. While replication has advantages because of its simplicity there exist coding techniques that provide adaptable reliability properties with an optimal redundancy ratio at the same time e.g. MDS (maximum distance separable) erasure codes. The coding and distribution scheme influences the prospective storage reliability. In this paper we present reliability models for erasure coding and replication techniques especially for their application in wide-area storage systems. Furthermore we utilize these models to quantify the reliability properties of concrete data storage scenarios. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models

    Page(s): 24 - 31
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (351 KB) |  | HTML iconHTML  

    Recent trends in high-performance computing point toward increasingly large machines with millions of processing, storage, and networking elements. Unfortunately, the reliability of these machines is inversely proportional to their size, resulting in a system-wide mean time between failures (MTBF), ranging from a few days to a few hours. As such, for long-running applications, the ability to efficiently recover from frequent failures is essential. Traditional forms of fault tolerance, such as checkpoint/restart, suffer from performance issues related to limited I/O and memory bandwidth. In this paper, we present a fault-tolerance mechanism that reduces the cost of failure recovery by maintaining shadow data structures and performing redundant remote memory accesses. Results from a computational chemistry application running at scale show that our techniques provide applications with a high degree of fault tolerance and low (2%-4%) overhead for 2048 processors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Quantifying Thread Vulnerability for Multicore Architectures

    Page(s): 32 - 39
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (260 KB) |  | HTML iconHTML  

    Continuously reducing transistor sizes and aggressive low power operating modes employed by modern architectures tend to increase transient error rates. Concurrently, multicore machines are dominating the architectural spectrum in various application domains. These two trends require a fresh look at resiliency of multithreaded applications against transient errors from a software perspective. In this paper, we propose and evaluate a new metric called the Thread Vulnerability Factor (TVF). A distinguishing characteristic of TVF is that its calculation for a given thread (which is typically one of the threads of a multithreaded application) does not depend on its code alone, but also on the codes of the threads that share data with that thread. As a result, we decompose TVF of a thread into two complementary parts: local and remote. While the former captures the TVF induced by the code of the target thread, the latter represents the vulnerability impact of the threads that interact with the target thread. We quantify the local and remote TVF values for three architectural components (register file, ALUs, and caches) using a set of four multithreaded applications. Our experimental evaluation shows that TVF values tend to increase as the number of cores increases which means the system becomes more vulnerable as the core count rises. We also discuss how TVF values and execution cycles together can be used to explore performance-reliability tradeoffs in multicores at a source code level. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • In Situ Power Analysis of General Purpose Graphical Processing Units

    Page(s): 40 - 44
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (937 KB) |  | HTML iconHTML  

    In this paper, an in situ power analysis profiling over time for general purpose graphics processing units (GPGPU) is presented. Based on this method the power consumption of different modes of operations like data transfer between GPU and host CPU, basic single precision floating point arithmetic operations (addition, subtraction, multiplication) on the multiprocessor units and instructions for shared and global memory access can be measured. There is a factor of 2 difference in power dissipation between various instructions and mode of operations of the GPGPUs. These measurements provide data for an instruction based power estimation of GPU software. It turns out that the power profile over time also gives a good understanding on which section of the program is executed at a certain point in time. The experimental results have been derived from two GPU architectures, namely the GT200 and the GF100 architecture. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Job Scheduling with License Reservation: A Semantic Approach

    Page(s): 47 - 54
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (806 KB) |  | HTML iconHTML  

    The license management is one of the main concerns when Independent Software Vendors (ISV) try to distribute their software in computing platforms such as Clouds. They want to be sure that customers use their software according to their license terms. The work presented in this paper tries to solve part of this problem extending a semantic resource allocation approach for supporting the scheduling of job taking into account software licenses. This approach defines the licenses as another type of computational resource which is available in the system and must be allocated to the different jobs requested by the users. License terms are modeled as resource properties, which describe the license constraints. A resource ontology has been extended in order to model the relations between customers, providers, jobs, resources and licenses in detail and make them machine processable. The license scheduling has been introduced in a semantic resource allocation process by providing a set of rules, which evaluate the semantic license terms during the job scheduling. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Deadline Satisfaction Enhanced Workflow Scheduling Algorithm

    Page(s): 55 - 61
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (340 KB) |  | HTML iconHTML  

    Meeting users' deadline constraint is usually the most important goal of workflow scheduling in Grid environment. In order to consider the dynamism of Grid resource, we adopted a stochastic model to describe dynamic workloads of Grid resources. A concept called Deadline Satisfaction Degree of Workflow (DSDW) was defined to represent the probability that a workflow could be completed before its deadline. We calculated task execution priorities based on their precedence relations in the workflow, then determined the candidate resource for each task so as to maximize DSDW, finally converted distribution problem of overall workflow deadline into a nonlinear programming problem with constraints and resolved it with known solutions. A Deadline Satisfaction Enhanced Scheduling Algorithm for Workflow (DSESAW) involving deadline distribution and resource selection was presented. The extensive simulation experiments using a practical medical image analysis application was conducted to verify our algorithm. Experimental results indicated that our algorithm could adapt to dynamic Grid environment and provide a good guarantee for user's deadline requirements. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Distributed Load Balancing for Parallel Agent-Based Simulations

    Page(s): 62 - 69
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (880 KB) |  | HTML iconHTML  

    We focus on agent-based simulations where a large number of agents move in the space, obeying to some simple rules. Since such kind of simulations are computational intensive, it is challenging, for such a contest, to let the number of agents to grow and to increase the quality of the simulation. A fascinating way to answer to this need is by exploiting parallel architectures. In this paper, we present a novel distributed load balancing schema for a parallel implementation of such simulations. The purpose of such schema is to achieve an high scalability. Our approach to load balancing is designed to be lightweight and totally distributed: the calculations for the balancing take place at each computational step, and influences the successive step. To the best of our knowledge, our approach is the first distributed load balancing schema in this context. We present both the design and the implementation that allowed us to perform a number of experiments, with up-to 1,000,000 agents. Tests show that, in spite of the fact that the load balancing algorithm is local, the workload distribution is balanced while the communication overhead is negligible. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Failure Handling Framework for Distributed Data Mining Services on the Grid

    Page(s): 70 - 79
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (741 KB) |  | HTML iconHTML  

    Fault tolerance is an important issue in Grid computing, where many and heterogenous machines are used. In this paper we present a flexible failure handling framework which extends a service-oriented architecture for Distributed Data Mining previously proposed, addressing the requirements for fault tolerance in the Grid. The framework allows users to achieve failure recovery whenever a crash can occur on a Grid node involved in the computation. The implemented framework has been evaluated on a real Grid setting to assess its effectiveness and performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Balancing Workloads of Servers Maintaining Scalable Distributed Data Structures

    Page(s): 80 - 84
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (242 KB) |  | HTML iconHTML  

    A new architecture of Scalable Distributed Data Structures (SDDS) is presented and evaluated. It applies for SDDS files with overactive servers. Every bucket of the file is supplemented with a reference counter. The number of references to a bucket is counted up. It reflects activity of the bucket and is used for selecting the most active and most often used buckets (overactive servers). Workloads of the servers are then balanced with the help of so called scalability of throughput. It is proven that this gives very good results for read-mostly databases, where extensive pattern matching takes place. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High Performance Matrix Inversion on a Multi-core Platform with Several GPUs

    Page(s): 87 - 93
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (227 KB) |  | HTML iconHTML  

    Inversion of large-scale matrices appears in a few scientific applications like model reduction or optimal control. Matrix inversion requires an important computational effort and, therefore, the application of high performance computing techniques and architectures for matrices with dimension in the order of thousands. Following the recent uprise of graphics processors (GPUs), we present and evaluate high performance codes for matrix inversion, based on Gauss-Jordan elimination with partial pivoting, which off-load the main computational kernels to one or more GPUs while performing fine-grain operations on the general-purpose processor. The target architecture consists of a multi-core processor connected to several GPUs. Parallelism is extracted from parallel implementations of BLAS and from the concurrent execution of operations in the available computational units. Numerical experiments on a system with two Intel QuadCore processors and four NVIDIA cl060 GPUs illustrate the efficiency and the scalability of the different implementations, which deliver over 1.2 x 1012 floating point operations per second. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallization of Adaboost Algorithm through Hybrid MPI/OpenMP and Transactional Memory

    Page(s): 94 - 100
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (229 KB) |  | HTML iconHTML  

    This paper proposes a parallelization of the Adaboost algorithm through hybrid usage of MPI, OpenMP, and transactional memory. After detailed analysis of the Adaboost algorithm, we show that multiple levels of parallelism exists in the algorithm. We develop the lower level of parallelism through OpenMP and higher level parallelism through MPI. Software transactional memory are used to facilitate the management of shared data among different threads. We evaluated the Hybrid parallelized Adaboost algorithm on a heterogeneous PC cluster. And the result shows that nearly linear speedup can be achieved given a good load balancing scheme. Moreover, the hybrid parallelized Adaboost algorithm outperforms Purely MPI based approach by about 14% to 26%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scaleable Sparse Matrix-Vector Multiplication with Functional Memory and GPUs

    Page(s): 101 - 108
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (460 KB) |  | HTML iconHTML  

    Sparse matrix-vector multiplication on GPUs faces to a serious problem when the vector length is too large to be stored in GPU's device memory. To solve this problem, we propose a novel software-hardware hybrid method for a heterogeneous system with GPUs and functional memory modules connected by PCI express. The functional memory contains huge capacity of memory and provides scatter/gather operations. We perform some preliminary evaluation for the proposed method with using a sparse matrix benchmark collection. We observe that the proposed method for a GPU with converting indirect references to direct references without exhausting GPU's cache memory achieves 4.1 times speedup compared with conventional methods. The proposed method intrinsically has high scalability of the number of GPUs because intercommunication among GPUs is completely eliminated. Therefore we estimate the performance of our proposed method would be expressed as the single GPU execution performance, which may be suppressed by the burst-transfer bandwidth of PCI express, multiplied with the number of GPUs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Accelerating Parameter Sweep Applications Using CUDA

    Page(s): 111 - 118
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (311 KB) |  | HTML iconHTML  

    This paper proposes a parallelization scheme for parameter sweep (PS) applications using the compute unified device architecture (CUDA). Our scheme focuses on PS applications with irregular access patterns, which usually result in lower performance on the GPU. The key idea to resolve this irregularity is to exploit the similarity of data accesses between different parameters. That is, the scheme simultaneously processes multiple parameters instead of a single parameter. This simultaneous sweep allows data accesses to be coalesced into a single access if the irregularity appears similarly at every parameter. It also reduces the amount of off-chip memory access by using fast on-chip memory for the data commonly accessed for multiple parameters. As a result, the scheme achieves up to 4.5 times higher performance than a naive scheme that processes a single parameter by a kernel invocation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FFT Implementation on a Streaming Architecture

    Page(s): 119 - 126
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (878 KB) |  | HTML iconHTML  

    Fast Fourier Transform (FFT) is a useful tool for applications requiring signal analysis and processing. However, its high computational cost requires efficient implementations, specially if real time applications are used, where response time is a decisive factor. Thus, the computational cost and wide application range that requires FFT transforms has motivated the research of efficient implementations. Recently, GPU computing is becoming more and more relevant because of their high computational power and low cost, but due to its novelty there is some lack of tools and libraries. In this paper we propose an efficient implementation of the FFT with AMD's Brook+ language. We describe several features and optimization strategies, analyzing the scalability and performance compared to other well-known existing solutions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.