By Topic

Parallel and Distributed Computing (ISPDC), 2010 Ninth International Symposium on

Date 7-9 July 2010

Filter Results

Displaying Results 1 - 25 of 37
  • [Front cover]

    Publication Year: 2010 , Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (738 KB)  
    Freely Available from IEEE
  • [Title page i]

    Publication Year: 2010 , Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (81 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Publication Year: 2010 , Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (138 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Publication Year: 2010 , Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (109 KB)  
    Freely Available from IEEE
  • Table of contents

    Publication Year: 2010 , Page(s): v - vii
    Save to Project icon | Request Permissions | PDF file iconPDF (169 KB)  
    Freely Available from IEEE
  • Message from the ISPDC 2010 Chairs

    Publication Year: 2010 , Page(s): viii
    Save to Project icon | Request Permissions | PDF file iconPDF (93 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • ISPDC 2010 Committees

    Publication Year: 2010 , Page(s): ix - xi
    Save to Project icon | Request Permissions | PDF file iconPDF (110 KB)  
    Freely Available from IEEE
  • Optimizing the Reliability of Pipelined Applications under Throughput Constraints

    Publication Year: 2010 , Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (317 KB) |  | HTML iconHTML  

    Mapping a pipelined application onto a distributed and parallel platform is a challenging problem. The problem becomes even more difficult when multiple optimization criteria are involved, and when the target resources are heterogeneous (processors and communication links) and subject to failures. This paper investigates the problem of mapping pipelined applications, consisting of a linear chain of stages executed in a pipeline way, onto such platforms. The objective is to optimize the reliability under a performance constraint, i.e., while guaranteeing a threshold throughput. In order to increase reliability, we replicate the execution of stages on multiple processors. We present complexity results, proving that this bi-criteria optimization problem is NP-hard. We then propose some heuristics, and discuss extensive experiments evaluating their performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Algorithm for Mapping Multilayer BP Networks onto the SpiNNaker Neuromorphic Hardware

    Publication Year: 2010 , Page(s): 9 - 16
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (287 KB) |  | HTML iconHTML  

    This paper demonstrates the feasibility and evaluates the performance of using the SpiNNaker neuromorphic hardware to simulate traditional non-spiking multi-layer perceptron networks with the back propagation learning rule. In addition to investigating the mapping of checker-boarding partitioning scheme onto SpiNNaker, we propose a new algorithm called pipelined checker-boarding partitioning which introduces a pipelined mode and captures the parallelism within each partition of the weight matrix, allowing the overlapping of communication and computation. Not only does the proposed algorithm localize communication, but it can also hide a part of or even all the communication. The performance is evaluated with SpiNNaker configurations up to 1000 nodes (20000 cores). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Decomposition Based Algorithm for State Prediction in Large Scale Distributed Systems

    Publication Year: 2010 , Page(s): 17 - 24
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1719 KB) |  | HTML iconHTML  

    Prediction represents an important component of resource management, providing information about the future state, utilization and availability of resources. We propose a new prediction algorithm inspired from the decomposition of a complex wave into simpler waves with fixed frequencies (similar to Fourier decomposition). The partial results obtained from this decomposition stage are combined using approaches inspired from artificial intelligence models. The experimental results for different system parameters, used in Alice experiment, highlight the great improvement, discussed in terms of error reduction, offered by this new prediction algorithm. The tests were made using real-time monitoring data provided by a system monitoring tool, in the case of one-step and multi-step ahead prediction. The prediction's results can be used by the resource management systems in order to improve the scheduling decisions, assuring the load balancing and optimizing the resource utilization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Operational Semantics of the Marte Repetitive Structure Modeling Concepts for Data-Parallel Applications Design

    Publication Year: 2010 , Page(s): 25 - 32
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (407 KB) |  | HTML iconHTML  

    This paper presents an operational semantics of the repetitive model of computation, which is the basis for the repetitive structure modeling (RSM) package defined in the standard UML Marte profile. It also deals with the semantics of an RSM extension for control-oriented design. The goal of this semantics is to serve as a formal support for i) reasoning about the behavioral properties of models specified in Marte with RSM, and ii) defining correct-by-construction model transformations for the production of executable code in a model-driven engineering framework. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Butterfly Automorphisms and Edge Faults

    Publication Year: 2010 , Page(s): 33 - 40
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (298 KB) |  | HTML iconHTML  

    This paper obtains all the automorphisms of a wrapped Butterfly network of degree n using an algebraic model. It also investigates the translation of butterfly edges by automorphisms. It proposes a new strategy for algorithm mappings on an architecture with faulty edges. This strategy essentially consists of finding an automorphism that would map the faulty edges to the free edges in the graph. Having a set of n2^ (n+1) well defined simple automorphisms which translate graph edges deterministically, makes this a very powerful technique for dealing with edge faults. We illustrate the technique by mapping Hamilton cycle on the butterfly under various edge fault scenarios. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cost Performance Analysis in Multi-level Tree Networks

    Publication Year: 2010 , Page(s): 41 - 48
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (888 KB) |  | HTML iconHTML  

    A monetary network cost problem involving a homogeneous multi-level tree of processors and links is discussed. The monetary network cost of processing a divisible load, which is linearly dependent on the amount of divisible workload, is basically composed of a communication cost and a computing cost. A monetary network analysis is performed by aggregating the network speed parameters and network cost parameters. This allows one to obtain a closed form solution for the total monetary network cost with maintaining a minimum total parallel processing finish time. Through a mathematical derivation of the ratio of total computing time variation to total network cost variation against changes in network size, insights on trends of network performance against network cost are achieved. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Pretty Good Accuracy in Matrix Multiplication with GPUs

    Publication Year: 2010 , Page(s): 49 - 55
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (172 KB) |  | HTML iconHTML  

    With systems such as Road Runner, there is a trend in super computing to offload parallel tasks to special purpose co-processors, composed of many relatively simple scalar processors. The cheaper commodity class equivalent of such a processor would be the graphics card, potentially offering super computer power within the confines of a desktop PC. Graphics cards however are not without problems, these range from the lack of double precision on most cards to a fairly steep drop in performance for using double precision on others, the end result being that in order to utilize the graphics card the computation must be done using single precision. In this paper we propose a method whereby a whole digit of the accuracy lost in single precision matrix multiply can be regained with only a 7% loss in performance by applying a compensated summation algorithm in a manner previously unexplored, a manner in which, at first glance, shouldn't provide any benefit but empirical evidence will show that though the novel idea is simple, provides unexpected benefits in terms of accuracy at little cost to performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting the Power of GPUs for Multi-gigabit Wireless Baseband Processing

    Publication Year: 2010 , Page(s): 56 - 62
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (333 KB) |  | HTML iconHTML  

    In this paper, we explore the feasibility of achieving gigabit baseband throughput using the vast computational power offered by the graphics processors (GPUs). One of the most computationally intensive functions commonly used in baseband communications, the Fast Fourier Transform (FFT) algorithm, is implemented on an NVIDIA GPU using their general-purpose computing platform called the Compute Unified Device Architecture (CUDA). The paper, first, investigates the implementation of an FFT algorithm using the GPU hardware and exploiting the computational capability available. It then outlines the limitations discovered and the methods used to overcome these challenges. Finally a new algorithm to compute FFT is proposed, which reduces interprocessor communication, and it is further optimized by improving memory access, enabling the processing rate to exceed 4 Gbps, achieving a processing time of a 512-point FFT in less than 200 ns. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • NQueens on CUDA: Optimization Issues

    Publication Year: 2010 , Page(s): 63 - 70
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (428 KB) |  | HTML iconHTML  

    Todays commercial off-the-shelf computer systems are multicore computing systems as a combination of CPU, graphic processor (GPU) and custom devices. In comparison with CPU cores, graphic cards are capable to execute hundreds up to thousands compute units in parallel. To benefit from these GPU computing resources, applications have to be parallelized and adapted to the target architecture. In this paper we show our experience in applying the NQueens puzzle solution on GPUs using Nvidia's CUDA (Compute Unified Device Architecture) technology. Using the example of memory usage and memory access, we demonstrate that optimizations of CUDA programs may have contrary results on different CUDA architectures. Evaluation results will point out, that it is not sufficient to use new programming languages or compilers to achieve best results with emerging graphic card computing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel Cycle Based Logic Simulation Using Graphics Processing Units

    Publication Year: 2010 , Page(s): 71 - 78
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (412 KB) |  | HTML iconHTML  

    Graphics Processing Units (GPUs) are gaining popularity for parallelization of general purpose applications. GPUs are massively parallel processors with huge performance in a small and readily available package. At the same time, the emergence of general purpose programming environments for GPUs such as CUDA shorten the learning curve of GPU programming. We present a GPU-based parallelization of logic simulation algorithm for electronic designs. Logic simulation is a crucial component of verification of electronic designs that allows one to check whether the design behaves according to the specifications. Verification of electronic designs consumes more than 60% of the overall design cycle. Any attempts to speedup the verification process (and logic simulation) results in great savings and shorter time-to-market. We develop a parallel cycle-based logic simulation algorithm that uses And Inverter Graphs (AIGs) as design representations and exploits the massively parallel GPU architecture. We demonstrate several orders of speedups on benchmarks using our system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel ID Shadow-Map Decompression on GPU

    Publication Year: 2010 , Page(s): 79 - 84
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (341 KB) |  | HTML iconHTML  

    ID shadow-maps are used for robust real-time rendering of shadows. The primary disadvantage of using shadow-maps is their excessive size for large scenes in case high quality shadows are needed. To eliminate large memory requirements and texture-size limitations of the current generation GPUs, texture compression is an important tool. We present a framework where compressed ID-shadow-maps are used for real-time rendering of static scene shadows. The texture compression is performed off-line on the CPU and real-time decompression is performed on the GPU within a fragment-shader for shadowing the pixels. The use of ID shadow-maps (instead of conventional depth-based shadow-maps) is the key to high compression ratios. The ID shadow-map is compressed on the CPU by first partitioning it into blocks. Each compressed block is packed densely into a global array, while a pointer table is constructed that holds a pointer to the start of every compressed block in the global array. This data organization provides the GPU with a random access to the start of each compressed block thus enables fast parallel decompression. The proposed decompression shader-program and the underlying data structures can be applied to any type of array consisting of integers. The framework is implemented using OpenGL and GLSL. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Realizing Optimization Opportunities for Distributed Applications in the Middleware Layer by Utilizing InDiGO Framework

    Publication Year: 2010 , Page(s): 85 - 92
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (614 KB) |  | HTML iconHTML  

    InDiGO framework provides an infrastructure to develop generic but customizable middleware services. It also provides tools to customize the middleware algorithms for specific applications. Such customization allows one to optimize algorithms by removing communication which is redundant in the context of a specific application. In this paper, we apply InDiGO framework to study a class of bidding distributed applications. In particular, we will study how optimization level is affected by varying several application parameters, such as number of clusters, number of components per cluster, number of clusters with local ordering and number of components per process. The results of this study will help us to answer questions like: What type of application information is useful for optimization in InDiGO framework? or How does the application structure or its size affect the level of optimization? We present experimental results to demonstrate the optimizations when our infrastructure is utilized. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Practical Uniform Peer Sampling under Churn

    Publication Year: 2010 , Page(s): 93 - 100
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (366 KB) |  | HTML iconHTML  

    Providing independent uniform samples from a system population poses considerable problems in highly dynamic settings, like P2P systems, where the number of participants and their unpredictable behavior (e.g., churn, crashes etc.) may introduce relevant bias. Current implementations of the Peer Sampling Service are designed to provide uniform samples only in static settings and do not consider that biased samples can directly affect the correctness of algorithms relying on a uniformity property or be exploited by a malicious adversary to increase the effectiveness of its attacks to the system. In this paper we provide a practical solution to the biasing problem by deploying a fully distributed Peer Sampling Correction Module on top of a given, possibly biased, peer sampling service. Samples provided by the peer sampling service will be locally processed by this module, using computationally efficient hashing functions, before getting to the application. The effectiveness of our approach is evaluated through an extensive simulation-based study. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving Grid Fault Tolerance by Means of Global Behavior Modeling

    Publication Year: 2010 , Page(s): 101 - 108
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (309 KB) |  | HTML iconHTML  

    Grid systems have proved to be one of the most important new alternatives to face challenging problems but, to exploit its benefits, dependability and fault tolerance are key aspects. However, the vast complexity of these systems limits the efficiency of traditional fault tolerance techniques. It seems necessary to distinguish between resource-level fault tolerance (focused on every machine) and service-level fault tolerance (focused on global behavior). Techniques based on these concepts can handle system complexity and increase dependability. We present an autonomous, self-adaptive fault tolerance framework for grid systems, based on a new approach to model distributed environments. The grid is considered as a single entity, instead of a set of independent resources. This point of view focuses on service-level fault tolerance, allowing us to see the big picture and understand the system's global behavior. The resulting model's simplicity is the key to provide system-wide fault tolerance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Toward a Reliable Distributed Data Management System

    Publication Year: 2010 , Page(s): 109 - 116
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (607 KB) |  | HTML iconHTML  

    Modern collaborative science has placed increasing burden on data management infrastructure to handle the increasingly large data archives generated. Beside functionality, reliability and availability are also key factors in delivering a data management system that can efficiently and effectively meet the challenges posed and compounded by the unbounded increase in the size of data archive generated by scientific applications. In this paper, we present our work on increasing and improving reliability and availability in the data management system we designed for the PetaShare project, we also discuss our work on benchmarking the performance and scalability of metadata management system in PetaShare project. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Early Performance Evaluation of New Six-Core Intel® Xeon® 5600 Family Processors for HPC

    Publication Year: 2010 , Page(s): 117 - 124
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (418 KB) |  | HTML iconHTML  

    In this paper we take a look at what the newest member of the Intel Xeon Processor family, code named Westmere brings to high performance computing. We compare three generations of Intel Xeon based systems and present a performance evolutions based on 16 node clusters based on these CPUs respectively. We compare CPU generations utilizing dual socket platforms and a cluster across a number of HPC benchmarks and focused on different performance field and aspect. We will evaluate also technologies and features like Intel’s Hyper Threading Technology (HT) and Intel Turbo Boost Technology (Turbo Mode) and the performance implication of these technologies for HPC. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Energy Minimization on Thread-Level Speculation in Multicore Systems

    Publication Year: 2010 , Page(s): 125 - 132
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (317 KB) |  | HTML iconHTML  

    Thread-Level Speculation (TLS) has shown great promise as an automatic parallelization technique to achieve high level performance by partitioning a sequential program into threads, which are expected to be optimistically executed in parallel. In this paper, we propose a load-balancing approach to save energy using dynamic voltage scaling. By scaling the voltage of processors running short threads, energy consumption on these processors can be reduced while keeping a similar speedup of the overall system. Two voltage selection strategies have been investigated. With the assistance of some profiling tools, we propose a static voltage selection algorithm that can minimize energy consumption without degrading the parallelism provided by the pure TLS. The other dynamic algorithm selects voltage for each thread with prediction during the execution. Our experimental results show that its energy consumption is reduced to 78.8% and execution time is stretched to 1.07 times, on average, of the pure TLS in a 16-core CMP processor. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Resource-Aware Compiler Prefetching for Many-Cores

    Publication Year: 2010 , Page(s): 133 - 140
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (324 KB) |  | HTML iconHTML  

    Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi-and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4-16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor lighter cores with less resources. Support for hardware and software prefetch increase MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We show that in situations where not enough resources are available to issue prefetch instructions for all references in a loop, it is more beneficial to decrease the prefetch distance and prefetch for as many references as possible, rather than use a fixed prefetched distance and skip prefetching for some references, as in current approaches. We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% and the state-of-the art GCC implementation by up to 34.79%. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show improvements of up to 24.61%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.