By Topic

Parallel and Distributed Processing with Applications (ISPA), 2012 IEEE 10th International Symposium on

Date 10-13 July 2012

Filter Results

Displaying Results 1 - 25 of 146
  • [Cover art]

    Publication Year: 2012 , Page(s): C4
    Save to Project icon | Request Permissions | PDF file iconPDF (440 KB)  
    Freely Available from IEEE
  • [Title page i]

    Publication Year: 2012 , Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (440 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Publication Year: 2012 , Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (114 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Publication Year: 2012 , Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (118 KB)  
    Freely Available from IEEE
  • Table of contents

    Publication Year: 2012 , Page(s): v - xiv
    Save to Project icon | Request Permissions | PDF file iconPDF (193 KB)  
    Freely Available from IEEE
  • Message from the General Chairs

    Publication Year: 2012 , Page(s): xv
    Save to Project icon | Request Permissions | PDF file iconPDF (78 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Message from the Program Committee Chairs

    Publication Year: 2012 , Page(s): xvi - xvii
    Save to Project icon | Request Permissions | PDF file iconPDF (98 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Organizing Committee

    Publication Year: 2012 , Page(s): xviii - xix
    Save to Project icon | Request Permissions | PDF file iconPDF (81 KB)  
    Freely Available from IEEE
  • Program Committee

    Publication Year: 2012 , Page(s): xx - xxii
    Save to Project icon | Request Permissions | PDF file iconPDF (97 KB)  
    Freely Available from IEEE
  • Workshop Committees

    Publication Year: 2012 , Page(s): xxiii - xxvii
    Save to Project icon | Request Permissions | PDF file iconPDF (105 KB)  
    Freely Available from IEEE
  • Reviewers

    Publication Year: 2012 , Page(s): xxviii
    Save to Project icon | Request Permissions | PDF file iconPDF (71 KB)  
    Freely Available from IEEE
  • A Parallel Procedure for Dynamic Multi-objective TSP

    Publication Year: 2012 , Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (342 KB) |  | HTML iconHTML  

    This paper proposes a new parallel search procedure for dynamic multi-objective traveling salesman problem. We design a multi-objective TSP in a stochastic dynamic environment. The proposed procedure first uses parallel processors to identify the extreme solutions of the search space for each of k objectives individually at the same time. These solutions are merged into a matrix E. The solutions in E are then searched by parallel processors and evaluated for dominance relationship. The proposed procedure was implemented in two different ways: a master-worker architecture and a pipeline architecture. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multiple spanning tree construction for deadlock-free adaptive routing in irregular networks

    Publication Year: 2012 , Page(s): 9 - 16
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (201 KB) |  | HTML iconHTML  

    This paper proposes a new adaptive deadlock-free routing scheme to improve the performance of irregular networks by using multiple spanning trees. The traditional up*/down* routing algorithm used a single root, where the root can be a hotspot. The multiple layer spanning tree scheme used different virtual networks for different spanning trees. Bandwidths of the system may be not fully utilized. The proposed new deadlock-free adaptive routing scheme uses multiple spanning trees, where different packets are delivered along different trees. There may exist some potential cyclic channel dependencies based on the multiple-spanning-tree-based routing scheme in VCT-switched networks. A simple scheme is proposed to avoid potential cyclic channel dependencies by selecting the minimum number of constrained turns. Sufficient simulation results are presented to show the effectiveness of the proposed method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Chemical Reaction Optimization for Heterogeneous Computing Environments

    Publication Year: 2012 , Page(s): 17 - 23
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (332 KB) |  | HTML iconHTML  

    Task scheduling has been proven to be NP-hard problem and we can usually approximate the best solutions with some classical algorithm, such as Heterogeneous Earliest Finish Time (HEFT), Genetic Algorithm. However, the huge types of scheduling problems and the small number of generally acknowledged methods mean that more methods are needed. In this paper, we propose a new method to schedule the execution of a group of dependent tasks for heterogeneous computing environments. The algorithm consists of two elements: An intelligent approach to assign the execution orders of tasks by task level, and an allocation algorithm based on chemical-reaction-inspired metaheuristic called Chemical Reaction Optimization (CRO) to map processors to tasks. The experiments show that the CRO-based algorithm performs consistently better than HEFT and Critical Path On a Processor (CPOP) without incurring much computational cost. Multiple runs of the algorithm can further improve the search result. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On Adaptive Contention Management Strategies for Software Transactional Memory

    Publication Year: 2012 , Page(s): 24 - 31
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (241 KB) |  | HTML iconHTML  

    Software Transaction Memory (STM) is an alternative synchronization method to the traditional lock-based schemes. In an STM system, the contention manager(CM) decides what action to take when a conflict occurs. CM is crucial to the performance of STM systems. However, the performance of existing CMs is sensitive to the transaction workloads and STM configurations. A static policy is therefore unsatisfactory. In this paper, we argue that adaptive contention manager (ACM) is necessary and feasible. We further present an ACM policy that can adaptively choose a suitable CM during run-time. We prove that our adaptation strategy preserves live-lock (starvation) freedom as long as the pool of CMs to adapt from contains at least one live-lock free (starvation free) CM. Experimental results demonstrate that our approach can choose proper CMs and achieves higher average throughput than existing static CM strategies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Moving Multimedia Simulations into the Cloud: A Cost-effective Solution

    Publication Year: 2012 , Page(s): 32 - 39
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (283 KB) |  | HTML iconHTML  

    Researchers often demand bursts of computing power to quickly obtain the results of certain simulation activities. Multimedia communication simulations usually belong to such category. They may require several days on a generic PC to test a comprehensive set of conditions depending on the complexity of the scenario. This paper proposes to use a cloud computing framework to accelerate these simulations and, consequently, research activities, while at the same time reducing the overall costs. A practical simulation example is shown, representative of a typical simulation of H.264/AVC video communications over a wireless channel. This work shows that, by means of a commercial cloud computing provider, the gains of the proposed technique compared to more traditional solutions using dedicated computers can be significant in terms of speed and cost reduction. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and Performance Issues of Cholesky and LU Solvers Using UPCBLAS

    Publication Year: 2012 , Page(s): 40 - 47
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (239 KB) |  | HTML iconHTML  

    Partitioned Global Address Space (PGAS) languages offer programmers a shared memory view that increases their productivity and allow locality exploitation to obtain good performance on current large-scale distributed memory systems. UPCBLAS is a parallel numerical library for dense matrix computations using the PGAS Unified Parallel C (UPC) language. The interface of this library exploits the characteristics of the PGAS memory model and thus it is easier to use than MPI-based libraries. This paper addresses the implementation of solvers of systems of equations through Cholesky and LU factorizations in UPC using UPCBLAS. The developed codes are experimentally evaluated and compared to the MPI versions using ScaLAPACK. Parallel solvers of equations are present in many parallel numerical applications and they have been traditionally developed in MPI. This work shows that UPCBLAS can be considered as a good alternative to the MPI-based libraries for increasing the productivity of numerical application developers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hybrid MPI/StarSs -- A Case Study

    Publication Year: 2012 , Page(s): 48 - 55
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (202 KB) |  | HTML iconHTML  

    Hybrid parallel programming models combining distributed and shared memory paradigms are well established in high-performance computing. The classical prototype of hybrid programming in HPC is MPI/OpenMP, but many other combinations are being investigated. Recently, the data-dependency driven, task parallel model for shared memory parallelisation named StarSs has been suggested for usage in combination with MPI. In this paper we apply hybrid MPI/StarSs to a Lattice-Boltzmann code. In particular, we present the hybrid programming model, the benefits we expect, the challenges in porting, and finally a comparison of the performance of MPI/StarSs hybrid, MPI/OpenMP hybrid and the original MPI-only versions of the same code. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reducing Energy Consumption of Dense Linear Algebra Operations on Hybrid CPU-GPU Platforms

    Publication Year: 2012 , Page(s): 56 - 62
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (222 KB) |  | HTML iconHTML  

    We investigate the balance between the time-to-solution and the energy consumption of a task-parallel execution of the Cholesky and LU factorizations on a hybrid platform, equipped with a multi-core processor and several GPUs. To improve energy efficiency, we incorporate two energy-saving techniques in the runtime in charge of scheduling the computations, to block idle threads and enable the transition to a more energy-friendly state of the general-purpose cores. Experiments on an Intel Xeon-based platform connected to an NVIDIA Tesla server report an average reduction of the energy consumption close to 9% (38% when only the consumption associated with the application is considered), for a minor increase in the execution time of the algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Binding Performance and Power of Dense Linear Algebra Operations

    Publication Year: 2012 , Page(s): 63 - 70
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1280 KB) |  | HTML iconHTML  

    In this paper we combine a powerful tracing framework with a power measurement setup to perform a visual analysis of the computational performance and the power consumption of tuned implementations for three key dense linear algebra operations: the LU factorization, the Cholesky factorization, and the reduction to tridiagonal form. Our results using 6 and 12 cores of an AMD Opteron-based platform reveal the serial/concurrent phases of the algorithms, and their connection to periods of low/high power consumption, as well as the linear dependency between execution time and energy for this class of operations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimizing Option Pricing Algorithms and Profiling Power Consumption on VLIW APU Architecture

    Publication Year: 2012 , Page(s): 71 - 78
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (816 KB) |  | HTML iconHTML  

    Heterogeneous multi-core architectures have become an integral component of high performance systems and high performance scientific computing (HPC). The use of these systems has been vital for research applications but until recently have not been a factor in the consumer level experience. However, with new technologies such as AMD's Accelerated Processing Unit (APU) which combines the Central Processing Unit and Graphics Processing Unit onto a single die, consumers now have an affordable high performance system at their disposal. AMD's APUs are aimed at providing good performance and low power consumption for all markets. Financial applications can benefit from this heterogeneous architecture for real time processing. However, to obtain good performance, algorithms must be coded to efficiently utilize the APU architecture. In this paper, we have optimized two option pricing algorithms on the APU making use of vectorization and loop unrolling for improved performance. Our algorithms are tested on both an ATI Mobility Radeon 5870 and an AMD E-350 APU which use the VLIW5 architecture. We also study the power consumption of these architectures to determine how they compare to traditional CPU- and GPU- based systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient GPU Asynchronous Implementation of a Watershed Algorithm Based on Cellular Automata

    Publication Year: 2012 , Page(s): 79 - 86
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1662 KB) |  | HTML iconHTML  

    The watershed transform is a widely used method for non-supervised image segmentation, especially suitable for low-contrast images. In this paper we show that an algorithm calculating the watershed transform based on a cellular automaton is a good choice for the most recent GPU architectures, especially when the synchronization rules are relaxed. In particular we compare a synchronous and an asynchronous implementation of the algorithm. The results show high speedups for both implementations, especially for the asynchronous one, indicating the potential of this kind of algorithms for new architectures based on hundreds of cores. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Memory Hierarchy Optimization for Large Tridiagonal System Solvers on GPU

    Publication Year: 2012 , Page(s): 87 - 94
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (448 KB) |  | HTML iconHTML  

    Nowadays GPUs are commodity hardware containing hundreds of cores and supporting thousands of threads that can be used to accelerate a wide range of applications. From a programmer's perspective, GPUs offer a stream processing model which requires the application of new techniques to exploit their capabilities. In this paper we present the application of the split-and-merge technique to the following parallel tridiagonal system solvers on the GPU: cyclic reduction and recursive doubling. The split-and-merge technique naturally splits the algorithm flow in parallel paths that can be solved in shared memory, and later merged in global memory. In this way, we can solve large systems of equations efficiently exploiting the memory hierarchy of the GPU. The results obtained show a significant acceleration compared with the direct implementation of the algorithms on the GPU. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Image Re-Ranking Computation on GPUs

    Publication Year: 2012 , Page(s): 95 - 102
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (714 KB) |  | HTML iconHTML  

    The huge growth of image collections and multimedia resources available is remarkable. One of the most common approaches to support image searches relies on the use of Content-Based Image Retrieval (CBIR) systems. CBIR systems aim at retrieving the most similar images in a collection, given a query image. Since the effectiveness of those systems is very dependent on the accuracy of ranking approaches, re-ranking algorithms have been proposed to exploit contextual information and improve the effectiveness of CBIR systems. Image re-ranking algorithms typically consider the relationship among every image in a given dataset when computing the new ranking. This approach demands a huge amount of computational power, which may render it prohibitive on very large data sets. In order to mitigate this problem, we propose using the computational power of Graphics Processing Units (GPU) to speedup the computation of image re-ranking algorithms. GPUs are fast emerging and relatively inexpensive parallel processors that are becoming available on a wide range of computer systems. In this paper, we propose a parallel implementation of an image re-ranking algorithm designed to fit the computational model of GPUs. Experimental results demonstrate that relevant performance gains can be obtained by our approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Portfolio Management Using Particle Swarm Optimization on GPU

    Publication Year: 2012 , Page(s): 103 - 110
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (535 KB) |  | HTML iconHTML  

    Mathematical models like the Black-Scholes-Merton model used to price options approximately for simple and plain options in the form of closed form solution. The market is flooded with various styles of options, which are difficult to price. Numerical techniques used for pricing take exorbitant time for reasonable accuracy in pricing results. Heuristic approaches such as Particle swarm optimization (PSO) have been proposed for option pricing, which provide same or better results for simple options than that of numerical techniques at much less computational cost (time). In this work, we first investigate the characteristics of PSO for option pricing and propose improvements to PSO modeling, which reduces the number of PSO parameters without loss of generality of the financial application under study. We have used our improved PSO (called NPSO) model to price complex chooser option, one of the complicated options in the market. Cooperation among particles of the NPSO helps reach the solution in less time. Interest in diversifying investments stems from the necessity to avert risk involved in any single type of investments. The complex chooser option is shown to exhibit the characteristics of a financial portfolio. As a further study, we have used NPSO for portfolio optimization. We have implemented our NPSO model in the state-of-the-art multi-core Graphics processing units (GPU) platform and show that the computational time can be significantly reduced. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.