By Topic

Applications for Multi-Core Architectures (WAMCA), 2012 Third Workshop on

Date 24-25 Oct. 2012

Filter Results

Displaying Results 1 - 20 of 20
  • [Back cover]

    Page(s): C4
    Save to Project icon | Request Permissions | PDF file iconPDF (1248 KB)  
    Freely Available from IEEE
  • [Title page i]

    Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (87 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (189 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (136 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): v - vi
    Save to Project icon | Request Permissions | PDF file iconPDF (144 KB)  
    Freely Available from IEEE
  • Message from the Organizers

    Page(s): vii
    Save to Project icon | Request Permissions | PDF file iconPDF (121 KB)  
    Freely Available from IEEE
  • Committees

    Page(s): viii
    Save to Project icon | Request Permissions | PDF file iconPDF (117 KB)  
    Freely Available from IEEE
  • Program Committee

    Page(s): ix
    Save to Project icon | Request Permissions | PDF file iconPDF (118 KB)  
    Freely Available from IEEE
  • A Load Distribution Algorithm Based on Profiling for Heterogeneous GPU Clusters

    Page(s): 1 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (213 KB) |  | HTML iconHTML  

    Clusters of GPUs are becoming commonly used to execute computationally demanding applications. Due to the frequent changes in GPU architecture, many clusters contain heterogeneous types of GPUs, leading to the problem of load distribution among the machines. In this work, we propose a load distribution algorithm for scientific applications executed in heterogeneous GPU clusters. The algorithm finds a distribution of data that minimizes the execution time of the application, by guaranteeing that all GPUs spend the same amount of time processing its assigned kernels and data. We use the algorithm to execute the simulation of large scale neuronal networks. We show that the algorithm effectively balances the load among the GPUs and reduces the execution time of the application. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Study on Mixed Precision Techniques for a GPU-based SIP Solver

    Page(s): 7 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (240 KB) |  | HTML iconHTML  

    This article presents the study and application of mixed precision techniques to accelerate a GPU-based implementation of the Strongly Implicit Procedure (SIP) to solve hepta-diagonal linear systems. In particular, two different options to incorporate mixed precision in the GPU implementation are discussed and one of them is implemented. The experimental evaluation of our proposal demonstrates that a runtime similar to a single precision implementation on GPU can be attained, but achieving a numerical accuracy comparable to double precision arithmetic. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Architecture of Request Distributor for GPU Clusters

    Page(s): 13 - 18
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (441 KB) |  | HTML iconHTML  

    The advent of GPU computing has enabled development of many strategies for accelerating different kinds of simulations. Even further, instead of processing an application by just using one GPU, it is a common to use a collection of GPUs as a solution. These GPUs can be located in the same machine, network, or even across a wide area network. Unfortunately, distribution and management of GPUs requires additional efforts by the user such as deal with data transfer, connection and processing among GPUs. Request distributor for GPU clusters (RDGPUC) is a software architecture which allows companies, institutes and other users to share their GPU resources. By using this architecture, each cluster can have its own software to manage internal resources and they only need to develop small code to interact with RDGPUC. This novel design brings flexibility to the system and allows everyone to share their resources without need to change their GPU cluster tool. Another interesting part of system is to allow users to submit requests from all kind of devices and platforms. Admin of this system is able to specify resource groups and special schedules for using resources. On the other hand, end-users can just use a simple interface to submit their requests on RDGPUC without knowing about internal design and current status of GPU clusters. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Hybrid CPU-GPU Local Search Heuristic for the Unrelated Parallel Machine Scheduling Problem

    Page(s): 19 - 23
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (175 KB) |  | HTML iconHTML  

    This work addresses the development of a hybrid CPU-GPU local search heuristic for the unrelated parallel machine scheduling problem. In this scheduling problem setup times are sequence-dependent and also machine-dependent. The objective is to minimize the maximum completion time of the schedule, known as make span. Since the problem belongs to the NP-hard class there is no known polynomial time algorithm to solve it, so metaheuristics and local search heuristics are usually developed to find good near optimal solutions. In general, the local search is the most expensive part of the heuristic method, so our algorithm harnesses the tremendous computing power of the GPU to decrease the local search computational time. We use the local search based on swapping jobs in different machines, since it is able find good near optimal solutions as we report from previous results in literature. We show that the hybrid CPU-GPU local search achieves average speedups from 10 to 27 times in relation to the pure CPU local search. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A High-Level Implementation of STM Haskell with Write/Write Conflict Detection

    Page(s): 24 - 29
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (170 KB) |  | HTML iconHTML  

    This paper describes a high level implementation of Software Transactional Memory (STM) for the Haskell language. The library is implemented completely in Haskell and, as opposed to all other implementation of STM Haskell, it features early detection of write/write conflicts. Preliminary performance measurements using the Haskell STM benchmark show that the library performs much better than a TL2~implementation written in Haskell, and performs reasonably well compared to the current implementation of STM Haskell written in C. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimal Virtual Channel Insertion for Contention Alleviation and Deadlock Avoidance in Custom NoCs

    Page(s): 30 - 35
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (470 KB) |  | HTML iconHTML  

    Deadlock and contention can be avoided in an NoC architecture by employing virtual channels (VC). VC insertion can result in power and chip area increases with little performance improvements. We present a novel VC insertion technique for deadlock avoidance and contention relief in irregular NoC architectures that avoids significant power and area increase. Given a resource pool of VCs, deadlock/contention analytical models, and a systematic pre-evaluation technique, minimal VC resources are inserted resulting in higher performance. Several experiments are conducted on various SoC benchmark applications. The results of our technique indicate an average performance improvement of 21%, 32.4% decrease in power dissipation and 79.5% resource savings as compared to past techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Novel Virtual Channel Implementation Technique for Multi-core On-chip Communication

    Page(s): 36 - 41
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (321 KB) |  | HTML iconHTML  

    In this paper, a new approach for implementing virtual channels (VC) for multi-core interconnection networks is presented. In this approach, the flits of different packets interleave in a channel with a single buffer of nominal depth by using a rotating flit-by-flit arbitration. The routing path of each flit is guaranteed because the flits belonging to the same packet are attached with an ID tag at each router so that they are differentiable at downstream routers. We present this on-chip communication of packets through sharing of channel and buffer, which is a novel method of virtual channel implementation. Furthermore, we demonstrate it by adding arbitrary virtual channels depending on the number of packet requests for a physical channel. In this way, NoC (Network-on-Chip) contention can be removed cheaply. Moreover, we discuss contention free communication where the depth of shared buffer does not affect the performance. A contention-free communication with small (one) buffer depth can create an efficient on-chip communication with high performance, small chip area and low power consumption. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Autotuning Wavefront Abstractions for Heterogeneous Architectures

    Page(s): 42 - 47
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1009 KB) |  | HTML iconHTML  

    We present our auto tuned heterogeneous parallel programming abstraction for the wave front pattern. An exhaustive search of the tuning space indicates that correct setting of tuning factors can average 37x speedup over a sequential baseline. Our best automated machine learning based heuristic obtains 92% of this ideal speedup, averaged across our full range of wave front examples. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Time-to-Solution and Energy-to-Solution: A Comparison between ARM and Xeon

    Page(s): 48 - 53
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (163 KB) |  | HTML iconHTML  

    Most High Performance Computing (HPC) systems today are known as "power hungry" because they aim at computing speed regardless to energy consumption. Some scientific applications still claim more speed and the community expects to reach exascale by the end of the decade. Nevertheless, to reach exascale we need to search alternatives to cope with energy constraints. A promising step forward in this direction is the usage of low power processors such as ARM. ARM processors target low power consumption in contrast with Xeon that are conventional on HPC aiming at computing speed. This paper presents a comparison between ARM and Xeon to evaluate if ARM is the future building block to HPC. We choose to use time-to-solution, peak power, and energy-to-solution to evaluate both processors from the user's perspective. The results point that although ARM having lower peak power, Xeon has still a better tradeoff from the user's point-of-view. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scheduling Cyclic Task Graphs with SCC-Map

    Page(s): 54 - 59
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (643 KB) |  | HTML iconHTML  

    The Dataflow execution model has been shown to be a good way of exploiting TLP, making parallel programming easier. In this model, tasks must be mapped to processing elements (PEs) considering the trade-off between communication and parallelism. Previous work on scheduling dependency graphs have mostly focused on directed a cyclic graphs, which are not suitable for dataflow (loops in the code become cycles in the graph). Thus, we present the SCC-Map: a novel static mapping algorithm that considers the importance of cycles during the mapping process. To validate our approach, we ran a set of benchmarks in on our dataflow simulator varying the communication latency, the number of PEs in the system and the placement algorithm. Our results show that the benchmark programs run significantly faster when mapped with SCC-Map. Moreover, we observed that SCC-Map is more effective than the other mapping algorithms when communication latency is higher. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Author index

    Page(s): 60
    Save to Project icon | Request Permissions | PDF file iconPDF (67 KB)  
    Freely Available from IEEE
  • [Publisher's information]

    Page(s): 62
    Save to Project icon | Request Permissions | PDF file iconPDF (135 KB)  
    Freely Available from IEEE