By Topic

Application Accelerators in High-Performance Computing (SAAHPC), 2011 Symposium on

Date 19-21 July 2011

Filter Results

Displaying Results 1 - 25 of 33
  • [Front cover]

    Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (571 KB)  
    Freely Available from IEEE
  • [Title page i]

    Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (66 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (130 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (110 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): v - vii
    Save to Project icon | Request Permissions | PDF file iconPDF (194 KB)  
    Freely Available from IEEE
  • Foreword

    Page(s): viii
    Save to Project icon | Request Permissions | PDF file iconPDF (84 KB)  
    Freely Available from IEEE
  • Conference Committees

    Page(s): ix - x
    Save to Project icon | Request Permissions | PDF file iconPDF (56 KB)  
    Freely Available from IEEE
  • Conference sponsors

    Page(s): xi
    Save to Project icon | Request Permissions | PDF file iconPDF (163 KB)  
    Freely Available from IEEE
  • Real-Time Object Tracking System on FPGAs

    Page(s): 1 - 7
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (475 KB) |  | HTML iconHTML  

    Object tracking is an important task in computer vision applications. One of the crucial challenges is the real-time speed requirement. In this paper we implement an object tracking system in reconfigurable hardware using an efficient parallel architecture. In our implementation, we adopt a background subtraction based algorithm. The designed object tracker exploits hardware parallelism to achieve high system speed. We also propose a dual object region search technique to further boost the performance of our system under complex tracking conditions. For our hardware implementation we use the Alter a Stratix III EP3SL340H1152C2 FPGA device. We compare the proposed FPGA-based implementation with the software implementation running on a 2.2 GHz processor. The observed speedup can reach more than 100X for complex video inputs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Iterative Refinement on FPGAs

    Page(s): 8 - 13
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (220 KB) |  | HTML iconHTML  

    Achievable accuracy for mixed precision iterative refinement depends on the precisions supported by computing platforms. Even though the arithmetic unit precision can be flexible for programmable logic computing architectures (e.g. FPGAs), previous work rarely discusses the performance benefits due to enabling flexible achievable accuracy. Hence, we propose an iterative refinement approach on FPGAs which employs an arbitrary precision for the iterative refinement to obtain an arbitrary accuracy. We implement single processing elements for the refinement on the Xilinx XC5VLX110T and compare them to Xilinx XC6VSX475T for performance estimation. This paper shows that the performance is similar to the NVIDIA GTX480 when a user requires accuracies between single and double precision, but the implementation can also produce beyond double precision accuracy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GPU-Accelerated Wire-Length Estimation for FPGA Placement

    Page(s): 14 - 23
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3339 KB) |  | HTML iconHTML  

    In the FPGA design flow, placement remains one of the most time-consuming stages, and is also crucial in terms of quality of result. HPWL and Star+ are widely used as cost metrics in FPGA placement for estimating the total wire-length of a candidate placement prior to routing. However, both wire-length models are expensive to compute requiring O(nm) time, where n is the number of nets and m is the average net cardinality. This paper proposes using the massively multi-threaded architecture provided by GPUs to reduce the time required to compute HPWL and Star+. First, a specialized set of data structures is developed for storing net-connectivity information on the GPU. Next, a study is performed to determine how to best map the data structures onto the GPU to exploit the heterogeneous memories and thread-level parallelism that are available. Finally, a study is performed to determine what effect circuit size and net cardinality have on the speedups that can be achieved. Overall, the results show that speedups of as much as 160x over a serial CPU implementation can be achieved for both models when tested using standard benchmarks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Accelerating a Climate Physics Model with OpenCL

    Page(s): 24 - 33
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (334 KB) |  | HTML iconHTML  

    Open Computing Language (OpenCL) is fast becoming the standard for heterogeneous parallel computing. It is designed to run on CPUs, GPUs, and other accelerator architectures. By implementing a real world application, a solar radiation model component widely used in climate and weather models, we show that OpenCL multi-threaded programming and execution model can dramatically increase performance even on CPU architectures. Our preliminary investigation indicates that low-level vector instructions and code representations in OpenCL contribute to dramatic performance improvement over the serial version when compared with the execution of the serial code compiled across various compilers on multiple platforms with auto vectorization flags. However, the portability of OpenCL implementations needs to improve, even for CPU architectures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Experience Applying Fortran GPU Compilers to Numerical Weather Prediction

    Page(s): 34 - 41
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (239 KB) |  | HTML iconHTML  

    Graphics Processing Units (GPUs) have enabled significant improvements in computational performance compared to traditional CPUs in several application domains. Until recently, GPUs have been programmed using C/C++ based methods such as CUDA (NVIDIA) and OpenCL (NVIDIA and AMD). Using these approaches, Fortran Numerical Weather Prediction (NWP) codes would have to be completely re-written to take full advantage of GPU performance gains. Emerging commercial Fortran compilers allow NWP codes to take advantage of GPU processing power with much less software development effort. The Non-hydrostatic Icosahedral Model (NIM) is a prototype dynamical core for global NWP. We use NIM to examine Fortran directive-based GPU compilers, evaluating code porting effort and computational performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Study of the Performance of Multifluid PPM Gas Dynamics on CPUs and GPUs

    Page(s): 42 - 51
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (387 KB) |  | HTML iconHTML  

    The potential for GPUs and many-core CPUs to support high performance computation in the area of computational fluid dynamics (CFD) is explored quantitatively through the example of the PPM gas dynamics code with PPB multi fluid volume fraction advection. This code has already been implemented on the IBM Cell processor and run at full scale on the Los Alamos Roadrunner machine. This implementation has involved a complete restructuring of the code that has been described in detail elsewhere. Here the lessons learned from that work are exploited to take advantage oftoday's latest generations of multi-core CPUs and many-core GPUs. The operations performed by this code are characterized in detail after being first decomposed into a series of individual code kernels to allow an implementation on GPUs. Careful implementations of this code for both CPUs and GPU sare then contrasted from a performance point of view. In addition, a single kernel that has many of the characteristics of the full application on CPUs has been built into a full, standalone, scalable parallel application. This single-kernel application shows the GPU at its best. In contrast, the full multi fluid gas dynamics application brings into play computational requirements that highlight the essential differences in CPU and GPU designs today and the different programming strategies needed to achieve the best performance for applications of this type on the two devices. The single kernel application code performs extremely well on both platforms. This application is not limited by main memory bandwidth on either device instead it is limited only by the computational capability of each. In this case, the GPU has the advantage, because it has more computational cores. The full multi fluid gas dynamics code is, however, of necessity memory bandwidth limited on the GPU, while it is still computational capability limited on the CPU. We believe that these codes provide a useful context for quantifying the costs a- - nd benefits of design decisions for these powerful new computing devices. Suggestions for improvements in both devices and codes based upon this work are offered in our conclusions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Non-serial Polyadic Dynamic Programming on a Data-Parallel Many-core Architecture

    Page(s): 52 - 55
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (283 KB) |  | HTML iconHTML  

    Dynamic Programming (DP) is a method for efficiently solving a broad range of search and optimization problems. As a result, techniques for managing large-scale DP problems are often critical to the performance of many applications. DP algorithms are often hard to parallelize. In this paper, we address the challenge of exploiting fine grain parallelism on a family of DP algorithms known as non-serial polyadic. We use an abstract formulation of non-serial polyadic DP, derived from RNA secondary structure prediction and matrix parenthesization approaches that are well-known and important problems from this family. We present a load balancing algorithm that achieves the best overall performance with this type of workload on many-core architectures. A divide-and-conquer approach previously used on multi-core architectures is compared against an iterative version. To evaluate these approaches, the algorithm was implemented on three NVIDIA GPUs using CUDA. We achieved up to 10 GFLOP/s performance and up to 228x speedup over the single-threaded CPU implementation. Moreover, the iterative approach results in up to 3.92x speedup over the divide-and-conquer approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and Simulation of a Rectangular Meshotron Unit Prototype

    Page(s): 56 - 59
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (207 KB) |  | HTML iconHTML  

    A novel application-specific hardware (ASH) unit was designed to form the building block of the Meshotron - a parallelisation network for three-dimensional (3D) digital wave guide-mesh (DWM) room acoustic models. The rectangular mesh topology was elected. This ASH unit was tested using professional hardware simulation tools, assuming 32-bit integer data. Room impulse responses (RIR) were obtained for a set of small models under different test conditions, using both single-unit and multi-unit configurations. They proved exactly identical to those obtained using 3D DWM modelling software for the same models and test conditions, which validates the design. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Transformation of Scientific Algorithms to Parallel Computing Code: Single GPU and MPI Multi GPU Backends with Subdomain Support

    Page(s): 60 - 63
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1374 KB) |  | HTML iconHTML  

    We propose an approach for high-performance scientific computing that separates the description of algorithms from the generation of code for parallel hardware architectures like Multi-Core CPUs, GPUs or FPGAs. This way, a scientist can focus on his domain of expertise by describing his algorithms generically without the need to have knowledge of specific hardware architectures, programming languages, APIs or tool flows. We present our prototype implementation that allows for transforming generic descriptions of algorithms with intensive array-type data access to highly optimized code for GPU and multi GPU cluster systems. We evaluate the approach for an example from the domain of computational nanophotonics and show that our current tool flow is able to generate efficient code that achieves speedups of up to 15.3x for a single GPU and even 35.9x for a multi GPU setup compared to a reference CPU implementation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A First Analysis of a Dynamic Memory Allocation Controller (DMAC) Core

    Page(s): 64 - 67
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (670 KB) |  | HTML iconHTML  

    Networking performance continues to grow but processor clock frequencies have not. Likewise, the latency to primary memory is not expected to improve dramatically either. This is leading computer architects to reconsider the networking subsystem and the roles and responsibilities of hardware and the operating system. This paper presents the first component of a new networking subsystem where the hardware is responsible for buffering, when necessary, messages without interrupting or involving the operating system. The design is presented and its functionality is demonstrated. The core on an FPGA is exercised with a synthetic stream of messages and the results show that the analytical performance model and measured performance agree. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Application of Graphics Processing Units (GPUs) to the Study of Non-linear Dynamics of the Exciton Bose-Einstein Condensate in a Semiconductor Quantum Well

    Page(s): 68 - 71
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (206 KB) |  | HTML iconHTML  

    In this paper, we explore the use of Graphics Processing Units (GPUs) to solve numerically the nonlinear Gross-Pitaevskii equation with an external potential. Our implementation uses NVIDIA's Compute Unified Device Architecture (CUDA) programming paradigm and demonstrates a speedup of 190x on an NVIDIA Tesla C2050 (Fermi) GPU compared to an optimized software implementation on a single-core of an Intel Xeon 5500-series processor. We apply the developed technique to the study of Bose-Einstein condensation (BEC) of excitons in semiconductor nanostructures. The technique is also applicable to the studies of atomic condensates, quantized vortices in quantum fluids, propagation of light pulses in optical wave guides, and ocean wave dynamics. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Porting Optimized GPU Kernels to a Multi-core CPU: Computational Quantum Chemistry Application Example

    Page(s): 72 - 75
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (507 KB) |  | HTML iconHTML  

    We investigate techniques for optimizing a multi-core CPU code back ported from a highly optimized GPU kernel. We show that common sub-expression elimination and loop unrolling optimization techniques improve code performance on the GPU, but not on the CPU. On the other hand, register reuse and loop merging are effective on the CPU and in combination they improve performance of the ported code by 16%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • G-NetMon: A GPU-accelerated Network Performance Monitoring System

    Page(s): 76 - 79
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (673 KB) |  | HTML iconHTML  

    At Fermilab, we have prototyped a GPU-accelerated network performance monitoring system, called G-NetMon, to support large-scale scientific collaborations. In this work, we explore new opportunities in network traffic monitoring and analysis with GPUs. Our system exploits the data parallelism that exists within network flow data to provide fast analysis of bulk data movement between Fermilab and collaboration sites. Experiments demonstrate that our G-NetMon can rapidly detect sub-optimal bulk data movements. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implications of Memory-Efficiency on Sparse Matrix-Vector Multiplication

    Page(s): 80 - 83
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (178 KB) |  | HTML iconHTML  

    Sparse Matrix Vector-Multiplication is an important operation for many iterative solvers. However, peak performance is limited by the fact that the commonly used algorithm alternates between compute-bound and memory-bound steps. This paper proposes a novel data structure and an FPGA-based hardware core that eliminates the limitations imposed by memory. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GPU Performance Comparison for Accelerated Radar Data Processing

    Page(s): 84 - 92
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (816 KB) |  | HTML iconHTML  

    Radar is a data-intensive measurement technique often requiring significant processing to make full use of the received signal. However, computing capacity is limited at remote or mobile radar installations thereby limiting radar data products used for real-time decisions. We used graphics processing units (GPUs) to accelerate processing of high resolution phase-coded radar data from the Modular UHF Ionosphere Radar (MUIR) at the High-frequency Active Auroral Research Program (HAARP) facility in Gakona, Alaska. Previously, this data could not be processed on-site in sufficient time to be useful for decisions made during active experiment campaigns, nor could the data be uploaded for off-site processing to high-performance computing (HPC) resources at the Arctic Region Supercomputing Center (ARSC) in Fairbanks. In this paper, we present a radar data-processing performance comparison of a workstation equipped with dual NVIDIA GeForce GTX 480 GPU accelerator cards and a node from ARSC's PACMAN cluster equipped with dual NVIDIA Tesla M2050 cards. Both platforms meet performance requirements, are relatively inexpensive and could operate effectively at remote observatories such as HAARP. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Evaluation of GPU Architectures Using Spiking Neural Networks

    Page(s): 93 - 102
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (309 KB) |  | HTML iconHTML  

    During recent years General-Purpose Graphical Processing Units (GP-GPUs) have entered the field of High-Performance Computing (HPC) as one of the primary architectural focuses for many research groups working with complex scientific applications. Nvidia's Tesla C2050, codenamed Fermi, and AMD's Radeon 5870 are two devices positioned to meet the computationally demanding needs of supercomputing research groups across the globe. Though Nvidia GPUs powered by CUDA have been the frequent choices of the performance centric research groups, the introduction and growth of OpenCL has promoted AMD GP-GPUs as potential accelerator candidates that can challenge Nvidia's stronghold. These architectures not only offer a plethora of features for application developers to explore, but their radically different architectures calls for a detailed study that weighs their merits and evaluates their potential to accelerate complex scientific applications. In this paper, we present our performance analysis research comparing Nvidia's Fermi and AMD's Radeon 5870 using OpenCL as the common programming model. We have chosen four different neuron models for Spiking Neural Networks (SNNs), each with different communication and computation requirements, namely the Izhikevich, Wilson, Morris Lecar (ML), and the Hodgkin Huxley (HH) models. We compare the runtime performance of the Fermi and Radeon GPUs with an implementation that exhausts all optimization techniques available with OpenCL. Several equivalent architectural parameters of the two GPUs are studied and correlated with the application performance. In addition to the comparative study effort, our implementations were able to achieve a speed-up of 857.3x and 658.51x on the Fermi and Radeon architectures respectively for the most compute intensive HH model with a dense network containing 9.72 million neurons. The final outcome of this research is a detailed architectural comparison of the two GPU architectures with a common programm- - ing platform. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation

    Page(s): 103 - 112
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (231 KB) |  | HTML iconHTML  

    For some classes of problems, NVIDIA CUDA abstraction and hardware properties combine with problem characteristics to limit the specific problem instances that can be effectively accelerated. As a real-world example, a two-dimensional correlation-based template-matching MATLAB application is considered. While this problem has a well known solution for the common case of linear image filtering -- small fixed templates of a known size applied to a much larger image -- the application considered here uses large arbitrarily-sized templates, up to 156-by-116 pixels, with small search spaces containing no more than 703 window positions per template. Our CUDA implementation approach employs template tiling and problem-specific kernel compilation to achieve speedups of up to 15 when compared to an optimized multi-threaded implementation running on a 3.33 GHz four core Intel Nehalem processor. Tiling the template enables exploiting the parallelism within the computation and shared memory usage. At the same time, problem-specific kernel compilation allows greater levels of adaptability than would otherwise be possible. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.