By Topic

Field-Programmable Custom Computing Machines (FCCM), 2011 IEEE 19th Annual International Symposium on

Date 1-3 May 2011

Filter Results

Displaying Results 1 - 25 of 62
  • [Front cover]

    Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (136 KB)  
    Freely Available from IEEE
  • [Title page i]

    Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (30 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (142 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (168 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): v - ix
    Save to Project icon | Request Permissions | PDF file iconPDF (170 KB)  
    Freely Available from IEEE
  • Message from the General and Program Chairs

    Page(s): x - xi
    Save to Project icon | Request Permissions | PDF file iconPDF (112 KB)  
    Freely Available from IEEE
  • Organizing Committee

    Page(s): xii
    Save to Project icon | Request Permissions | PDF file iconPDF (108 KB)  
    Freely Available from IEEE
  • Program Committee

    Page(s): xiii - xiv
    Save to Project icon | Request Permissions | PDF file iconPDF (112 KB)  
    Freely Available from IEEE
  • Additional Reviewers

    Page(s): xv
    Save to Project icon | Request Permissions | PDF file iconPDF (81 KB)  
    Freely Available from IEEE
  • Preconference Workshop Summary

    Page(s): xvi
    Save to Project icon | Request Permissions | PDF file iconPDF (77 KB)  
    Freely Available from IEEE
  • Panel Session Summary

    Page(s): xvii
    Save to Project icon | Request Permissions | PDF file iconPDF (90 KB)  
    Freely Available from IEEE
  • Sponsors

    Page(s): xviii
    Save to Project icon | Request Permissions | PDF file iconPDF (213 KB)  
    Freely Available from IEEE
  • A Sparse Matrix Personality for the Convey HC-1

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (625 KB) |  | HTML iconHTML  

    In this paper we describe a double precision floating point sparse matrix-vector multiplier (SpMV) and its performance as implemented on a Convey HC-1 reconfigurable computer. The primary contributions of this work are a novel streaming reduction architecture for floating point accumulation, a novel on-chip cache optimized for streaming compressed sparse row (CSR) matrices, and end-to-end integration with the HC-1's system, programming model, and runtime environment. The design is composed of 32 parallel processing elements, each connected to the HC-1's coprocessor memory and each containing a streaming multiply-accumulator and local vector cache. When used on the HC-1, each PE has a peak throughput of 300 double precision MFLOP/s, giving a total peak throughput of 9.6 GFLOPS/s. For our test matrices, we demonstrate up to 40% of the peak performance and compare these results with results obtained using the CUSparse library on an NVIDIA Tesla S1070 GPU. In most cases our implementation exceeds the performance of the GPU. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Modeling Dynamically Reconfigurable Systems for Simulation-Based Functional Verification

    Page(s): 9 - 16
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (365 KB) |  | HTML iconHTML  

    Dynamically Reconfigurable Systems (DRS), which allow logic to be partially reconfigured during run-time, are promising candidates for embedded and high-performance systems. However, their architectural flexibility introduces a new dimension to the functional verification problem. Dynamic reconfiguration requires the designer to consider new issues such as synchronizing, isolating and initializing reconfigurable modules. Furthermore, by exposing the FPGA architecture to the application specification, it has made functional verification dependent on the physical implementation. This paper studies simulation as the most fundamental approach to the functional verification of DRS. The main contribution of this paper is in proposing a verification-driven top-down modeling methodology that guides designers in refining their reconfigurable system design from the behavioral level to the register transfer level. We assess the feasibility of our methodology via a case study involving the design of a generic partial reconfiguration platform. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mixed Precision Processing in Reconfigurable Systems

    Page(s): 17 - 24
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (349 KB) |  | HTML iconHTML  

    Customisable data formats provide an opportunity for exploring trade-offs in accuracy and performance of reconfigurable systems. This paper introduces a novel methodology for mixed-precision comparison, which improves comparison performance by using reduced-precision data paths while maintaining accuracy by using high-precision data paths. Our methodology adopts reduced-precision data-paths for preliminary comparison, and high-precision data-paths when the accuracy for preliminary comparison is insufficient. We develop an analytical model for performance estimation of the proposed mixed-precision methodology. Optimisation based on integer linear programming is employed for determining the optimal precision and resource allocation for each of the data paths. The effectiveness of our approach is evaluated using a common collision detection problem. Performance gains of 4 to 7.3 times are obtained over baseline fixed-precision designs for the same FPGAs. With the help of the proposed mixed-precision methodology, our FPGA designs are 15.4 to 16.7 times faster than software running on multi-core CPUs with the same technology. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic Communication in a Coarse Grained Reconfigurable Array

    Page(s): 25 - 28
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (440 KB) |  | HTML iconHTML  

    Coarse Grained Reconfigurable Arrays (CGRAs) are typically very efficient for a single task. However all functional units are required to perform in lock step, wasting resources and making complex programming flows difficult. Massively Parallel Processor Arrays (MPPAs) excel at executing unrelated tasks simultaneously, but limit the amount of resources dedicated to a single task. We propose an architecture with an MPPA's design flexibility and a CGRA's throughput, capable of processing and transferring data in a pre-compiled schedule, with dynamic transfers between components. Alternative interconnect strategies are compared for silicon area cost and power utilization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Run-Time Resource Allocation for Simultaneous Multi-tasking in Multi-core Reconfigurable Processors

    Page(s): 29 - 32
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (412 KB) |  | HTML iconHTML  

    State-of-the-art multi-core reconfigurable processors do not exploit the full potential of simultaneous multi-tasking with run-time adaptive reconfigurable fabric allocation. We propose a novel run-time system for simultaneous multi-tasking in a multi-core reconfigurable processor that adaptively allocates the mixed-grained reconfigurable fabric resource at run time among different tasks considering their performance constraints. Our scheme employs the novel concept of refined task-criticality (based on the functional-block-level performance constraints) considering the computational properties of dependent tasks and their inherent potential for acceleration. Our scheme dynamically compensates the deadline misses at the functional block level. It thereby reduces the potential task-level deadline misses under competing scenarios. With the help of a secure video conferencing application (with 4 dependent tasks of diverse computational properties), we demonstrate that our scheme reduces the deadline misses by (on average) 6× under given performance constraints, when compared to state-of-the-art reconfigurable processors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Autonomous Vector/Scalar Floating Point Coprocessor for FPGAs

    Page(s): 33 - 36
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (443 KB) |  | HTML iconHTML  

    We present a Floating Point Vector Coprocessor that works with the Xilinx embedded processors. The FPVC is completely autonomous from the embedded processor, exploiting parallelism and exhibiting greater speedup than alternative vector processors. The FPVC supports scalar computation so that loops can be executed independently of the main embedded processor. Floating point addition, multiplication, division and square root are implemented with the Northeastern University VFLOAT library. The FPVC is parameterized so that the number of vector lanes and maximum vector length can be easily modified. We have implemented the FPVC on a Xilinx Virtex 5 connected via the Processor Local Bus (PLB) to the embedded PowerPC. Our results show more than five times improved performance over the PowerPC augmented with the Xilinx Floating Point Unit on applications from linear algebra: QR and Cholesky decomposition. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hecto-Scale Frame Rate Face Detection System for SVGA Source on FPGA Board

    Page(s): 37 - 40
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (278 KB) |  | HTML iconHTML  

    This paper proposes techniques for face detection and gives the implementation details for an FPGA development board. We analyze and discuss the relation between the system computation cost and selection of the image scaling factor. We give a new method to select the stop threshold for the image reduction process, which reduces the total computation by half. We also provide a color image output mode to let our system enjoy more human-oriented design. Test results show that the system achieves real-time face detection speed (100 fps) and a high face detection rate (87.2%) for an SVGA (600 × 800) video source. The low power consumption (3.5W) is another advantage over previous work. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An FPGA Implementation of Information Theoretic Visual-Saliency System and Its Optimization

    Page(s): 41 - 48
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1453 KB) |  | HTML iconHTML  

    Biological vision systems use saliency-based visual attention mechanisms to limit higher-level vision processing on the most visually-salient subsets of an input image. Among several computational models that capture the visual-saliency in biological system, an information theoretic AIM(Attention based on Information Maximization) algorithm has been demonstrated to predict human gaze patterns better than other existing models. We present an FPGA based implementation of this computationally intensive AIM algorithm to support embedded vision applications. Our implementation provides performance of processing about 4M pixels/sec for 25 basis functions with a convolution kernel size of 21 by 21 for each of the R, G, and B color-channels, when implemented on a Virtex-6 LX240T. We also provide an optimization aimed at controlling the trade-off between power consumption and latency, and performance comparisons with a GPU implementation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable, High Performance Fourier Domain Optical Coherence Tomography: Why FPGAs and Not GPGPUs

    Page(s): 49 - 56
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1046 KB) |  | HTML iconHTML  

    Fourier Domain Optical Coherence Tomography (FD-OCT) is an emerging biomedical imaging technology featuring ultra-high resolution and fast imaging speed. Due to the complexity of the FD-OCT algorithm, real time FD-OCT imaging demands high performance computing platforms. However, the scaling of real-time FD-OCT processing for increasing data acquisition rates and 3-dimensional (3D) imaging is quickly outpacing the performance of general purpose processors. Our research analyzes the scalability of accelerating FD-OCT processing on two potential implementation platforms: General Purpose Graphical Processing Units (GPGPUs) and Field Programmable Gate Arrays (FPGAs). We implemented a complete FD-OCT system using a NVIDIA GPGPU as co-processor, with a speed up of 6.9x over general purpose processors (GPPs). We also created a hardware processing engine using FPGAs with a speed up of 15.5x over GPPs for a single pipeline, which can be replicated to further increase performance. Our analysis of the performance and scalability for both platforms shows that, while GPGPUs offer an easy and low cost solution for accelerating FD-OCT, FPGAs are more likely to match the long term demands for real-time, 3D, FD-OCT imaging. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Architecture, Design, and Experimental Evaluation of a Lightfield Descriptor Depth Buffer Algorithm on Reconfigurable Logic and on a GPU

    Page(s): 57 - 64
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (454 KB) |  | HTML iconHTML  

    The Lightfield descriptor method for 3D computer graphics offers the highest quality object retrieval from a database at the expense of higher storage and computational cost vs. other methods. This paper presents two special purpose architectures, based on FPGAs and GPUs, for the depth buffer extraction algorithm which is used by the Light field Descriptor method. The two architectures were fully designed and implemented in hardware on a Virtex 5 FPGA Device and on a GeForce GPU. The FPGA-based design offers a measured average speedup of 50x vs. software. The corresponding GPU results were by comparison less promising, but still better than software solutions. Results reported in this paper are from actual runs on hardware. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementation and Performance Analysis of SEAL Encryption on FPGA, GPU and Multi-core Processors

    Page(s): 65 - 68
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (282 KB) |  | HTML iconHTML  

    Accelerators, such as field programmable gate arrays (FPGAs) and graphics processing units (GPUs), are special purpose processors designed to speed up compute-intensive sections of applications. FPGAs are highly customizable, while GPUs provide massive parallel execution resources and high memory bandwidth. In this paper, we compare the performance of these architectures, presenting a performance study of SEAL, a fast, software-oriented encryption algorithm on a Virtex-6 FPGA, a Graphics Processor Unit (GPU), and Intel Core i7, a 2-way hyper-threaded, 4-core processor. We show that each platform has relative competitive advantages in encrypting an input plaintext using SEAL. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FPGA Communication Framework

    Page(s): 69 - 72
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (293 KB) |  | HTML iconHTML  

    FPGA-CF is an open-source, portable, extensible communications package that consists of a small hardware core (less than 600 slices) and and a host-software library/API. It enables a host PC to transmit data at 120 Mb/s to Xilinx-based FPGA boards via Ethernet using standard internet protocols. The hardware core is directly connected to the Xilinx internal configuration port (ICAP) and supports all ICAP functionality. The core also provides an extensible user-channel interface and provides up to 15, 8-bit user-data channels. The host software API supports both Java and C++ and provides high-level functionality for making connections and transmitting data. The utility of the system is demonstrated by implementing an on-chip test/debug system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Calculation of Pairwise Nonbonded Forces

    Page(s): 73 - 76
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (626 KB) |  | HTML iconHTML  

    A major bottleneck in molecular dynamics (MD) simulations is the calculation of the pair wise nonbonded interactions. Previous work on FPGAs has shown that these calculations can be implemented with a number of force computation pipelines operating in parallel (4 and 8 for the Stratix-III and Stratix-V, respectively). Optimization has received some attention previously in CPU, GPU, FPGA, and ASIC implementations, with direct computation of the equations of interaction being replaced with table lookup with interpolation, and the order and granularity of those interpolations being optimized. FPGAs lend themselves to a particularly rich design space both of opportunities and constraints. We explore and evaluate this space with respect to both resource requirements and simulation quality. We find that FPGAs' BRAM architecture makes them well suited to support unusually fine-grained intervals. This leads to a reduction in other logic and a proportional increase in performance. We demonstrate these designs with prototype implementations supporting full electrostatics and integrated into NAMD-lite. Throughput is improved by 50% over the previous best FPGA implementation while simulation quality is maintained. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.