By Topic

Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th Annual International Symposium on

Date April 29 2012-May 1 2012

Filter Results

Displaying Results 1 - 25 of 57
  • [Front cover]

    Publication Year: 2012 , Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (143 KB)  
    Freely Available from IEEE
  • [Title page i]

    Publication Year: 2012 , Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (34 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Publication Year: 2012 , Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (89 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Publication Year: 2012 , Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (143 KB)  
    Freely Available from IEEE
  • Table of contents

    Publication Year: 2012 , Page(s): v - viii
    Save to Project icon | Request Permissions | PDF file iconPDF (168 KB)  
    Freely Available from IEEE
  • A Message from the General Chair and Program Chair

    Publication Year: 2012 , Page(s): ix - x
    Save to Project icon | Request Permissions | PDF file iconPDF (83 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Organizing Committee

    Publication Year: 2012 , Page(s): xi
    Save to Project icon | Request Permissions | PDF file iconPDF (64 KB)  
    Freely Available from IEEE
  • Program Committee

    Publication Year: 2012 , Page(s): xii - xiii
    Save to Project icon | Request Permissions | PDF file iconPDF (65 KB)  
    Freely Available from IEEE
  • Additional reviewers

    Publication Year: 2012 , Page(s): xiv
    Save to Project icon | Request Permissions | PDF file iconPDF (63 KB)  
    Freely Available from IEEE
  • Workshop description

    Publication Year: 2012 , Page(s): xv - xvi
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (87 KB)  

    Provides an abstract of the presentation and a brief professional biography of the presenter. The complete presentation was not made available for publication as part of the conference proceedings. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sponsors

    Publication Year: 2012 , Page(s): xvii - xviii
    Save to Project icon | Request Permissions | PDF file iconPDF (855 KB)  
    Freely Available from IEEE
  • A Low-Overhead Profiling and Visualization Framework for Hybrid Transactional Memory

    Publication Year: 2012 , Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (340 KB) |  | HTML iconHTML  

    Multi-core prototyping presents a good opportunity for establishing low overhead and detailed profiling and visualization in order to study new research topics. In this paper, we design and implement a low execution, low area overhead profiling mechanism and a visualization tool for observing Transactional Memory behaviors on FPGA. To achieve this, we non-disruptively create and bring out events on the fly and process them offline on a host. There, our tool regenerates the execution from the collected events and produces traces for comprehensively inspecting the behavior of interacting multithreaded programs. With zero execution overhead for hardware TM events, single-instruction overhead for software TM events, and utilizing a low logic area of 2.3% per processor core, we run TM benchmarks to evaluate various different levels of profiling detail with an average runtime overhead of 6%. We demonstrate the usefulness of such detailed examination of SW/HW transactional behavior in two parts: (i) we speed up a TM benchmark by 24.1%, and (ii) we closely inspect transactions to point out pathologies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards a Universal FPGA Matrix-Vector Multiplication Architecture

    Publication Year: 2012 , Page(s): 9 - 16
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1099 KB) |  | HTML iconHTML  

    We present the design and implementation of a universal, single-bit stream library for accelerating matrix-vector multiplication using FPGAs. Our library handles multiple matrix encodings ranging from dense to multiple sparse formats. A key novelty in our approach is the introduction of a hardware-optimized sparse matrix representation called Compressed Variable-Length Bit Vector (CVBV), which reduces the storage and bandwidth requirements up to 43% (on average 25%) compared to compressed sparse row (CSR) across all the matrices from the University of Florida Sparse Matrix Collection. Our hardware incorporates a runtime-programmable decoder that performs on-the-fly-decoding of various formats such as Dense, COO, CSR, DIA, and ELL. The flexibility and scalability of our design is demonstrated across two FPGA platforms: (1) the BEE3 (Virtex-5 LX155T with 16GB of DRAM) and (2) ML605 (Virtex-6 LX240T with 2GB of DRAM). For dense matrices, our approach scales to large data sets with over 1 billion elements, and achieves robust performance independent of the matrix aspect ratio. For sparse matrices, our approach using a compressed representation reduces the overall bandwidth while also achieving comparable efficiency relative to state-of-the-art approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Impact of Cache Architecture and Interface on Performance and Area of FPGA-Based Processor/Parallel-Accelerator Systems

    Publication Year: 2012 , Page(s): 17 - 24
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (806 KB) |  | HTML iconHTML  

    We describe new multi-ported cache designs suitable for use in FPGA-based processor/parallel-accelerator systems, and evaluate their impact on application performance and area. The baseline system comprises a MIPS soft processor and custom hardware accelerators with a shared memory architecture: on-FPGA L1 cache backed by off-chip DDR2 SDRAM. Within this general system model, we evaluate traditional cache design parameters (cache size, line size, associativity). In the parallel accelerator context, we examine the impact of the cache design and its interface. Specifically, we look at how the number of cache ports affects performance when multiple hardware accelerators operate (and access memory) in parallel, and evaluate two different hardware implementations of multi-ported caches using: 1) multi-pumping, and 2) a recently-published approach based on the concept of a live-value table. Results show that application performance depends strongly on the cache interface and architecture: for a system with 6 accelerators, depending on the cache design, speed up swings from 0.73× to 6.14×, on average, relative to a baseline sequential system (with a single accelerator and a direct-mapped, 2KB cache with 32B lines). Considering both performance and area, the best architecture is found to be a 4-port multi-pump direct-mapped cache with a 16KB cache size and a 128B line size. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Area-Efficient Architectures for Large Integer and Quadruple Precision Floating Point Multipliers

    Publication Year: 2012 , Page(s): 25 - 28
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (186 KB) |  | HTML iconHTML  

    Large integer multiplication and floating point multiplication are the two dominating operations for many scientific and cryptographic applications. Large integer multipliers generally have linearly but high area requirement according to a given bit-width. High precision requirements of a given application lead to the use of quadruple precision arithmetic, however its operation is dominated by large integer multiplication of the mantissa product. In this paper, we propose a hardware efficient approach for implementing a fully pipelined large integer multipliers, and further extending it to Quadruple Precision (QP) floating point multiplication. The proposed design uses less hardware resources in terms of DSP48 blocks and slices, while attaining high performance. Promising results are obtained when compared our designs with the best reported large integer multipliers and also QP floating point multiplier in literatures. For instance, our results have demonstrated a significant improvement for the proposed QP multiplier, for over 50% improvement in terms of the DSP48 block usage with a penalty of slight additional slices, when compared to the best result in the literature on a Virtex-4 device. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multi-Resolution Real-Time Dense Stereo Vision Processing in FPGA

    Publication Year: 2012 , Page(s): 29 - 32
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (359 KB) |  | HTML iconHTML  

    High-performance dense stereo is a critical component of computer vision applications like 3D reconstruction, robot navigation, and augmented reality. In this paper, we present a low-power, high performance FPGA implementation of a stereo algorithm suitable for embedded real-time platforms. The design is scalable for higher resolution images and frame rates and supporting different cameras and application requirements. We achieve this by designing highly parallel computation cores with very efficient memory access to the image data. Using a prototype board, we demonstrate real-time stereo processing with 640×480 pixel GigE Vision cameras at 30 frames per second. We show that this FPGA design is 10 times lower power, more scalable and has lower latency, as compared to a GPU based implementation of the same stereo algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Mixed Precision Methodology for Mathematical Optimisation

    Publication Year: 2012 , Page(s): 33 - 36
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (215 KB) |  | HTML iconHTML  

    This paper introduces a novel mixed precision methodology for mathematical optimisation. It involves the use of reduced precision FPGA optimisers for searching potential regions containing the global optimum, and double precision optimisers on a general purpose processor (GPP) for verifying the results. An empirical method is proposed to determine parameters of the mixed precision methodology running on a reconfigurable accelerator consisting of FPGA and GPP. The effectiveness of our approach is evaluated using a set of optimisation benchmarks. Using our mixed precision methodology and a modern reconfigurable accelerator, we can locate the global optima 1.7 to 6 times faster compared with quad-core optimiser. The mixed precision optimisations search up to 40.3 times more starting vector per unit time compared with quad core optimisers and only 0.7% to 2.7% of these searches are refined using GPP double precision optimisers. The proposed methodology also allows us to accelerate problems with more complicated functions or to solve problems involving higher dimensions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Go Ahead: A Partial Reconfiguration Framework

    Publication Year: 2012 , Page(s): 37 - 44
    Cited by:  Papers (13)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (769 KB) |  | HTML iconHTML  

    Exploiting the benefits of partial run-time reconfiguration requires efficient tools. In this paper, we introduce the tool Go Ahead that is able to implement run-time reconfigurable systems for all recent Xilinx FPGAs. This includes in particular support for low cost and low power Spartan-6 FPGAs. Go Ahead assists during floor planning and automates the constraint generation. It interacts with the Xilinx vendor tools and triggers the physical implementation phases all the way down to the final configuration bit streams. Go Ahead enables the building of flexible systems for integrating many reconfigurable modules very efficiently into a system. The tool targets (re)usability, portability to future devices, and migration paths among reconfigurable systems featuring different FPGAs or even FPGA families. Moreover, it provides a scripting interface and all features can be accessed remotely. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On-the-fly Composition of FPGA-Based SQL Query Accelerators Using a Partially Reconfigurable Module Library

    Publication Year: 2012 , Page(s): 45 - 52
    Cited by:  Papers (12)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (292 KB) |  | HTML iconHTML  

    In this paper, we introduce a novel FPGA-based methodology for accelerating SQL queries using dynamic partial reconfiguration. Query acceleration is of utmost importance in large database systems to achieve a very high throughput. Although common FPGA-based accelerators are suitable to achieve such a high throughput, their design is hard to extend for new operations. Using partial dynamic reconfiguration, we are able to build more flexible architectures which can be extended to new operations or SQL constructs with a very low area overhead on the FPGA. Furthermore, the reconfiguration of a few FPGA frames can be used to switch very fast from one query to the next. In our approach, an SQL query is transformed into a hardware pipeline consisting of partially reconfigurable modules. The assembly of the (FPGA) data path is done at run-time using a static system providing the stream-based communication interfaces to the partial modules and the database management system. More specifically, each incoming SQL query is analyzed and divided into single operations which are subsequently mapped onto library modules and the composed data path loaded on the FPGA. We show that our approach is able to achieve a substantially higher throughput compared to a software-only solution. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fixed Point Lanczos: Sustaining TFLOP-equivalent Performance in FPGAs for Scientific Computing

    Publication Year: 2012 , Page(s): 53 - 60
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (301 KB) |  | HTML iconHTML  

    We consider the problem of enabling fixed-point implementations of linear algebra kernels to match the strengths of the field-programmable gate array (FPGA). Algorithms for solving linear equations, finding eigen values or finding singular values are typically nonlinear and recursive making the problem of establishing analytical bounds on variable dynamic range non-trivial. Current approaches fail to provide tight bounds for this type of algorithms. We use as a case study one of the most important kernels in scientific computing, the Lanczos iteration, which lies at the heart of well known methods such as conjugate gradient and minimum residual, and we show how we can modify the algorithm to allow us to apply standard linear algebra analysis to prove tight analytical bounds on all variables of the process, regardless of the properties of the original matrix. It is shown that the numerical behaviour of fixed-point implementations of the modified problem can be chosen to be at least as good as a double precision floating point implementation. Using this approach it is possible to get sustained FPGA performance very close to the peak general-purpose graphics processing unit (GPGPU) performance in FPGAs of comparable size when solving a single problem. If there are several independent problems to solve simultaneously it is possible to exceed the peak floating-point performance of a GPGPU, obtaining approximately 1, 2 or 4 TFLOPs for error tolerances of 10-7, 10-5 and 10-3, respectively, in a large Virtex 7 FPGA. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Formic: Cost-efficient and Scalable Prototyping of Manycore Architectures

    Publication Year: 2012 , Page(s): 61 - 64
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2072 KB) |  | HTML iconHTML  

    Modeling emerging multicore architectures is challenging and imposes a tradeoff between simulation speed and accuracy. An effective practice that balances both targets well is to map the target architecture on FPGA platforms. We find that accurate prototyping of hundreds of cores on existing FPGA boards faces at least one of the following problems: (i) limited fast memory resources (SRAM) to model caches, (ii) insufficient inter-board connectivity for scaling the design or (iii) the board is too expensive. We address these shortcomings by designing a new FPGA board for multicore architecture prototyping, which explicitly targets scalability and cost-efficiency. Formic has a 35% bigger FPGA, three times more SRAM, four times more links and costs at most half as much when compared to the popular Xilinx XUPV5 prototyping platform. We build and test a 64-board system by developing a 512-core, Micro Blaze-based, non-coherent hardware prototype with DMA capabilities, with full network on-chip in a 3D-mesh topology. We believe that Formic offers significant advantages over existing academic and commercial platforms that can facilitate hardware prototyping for future many core architectures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast Multi-Objective Algorithmic Design Co-Exploration for FPGA-based Accelerators

    Publication Year: 2012 , Page(s): 65 - 68
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (220 KB) |  | HTML iconHTML  

    The reconfigurability of Field Programmable Gate Arrays (FPGAs) makes them an attractive platform for accelerating algorithms. Accelerating a particular algorithm is a challenging task as the large number of possible algorithmic and hardware design parameters lead to different accelerator variant implementations, each with its own metrics such as performance, area, power, and arithmetic accuracy characteristics. To identify these parameters that optimize the accelerator for certain metrics, we propose techniques for fast design space exploration and non-linear multi-objective optimization (e.g., minimize power under arithmetic inaccuracy bounds). Our methodology samples a small part of the design space and uses measurements from the sampled implementations to train mathematical models for the different metrics. To automate and improve the model generation process, we propose the use of L1-regularized least squares regression techniques. To demonstrate the effectiveness of our approach, we implement a high-throughput real-time accelerator for image debluring. We demonstrate the accuracy (e.g., within 8% for power modeling) of our modeling techniques and their ability to identify the optimal accelerator designs with large speed-ups (340×) in comparison to brute-force enumeration. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FLEXDET: Flexible, Efficient Multi-Mode MIMO Detection Using Reconfigurable ASIP

    Publication Year: 2012 , Page(s): 69 - 76
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (684 KB) |  | HTML iconHTML  

    This paper describes the implementation of a multi-mode MIMO detector based on the concept of partially reconfigurable ASIP (rASIP). The multi-mode detector can support three different detection algorithms which are the Maximum Ratio Combining, the linear Minimum Mean Square Error (MMSE) detection, and the MMSE Successive Interference Cancellation. The detection algorithms also support different antenna configurations and modulation schemes. The rASIP is based on a Coarse-Grained Reconfigurable Architecture (CGRA), which is designed for efficient architectural support of matrix operations. A matrix inversion algorithm, which is used for the preprocessing of different detection algorithms, is mapped on the CGRA. By integrating a processor with the CGRA, the variations in the control path of different algorithm configurations can be handled efficiently. To the best of our knowledge, we show, for the first time that, a CGRA-based multi-mode MIMO detection is extremely efficient and matches the performance of dedicated ASIC implementation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FX-SCORE: A Framework for Fixed-Point Compilation of SPICE Device Models Using Gappa++

    Publication Year: 2012 , Page(s): 77 - 84
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (311 KB) |  | HTML iconHTML  

    Automated, offline precision-analysis of dataflow computation containing elementary functions (e.g. exp) and if-then-else control flow operations enables accurate fixed-point FPGA implementation of SPICE device equations. We perform interval analysis of these equations using Gappa++ to statically compare error bounds of fixed-point and double-precision implementations. This is possible due to the limited dynamic range of physical voltage, current and conductance quantities in a SPICE simulation of real-world circuits. In contrast to previous custom-precision SPICE device mappings, our fixed-point implementation has the same accuracy as double-precision implementation when compared to ideal arithmetic (reals). To deliver these implementations we develop FX-SCORE, a high-level framework based on the SCORE streaming FPGA framework, that automatically generates Gappa++ scripts and AutoESL circuits to explore the cost-quality tradeoffs of Fixed-point FPGA implementations. Using our methodology, we can determine whether fixed-point is always better than a double-precision implementation at the same relative error. We demonstrate 35% geometric mean area improvement for different SPICE device models such as Diode, Level-1 MOSFET and an Approximate MOSFET when comparing custom fixed-point implementations with standard double-precision realizations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FPGA-based Acceleration for Tracking Audio Effects in Movies

    Publication Year: 2012 , Page(s): 85 - 92
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (244 KB) |  | HTML iconHTML  

    In this paper we propose an FPGA-based hardware platform to accelerate an audio tracking method. Our tracking approach is inspired by the problem of molecular sequence alignment and adopts a well-known dynamic programming algorithm (Smith-Waterman algorithm) from the area of bioinformatics. However, the high computational complexity of such algorithms imposes a significant barrier to their adoption by audio tracking systems. To alleviate the time-consuming problem and achieve realistic response times, we propose the acceleration of computationally intensive parts of our tracking method using an FPGA-based platform. Our FPGA accelerator is actually based on the systolization of the Smith-Waterman algorithm proposed in previous approaches for the acceleration of bio-sequence scanning but the special requirements of the audio tracking method impose significant design challenges in the accelerator architecture. The accelerator has been implemented in a Xilinx Virtex-5 device and the experimental results show that it achieves significant speedup compared with the software implementation of the tracking method. The proposed approach has been tested in the context of detecting animal sounds in audio streams from movies, where a basic requirement is to reduce the noisiness of the detection results by means of exploiting the statistical nature of the scores that are generated by the dynamic programming algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.