By Topic

Field-Programmable Custom Computing Machines, 2007. FCCM 2007. 15th Annual IEEE Symposium on

Date 23-25 April 2007

Filter Results

Displaying Results 1 - 25 of 63
  • 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines - Cover

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (64 KB)  
    Freely Available from IEEE
  • 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines-Title

    Page(s): i - iii
    Save to Project icon | Request Permissions | PDF file iconPDF (65 KB)  
    Freely Available from IEEE
  • 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines-Copyright

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (67 KB)  
    Freely Available from IEEE
  • 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines - TOC

    Page(s): v - ix
    Save to Project icon | Request Permissions | PDF file iconPDF (64 KB)  
    Freely Available from IEEE
  • Conference organizers

    Page(s): x
    Save to Project icon | Request Permissions | PDF file iconPDF (48 KB)  
    Freely Available from IEEE
  • Sampling from the Multivariate Gaussian Distribution using Reconfigurable Hardware

    Page(s): 3 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (337 KB) |  | HTML iconHTML  

    The multivariate Gaussian distribution models random processes as vectors of Gaussian samples with a fixed correlation matrix. Such distributions are useful for modelling real-world multivariate time-series such as equity returns, where the returns for businesses in the same sector are likely to be correlated. Generating random samples from such a distribution presents a computational challenge due to the dense matrix-vector multiplication needed to introduce the required correlations. This paper proposes a hardware architecture for generating random vectors, utilising the embedded block RAMs and multipliers found in contemporary FPGAs. The approach generates a new n dimensional random vector every n clock cycles, and has a raw generation rate over 200 times that of a single Opteron 2.2GHz using an optimised BLAS package for linear algebra computation. The generation architecture is an ideal source for both software simulations connected via high bandwidth connection, and for completely FPGA based simulations. Practical performance is explored in a case study in Delta-Gamma Value-at-Risk, where a standalone Virtex-4 xc4vsx55 solution at 400 MHz is 33 times faster than a quad Opteron 2.2GHz SMP. The FPGA solution also scales well for larger problem sizes, allowing larger portfolios to be simulation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Fast FPGA-Based 2-Opt Solver for Small-Scale Euclidean Traveling Salesman Problem

    Page(s): 13 - 22
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (370 KB) |  | HTML iconHTML  

    In this paper we discuss and analyze the FPGA-based implementation of an algorithm for the traveling salesman problem (TSP), and in particular of 2-Opt, one of the most famous local optimization algorithms, for Euclidean TSP instances up to a few hundred cities. We introduce the notion of "symmetrical 2-Opt moves" which allows us to uncover fine-grain parallelism when executing the specified algorithm. We propose a novel architecture that exploits this parallelism, and demonstrate its implementation in reconfigurable hardware. We evaluate our proposed architecture and its implementation on a state-of-the-art FPGA using a subset of the TSPLIB benchmark, and find that our approach exhibits better quality of final results and an average speedup of 600% when compared with the state-of-the-art software implementation. Our approach produces, to the best of our knowledge, the fastest to date TSP 2-Opt solver for small-scale Euclidean TSP instances. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the Acceleration of Shortest Path Calculations in Transportation Networks

    Page(s): 23 - 34
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (370 KB) |  | HTML iconHTML  

    Shortest path algorithms are key elements of many graph problems. They are used in such applications as online direction finding and navigation, and modeling of traffic for large scale simulations of major metropolitan areas. As shortest path algorithm are execution bottlenecks, it is beneficial to move their execution to parallel hardware such as field programmable gate arrays (FPGAs). One of the innovations of this approach is the use of a small bubble sort core to produce the extract-min function. While bubble sort is not usually considered an appropriate algorithm for any non-trivial usage, it is appropriate in this case as it can produce a single minimum out of the list in O(n) cycles, where n is the number of elements in the vertex list. The cost of this min operation does not impact the running time of the architecture, because the queue depth for fetching the next set of edges from memory is roughly equivalent to the number of cores in the system. Additionally, this work provides a collection of simulation results that model the behavior of the node queue in hardware. The results show that a hardware queue, implementing a small bubble-type minimum function, need only be on the order of 16 elements to provide both correct and optimal paths. With support for a large DRAM graph store with SRAM-based caching on a Cray XD-1 FPGA-accelerated system, the system provides a speedup of roughly 50x over the CPU-based implementation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Enhancing Relocatability of Partial Bitstreams for Run-Time Reconfiguration

    Page(s): 35 - 44
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (740 KB) |  | HTML iconHTML  

    This paper introduces a method that enhances the relocatability of partial bitstreams for FPGA run-time reconfiguration. Reconfigurable applications usually employ partial bitstreams which are specific to one target region on the FPGA. Previously, techniques have been proposed that allow relocation between identical regions on the FPGA. However, as FPGAs are becoming increasingly heterogeneous, this approach is often too restrictive. We introduce a method that circumvents the problem of having to find fully identical regions based on compatible subsets of resources, enabling flexible placement of relocatable modules. In a software defined radio prototype with two reconfigurable regions, the number of partial bitstreams is reduced by 50% and the compile time is shortened by 43%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Library and Platform for FPGA Bitstream Manipulation

    Page(s): 45 - 54
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5923 KB) |  | HTML iconHTML  

    Since 1998, no commercially available FPGA has been accompanied by public documentation of its native machine code (or bitstream) format. Consequently, research in reconfigurable hardware has been confined to areas which are specifically supported by manufacturer-supplied tools. Recently, detailed documentation of the bitstream format for the Atmel FPSLIC series of FPGAs appeared on the usenet group comp.arch.fpga. This information has been used to create abits, a Java library for direct manipulation of FPSLIC bitstreams and partial reconfiguration. The abits library is accompanied by the slipway reference design, a low-cost USB bus-powered board carrying an FPSLIC. This paper describes the abits library and slipway platform, as well as a few applications which they make possible. Both the abits source code and slipway board layout are publicly available under the terms of the BSD license. It is our hope that these tools will enable further research in reconfigurable hardware which would not otherwise be possible. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing

    Page(s): 55 - 64
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (603 KB) |  | HTML iconHTML  

    A new platform for reconfigurable computing has an object-based programming model, with architecture, silicon and tools designed to faithfully realize this model. The platform is aimed at application developers using software languages and methodologies. Its objectives are massive performance, long-term scalability, and easy development. In our structural object programming model, objects are strictly encapsulated software programs running concurrently on an asynchronous array of processors and memories. They exchange data and control through a structure of self-synchronizing asynchronous channels. Objects are combined hierarchically to create new objects, connected through the common channel interface. The first chip is a 130nm ASIC with 360 32-bit processors, 360 1KB RAM banks with access engines, and a configurable word-wide channel interconnect. Applications written in Java and block diagrams compile in one minute. Sub-millisecond runtime reconfiguration is inherent. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Configurable Transactional Memory

    Page(s): 65 - 72
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (511 KB) |  | HTML iconHTML  

    Programming efficiency of heterogeneous concurrent systems is limited by the use of lock-based synchronization mechanisms. Transactional memories can greatly improve the programming efficiency of such systems. In field-programmable computing machines, a conventional fixed transactional memory becomes inefficient use of the silicon. We propose configurable transactional memory (CTM) as a mechanism to implement application specific synchronization that utilizes the field-programmability of such devices to match with the requirements of an application. The proposed configurable transactional memory is targeted at embedded applications and is area efficient compared to conventional schemes that are implemented with cache-coherent protocols. In particular, the CTM is designed to be incorporated in to compilation and synthesis paths of either high-level languages or during system creation process using tools such as Xilinx EDK. We study the impact of deploying a CTM in a packet metering and statistics application and two micro-benchmarks as compared to a lock-based synchronization scheme. We have implemented this application in a Xilinx Virtex4 device and found that the CTM was 0-73% better than a fine-grained lock-based scheme. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Reconfigurable Hardware Interface for a Modern Computing System

    Page(s): 73 - 84
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (355 KB) |  | HTML iconHTML  

    Reconfigurable hardware (RH) is used in an increasing variety of applications, many of which require support for features commonly found in general purpose systems. In this work we examine some of the challenges faced in integrating RH with general purpose processors and memory systems. We propose a new CPU-RH-memory interface that takes advantage of on-chip caches and uses virtual memory for communication. Additionally we describe the simulator model we developed to evaluate this new architecture. This work shows that an efficient interface can greatly accelerate RH applications, and provides a strong first step toward multiprocessor reconfigurable computing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FPGA Acceleration of Gene Rearrangement Analysis

    Page(s): 85 - 94
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (519 KB) |  | HTML iconHTML  

    In this paper we present our work toward FPGA acceleration of phylogenetic reconstruction, a type of analysis that is commonly performed in the fields of systematic biology and comparative genomics. In our initial study, we have targeted a specific application that reconstructs maximum-parsimony (MP) phylogenies for gene-rearrangement data. Like other prevalent applications in computational biology, this application relies on a control-dependent, memory-intensive, and non-arithmetic combinatorial optimization algorithm. To achieve hardware acceleration, we developed an FPGA core design that implements the application's primary bottleneck computation. Because our core is lightweight, we are able to synthesize multiple cores on a single FPGA. By using several cores in parallel, we have achieved a 25X end-to-end application speedup using simulated input data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FPGA-accelerated seed generation in Mercury BLASTP

    Page(s): 95 - 106
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (381 KB) |  | HTML iconHTML  

    BLASTP is the most popular tool for comparative analysis of protein sequences. In recent years, an exponential increase in the size of protein sequence databases has required either exponentially more runtime or a cluster of machines to keep pace. To address this problem, we have designed and built a high-performance FPGA-accelerated version of BLASTP, Mercury BLASTP. In this paper, we focus on seed generation, the first stage of the BLASTP algorithm. Our seed generator is capable of processing database residues at up to 219 Mresidues/second for 2048- residue queries. The full Mercury BLASTP pipeline, including our seed generator, achieves a speedup of 37times over the popular NCBI BLASTP software on a 2.8 GHz Intel P4 CPU, with sensitivity more than 99% that of the software. Our architecture can be generalized to accelerate the seed generation stage in other important biocomputing applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Systolic Architecture for Computational Fluid Dynamics on FPGAs

    Page(s): 107 - 116
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (523 KB) |  | HTML iconHTML  

    This paper presents an FPGA-based flow solver based on the systolic architecture. We show that the fractional-step method employing central difference schemes can be expressed as a systolic algorithm, and therefore the systolic architecture is suitable for a dedicated processor to the flow solver. We have designed a 2D systolic array of cells, each of which has a micro-programmable data-path containing a MAC (multiplication and accumulation) unit and a local memory to store necessary data for computational fluid dynamics. With ALTERA Stratix II FPGA, we implemented 96(= 12 times 8) cells running at 60 MHz. Since the MAC unit has both an adder and a multiplier for single-precision floating-point numbers, the total peak performance is 11.5(= 96times60 MHztimes2) GFlops. We made a choice of 2D square driven cavity flow as a benchmark computation based on the fractional-step method. For this computation, the FPGA-based processor running only at 60 MHz achieved 7.14 and 6.41 times faster computations than Pentium4 processor at 3.2 GHz and Itanium2 at 1.4 GHz, respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FPGA-Based Multigrid Computation for Molecular Dynamics Simulations

    Page(s): 117 - 126
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (463 KB) |  | HTML iconHTML  

    FPGA-based acceleration of molecular dynamics (MD) has been the subject of several recent studies. Implementing long-range forces, however, has only recently been addressed. Here we describe a solution based on the multigrid method. We show that multigrid is, in general, an excellent match to FPGAs: the primary operations take advantage of the large number of independently addressable RAMs and the efficiency with which complex systolic structures can be implemented. The multigrid accelerator has been integrated into our existing MD system, and an overall performance gain of 5x to 7x has been obtained, depending on hardware configuration and reference code. The simulation accuracy is comparable to the original double precision serial code. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reconfigurable Computing Cluster (RCC) Project: Investigating the Feasibility of FPGA-Based Petascale Computing

    Page(s): 127 - 140
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (373 KB) |  | HTML iconHTML  

    While medium- and large-sized computing centers have increasingly relied on clusters of commodity PC hardware to provide cost-effective capacity and capability, it is not clear that this technology will scale to the PetaFLOP range. It is expected that semiconductor technology will continue its exponential advancements over next fifteen years; however, new issues are rapidly emerging and the relative importance of current performance metrics are shifting. Future PetaFLOP architectures will require system designers to solve computer architecture problems ranging from how to house, power, and cool the machine, all the while remaining sensitive to cost. The reconfigurable computing cluster (RCC) project is a multi-institution, multi-disciplinary project investigating the use of Platform FPGAs to build cost-effective petascale computers. This paper describes the nascent project's objectives and a 64-node prototype cluster. Specifically, the aim is to provide an detailed motivation for the project, describe the design principles guiding development, and present a preliminary performance assessment. Microbenchmark results are reported to answer several pragmatic questions about key subsystems, including the system software, network performance, memory bandwidth, power consumption of nodes in the cluster. Results suggest that the approach is sound. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Mapping of Dimensionality Reduction Designs onto Heterogeneous FPGAs

    Page(s): 141 - 150
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (310 KB) |  | HTML iconHTML  

    Dimensionality reduction or feature extraction has been widely used in applications that require to reduce the amount of original data, like in image compression, or to represent the original data by a small set of variables that capture the main modes of data variation, as in face recognition and detection applications. A linear projection is often chosen due to its computational attractiveness. The calculation of the linear basis that best explains the data is usually addressed using the Karhunen-Loeve transform (KLT). Moreover, for applications where real-time performance and flexibility to accommodate new data are required, the linear projection is implemented in FPGAs due to their fine-grain parallelism and reconfigurability properties. Currently, the optimization of such a design, in terms of area usage and efficient allocation of the embedded multipliers that exist in modern FPGAs, is considered as a separate problem to the basis calculation. In this paper, we propose a novel approach that couples the calculation of the linear projection basis, the area optimization problem, and the heterogeneity exploration of modern FPGAs under a probabilistic Bayesian framework. The power of the proposed framework is based on the flexibility to insert information regarding the implementation requirements of the linear basis by assigning a proper prior distribution. Results using real-life examples demonstrate the effectiveness of our approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • K-means Clustering for Multispectral Images Using Floating-Point Divide

    Page(s): 151 - 162
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1100 KB) |  | HTML iconHTML  

    Many signal processing algorithms can be accelerated using reconfigurable hardware. To achieve a good speedup compared to running software on a general purpose processor, fine-grained control over the bitwidth of each component in the datapath is desired. This goal can be achieved by using NU's variable precision floating-point library. To analyze the usefulness of the floating-point divide unit, we incorporate it into our previous implementation of the Kmeans clustering algorithm applied to multispectral satellite images. With the lack of a floating-point divide hardware implementation, the mean updating step in each iteration of the K-means algorithm had to be moved to the host computer for calculation. The new means calculated on the host then had to be moved back to the FPGA board for the next iteration of the algorithm. This added data transfer overhead between the host and the FPGA board. In this work, we use the new fp div module to implement the mean updating step in FPGA hardware. This greatly reduces the communication overhead between host and FPGA board and further accelerates run time. The Kmeans clustering example illustrates the use of the fp div, fix2float and float2fix modules seamlessly assembled together in a real application. It is the first implementation that has the complete K-means computation done in FPGA hardware. Our results show that the hardware implementation achieves a speedup of over 2150x for core computation time and about 11x for total run time including data transfer time. They also show that the divide in FPGA hardware is 100 times faster than in software. Moreover, implementing divide in the FPGA frees the host to work on other tasks concurrently with K-means clustering, thus providing further speedup by allowing the image analyst to exploit this coarse grained parallelism. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimizing Logarithmic Arithmetic on FPGAs

    Page(s): 163 - 172
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (298 KB) |  | HTML iconHTML  

    This paper proposes optimizations of the methods and parameters used in both mathematical approximation and hardware design for logarithmic number system (LNS) arithmetic. First, we introduce a general polynomial approximation approach with an adaptive divide-in-halves segmentation method for evaluation of LNS arithmetic functions. Second, we develop a library generator that automatically generates optimized LNS arithmetic units with a wide bit-width range from 21 to 64 bits, to support LNS application development and design exploration. The basic arithmetic units are tested on practical FPGA boards as well as software simulation. When compared with existing LNS designs, our generated units provide in most cases 6% to 37% reduction in area and 20% to 50% reduction in latency. The key challenge for LNS remains on the application level. We show the performance of LNS versus floating-point for realistic applications: digital sine/cosine waveform generator, matrix multiplication and radiative Monte Carlo simulation. Our infrastructure for fast prototyping LNS FPGA applications allows us to efficiently study LNS number representation and its tradeoffs in speed and size when compared with floating-point designs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Generating FPGA-Accelerated DFT Libraries

    Page(s): 173 - 184
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (508 KB) |  | HTML iconHTML  

    We present a domain-specific approach to generate high-performance hardware-software partitioned implementations of the discrete Fourier transform (DFT) in fixed point precision. The partitioning strategy is a heuristic based on the DFT's divide-and-conquer algorithmic structure and fine tuned by the feedback-driven exploration of candidate designs. We have integrated this approach in the Spiral linear-transform code-generation framework to support push-button automatic implementation. We present evaluations of hardware-software DFT implementations running on the embedded PowerPC processor and the reconfigurable fabric of the Xilinx Virtex-II Pro FPGA. In our experiments, the 1D and 2D DFT's FPGA-accelerated libraries exhibit between 2 and 7.5 times higher performance (operations per second) and up to 2.5 times better energy efficiency (operations per Joule) than the software-only version. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An FPGA implementation of pipelined multiplicative division with IEEE Rounding

    Page(s): 185 - 196
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (323 KB) |  | HTML iconHTML  

    We report the results of an FPGA implementation of double precision floating-point division with IEEE rounding. We achieve a total latency (i.e., cycles times clock period) that is 2:6 times smaller than the latency of the fastest previous implementation on FPGAs. The amount of hardware, on the other hand, is comparable to commercial cores. The division circuit is based on Goldschmidt's algorithm. All IEEE rounding modes are supported and are implemented using dewpoint rounding. The precision of the initial approximation of the reciprocal is 14 bits. To save hardware and reduce the critical path, a half-sized 62x30 Booth radix-8 multiplier is used. This multiplier can receive both the multiplicand and the multiplier in carry-save representation. The division circuit is partitioned into four pipeline stages, has a latency of 11 cycles, and may restart a new double precision division operation after 8 cycles. Synthesis results of an implementation (not including the computation of the initial approximation of the reciprocal and the exponent path) guarantee a clock frequency of 131 MHz on an Altera Stratix II using 3592 ALMs. The implementation was successfully tested with over 10 million random vectors as well as over a million hard-to-round vectors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Integer Factorization Based on Elliptic Curve Method: Towards Better Exploitation of Reconfigurable Hardware

    Page(s): 197 - 206
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (365 KB) |  | HTML iconHTML  

    Currently, the best known algorithm for factorizing modulus of the RSA public key cryptosystem is the Number Field Sieve. One of its important phases usually combines a sieving technique and a method for checking smoothness of mid-size numbers. For this factorization, the Elliptic Curve Method (ECM) is an attractive solution. As ECM is highly regular and many parallel computations are required, hardware-based platforms were shown to be more cost-effective than software solutions. The few papers dealing with implementation of ECM on FPGA are all based on bit-serial architectures. They use only general-purpose logic and low-cost FPGAs which appear as the best performance/cost solution. This work explores another approach, based on the exploitation of embedded multipliers available in modern FPGAs and the use of high-performances FPGAs. The proposed architecture - based on a fully parallel and pipelined modular multiplier circuit - exhibits a 15-fold improvement over throughput/hardware cost ratio of previously published results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Matched Filter Computation on FPGA, Cell and GPU

    Page(s): 207 - 218
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (286 KB) |  | HTML iconHTML  

    The matched filter is an important kernel in the processing of hyperspectral data. The filter enables researchers to sift useful data from instruments that span large frequency bands and can produce Gigabytes of data in seconds. In this work, we evaluate the performance of a matched filter algorithm implementation on an FPGA-accelerated co-processor (Cray XD-1), the IBM Cell microprocessor, and the NVIDIA GeForce 7900 GTX GPU graphics card. We provide extensive discussion of the challenges and opportunities afforded by each platform. In particular, we explore the problems of partitioning the filter most efficiently between the host CPU and the co-processor. Using our results, we derive several performance metrics that provide the optimal solution for a variety of application situations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.