By Topic

Field-Programmable Custom Computing Machines, 2002. Proceedings. 10th Annual IEEE Symposium on

Date 24-24 April 2002

Filter Results

Displaying Results 1 - 25 of 46
  • Proceedings 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. FCCM 2002

    Save to Project icon | Request Permissions | PDF file iconPDF (444 KB)  
    Freely Available from IEEE
  • Author index

    Page(s): 321 - 322
    Save to Project icon | Request Permissions | PDF file iconPDF (252 KB)  
    Freely Available from IEEE
  • Reconfigurable shape-adaptive template matching architectures

    Page(s): 98 - 107
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (624 KB) |  | HTML iconHTML  

    This paper presents reconfigurable computing strategies for a Shape-Adaptive Template Matching (SA-TM) method to retrieve arbitrarily shaped objects within images or video frames. A generic systolic array architecture is proposed as the basis for comparing three designs: a static design where the configuration does not change after compilation, a partially-dynamic design where a static circuit can be reconfigured to use different on-chip data, and a dynamic design which completely, adapts to a particular computation. While the logic resources required to implement the static and partially-dynamic designs are constant and depend only on the size of the search frame, the dynamic design is adapted to the size and shape of the template object, and hence requires much less area. The execution time of the matching process greatly depends on the number of frames the same object is matched at. For a small number of frames, the dynamic and partially-dynamic designs suffer from high reconfiguration overheads. This overhead is significantly reduced if the matching process is repeated on a large number of consecutive frames. We find that the dynamic SA-TM design in a 50 MHz Virtex 1000E device, including reconfiguration time, can perform almost 7,000 times faster than a 1.4 GHz Pentium 4 PC when processing a 100×100 template on 300 consecutive video frames in HDTV format. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FPGA-based template matching using distance transforms

    Page(s): 89 - 97
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (350 KB) |  | HTML iconHTML  

    This paper presents a high-performance FPGA solution to generic shape-based object detection in images. The underlying detection method involves representing the target object by binary templates containing positional and directional edge information. A particular scene image is preprocessed by edge segmentation, edge cleaning and distance transforms. Matching involves correlating the templates with the distance-transformed scene image and determining the locations where the mismatch is below a certain user-defined threshold. Although successful in the past, a significant drawback of these matching methods has been their large computational cost when implemented on a sequential general-purpose processor. In this paper we present a step by step implementation of the components of such object detection systems, taking advantage of the data and logical parallelism opportunities offered by an FPGA architecture. The realization of a pipelined calculation of the preprocessing and correlation on FPGA is presented in detail. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Assisting network intrusion detection with reconfigurable hardware

    Page(s): 111 - 120
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1625 KB) |  | HTML iconHTML  

    String matching is used by Network Intrusion Detection Systems (NIDS) to inspect incoming packet payloads for hostile data. String-matching speed is often the main factor limiting NIDS performance. String-matching performance can be dramatically improved by using Field-Programmable Gate Arrays (FPGAs); accordingly, a "regular-expression to FPGA circuit" module generator has been developed. The module generator extracts strings from the Snort NIDS rule-set, generates a regular expression that matches all extracted strings, synthesizes a FPGA-based string matching circuit, and generates an EDIF netlist that can be processed by Xilinx software to create an FPGA bitstream. The feasibility of this approach is demonstrated by comparing the performance of the FPGA-based string matcher against the software-based GNU regex program. The FPGA-based string matcher exceeds the performance of the software-based system by 600x for large patterns. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast area estimation to support compiler optimizations in FPGA-based reconfigurable systems

    Page(s): 239 - 247
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (391 KB) |  | HTML iconHTML  

    Several projects have developed compiler tools that translate high-level languages down to hardware description languages for mapping onto FPGA-based reconfigurable computers. These compiler tools can apply extensive transformations that exploit the parallelism inherent in the computations. However, the transformations can have a major impact on the chip area (number of logic blocks) used on the FPGA. It is imperative therefore that the compiler user be provided with feedback indicating how much space is being used. In this paper we present a fast compile-time area estimation technique to guide the compiler optimizations. Experimental results show that our technique achieves an accuracy within 2.5% for small image-processing operators, and within 5.0% for larger benchmarks, as compared to the usual post-compilation synthesis tool estimations. The estimation time is in the order of milliseconds as compared to several minutes for a synthesis tool. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mapping multi-mode circuits to LUT-based FPGA using embedded MUXes

    Page(s): 318 - 319
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (278 KB) |  | HTML iconHTML  

    For some systems, a general-purpose FPGA solution tends to be large and slow. A reconfigurable solution is smaller and faster but has a delay associated with the reconfiguration. In this paper, embedded MUXes are used to achieve the performance of reconfiguration without the time penalty. For a CRC circuit an area reduction of 93% compared to a general-purpose solution and a reduction of 17-34% compared to similar software compiled systems is achieved. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Tabu search with intensification strategy for functional partitioning in hardware-software codesign

    Page(s): 297 - 298
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (304 KB) |  | HTML iconHTML  

    This paper presents tabu search (TS) method with intensification strategy for hardware-software partitioning. The algorithm operates on functional blocks for designs represented as directed acyclic graphs (DAG), with the objective of minimising processing time under various hardware area constraints. Results are compared to two other heuristic search algorithms: genetic algorithm (GA) and simulated annealing (SA). The comparison involves a scheduling model based on list scheduling for calculating processing time used as a system cost, assuming that shared resource conflicts do not occur. The results show that TS, which rarely appears for solving this kind of problem, is superior to SA and GA in terms of both search time and the quality of solutions. In addition, we have implemented intensification strategy in TS called penalty reward, which can further improve the quality of results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The design of the Amalgam reconfigurable cluster

    Page(s): 309 - 310
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (294 KB) |  | HTML iconHTML  

    Amalgam is a novel architecture for multifunction embedded systems. It integrates multiple reconfigurable and programmable processing resources (known as clusters) to achieve high-performance with low design effort on a variety of multimedia applications. The reconfigurable cluster (RClust) enables Amalgam to exploit the natural parallelism and operator granularities of a target application. The RClust contains a ring of reconfigurable logic interleaved with a banked register file to support Amalgam's register-based inter-cluster communication mechanism. This low-latency mechanism allows the RClust to coordinate with a programmable cluster (PClust) as a special purpose junctional unit implementing small custom operations. The relatively large size of the cluster, however, allows it to implement larger, more independent computational kernels. In this extended abstract, we describe the initial design of the RClust and present results from mapping several benchmarks to Amalgam architectures with and without RClust elements. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An FPGA implementation of triangle mesh decompression

    Page(s): 22 - 31
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (444 KB) |  | HTML iconHTML  

    This paper presents an FPGA-based design and implementation of a three dimensional (3D)) triangle mesh decompressor. Triangle mesh is the dominant representation of 3D geometric models. The prototype decompressor is based on a simple and highly efficient triangle mesh compression algorithm, called BFT mesh encoding. To the best of our knowledge, this is the first hardware implementation of triangle mesh decompression. The decompressor can be added at the front-end of a 3D graphics card sitting on the PCI/AGP bus. It can reduce the bandwidth requirement on the bus between the host and the graphics card by up to 80% compared to standard triangle mesh representations. Other mesh decompression algorithms with comparable compression efficiency to BFT mesh encoding are too complex to be implemented in hardware. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Module generators driving the compilation for adaptive computing systems

    Page(s): 293 - 294
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (328 KB) |  | HTML iconHTML  

    We present GLACE, the Generic Library for Adaptive Computing Systems, which offers a comprehensive set of user-extensible module generators and associated meta-data (e.g., timing, interfaces, topology, etc.). Furthermore, we will discuss some of the issues that need to be addressed when using GLACE from a high-level compilation How. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A scalable FPGA-based custom computing machine for a medical image processing

    Page(s): 307 - 308
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (673 KB) |  | HTML iconHTML  

    Concentration index filter is a kind of spatial filters of images, and its typical application is diagnosis from medical images. This paper presents a dedicated computing engine for concentration index filtering. Original algorithm is modified to extract full parallelism and data width is optimized for maximizing clock speed and minimizing hardware scale. Evaluation results reveal that the system runs 100 times faster than current workstation and enables real-time diagnosis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Queue machines: hardware compilation in hardware

    Page(s): 152 - 160
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (384 KB) |  | HTML iconHTML  

    In this paper we hypothesize that reconfigurable computing is not more widely used because of the logistical difficulties caused by the close coupling of applications and hardware platforms. As an alternative, we propose computing machines that use a single, serial instruction representation for the entire reconfigurable computing application. We show how it is possible to convert, at runtime, the parallel portions of the application into a spatial representation suitable for execution on a reconfigurable fabric. The conversion to spatial representation is facilitated by the use of an instruction set architecture based on an operand queue. We describe techniques to generate code for queue machines and hardware virtualization techniques necessary to allow any application to execute on any platform. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Single-chip gigabit mixed-version IP router on Virtex-II Pro

    Page(s): 35 - 44
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (356 KB) |  | HTML iconHTML  

    This paper concerns novel single-chip system architecture options, based on the Xilinx Virtex-II Pro part, which includes up to four PowerPC cores and was launched in Spring 2002. The research described here was carried out pre-launch (i.e., prior to availability of real parts), so the paper focuses on initial architectural experiments based on simulation. The application is a Mixed-version IP Router, named MIR, servicing gigabit ethernet ports. This would be of use to organizations with several gigabit ethernets, with a mixture of IPv4 and IPv6 hosts and routers attached directly to the networks. A particular benefit of a programmable approach based on Virtex-II Pro is that the router's functions can evolve smoothly, maintaining router performance as the organization migrates from IPv4 to IPv6 internally, and also as the Internet migrates externally. The basic aim is to carry out more frequent, and less control intensive, functions in logic, and other functions in the processor. Two prototypes are described here. Both support four ethernet ports, but the designs are scalable upwards. The second one, the more ambitious of the two, instantiates a configuration appropriate when the bulk of the incoming packets are IPv4. Such packets are processed and switched entirely by logic, with no internal copying of packets between buffers and virtually no delay between packet receipt and onward forwarding. This involves a specially-tailored internal interconnection network between the four ports, and also processing performed in parallel with packet receipt, i.e. multi-threading in logic. IPv6 packets, or some rare IPv4 cases, are passed to a PowerPC core for processing. In essence, the PowerPC acts as a slave to the logic, rather than the more common opposite master-slave relationship. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • PAM-Blox II: design and evaluation of C++ module generation for computing with FPGAs

    Page(s): 67 - 76
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (484 KB) |  | HTML iconHTML  

    This paper explores the implications of integrating flexible module generation into a compiler for FPGAs. The objective is to improve the programmability of FPGAs, or in other words, the productivity of the FPGA programmer. We describe (1) the module generation library PAM-Blox II, the second generation of object-oriented module generators in C++, targeted at computing with FPGAs, and (2) examples of design tradeoffs and performance results using redundant representations for addition and multiplication, and technology mapping of comparison and elementary function evaluation. PAM-Blox II is built on top of a set of extensions to the gate level FPGA design library PamDC to provide a more efficient, portable, scalable, and maintainable module generator library. Using PAM-Blox II we demonstrate a simplified interface to bit-level programability. The simplification results from the bottom-up approach and a close coupling of architecture generation, module generation and gate level CAD. The tradeoffs for the module generators are based on trading area for speed and hand-optimizing technology mapping to the specific FPGA technology. As an example, we show that redundant number representations hold one key to unleashing the full potential of reconfigurability on the bit-level. The presented module generators are applied to encryption and compression to show the impact of the bit-level optimizations on application performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GRIP: a reconfigurable architecture for host-based gigabit-rate packet processing

    Page(s): 121 - 130
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1412 KB) |  | HTML iconHTML  

    One of the fundamental challenges for modern high-performance network interfaces is the processing capabilities required to process packets at high speeds. Simply transmitting or receiving data at gigabit speeds fully utilizes the CPU on a standard workstation. Any processing that must be done to the data, whether at the application layer or the network layer, decreases the achievable throughput. This paper presents an architecture for offloading a significant portion of the network, processing from the host CPU onto the network interface. A prototype, called the GRIP (Gigabit Rate IPSec) card, has been constructed based on an FPGA coupled with a commodity Gigabit Ethernet MAC. Experimental results based on the prototype are presented and analyzed. In addition, a second generation design is presented in the context of lessons learned from the prototype. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On sparse matrix-vector multiplication with FPGA-based system

    Page(s): 273 - 274
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (312 KB) |  | HTML iconHTML  

    In this paper we report on our experimentation with the use of FPGA-based system to solve the irregular computation problem of evaluating y = Ax when the matrix A is sparse. The main features of our matrix-vector multiplication algorithm are (i) an organization of the operations to suit the FPGA-based system ability in processing a stream of data, and (ii) the use of distributed arithmetic technique together with an efficient scheduling heuristic to exploit the inherent parallelism in the matrix-vector multiplication problem. The performance of our algorithm has been evaluated with an implementation on the Pamette FPGA-based system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mobile Memory: Improving memory locality in very large reconfigurable fabrics

    Page(s): 195 - 204
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (600 KB) |  | HTML iconHTML  

    As the size of reconfigurable fabrics increases we can envision entire applications being mapped to a reconfigurable device; not just the code, but also the memory. These larger circuits, unfortunately, will suffer from the problem of a growing memory bottleneck. In this paper we explore how mobile memory techniques, inspired by cache-only memory architectures, can be applied to help solve this problem. The basic idea is to move the memory to the location of the accessor. Using both an analytical model and simulation we investigate several different memory movement algorithms. The results show that mobility can, on average, decrease memory latency 2x; which translates into speedup of about 15%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hyperspectral image compression on reconfigurable platforms

    Page(s): 251 - 260
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1486 KB) |  | HTML iconHTML  

    In this paper we present an implementation of the image compression routine SPIHT in reconfigurable logic. A discussion on why adaptive logic is required, as opposed to an ASIC, is provided along with background material on the image compression algorithm. We analyzed several discrete wavelet transform architectures and selected the folded DWT design. In addition we provide a study on what storage elements are required for each wavelet coefficient. The paper uses a modification to the original SPIHT algorithm needed to parallelize the computation. The architecture of the SPIHT engine is based upon fixed-order SPIHT, developed specifically for use within adaptive hardware. For an N × N image fixed-order SPIHT may be calculated in N2/4 cycles. Square images which are powers of 2 up to 1024 × 1024 are supported by the architecture. Our system was developed on an Annapolis Microsystems WildStar board populated with Xilinx Virtex-E parts. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Coarse-grain pipelining on multiple FPGA architectures

    Page(s): 77 - 86
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1941 KB) |  | HTML iconHTML  

    Reconfigurable systems, and in particular, FPGA-based custom computing machines, offer a unique opportunity to define application-specific architectures. These architectures offer performance advantages for application domains such as image processing, where the use of customized pipelines exploits the inherent coarse-grain parallelism. In this paper we describe a set of program analyses and an implementation that map a sequential and un-annotated C program into a pipelined implementation running on a set of FPGAs, each with multiple external memories. Based on well-known parallel computing analysis techniques, our algorithms perform unrolling for operator parallelization, reuse and data layout for memory parallelization and precise communication analysis. We extend these techniques for FPGA-based systems to automatically partition the application data and computation into custom pipeline stages, taking into account the available FPGA and interconnect resources. We illustrate the analysis components by way of an example, a machine vision program. We present the algorithm results, derived with minimal manual intervention, which demonstrate the potential of this approach for automatically deriving pipelined designs from high-level sequential specifications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic latency-optimal design of FPGA-based systolic arrays

    Page(s): 299 - 300
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (291 KB) |  | HTML iconHTML  

    "Systolic" algorithms have been shown to be suitable for a very large range of structured problems (i.e., linear algebra, graph theory, computational geometry, number-theoretic algorithms, string matching, sorting/searching, dynamic programming, discreet mathematics). Usage of this systolic architecture class has not been widespread in the past, in part because programmable hardware that supported this computing paradigm was not cost-effective to build and no design tools existed. However, suitable hardware has begun to appear. Complex FPGAs now provide an adequate level of speed, density and programmability in the form of reconfigurable computers, boards, and chips with embedded computational support. Such hardware could allow rapid implementation and change of systolic algorithms leading to inexpensive "programmable" systolic array hardware. Furthermore, the architectural characteristics of much FPGA hardware matches that required by systolic processing, because this technology is constructed from tiling identical memory and logic blocks along with supporting mesh interconnection networks. The symbolic parallel algorithm development environment (SPADE) described here is being developed to allow a designer to easily and rapidly explore the design space of various systolic algorithm implementations so that FPGA system tradeoffs can be efficiently analyzed. The intention is to allow a user to specify his algorithm with traditional high-level code, set some architectural constraints and then view the results in a meaningful graphical format. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mapping algorithms to the Amalgam programmable-reconfigurable processor

    Page(s): 311 - 312
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (319 KB) |  | HTML iconHTML  

    The Amalgam programmable-reconfigurable processor is designed to provide the computational power required by upcoming embedded applications without requiring the design of application-specific hardware. It integrates multiple programmable processors and blocks of reconfigurable logic onto a single chip, using a clustered architecture, similar to the one used on the M-Machine to reduce wire length and delay and allow implementation at high clock rates. The clustered architecture provides tremendous flexibility, allowing applications to exploit parallelism at whatever granularity is best-suited to the application, while the combination of reconfigurable logic and programmable processors delivers much higher performance than could be achieved through programmable processors alone. This abstract presents the results of our initial experiments in hand-mapping applications onto Amalgam. Five applications (IDCT, Rijndael encryption, nQueens, DNA sequence comparison, and image dithering) have been implemented, achieving speedups ranging from 8.7× to 23.2× over the performance of a single programmable cluster by using the complete resources of an Amalgam chip. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis and implementation of the discrete element method using a dedicated highly parallel architecture in reconfigurable computing

    Page(s): 173 - 181
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (396 KB) |  | HTML iconHTML  

    The Discrete Element Method (DEM) is a numerical model to describe the mechanical behaviour of discontinuous bodies. It has been traditionally used to simulate particle flows (e.g. sand, sugar), but is becoming more popular as a method to represent solid materials. The DEM is very computationally expensive, but has properties that make it amenable to acceleration by reconfigurable computing. This paper describes the implementation of a dedicated hardware architecture for the DEM implemented on an FPGA, which is capable of giving a speed-zip of about 30 times compared to an optimised software version running on a fast microprocessor. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Control and configuration software for a reconfigurable networking hardware platform

    Page(s): 45 - 54
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1437 KB) |  | HTML iconHTML  

    A suite of tools called NCHARGE (Networked Configurable Hardware Administrator for Reconfiguration and Governing via End-systems) has been developed to simplify the co-design of hardware and software components that process packets within a network of Field Programmable Gate Arrays (FPGAs). A key feature of NCHARGE is that it provides a high-performance packet interface to hardware and standard Application Programming Interface (API) between software and reprogrammable hardware modules. Using this API, multiple software processes can communicate to one or more hardware modules using standard TCP/IP sockets. NCHARGE also provides a Web-Based User Interface to simplify the configuration and control of an entire network switch that contains several software and hardware modules. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A massively parallel RC4 key search engine

    Page(s): 13 - 21
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (473 KB) |  | HTML iconHTML  

    A massively parallel implementation of an RC4 key search engine on an FPGA is described. The design employs parallelism at the logic level to perform many operations per cycle, uses on-chip memories to achieve very high memory bandwidth, floorplanning to reduce routing delays and multiple decryption units to achieve further parallelism. A total of 96 RC4 decryption engines were integrated on a single Xilinx Virtex XCV1000-E field programmable gate array (FPGA). The resulting design operates at a 50 MHz clock rate and achieves a search speed of 6.06 × 106 keys/second, which is a speedup of 58 over a 1.5 GHz Pentium 4 PC. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.