By Topic

Field-Programmable Custom Computing Machines, 2006. FCCM '06. 14th Annual IEEE Symposium on

Date 24-26 April 2006

Filter Results

Displaying Results 1 - 25 of 72
  • 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines - Cover

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (64 KB)  
    Freely Available from IEEE
  • 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines - Title

    Page(s): i - iii
    Save to Project icon | Request Permissions | PDF file iconPDF (60 KB)  
    Freely Available from IEEE
  • 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines - Copyright

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (53 KB)  
    Freely Available from IEEE
  • 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines - Table of contents

    Page(s): v - x
    Save to Project icon | Request Permissions | PDF file iconPDF (70 KB)  
    Freely Available from IEEE
  • Conference organizers

    Page(s): xi
    Save to Project icon | Request Permissions | PDF file iconPDF (46 KB)  
    Freely Available from IEEE
  • A Hybrid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer

    Page(s): 3 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (489 KB) |  | HTML iconHTML  

    Supercomputer companies such as Cray, Silicon Graphics, and SRC Computers now offer reconfigurable computer (RC) systems that combine general-purpose processors (GPPs) with field-programmable gate arrays (FPGAs). The FPGAs can be programmed to become, in effect, application-specific processors. These exciting supercomputers allow end-users to create custom computing architectures aimed at the computationally intensive parts of each problem. This report describes a parameterized, parallelized, deeply pipelined, dual-FPGA, IEEE-754 64-bit floating-point design for accelerating the conjugate gradient (CG) iterative method on an FPGA-augmented RC. The FPGA-based elements are developed via a hybrid approach that uses a high-level language (HLL)-to-hardware description language (HDL) compiler in conjunction with custom-built, VHDL-based, floating-point components. A reference version of the design is implemented on a contemporary RC. Actual run time performance data compare the FPGA-augmented CG to the software-only version and show that the FPGA-based version runs 1.3 times faster than the software version. Estimates show that the design can achieve a 4 fold speedup on a next-generation RC View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A case study in porting a production scientific supercomputing application to a reconfigurable computer

    Page(s): 13 - 22
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (123 KB) |  | HTML iconHTML  

    This case study presents the results of porting a production scientific code, called NAMD, to the SRC-6 high-performance reconfigurable computing platform based on field programmable gate array (FPGA) technology. NAMD is a molecular dynamics code designed to run on large supercomputing systems and used extensively by the computational biophysics community. NAMD's computational kernel is highly optimized to run on conventional von Neumann processors; this presents numerous challenges to its reimplementation on FPGA architecture. This paper presents an overview of the SRC-6 architecture and the NAMD application and then discusses the challenges, solutions, and results of the porting effort. The rationale in choosing the development path taken and the general framework for porting an existing scientific code, such as NAMD, to the SRC-6 platform are presented and discussed in detail. The results and methods presented in this paper are applicable to the large class of problems in scientific computing View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hardware/Software Approach to Molecular Dynamics on Reconfigurable Computers

    Page(s): 23 - 34
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (211 KB) |  | HTML iconHTML  

    With advances in re configurable hardware, especially field-programmable gate arrays (FPGAs), it has become possible to use reconfigurable hardware to accelerate complex applications, such as those in scientific computing. There has been a resulting development of reconfigurable computers - computers which have both general purpose processors and reconfigurable hardware, as well as memory and high-performance interconnection networks. In this paper, we study the acceleration of molecular dynamics simulations using reconfigurable computers. We describe how we partition the application between software and hardware and then model the performance of several alternatives for the task mapped to hardware. We describe an implementation of one of these alternatives on a reconfigurable computer and demonstrate that for two real-world simulations, it achieves a 2 times speed-up over the software baseline. We then compare our design and results to those of prior efforts and explain the advantages of the hardware/software approach, including flexibility View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Virtual Embedded Blocks: A Methodology for Evaluating Embedded Elements in FPGAs

    Page(s): 35 - 44
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (205 KB) |  | HTML iconHTML  

    Embedded elements, such as block multipliers, are increasingly used in advanced field programmable gate array (FPGA) devices to improve efficiency in speed, area and power consumption. A methodology is described for assessing the impact of such embedded elements on efficiency. The methodology involves creating dummy elements, called virtual embedded blocks (VEBs), in the FPGA to model the size, position and delay of the embedded elements. The standard design flow offered by FPGA and CAD vendors can be used for mapping, placement, routing and retiming of designs with VEBs. The speed and resource utilisation of the resulting designs can then be inferred using the FPGA vendor's timing analysis tools. We illustrate the application of this methodology to the evaluation of various schemes of involving embedded elements that support floating-point computations View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automated Generation of Hardware Accelerators with Direct Memory Access from ANSI/ISO Standard C Functions

    Page(s): 45 - 56
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (736 KB) |  | HTML iconHTML  

    Methodologies for synthesis of stand-alone hardware modules from C/C++ based languages have been gaining adoption for embedded system design, as an essential means to stay ahead of increasing performance, complexity, and time-to-market demands. However, using C to generate stand-alone blocks does not allow for truly seamless unification of embedded software and hardware development flows. This paper describes a methodology for generating hardware accelerator modules that are tightly coupled with a soft RISC CPU, its tool chain, and its memory system. This coupling allows for several significant advancements: (1) a unified development environment with true pushbutton switching between original software and hardware-accelerated implementations, (2) direct access to memory from the accelerator module, (3) full support for pointers and arrays, and (4) latency-aware pipelining of memory transactions. We also present results of our implementation, the C2H compiler. Eight user test cases on common embedded applications show speedup factors of 13x-73x achieved in less than a few days View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Hardware Generation of Random Variates with Arbitrary Distributions

    Page(s): 57 - 66
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (223 KB) |  | HTML iconHTML  

    This paper presents a technique for efficiently generating random numbers from a given probability distribution. This is achieved by using a generic hardware architecture, which transforms uniform random numbers according to a distribution mapping stored in RAM, and a software approximation generator that creates distribution mappings for any given target distribution. This technique has many features not found in current non-uniform random number generators, such as the ability to adjust the target distribution while the generator is running, per-cycle switching between distributions, and the ability to generate distributions with discontinuities in the probability density function View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Architecture for Efficient Hardware Data Mining using Reconfigurable Computing Systems

    Page(s): 67 - 75
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (229 KB) |  | HTML iconHTML  

    The Apriori algorithm is a fundamental correlation-based data mining kernel used in a variety of fields. The innovation in this paper is a highly parallel custom architecture implemented on a reconfigurable computing system. Using this "bitmapped CAM," the time and area required for executing the subset operations fundamental to data mining can be significantly reduced. The bitmapped CAM architecture implementation on an FPGA-accelerated high performance workstation provides a performance acceleration of orders of magnitude over software-based systems. The bitmapped CAM utilizes redundancy within the candidate data to efficiently store and process many subset operations simultaneously. The efficiency of this operation allows 140 units to process about 2,240 subset operations simultaneously. Using industry-standard benchmarking databases, we have tested the bitmapped CAM architecture and shown the platform provides a minimum of 24times (and often much higher) time performance advantage over the fastest software Apriori implementations View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic Sliding Window Operation Optimization for FPGA-Based

    Page(s): 76 - 88
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (271 KB) |  | HTML iconHTML  

    FPGA-based computing boards are frequently used as hardware accelerators for image processing algorithms based on sliding window operations (SWOs). SWOs are both computationally intensive and data intensive and benefit from hardware acceleration with FPGAs, especially for delay sensitive applications. The current design process requires that, for each specific application using SWOs with different size of window, image, etc.; a detail design must be completed before a realistic estimate of the achievable speedup can be obtained. We present an automated tool, sliding window operation optimization (SWOOP), that generates the estimate of speedup for a high performance design before detailed implementation is complete. The achievable speedup is determined by the area of the FPGA, or, more often, the memory bandwidth to the processing elements. The memory bandwidth to each processing element is a combination of bandwidth to the FPGA and the efficient use of on-chip RAM as a data cache. SWOOP uses analytic techniques to automatically determine the number of parallel processing elements to implement on the FPGA, the assignment of input and output data to on-board memory, and the organization of data in on-chip memory to most effectively keep the processing elements busy. The result is a block layout of the final design, its memory architecture, and a measure of the achievable speedup. The results, compared to manual designs, show that the estimates obtained usinq SWOOP are very accurate View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Enabling a Uniform Programming Model Across the Software/Hardware Boundary

    Page(s): 89 - 98
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (174 KB) |  | HTML iconHTML  

    In this paper, we present hthreads, a unifying programming model for specifying application threads running within a hybrid CPU/FPGA system. Threads are specified from a single pthreads multithreaded application program and compiled to run on the CPU or synthesized to run on the FPGA. The hthreads system, in general, is unique within the reconfigurable computing community as it abstracts the CPU/FPGA components into a unified custom threaded multiprocessor architecture platform. To support the abstraction of the CPU/FPGA component boundary, we have created the hardware thread interface (HWTI) component that frees the designer from having to specify and embed platform specific instructions to form customized hardware/software interactions. Instead, the hardware thread interface supports the generalized pthreads API semantics, and allows passing of abstract data types between hardware and software threads. Thus the hardware thread interface provides an abstract, platform independent compilation target that enables thread and instruction-level parallelism across the software/hardware boundary View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Type Architecture for Hybrid Micro-Parallel Computers

    Page(s): 99 - 110
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (283 KB) |  | HTML iconHTML  

    Platform FPGAs that integrate sequential processors with a spatial fabric have become prevalent. While these hybrid architectures ease the burden of integrating sequential and spatial code in a single application, programming them, and particularly their spatial fabrics remains challenging. The difficulty arises in part from the lack of an agreed upon computational model and family of programming languages. In addition, moving algorithms into hardware is an arcane art far removed from the experience of most programmers. To address this challenge, we present a new type architecture, an abstract model analogous to the von Neumann machine for sequential computers, that can serve as common ground for algorithm designers, language designers, and hardware architects. We show that many parallel architectures, including platform FPGAs, are implementations of this type architecture. Using examples from a variety of application domains, we show how algorithms can be analyzed to estimate their performance on implementations of this type architecture. This analysis is done without having to delve into the details of any architecture in particular. Finally, we describe some of the common features of languages designed for expressing micro-parallelism, highlighting connections with the type architecture View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Scalable FPGA-based Multiprocessor

    Page(s): 111 - 120
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (340 KB) |  | HTML iconHTML  

    It has been shown that a small number of FPGAs can significantly accelerate certain computing tasks by up to two or three orders of magnitude. However, particularly intensive large-scale computing applications, such as molecular dynamics simulations of biological systems, underscore the need for even greater speedups to address relevant length and time scales. In this work, we propose an architecture for a scalable computing machine built entirely using FPGA computing nodes. The machine enables designers to implement large-scale computing applications using a heterogeneous combination of hardware accelerators and embedded microprocessors spread across many FPGAs, all interconnected by a flexible communication network. Parallelism at multiple levels of granularity within an application can be exploited to obtain the maximum computational throughput. By focusing on applications that exhibit a high computation-to-communication ratio, we narrow the extent of this investigation to the development of a suitable communication infrastructure for our machine, as well as an appropriate programming model and design flow for implementing applications. By providing a simple, abstracted communication interface with the objective of being able to scale to thousands of FPGA nodes, the proposed architecture appears to the programmer as a unified, extensible FPGA fabric. A programming model based on the MPI message-passing standard is also presented as a means for partitioning an application into independent computing tasks that can be implemented on our architecture. Finally, we demonstrate the first use of our design flow by developing a simple molecular dynamics simulation application for the proposed machine, which runs on a small platform of development boards View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Reconfigurable Distributed Computing Fabric Exploiting Multilevel Parallelism

    Page(s): 121 - 130
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (625 KB) |  | HTML iconHTML  

    This paper presents a novel reconfigurable data flow processing architecture that promises high performance by explicitly targeting both fine- and course-grained parallelism. This architecture is based on multiple FPGAs organized in a scalable direct network that is substantially more interconnect-efficient than currently used crossbar technology. In addition, we discuss several ancillary issues and propose solutions required to support this architecture and achieve maximal performance for general-purpose applications; these include supporting IP, mapping techniques, and routing policies that enable greater flexibility for architectural evolution and code portability View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Multithreaded Soft Processor for SoPC Area Reduction

    Page(s): 131 - 142
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (246 KB) |  | HTML iconHTML  

    The growth in size and performance of field programmable gate arrays (FPGAs) has compelled system-on-a-programmable-chip (SoPC) designers to use soft processors for controlling systems with large numbers of intellectual property (IP) blocks. Soft processors control IP blocks, which are accessed by the processor either as peripheral devices or/and by using custom instructions (CIs). In large systems, chip multiprocessors (CMPs) are used to execute many programs concurrently. When these programs require the use of the same IP blocks which are accessed as peripheral devices, they may have to stall waiting for their turn. In the case of CIs, the FPGA logic blocks that implement the CIs may have to be replicated for each processor. In both of these cases FPGA area is wasted, either by idle soft processors or the replication of CI logic blocks. This paper presents a multithreaded (MT) soft processor for area reduction in SoPC implementations. An MT processor allows multiple programs to access the same IP without the need for the logic replication or the replication of whole processors. We first designed a single-threaded processor that is instruction-set compatible to Altera's Nios II soft processor. Our processor is approximately the same size as the Nios II economy version, with equivalent performance. We augmented our processor to have 4-way interleaved multithreading capabilities. This paper compares the area usage and performance of the MT processor versus two CMP systems, using Altera's and our single-threaded processors, separately. Our results show that we can achieve an area savings of about 45% for the processor itself, in addition to the area savings due to not replicating CI logic blocks View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GraphStep: A System Architecture for Sparse-Graph Algorithms

    Page(s): 143 - 151
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (175 KB) |  | HTML iconHTML  

    Many important applications are organized around long-lived, irregular sparse graphs (e.g., data and knowledge bases, CAD optimization, numerical problems, simulations). The graph structures are large, and the applications need regular access to a large, data-dependent portion of the graph for each operation (e.g., the algorithm may need to walk the graph, visiting all nodes, or propagate changes through many nodes in the graph). On conventional microprocessors, the graph structures exceed on-chip cache capacities, making main-memory bandwidth and latency the key performance limiters. To avoid this "memory wall," we introduce a concurrent system architecture for sparse graph algorithms that places graph nodes in small distributed memories paired with specialized graph processing nodes interconnected by a lightweight network. This gives us a scalable way to map these applications so that they can exploit the high-bandwidth and low-latency capabilities of embedded memories (e.g., FPGA Block RAMs). On typical spreading-activation queries on the ConceptNet Knowledge Base, a sample application, this translates into an order of magnitude speedup per FPGA compared to a state-of-the-art Pentium processor View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hardware/Software Integration for FPGA-based All-Pairs Shortest-Paths

    Page(s): 152 - 164
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (267 KB) |  | HTML iconHTML  

    Field-programmable gate arrays (FPGAs) are being employed in high performance computing systems owing to their potential to accelerate a wide variety of long-running routines. Parallel FPGA-based designs often yield a very high speedup. Applications using these designs on reconfigurable supercomputers involve software on the system managing computation on the FPGA. To extract maximum performance from an FPGA design at the application level, it becomes necessary to minimize associated data movement costs on the system. We address this hardware/software integration challenge in the context of the all-pairs shortest-paths (APSP) problem in a directed graph. We employ a parallel FPGA-based design using a blocked algorithm to solve large instances of APSP. With appropriate design choices and optimizations, experimental results on the Cray XD1 show that the FPGA-based implementation sustains an application-level speedup of 15 over an optimized CPU-based implementation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Field Programmable RFID Tag and Associated Design Flow

    Page(s): 165 - 174
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (336 KB) |  | HTML iconHTML  

    Current radio frequency identification (RFID) systems generally have long design times and low tolerance to changes in specification. This paper describes a field programmable, low-power active RFID tag, and its associated specification and automated design flow. RFID primitives to be supported by the tag are enumerated with RFID macros, or assembly-like descriptions of the tag operations. From these, the RFID preprocessor generates templates automatically. The behavior of each RFID primitive is specified using ANSI C in the template. The resulting file is compiled by the RFID compiler. A smart buffer sits between the transceiver and the tag controller, to detect whether incoming packets are intended for the tag. By doing so, the main controller may remain powered down to reduce power consumption. Two system-on-a-chip implementation strategies are presented. First, a microprocessor based system for which a C program is automatically generated. The second includes a block of low-power FPGA logic. The user supplied RFID logic in ANSI-C is automatically converted into combinational VHDL by the RFID compiler. Based on a test program, the processors required 183, 43, and 19 muJ per transaction for StrongARM, XScale, and EISC processors, respectively. By replacing the processor with a Coolrunner II, the controller can be reduced to 1.11 nJ per transaction View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Combining Instruction Coding and Scheduling to Optimize Energy in System-on-FPGA

    Page(s): 175 - 184
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (202 KB) |  | HTML iconHTML  

    In this paper, we investigate a combination of two techniques n struction coding and instruction re-ordering - for optimizing energy in embedded processor control. We present the first practical, hardware implementation incorporating both approaches as part of a novel flow for automatic power-optimization of an FPGA soft processor. Our infrastructure generates customized processors and associated software, to enable power optimizations to be evaluated on multiple architectures and FPGA platforms. We evaluate using both software estimates of power and actual measurements from both low-cost and high-performance FPGAs. We generate over 150 optimized processor designs for two FPGA platforms, two processor architectures and six different benchmarks at four different clock rates and achieve consistent measured dynamic power reduction of up to 74%, without performance cost. Our results are applicable beyond processor optimization, quantifying the benefits of practical switching reduction and highlighting non-obvious pitfalls and complexities in dynamic power optimization View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Power Visualization, Analysis, and Optimization Tools for FPGAs

    Page(s): 185 - 194
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1225 KB) |  | HTML iconHTML  

    This paper introduces the low-power intelligent tool environment (LITE), an object oriented tool set designed for power visualization, analysis, and optimization. These tools leverage an established FPGA design environment, JHDL, that allows design logic and power utilization to be displayed, analyzed, and cross-probed simultaneously at a level of abstraction close to the design entry point. Circuit logic, FPGA architecture and power information are correlated to create accurate power prediction and estimation models. These models and power analysis tools can then be used to create power optimization algorithms. Power optimization algorithm development is supported through the use of tools to query and sort circuit characteristics and drop in COTS CAD tool compliant constraints. These constraints can be used to guide the COTS placement and routing tools to optimize for power View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Systematic Characterization of Programmable Packet Processing Pipelines

    Page(s): 195 - 204
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (234 KB) |  | HTML iconHTML  

    This paper considers the elaboration of custom pipelines for network packet processing, built upon flexible programmability of pipeline stage granularity. A systematic procedure for accurately characterizing throughput, latency, and FPGA resource requirements, of different programmed pipeline variants is presented. This procedure may be exploited at design time, configuration time, or run time, to program pipeline architectures to meet specific networking application requirements. The procedure is illustrated using three case studies drawn from real-life packet processing at different levels of networking protocol. Detailed results are presented, demonstrating that the procedure estimates pipeline characteristics well, thus allowing rapid architecture space exploration prior to elaboration View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Packet Switched vs. Time Multiplexed FPGA Overlay Networks

    Page(s): 205 - 216
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (206 KB) |  | HTML iconHTML  

    Dedicated, spatially configured FPGA interconnect is efficient for applications that require high throughput connections between processing elements (PEs) but with a limited degree of PE interconnectivity (e.g. wiring up gates and datapaths). Applications which virtualize PEs may require a large number of distinct PE-to-PE connections (e.g. using one PE to simulate 100s of operators, each requiring input data from thousands of other operators), but with each connection having low throughput compared with the PE's operating cycle time. In these highly interconnected conditions, dedicating spatial interconnect resources for all possible connections is costly and inefficient. Alternatively, we can time share physical network resources by virtualizing interconnect links, either by statically scheduling the sharing of resources prior to runtime or by dynamically negotiating resources at runtime. We explore the tradeoffs (e.g. area, route latency, route quality) between time-multiplexed and packet-switched networks overlayed on top of commodity FPGAs. We demonstrate modular and scalable networks which operate on a Xilinx XC2V6000-4 at 166MHz. For our applications, time-multiplexed, offline scheduling offers up to a 63% performance increase over online, packet-switched scheduling for equivalent topologies. When applying designs to equivalent area, packet-switching is up to 2times faster for small area designs while time-multiplexing is up to 5times faster for larger area designs. When limited to the capacity of a XC2V6000, if all communication is known, time-multiplexed routing outperforms packet-switching; however when the active set of links drops below 40% of the potential links, packet-switched routing can outperform time-multiplexing View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.