By Topic

Field-Programmable Technology (FPT), 2013 International Conference on

Date 9-11 Dec. 2013

Filter Results

Displaying Results 1 - 25 of 118
  • [Title page]

    Publication Year: 2013 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (1757 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Publication Year: 2013 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (52 KB)  
    Freely Available from IEEE
  • Message from the general chair and program co-chairs

    Publication Year: 2013 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (75 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Organization

    Publication Year: 2013 , Page(s): 1 - 2
    Save to Project icon | Request Permissions | PDF file iconPDF (77 KB)  
    Freely Available from IEEE
  • Contents

    Publication Year: 2013 , Page(s): 1 - 6
    Save to Project icon | Request Permissions | PDF file iconPDF (274 KB)  
    Freely Available from IEEE
  • Keynote lectures [breaker page]

    Publication Year: 2013 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (35 KB)  
    Freely Available from IEEE
  • Recent advances in die stacking and 3D FPGA

    Publication Year: 2013 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (67 KB)  
    Freely Available from IEEE
  • Reconfigurable chip advantage compared with GPGPU from the compiler perspective

    Publication Year: 2013 , Page(s): 2
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | PDF file iconPDF (68 KB)  
    Freely Available from IEEE
  • Why Put FPGAs in your CPU socket?

    Publication Year: 2013 , Page(s): 3
    Save to Project icon | Request Permissions | PDF file iconPDF (71 KB)  
    Freely Available from IEEE
  • 1.1 Best paper candidate session

    Publication Year: 2013 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (35 KB)  
    Freely Available from IEEE
  • Accelerating validation of time-triggered automotive systems on FPGAs

    Publication Year: 2013 , Page(s): 4 - 11
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (480 KB) |  | HTML iconHTML  

    Automotive systems comprise a high number of networked safety-critical functions. Any design changes or addition of new functionality must be rigorously tested to ensure that no performance or safety issues are introduced, and this consumes a significant amount of time. Validation should be conducted using a faithful representation of the system, and so typically, a full subsystem is built for validation. We present a scalable scheme for emulating a complete cluster of automotive embedded compute units on an FPGA, with accelerated network communication using custom physical level interfaces. With these interfaces, we can achieve acceleration of system emulation by 8× or more, with a systematic way of exploring real-world issues like jitter, network delays, and data corruption, among others. By using the same communication infrastructure as in a real deployed system, this validation is closer to the requirements of standards compliance. This approach also enables hardware-in-the-loop (HIL) validation, allowing rapid prototyping of distributed functions, including changes in network topology and parameters, and modification of time-triggered schedules without physical hardware modification. We present an implementation of this framework on the Xilinx ML605 evaluation board that integrates six FlexRay automotive functions to demonstrate the potential of the framework. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting partially defective LUTs: Why you don't need perfect fabrication

    Publication Year: 2013 , Page(s): 12 - 19
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (301 KB) |  | HTML iconHTML  

    Shrinking integrated circuit feature sizes lead to increased variation and higher defect rates. Prior work has shown how to tolerate the failure of entire LUTs and how to tolerate failures and high variation in interconnect. We show how to use LUTs even when they are partially defective - a form of fine-grained defect tolerance. We characterize the defect tolerance of a range of mapping strategies for defective LUTs, including LUT swapping in a cluster, input permutation, input polarity selection, defect-aware packing, and defect-aware placement. By tolerating partially defective LUTs, we show that, even without allocating dedicated spare LUTs, it is possible to achieve near perfect yield with cluster local remapping when roughly 1% of the LUT multiplexers fail to switch. With full, defect-aware placement, this can increase to 10-25% with just a few extra rows and columns. In contrast, substitution of perfect LUTs to dedicated spares only tolerates failure rates of 0.01-0.05%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Maximum flow algorithms for maximum observability during FPGA debug

    Publication Year: 2013 , Page(s): 20 - 27
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (558 KB) |  | HTML iconHTML  

    Due to the ever-increasing density and complexity of integrated circuits, FPGA prototyping has become a necessary part of the design process. To enhance observability into these devices, designers commonly insert trace-buffers to record and expose the values on a small subset of internal signals during live operation to help root-cause errors. For dense designs, routing congestion will restrict the number of signals that can be connected to these trace-buffers. In this work, we apply optimal network flow graph algorithms, a well studied technique, to the problem of transporting circuit signals to embedded trace-buffers for observation. Specifically, we apply a minimum cost maximum flow algorithm to gain maximum signal observability with minimum total wirelength. We showcase our techniques on both theoretical FPGA architectures using VPR, and with a Xilinx Virtex6 device, finding that for the latter, over 99.6% of all spare RAM inputs can be reclaimed for tracing across four large benchmarks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The architecture and placement algorithm for a uni-directional routing based 3D FPGA

    Publication Year: 2013 , Page(s): 28 - 33
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1189 KB) |  | HTML iconHTML  

    Three-Dimensional (3D) FPGA as a promising design trend, achieves significant performance improvement over conventional 2D-based FPGA. The maturity of the uni-directional routing architecture design, which achieves 25% area saving in area-delay-product (ADP) over bi-directional routing architectures, has driven major vendors such as Xilinx and Altera to switch to such architecture in their 2D-based products. However, few studies were contributed to exploring performance-optimal uni-directional 3D routing architectures. In this paper, we propose and evaluate a novel uni-directional 3D routing architecture named UNI-3D. Additionally, in the EDA counterpart, we also propose an improved simulated annealing (SA)-based placement algorithm that caters the unidirectional architecture, to alleviate signal propagation imbalance in the vertical channels resulted from using conventional bi-directional based SA approach. Our simulation results show that our proposed architecture is able to achieve up to 28.44% of delay reduction and 26.21% planar channel width reduction compared with the baseline 2D uni-directional architecture. At the same time, the proposed SA algorithm is able to improve the average vertical channel width up to 16% compared to state-of-the-art works. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 1.2 Architecture

    Publication Year: 2013 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (35 KB)  
    Freely Available from IEEE
  • COFFE: Fully-automated transistor sizing for FPGAs

    Publication Year: 2013 , Page(s): 34 - 41
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (796 KB) |  | HTML iconHTML  

    In this paper, we present COFFE (Circuit Optimization For FPGA Exploration), a new fully-automated transistor sizing tool for FPGAs. Automated transistor-level CAD tools are an important part of the architecture exploration flow because they provide accurate area and delay estimates of low-level FPGA circuitry, which must be obtained for each architecture. We show that modeling transistors as linear resistances and capacitances as has been done in previous FPGA transistor sizing tools is highly inaccurate for fine-grained transistor-level design in advanced process nodes. Therefore, COFFE's transistor sizing algorithm maintains circuit non-linearities by relying exclusively on HSPICE simulations to measure delay. Area is estimated with a transistor size-based model that incorporates a number of improvements to enhance its accuracy in advanced process technologies versus prior methods. In addition to more accurate area and delay estimation, COFFE considers more layout effects than prior published work by automatically accounting for transistor and wire loads, which are computed based on architectural parameters and layout area. This new FPGA transistor sizing tool requires only several hours to produce high-quality transistor sizing results for an entire FPGA tile; a task that would normally take months of manual effort. We demonstrate COFFE's utility in FPGA architecture studies by investigating an important new architectural question at the logic-to-routing interface. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A case for hardened multiplexers in FPGAs

    Publication Year: 2013 , Page(s): 42 - 49
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (299 KB) |  | HTML iconHTML  

    This paper presents a case for a hybrid configurable logic block that contains a mixture of LUTs and hardened multiplexers towards the goal of higher logic density and area reduction. Technology mapping optimizations, called MuxMap, that target the proposed architecture are implemented using a modified version of the mapper in the ABC logic synthesis tool. VPR is used to model the new hybrid configurable logic block and verify post place and route implementation. Multiple hybrid configurable logic block architectures with varying MUX:LUT ratios are evaluated across three benchmark suites with both Quartus II and Odin-II front-end RTL synthesis tools. Experimentally, we show that without any mapper optimizations we naturally save ~4% area post place and route and with MuxMap optimizations in ABC yielding ~6% area reduction post place and route while maintaining mapping depth, overall configurable logic block count, and routing demand. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Debugging processors with advanced features by reprogramming LUTs on FPGA

    Publication Year: 2013 , Page(s): 50 - 57
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (630 KB) |  | HTML iconHTML  

    In this paper, we propose an automated method for debugging and rectification of logical bugs in processors that are implemented on FPGAs. Our method is based on preserving the current circuit topology, and debugging and rectifying bugs by only changing the contents of LUTs, without any modification to the wiring. As a result, correcting the bugs does not require re-synthesis, which can be very time consuming for complex processors due to possible timing closure problems. As the topology of the circuit is preserved, correcting the bugs does not affect the timings of the circuit. In the design phase, we may add additional LUTs or additional inputs to LUTs in the original circuit, so that we can use them in debugging and rectification phase. After a bug is found, first we try to identify the candidate signals as well as their required changes to correct their behavior. This is achieved by using symbolic simulation and equivalence checking between an instruction-set architecture model of the processor and its erroneous model at micro-architecture level. Then, we try to map the corrected functionality into the existing LUT topology. This is realized by a novel method that formulates the problem as a QBF (Quantified Boolean Formula) problem, and solves it by repeatedly applying normal SAT solvers incrementally instead of QBF solvers utilizing ideas from CEGAR (Counter Example Guided Abstraction Refinement) paradigm. We show effectiveness as well as efficiency of our method by correcting bugs in two complex out-of-order superscalar processors with a timing error recovery mechanism. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 1.3 FPGA applications I

    Publication Year: 2013 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (35 KB)  
    Freely Available from IEEE
  • Virtual-to-Physical address translation for an FPGA-based interconnect with host and GPU remote DMA capabilities

    Publication Year: 2013 , Page(s): 58 - 65
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1729 KB) |  | HTML iconHTML  

    We developed a custom FPGA-based Network Interface Controller named APEnet+ aimed at GPU accelerated clusters for High Performance Computing. The card exploits peer-to-peer capabilities (GPU-Direct RDMA) for latest NVIDIA GPGPU devices and the RDMA paradigm to perform fast direct communication between computing nodes, offloading the host CPU from network tasks execution. In this work we focus on the implementation of a Virtual to Physical address translation mechanism, using the FPGA embedded soft-processor. Address management is the most demanding task - we estimated up to 70% of the μC load - for the NIC receiving side, resulting being the main culprit for data bottleneck. To improve the performance of this task and hence improve data transfer over the network, we added a specialized hardware logic block acting as a Translation Lookaside Buffer. This block makes use of a peculiar Content Address Memory implementation designed for scalability and speed. We present detailed measurements to demonstrate the benefits coming from the introduction of such custom logic: a substantial address translation latency reduction (from a measured value of 1.9 μs to 124 ns) and a performance enhancement of both host-bound and GPU-bound data transfers (up to ~ 60% of bandwidth increase) in given message size ranges. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Accelerating iterative algorithms with asynchronous accumulative updates on FPGAs

    Publication Year: 2013 , Page(s): 66 - 73
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (364 KB) |  | HTML iconHTML  

    Iterative algorithms represent a pervasive class of data mining, web search and scientific computing applications. In iterative algorithms, a final result is derived by performing repetitive computations on an input data set. Existing techniques to parallelize such algorithms typically use software frameworks such as MapReduce and Hadoop to distribute data for an iteration across multiple CPU-based workstations in a cluster and collect per-iteration results. These platforms are marked by the need to synchronize data computations at iteration boundaries, impeding system performance. In this paper, we demonstrate that FPGAs in distributed computing systems can serve a vital role in breaking this synchronization barrier with the help of asynchronous accumulative updates. These updates allow for the accumulation of intermediate results for numerous data points without the need for iteration-based barriers allowing individual nodes in a cluster to independently make progress towards the final outcome. Computation is dynamically prioritized to accelerate algorithm convergence. A general-class of iterative algorithms have been implemented on a cluster of four FPGAs. A speedup of 7× is achieved over an implementation of asynchronous accumulative updates on a general-purpose CPU. The system offers up to 154× speedup versus a standard Hadoop-based CPU-workstation. Improved performance is achieved by clusters of FPGAs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High throughput, tree automata based XML processing using FPGAs

    Publication Year: 2013 , Page(s): 74 - 81
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (315 KB) |  | HTML iconHTML  

    A novel and efficient approach to XML processing using FPGAs, based upon the sound theoretical formalism of tree automata, is presented. The approach enables the key tasks of schema validation and query to be performed in a unified manner. A remarkably simple implementation of a tree automaton in hardware, as a pair of interacting automata with the states of one forming the input to the other, is described. The implementation can process one XML token in at most two clock cycles. Also, the throughput is achieved for any schema grammar or query (that can be accommodated in the state tables) independent of its complexity. Further, use of tree automata offers greater expressive power for specifying schemas as well as queries than in previous hardware based approaches. Detailed performance evaluation demonstrates the significant throughput improvements of the proposed tree automata based approach compared with software as well as earlier FPGA based approaches. The implementation of XML schema validation on a mid-range FPGA provides sustained throughput from 1.7 to 3.1 Gbps, yielding a five to ten times speedup over an efficient software approach. Due to the very compact implementation, multiple instances can be utilized to further make significant improvements in throughput. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Transparent FPGA based device for SQL DDoS mitigation

    Publication Year: 2013 , Page(s): 82 - 89
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (480 KB) |  | HTML iconHTML  

    A Distributed Denial-of-Service attack is an attempt to make a computer resource unavailable to its intended users. Typically, a large number of bots are triggered by an attacker simultaneously to create a huge load on a web server and bring it down. However, when processing SQL queries on a web server, owing to huge resource requirements, even a small number of queries from smaller set of bots can create huge load on the server. Such sophisticated application layer attacks go undetected by network security solutions under deployment today. Therefore, we propose an SQL DDoS Mitigator device that focuses on preventing such attacks targeting SQL database resources. It can parse packets at line speed, with a maximum latency of 20μs for detecting HTTP GET packets with embedded SQL queries. The query pattern information for requester IP addresses are stored in a red-black tree data structure. Clients crossing the limit of server load, dynamically set on the basis of server state, will be re-directed to a CAPTCHA server for identification of bots. The IPs confirmed as bots are black-listed for a configurable timeout period. The complete system, except the CAPTCHA server, is built on “Xilinx Virtex-II Pro 50” FPGA based NetFPGA-1G platform. The device achieved a throughput of 400 Kilo Packets/s in a 1 Gbps network. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 1.4 Power-aware and dynamically reconfigurable systems

    Publication Year: 2013 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (35 KB)  
    Freely Available from IEEE
  • Discrete event system specification, synthesis, and optimization of low-power FPGA-based embedded systems

    Publication Year: 2013 , Page(s): 98 - 105
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1237 KB) |  | HTML iconHTML  

    Discrete event system specification (DEVS) has been widely used within modeling and simulation to design, verify, and implement complex reactive systems. DEVS provides a robust formalism for designing systems using event-driven, state-based models in which timing information is explicitly defined. In this paper, we present an overview of a DEVS-based hardware design, synthesis, and optimization methodology. Within this approach, hardware DEVS (HDEVS) specifications can be synthesized to hardware, during which the event-driven model and explicit timing allow for an efficient hardware realization using globally asynchronous, locally synchronous design approach. Additionally, we present an optimization method for reducing power consumption through optimal frequency mapping and clock gating of individual components while ensuring system latency constraints are achieved. We further demonstrate the resulting power consumption savings for activity-driven forest fire and asthma health management applications targeting two low-power FPGA devices. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.