By Topic

Programmable Logic, 2009. SPL. 5th Southern Conference on

Date 1-3 April 2009

Filter Results

Displaying Results 1 - 25 of 45
  • Authors index

    Page(s): 229 - 231
    Save to Project icon | Request Permissions | PDF file iconPDF (236 KB)  
    Freely Available from IEEE
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (65 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (96 KB)  
    Freely Available from IEEE
  • Organizing Committee

    Page(s): iv - v
    Save to Project icon | Request Permissions | PDF file iconPDF (140 KB)  
    Freely Available from IEEE
  • PAM Map: An architecture-independent logic block mapping algorithm for SRAM-based FPGAs

    Page(s): 15 - 19
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1074 KB) |  | HTML iconHTML  

    In this paper, we consider the general problem of mapping a given logic circuit onto an SRAM-based FPGA with programmable logic blocks of arbitrary architectures. We formulate the problem as a graph matching problem and present an architecture-independent algorithm for this purpose. This algorithm also obtains a best area saving of 4% compared to architecture-dependent methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FPGA/soft-processor based real-time object tracking system

    Page(s): 33 - 37
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (486 KB) |  | HTML iconHTML  

    This paper presents a low cost FPGA based solution for a real-time moving object tracking system. A specialized architecture is presented based on a soft RISC processor capable of running kernel based mean shift tracking algorithm. The system includes a frame grabber unit that stores the video frame in DDR RAM using direct memory access, a video display unit to monitor the tracking statistics and a soft processor capable of running mean shift tracking algorithm within the required time constraint. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Concurrent calculations on reconfigurable logic devices applied to the analisys of video images

    Page(s): 109 - 114
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1401 KB) |  | HTML iconHTML  

    This paper presents the design and implementation on FPGA devices of an algorithm for computing the similarity between neighbor photograms in a video sequence using luminance information. Making use of the well-known flexibility of reconfigurable logic devices, we have designed a hardware implementation of the algorithm used in video segmentation and indexation. The experimental work has established a tradeoff between concurrent sequential resources and functional blocks, in order to achieve maximum operation speed with minimum silicon area. In order to evaluate the efficiency of the designed system, we have compared the performance of the hardware solution with that of calculations done via software using general-purpose processors with and without the MMX extension. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic VHDL generation for solving rotation and scale-invariant template matching in FPGA

    Page(s): 21 - 26
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (617 KB) |  | HTML iconHTML  

    Template matching is a classical problem in computer vision. It consists in detecting the presence of a given template in a digital image. This task becomes considerably more complex with the invariance to rotation, scale, translation, brightness and contrast (RSTBC). A novel RSTBC-invariant robust template matching algorithm named Ciratefi was recently proposed. However, its execution in a conventional computer takes several seconds. Moreover, the implementation of its general version in hardware is difficult, because there are many adjustable parameters. This paper proposes a software that automatically generates compilable Hardware Description Logic (VHDL) modules that implements Ciratefi in Field Programmable Gate Array (FPGA) devices. The proposed solution accelerates the time to process a frame from 7s (in a 3 GHz PC) to 1.06 ms. This excellent performance (more than the required for a real-time system) may lead to cost-effective high-performance co-processing computer vision systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design of asynchronous MSP430 microprocessor using balsa back-end retargeting

    Page(s): 223 - 228
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (908 KB) |  | HTML iconHTML  

    Balsa developed by Advanced Processor Technology (APT) Group of Manchester University presents robust design environment that supports both a framework for synthesizing asynchronous hardware systems and the language for describing such systems. In this paper, a design of microprocessor, MSP430, in balsa language and the functional verification of the controller is presented. Back-end retargeting is performed as a part of the design methodology in balsa. By back-end retargeting procedure, a new technology library including FPGA cell library is incorporated into Balsa design environment. Moreover, the circuit area is analyzed and reduced in different implementation styles by replacing helpercells in balsa into standard cells of the target library. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast radix 2K dividers for FPGAs

    Page(s): 115 - 122
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1368 KB) |  | HTML iconHTML  

    In this paper we present radix r = 2k divider for fixed-point operands. The divider divides in a radix r = 2k, producing k bits at each iteration. The proposed digit recurrence algorithm has two different architectures, a first one for general hardware implementation, and the second one is optimized for configurable logic (FPGAs). Results show a speedup greater to three times respect to a classical non-restoring division implemented in Xilinx Devices. Additionally a throughput-latency-area comparison of pipelined and sequential dividers implementation is disclosed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design space exploration of present implementations for FPGAS

    Page(s): 141 - 145
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (362 KB) |  | HTML iconHTML  

    In this paper we investigate the performance of the block cipher PRESENT on FPGAs. We provide implementation results of an efficiency (i.e. throughput per slice) optimized design and compare them with other block ciphers. Though PRESENT was originally designed with a minimal hardware footprint in mind, our results also highlight that PRESENT is well suited for high-speed and high-throughput applications. Especially its hardware efficiency, i.e. the throughput per slice, is noteworthy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Decimal addition in FPGA

    Page(s): 101 - 108
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1202 KB) |  | HTML iconHTML  

    This paper presents a study of the classical BCD adders from which a carry-chain type adder is redesigned to fit within the Xilinx FPGAs. Some new concepts are presented to compute the P and G functions for carry-chain optimization purposes. Several alternative designs are then presented with the corresponding time performances and area consumption figures. In order to compare the results, the straight implementation of a decimal ripple-carry adder and the FPGA optimized base 2 adder for the same range are implemented. Results for big operands show that the decimal adder works faster than an equivalent binary implementation and furthermore the coding / decoding processes are no more needed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance analysis of double digit decimal multiplier on various FPGA logic families

    Page(s): 165 - 170
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (292 KB) |  | HTML iconHTML  

    Decimal multiplication is an integral part of financial, commercial, and Internet-based computations. This paper presents a novel double digit decimal multiplication (DDDM) technique that performs 2 digit multiplications simultaneously in one clock cycle. This design offers low latency and high throughput. When multiplying two n-digit operands to produce a 2n-digit product, the design has a latency of [(n/2) +1] cycles. The paper presents area and delay comparisons for 7-digit, 16-digit, 34-digit double digit decimal multipliers on different families of Xilinx, Altera, Actel and Quick Logic FPGAs. The multipliers presented can be extended to support decimal floating-point multiplication for IEEE P754 standard. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • T-NDPack: Timing-driven non-uniform depopulation based clustering

    Page(s): 9 - 14
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (543 KB) |  | HTML iconHTML  

    Low-cost FPGAs have comparable number of configurable logic blocks (CLBs) with respect to resource-rich FPGAs but have much less routing tracks. This leads to the difficulty for CAD tools to successfully and optimally map a circuit into these devices. Instead of switching to resource-rich FPGAs, the designers could employ depopulation based clustering technique which underuses CLBs, hence improves routability by spreading the logic over the architecture. However, all depopulation based clustering algorithms to this date increase critical path delay. In this paper, we present a timing-driven non-uniform depopulation based clustering technique, T-NDPack, that targets critical path delay and channel width constraints simultaneously. We adjust the capacity of the CLB based on the criticality of the logic block. Paper analyzes the effect of depopulation strategies on area and delay performance. Results show that T-NDPack reduces minimum channel width by 11.07% while increasing the number of CLBs by 13.28%. More importantly, T-NDPack decreases critical path delay by 2.89%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Experiences applying OVM 2.0 to an 8B/10B RTL design

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (494 KB) |  | HTML iconHTML  

    The SystemVerilog implementation of the open verification methodology (OVM) is exercised on an 8b/10b RTL open core design in the hope of being a simple yet complete exercise to expose the key features of OVM. Emphasis is put onto the actual usage of the verification components rather than a complete verification flow aiming at being of help to readers unfamiliar with OVM seeking to apply the methodology to their own designs. A link that takes you to the complete code is given to reinforce this aim. We found the methodology easy to use but intimidating at first glance specially for someone with little experience in object oriented programming. However it is clear to see the flexibility, portability and reusability of verification code once you manage to give some first steps. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parameterized hardware design on reconfigurable computers: An image registration case study

    Page(s): 71 - 76
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1315 KB) |  | HTML iconHTML  

    Reconfigurable computers (RCs) with hardware (FPGA) co-processors can achieve significant performance improvement compared to traditional computers for certain categories of applications. The potential amount of speedup an RC can deliver depends on the intrinsic parallelism of the target application as well as the characteristics of the target platform. In this paper, we use image registration implementation as a case study to show how a hardware implementation is parameterized by co-processor architecture, particularly the local memory layout. Image registration is a fundamental task in image processing used to match two or more pictures taken at different times, from different sensors, or from different viewpoints. One of several basic transformations in image registration is rigid-body transformation, which is composed of a combination of a rotation thetas, a translation (tx,ty), and a scale change (s). In this work, rigid-body transformation is applied on the test image to register it with the reference image; and correlation coefficient is used as the similarity metric between the two images. Two different algorithms, exhaustive search algorithm and discrete wavelet transform (DWT)-based search algorithm, are implemented on hardware (i.e., FPGA device on Cray XD1 reconfigurable computer). The hardware implementation of exhaustive search algorithm is 10times faster than the software implementation. The performance improvement of DWT-based search algorithm in hardware is roughly 2 folds compared to the corresponding software implementation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Real-time particle image velocimetry based on FPGA technology

    Page(s): 147 - 152
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (761 KB) |  | HTML iconHTML  

    Particle image velocimetry (PIV) allows measuring distributed flow velocity fields. It is well established as an experimental tool in modern fluid dynamics research, being applied to liquid, gases and multiphase flows. Images of tracer particles are processed by means of a statistical strategy, which makes its real-time implementation difficult to achieve. In this paper, we describe the design and implementation of an embedded architecture for real-time PIV based on FPGA technology. The proposed scheme has allowed us to exploit the low-level parallelization in both, the direct cross-correlation computation and interrogation windows handling. We propose a bus architecture to manage multiple interfaces among the processing modules and external devices. By using this scheme, we achieved design flexibility and improved processing speed. Major benefits of the speed improvement are enhanced experimental capabilities like feedback control, on-line flow regimen visualization, and a significant speed up in off-line processing. We show experimental results of a physical field of velocities calculated in real-time. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Chipcflow- A dynamic dataflowmachine using dynamic reconfigurable hardware

    Page(s): 213 - 216
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (528 KB) |  | HTML iconHTML  

    In order to convert High Level Language (HLL) into hardware, a Control Dataflow Graph (CDFG) is a fundamental element to be used. Otherwise, Dataflow Architecture, can be obtained directly from the CDFG. In the 1970s and late 1980s, the Dataflow Model was the focus of attention that provided parallelism in a natural form. In particular, dynamic dataflow architecture can be generated to produce a high level of parallelism. In this paper, the ChipCflow project is described as a system to convert HLL into a dynamic dataflow graph to be executed in a dynamic reconfigurable hardware, exploring the dynamic reconfiguration. The ChipCflow consists of various parts: the compiler to convert the C program into a dataflow graph; the operators and its instances; the tagged-token; and the matching data. Some results are presented in order to show a proof of concept for the project. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FPGA accelerator for protein structure prediction algorithm

    Page(s): 123 - 128
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (532 KB) |  | HTML iconHTML  

    Bioinformatics applications are computationally very expensive programs. They work with large data sets and also consume a lot of CPU cycles and often require high degrees of precision. An important application in this area is tertiary structure prediction of proteins. This paper reports a codesign methodology to build hardware accelerators to minimize the running time of a protein energy minimization algorithm. It has been shown that significant speedups can be obtained by moving core time consuming functions onto an FPGA. It has been shown that a 5 fold decrease in the run time of the application can be achieved by simply moving one core function into hardware. Upto an order of magnitude improvement in runtimes can be obtained by moving two functions (core functions in many other bioinformatics applications) which consume 99% of the CPU cycles in the chosen application. A generalized speedup analysis using single and multiple FPGA cards has also been presented. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mitigating and tolerating SEU effects in switch modules of SRAM-based FPGAs

    Page(s): 171 - 176
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1354 KB) |  | HTML iconHTML  

    This paper proposes three methods to mitigate and tolerate SEU-caused errors on the configuration bits of SRAM-based field programmable gate arrays. The proposed methods are based on error detection and correction codes which are able to detect or correct SEU-caused errors in Switch Modules. The effects of proposed methods on the various parameters such as area, delay and power consumption for ten ITC'99 benchmark circuits have been evaluated with synopsisreg CAD tool and compared with previous work. The experimental results show that the proposed methods can detect or correct 100% single errors in Switch Modules by imposing area overhead between 2% and 60%, delay overhead between 25% and 100% and power consumption overhead between 1% and 25%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SCAR-FPGA : A novel side-channel attack resistant fpga

    Page(s): 177 - 182
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (371 KB) |  | HTML iconHTML  

    In design of embedded systems for security applications, flexibility and tamper-resistance are two important factors to be considered. High frequency of updates and high costs of ASIC and their long design time urge us to use a secure FPGA as an alternative. In this paper a secure FPGA is proposed for secure implementation of crypto devices. The FPGA architecture is based on Asynchronous methodology and is resistant against multiple side channel attacks such as Power Attacks and Fault Attacks. AES algorithm implementation shows the native resistance of SCAR-FPGA. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimising multi-loop programs for heterogeneous computing systems

    Page(s): 129 - 134
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (315 KB) |  | HTML iconHTML  

    This paper presents a method for optimising parallelisation and scheduling of task graphs containing representation of loops for implementation in heterogeneous computing systems with both software and hardware processors. The method integrates loop unrolling with task scheduling and determines the extent to which each loop should be unrolled to maximise performance, while meeting size constraints. A performance-driven strategy is proposed to find the best unrolling factor for each loop, such that the closer the match of run-time conditions and compile-time parameters, the higher the performance. Experimental results obtained using a speech recognition system show the proposed method outperforms an approach without unrolling by 2.1 times, and using the processing time of a 2.6 GHz microprocessor as a reference, a speed up of 10 times can be achieved when compile-time and run-time parameters are matched, while the performance drops gradually when they are different. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Flexible communication support for dynamically reconfigurable FPGAS

    Page(s): 65 - 70
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1659 KB) |  | HTML iconHTML  

    Dynamic reconfiguration of FPGAs allows the dynamic management of various tasks that describe an application. This new feature permits, for optimization purpose, to place tasks on line in an available region of the FPGA. Dynamic reconfiguration of tasks leads to some communication problems since tasks are not present in the matrix during all computation time. This dynamicity needs to be supported by the interconnection network. In this paper, we propose the implementation of a flexible interconnection network supporting such dynamicity. The proposed architecture is fully compliant with the present state-of-art dynamically reconfigurable circuits such as Xilinx Virtex family of FPGA. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FuSE - a hardware accelerated HDL fault injection tool

    Page(s): 89 - 94
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (351 KB) |  | HTML iconHTML  

    The ongoing miniaturization of digital circuits makes them more and more susceptible to faults which also complicates the design of fault tolerant systems. In this context fault injection plays an important role in the process of fault tolerance validation. As a result many fault injection tools have emerged during the last decade. However these tools only operate on specific domains and can therefore be referred to as hardware- or software-, simulation- or emulation based techniques. In this paper we present FuSE, a single fault injection tool which covers multiple domains as well as different fault injection purposes. FuSE has been designed for usage with the SEmulatorreg-an FPGA-based hardware accelerator. The created tool set has been fully automated for the fault injection process and only requires a VHDL description and a test bench of the circuit under test. FuSE can then perform fault injection experiments with a diagnostic resolution that is known from simulation-based approaches, but at a speed that even handles long running experiments with ease. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cube: A 512-FPGA cluster

    Page(s): 51 - 57
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1496 KB) |  | HTML iconHTML  

    Cube, a massively-parallel FPGA-based platform is presented. The machine is made from boards each containing 64 FPGA devices and eight boards can be connected in a cube structure for a total of 512 FPGA devices. With high bandwidth systolic inter-FPGA communication and a flexible programming scheme, the result is a low power, high density and scalable supercomputing machine suitable for various large scale parallel applications. A RC4 key search engine was built as an demonstration application. In a fully implemented Cube, the engine can perform a full search on the 40-bit key space within 3 minutes, this being 359 times faster than a multi-threaded software implementation running on a 2.5 GHz Intel Quad-Core Xeon processor. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.