Scheduled System Maintenance on May 29th, 2015:
IEEE Xplore will be upgraded between 11:00 AM and 10:00 PM EDT. During this time there may be intermittent impact on performance. We apologize for any inconvenience.
By Topic

Application-Specific Systems, Architectures and Processors, 2008. ASAP 2008. International Conference on

Date 2-4 July 2008

Filter Results

Displaying Results 1 - 25 of 58
  • Fast custom instruction identification by convex subgraph enumeration

    Publication Year: 2008 , Page(s): 1 - 6
    Cited by:  Papers (15)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (371 KB) |  | HTML iconHTML  

    Automatic generation of custom instruction processors from high-level application descriptions enables fast design space exploration, while offering very favorable performance and silicon area combinations. This work introduces a novel method for adapting the instruction set to match an application captured in a high-level language. A simplified model is used to find the optimal instructions via enumeration of maximal convex subgraphs of application data flow graphs (DFGs). Our experiments involving a set of multimedia and cryptography benchmarks show that an order of magnitude performance improvement can be achieved using only a limited amount of hardware resources. In most cases, our algorithm takes less than a second to execute. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bit matrix multiplication in commodity processors

    Publication Year: 2008 , Page(s): 7 - 12
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (393 KB) |  | HTML iconHTML  

    Registers in processors generally contain words or, with the addition of multimedia extensions, short vectors of subwords of bytes or 16-bit elements. In this paper, we view the contents of registers as vectors or matrices of individual bits. However, the facility to operate efficiently on the bit-level is generally lacking. A commodity processor usually only has logical and shift instructions and occasionally population count instructions. Perhaps the most powerful primitive bit-level operation is the bit matrix multiply (BMM) instruction, currently found only in supercomputers like Cray. This instruction multiplies two ntimesn bit matrices. In this paper, we show the power of BMM. We propose and analyze new processor instructions that implement simpler BMM primitive operations more suitable for a commodity processor. We show the impact of BMM on the performance of critical application kernels and discuss its hardware cost. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Synthesis of application accelerators on Runtime Reconfigurable Hardware

    Publication Year: 2008 , Page(s): 13 - 18
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (318 KB) |  | HTML iconHTML  

    Application accelerators are predominantly ASICs. The cost of ASIC solutions are order of magnitudes higher than programmable processing cores. Despite this, ASIC solutions are preferred when both high performance and low power is the target. ASICs offer no flexibility in terms of it being able to cater to application derivatives, unless this has been provisioned for at the time of design. In this paper we define the architecture of Runtime Reconfigurable Hardware (RRH) as the platform for application acceleration. The proposed RRH is a homogeneous fabric comprising computing, storage and communicating resources. We also propose a synthesis methodology to realize application written a high level language (HLL) on the RRH. Applications described in HLL is compiled into application substructures. For each application substructure a set of Compute Elements interconnected in a manner that closely matches the communication pattern within it, is allocated. CEs in such a configuration is called a hardware affine. Hardware Affines are carved out on the RRH at runtime. These hardware affines are defined at compile time, and are provisioned at runtime on the fabric. By virtue of the fact that these hardware affines are NOT instruction set processor cores or Logic Elements as in FPGAs, we bear the performance and power advantage of an ASIC, and the hardware reconfigurability/programmability of that of an FPGA/Instruction Set Processor. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Floating point multiplication rounding schemes for interval arithmetic

    Publication Year: 2008 , Page(s): 19 - 24
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1096 KB) |  | HTML iconHTML  

    Floating point multipliers with two differently rounded results for the same operation can be used for increasing the performance of interval multiplication. The present paper stands by this idea, by investigating the idea of using three existing floating point multiplication rounding algorithms for such multipliers - the Even-Seidel, Quach and Yu-Zyner algorithms. These three rounding schemes are modified for interval arithmetic; furthermore, a new rounding scheme is proposed. The estimates rendered by our analysis show that the proposed scheme has the best performance/area ratio. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast multivariate signature generation in hardware: The case of rainbow

    Publication Year: 2008 , Page(s): 25 - 30
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (370 KB) |  | HTML iconHTML  

    This paper presents a time-area efficient hardware architecture for the multivariate signature scheme Rainbow. As a part of this architecture, a high-performance hardware optimized variant of the well-known Gaussian elimination over GF(2l) and its efficient implementation are presented. The resulting signature generation core of Rainbow requires 63,593 gate equivalents and signs a message in just 804 clock cycles at 67 MHz using AMI 0.35 mum CMOS technology. Thus, Rainbow provides significant performance improvements compared to RSA and ECDSA. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fault-tolerant dynamically reconfigurable NoC-based SoC

    Publication Year: 2008 , Page(s): 31 - 36
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (500 KB) |  | HTML iconHTML  

    This paper proposes a network-on-chip (NoC)-based dynamically reconfigurable platform which can perform multiple applications, simultaneously. A tile attached to a router in the NoC consists of a core container which can host a core permanently or temporarily. The tile also has a hardwired controller and a cache like memory to control the hosted cores. A core, which runs a task, may be described by a bitstream (called hardware core) or a programme code (called software core). Because of the dynamic behaviour of the proposed platform, using task identifier, a stochastic dynamic routing algorithm will find (or map) the task in the platform. Because of using the task identifier in routing algorithm and the reconfigurability of tiles, the proposed platform can tolerate probable faults. The proposed SoC architecture is easily able to run new protocols and tasks. Our results show that, the proposed platform follows the user interests such that runs tasks with higher temporal locality much faster than the tasks with lower temporal locality. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Security processor with quantum key distribution

    Publication Year: 2008 , Page(s): 37 - 42
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (557 KB) |  | HTML iconHTML  

    We present a fully operable security gateway prototype, integrating quantum key distribution and realised as a system-on-chip. It is implemented on a field-programmable gate array and provides a virtual private network with low latency and gigabit throughput. The seamless hard- and software integration of a quantum key distribution layer enables high key-update rates for the encryption modules. Hence, the amount of data encrypted with one session key can be significantly decreased. We realise a highly modular architecture and make extensive use of software/hardware partitioning. This work is the first approach towards application of a new key distribution technology in dedicated security processors. In particular, it elaborates requirements for the integration of quantum key distribution on a chip level. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fully-pipelined efficient architectures for FPGA realization of discrete Hadamard transform

    Publication Year: 2008 , Page(s): 43 - 48
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (336 KB) |  | HTML iconHTML  

    Fully-pipelined simple modular structures are presented in this paper for efficient hardware realization of discrete Hadamard transform (HT). From the kernel matrix of HT, we have derived four different pipelined modular designs for transform length N = 4. It is shown further that the HT of transform-length N = 8 can be obtained from two 4-point HT modules, and similarly, the HT of transform-length N=16 can be obtained from four 4-point HT modules. Long-length transforms may, however, be computed from these short-length modules as N-point transforms can be computed from 2M number of M point HT-modules, where M = N1/2. The proposed architectures are coded in VHDL, simulated by Xilinx ISE tool for validation and testing; and synthesized thereafter to be implemented in different FPGA devices, e.g., Virtex-E, Virtex-II Pro and Virtex-4. From the synthesis result, it is found that the proposed designs involve considerably less number of slices and provide significantly higher best-achievable-frequency compared with the existing architectures for FPGA implementation of HT. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reconfigurable Viterbi decoder on mesh connected multiprocessor architecture

    Publication Year: 2008 , Page(s): 49 - 54
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (309 KB) |  | HTML iconHTML  

    In modern wireline and wireless communication systems, Viterbi decoder is one of the most compute intensive and essential elements. Each standard requires a different configuration of Viterbi decoder. Hence there is a need to design a flexible reconfigurable Viterbi decoder to support different configurations on a single platform. In this paper we present a reconfigurable Viterbi decoder which can be reconfigured for standards such as WCDMA, CDMA2000, IEEE 802.11, DAB, DVB, and GSM. Different parameters like code rate, constraint length, polynomials and truncation length can be configured to map any of the above mentioned standards. Our design provides higher throughput and scalable power consumption in various configuration of the reconfigurable Viterbi decoder. The power and throughput can also be optimized for different standards. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Run-time thread sorting to expose data-level parallelism

    Publication Year: 2008 , Page(s): 55 - 60
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (977 KB) |  | HTML iconHTML  

    We address the problem of data parallel processing for computational quantum chemistry (CQC). CQC is a computationally demanding tool to study the electronic structure of molecules. An important algorithmic component of these computations is the evaluation of Electron Repulsion Integrals (ERIs). A key problem with ERI evaluation is controlflow variation between different ERI evaluations, which can only be resolved at runtime. This causes the computation to be unsuitable for data parallel execution. However, it is observed that although there is variation between ERI evaluations, the variation is limited; in fact there are a limited number of ERI classes present within any given workload. Conceptually, it is possible to classify the ERIs into sizable sets, and execute these sets in a data parallel fashion. Practically, creating these sets is computationally expensive. We describe an architecture to perform this thread sorting, where high throughput is achieved with small associative and multiport memories. The performance of the prototype is evaluated with FPGA synthesis. We go on to envision other uses for thread sorting, in general-purpose manycore architectures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A new high-performance scalable dynamic interconnection for FPGA-based reconfigurable systems

    Publication Year: 2008 , Page(s): 61 - 66
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2589 KB) |  | HTML iconHTML  

    Networks on chip (NoCs) present viable interconnection architectures which are especially characterized by high level of parallelism, high performances and scalability. The already proposed NoC architectures in literature are mostly destined to system-on-chip (SoCs) designs. For a FPGA-based reconfigurable system, the proposed NoCs are not suitable. In this paper, we present a new high-performance interconnection approach destined for FPGA-based reconfigurable system. Our proposed NoC is based on a scalable communication unit characterized by its particularly architecture, an arbitration policy based on the priority-to-the-right rule and high performances. We present the basic concept of this communication approach and we prove its feasibility on examples through the simulations. Implementation results are also detailed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Extending the SIMPPL SoC architectural framework to support application-specific architectures on multi-FPGA platforms

    Publication Year: 2008 , Page(s): 67 - 72
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (468 KB) |  | HTML iconHTML  

    Process technology has reduced in size such that it is possible to implement complete application-specific architectures as systems-on-chip (SoCs) using both application-specific integrated circuits (ASICs) and field programmable gate arrays (FPGAs). However, the reconfigurable nature of an FPGA results in lower logic density, such that large, complex applications require multi-FPGA implementation platforms. Although designing SoCs is challenging, SoC models such as systems integrating modules with predefined physical links (SIMPPL) exist to facilitate the design process. SIMPPL leverages defined physical interfaces and communication protocols to enable rapid system-level integration for application-specific architectures. This paper presents a ldquoSIMPPL repeaterrdquo that enables the SIMPPL SoC architectural framework to be used for systems spanning multiple FPGAs. The SIMPPL repeater abstracts inter-chip communication, allowing designers to treat a multi-FPGA platform as a single large reconfigurable fabric and focus on their application-specific architecture. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • PERMAP: A performance-aware mapping for application-specific SoCs

    Publication Year: 2008 , Page(s): 73 - 78
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (526 KB) |  | HTML iconHTML  

    Future system-on-chip (SoC) designs will need efficient on-chip communication architectures that can provide efficient and scalable data transport among the intellectual properties (IPs). Designing and optimizing SoCs is an increasingly difficult task due to the size and complexity of the SoC design space, high cost of detailed simulation, and several constraints that the design must satisfy. For efficient design of SoCs, an efficient mapping of IPs onto networks-on-chip (NoCs) is highly desirable. Towards this end, we have presented PERMAP, a performance-aware mapping algorithm which maps the IPs onto a generic NoC architecture such that the average communication delay is minimized. This is accomplished by a performance analytical model which can be used for any arbitrary network topology with wormhole routing. The algorithm is used for mapping a video application onto a tile-based NoC and experimental results show that PERMAP is fast and robust. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low-cost implementations of NTRU for pervasive security

    Publication Year: 2008 , Page(s): 79 - 84
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (422 KB) |  | HTML iconHTML  

    NTRU is a public-key cryptosystem based on the shortest vector problem in a lattice which is an alternative to RSA and ECC. This work presents a compact and low power NTRU design that is suitable for pervasive security applications such as RFIDs and sensor nodes. We have designed two architectures, one is only capable of encryption and the other one performs both encryption and decryption. The strategy for the designs includes clock gating of registers, operand isolation and precomputation. This work is also the first one to present a complete NTRU design with encryption/decryption circuitry. Our encryption-only NTRU design has a gate-count of 2.8 kgates and dynamic power consumption of 1.72 muW. Moreover, encryption-decryption NTRU design consumes about 6 muW dynamic power and consists of 10.5 kgates. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the high-throughput implementation of RIPEMD-160 hash algorithm

    Publication Year: 2008 , Page(s): 85 - 90
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (393 KB) |  | HTML iconHTML  

    In this paper we present two new architectures of the RIPEMD-160 hash algorithm for high throughput implementations. The first architecture achieves the iteration bound of RIPEMD-160, i.e. it achieves a theoretical upper bound on throughput at the micro-architecture level. The second architecture is designed by performing a gate level optimization and achieves a better performance than the first one at the cost of a larger gate area. Throughputs of 3.122 Gbps and 624 Mbps are achieved, with and without pipelining, respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Zodiac: System architecture implementation for a high-performance Network Security Processor

    Publication Year: 2008 , Page(s): 91 - 96
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (418 KB) |  | HTML iconHTML  

    The last few years have seen many significant progresses in the field of application-specific processors. One exemplar is Network Security Processors (NSPs) that perform various cryptographic operations specified by network security protocols and help to offload the computation intensive burdens from Network Processors (NPs). This paper proposes a high-performance NSP intended for both IPSec and SSL protocols acceleration. With a programmable descriptor-based instruction set architecture, the novel design of system architecture leads to a Gbps rate NSP named Zodiac, which is programmable with domain specific instructions for Gbps throughput IPSec and SSL applications. Synthesized with a 0.18 mum CMOS technology, the peak throughput of IPSec ESP tunnel mode can reach up to 1.651 Gbps and over 1000 full SSL handshakes per second are attainable. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient systolization of cyclic convolution for systolic implementation of sinusoidal transforms

    Publication Year: 2008 , Page(s): 97 - 101
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (335 KB) |  | HTML iconHTML  

    This paper presents an algorithm to convert composite-length cyclic convolution into a block cyclic convolution sum of small matrix-vector products, even if the co-factors of convolution-length are not mutually prime. It is shown that by using optimal short-length convolution algorithms, the block-convolution could be computed from a few short-length cyclic and cyclic-like convolutions, when one of the co-factors belongs to {2, 3, 4, 6, 8}. A generalized systolic array is derived for cyclic-like convolution, and used that for the computation of long-length convolutions. The proposed structure for convolution-length N= 2L involves nearly the same hardware and half the time-complexity as the direct implementation; and the structure for N= 4L involves sime12.5% more hardware and one-fourth the time-complexity of the latter. The structures for N=2L and N=4L, respectively, have the same and sime12.5% less area-time complexity as the corresponding existing prime-factor systolic structures, but unlike the latter type, do not involve complex input/output mapping; and could be used even if the co-factors of convolution-length are not relatively prime. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Resource efficient generators for the floating-point uniform and exponential distributions

    Publication Year: 2008 , Page(s): 102 - 107
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (339 KB) |  | HTML iconHTML  

    Monte-Carlo simulations and many other stochastic algorithms are almost ideal applications for FPGAs, as the huge amount of available parallelism allows deep pipelining without loop-carried dependencies and spatial scaling across large devices without shared resource bottlenecks. Another key advantage is that random number generation is very cheap (when compared to software), and can be tailored to meet the performance and quality needs of each application. However, in many cases this advantage is not exploited, either because an inefficient but simple to implement generator is chosen, or because a generator with properties that far exceed the needs of the application is used. This paper describes generators for the floating-point uniform and exponential distributions, which provide efficient resource usage, while remaining sufficiently simple to make them attractive to users. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low discrepancy sequences for Monte Carlo simulations on reconfigurable platforms

    Publication Year: 2008 , Page(s): 108 - 113
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (380 KB) |  | HTML iconHTML  

    Low-discrepancy sequences, also known as ldquoquasi-randomrdquo sequences, are numbers that are better equidistributed in a given volume than pseudo-random numbers. Evaluation of high-dimensional integrals is commonly required in scientific fields as well as other areas (such as finance), and is performed by stochastic Monte Carlo simulations. Simulations which use quasi-random numbers can achieve faster convergence and better accuracy than simulations using conventional pseudo-random numbers. Such simulations are called Quasi-Monte Carlo. Conventional Monte Carlo simulations are increasingly implemented on reconfigurable devices such as FPGAs due to their inherently parallel nature. This has not been possible for Quasi-Monte Carlo simulations because, to our knowledge, no low-discrepancy sequences have been generated in hardware before. We present FPGA-optimized scalable designs to generate three different common low-discrepancy sequences: Sobol, Niederreiter and Halton. We implement these three generators on Virtex-4 FPGAs with varying degrees of fine-grained parallelization, although our ideas can be applied to a far broader class of sequences. We conclude with results from the implementation of an actual Quasi-Monte Carlo simulation for extracting partial inductances from integrated circuits. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A subsampling pulsed UWB demodulator based on a flexible complex SVD

    Publication Year: 2008 , Page(s): 114 - 119
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (916 KB) |  | HTML iconHTML  

    A flexible digital architecture for a pulsed ultra-wideband demodulator sampling below Nyquist rate is presented. The system is based on a complex Singular Value Decomposition implemented on a configurable systolic array of simple processors. Automatic code generation is applied to cut design time and rapidly assess the implementation cost of several architectures of the processors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamically reconfigurable regular expression matching architecture

    Publication Year: 2008 , Page(s): 120 - 125
    Cited by:  Papers (4)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (314 KB) |  | HTML iconHTML  

    Regular Expressions are generic representations for a string or a collection of strings. This paper focuses on implementation of a regular expression matching architecture on reconfigurable fabric like FPGA. We present a Non-deterministic Finite Automata based implementation with extended regular expression syntax set compared to previous approaches. We also describe a dynamically reconfigurable generic block that implements the supported regular expression syntax. This enables formation of the regular expression hardware by a simple cascade of generic blocks as well as a possibility for reconfiguring the generic blocks to change the regular expression being matched. Further, we have developed an HDL code generator to obtain the VHDL description of the hardware for any regular expression set. Our optimized regular expression engine achieves a throughput of 2.45 Gbps. Our dynamically reconfigurable regular expression engine achieves a throughput of 0.8 Gbps using 12 FPGA slices per generic block on Xilinx Virtex2Pro FPGA. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An MPSoC architecture for the Multiple Target Tracking application in driver assistant system

    Publication Year: 2008 , Page(s): 126 - 131
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (456 KB) |  | HTML iconHTML  

    This article discusses the design of an application specific MPSoC architecture dedicated to multiple target tracking (MTT). This application has its utility in driver assistant systems, more precisely in collision avoidance and warning systems. An automotive-radar is used as the front end sensor in our application. The article examines the tradeoffs that must be taken into consideration in the realization of the entire MTT application in an embedded system. In our implementation of MTT, several independent parallel tasks have been identified and mapped onto a multiprocessor architecture to ensure the deadlines imposed by the application. Our study demonstrates that the joint utilization of reconfigurable circuits (namely FPGA) and MPSoC, facilitates the development of a flexible and efficient MTT system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Managing multi-core soft-error reliability through utility-driven cross domain optimization

    Publication Year: 2008 , Page(s): 132 - 137
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (743 KB) |  | HTML iconHTML  

    As semiconductor processing technology continues to scale down, managing reliability becomes an increasingly difficult challenge in high-performance microprocessor design. Transient faults, also known as soft errors, corrupt program data at the circuit level and cause incorrect program execution and system crashes. Future processors will consist of billions of transistors organized as multicore microarchitectures. Packaging multiple cores (and hence more transistors) onto the same die exposes more devices to soft error strikes. This paper explores utility-function-driven (benefit driven) cross domain optimization for both performance and reliability. We propose the use of utility-based resource management for individual cores while applying utility-based shared cache partitioning across multiple cores. Moreover, we coordinate the optimization of multiple resources based on their cross domain utility information to achieve attractive performance and reliability tradeoffs. Extensive experimental results show that, on average, our utility-driven cross domain optimization reduces the soft error rate of the most vulnerable core in a chip multiprocessor (CMP) by up to 35% and improves the CMPpsilas overall reliability by 22% with less than 3% performance degradation across 15 investigated workloads. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient implementation of a phase unwrapping kernel on reconfigurable hardware

    Publication Year: 2008 , Page(s): 138 - 143
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (636 KB) |  | HTML iconHTML  

    The optical quadrature method of microscopy (OQM) uses phase data to capture information about the sample being studied. This phase data need to be unwrapped before it can be of use. Phase unwrapping is the process by which an integer multiple of 2pi is added to a measured, wrapped phase value in order to generate a continuous function. The algorithm used is the minimum LP norm method which uses a two dimensional discrete cosine transform (2-D DCT) which forms the most computationally expensive part of the minimum LP norm method. This paper presents an implementation on reconfigurable hardware that performs the 2-D DCT over the entire 1024 times 512 image, solves the intermediate equation and then performs the two dimensional Inverse discrete cosine transform (2-D IDCT) using a novel FPGA implementation of the DCT with a semi-floating point data representation. This represents the largest 2-D DCT FPGA implementation in the literature, with most previous work focusing on the 8 times 8 transform. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A parallel hardware architecture for connected component labeling based on fast label merging

    Publication Year: 2008 , Page(s): 144 - 149
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (505 KB) |  | HTML iconHTML  

    This paper presents a dedicated parallel hardware architecture for fast connected component labeling. Both, label generation and merging of equivalent labels are accelerated. Label generation is performed for four pixels in parallel. A special linked list based approach for fast label merging is proposed. This results in a compact implementation and shorter processing times compared to published implementations. For prototyping and evaluation purposes, the hardware architecture was integrated into an FPGA-based modular coprocessor architecture. A binary D1 test image is labeled in 1.74 ms on a Virtex-II Pro FPGA running at 140 MHz. Moreover, the architecture can be easily integrated into embedded image processing systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.