By Topic

Very Large Scale Integration (VLSI) Systems, IEEE Transactions on

Issue 10 • Date Oct. 2008

Filter Results

Displaying Results 1 - 20 of 20
  • Table of contents

    Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (40 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Very Large Scale Integration (VLSI) Systems publication information

    Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (34 KB)  
    Freely Available from IEEE
  • Guest Editorial Special Section on Application Specific Processors

    Page(s): 1257 - 1258
    Save to Project icon | Request Permissions | PDF file iconPDF (324 KB)  
    Freely Available from IEEE
  • Recurrence-Aware Instruction Set Selection for Extensible Embedded Processors

    Page(s): 1259 - 1267
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (959 KB) |  | HTML iconHTML  

    Automatic generation of a customized instruction set, starting from an input application code, is a complex problem that has received considerable attention in the past few years. Because of its complexity, only simplified versions of the problem have been solved exactly so far. For example, exact algorithms have been proposed for custom instruction identification but that do not consider recurrence; other methods exist that can indeed handle recurrence, but are limited in how complex an instruction they can identify. However, an exact solution that can handle identification and recurrence simultaneously has been missing. We divide the problem into several parts and concentrate on covering, that is, selecting a set of nonoverlapping and possibly recurrent custom instructions to be implemented and used. We then propose a range of novel algorithms, both exact and approximate, to solve the covering problem in conjunction with the recurrence of candidate extensions. We propose an optimal search technique that uses branch-and-bound to improve an existing solution, in conjunction with a greedy search to help the algorithm out of any local optima, and achieve a tangible improvement over nonrecurrence-aware covering. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Outer Loop Pipelining for Application Specific Datapaths in FPGAs

    Page(s): 1268 - 1280
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (978 KB) |  | HTML iconHTML  

    Most hardware compilers apply loop pipelining to increase the parallelism achieved, but pipelining is restricted to the only innermost level in a nested loop. In this work we extend and adapt an existing outer loop pipelining approach known as single dimension software pipelining to generate schedules for field-programmable gate-array (FPGA) hardware coprocessors. Each loop level in nine test loops is pipelined and the resulting schedules are implemented in VHDL and targeted to an Altera Stratix II FPGA. The results show that the fastest solution for all but one of the loops occurs when pipelining is applied one to three levels above the innermost loop. Across the nine test loops we achieve an acceleration over the innermost loop solution of up to seven times, with a mean speedup of 3.2 times. The results suggest that inclusion of outer loop pipelining in future hardware compilers may be worthwhile as it can allow significantly improved results to be achieved at the cost of a small increase in compile time. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Design Flow for Architecture Exploration and Implementation of Partially Reconfigurable Processors

    Page(s): 1281 - 1294
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (889 KB) |  | HTML iconHTML  

    During the last years, the growing application complexity, design, and mask costs have compelled embedded system designers to increasingly consider partially reconfigurable application-specific instruction set processors (rASIPs) which combine a programmable base processor with a reconfigurable fabric. Although such processors promise to deliver excellent balance between performance and flexibility, their design remains a challenging task. The key to the successful design of a rASIP is combined architecture exploration of all the three major components: the programmable core, the reconfigurable fabric, and the interfaces between these two. This work presents a design flow that supports fast architecture exploration for rASIPs. The design flow is centered around a unified description of an entire rASIP in an architecture description language (ADL). This ADL description facilitates consistent modeling and exploration of all three components of a rASIP through automatic generation of the software tools (compiler tool chain and instruction set simulator) and the RTL hardware model. The generated software tools and the RTL model can be used either for final implementation of the rASIP or can serve as a preoptimized starting point for implementation that can be hand optimized afterward. The design flow is further enhanced by a number of automatic application analysis tools, including a fine-grained application profiler, an instruction set extension (ISE) generator, and a data path mapper for coarse grained reconfigurable architectures (CGRAs). We present some case studies on embedded benchmarks to show how the design space exploration process helps to efficiently design an application domain specific rASIP. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Resource Utilization for an Extensible Processor Through Dynamic Instruction Set Adaptation

    Page(s): 1295 - 1308
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3922 KB) |  | HTML iconHTML  

    State-of-the-art application-specific instruction set processors (ASIPs) allow the designer to define individual prefabrication customizations, thus improving the degree of specialization towards the actual application requirements, e.g., the computational hot spots. However, only a subset of hot spots can be targeted to keep the ASIP within a reasonable size. We propose a modular special instruction composition with multiple implementation possibilities per special instruction, compile-time embedded instructions to trigger a run-time adaptation of the instruction set, and a run-time system that dynamically selects an appropriate variation of the instruction set, i.e., a situation-dependent beneficial implementation for each special instruction. We thereby achieve a better efficiency of resource usage of up to 3.0 times (average 1.4 times) compared with current state-of-the-art ASIPs, resulting in a 3.1 times (average 1.4 times) improved application performance (compared with a general purpose processor up to 25.7 times and average 17.6 times). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Reconfigurable ASIP for Convolutional and Turbo Decoding in an SDR Environment

    Page(s): 1309 - 1320
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2696 KB) |  | HTML iconHTML  

    Future mobile and wireless communication networks require flexible modem architectures to support seamless services between different network standards. Hence, a common hardware platform that can support multiple protocols implemented or controlled by software, generally referred to as software defined radio (SDR), is essential. This paper presents a family of dynamically reconfigurable application-specific instruction-set processors (ASIPs) for channel coding in wireless communication systems. As a weakly programmable intellectual property (IP) core, it can implement trellis-based channel decoding in a SDR environment. It features binary convolutional decoding, and turbo decoding for binary as well as duobinary turbo codes for all current and upcoming standards. The ASIP consists of a specialized pipeline with 15 stages and a dedicated communication and memory infrastructure. Logic synthesis revealed a maximum clock frequency of 400 MHz and an area of 0.11 mm2 for the processor's logic using a low power 65-nm technology. Memories require another 0.31 mm2 . Simulation results for Viterbi and turbo decoding demonstrate maximum throughput of 196 and 34 Mb/s, respectively. The ASIP hence outperforms state-of-the-art decoder architectures targeting software defined radio by at least a factor of three while consuming only 60% or less of the logic area. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High Performance Architecture of an Application Specific Processor for the H.264 Deblocking Filter

    Page(s): 1321 - 1334
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (504 KB) |  | HTML iconHTML  

    This paper presents an efficient architecture of an application specific processor (ASP) designed for the deblocking filter algorithm of the H.264 video compression standard. Several optimization techniques at different design levels, such as vector register, pipeline processing, very long instruction word (VLIW) processor, and predication, are utilized in this design. The proposed ASP can meet the real time constraint of the deblocking filter algorithm for the 16:9 video format (4690 times 2304) at 30 frames per second (fps) at 200-MHz clock rate. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Processing Path Dispatcher in Network Processor MPSoCs

    Page(s): 1335 - 1345
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1948 KB) |  | HTML iconHTML  

    Multi-field packet classification problems discussed in the literature are typically constrained to the Internet five-tuple and primarily address the problem of network quality-of-service (QoS) support and access control. In this paper, we present a solution for a classification problem that is used for optimized packet assignment to different data paths within a network processor system-on-chip (SoC). In contrast to the five-tuple-based rules discussed in the prior art, our problem has rules that consider a larger set of fields from the packet header. However, for each individual rule a different sub-set of fields is relevant and the number of rules is smaller. Based on a specification of the usage case for our classifier we derive heterogeneous decision graph algorithm (HDGA), a heuristic approach to construct a decision tree classifier that integrates external lookup results for certain types of rules. We evaluate various parameters for optimizing the proposed decision tree and present simulation results to show the scalability of HDGA for typical problem sizes. This paper is concluded with the results of an implementation on our field-programmable gate-array (FPGA)-based prototyping platform. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic Processor Customization for Zero-Overhead Online Software Verification

    Page(s): 1346 - 1357
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (551 KB) |  | HTML iconHTML  

    The PSL-to-Verilog (P2V) compiler can translate a set of assertions about a block-structured software program into a hardware design to be executed concurrently with the program. The assertions validate the correctness of the software program without altering the program's temporal behavior in any way, a result never previously achieved by any online model-checking system. Additionally, the techniques and implementations described apply to any general purpose program and the absence of execution overhead renders the system ideal for the verification and debugging of real-time systems. Assertions are expressed in a simple subset of the property specification language (PSL), an IEEE standard originally intended for the behavioral specification of hardware designs. The target execution system is the eMIPS processor, a dynamically self-extensible processor realized with a field-programmable gate array (FPGA). The system can concurrently execute and check multiple programs at a time. Assertions are compiled into eMIPS Extensions, which are loaded by the operating system software into a portion of the FPGA, and discarded once the program terminates. If an assertion is violated, the program receives an exception, otherwise, it executes fully unaware of its verifier. The software program is not modified in any way. It can be compiled separately with full optimizations and executes with or without the corresponding hardware checker. The P2V compiler, implemented in Python, generates code for the implementation of the eMIPS processor running on the Xilinx ML401 development board. It is currently used to verify software properties in areas such as testing, debugging, intrusion detection, and the behavioral verification of concurrent and real-time programs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Unified Convolutional/Turbo Decoder Design Using Tile-Based Timing Analysis of VA/MAP Kernel

    Page(s): 1358 - 1371
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3007 KB) |  | HTML iconHTML  

    To satisfy the advanced forward-error-correction (FEC) standards, in which the Convolutional code and Turbo code may co-exit, a prototype design of a unified Convolutional/Turbo decoder is proposed. In this paper, we systematically analyze the timing charts of both the Viterbi algorithm and the MAP algorithm. Then, three techniques, including Distribution, Pointer, and Parallel schemes, are introduced; they can be used as flexible tools in timing-chart analysis to either reduce memory size or to increase throughput rate. Furthermore, we propose a tile-based methodology to analyze the key features of timing charts, such as computing/memory units and hardware utilization. On the basis of the timing analysis, we developed a VA/MAP timing chart that has three modes (VA mode, MAP mode, and concurrent VA/MAP mode) by complementing the idle time of both VA and MAP decoding procedures. The new combined timing analysis helps us for constructing a unified component decoder with near 100% utilization rate of the processing element (PE) in both VA/MAP decoding functions. According to the triple-mode VA/MAP timing chart, we construct a triple-mode FEC kernel that can perform both Convolutional/Turbo decoding functions seamlessly for different communication systems. By integrating the FEC kernel with different size of memory, we can construct four types of FEC decoders for different application scenarios, such as 1) standalone Convolutional decoder (VA mode); 2) standalone Turbo decoder (MAP mode); 3) dual- mode Convolutional/Turbo decoder (VA mode and MAP mode); and 4) triple-mode Convolutional/Turbo decoder (VA mode, MAP mode, and concurrent VA/MAP mode). Finally, a prototyping FEC kernel processor that is compliant to 3GPP standard is verified in TSMC 0.18-mum CMOS process in the type of triple-mode FEC decoder. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A New Family of Sequential Elements With Built-in Soft Error Tolerance for Dual-VDD Systems

    Page(s): 1372 - 1384
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (928 KB) |  | HTML iconHTML  

    In this paper, we propose some soft-error-tolerant latches and flip-flops that can be used in dual-VDD systems. By utilizing local redundancy and inner feedback techniques, the latches and flip-flops can recover from soft errors caused by cosmic rays and particle strikes. The proposed flip-flop can be used as a level shifter without the problems of static leakage and redundant switching activity. Implemented in a standard 0.18- mum technology, the proposed latches and flip-flops show superior performance compared to conventional ones in terms of delay and power while keeping the soft-error-tolerant characteristic. Experimental results show that compared to the traditional built-in soft-error-tolerant D latch, the D-QN delay of the new D latch is 29.1% less than that of the traditional built-in soft-error-tolerant D latch while consuming 16.5% less power as well. The D-Q delay and power of the new flip-flop are about 47.7% and 54% less than those of the traditional high speed level-converting flip-flop, respectively. In addition, the proposed flip-flop is more robust to soft errors. The critical charge which represents the minimum charge at the D input required to cause an error of the flip-flop can be increased by more than 46.4%. The time window during which the flip-flop will be erroneous caused by single-event upsets at the D input is reduced by more than 22.2%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Hierarchical Motion Estimation Algorithm and Its VLSI Architecture

    Page(s): 1385 - 1398
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3211 KB) |  | HTML iconHTML  

    This paper addresses the development and hardware implementation of an efficient hierarchical motion estimation algorithm, HMEA, using multiresolution frames to reduce the computational complexity. Excellent estimation performance is ensured using an averaging filter to downsample the original image. At the smallest resolution, the least two motion vector candidates are selected using a full-search block matching algorithm. At the middle level, these two candidate motion vectors are employed as the center points for small range local searches. Then, at the original resolution, the final motion vector is obtained by performing a local search around the single candidate from the middle level. HMEA exhibits regular data flow and is suitable for hardware implementation. An efficient VLSI architecture that includes an averaging filter to downsample the image and two 2-D semisystolic processing element arrays to determine the sum of absolute difference in pipeline is also presented. Simulation results indicate that HMEA is more area-efficient and faster than many full-search and multiresolution architectures while maintaining high video quality. This architecture with 59K gates and 1393 bytes of RAM is implemented for a search range of [ -16.0, +15.5]. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Error-Resilient Motion Estimation Architecture

    Page(s): 1399 - 1412
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4571 KB) |  | HTML iconHTML  

    In this paper, we propose an energy-efficient motion estimation architecture. The proposed architecture employs the principle of error-resiliency to combat logic level timing errors that may arise in average-case designs in presence of process variations and/or due to overscaling of the supply voltage [voltage overscaling (VOS)] and thereby achieves power reduction. Error-resiliency is incorporated via algorithmic noise-tolerance (ANT). Referred to as input subsampled replica ANT (ISR-ANT), the proposed technique incorporates an input subsampled replica of the main sum-of-absolute-difference (MSAD) block for detecting and correcting errors in the MSAD block. Simulations show that the proposed technique can save up to 60% power over an optimal error-free system in a 130-nm CMOS technology. These power savings increase to 78% in a 45-nm predictive process technology. Performance of the ISR-ANT architecture in the presence of process variations indicates that average peak signal-to-noise ratio (PSNR) of the ISR-ANT architecture increases by up to 1.8 dB over that of the conventional architecture in 130-nm IBM process technology. Furthermore, the PSNR variation (sigma/mu) is also reduced by 7times over that of the conventional architecture at the slow corner while achieving a power reduction of 33%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamically Configurable Bus Topologies for High-Performance On-Chip Communication

    Page(s): 1413 - 1426
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1340 KB) |  | HTML iconHTML  

    The on-chip communication architecture is a primary determinant of overall performance in complex system-on-chip (SoC) designs. Since the communication requirements of SoC components can vary significantly over time, communication architectures that dynamically detect and adapt to such variations can substantially improve system performance. In this paper, we propose Flexbus, a new architecture that can efficiently adapt the logical connectivity of the communication architecture and the components connected to it. Flexbus achieves this by dynamically controlling both the communication architecture topology, as well as the mapping of SoC components to the communication architecture. This is achieved through new dynamic bridge by-pass, and component remapping techniques. In this paper, we introduce these techniques, describe how they can be realized within modern on-chip buses, and discuss policies for run-time reconfiguration of Flexbus-based architectures.The techniques underlying Flexbus are general, and are applicable to a variety of bus standards. We have implemented Flexbus as an extension of the popular AMBA AHB bus, and have evaluated it using a commercial design flow. We report on experiments conducted to analyze its area, timing, and performance under a wide variety of system-level traffic profiles. We have applied Flexbus to two example SoC designs: 1) an IEEE 802.11 MAC processor and 2) an UMTS turbo decoder. Our results show that Flexbus provides gains of up to 34.55 % in application data-rates over conventional architectures, with negligible area overhead and a 3.2% penalty in clock period. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scitopia.org [advertisement]

    Page(s): 1427
    Save to Project icon | Request Permissions | PDF file iconPDF (253 KB)  
    Freely Available from IEEE
  • Over 1 million scientific documents easily within reach, from IEEE [advertisement]

    Page(s): 1428
    Save to Project icon | Request Permissions | PDF file iconPDF (318 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Very Large Scale Integration (VLSI) Systems society information

    Page(s): C3
    Save to Project icon | Request Permissions | PDF file iconPDF (22 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Very Large Scale Integration (VLSI) Systems information for authors

    Page(s): C4
    Save to Project icon | Request Permissions | PDF file iconPDF (27 KB)  
    Freely Available from IEEE

Aims & Scope

Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing, and systems applications. Generation of specifications, design, and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor, and process levels.

To address this critical area through a common forum, the IEEE Transactions on VLSI Systems was founded. The editorial board, consisting of international experts, invites original papers which emphasize the novel system integration aspects of microelectronic systems, including interactions among system design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and system level qualification. Thus, the coverage of this Transactions focuses on VLSI/ULSI microelectronic system integration.

Topics of special interest include, but are not strictly limited to, the following: • System Specification, Design and Partitioning, • System-level Test, • Reliable VLSI/ULSI Systems, • High Performance Computing and Communication Systems, • Wafer Scale Integration and Multichip Modules (MCMs), • High-Speed Interconnects in Microelectronic Systems, • VLSI/ULSI Neural Networks and Their Applications, • Adaptive Computing Systems with FPGA components, • Mixed Analog/Digital Systems, • Cost, Performance Tradeoffs of VLSI/ULSI Systems, • Adaptive Computing Using Reconfigurable Components (FPGAs) 

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Yehea Ismail
CND Director
American University of Cairo and Zewail City of Science and Technology
New Cairo, Egypt
y.ismail@aucegypt.edu