By Topic

Very Large Scale Integration (VLSI) Systems, IEEE Transactions on

Issue 4 • Date April 2010

Filter Results

Displaying Results 1 - 22 of 22
  • Table of contents

    Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (43 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Very Large Scale Integration (VLSI) Systems publication information

    Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (40 KB)  
    Freely Available from IEEE
  • Computation Error Analysis in Digital Signal Processing Systems With Overscaled Supply Voltage

    Page(s): 517 - 526
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1768 KB) |  | HTML iconHTML  

    It has been recently demonstrated that digital signal processing systems may possibly leverage unconventional voltage overscaling (VOS) to reduce energy consumption while maintaining satisfactory signal processing performance. Due to the computation-intensive nature of most signal processing algorithms, the energy saving potential largely depends on the behavior of computer arithmetic units in response to overscaled supply voltage. This paper shows that different hardware implementations of the same computer arithmetic function may respond to VOS very differently and result in different energy saving potentials. Therefore, the selection of appropriate computer arithmetic architecture is an important issue in voltage-overscaled signal processing system design. This paper presents an analytical method to estimate the statistics of computer arithmetic computation errors due to supply voltage overscaling. Compared with computation-intensive circuit simulations, this analytical approach can be several orders of magnitude faster and can achieve a reasonable accuracy. This approach can be used to choose the appropriate computer arithmetic architecture in voltage-overscaled signal processing systems. Finally, we carry out case studies on a coordinate rotation digital computer processor and a finite-impulse-response filter to further demonstrate the importance of choosing proper computer arithmetic implementations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Self-Adaptive System for Addressing Permanent Errors in On-Chip Interconnects

    Page(s): 527 - 540
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1410 KB) |  | HTML iconHTML  

    We present a self-contained adaptive system for detecting and bypassing permanent errors in on-chip interconnects. The proposed system reroutes data on erroneous links to a set of spare wires without interrupting the data flow. To detect permanent errors at runtime, a novel in-line test (ILT) method using spare wires and a test pattern generator is proposed. In addition, an improved syndrome storing-based detection (SSD) method is presented and compared to the ILT method. Each detection method (ILT and SSD) is integrated individually into the noninterrupting adaptive system, and a case study is performed to compare them with Hamming and Bose-Chaudhuri-Hocquenghem (BCH) code implementations. In the presence of permanent errors, the probability of correct transmission in the proposed systems is improved by up to 140% over the standalone Hamming code. Furthermore, our methods achieve up to 38% area, 64% energy, and 61% latency improvements over the BCH implementation at comparable error performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Single- and Multi-core Configurable AES Architectures for Flexible Security

    Page(s): 541 - 552
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1217 KB) |  | HTML iconHTML  

    As networking technology advances, the gap between network bandwidth and network processing power widens. Information security issues add to the need for developing high-performance network processing hardware, particularly that for real-time processing of cryptographic algorithms. This paper presents a configurable architecture for Advanced Encryption Standard (AES) encryption, whose major building blocks are a group of AES processors. Each AES processor provides 219 block cipher schemes with a novel on-the-fly key expansion design for the original AES algorithm and an extended AES algorithm. In this multicore architecture, the memory controller of each AES processor is designed for the maximum overlapping between data transfer and encryption, reducing interrupt handling load of the host processor. This design can be applied to high-speed systems since its independent data paths greatly reduces the input/output bandwidth problem. A test chip has been fabricated for the AES architecture, using a standard 0.25-??m CMOS process. It has a silicon area of 6.29 mm2, containing about 200,500 logic gates, and runs at a 66-MHz clock. In electronic codebook (ECB) and cipher-block chaining (CBC) cipher modes, the throughput rates are 844.9, 704, and 603.4 Mb/s for 128-, 192-, and 256-b keys, respectively. In order to achieve 1-Gb/s throughput (including overhead) at the worst case, we design a multicore architecture containing three AES processors with 0.18-??m CMOS process. The throughput rate of the architecture is between 1.29 and 3.75 Gb/s at 102 MHz. The architecture performs encryption and decryption of large data with 128-b key in CBC mode using on-the-fly key generation and composite field S-box, making it more cost effective (with better thousand-gate/gigabit-per-second ratio) than conventional methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Efficient Multimode Multiplier Supporting AES and Fundamental Operations of Public-Key Cryptosystems

    Page(s): 553 - 563
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (830 KB) |  | HTML iconHTML  

    This paper presents a highly efficient multimode multiplier supporting prime field, namely, polynomial field, and matrix-vector multiplications based on an asymmetric word-based Montgomery multiplication (MM) algorithm. The proposed multimode 128 ?? 32 b multiplier provides throughput rates of 441 and 511 Mb/s for 256-b operands over GF(P) and GF(2n) at a clock rate of 100 MHz, respectively. With 21 930 additional gates for Advanced Encryption Standard (AES), the multiplier is extended to provide 1.28-, 1.06-, and 0.91-Gb/s throughput rates for 128-, 192-, and 256-b keys, respectively. The comparison result shows that the proposed integration architecture outperforms others in terms of performance and efficiency for both AES and MM that is essential in most public-key cryptosystems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • LOPASS: A Low-Power Architectural Synthesis System for FPGAs With Interconnect Estimation and Optimization

    Page(s): 564 - 577
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (594 KB) |  | HTML iconHTML  

    In this paper, we present a low-power architectural synthesis system (LOPASS) for field-programmable gate-array (FPGA) designs with interconnect power estimation and optimization. LOPASS includes three major components: 1) a flexible high-level power estimator for FPGAs considering the power consumption of various FPGA logic components and interconnects; 2) a simulated-annealing optimization engine that carries out resource selection and allocation, scheduling, functional unit binding, register binding, and interconnection estimation simultaneously to reduce power effectively; and 3) a k-cofamily-based register binding algorithm and an efficient port assignment algorithm that reduce interconnections in the data path through multiplexer optimization. The experimental results show that LOPASS produces promising results on latency optimization compared to an academic high-level synthesis tool SPARK. Compared to an early commercial high-level synthesis tool, namely, Synopsys Behavioral Compiler, LOPASS is 61.6% better on power consumption and 10.6% better on clock period on average. Compared to a current commercial tool, namely, Impulse C, LOPASS is 31.1% better on power reduction with an 11.8% penalty on clock period. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving FPGA Performance for Carry-Save Arithmetic

    Page(s): 578 - 590
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (974 KB) |  | HTML iconHTML  

    The selective use of carry-save arithmetic, where appropriate, can accelerate a variety of arithmetic-dominated circuits. Carry-save arithmetic occurs naturally in a variety of DSP applications, and further opportunities to exploit it can be exposed through systematic data flow transformations that can be applied by a hardware compiler. Field-programmable gate arrays (FPGAs), however, are not particularly well suited to carry-save arithmetic. To address this concern, we introduce the ??field programmable counter array?? (FPCA), an accelerator for carry-save arithmetic intended for integration into an FPGA as an alternative to DSP blocks. In addition to multiplication and multiply accumulation, the FPCA can accelerate more general carry-save operations, such as multi-input addition (e.g., add k > 2 integers) and multipliers that have been fused with other adders. Our experiments show that the FPCA accelerates a wider variety of applications than DSP blocks and improves performance, area utilization, and energy consumption compared with soft FPGA logic. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Initialization-Based Test Pattern Generation for Asynchronous Circuits

    Page(s): 591 - 601
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (347 KB) |  | HTML iconHTML  

    A novel test pattern generation method for asynchronous circuits is described and evaluated in detail. The method combines conventional pattern generation with hazard-free state initialization. Any type of asynchronous circuit can be processed, and all stuck-at faults, even those inside state-holding elements, such as C-elements, are considered. The results on some of the largest benchmarks ever used for asynchronous circuit testing show fault coverage on the order of 99% with no area overhead for (quasi-)delay-insensitive datapath circuits. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Statistical Leakage Estimation Based on Sequential Addition of Cell Leakage Currents

    Page(s): 602 - 615
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (668 KB) |  | HTML iconHTML  

    This paper presents a novel method for full-chip statistical leakage estimation that considers the impact of process variation. The proposed method considers the correlations among leakage currents in a chip and the state dependence of the leakage current of a cell for an accurate analysis. For an efficient addition of the cell leakage currents, we propose the virtual-cell approximation (VCA), which sums cell leakage currents sequentially by approximating their sum as the leakage current of a single virtual cell while preserving the correlations among leakage currents. By the use of the VCA, the proposed method efficiently calculates a full-chip leakage current. Experimental results using ISCAS benchmarks at various process variation levels showed that the proposed method provides an accurate result by demonstrating average leakage mean and standard deviation errors of 3.12% and 2.22%, respectively, when compared with the results of a Monte Carlo (MC) simulation-based leakage estimation. In efficiency, the proposed method also demonstrated to be 5000 times faster than MC simulation-based leakage estimations and 9000 times faster than the Wilkinson's method-based leakage estimation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robust Bioinspired Architecture for Optical-Flow Computation

    Page(s): 616 - 629
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2854 KB) |  | HTML iconHTML  

    Motion estimation from image sequences, called optical flow, has been deeply analyzed by the scientific community. Despite the number of different models and algorithms, none of them covers all problems associated with real-world processing. This paper presents a novel customizable architecture of a neuromorphic robust optical flow (multichannel gradient model) based on reconfigurable hardware with the properties of the cortical motion pathway, thus obtaining a useful framework for building future complex bioinspired real-time systems with high computational complexity. The presented architecture is customizable and adaptable, while emulating several neuromorphic properties, such as the use of several information channels of small bit width, which is the nature of the brain. This paper includes the resource usage and performance data, as well as a comparison with other systems. This hardware platform has many application fields in difficult environments due to its bioinspired nature and robustness properties, and it can be used as starting point in more complex systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Fast Heuristic Algorithm for Multidomain Clock Skew Scheduling

    Page(s): 630 - 637
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (383 KB) |  | HTML iconHTML  

    In the most general form, clock skew scheduling (CSS) generates a dedicated clock delay for each individual sequential component in the clock distribution network in order to minimize the clock period. Multidomain CSS (MDCSS) relieves this requirement. Instead, sequential components are grouped into several clusters (called clock domains), each of which has a uniform clock delay for all registers within that domain. The skew values of clock domains are provided by a set of deskew buffers with electrically programmable phase shifts and injected after the chip is manufactured. This technique is attractive since, due to process variations, it is becoming overwhelmingly difficult to create precise clock network delays for all sequential elements in a design globally. In this paper, we present a fast algorithm for determining the minimum number of clock domains to be used by MDCSS. The exact solution to this problem cannot be found within a reasonable time if the number of clock domains increases beyond three domains. We show that, even with a small-size circuit, in order to obtain the minimum clock period, more than three clock domains may be required. Therefore, a fast heuristic algorithm is needed to identify these domains. To the best of our knowledge, we present the first efficient heuristic algorithm for this problem. For large benchmark circuits, we solve the problem within 14.7 min on average (as high as 31.7 min for the worst case), while a commercial mixed-integer linear program solver cannot finish in over 5 h. Furthermore, our results show that, for 19 out of 21 small- and medium-size benchmarks, our algorithm yields the optimal solution. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis and Design of a Multistage CMOS Band-Pass Low-Noise Preamplifier for Ultrawideband RF Receiver

    Page(s): 638 - 651
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3618 KB) |  | HTML iconHTML  

    A CMOS low-noise preamplifier for application in a 3.1-10.6-GHz ultrawideband radio-frequency (RF) receiver system is presented. This is essentially a wideband-pass multistage RF preamplifier using a cascade of a three-segment band-pass LC ?? -section filter with a common-gate stage as the front end. Fundamental design analysis in terms of gain, bandwidth, noise, and impedance matching for the amplifier is presented in detail. The preamplifier was fabricated using the low-cost TSMC 0.18-??m 6M1P CMOS process technology. The amplifier delivered a buffered power gain (S 21) of ?? 14 dB with a -3-dB bandwidth (between the corner frequencies) of around 7.5 GHz. It consumed around 30 mW from a 2.5-V supply voltage. It had a minimum passband noise figure of around 4.7 dB, an input-referred third-order intercept point of -5.3 dBm, and reverse isolation (S 12) under -65 dB. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Area and Power Optimization of High-Order Gain Calibration in Digitally-Enhanced Pipelined ADCs

    Page(s): 652 - 657
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (537 KB) |  | HTML iconHTML  

    Digital calibration techniques are widely utilized to linearize pipelined analog-to-digital converters (ADCs). However, their power dissipation can be prohibitively high, particularly when high-order gain calibration is needed. This paper demonstrates the need for high-order gain calibration in pipelined ADCs designed using low-gain opamps in scaled digital CMOS. For high-order gain calibration, this paper then proposes a design methodology to optimize the data precision (number of bits) within the digital calibration unit. Thus, the power dissipation and chip area of the calibration unit can be minimized, without affecting the ADC linearity. A 90-nm field-programmable gate array synthesis of a second-order gain calibration unit shows that the proposed optimization methodology results in 53% and 30% reductions in digital power dissipation and chip area, respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Techniques to Prioritize Paths for Diagnosis

    Page(s): 658 - 661
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (145 KB) |  | HTML iconHTML  

    Existing techniques for path delay fault (PDF) diagnosis prune fault-free candidates using nonfailing patterns but fail to reduce the size of suspect set significantly. This paper presents two alternative techniques that can be applied in a postprocessing manner to further reduce the suspect set by prioritizing paths using only the failing patterns. Experimental results on the ISCAS benchmarks demonstrate that they are time and memory efficient. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A High-Performance Three-Engine Architecture for H.264/AVC Fractional Motion Estimation

    Page(s): 662 - 666
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (313 KB) |  | HTML iconHTML  

    Variable-block-size motion estimation (VBSME) is one of the contributors to H.264/Advanced Video Coding (AVC)'s excellent coding efficiency. Due to its high computational complexity, however, VBSME needs acceleration for real-time high-resolution applications. We propose a high-performance hardware architecture for H.264/AVC fractional motion estimation. Our architecture consists of three parallel processing engines, one for 4 ?? 4 and 8 ?? 8 blocks, one for 8 ?? 4 and 4 ?? 8 blocks, and another for the remaining type of blocks. In addition, we propose a resource-sharing scheme which saves 33% of hardware cost for the computation of the sum of absolute transformed difference. Synthesized into a Taiwan Semiconductor Manufacturing Company (TSMC) 180-nm CMOS cell library, our 321-K gate design only needs to run at 154 MHz when encoding a 1920 ??1088 video at 30 frames per second. Compared with a most comparable previous work that consumes 311 K gates and runs at 200 MHz, our proposed architecture is more efficient. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Modeling and Analysis of the Nonrectangular Gate Effect for Postlithography Circuit Simulation

    Page(s): 666 - 670
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (345 KB) |  | HTML iconHTML  

    For nanoscale CMOS devices, gate roughness has severe impact on the device I-V characteristics, particularly in the subthreshold region. In particular, the nonrectangular gate (NRG) geometries are caused by subwavelength lithography and have relatively low spatial frequency. In this paper, we present an analytical approach to model NRG effects on I -V characteristics. To predict the change of I- V characteristics due to the NRG effect, the proposed model converts the postlithography gate profile into an equivalent gate length (Le) , which is a function of the gate bias voltage but independent of the drain bias voltage. We demonstrate the accuracy of this approach by comparing it to TCAD simulation results for 65-nm technology. The new Le model is readily integrated into standard transistor models in traditional circuit simulation tools, such as SPICE, for both dc and transient analyses. We further develop a generic procedure to systematically extract the Le value from the postlithography gate profile. The interaction with the narrow-width effect is also efficiently incorporated into the proposed algorithm. TCAD verification demonstrates that the proposed Le model is simple for implementation, scalable with both transistor geometries and bias conditions, and also continuous across all the operation regions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Configuration Locking and Schedulability Estimation for Reduced Reconfiguration Overheads of Reconfigurable Systems

    Page(s): 671 - 674
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (328 KB) |  | HTML iconHTML  

    Dynamically reconfigurable field-programmable gate arrays (FPGAs) hold the promise of providing a virtual hardware resource in which hardware circuits can be dynamically scheduled onto the available FPGA resources. However, reconfiguring an FPGA can incur significant performance and energy overheads. This paper analyzes the relationship between several hardware task scheduling algorithms and their impact on the number of reconfigurations required to execute a set of hardware tasks. In addition, three new hardware scheduling algorithms, specifically designed to reduce the number of required reconfigurations, are presented and analyzed. By selectively locking configurations within the reconfigurable tiles of an FPGA, significant reductions in the number of required reconfiguration can be achieved. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Post-Manufacture Tuning for Nano-CMOS Yield Recovery Using Reconfigurable Logic

    Page(s): 675 - 679
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (414 KB) |  | HTML iconHTML  

    In this paper, an architectural framework for post-silicon tuning of nanoscale CMOS circuits is developed. The tuning methodology is driven by a ??tunable?? gate design that allows the gate to be switched from a high-speed/high-power mode to a low-speed/low-power mode under digital control. A small number of ??critical?? logic gates are replaced with tunable gates for post-silicon power-performance tuning. In addition, supply voltage and body bias can be employed as hardware ??tuning knobs?? as well to deal with delay and leakage variations. After silicon is manufactured, the hardware ??knobs?? are programmed through the use of an implicit self-test methodology that can be exercised by the proposed self-adaptation architectural framework. It is seen that the delay yield can be improved by an average of 40% with minimal impact on area. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Accurate Predictive Interconnect Modeling for System-Level Design

    Page(s): 679 - 684
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (254 KB) |  | HTML iconHTML  

    We propose new accurate predictive models for the delay, power, and area of buffered interconnects to enable a more effective system-level design exploration with existing and future nanometer technology processes. We show that our models are significantly more accurate than previous models - essentially matching sign-off analyses. We integrate our models in the COSI-OCC communication synthesis infrastructure and show how they impact the feasibility and optimality of the network-on-chip architectures that are synthesized by this tool. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Approach for Adaptive DRAM Temperature and Power Management

    Page(s): 684 - 688
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (340 KB) |  | HTML iconHTML  

    High-performance DRAMs are providing increasing memory access bandwidth to processors, which is leading to high power consumption and operating temperature in DRAM chips. In this paper, we propose a customized low-power technique for high-performance DRAM systems to improve DRAM page hit rate by buffering write operations that may incur page misses. This approach reduces DRAM system power consumption and temperature without any performance penalty. We combine the throughput-aware page-hit-aware write buffer (TAP) with low-power-state-based techniques for further power and temperature reduction, namely, TAP-low. Our experiments show that a system with TAP-low could reduce the total DRAM power consumption by up to 68.6% (19.9% on average). The steady-state temperature can be reduced by as much as 7.84??C and 2.55??C on average across eight representative workloads. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Transactions on Very Large Scale Integration (VLSI) Systems society information

    Page(s): C3
    Save to Project icon | Request Permissions | PDF file iconPDF (27 KB)  
    Freely Available from IEEE

Aims & Scope

Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing, and systems applications. Generation of specifications, design, and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor, and process levels.

To address this critical area through a common forum, the IEEE Transactions on VLSI Systems was founded. The editorial board, consisting of international experts, invites original papers which emphasize the novel system integration aspects of microelectronic systems, including interactions among system design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and system level qualification. Thus, the coverage of this Transactions focuses on VLSI/ULSI microelectronic system integration.

Topics of special interest include, but are not strictly limited to, the following: • System Specification, Design and Partitioning, • System-level Test, • Reliable VLSI/ULSI Systems, • High Performance Computing and Communication Systems, • Wafer Scale Integration and Multichip Modules (MCMs), • High-Speed Interconnects in Microelectronic Systems, • VLSI/ULSI Neural Networks and Their Applications, • Adaptive Computing Systems with FPGA components, • Mixed Analog/Digital Systems, • Cost, Performance Tradeoffs of VLSI/ULSI Systems, • Adaptive Computing Using Reconfigurable Components (FPGAs) 

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Yehea Ismail
CND Director
American University of Cairo and Zewail City of Science and Technology
New Cairo, Egypt
y.ismail@aucegypt.edu