By Topic

Very Large Scale Integration (VLSI) Systems, IEEE Transactions on

Issue 9 • Date Sept. 2008

Filter Results

Displaying Results 1 - 20 of 20
  • Table of contents

    Publication Year: 2008 , Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (42 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Very Large Scale Integration (VLSI) Systems publication information

    Publication Year: 2008 , Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (36 KB)  
    Freely Available from IEEE
  • Two-Phase Fine-Grain Sleep Transistor Insertion Technique in Leakage Critical Circuits

    Publication Year: 2008 , Page(s): 1101 - 1113
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (938 KB) |  | HTML iconHTML  

    Sleep transistor (ST) insertion is a valuable leakage reduction technique in circuit standby mode. Fine-grain sleep transistor insertion (FGSTI) makes it easier to guarantee circuit functionality and improve circuit noise margins. In this paper, we introduce a novel two-phase FGSTI technique which consists of ST placement and ST sizing. These two phases are formally modeled using mixed integer linear programming (MILP) models. When the circuit timing relaxation is not large enough to assign ST everywhere, leakage feedback (LF) gates, which are used to avoid floating states, induce large area and dynamic power overhead. An extended multi-object ST placement model is further proposed to reduce the leakage current and the LF gate number simultaneously. Finally, heuristic algorithms are developed to speed up the ST placement phase. Our experimental results on the ISCAS'85 benchmarks reveal that: 1) the two-phase FGSTI technique achieves better results than the simultaneous ST placement and sizing method; 2) when the circuit timing relaxation varies from 0% to 5%, the multi-object ST placement model can achieve on average 4times-9times LF gate number reduction, while the leakage difference is only about 8% of original circuit leakage; 3) our heuristic algorithm is 1000times faster than the MILP method within an acceptable loss of accuracy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Comparative Study Between Static and Dynamic Sleep Signal Generation Techniques for Leakage Tolerant Designs

    Publication Year: 2008 , Page(s): 1114 - 1126
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1128 KB) |  | HTML iconHTML  

    Power gating techniques are rapidly gaining popularity assisting the management of leakage power consumption for deep submicrometer microprocessors' functional units. Power gating is based on an input sleep signal to set the functional unit into a low leakage mode. However, power gating techniques in general inherently lack information about the utilization profile of the functional units they manage. This limitation is usually handled either statically by using a fixed length counter that generates the sleep signal when the functional unit is idle for a specified number of cycles or dynamically by changing the number of cycles before the sleep signal is generated depending on the previous history of operation. In this paper, a comparative study between the static and dynamic approaches regarding the power-performance tradeoff will be presented. It will be shown that the dynamic sleep signal generator is capable of tracking the operation of the functional units while achieving accuracies up to 90% compared to an average of 40%-60% for the static sleep signal generator (SSSG). Additionally it saves up to 80% more leakage versus the SSSG. This study is very important in assisting circuit designers choose between both techniques depending on the power gated circuit. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Static and Dynamic Temperature-Aware Scheduling for Multiprocessor SoCs

    Publication Year: 2008 , Page(s): 1127 - 1140
    Cited by:  Papers (38)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1904 KB) |  | HTML iconHTML  

    Thermal hot spots and high temperature gradients degrade reliability and performance, and increase cooling costs and leakage power. In this paper, we explore the benefits of temperature-aware task scheduling for multiprocessor system-on-a-chip (MPSoC). We evaluate our techniques using workload characteristics collected from a real system by Sun's Continuous System Telemetry. We first solve the task scheduling problem statically using integer linear programming (ILP). The ILP solution is guaranteed to be optimal for the given assumptions for tasks. We formulate ILPs for minimizing energy, balancing energy, and reducing hot spots, and provide an extensive comparison of their thermal behavior against our technique. Our static solution can reduce the frequency of hot spots by 35%, spatial gradients by 85%, and thermal cycles by 61% in comparison to the ILP for minimizing energy. We then design dynamic scheduling policies at the OS-level with negligible performance overhead. Our adaptive dynamic policy reduces the frequency of high-magnitude thermal cycles and spatial gradients by around 50% and 90%, respectively, in comparison to state-of-the-art schedulers. Reactive thermal management strategies, such as thread migration, can be combined with our scheduling policy to further reduce hot spots, temperature variations, and the associated performance cost. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low-Power Mixed-Signal CVNS-Based 64-Bit Adder for Media Signal Processing

    Publication Year: 2008 , Page(s): 1141 - 1150
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2183 KB) |  | HTML iconHTML  

    In this paper, design of a mixed-signal 64-bit adder based on the continuous valued number system (CVNS) is presented. The 64-bit adder is generated by cascading four 16-bit radix-2 CVNS adders. Truncated summation of the CVNS digits reduced the number of required interconnections in the system, which in turn reduced design complexity and hardware costs. This adder can perform one 64-bit, two 32-bit, four 16-bit, or eight 8-bit additions on demand for media signal processing applications. The compact and low-power and low-noise design of the adder is suitable for this type of application. The 64-bit adder designed in TSMC CMOS 0.18-mum technology, has a worst case delay of 1.5 ns, energy dissipation of about 14 pJ with the core area of 13 250mum2. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A New Modular Exponentiation Architecture for Efficient Design of RSA Cryptosystem

    Publication Year: 2008 , Page(s): 1151 - 1161
    Cited by:  Papers (20)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (707 KB) |  | HTML iconHTML  

    Modular exponentiation with a large modulus, which is usually accomplished by repeated modular multiplications, has been widely used in public key cryptosystems for secured data communications. To speed up the computation, the Montgomery modular multiplication algorithm is used to relax the process of quotient determination, and the carry-save addition (CSA) is employed to reduce the critical path delay. In this paper, based on the inherent data dependency between the modular multiplication and square operations in the H-algorithm of modular exponentiation, we present a new modular exponentiation architecture with a unified modular multiplication/square module and show how to reduce the number of input operands for the CSA tree by mathematical manipulation. The developed architecture has the following advantages. 1) There is no need to convert the carry-save form of an operand into its binary representation at the end of each modular multiplication. In this way, except the final step to get the result of modular exponentiation, the time-consuming carry propagation can then be eliminated. 2) The number of input operands for the CSA tree is reduced in a very efficient way. 3) The hardware saving is achieved with very limited impact on the original critical path delay when designed with two distinct modular multiplication and square components. Experimental results show that our modular exponentiation design obtains the least hardware complexity compared with the existing work and outperforms them in terms of area-time (AT) complexity as well. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On Parallelization of High-Speed Processors for Elliptic Curve Cryptography

    Publication Year: 2008 , Page(s): 1162 - 1175
    Cited by:  Papers (15)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (911 KB) |  | HTML iconHTML  

    This paper discusses parallelization of elliptic curve cryptography hardware accelerators using elliptic curves over binary fields F2m. Elliptic curve point multiplication, which is the operation used in every elliptic curve cryptosystem, is hierarchical in nature, and parallelism can be utilized in different hierarchy levels as shown in many publications. However, a comprehensive analysis on the effects of parallelization has not been previously presented. This paper provides tools for evaluating the use of parallelism and shows where it should be used in order to maximize efficiency. Special attention is given for a family of curves called Koblitz curves because they offer very efficient point multiplication. A new method where the latency of point multiplication is reduced with parallel field arithmetic processors is introduced. It is shown to outperform the previously presented multiple field multiplier techniques in the cases of Koblitz curves and generic curves with fixed base points. A highly efficient general elliptic curve cryptography processor architecture is presented and analyzed. Based on this architecture and analysis on the effects of parallelization, a few designs are implemented on an Altera Stratix II field-programmable gate array (FPGA). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Trace-Based Framework for Verifiable GALS Composition of IPs

    Publication Year: 2008 , Page(s): 1176 - 1186
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (488 KB) |  | HTML iconHTML  

    Composing intellectual property (IP) blocks running at different clock speeds over asynchronous communication links for a system-on-chip (SoC) design is a challenging task, especially for ensuring the functional correctness of the overall design. In this paper, we propose a trace-based framework that helps in identifying a class of IPs that can be composed to ldquocorrect-by-constructionrdquo globally asynchronous locally synchronous (GALS) designs, and their correctness is maintained with respect to their synchronous compositions. Our notion of correctness is latency equivalence. Latency equivalence means that the order of valid values is same on the corresponding signals in the synchronous as well as asynchronous compositions. We also provide a description of the protocol to be inserted between the IPs to obtain this equivalence. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Over-1-Gb/s Transceiver Core for Integration Into Large System-on-Chips for Consumer Electronics

    Publication Year: 2008 , Page(s): 1187 - 1198
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2197 KB) |  | HTML iconHTML  

    This paper describes an area-effective 1.5-Gb/s transceiver core with spread spectrum clocking (SSC) capability that is suitable for integration into large system-on-chips (SoCs) for consumer electronics applications such as audio and video stream data transmission. To achieve a good balance between SSC performance and the core area, a novel SSC scheme using a multi-level (hierarchical) phase-interpolator technique has been developed. This technique achieves a very fine clock phase shift of about 0.1 ps for precise and smooth frequency modulation. The SSC scheme is based on a digital feed-forward operation and leads to a small area and good noise robustness for SoC integration. This core also has digital clock data recovery (CDR) with jitter tolerance enhancement and a simple adaptive data equalizer (AEQ). These functions are also on a digital operation and controlled by digital codes, and the core presupposes a multiphase clock for the digital SSC, CDR, and AEQ with shared phase-locked loop (PLL) topology. A test chip including two of these cores was fabricated using shared PLL. The core showed significant peak power reduction (-19 dB to the non-SSC situation) and a small core area of 0.25 mm2 in 0.13-mum CMOS process. This core achieved a remarkable ratio of peak power reduction to area of 76 dB/mm2. Moreover, it achieved good jitter tolerance (flat 0.8 UI at >1 MHz) and stable data communication over an STP (shielded twist pair) cable ranging in length from 1 m to over 22 m. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Robust Concurrent Online Testing of Network-on-Chip-Based SoCs

    Publication Year: 2008 , Page(s): 1199 - 1209
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1691 KB) |  | HTML iconHTML  

    Lifetime concerns for complex systems-on-a-chip (SoC) designs due to decreasing levels in reliability motivate the development of solutions to ensure reliable operation. A precursor to any proposed recovery scheme would require the identification of failures in the system. Non-concurrent in-field testing is an impractical solution due to prohibitive costs in terms of test power and test time. This novel research proposes the use of concurrent online testing (COLT) to circumvent these issues. A test infrastructure-intellectual property (TI-IP) is deployed within network-on-chip (NoC)-based SoC designs to provide online test support while managing intrusion of test into executing applications within the system. This research describes the architecture and operation of a TI-IP capable of COLT. To address scalability of this solution, we show how these would operate when more than one is deployed in an SoC. In the absence of benchmarks for the analysis of COLT, two baseline and eight TI-IP configuration variations within SoC test configurations were developed using application and test benchmarks from the research domain. The power profiles from the NoCSim simulation environment are reported here demonstrating how different configurations of TI-IPs would operate. A robust TI-IP protocol is also specified and possible hazards and their mitigations are identified. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Offline and Online Aspects of Defragmenting the Module Layout of a Partially Reconfigurable Device

    Publication Year: 2008 , Page(s): 1210 - 1219
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (954 KB) |  | HTML iconHTML  

    Modern generations of field-programmable gate arrays (FPGAs) allow for partial reconfiguration. In an online context, where the sequence of modules to be loaded on the FPGA is unknown beforehand, repeated insertion and deletion of modules leads to progressive fragmentation of the available space, making defragmentation an important issue. We address this problem by proposing an online and an offline component for the defragmentation of the available space. We consider defragmenting the module layout on a reconfigurable device. This corresponds to solving a 2D strip packing problem. Problems of this type are NP-hard in the strong sense, and previous algorithmic results are rather limited. Based on a graph-theoretic characterization of feasible packings, we develop a method that can solve 2D defragmentation instances of practical size to optimality. Our approach is validated for a set of benchmark instances. We also discuss a simple strategy for dealing with online scenarios, called ldquoleast-interference fitrdquo (LIF); we give a number of analytic results that allow a comparison of LIF with the best offline solution, and demonstrate that it works well on benchmark instances of moderate size. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamically De-Skewable Clock Distribution Methodology

    Publication Year: 2008 , Page(s): 1220 - 1229
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (561 KB) |  | HTML iconHTML  

    In a typical clock distribution scheme, a central clock signal is distributed to several sites on the integrated circuit (IC). Local regenerators at these sites buffer the clock signal for the logic in regions close to the regenerator. Minimizing the skew between the clocks at these regeneration sites is critical. In recent times, this is becoming harder due to increasing intra-die processing variations. In this paper, we describe a novel technique to distribute a clock signal from a central location to several sites on a VLSI IC. Our technique uses a buffered H-tree and includes circuitry to dynamically remove any skew that may result due to intra-die processing variations. While existing approaches to deskewing a clock tree have utilized several phase detection circuits (number of phase detectors dependent on the number of clock regenerators), our method requires only one phase detector. Also, in our approach, the resolution of the phase detector is inconsequential unlike existing techniques. Our deskewing technique can be applied dynamically, either at boot time or periodically during the operation of the IC. Using a six-level H-tree clock distribution network with process variations deliberately included, we demonstrate that our technique can reduce skews as high as 300 ps down to just 3 ps. We compare our clock tree with traditional buffered and unbuffered H-tree networks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reducing Interconnect Delay Uncertainty via Hybrid Polarity Repeater Insertion

    Publication Year: 2008 , Page(s): 1230 - 1239
    Cited by:  Papers (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (517 KB) |  | HTML iconHTML  

    Capacitive crosstalk between adjacent signal wires has significant effect on performance and delay uncertainty of point-to-point on-chip buses in deep submicrometer (DSM) VLSI technologies. We propose a hybrid polarity repeater insertion technique that combines inverting and non-inverting repeater insertion to achieve constant average effective coupling capacitance per wire transition for all possible switching patterns. Theoretical analysis shows the superiority of the proposed method in terms of performance and delay uncertainty compared to conventional and staggered repeater insertion methods. Simulations at the 90-nm node on semi-global METAL5 layer show around 25% reduction in worst case delay and around 86% delay uncertainty minimization compared to standard bus with optimal repeater configuration. The reduction in worst case capacitive coupling reduces peak energy which is a critical factor for thermal regulation and packaging. Isodelay comparisons with standard bus show that the proposed technique achieves considerable reduction in total buffers area, which in turn reduces average energy and peak current. Comparisons with staggered repeater which is one of the simplest and most effective crosstalk reduction techniques in the literature show that hybrid polarity repeater offers higher performance, less delay uncertainty, and reduced sensitivity to repeater placement variation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptive On-Chip Power Supply With Robust One-Cycle Control Technique

    Publication Year: 2008 , Page(s): 1240 - 1243
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (995 KB) |  | HTML iconHTML  

    In this paper, an integrated adaptive-output switching converter is proposed. The design employs a one-cycle control for fast line regulation and a single outer loop for tight load regulation and fine tuning. A switched-capacitor integrator is introduced to the one-cycle control to obtain positive integration with a single positive power supply, allowing a standard low-cost CMOS fabrication process. To improve the efficiency, a dynamic loss control technique is presented. The converter was designed and fabricated with 0.35 mum N-well CMOS process. With a supply voltage of 3 V, a voltage ripple of less than plusmn20 mV is measured. The maximum efficiency is 92% with a load power of 475 mW. The converter exhibits a tracking speed of 23.75 mus/V for both start-up and reference voltage transitions. The recovery time for a 20% load change is approximately 9.5 mus. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low-Complexity Policies for Energy-Performance Tradeoff in Chip-Multi-Processors

    Publication Year: 2008 , Page(s): 1243 - 1248
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (683 KB) |  | HTML iconHTML  

    Chip-multi-processor (CMP) utilize multiple energy-efficient processing elements (PEs) to deliver high performance while reducing energy-consumption. Dynamic frequency-Voltage Scaling (DVS) balances performance and energy consumption by varying PEs' frequency-voltage workpoints to save energy while meeting performance requirements. We consider multi-task CMP applications with unknown workloads, and dynamically set workpoints to minimize . Heuristic policies for serial/parallel task-graphs are investigated. We compare these policies to a theoretical bound and show that they achieve good results with low complexity. In most cases the simplest policy, which usually assigns constant workpoints, is also the most cost-effective one. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Accumulator-Based Compaction Scheme For Online BIST of RAMs

    Publication Year: 2008 , Page(s): 1248 - 1251
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (130 KB) |  | HTML iconHTML  

    Transparent built-in self test (BIST) schemes for RAM modules assure the preservation of the memory contents during periodic testing. Symmetric transparent BIST skips the signature prediction phase required in traditional transparent BIST schemes, achieving considerable reduction in test time. In symmetric transparent BIST schemes proposed to date, output data compaction is performed using either single-input or multiple-input shift registers whose characteristic polynomials are modified during testing. In this paper the utilization of accumulator modules for output data compaction in symmetric transparent BIST for RAMs is proposed. It is shown that in this way the hardware overhead, the complexity of the controller, and the aliasing probability are considerably reduced. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Injection-Locked Clocking: A Low-Power Clock Distribution Scheme for High-Performance Microprocessors

    Publication Year: 2008 , Page(s): 1251 - 1256
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (672 KB) |  | HTML iconHTML  

    We propose injection-locked clocking (ILC) to combat deteriorating clock skew and jitter, and reduce power consumption in high-performance microprocessors. In the new clocking scheme, injection-locked oscillators are used as local clock receivers. Compared to conventional clocking with buffered trees or grids, ILC can achieve better power efficiency, lower jitter, and much simpler skew compensation thanks to its built-in deskewing capability. Unlike other alternatives, ILC is fully compatible with conventional clock distribution networks. In this paper, a quantitative study based on circuit and microarchitectural-level simulations is performed. Alpha21264 is used as the baseline processor, and is scaled to 0.13 m and 3 GHz. Simulations show 20- and 23-ps jitter reduction, 10.1% and 17% power savings in two ILC configurations. A test chip distributing 5-GHz clock is implemented in a standard 0.18- m CMOS technology and achieved excellent jitter performance and a deskew range up to 80 ps. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • IEEE Transactions on Very Large Scale Integration (VLSI) Systems society information

    Publication Year: 2008 , Page(s): C3
    Save to Project icon | Request Permissions | PDF file iconPDF (25 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Very Large Scale Integration (VLSI) Systems Information for authors

    Publication Year: 2008 , Page(s): C4
    Save to Project icon | Request Permissions | PDF file iconPDF (27 KB)  
    Freely Available from IEEE

Aims & Scope

Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing, and systems applications. Generation of specifications, design, and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor, and process levels.

To address this critical area through a common forum, the IEEE Transactions on VLSI Systems was founded. The editorial board, consisting of international experts, invites original papers which emphasize the novel system integration aspects of microelectronic systems, including interactions among system design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and system level qualification. Thus, the coverage of this Transactions focuses on VLSI/ULSI microelectronic system integration.

Topics of special interest include, but are not strictly limited to, the following: • System Specification, Design and Partitioning, • System-level Test, • Reliable VLSI/ULSI Systems, • High Performance Computing and Communication Systems, • Wafer Scale Integration and Multichip Modules (MCMs), • High-Speed Interconnects in Microelectronic Systems, • VLSI/ULSI Neural Networks and Their Applications, • Adaptive Computing Systems with FPGA components, • Mixed Analog/Digital Systems, • Cost, Performance Tradeoffs of VLSI/ULSI Systems, • Adaptive Computing Using Reconfigurable Components (FPGAs) 

Full Aims & Scope

Meet Our Editors

Editor-in-Chief

Krishnendu Chakrabarty
Department of Electrical Engineering
Duke University
Durham, NC 27708 USA
Krish@duke.edu