By Topic

Very Large Scale Integration (VLSI) Systems, IEEE Transactions on

Issue 9 • Date Sept. 2009

Filter Results

Displaying Results 1 - 25 of 28
  • Table of contents

    Publication Year: 2009 , Page(s): C1 - C4
    Save to Project icon | Request Permissions | PDF file iconPDF (48 KB)  
    Freely Available from IEEE
  • IEEE Transactions on Very Large Scale Integration (VLSI) Systems publication information

    Publication Year: 2009 , Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (40 KB)  
    Freely Available from IEEE
  • On Topology Reconfiguration for Defect-Tolerant NoC-Based Homogeneous Manycore Systems

    Publication Year: 2009 , Page(s): 1173 - 1186
    Cited by:  Papers (21)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1267 KB) |  | HTML iconHTML  

    Homogeneous manycore systems are emerging for tera-scale computation and typically utilize Network-on-Chip (NoC) as the communication scheme between embedded cores. Effective defect tolerance techniques are essential to improve the yield of such complex integrated circuits. We propose to achieve fault tolerance by employing redundancy at the core-level instead of at the microarchitecture level. When faulty cores exist on-chip in this architecture, however, the physical topologies of various manufactured chips can be significantly different. How to reconfigure the system with the most effective NoC topology is a relevant research problem. In this paper, we first show that this problem is an instance of a well known NP-complete problem. We then present novel solutions for the above problem, which not only maximize the performance of the on-chip communication scheme, but also provide a unified topology to Operating System and application software running on the processor. Experimental results show the effectiveness of the proposed techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Analytical Model for Soft Error Critical Charge of Nanometric SRAMs

    Publication Year: 2009 , Page(s): 1187 - 1195
    Cited by:  Papers (16)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (704 KB) |  | HTML iconHTML  

    Scaling transistor size to the scale of the nanometer coupled with reduction of supply voltage has made SRAMs more vulnerable to soft errors than ever before. The vulnerability has been accentuated by increased variability in device parameters. In this paper, we present an analytical model for critical charge in order to assess the soft error vulnerability of 6T SRAM cell. The model takes into account the dynamic behavior of the cell and demonstrates a simple technique to decouple the nonlinearly coupled storage nodes. Decoupling of storage nodes enables solving associated current equations to determine the critical charge for an exponential noise current. The critical charge model thus developed consists of both NMOS and PMOS transistor parameters. Consequently, the model can estimate critical charge variations due to variability of transistor parameters and manufacturing defects, such as resistive contacts and vias. In addition, the model can serve as a tool to optimize the hibernation voltage of low-power SRAMs or the size of MIM capacitor per cell in order to achieve a target soft error robustness. Critical charge calculated by the model is in good agreement with SPICE simulations for a commercial 90-nm CMOS process with a maximum discrepancy of less than 5%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low-Power Clocked-Pseudo-NMOS Flip-Flop for Level Conversion in Dual Supply Systems

    Publication Year: 2009 , Page(s): 1196 - 1202
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (791 KB) |  | HTML iconHTML  

    Clustered voltage scaling (CVS) is an effective way to decrease power dissipation. One of the design challenges is the design of an efficient level converter with fewer power and delay overheads. In this paper, level-shifting flip-flop topologies are investigated. Different level-shifting schemes are analyzed and classified into groups: differential style, n-type metal-oxide-semiconductor (NMOS) pass-transistor style, and precharged style. An efficient level-shifting scheme, the clocked-pseudo-NMOS (CPN) level conversion scheme, is presented. One novel level conversion flip-flop (CPN-LCFF) is proposed, which combines the conditional discharge technique and pseudo-NMOS technique. In view of power and delay, the new CPN-LCFF outperforms previous LCFF by over 8% and 15.6%, respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sleep Transistor Sizing and Adaptive Control for Supply Noise Minimization Considering Resonance

    Publication Year: 2009 , Page(s): 1203 - 1211
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1733 KB) |  | HTML iconHTML  

    The conventional sleep transistor sizing schemes do not consider the resonant supply noise which represents the worst-case supply disturbance. This paper investigates the impact of sleep transistor sizing on different on-chip noise components and shows that, contrary to the conventional wisdom, a larger sleep transistor is not always favored in term of performance when the resonant supply noise is taken into account. To minimize the worst-case supply noise, an optimal sizing scheme using an explicit noise and impedance model is developed and verified by benchmark circuits. Employing the proposed technique results in a reduction of the worst-case noise by 19%, as well as a saving of standby leakage and area overhead by 60% in comparison with conventional sizing scheme. In order to deal with the sporadic nature of the resonant, we propose an adaptive sleep transistor circuit which adjusts the size of sleep transistor on the fly to remove the DC noise penalty of the fixed sizing scheme. Simulation results on 32-nm CMOS technology are used to demonstrate the functionality and effectiveness of the proposed adaptive sizing circuits. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Low-Power Delay Buffer Using Gated Driver Tree

    Publication Year: 2009 , Page(s): 1212 - 1219
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1412 KB) |  | HTML iconHTML  

    This paper presents circuit design of a low-power delay buffer. The proposed delay buffer uses several new techniques to reduce its power consumption. Since delay buffers are accessed sequentially, it adopts a ring-counter addressing scheme. In the ring counter, double-edge-triggered (DET) flip-flops are utilized to reduce the operating frequency by half and the C-element gated-clock strategy is proposed. A novel gated-clock-driver tree is then applied to further reduce the activity along the clock distribution network. Moreover, the gated-driver-tree idea is also employed in the input and output ports of the memory block to decrease their loading, thus saving even more power. Both simulation results and experimental results show great improvement in power consumption. A 256 times 8 delay buffer is fabricated and verified in 0.18 mum CMOS technology and it dissipates only 2.56 mW when operating at 135 MHz from 1.8-V supply voltage. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Variation-Tolerant Dynamic Power Management at the System-Level

    Publication Year: 2009 , Page(s): 1220 - 1232
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1202 KB) |  | HTML iconHTML  

    The power characteristics of system-on-chips (SoCs) in nanoscale technologies are significantly impacted by manufacturing process variations, making it important to consider these effects during system-level power analysis and optimization. In this paper, we identify and address the problem of designing effective power management schemes in the presence of such variations. In particular, we demonstrate that conventional power management schemes, which are designed without considering the impact of variations, can result in substantial power wastage. We therefore propose two approaches to variation-aware power management, namely, design-specific and chip-specific approaches. In each of these approaches, the goal is to consider the impact of variations while deriving power management policy parameters, in order to optimize metrics that are relevant under variations. We motivate and introduce these metrics, and present both exact and heuristic approaches to optimize them. The methods are designed and implemented in the context of two power management frameworks, namely an ideal oracle-based framework and a timeout-based framework. We experimentally evaluate the proposed ideas using an ARM946 processor core model. For the oracle-based framework, variation-aware power management can result in improvements of upto 59% for mu+sigma , and upto 55% for 95th percentile of the energy distribution, over conventional power management schemes that do not consider variations. For the timeout-based framework, we obtain reductions of upto 43% in mu+sigma and upto 55% in the 99th percentile of the energy distribution. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multiplication Acceleration Through Twin Precision

    Publication Year: 2009 , Page(s): 1233 - 1246
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1475 KB) |  | HTML iconHTML  

    We present the twin-precision technique for integer multipliers. The twin-precision technique can reduce the power dissipation by adapting a multiplier to the bitwidth of the operands being computed. The technique also enables an increased computational throughput, by allowing several narrow-width operations to be computed in parallel. We describe how to apply the twin-precision technique also to signed multiplier schemes, such as Baugh-Wooley and modified-Booth multipliers. It is shown that the twin-precision delay penalty is small (5%-10%) and that a significant reduction in power dissipation (40%-70%) can be achieved, when operating on narrow-width operands. In an application case study, we show that by extending the multiplier of a general-purpose processor with the twin-precision scheme, the execution time of a Fast Fourier Transform is reduced with 15% at a 14% reduction in datapath energy dissipation. All our evaluations are based on layout-extracted data from multipliers implemented in 130-nm and 65-nm commercial process technologies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • CGADL: An Architecture Description Language for Coarse-Grained Reconfigurable Arrays

    Publication Year: 2009 , Page(s): 1247 - 1259
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1134 KB) |  | HTML iconHTML  

    The high degree of freedom in the design of coarse-grained reconfigurable arrays imposes new challenges on their description and modeling. In this paper, we introduce an architecture description language targeted to describe coarse-grained reconfigurable architecture templates. It comprises innovative key features to allow fast modeling and analysis of such architectures, i.e.: representation of processing element array (ir)regularities, and flexible and concise description of interconnection network. We demonstrate that the proposed language enables a formal validation of the described template, and it eases the analysis and estimation of hardware costs earlier in the design phase. Finally, we show how we automatically generate a SystemC-based simulator of the described architecture. Our results suggest that the semantic and technical innovations of the proposed architecture description language may have a positive impact on the productivity of the design phase. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A 152-mW Mobile Multimedia SoC With Fully Programmable 3-D Graphics and MPEG4/H.264/JPEG

    Publication Year: 2009 , Page(s): 1260 - 1266
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2368 KB) |  | HTML iconHTML  

    This paper presents a low power multimedia system-on-chip (SoC) with full integration with fully programmable 3-D graphics, MPEG4 codec, H.264 decoder, and JPEG codec for mobile devices. The mobile unified shader in 3-D graphics engine provides fully programmable 3-D graphics with 35% area and 28% power reduction. Low-power lighting engine which employs logarithmic number datapath and the specialized lighting instruction enable 9.1 Mvertices/s vertex fill rate, which is 2.5 times improvement compared with previous works including transformations and OpenGL lighting. The SoC consumes less than 152 mW for video applications and less than 195 mW for 3-D graphics applications. The mobile unified shader and merged JPEG/MPEG4 codec reduce the silicon area and the SoC consumes 6.4 mm times 6.4 mm in 0.13 mu m complementary metal-oxide-semiconductor (CMOS) logic process. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A 32-Gb/s On-Chip Bus With Driver Pre-Emphasis Signaling

    Publication Year: 2009 , Page(s): 1267 - 1274
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2249 KB) |  | HTML iconHTML  

    This paper describes a differential current-mode bus architecture based on driver pre-emphasis for on-chip global interconnects that achieves high-data rates while reducing bus power dissipation and improving signal delay latency. The 16-b bus core fabricated in 0.25-mum complementary metal-oxide-semiconductor (CMOS) technology attains an aggregate signaling data rate of 32 Gb/s over 5-10-mm-long lossy interconnects. With a supply of 2.5 V, 25.5-48.7-mW power dissipation was measured for signal activity above 0.1, equivalent to 0.80-1.52 pJ/b. This work demonstrates a 15.0%-67.5% power reduction over a conventional single-ended voltage-mode static bus while reducing delay latency by 28.3% and peak current by 70%. The proposed bus architecture is robust against crosstalk noise and occupies comparable routing area to a reference static bus design. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • VLSI Implementation of an Edge-Oriented Image Scaling Processor

    Publication Year: 2009 , Page(s): 1275 - 1284
    Cited by:  Papers (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1063 KB) |  | HTML iconHTML  

    Image scaling is a very important technique and has been widely used in many image processing applications. In this paper, we present an edge-oriented area-pixel scaling processor. To achieve the goal of low cost, the area-pixel scaling technique is implemented with a low-complexity VLSI architecture in our design. A simple edge catching technique is adopted to preserve the image edge features effectively so as to achieve better image quality. Compared with the previous low-complexity techniques, our method performs better in terms of both quantitative evaluation and visual quality. The seven-stage VLSI architecture of our image scaling processor contains 10.4-K gate counts and yields a processing rate of about 200 MHz by using TSMC 0.18-mum technology. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A VLIW Vector Media Coprocessor With Cascaded SIMD ALUs

    Publication Year: 2009 , Page(s): 1285 - 1296
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1262 KB) |  | HTML iconHTML  

    High-definition video applications, such as digital TV and digital video cameras, require high processing performance for high-quality visual images in addition to a complex video CODEC. Pre-/postprocessing to improve video quality is becoming much more important because requirements for pre-/postprocessing vary among applications and processing algorithms have not been stabilized. Therefore, a new processor architecture that has a highly parallel datapath is needed. In this paper, we introduce a VLIW vector media coprocessor, ldquovector coprocessor (VCP),rdquo that includes three asymmetric execution pipelines with cascaded SIMD ALUs. To improve performance efficiency, we reduce the area ratio of the control circuit while increasing the ratio of the arithmetic circuit. The total gate count of VCP is 1268 kgates and its maximum operating frequency is 300 MHz at 90-nm CMOS process. Some of the processing kernels in an adaptive prefilter that is applied to preprocessing for video encoding are evaluated. In the case of the edgeness and the sum of absolute differences, the performance is 183 giga operations per second. VCP offers enough performance for HD video processing and good cost-performance while all processing pipeline units operate effectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A New Architecture of a Two-Stage Lossless Data Compression and Decompression Algorithm

    Publication Year: 2009 , Page(s): 1297 - 1303
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1142 KB) |  | HTML iconHTML  

    In this paper, we propose a new architecture for the two-level lossless data compression and decompression algorithm proposed in that combines the PDLZW algorithm and an approximated adaptive Huffman algorithm with dynamic-block exchange (AHDB). In the new architecture, we replace the CAM dictionary set used in the PDLZW algorithm with a CAM-tag-based dictionary set to reduce hardware cost and the CAM-based ordered list used in the AHDB algorithm with a memory inter-reference (MIR) stage realized by using two SRAMs. The resulting architecture is then implemented based on cell-based libraries with both 0.35-mum 2P4M and 0.18-mum 1P6M process technologies, respectively. With the same process technology, the prototyped chip demonstrates the new architecture not only has better performance, at least 33% improvement, but also occupies less area, only about 44%, and consumes less power, about 50%, in comparison with the architecture proposed in . In addition, the maximum data rate can achieve 2 Gbps when realizing in 0.35 mum 2P4M process technology and 4 Gbps when realizing in 0.18-mum 1P6M process technology. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Signal Assignment to Hierarchical Memory Organizations for Embedded Multidimensional Signal Processing Systems

    Publication Year: 2009 , Page(s): 1304 - 1317
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (621 KB) |  | HTML iconHTML  

    The storage requirements of the array-dominated and loop-organized algorithmic specifications running on embedded systems can be significant. Employing a data memory space much larger than needed has negative consequences on the energy consumption, latency, and chip area. Finding an optimized storage of the usually large arrays from these algorithmic specifications is an essential task of memory allocation. This paper proposes an efficient algorithm for mapping multidimensional arrays to the data memory. Similarly to [1], it computes bounding windows for live elements in the index space of arrays, but this algorithm is several times faster. More important, since this algorithm works not only for entire arrays, but also parts of arrays - like, for instance, array references or, more general, sets of array elements represented by lattices [2], this signal-to-memory mapping technique can be also applied in hierarchical memory architectures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Synthesis Algorithm for Application-Specific Homogeneous Processor Networks

    Publication Year: 2009 , Page(s): 1318 - 1329
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (613 KB) |  | HTML iconHTML  

    The application specific multiprocessor system-on-a-chip is a promising design alternative because of its high degree of flexibility, short development time, and potentially high performance attributed to application specific optimizations. However, designing an optimal application specific multiprocessor system is still challenging because there are a number of important metrics, such as throughput, latency, and resource usage, which need to be explored and optimized. This paper addresses the problem of synthesizing an application-specific multiprocessor system for stream-oriented embedded applications to minimize system latency under the throughput constraint. We employ a novel framework for this problem, similar to that of technology mapping in the logic synthesis domain, and develop a set of efficient algorithms, including labeling and clustering for efficient generation of the multiprocessor architecture with application specific optimized latency. Specifically, the result of our algorithm is latency optimal for directed acyclic task graphs. Application of our approach to the Motion JPEG example on Xilinx's Virtex II Pro platform FPGA shows interesting design tradeoffs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sleep Transistor Sizing for Leakage Power Minimization Considering Charge Balancing

    Publication Year: 2009 , Page(s): 1330 - 1334
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (212 KB) |  | HTML iconHTML  

    One of the effective techniques to reduce leakage power is power gating. Previously, a distributed sleep transistor network was proposed to reduce the sleep transistor area for power gating by connecting all the virtual ground lines together to minimize the maximum instantaneous current flowing through sleep transistors. In this paper, we propose a new methodology for determining the sizes of sleep transistors of the DSTN structure. We present novel algorithms and theorems for efficiently estimating a tight upper bound of the voltage drop and minimizing the sizes of sleep transistors. We also present mathematical proofs of our theorems and lemmas in detail. Our experimental results show 23.36% sleep transistor area reduction compared to the previous work on average. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Programmable Logic Core Enhancements for High-Speed On-Chip Interfaces

    Publication Year: 2009 , Page(s): 1334 - 1339
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1162 KB) |  | HTML iconHTML  

    Programmable logic cores (PLCs) offer a means of providing post-fabrication reconfigurability to a SoC design. This ability has the potential to significantly enhance the SoC design process by enabling post-silicon debugging, design error correction and post-fabrication feature enhancement. However, circuits implemented in general purpose programmable logic will inevitably have lower timing performance than fixed function circuits. This fundamental mismatch makes it difficult to use the PLC effectively. We address this problem by proposing changes to the structure of the PLC itself; these architectural enhancements enable circuit implementations with high performance interfaces. In previous work we addressed system bus interfaces, in this work we address direct synchronous interfaces. Our results show significant improvement in PLC interface timing, such that interaction with full-speed fixed-function SoC logic is possible. Our enhanced PLCs are able to implement direct synchronous interfaces running at, on average, 662 MHz (compared to 249 MHz in regular programmable logic). We are able to do this without compromising the basic structure or routiblity of the programmable fabric. At the same time, we show that the area overhead for these architectural changes was approximately 1%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design of Parasitic and Process-Variation Aware Nano-CMOS RF Circuits: A VCO Case Study

    Publication Year: 2009 , Page(s): 1339 - 1342
    Cited by:  Papers (15)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (530 KB) |  | HTML iconHTML  

    This paper proposes a novel flow for parasitic and process-variation aware design of radio-frequency integrated circuits (RFICs). A nano-CMOS current-starved voltage controlled oscillator (VCO) circuit has been designed using this flow as a case study. The oscillation frequency is considered as the objective optimization function with the area overhead as constraint. Extensive Monte Carlo simulations have been carried out on the parasitic extracted netlist of the VCO to study the effect of process variation on the oscillation frequency. In the design cycle, a performance degradation of 43.5% is observed when the parasitic extracted netlist is subjected to worst-case process variation. The proposed design flow could bring the oscillation frequency within 4.5% of the target, leading to convergence of the complete design in only one design iteration. To the best of the authors' knowledge, this paper presents the first work focussed on a current starved VCO in which the combined effect of parasitics and process variations has been considered. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Partially Protected Caches to Reduce Failures Due to Soft Errors in Multimedia Applications

    Publication Year: 2009 , Page(s): 1343 - 1347
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (462 KB) |  | HTML iconHTML  

    With advances in process technology, soft errors are becoming an increasingly critical design concern. Owing to their large area, high density, and low operating voltages, caches are worst hit by soft errors. Based on the observation that in multimedia applications, not all data require the same amount of protection from soft errors, we propose a partially protected cache (PPC) architecture, in which there are two caches, one protected and the other unprotected at the same level of memory hierarchy. We demonstrate that as compared to the existing unprotected cache architectures, PPC architectures can provide 47 times reduction in failure rate, at only 1% runtime and 3% power overheads. In addition, the failure rate reduction obtained by PPCs is very sensitive to the PPC cache configuration. Therefore, this observation provides an opportunity for further improvement of the solution by correctly parameterizing the PPC configurations. Consequently, we develop design space exploration (DSE) strategies to discover the best PPC configuration. Our DSE technique can reduce the exploration time by more than six times as compared to an exhaustive approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multivoltage Multifrequency Low-Energy Synthesis for Functionally Pipelined Datapath

    Publication Year: 2009 , Page(s): 1348 - 1352
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (225 KB) |  | HTML iconHTML  

    In this paper, an algorithm named MuVoF is proposed to perform multivoltage multifrequency low-energy high-level synthesis for functionally pipelined datapath under resource and throughput constraints. A datapath is partitioned into a number of pipelined stages such that the clock period can be extended maximally. A multivoltage assignment algorithm then utilizes the extended clock period to reduce energy by lowering the supply voltages of the resources. The results are further refined by four local transformations performed in an iterative process. The experiment results show that MuVoF is capable of exploring the design space effectively and achieves efficient energy reduction. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Time-Efficient Single Constant Multiplication Based on Overlapping Digit Patterns

    Publication Year: 2009 , Page(s): 1353 - 1357
    Cited by:  Papers (12)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (325 KB) |  | HTML iconHTML  

    Common subexpression elimination (CSE) algorithms try to minimize the number of adders (or subtracters) required to implement constant multiplication by searching and substituting common patterns in the CSE representation of a constant. CSE algorithms, in general, cannot find certain patterns due to inherent restrictions in the CSE representation. We propose overlapping digit patterns (ODPs) to remove some of these restrictions. We integrate ODPs into H(k), the best existing heuristic algorithm for single constant multiplication (SCM). H(k) is not applicable to the multiple constant multiplication (MCM) problem, so we cannot consider this problem. Generally, H(k) finds solutions very close to optimal, so there is a strict limitation on any further improvement which applies to any new heuristic. Instead, by integrating ODPs within H(k), we can on average significantly improve the run time of the algorithm (typically by one order of magnitude) while still reducing the number of adders. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Maximizing the Functional Yield of Wafer-to-Wafer 3-D Integration

    Publication Year: 2009 , Page(s): 1357 - 1362
    Cited by:  Papers (39)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (269 KB) |  | HTML iconHTML  

    Three-dimensional integrated circuit technology with through-silicon vias offers many advantages, including improved form factor, increased circuit performance, robust heterogenous integration, and reduced costs. Wafer-to-wafer integration supports the highest possible density of through-silicon vias and highest throughput; however, in contrast to die-to-wafer integration, it does not benefit from the ability to bond only tested and diced good die. In wafer-to-wafer integration, wafers are entirely bonded together, which can unintentionally integrate a bad die from one wafer to a good die from another wafer reducing the yield. In this paper, we propose solutions that maximize the yield of wafer-to-wafer 3-D integration, assuming that the individual die can be tested on the wafers before bonding. We exploit some of the available flexibility in the integration process, and propose wafer assignment algorithms that maximize the number of good 3-D ICs. Our algorithms range from scalable, fast heuristics to optimal methods that exactly maximize the yield of wafer-to-wafer 3-D integration. Using realistic defect models and yield simulations, we demonstrate the effectiveness of our methods up to large numbers of wafer stacks. Our results demonstrate that it is possible to significantly improve the yield in comparison to yield-oblivious wafer assignment methods. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low-Power Snoop Architecture for Synchronized Producer-Consumer Embedded Multiprocessing

    Publication Year: 2009 , Page(s): 1362 - 1366
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (147 KB) |  | HTML iconHTML  

    We introduce a cross-layer customization methodology where application knowledge regarding data sharing in producer-consumer relationships is used in order to aggressively eliminate unnecessary and predictable snoop-induced cache lookups even for references to shared data, thus, achieving significant power reductions with minimal hardware cost. The technique exploits application-specific information regarding the exact producer-consumer relationships between tasks as well as information regarding the precise timing of synchronized accesses to shared memory buffers by their corresponding producers and/or consumers. Snoop-induced cache lookups for accesses to the shared data are eliminated when it is ensured that such lookups will not result in extra knowledge regarding the cache state in respect to the other caches and the memory. Our experiments show average power reductions of more than 80% compared to a general-purpose snoop protocol. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing, and systems applications. Generation of specifications, design, and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor, and process levels.

To address this critical area through a common forum, the IEEE Transactions on VLSI Systems was founded. The editorial board, consisting of international experts, invites original papers which emphasize the novel system integration aspects of microelectronic systems, including interactions among system design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and system level qualification. Thus, the coverage of this Transactions focuses on VLSI/ULSI microelectronic system integration.

Topics of special interest include, but are not strictly limited to, the following: • System Specification, Design and Partitioning, • System-level Test, • Reliable VLSI/ULSI Systems, • High Performance Computing and Communication Systems, • Wafer Scale Integration and Multichip Modules (MCMs), • High-Speed Interconnects in Microelectronic Systems, • VLSI/ULSI Neural Networks and Their Applications, • Adaptive Computing Systems with FPGA components, • Mixed Analog/Digital Systems, • Cost, Performance Tradeoffs of VLSI/ULSI Systems, • Adaptive Computing Using Reconfigurable Components (FPGAs) 

Full Aims & Scope

Meet Our Editors

Editor-in-Chief

Krishnendu Chakrabarty
Department of Electrical Engineering
Duke University
Durham, NC 27708 USA
Krish@duke.edu