By Topic

Computer Design, 2006. ICCD 2006. International Conference on

Date 1-4 Oct. 2006

Filter Results

Displaying Results 1 - 25 of 89
  • [Copyright notice]

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (43 KB)  
    Freely Available from IEEE
  • Welcome to ICCD 2006!

    Page(s): xii - xiii
    Save to Project icon | Request Permissions | PDF file iconPDF (32 KB)  
    Freely Available from IEEE
  • Organizing Committee

    Page(s): xiv
    Save to Project icon | Request Permissions | PDF file iconPDF (65 KB)  
    Freely Available from IEEE
  • Program Committee

    Page(s): xv - xvi
    Save to Project icon | Request Permissions | PDF file iconPDF (101 KB)  
    Freely Available from IEEE
  • Additional reviewers

    Page(s): xvii
    Save to Project icon | Request Permissions | PDF file iconPDF (54 KB)  
    Freely Available from IEEE
  • Table of contents

    Save to Project icon | Request Permissions | PDF file iconPDF (2836 KB)  
    Freely Available from IEEE
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • Long-term Performance Bottleneck Analysis and Prediction

    Page(s): 3 - 9
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (282 KB) |  | HTML iconHTML  

    Identifying performance bottlenecks is important for microarchitects and application developers to produce high performance microprocessor designs and application software. Many techniques are used for this purpose, including simulation, software profiling and hardware event counters. Recently long-term program behavior has been getting more attention from researchers because of its potential applications in system-level, as well as program-level optimizations. In this paper, we study performance bottlenecks from a long-term program behavior viewpoint by classifying dynamic program execution into bottleneck phases -the portions of execution that have similar performance bottlenecks. We propose an event counter based performance model that can accurately estimate the performance cost for critical system events. Based on this model, we propose the bottleneck vector as the basis of long-term performance bottleneck analysis and a runtime bottleneck phase tracking scheme. In addition, three bottleneck phase prediction schemes are studied. Finally, we present an application of our performance bottleneck analysis model -an adaptive value predictor, which improves average performance by 7% when compared to the original value predictor design. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic Code Value Specialization Using the Trace Cache Fill Unit

    Page(s): 10 - 16
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (180 KB) |  | HTML iconHTML  

    Value specialization is a technique which can improve a program's performance when its code frequently takes the same values. In this paper, speculative value specialization is applied dynamically by utilizing the trace cache hardware. We implement a small, efficient hardware profiler to identify loads that have semi-invariant runtime values. A specialization engine off the program's critical path generates highly optimized traces using these values, which reside in the trace cache. Specialized traces are dynamically verified during execution, and mis-specialization is recovered automatically without new hardware overhead. Our simulation shows that dynamic value specialization in the trace cache achieves a 17% speedup, even over a system with support for hardware value prediction. When combined with other techniques aimed at tolerating memory latencies, this technique still performs well -this technique combined with an aggressive hardware prefetcher achieves 24% better performance than prefetching alone. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast, Performance-Optimized Partial Match Address Compression for Low-Latency On-Chip Address Buses

    Page(s): 17 - 24
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (244 KB) |  | HTML iconHTML  

    The influence of interconnects on processor performance and cost is becoming increasingly pronounced with technology scaling. In this paper, we present a fast compression scheme that exploits the spatial and temporal locality of addresses to dynamically compress them to different extents depending upon the extent to which they match the higher-order portions of recently-occurring addresses saved in a very small "compression cache" of capacity less than 500 bits. When a maximal match occurs, the address is compressed to the maximum extent and is transmitted on a narrow bus in one cycle. When a partial match occurs, one or more extra cycles are required for address transmission depending upon the extent of the partial match. To minimize this transmission cycle penalty (TCP), we use an efficient algorithm to determine the optimal set of partial matches to be supported in our partial match compression (PMC) scheme - we refer to this scheme as performance-optimized PMC (PO-PMC). A previously-proposed scheme called bus expander (BE) supports only a single, fixed-size match for compression. We show that all addresses that result in (maximal) matches in BE also result in the same in PMC, but the remaining addresses that are considered "no matches" in BE frequently result in partial matches in PMC, thus helping curtail the latter's TCP significantly. Across many SPEC CPU2000 integer and floating-point benchmarks, we find that average program performance improves by 3% when using PO-PMC compared to that when using BE. Further, we investigate how area slack arising from compression can be exploited for bus latency improvement by increasing inter-wire spacing. We find that, on the average, it can reduce bus latency by up to 84.63% and thereby improve program performance by about 16%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Joint Performance Improvement and Error Tolerance for Memory Design Based on Soft Indexing

    Page(s): 25 - 30
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (840 KB) |  | HTML iconHTML  

    Memory design is facing the dual challenges of performance improvement and error tolerance due to a combination of technology scaling and higher levels of integration. To address these challenges, we propose a new memory microarchitecture referred to as the soft indexing. The proposed technique allocates memory resources in a self-adaptive manner in accordance with runtime program variations, thereby achieving efficient memory access and effective error protection in a coherent manner. Statistical analysis shows 10times improvement in error detection capability over the existing error-control techniques. The benefits of the proposed technique are also experimentally demonstrated using the SPEC CPU2000 benchmarks. Simulation results show 94.9% average error-control coverage on the 23 benchmarks, with average of 23.2% reduction in memory miss rates as compared to the conventional techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Low Power Highly Associative Cache for Embedded Systems

    Page(s): 31 - 36
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (272 KB) |  | HTML iconHTML  

    Reducing energy consumption is an important issue for battery powered embedded computing systems. Content addressable memory (CAM)-based highly-associative caches (HAC) are widely used in low power embedded microprocessors. The CAM tag is costly in power, access time, and area. We have designed a low power highly associative cache (LPHAC) whose tag is partially implemented by using CAM, while the remaining tag is implemented by using SRAM. The experimental results from 10 MediaBench and all 26 SPEC2K benchmarks show the proposed LPHAC exhibits almost the identical miss rate as a traditional HAC. At the same time, it consumes 27% less per cache access power and 1.6% less area with faster access time. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the Improvement of Statistical Timing Analysis

    Page(s): 37 - 42
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (408 KB) |  | HTML iconHTML  

    As the minimum feature sizes of VLSI fabrication processes continue to shrink, the impact of process variations is becoming increasingly significant. This has prompted research into extending traditional static timing analysis so that it can be performed statistically. However, statistical static timing analysis (SSTA) tends to be quite pessimistic. In this paper we present a sensitizable statistical timing analysis (StatSense) technique to overcome the pessimism of SSTA. Our StatSense approach implicitly eliminates false paths, and also uses different delay distributions for different input transitions for any gate. These features enable our StatSense approach to perform less conservative timing analysis than the SSTA approach. Our results show that on average, the worst case (mu + 3sigma) circuit delay reported by StatSense is about 20% lower than that reported by SSTA. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FA-STAC: A Framework for Fast and Accurate Static Timing Analysis with Coupling

    Page(s): 43 - 49
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (591 KB) |  | HTML iconHTML  

    This paper presents a framework for fast and accurate static timing analysis considering coupling. With technology scaling to smaller dimensions, the impact of coupling induced delay variations can no longer be ignored. Timing analysis considering coupling is iterative, and can have considerably larger run-times than a single pass approach. We propose a novel and accurate coupling delay model, and present techniques to increase the convergence rate of timing analysis when complex coupling models are employed. Experimental results obtained for the ISCAS benchmarks show promising accuracy improvements using our coupling model while an efficient iteration scheme shows significant speedup (up to 62.1%) in comparison to traditional approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reduction of Crosstalk Pessimism Using Tendency Graph Approach

    Page(s): 50 - 55
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (260 KB) |  | HTML iconHTML  

    Accurate estimation of worst-case crosstalk effects is critical for a realistic estimation of the worst-case behavior of deep sub-micron circuits. Crosstalk analysis models usually assume that the worst-case crosstalk occurs with all the aggressors of a victim (net or path) simultaneously inducing crosstalk even though this may not be possible at all. This overestimated crosstalk is called false noise. Logic correlations have been explored to reduce false noise in J.C. Beck, et al., (2004), which also used branch and bound method to solve the problem. In this paper, we propose a novel approach, named tendency graph approach (TGA), which preprocesses the logic constraints of the circuit to drastically speed up the fundamental branch and bound algorithm. The new approach has been implemented in C++ and tested on an industrial circuit in a current 90 nm technology, demonstrating that TGA considerably accelerates the solution to the false noise problem, and makes in many cases branch and bound feasible in the first place. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Statistical Analysis of Power Grid Networks Considering Lognormal Leakage Current Variations with Spatial Correlation

    Page(s): 56 - 62
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4554 KB) |  | HTML iconHTML  

    As the technology scales into 90 nm and below, process-induced variations become more pronounced. In this paper, we propose an efficient stochastic method for analyzing the voltage drop variations of on-chip power grid networks, considering log-normal leakage current variations with spatial correlation. The new analysis is based on the Hermite polynomial chaos (PC) representation of random processes. Different from the existing Hermite PC based method for power grid analysis, which models all the random variations as Gaussian processes without considering spatial correlation. The new method focuses on the impacts of stochastic sub-threshold leakage currents, which are modeled as log-normal distribution random variables, on the power grid voltage variations. To consider the spatial correlation, we apply orthogonal decomposition to map the correlated random variables into independent variables. Our experiment results show that the new method is more accurate than the Gaussian-only Hermite PC method using the Taylor expansion method for analyzing leakage current variations, and two orders of magnitude faster than the Monte Carlo method with small variance errors. We also show that the spatial correlation may lead to large errors if not being considered in the statistical analysis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • RasP: An Area-efficient, On-chip Network

    Page(s): 63 - 69
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (474 KB) |  | HTML iconHTML  

    We present RasP, our asynchronous on-chip-network, which uses high-speed pulse-based signalling techniques. RasP offers numerous advantages over conventional interconnects, such as clock-domain crossing and skew tolerance. Most importantly, it features a very small global-wiring footprint. This compact nature allows a system designer to give priority to link bandwidth or signal-to-noise ratios, rather than being restricted by lane areas. We describe our point-to-point link and develop it into a fully-routable system, with a repeater, router, arbiter and multiplexer. Simulations give throughput figures of between 1Gbit/s and 700Mbit/s in a 0.18mum technology, depending on interconnect length. We also show that it compares favourably in performance and area to Bainbridge et al.'s Chain interconnect. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Quantitative Prediction of On-chip Capacitive and Inductive Crosstalk Noise and Discussion on Wire Cross-Sectional Area Toward Inductive Crosstalk Free Interconnects

    Page(s): 70 - 75
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (394 KB) |  | HTML iconHTML  

    Capacitive and inductive crosstalk noises are expected to be more serious in advanced technologies. However, capacitive and inductive crosstalk noises in the future have not been concurrently and sufficiently discussed quantitatively, though capacitive crosstalk noise has been intensively studied solely as a primary factor of interconnect delay variation. This paper quantitatively predicts the impact of capacitive and inductive crosstalk in prospective processes, and reveals that interconnect scaling strategies strongly affect relative dominance between capacitive and inductive coupling. Our prediction also makes the point that the interconnect resistance significantly influences both inductive coupling noise and propagation delay. We then evaluate a tradeoff between wire cross-sectional area and worst-case propagation delay focusing on inductive coupling noise, and show that an appropriate selection of wire cross-section can reduce delay uncertainty by the small sacrifice of propagation delay. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • CMOS Comparators for High-Speed and Low-Power Applications

    Page(s): 76 - 81
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (402 KB) |  | HTML iconHTML  

    In this paper, we present two designs for CMOS comparators: one which is targeted for high-speed applications and another for low-power applications. Additionally, we present hierarchical pipelined comparators which can be optimized for delay, area, or power consumption by using either design in different stages. Simulation results for our fastest hierarchical 64-bit comparator with a 1.2 V 100 nm process demonstrate a worst-case delay of 440 ps. To enable a fair comparison with previously reported approaches, we also simulated our designs with a 5.0 V AMIS 0.5 mum process as well. For this experiment, the fastest design has a latency of 1.33 ns, which represents a 37% speed improvement over the best previously reported approach to date (which was implemented in a 0.5 mum process). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Reconfigurable CAM Architecture for Network Search Engines

    Page(s): 82 - 87
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (188 KB) |  | HTML iconHTML  

    A novel reconfigurable content addressable memory, called RCAM, is proposed that supports on-the-fly reconfiguration between CAM and TCAM. The area overhead of the proposed RCAM cell is only 5.6% when compared to conventional TCAM. This overhead is compensated by area saving due to removal of the priority encoder. Other features of our architecture include reconfigurability, and better overall performance and power. To achieve these we incorporated two novel techniques: (i) a hybrid CAM/TCAM architecture that allows user to pre-define CAM/TCAM cell behavior in each bit or word position and ultimately curtail the overall power consumptions of memory unit: and (ii) a wired-AND technique by which we can completely eliminate the sorting requirement and thus significantly reduce the update time. A 4 Kb RCAM architecture was implemented using 0.18 mum CMOS technology. The simulations indicate a search time of 6.15 ns, i.e. capability of handling about five OC-192 at wire speed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Delay and Area Efficient First-level Cache Soft Error Detection and Correction

    Page(s): 88 - 92
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (299 KB) |  | HTML iconHTML  

    Soft error rates are an increasing problem in modern VLSI circuits. Commonly used error correcting codes reduce soft error rates in large memories and second level caches but are not suited to small fast memories such as first level caches, due to the area and speed penalties they entail. Here, an error detection and correction scheme that is appropriate for use in low latency first level caches and other small, fast memories such as register files is presented. The scheme allows fine, e.g., byte write granularity with acceptable storage overhead. Analysis demonstrates that the proposed method provides adequate soft error rate reduction with improved latency and area cost. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automated Design of Microfluidics-Based Biochips: Connecting Biochemistry to Electronics CAD

    Page(s): 93 - 100
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2918 KB) |  | HTML iconHTML  

    Microfluidics-based biochips offer exciting possibilities for high-throughput sequencing, parallel immunoassays, blood chemistry for clinical diagnostics, DNA sequencing, and environmental toxicity monitoring. The complexity of microfluidic devices is expected to become significant in the near future due to the need for multiple and concurrent biochemical assays on multifunctional and reconfigurable platforms. This paper presents early work on top-down system-level computer-aided design (CAD) tools for the synthesis, testing and reconfiguration of microfluidic biochips. Synthesis tools map behavioral descriptions to a droplet-based microfluidic biochip and generate an optimized schedule of assay operations, the binding of assay operations to functional units, and the layout and droplet flow-paths. Cost-effective testing techniques lead to the detection of manufacturing defects and operational faults. Reconfiguration techniques, incorporated in these CAD tools, can easily bypass faults once they are detected. Thus the biochip user can concentrate on the development of the nano- and micro-scale bioassays, leaving assay optimization and implementation details to design automation tools. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast Speculative Address Generation and Way Caching for Reducing L1 Data Cache Energy

    Page(s): 101 - 107
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (242 KB) |  | HTML iconHTML  

    L1 data caches in high-performance processors continue to grow in set associativity. Higher associativity can significantly increase the cache energy consumption. Cache access latency can be affected as well, leading to an increase in overall energy consumption due to increased execution time. At the same time, the static energy consumption of the cache increases significantly with each new process generation. This paper proposes a new approach to reduce the overall L1 cache energy consumption using a combination of way caching and fast, speculative address generation. A 16-entry way cache storing a 3-bit way number for recently accessed L1 data cache lines is shown sufficient to significantly reduce both static and dynamic energy consumption of the L1 cache. Fast speculative address generation helps to hide the way cache access latency and is highly accurate. The L1 cache energy-delay product is reduced by 10% compared to using the way cache alone and by 37% compared to the use of multiple MRU technique. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Customizable Fault Tolerant Caches for Embedded Processors

    Page(s): 108 - 113
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (167 KB) |  | HTML iconHTML  

    The continuing divergence of processor and memory speeds has led to the increasing reliance on larger caches which have become major consumers of area and power in embedded processors. Concurrently, intra-die and inter-die process variation at future technology nodes will cause defect-free yield to drop sharply unless mitigated. This paper focuses on an architectural technique to configure cache designs to be resilient to memory cell failures brought on by the effects of process variation. Profile-driven re-mapping of memory lines to cache lines is proposed to tolerate failures while minimizing degradation in average memory access time (AMAT) and thereby significantly boosting performance-based die yield beyond that which can be achieved with current techniques. For example, with 50% of the number of cache lines faulty, the performance drop quantified by increase in AMAT using our technique is 12.5% compared to 60% increase in AMAT using existing techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.