Scheduled System Maintenance on May 29th, 2015:
IEEE Xplore will be upgraded between 11:00 AM and 10:00 PM EDT. During this time there may be intermittent impact on performance. For technical support, please contact us at onlinesupport@ieee.org. We apologize for any inconvenience.
By Topic

Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on

Date 1-5 Dec. 2007

Filter Results

Displaying Results 1 - 25 of 44
  • 40th Annual IEEE/ACM International Symposium on Microarchitecture - Cover

    Publication Year: 2007 , Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (180 KB)  
    Freely Available from IEEE
  • 40th Annual IEEE/ACM International Symposium on Microarchitecture - Title page

    Publication Year: 2007 , Page(s): i - iii
    Save to Project icon | Request Permissions | PDF file iconPDF (53 KB)  
    Freely Available from IEEE
  • 40th Annual IEEE/ACM International Symposium on Microarchitecture - Copyright

    Publication Year: 2007 , Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (52 KB)  
    Freely Available from IEEE
  • 40th Annual IEEE/ACM International Symposium on Microarchitecture - Table of contents

    Publication Year: 2007 , Page(s): v - vii
    Save to Project icon | Request Permissions | PDF file iconPDF (48 KB)  
    Freely Available from IEEE
  • Message from the General Chairs

    Publication Year: 2007 , Page(s): viii
    Save to Project icon | Request Permissions | PDF file iconPDF (31 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Message from the Program Chairs

    Publication Year: 2007 , Page(s): ix
    Save to Project icon | Request Permissions | PDF file iconPDF (31 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Organizing Committee

    Publication Year: 2007 , Page(s): x
    Save to Project icon | Request Permissions | PDF file iconPDF (30 KB)  
    Freely Available from IEEE
  • list-reviewer

    Publication Year: 2007
    Save to Project icon | Request Permissions | PDF file iconPDF (32 KB)  
    Freely Available from IEEE
  • Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

    Publication Year: 2007 , Page(s): 3 - 14
    Cited by:  Papers (95)  |  Patents (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (530 KB) |  | HTML iconHTML  

    A significant part of future microprocessor real estate will be dedicated to 12 or 13 caches. These on-chip caches will heavily impact processor performance, power dissipation, and thermal management strategies. There are a number of interconnect design considerations that influence power/performance/area characteristics of large caches, such as wire models (width/spacing/repeaters), signaling strategy (RC/differential/transmission), router design, etc. Yet, to date, there exists no analytical tool that takes all of these parameters into account to carry out a design space exploration for large caches and estimate an optimal organization. In this work, we implement two major extensions to the CACTI cache modeling tool that focus on interconnect design for a large cache. First, we add the ability to model different types of wires, such as RC-based wires with different power/delay characteristics and differential low-swing buses. Second, we add the ability to model non-uniform cache access (NUCA). We not only adopt state-of-the-art design space exploration strategies for NUCA, we also enhance this exploration by considering on-chip network contention and a wider spectrum of wiring and routing choices. We present a validation analysis of the new tool (to be released as CACTI 6.0) and present a case study to showcase how the tool can improve architecture research methodologies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Process Variation Tolerant 3T1D-Based Cache Architectures

    Publication Year: 2007 , Page(s): 15 - 26
    Cited by:  Papers (27)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (554 KB) |  | HTML iconHTML  

    Process variations will greatly impact the stability, leakage power consumption, and performance of future microprocessors. These variations are especially detrimental to 6T SRAM (6-transistor static memory) structures and will become critical with continued technology scaling. In this paper, we propose new on-chip memory architectures based on novel 3T1D DRAM (3-transistor, 1-diode dynamic memory) cells. We provide a detailed comparison between 6T and 3T1D designs in the context of a L1 data cache. The effects of physical device variation on a 3T1D cache can be lumped into variation of data retention times. This paper proposes a range of cache refresh and placement schemes that are sensitive to retention time, and we show that most of the retention time variations can be masked by the microarchitecture when using these schemes. We have performed detailed circuit and architectural simulations assuming different degrees of variability in advanced technology nodes, and we show that the resulting memory architecture can tolerate large process variations with little or even no impact on performance when compared to ideal 6T SRAM designs. Furthermore, these designs are robust to memory cell stability issues and can achieve large power savings. These advantages make the new memory architectures a promising choice for on-chip variation-tolerant cache structures required for next generation microprocessors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing

    Publication Year: 2007 , Page(s): 27 - 42
    Cited by:  Papers (31)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1501 KB) |  | HTML iconHTML  

    Parameter variation is detrimental to a processor's frequency and leakage power. One proposed technique to mitigate it is fine-grain body biasing (FGBB), where different parts of the processor chip are given a voltage bias that changes the speed and leakage properties of their transistors. This technique has been proposed for static application, with the bias voltages being programmed at manufacturing time for worst-case conditions. In this paper, we introduce dynamic FGBB (D-FGBB), which allows the continuous re-evaluation of the bias voltages to adapt to dynamic conditions. Our results show that D-FGBB is very versatile and effective. Specifically, with the processor working in normal mode at fixed frequency, D-FGBB reduces the leakage power of the chip by an average of 2%-A2.% compared to static FGBB. Alternatively, with the processor working in a high-performance mode, D-FGBB increases the processor frequency by an average of 7-9% compared to static FGBB - or 7-16% compared to no body biasing. Finally, we also show that D-FGBB can be synergistically combined with dynamic voltage and frequency scaling (DVFS), creating an effective means to manage power. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimal versus Heuristic Global Code Scheduling

    Publication Year: 2007 , Page(s): 43 - 55
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (415 KB) |  | HTML iconHTML  

    We present a global instruction scheduler based on integer linear programming (ILP) that was implemented experimentally in the Intel Itaniumreg product compiler. It features virtually the full scale of known EPIC scheduling optimizations, more than its heuristic counterpart in the compiler, GCS, and in contrast to the latter it computes optimal solutions in the form of schedules with minimal length. Due to our highly efficient ILP model it can solve problem instances with 500-750 instructions, and in combination with region scheduling we are able to schedule routines of arbitrary size. In experiments on five SPECreg CPU2006 integer benchmarks, ILP-scheduled code exhibits a 32% schedule length advantage and a 10% runtime speedup over GCS-scheduled code, at the highest compiler optimization levels typically used for SPEC submissions. We further study the impact of different code motion classes, region sizes, and target microarchitectures, gaining insights into the nature of the global instruction scheduling problem. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Global Multi-Threaded Instruction Scheduling

    Publication Year: 2007 , Page(s): 56 - 68
    Cited by:  Papers (1)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (586 KB) |  | HTML iconHTML  

    The microprocessor industry has moved toward chip multiprocessor (CMP) designs as a means of utilizing the increasing transistor counts in the face of physical and micro-architectural limitations. Despite this move, CMPs do not directly improve the performance of single-threaded codes, a characteristic of most applications. In order to support parallelization of general-purpose applications, computer architects have proposed CMPs with lightweight scalar communication mechanisms (R. Rangan et al., 2004), (K. Sankaralingam, et al., 2003), (M.B. Taylor et al., 2005). Despite such support, most existing compiler multi-threading techniques have generally demonstrated little effectiveness in extracting parallelism from non-scientific applications (W. Lee et al., 1998), (W. Lee et al., 2002), (K. Rich and M. Farrens, 2004). The main reason for this is that such techniques are mostly restricted to extracting parallelism within straight-line regions of code. In this paper, we first propose a framework that enables global multi-threaded instruction scheduling in general. We then describe GREMIO, a scheduler built using this framework. GREMIO operates at a global scope, at the procedure level, and uses control dependence analysis to extract non-speculative thread-level parallelism from sequential codes. Using a fully automatic compiler implementation of GREMIO and a validated processor model, this paper demonstrates gains for a dual-core CMP model running a variety of codes. Our experiments demonstrate the advantage of exploiting global scheduling for multithreaded architectures, and present gains in a detailed comparison with the decoupled software pipelining (DSWP) multi-threading technique (G. Ottoni et al., 2005). Furthermore, our experiments show that adding GREMIO to a compiler with DSWP improves the average speedup from 16.5% to 32.8% for important benchmark functions when utilizing two cores, indicating the importance of this technique in making compilers extract threa- - ds effectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Revisiting the Sequential Programming Model for Multi-Core

    Publication Year: 2007 , Page(s): 69 - 84
    Cited by:  Papers (14)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (448 KB) |  | HTML iconHTML  

    Single-threaded programming is already considered a complicated task. The move to multi-threaded programming only increases the complexity and cost involved in software development due to rewriting legacy code, training of the programmer, increased debugging of the program, and efforts to avoid race conditions, deadlocks, and other problems associated with parallel programming. To address these costs, other approaches, such as automatic thread extraction, have been explored. Unfortunately, the amount of parallelism that has been automatically extracted is generally insufficient to keep many cores busy. This paper argues that this lack of parallelism is not an intrinsic limitation of the sequential programming model, but rather occurs for two reasons. First, there exists no framework for automatic thread extraction that brings together key existing state-of-the-art compiler and hardware techniques. This paper shows that such a framework can yield scalable parallelization on several SPEC CINT2000 benchmarks. Second, existing sequential programming languages force programmers to define a single legal program outcome, rather than allowing for a range of legal outcomes. This paper shows that natural extensions to the sequential programming model enable parallelization for the remainder of the SPEC CINT2000 suite. Our experience demonstrates that, by changing only 60 source code lines, all of the C benchmarks in the SPEC CINT2000 suite were parallelizable by automatic thread extraction. This process, constrained by the limits of modern optimizing compilers, yielded a speedup of 454% on these applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Penelope: The NBTI-Aware Processor

    Publication Year: 2007 , Page(s): 85 - 96
    Cited by:  Papers (48)  |  Patents (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (527 KB) |  | HTML iconHTML  

    Transistors consist of lower number of atoms with every technology generation. Such atoms may be displaced due to the stress caused by high temperature, frequency and current, leading to failures. NBTI (negative bias temperature instability) is one of the most important sources of failure affecting transistors. NBTI degrades PMOS transistors whenever the voltage at the gate is negative (logic input "0"). The main consequence is a reduction in the maximum operating frequency and an increase in the minimum supply voltage of storage structures to cope for the degradation. Many PMOS transistors affected by NBTI can be found in both combinational and storage blocks since they observe a "0 " at their gates most of the time. This paper proposes and evaluates the design of Penelope, an NBTI-aware processor. We propose (i) generic strategies to mitigate degradation in both combinational and storage blocks, (ii) specific techniques to protect individual blocks by applying the global strategies, and (Hi) a metric to assess the benefits of reduced degradation and the overheads in performance and power. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation

    Publication Year: 2007 , Page(s): 97 - 108
    Cited by:  Papers (35)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (702 KB) |  | HTML iconHTML  

    As silicon process technology scales deeper into the nanometer regime, hardware defects are becoming more common. Such defects are bound to hinder the correct operation of future processor systems, unless new online techniques become available to detect and to tolerate them while preserving the integrity of software applications running on the system. This paper proposes a new, software-based, defect detection and diagnosis technique. We introduce a novel set of instructions, called access-control extension (ACE), that can access and control the microprocessor's internal state. Special firmware periodically suspends microprocessor execution and uses the ACE instructions to run directed tests on the hardware. When a hardware defect is present, these tests can diagnose and locate it, and then activate system repair through resource reconfiguration. The software nature of our framework makes it flexible: testing techniques can be modified/upgraded in the field to trade off performance with reliability without requiring any change to the hardware. We evaluated our technique on a commercial chip-multiprocessor based on Sun's Niagara and found that it can provide very high coverage, with 99.22% of all silicon defects detected. Moreover, our results show that the average performance overhead of software-based testing is only 5.5%. Based on a detailed RTL-level implementation of our technique, we find its area overhead to be quite modest, with only a 5.8% increase in total chip area. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Self-calibrating Online Wearout Detection

    Publication Year: 2007 , Page(s): 109 - 122
    Cited by:  Papers (19)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (540 KB) |  | HTML iconHTML  

    Technology scaling, characterized by decreasing feature size, thinning gate oxide, and non-ideal voltage scaling, will become a major hindrance to microprocessor reliability in future technology generations. Physical analysis of device failure mechanisms has shown that most wearout mechanisms projected to plague future technology generations are progressive, meaning that the circuit-level effects of wearout develop and intensify with age over the lifetime of the chip. This work leverages the progression of wearout over time in order to present a low-cost hardware structure that identifies increasing propagation delay, which is symptomatic of many forms of wearout, to accurately forecast the failure of microarchitectural structures. To motivate the use of this predictive technique, an HSPICE analysis of the effects of one particular failure mechanism, gate oxide breakdown, on gates from a standard cell library characterized for a 90 nm process is presented. This gate-level analysis is then used to demonstrate the aggregate change in output delay of high-level structures within a synthesized Verilog model of an embedded microprocessor core. Leveraging this analysis, a self- calibrating hardware structure for conducting statistical analysis of output delay is presented and its efficacy in predicting the failure of a variety of structures within the microprocessor core is evaluated. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementing Signatures for Transactional Memory

    Publication Year: 2007 , Page(s): 123 - 133
    Cited by:  Papers (22)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (458 KB) |  | HTML iconHTML  

    Transactional Memory (TM) systems must track the read and write sets - items read and written during a transaction - to detect conflicts among concurrent transactions. Several TMs use signatures, which summarize unbounded read/write sets in bounded hardware at a performance cost of false positives (conflicts detected when none exists). This paper examines different organizations to achieve hardware-efficient and accurate TM signatures. First, we find that implementing each signature with a single k-hash- function Bloom filter (True Bloom signature) is inefficient, as it requires multi-ported SRAMs. Instead, we advocate using k single-hash-function Bloom filters in parallel (Parallel Bloom signature), using area-efficient single-ported SRAMs. Our formal analysis shows that both organizations perform equally well in theory and our simulation- based evaluation shows this to hold approximately in practice. We also show that by choosing high-quality hash functions we can achieve signature designs noticeably more accurate than the previously proposed implementations. Finally, we adapt Pagh and Rodler's cuckoo hashing to implement Cuckoo-Bloom signatures. While this representation does not support set intersection, it mitigates false positives for the common case of small read/write sets and performs like a Bloom filter for large sets. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs

    Publication Year: 2007 , Page(s): 134 - 145
    Cited by:  Papers (32)  |  Patents (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (522 KB) |  | HTML iconHTML  

    DRAMs require periodic refresh for preserving data stored in them. The refresh interval for DRAMs depends on the vendor and the design technology they use. For each refresh in a DRAM row, the stored information in each cell is read out and then written back to itself as each DRAM bit read is self-destructive. The refresh process is inevitable for maintaining data correctness, unfortunately, at the expense of power and bandwidth overhead. The future trend to integrate layers of 3D die-stacked DRAMs on top of a processor further exacerbates the situation as accesses to these DRAMs will be more frequent and hiding refresh cycles in the available slack becomes increasingly difficult. Moreover, due to the implication of temperature increase, the refresh interval of 3D die-stacked DRAMs will become shorter than those of conventional ones. This paper proposes an innovative scheme to alleviate the energy consumed in DRAMs. By employing a time-out counter for each memory row of a DRAM module, all the unnecessary periodic refresh operations can be eliminated. The basic concept behind our scheme is that a DRAM row that was recently read or written to by the processor (or other devices that share the same DRAM) does not need to be refreshed again by the periodic refresh operation, thereby eliminating excessive refreshes and the energy dissipated. Based on this concept, we propose a low-cost technique in the memory controller for DRAM power reduction. The simulation results show that our technique can reduce up to 86% of all refresh operations and 59.3% on the average for a 2 GB DRAM. This in turn results in a 52.6% energy savings for refresh operations. The overall energy saving in the DRAM is up to 25.7% with an average of 12.13% obtained for SPLASH-2, SPECint2000, and Biobench benchmark programs simulated on a 2 GB DRAM. For a 64 MB 3D DRAM, the energy saving is up to 21% and 9.37% on an average when the refresh rate is 64 ms. For a faster 32 ms refresh rate the maximum and a- - verage savings are 12% and 6.8% respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors

    Publication Year: 2007 , Page(s): 146 - 160
    Cited by:  Papers (62)  |  Patents (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (480 KB) |  | HTML iconHTML  

    DRAM memory is a major resource shared among cores in a chip multiprocessor (CMP) system. Memory requests from different threads can interfere with each other. Existing memory access scheduling techniques try to optimize the overall data throughput obtained from the DRAM and thus do not take into account inter-thread interference. Therefore, different threads running together on the same chip can experience extremely different memory system performance: one thread can experience a severe slowdown or starvation while another is unfairly prioritized by the memory scheduler. This paper proposes a new memory access scheduler, called the Stall-Time Fair Memory scheduler (STFM), that provides quality of service to different threads sharing the DRAM memory system. The goal of the proposed scheduler is to "equalize " the DRAM-related slowdown experienced by each thread due to interference from other threads, without hurting overall system performance. As such, STFM takes into account inherent memory characteristics of each thread and does not unfairly penalize threads that use the DRAM system without interfering with other threads. We show that STFM significantly reduces the unfairness in the DRAM system while also improving system throughput (i.e., weighted speedup of threads) on a wide variety of workloads and systems. For example, averaged over 32 different workloads running on an 8-core CMP, the ratio between the highest DRAM-related slowdown and the lowest DRAM-related slowdown reduces from 5.26X to 1.4X, while the average system throughput improves by 7.6%. We qualitatively and quantitatively compare STFM to one new and three previously- proposed memory access scheduling algorithms, including network fair queueing. Our results show that STFM provides the best fairness, system throughput, and scalability. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Impact of Cache Coherence Protocols on the Processing of Network Traffic

    Publication Year: 2007 , Page(s): 161 - 171
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (461 KB) |  | HTML iconHTML  

    Since the introduction of the 10 GbE standard in 2002, the ability of general purpose processors to efficiently process network traffic with common protocols such as TCP/IP has been revisited and critically evaluated. However, recent commercially available processors such as Intelreg Coretrade 2 Duo Processor introduce microarchitectural enhancements that could significantly influence the approach to accelerating network processing. We examine the network performance of a real platform containing Intelreg Coretrade micro-architecture based processors, the role of coherency and a prototype implementation of direct cache placement (direct cache access or DCA) of inbound network traffic. We observe that a substantial portion of the time relates to the inefficiency of I/O specific coherence protocols in the platform. We demonstrate that a relatively, low complexity implementation of DCA called 'Prefetch Hint' provides a 15 to 43% speed-up to receive-side processing across a range of I/O sizes and present a detailed characterization of the benefits. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Flattened Butterfly Topology for On-Chip Networks

    Publication Year: 2007 , Page(s): 172 - 182
    Cited by:  Papers (28)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (783 KB) |  | HTML iconHTML  

    With the trend towards increasing number of cores in chip multiprocessors, the on-chip interconnect that connects the cores needs to scale efficiently. In this work, we propose the use of high-radix networks in on-chip interconnection networks and describe how the flattened butterfly topology can be mapped to on-chip networks. By using high-radix routers to reduce the diameter of the network, the flattened butterfly offers lower latency and energy consumption than conventional on-chip topologies. In addition, by exploiting the two dimensional planar VLSI layout, the on-chip flattened butterfly can exploit the bypass channels such that non-minimal routing can be used with minimal impact on latency and energy consumption. We evaluate the flattened butterfly and compare it to alternate on-chip topologies using synthetic traffic patterns and traces and show that the flattened butterfly can increase throughput by up to 50% compared to a concentrated mesh and reduce latency by 28% while reducing the power consumption by 38% compared to a mesh network. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using Address Independent Seed Encryption and Bonsai Merkle Trees to Make Secure Processors OS- and Performance-Friendly

    Publication Year: 2007 , Page(s): 183 - 196
    Cited by:  Papers (15)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (931 KB) |  | HTML iconHTML  

    In today's digital world, computer security issues have become increasingly important. In particular, researchers have proposed designs for secure processors which utilize hardware-based memory encryption and integrity verification to protect the privacy and integrity of computation even from sophisticated physical attacks. However, currently proposed schemes remain hampered by problems that make them impractical for use in today's computer systems: lack of virtual memory and inter-process communication support as well as excessive storage and performance overheads. In this paper, we propose 1) address independent seed encryption (AISE), a counter-mode based memory encryption scheme using a novel seed composition, and 2) Bonsai Merkle trees (BMT), a novel Merkle tree-based memory integrity verification technique, to eliminate these system and performance issues associated with prior counter-mode memory encryption and Merkle tree integrity verification schemes. We present both a qualitative discussion and a quantitative analysis to illustrate the advantages of our techniques over previously proposed approaches in terms of complexity, feasibility, performance, and storage. Our results show that AISE+BMT reduces the overhead of prior memory encryption and integrity verification schemes from 12% to 2% on average, while eliminating critical system-level problems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding

    Publication Year: 2007 , Page(s): 197 - 209
    Cited by:  Papers (52)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (537 KB) |  | HTML iconHTML  

    In deep sub-micron ICs, growing amounts of on-die memory and scaling effects make embedded memories increasingly vulnerable to reliability and yield problems. As scaling progresses, soft and hard errors in the memory system will increase and single error events are more likely to cause large-scale multi- bit errors. However, conventional memory protection techniques can neither detect nor correct large-scale multi-bit errors without incurring large performance, area, and power overheads. We propose two-dimensional (2D) error coding in embedded memories, a scalable multi-bit error protection technique to improve memory reliability and yield. The key innovation is the use of vertical error coding across words that is used only for error correction in combination with conventional per-word horizontal error coding. We evaluate this scheme in the cache hierarchies of two representative chip multiprocessor designs and show that 2D error coding can correct clustered errors up to 32times32 bits with significantly smaller performance, area, and power overheads than conventional techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Argus: Low-Cost, Comprehensive Error Detection in Simple Cores

    Publication Year: 2007 , Page(s): 210 - 222
    Cited by:  Papers (28)  |  Patents (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (499 KB) |  | HTML iconHTML  

    We have developed Argus, a novel approach for providing low-cost, comprehensive error detection for simple cores. The key to Argus is that the operation of a von Neumann core consists of four fundamental tasks - control flow, dataflow, computation, and memory access - that can be checked separately. We prove that Argus can detect any error by observing whether any of these tasks are performed incorrectly. We describe a prototype implementation, Argus-1, based on a single-issue, 4-stage, in-order processor to illustrate the potential of our approach. Experiments show that Argus-1 detects transient and permanent errors in simple cores with much lower impact on performance (<4% average overhead) and chip area (<17% overhead) than previous techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.