Notification:
We are currently experiencing intermittent issues impacting performance. We apologize for the inconvenience.
By Topic

Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on

Date 12-16 Dec. 2009

Filter Results

Displaying Results 1 - 25 of 56
  • [Front matter]

    Publication Year: 2009 , Page(s): i - xiv
    Save to Project icon | PDF file iconPDF (1471 KB)  
    Freely Available from IEEE
  • POWER7 multi-core processor design

    Publication Year: 2009 , Page(s): 1
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (127 KB) |  | HTML iconHTML  

    Summary form only given. In this talk, we will describe many key architectural, micro-architectural, RAS and power-management features of the POWER7 core for the first time. POWER7 is IBM's first 8-core processor chip, with each core capable of 4-way SMT, fabricated in IBM's 45 nm SOI technology with 11 levels of metal. Details of the processor core will be discussed, along with insights, technical issues and challenges related to designing high performance, power-efficient multi-core chips for building balanced servers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Characterizing and mitigating the impact of process variations on phase change based memory systems

    Publication Year: 2009 , Page(s): 2 - 13
    Cited by:  Papers (10)  |  Patents (4)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (769 KB) |  | HTML iconHTML  

    Dynamic Random Access Memory (DRAM) has been used in main memory design for decades. However, DRAM consumes an increasing power budget and faces difficulties in scaling down for small feature size CMOS processing technologies. Compared to conventional DRAM, emerging phase change random access memory (PRAM) demonstrates superior power efficiency and processing scalability as VLSI technologies and integration density continue to advance. Nevertheless, using nano-scale fabrication technologies will unavoidably introduce design parameter variability in the manufacturing stage. In the past, the impact of process variation (PV) on conventional transistor-based storage cells and combinational logic has been studied extensively. However, the implication of PV on non-volatile memory design using emerging phase change techniques has not been well understood. In this paper, we take the first step toward characterizing the effect of process variation on PRAM and explore PV-aware design techniques. We show that process variation increases the PRAM programming power by 96% and degrades PRAM endurance by 50X. Our proposed circuit and two microarchtiecture techniques with system-level support reduce PRAM power by 44%, 59% and 57% and improve PRAM endurance by 27X, 277X and 268X, relative to PV-affected PRAM design. Moreover, we show that the synergy of the proposed cross-layer approaches, which achieve an average 63% power savings and 13050X endurance improvement over the conventional case, provide an attractive design solution to mitigate the deleterious impact of PV for non-volatile memory in the upcoming nano-scale processing technology era. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Enhancing lifetime and security of PCM-based Main Memory with Start-Gap Wear Leveling

    Publication Year: 2009 , Page(s): 14 - 23
    Cited by:  Papers (35)  |  Patents (3)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (337 KB) |  | HTML iconHTML  

    Phase Change Memory (PCM) is an emerging memory technology that can increase main memory capacity in a cost-effective and power-efficient manner. However, PCM cells can endure only a maximum of 107-108 writes, making a PCM based system have a lifetime of only a few years under ideal conditions. Furthermore, we show that non-uniformity in writes to different cells reduces the achievable lifetime of PCM system by 20×. Writes to PCM cells can be made uniform with Wear-Leveling. Unfortunately, existing wear-leveling techniques require large storage tables and indirection, resulting in significant area and latency overheads. We propose Start-Gap, a simple, novel, and effective wear-leveling technique that uses only two registers. By combining Start-Gap with simple address-space randomization techniques we show that the achievable lifetime of the baseline 16 GB PCM-based system is boosted from 5% (with no wear-leveling) to 97% of the theoretical maximum, while incurring a total storage overhead of less than 13 bytes and obviating the latency overhead of accessing large tables. We also analyze the security vulnerabilities for memory systems that have limited write endurance, showing that under adversarial settings, a PCM-based system can fail in less than one minute. We provide a simple extension to Start-Gap that makes PCM-based systems robust to such malicious attacks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Characterizing flash memory: Anomalies, observations, and applications

    Publication Year: 2009 , Page(s): 24 - 33
    Cited by:  Papers (28)  |  Patents (1)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (4504 KB) |  | HTML iconHTML  

    Despite flash memory's promise, it suffers from many idiosyncrasies such as limited durability, data integrity problems, and asymmetry in operation granularity. As architects, we aim to find ways to overcome these idiosyncrasies while exploiting flash memory's useful characteristics. To be successful, we must understand the trade-offs between the performance, cost (in both power and dollars), and reliability of flash memory. In addition, we must understand how different usage patterns affect these characteristics. Flash manufacturers provide conservative guidelines about these metrics, and this lack of detail makes it difficult to design systems that fully exploit flash memory's capabilities. We have empirically characterized flash memory technology from five manufacturers by directly measuring the performance, power, and reliability. We demonstrate that performance varies significantly between vendors, devices, and from publicly available datasheets. We also demonstrate and quantify some unexpected device characteristics and show how we can use them to improve responsiveness and energy consumption of solid state disks by 44% and 13%, respectively, as well as increase flash device lifetime by 5.2x. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Complexity effective memory access scheduling for many-core accelerator architectures

    Publication Year: 2009 , Page(s): 34 - 44
    Cited by:  Papers (1)  |  Patents (1)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (369 KB) |  | HTML iconHTML  

    Modern DRAM systems rely on memory controllers that employ out-of-order scheduling to maximize row access locality and bank-level parallelism, which in turn maximizes DRAM bandwidth. This is especially important in graphics processing unit (GPU) architectures, where the large quantity of parallelism places a heavy demand on the memory system. The logic needed for out-of-order scheduling can be expensive in terms of area, especially when compared to an in-order scheduling approach. In this paper, we propose a complexity-effective solution to DRAM request scheduling which recovers most of the performance loss incurred by a naive in-order first-in first-out (FIFO) DRAM scheduler compared to an aggressive out-of-order DRAM scheduler. We observe that the memory request stream from individual GPU ?shader cores? tends to have sufficient row access locality to maximize DRAM efficiency in most applications without significant reordering. However, the interconnection network across which memory requests are sent from the shader cores to the DRAM controller tends to finely interleave the numerous memory request streams in a way that destroys the row access locality of the resultant stream seen at the DRAM controller. To address this, we employ an interconnection network arbitration scheme that preserves the row access locality of individual memory request streams and, in doing so, achieves DRAM efficiency and system performance close to that achievable by using out-of-order memory request scheduling while doing so with a simpler design. We evaluate our interconnection network arbitration scheme using crossbar, mesh, and ring networks for a baseline architecture of 8 memory channels, each controlled by its own DRAM controller and 28 shader cores (224 ALUs), supporting up to 1,792 in-flight memory requests. Our results show that our interconnect arbitration scheme coupled with a banked FIFO in-order scheduler obtains up to 91% of the performance obtainable with an out-of-order m- emory scheduler for a crossbar network with eight-entry DRAM controller queues. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

    Publication Year: 2009 , Page(s): 45 - 55
    Cited by:  Papers (29)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (275 KB) |  | HTML iconHTML  

    Heterogeneous multiprocessors are increasingly important in the multi-core era due to their potential for high performance and energy efficiency. In order for software to fully realize this potential, the step that maps computations to processing elements must be as automated as possible. However, the state-of-the-art approach is to rely on the programmer to specify this mapping manually and statically. This approach is not only labor intensive but also not adaptable to changes in runtime environments like problem sizes and hardware/software configurations. In this study, we propose adaptive mapping, a fully automatic technique to map computations to processing elements on a CPU+GPU machine. We have implemented it in our experimental heterogeneous programming system called Qilin. Our results show that, by judiciously distributing works over the CPU and GPU, automatic adaptive mapping achieves a 25% reduction in execution time and a 20% reduction in energy consumption than static mappings on average for a set of important computation benchmarks. We also demonstrate that our technique is able to adapt to changes in the input problem size and system configuration. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • DDT: Design and evaluation of a dynamic program analysis for optimizing data structure usage

    Publication Year: 2009 , Page(s): 56 - 66
    Cited by:  Patents (1)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (586 KB) |  | HTML iconHTML  

    Data structures define how values being computed are stored and accessed within programs. By recognizing what data structures are being used in an application, tools can make applications more robust by enforcing data structure consistency properties, and developers can better understand and more easily modify applications to suit the target architecture for a particular application. This paper presents the design and application of DDT, a new program analysis tool that automatically identifies data structures within an application. An application binary is instrumented to dynamically monitor how the data is stored and organized for a set of sample inputs. The instrumentation detects which functions interact with the stored data, and creates a signature for these functions using dynamic invariant detection. The invariants of these functions are then matched against a library of known data structures, providing a probable identification. That is, DDT uses program consistency properties to identify what data structures an application employs. The empirical evaluation shows that this technique is highly accurate across several different implementations of standard data structures, enabling aggressive optimizations in many situations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Tree register allocation

    Publication Year: 2009 , Page(s): 67 - 77
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (402 KB) |  | HTML iconHTML  

    This paper presents tree register allocation, which maps the lifetimes of the variables in a program into a set of trees, colors each tree in a greedy style, which is optimal when there is no spilling, and connects dataflow between and within the trees afterward. This approach generalizes and subsumes as special cases SSA-based, linear scan, and local register allocation. It keeps their simplicity and low throughput cost, and exposes a wide solution space beyond them. Its flexibility enables control flow structure and/or profile information to be better reflected in the trees. This approach has been prototyped in the Phoenix production compiler framework. Preliminary experiments suggest this is a promising direction with great potential. Register allocation based on two special kinds of trees, extended basic blocks and the maximal spanning tree, are found to be competitive alternatives to SSA-based register allocation, and they all tend to generate better code than linear scan. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Portable compiler optimisation across embedded programs and microarchitectures using machine learning

    Publication Year: 2009 , Page(s): 78 - 88
    Cited by:  Papers (1)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (860 KB) |  | HTML iconHTML  

    Building an optimising compiler is a difficult and time consuming task which must be repeated for each generation of a microprocessor. As the underlying microarchitecture changes from one generation to the next, the compiler must be retuned to optimise specifically for that new system. It may take several releases of the compiler to effectively exploit a processor's performance potential, by which time a new generation has appeared and the process starts again. We address this challenge by developing a portable optimising compiler. Our approach employs machine learning to automatically learn the best optimisations to apply for any new program on a new microarchitectural configuration. It achieves this by learning a model off-line which maps a microarchitecture description plus the hardware counters from a single run of the program to the best compiler optimisation passes. Our compiler gains 67% of the maximum speedup obtainable by an iterative compiler search using 1000 evaluations. We obtain, on average, a 1.16x speedup over the highest default optimisation level across an entire microarchitecture configuration space, achieving a 4.3x speedup in the best case. We demonstrate the robustness of this technique by applying it to an extended microarchitectural space where we achieve comparable performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving cache lifetime reliability at ultra-low voltages

    Publication Year: 2009 , Page(s): 89 - 99
    Cited by:  Papers (13)  |  Patents (1)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (360 KB) |  | HTML iconHTML  

    Voltage scaling is one of the most effective mechanisms to reduce microprocessor power consumption. However, the increased severity of manufacturing-induced parameter variations at lower voltages limits voltage scaling to a minimum voltage, Vccmin, below which a processor cannot operate reliably. Memory cell failures in large memory structures (e.g., caches) typically determine the Vccmin for the whole processor. Memory failures can be persistent (i.e., failures at time zero which cause yield loss) or non-persistent (e.g., soft errors or erratic bit failures). Both types of failures increase as supply voltage decreases and both need to be addressed to achieve reliable operation at low voltages. In this paper, we propose a novel adaptive technique to improve cache lifetime reliability and enable low voltage operation. This technique, multi-bit segmented ECC (MS-ECC) addresses both persistent and non-persistent failures. Like previous work on mitigating persistent failures, MS-ECC trades off cache capacity for lower voltages. However, unlike previous schemes, MS-ECC does not rely on testing to identify and isolate defective bits, and therefore enables error tolerance for non-persistent failures like erratic bits and soft errors at low voltages. Furthermore, MS-ECC's design can allow the operating system to adaptively change the cache size and ECC capability to adjust to system operating conditions. Compared to current designs with single-bit correction, the most aggressive implementation for MS-ECC enables a 30% reduction in supply voltage, reducing power by 71% and energy per instruction by 42%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • ZerehCache: Armoring cache architectures in high defect density technologies

    Publication Year: 2009 , Page(s): 100 - 110
    Cited by:  Papers (4)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1542 KB) |  | HTML iconHTML  

    Aggressive technology scaling to 45 nm and below introduces serious reliability challenges to the design of microprocessors. Large SRAM structures used for caches are particularly sensitive to process variation due to their high density and organization. Designers typically over-provision caches with additional resources to overcome the hard-faults. However, static allocation and binding of redundant resources results in low utilization of the extra resources and ultimately limits the number of defects that can be tolerated. This work re-examines the design of process variation tolerant on-chip caches with the focus on flexibility and dynamic reconfigurability to allow a large number defects to be tolerated with modest hardware overhead. Our approach, ZerehCache, combines redundant data array elements with a permutation network for providing a higher degree of freedom on replacement. A graph coloring algorithm is used to configure the network and find the proper mapping of replacement elements. We perform an extensive design space exploration of both L1/L2 caches to identify several Pareto optimal ZerehCaches. For the yield analysis, a population of 1000 chips was studied at the 45 nm technology node; L1 designs with 16% and an L2 designs with 8% area overheads achieve yields of 99% and 96%, respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low Vccmin fault-tolerant cache with highly predictable performance

    Publication Year: 2009 , Page(s): 111 - 121
    Cited by:  Papers (4)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (330 KB) |  | HTML iconHTML  

    Transistors per area unit double in every new technology node. However, the electric field density and power demand grow if Vcc is not scaled. Therefore, Vcc must be scaled in pace with new technology nodes to prevent excessive degradation and keep power demand within reasonable limits. Unfortunately, low Vcc operation exacerbates the effect of variations and decreases noise and stability margins, increasing the likelihood of errors in SRAM memories such as caches. Those errors translate into performance loss and performance variation across different cores, which is especially undesirable in a multi-core processor. This paper presents (i) a novel scheme to tolerate high faulty bit rates in caches by disabling only faulty subblocks, (ii) a dynamic address remapping scheme to reduce performance variation across different cores, which is key for performance predictability, and (iii) a comparison with state-of-the-art techniques for faulty bit tolerance in caches. Results for some typical first level data cache configurations show 15% average performance increase and standard deviation reduction from 3.13% down to 0.55% when compared to cache line disabling schemes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • mSWAT: Low-cost hardware fault detection and diagnosis for multicore systems

    Publication Year: 2009 , Page(s): 122 - 132
    Cited by:  Papers (7)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (356 KB) |  | HTML iconHTML  

    Continued technology scaling is resulting in systems with billions of devices. Unfortunately, these devices are prone to failures from various sources, resulting in even commodity systems being affected by the growing reliability threat. Thus, traditional solutions involving high redundancy or piecemeal solutions targeting specific failure modes will no longer be viable owing to their high overheads. Recent reliability solutions have explored using low-cost monitors that watch for anomalous software behavior as a symptom of hardware faults. We previously proposed the SWAT system that uses such low-cost detectors to detect hardware faults, and a higher cost mechanism for diagnosis. However, all of the prior work in this context, including SWAT, assumes single-threaded applications and has not been demonstrated for multithreaded applications running on multicore systems. This paper presents mSWAT, the first work to apply symptom based detection and diagnosis for faults in multicore architectures running multithreaded software. For detection, we extend the symptom-based detectors in SWAT and show that they result in a very low silent data corruption (SDC) rate for both permanent and transient hardware faults. For diagnosis, the multicore environment poses significant new challenges. First, deterministic replay required for SWAT's single-threaded diagnosis incurs higher overheads for multithreaded workloads. Second, the fault may propagate to fault-free cores resulting in symptoms from fault-free cores and no available known-good core, breaking fundamental assumptions of SWAT's diagnosis algorithm. We propose a novel permanent fault diagnosis algorithm for multithreaded applications running on multicore systems that uses a lightweight isolated deterministic replay to diagnose the faulty core with no prior knowledge of a known good core. Our results show that this technique successfully diagnoses over 95% of the detected permanent faults while incurring low hardware ov- erheads. mSWAT thus offers an affordable solution to protect future multicore systems from hardware faults. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • BulkCompiler: High-performance Sequential Consistency through cooperative compiler and hardware support

    Publication Year: 2009 , Page(s): 133 - 144
    Cited by:  Papers (7)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (196 KB) |  | HTML iconHTML  

    A platform that supported sequential consistency (SC) for all codes - not only the well-synchronized ones - would simplify the task of programmers. Recently, several hardware architectures that support high-performance SC by committing groups of instructions at a time have been proposed. However, for a platform to support SC, it is insufficient that the hardware does; the compiler has to support SC as well. This paper presents the hardware-compiler interface, and the main compiler ideas for BulkCompiler, a simple compiler layer that works with the group-committing hardware to provide a whole-system high-performance SC platform. We introduce ISA primitives and software algorithms for BulkCompiler to drive instruction-group formation, and to transform code to exploit the groups. Our simulation results show that BulkCompiler not only enables a whole-system SC environment, but also one that actually outperforms a conventional platform that uses the more relaxed Java Memory Model by an average of 37%. The speedups come from code optimization inside software-assembled instruction groups. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • EazyHTM: EAger-LaZY hardware Transactional Memory

    Publication Year: 2009 , Page(s): 145 - 155
    Cited by:  Papers (12)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (455 KB) |  | HTML iconHTML  

    Transactional memory aims to provide a programming model that makes parallel programming easier. Hardware implementations of transactional memory (HTM) suffer from fewer overheads than implementations in software, and refinements in conflict management strategies for HTM allow for even larger improvements. In particular, lazy conflict management has been shown to deliver better performance, but it has hitherto required complex protocols and implementations. In this paper we show a new scalable HTM architecture that performs comparably to the state-of-the-art and can be implemented by minor modifications to the MESI protocol rather than re-engineering it from the ground up. Our approach detects conflicts eagerly while a transaction is running, but defers the resolution lazily until commit time. We evaluate this EAger-laZY system, EazyHTM, by comparing it with the scalable-TCC-like approach and a system employing ideal lazy conflict management with a zero-cycle transaction validation and fully-parallel commits. We show that EazyHTM performs on average 7% faster than scalable-TCC. In addition, EazyHTM has fast commits and aborts, can commit in parallel even if there is only one directory present, and does not suffer from cascading waits. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Proactive transaction scheduling for contention management

    Publication Year: 2009 , Page(s): 156 - 167
    Cited by:  Papers (6)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (325 KB) |  | HTML iconHTML  

    Hardware transactional memory offers a promising high performance and easier to program alternative to lock-based synchronization for creating parallel programs. This is particularly important as hardware manufacturers continue to put more cores on die. But transactional memory still has one main drawback: contention. Contention is caused by multiple transactions trying to speculatively modify the same memory location concurrently causing one or more transactions to abort and retry its execution. Contention serializes the execution, meaning high contention leads to very poor parallel performance. As more cores are added, contention worsens. To date contention-manager designs have been primarily reactive in nature and limited to various forms of randomized backoff to effectively stall contending transactions when conflicts occur. While backoff-based managers have been popular due to their simplicity, at higher core counts our analysis on the STAMP benchmark suite shows that backoff-based managers perform poorly. In particular, small groups of transactions create hot spots of contention that lead to this poor performance. We show these hot spots commonly consist of small sets of conflicts that occur in a predictable manner. To counter this challenge we introduce a dynamic contention management strategy that minimizes contention by using past history to identify when these hot spots will reoccur in the future and proactively schedule affected transactions around these hot spots. The strategy used predicts future contention and schedules to avoid it at runtime without the need for programmer input. Our experiments show that by using our proactive scheduling technique we outperform a backoff-based policy for a 16 processor system by an average of 85%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Into the wild: Studying real user activity patterns to guide power optimizations for mobile architectures

    Publication Year: 2009 , Page(s): 168 - 178
    Cited by:  Papers (8)  |  Patents (2)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (907 KB) |  | HTML iconHTML  

    As the market for mobile architectures continues its rapid growth, it has become increasingly important to understand and optimize the power consumption of these battery-driven devices. While energy consumption has been heavily explored, there is one critical factor that is often overlooked - the end user. Ultimately, the energy consumption of a mobile architecture is defined by user activity. In this paper, we study mobile architectures in their natural environment - in the hands of the end user. Specifically, we develop a logger application for Android G1 mobile phones and release the logger into the wild to collect traces of real user activity. We then show how the traces can be used to characterize power consumption, and guide the development of power optimizations. We present a regression-based power estimation model that only relies on easily-accessible measurements collected by our logger. The model accurately estimates power consumption and provides insights about the power breakdown among hardware components. We show that energy consumption widely varies depending upon the user. In addition, our results show that the screen and the CPU are the two largest power consuming components. We also study patterns in user behavior to derive power optimizations. We observe that majority of the active screen time is dominated by long screen intervals. To reduce the energy consumption during these long intervals, we implement a scheme that slowly reduces the screen brightness over time. Our results reveal that the users are happier with a system that slowly reduces the screen brightness rather than abruptly doing so, even though the two schemes settle at the same brightness. Similarly, we experiment with a scheme that slowly reduces the CPU frequency over time. We evaluate these optimizations with a user study and demonstrate 10.6% total system energy savings with a minimal impact on user satisfaction. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A microarchitecture-based framework for pre- and post-silicon power delivery analysis

    Publication Year: 2009 , Page(s): 179 - 188
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (4079 KB)  

    Variations in power supply voltage, which is a function of the power delivery network and dynamic current consumption, can affect circuit reliability. Much work has been done to understand power delivery robustness during both the design phase as well as the post-silicon validation phase. Methods applicable at the design phase typically synthesize worst-case current waveforms based on simple current constraints but fail to provide corresponding instruction streams due to their ignorance of the functional aspects of the machine and hence cannot be validated. Approaches used for post-silicon validation are not useful during design, and either rely heavily on available test content which can come from power, performance, or defect testing, and hence are limited in validation potential or employ manually-crafted tests aimed at power delivery, and hence are highly labor-intensive. In this paper, we provide a novel approach to construct processor current waveforms to induce significant droops while at the same time producing instruction streams to achieve those waveforms. We solve the pre-silicon current stimulus generation problem as an optimization problem. The modular framework in this paper utilizes microarchitectural information, current consumption estimates of fine-grained microarchitectural components and a pre-characterized power delivery network to obtain significant droop-inducing current waveforms. The paper further discusses techniques to convert operations associated with these generated waveforms to functional instruction streams. Silicon measurements of such tests run on an industrial microprocessor validate the approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reducing peak power with a table-driven adaptive processor core

    Publication Year: 2009 , Page(s): 189 - 200
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (956 KB) |  | HTML iconHTML  

    The increasing power dissipation of current processors and processor cores constrains design options, increases packaging and cooling costs, increases power delivery costs, and decreases reliability. Much research has been focused on decreasing average power dissipation, which most directly addresses cooling costs and reliability. However, much less has been done to decrease peak power, which most directly impacts the processor design, packaging, and power delivery. This research proposes a new architecture which provides a significant decrease in peak power with limited performance loss. It does this through the use of a highly adaptive processor. Many components of the processor can be configured at different levels, but because they are centrally controlled, the architecture can guarantee that they are never all configured maximally at the same time. This paper describes this adaptive processor and explores mechanisms for transitioning between allowed configurations to maximize performance within a peak power constraint. Such an architecture can cut peak power by 25% with less than 5% performance loss; among other advantages, this frees 5.3% of total core area used for decoupling capacitors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy

    Publication Year: 2009 , Page(s): 201 - 212
    Cited by:  Papers (4)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (6893 KB) |  | HTML iconHTML  

    3D-integration is a promising technology to help combat the ??memory wall?? in future multi-core processors. Past work has considered using 3D-stacked DRAM as a large last-level cache (LLC). While significant performance benefits can be gained with such an approach, there remain additional opportunities beyond the simple integration of commodity DRAM chips. In this work, we leverage the hardware organization typical of DRAM architectures to propose new cache management policies that would otherwise not be practical for standard SRAM-based caches. We propose a cache where each set is organized as multiple logical FIFO or queue structures that simultaneously provide performance isolation between threads as well as reduce the number of entries occupied by dead lines. Our results show that beyond the simplistic approach of stacking DRAM as cache, such tightly-integrated 3D architectures enable new opportunities for optimizing and improving system performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An hybrid eDRAM/SRAM macrocell to implement first-level data caches

    Publication Year: 2009 , Page(s): 213 - 221
    Cited by:  Papers (3)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1463 KB) |  | HTML iconHTML  

    SRAM and DRAM cells have been the predominant technologies used to implement memory cells in computer systems, each one having its advantages and shortcomings. SRAM cells are faster and require no refresh since reads are not destructive. In contrast, DRAM cells provide higher density and minimal leakage energy since there are no paths within the cell from Vdd to ground. Recently, DRAM cells have been embedded in logic-based technology, thus overcoming the speed limit of typical DRAM cells. In this paper we propose an n-bit macrocell that implements one static cell, and n-1 dynamic cells. This cell is aimed at being used in an n-way set-associative first-level data cache. Our study shows that in a four-way set-associative cache with this macrocell compared to an SRAM based with the same capacity, leakage is reduced by about 75% and area more than half with a minimal impact on performance. Architectural mechanisms have also been devised to avoid refresh logic. Experimental results show that no performance is lost when the retention time is larger than 50 K processor cycles. In addition, the proposed delayed writeback policy that avoids refreshing performs a similar amount of writebacks than a conventional cache with the same organization, so no power wasting is incurred. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Variation-tolerant non-uniform 3D cache management in die stacked multicore processor

    Publication Year: 2009 , Page(s): 222 - 231
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1810 KB) |  | HTML iconHTML  

    Process variations in integrated circuits have significant impact on their performance, leakage and stability. This is particularly evident in large, regular and dense structures such as DRAMs. DRAMs are built using minimized transistors with presumably uniform speed in an organized array structure. Process variation can introduce latency disparity among different memory arrays. With the proliferation of 3D stacking technology, DRAMs become a favorable choice for stacking on top of a multicore processor as a last level cache for large capacity, high bandwidth, and low power. Hence, variations in bank speed creates a unique problem of non-uniform cache accesses in 3D space. In this paper, we investigate cache management techniques for tolerating process variation in a 3D DRAM stacked onto a multicore processor. We modeled the process variation in a 4-layer DRAM memory to characterize the latency variations among different banks. As a result, the notion of fast and slow banks from the core's standpoint is no longer associated with their physical distances with the banks. They are determined by the different bank latencies due to process variation. We develop cache migration schemes that utilizes fast banks while limiting the cost due to migration. Our experiments show that there is a great performance benefit in exploiting fast memory banks through migration. On average, a variation-aware management can improve the performance of a workload over the baseline (where one of the slowest bank speed is assumed for all banks) by 17.8%. We are also only 0.45% away in performance from an ideal memory where no process variation is present. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • In-Network Coherence Filtering: Snoopy coherence without broadcasts

    Publication Year: 2009 , Page(s): 232 - 243
    Cited by:  Papers (3)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (574 KB) |  | HTML iconHTML  

    With transistor miniaturization leading to an abundance of on-chip resources and uniprocessor designs providing diminishing returns, the industry has moved beyond single-core microprocessors and embraced the many-core wave. Scalable cache coherence protocol implementations are necessary to allow fast sharing of data among various cores and drive the many-core revolution forward. Snoopy coherence protocols, if realizable, have the desirable property of having low storage overhead and not adding indirection delay to cache-to-cache accesses. There are various proposals, like Token Coherence (TokenB), Uncorq, Intel QPI, INSO and Timestamp Snooping, that tackle the ordering of requests in snoopy protocols and make them realizable on unordered networks. However, snoopy protocols still have the broadcast overhead because each coherence request goes to all cores in the system. This has substantial network bandwidth and power implications. In this work, we propose embedding small in-network coherence filters inside on-chip routers that dynamically track sharing patterns among various cores. This sharing information is used to filter away redundant snoop requests that are traveling towards unshared cores. Filtering these useless messages saves network bandwidth and power and makes snoopy protocols on many-core systems truly scalable. Our in-network coherence filters are able to reduce the total number of snoops in the system on an average by 41.9%, thereby reducing total network traffic by 25.4% on 16-processor chip multiprocessor (CMP) systems running parallel applications. For 64-processor CMP systems, our filtering technique on an average achieves 46.5% reduction in total number of snoops that ends up reducing the total network traffic by 27.3%, on an average. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SCARAB: A single cycle adaptive routing and bufferless network

    Publication Year: 2009 , Page(s): 244 - 254
    Cited by:  Papers (9)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (610 KB) |  | HTML iconHTML  

    As technology scaling drives the number of processor cores upward, current on-chip routers consume substantial portions of chip area and power budgets. Since existing research has greatly reduced router latency overheads and capitalized on available on-chip bandwidth, power constraints dominate interconnection network design. Recently research has proposed bufferless routers as a means to alleviate these constraints, but to date all designs exhibit poor operational frequency, throughput, or latency. In this paper, we propose an efficient bufferless router which lowers average packet latency by 17.6% and dynamic energy by 18.3% over existing bufferless on-chip network designs. In order to maintain the energy and area benefit of bufferless routers while delivering ultra-low latencies, our router utilizes an opportunistic processor-side buffering technique and an energy-efficient circuit-switched network for delivering negative acknowledgments for dropped packets. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.