By Topic

Computer Architecture Letters

Issue 1 • Date Jan.-June 2006

Filter Results

Displaying Results 1 - 11 of 11
  • Balanced instruction cache: reducing conflict misses of direct-mapped caches through balanced subarray accesses

    Publication Year: 2006 , Page(s): 2 - 5
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (238 KB)  

    It is observed that the limited memory space of direct-mapped caches is not used in balance therefore incurs extra conflict misses. We propose a novel cache organization of a balanced cache, which balances accesses to cache sets at the granularity of cache subarrays. The key technique of the balanced cache is a programmable subarray decoder through which the mapping of memory reference addresses to cache subarrays can be optimized hence conflict misses of direct-mapped caches can be resolved. The experimental results show that the miss rate of balanced cache is lower than that of the same sized two-way set-associative caches on average and can be as low as that of the same sized four-way set-associative caches for particular applications. Compared with previous techniques, the balanced cache requires only one cycle to access all cache hits and has the same access time as direct-mapped caches View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • From sequential programs to concurrent threads

    Publication Year: 2006 , Page(s): 6 - 9
    Cited by:  Papers (7)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (523 KB)  

    Chip multiprocessors are of increasing importance due to difficulties in achieving higher clock frequencies in uniprocessors, but their success depends on finding useful work for the processor cores. This paper addresses this challenge by presenting a simple compiler approach that extracts non-speculative thread-level parallelism from sequential codes. We present initial results from this technique targeting a validated dual-core processor model, achieving speedups ranging from 9-48% with an average of 25% for important benchmark loops over their single-threaded versions. We also identify important next steps found during our pursuit of higher degrees of automatic threading View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Topology optimization of interconnection networks

    Publication Year: 2006 , Page(s): 10 - 13
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (779 KB)  

    This paper describes an automatic optimization tool that searches a family of network topologies to select the topology that best achieves a specified set of design goals while satisfying specified packaging constraints. Our tool uses a model of signaling technology that relates bandwidth, cost and distance of links. This model captures the distance-dependent bandwidth of modern high-speed electrical links and the cost differential between electrical and optical links. Using our optimization tool, we explore the design space of hybrid Clos-torus (C-T) networks. For a representative set of packaging constraints we determine the optimal hybrid C-T topology to minimize cost and the optimal C-T topology to minimize latency for various packet lengths. We then use the tool to measure the sensitivity of the optimal topology to several important packaging constraints such as pin count and critical distance View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors

    Publication Year: 2006 , Page(s): 14 - 17
    Cited by:  Papers (26)  |  Patents (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (324 KB)  

    This paper evaluates asymmetric cluster chip multiprocessor (ACCMP) architectures as a mechanism to achieve the highest performance for a given power budget. ACCMPs execute serial phases of multithreaded programs on large high-performance cores whereas parallel phases are executed on a mix of large and many small simple cores. Theoretical analysis reveals a performance upper bound for symmetric multiprocessors, which is surpassed by asymmetric configurations at certain power ranges. Our emulations show that asymmetric multiprocessors can reduce power consumption by more than two thirds with similar performance compared to symmetric multiprocessors View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Probabilistic counter updates for predictor hysteresis and bias

    Publication Year: 2006 , Page(s): 18 - 21
    Cited by:  Papers (3)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (700 KB)  

    Hardware predictor designers have incorporated hysteresis and/or bias to achieve desired behavior by increasing the number of bits per counter. Some resulting proposed predictor designs are currently impractical because their counter tables are too large. We describe a method for dramatically reducing the amount of storage required for a predictor's counter table with minimal impact on prediction accuracy. Probabilistic updates to counter state are implemented using a hardware pseudo-random number generator to increment or decrement counters a fraction of the time, meaning fewer counter bits are required. We demonstrate the effectiveness of probabilistic updates in the context of Fields et al.'s critical path predictor, which employs a biased 6-bit counter. Averaged across the SPEC CINT2000 benchmarks, our 2-bit and 3-bit probabilistic counters closely approximate a 6-bit deterministic one (achieving speedups of 7.75% and 7.91% compared to 7.94%) when used for criticality-based scheduling in a clustered machine. Performance degrades gracefully, enabling even a 1-bit probabilistic counter to outperform the best 3-bit deterministic counter we found View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A case for fault tolerance and performance enhancement using chip multi-processors

    Publication Year: 2006 , Page(s): 22 - 25
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (174 KB)  

    This paper makes a case for using multi-core processors to simultaneously achieve transient-fault tolerance and performance enhancement. Our approach is extended from a recent latency-tolerance proposal, dual-core execution (DCE). In DCE, a program is executed twice in two processors, named the front and back processors. The front processor pre-processes instructions in a very fast yet highly accurate way and the back processor re-executes the instruction stream retired from the front processor. The front processor runs faster as it has no correctness constraints whereas its results, including timely prefetching and prompt branch misprediction resolution, help the back processor make faster progress. In this paper, we propose to entrust the speculative results of the front processor and use them to check the un-speculative results of the back processor. A discrepancy, either due to a transient fault or a mispeculation, is then handled with the existing mispeculation recovery mechanism. In this way, both transient-fault tolerance and performance improvement can be delivered simultaneously with little hardware overhead View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adopting system call based address translation into user-level communication

    Publication Year: 2006 , Page(s): 26 - 29
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1137 KB)  

    User-level communication alleviates the software overhead of the communication subsystem by allowing applications to access the network interface directly. For that purpose, efficient address translation of virtual address to physical address is critical. In this study, we propose a system call based address translation scheme where every translation is done by the kernel instead of a translation cache on a network interface controller as in the previous cache based address translation. According to our experiments, our scheme achieves up to 4.5 % reduction in application execution time compared to the previous cache based approach View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Data parallel address architecture

    Publication Year: 2006 , Page(s): 30 - 33
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (238 KB)  

    Data parallel memory systems must maintain a large number of outstanding memory references to fully use increasing DRAM bandwidth in the presence of increasing latency. At the same time, the throughput of modern DRAMs is very sensitive to access pattern's due to the time required to precharge and activate banks and to switch between read and write access. To achieve memory reference parallelism a system may simultaneously issue references from multiple reference threads. Alternatively multiple references from a single thread can be issued in parallel. In this paper, we examine this tradeoff and show that allowing only a single thread to access DRAM at any given time significantly improves performance by increasing the locality of the reference stream and hence reducing precharge/activate operations and read/write turnaround. Simulations of scientific and multimedia applications show that generating multiple references from a single thread gives, on average, 17% better performance than generating references from two parallel threads View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • In-network cache coherence

    Publication Year: 2006 , Page(s): 34 - 37
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (422 KB)  

    We propose implementing cache coherence protocols within the network, demonstrating how an in-network implementation of the MSI directory-based protocol allows for in-transit optimizations of read and write delay. Our results show 15% and 24% savings on average in memory access latency for SPLASH-2 parallel benchmarks running on a 4times4 and a 16times16 multiprocessor respectively View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance modeling using Monte Carlo simulation

    Publication Year: 2006 , Page(s): 38 - 41
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (412 KB)  

    Cycle accurate simulation has long been the primary tool for micro-architecture design and evaluation. Though accurate, the slow speed often imposes constraints on the extent of design exploration. In this work, we propose a fast, accurate Monte-Carlo based model for predicting processor performance. We apply this technique to predict the CPI of in-order architectures and validate it against the Itanium-2. The Monte Carlo model uses micro-architecture independent application characteristics, and cache, branch predictor statistics to predict CPI with an average error of less than 7%. Since prediction is achieved in a few seconds, the model can be used for fast design space exploration that can efficiently cull the space for cycle-accurate simulations. Besides accurately predicting CPI, the model also breaks down CPI into various components, where each component quantifies the effect of a particular stall condition (branch misprediction, cache miss, etc.) on overall CPI. Such a CPI decomposition can help processor designers quickly identify and resolve critical performance bottlenecks View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Foreword

    Publication Year: 2006 , Page(s): 11
    Save to Project icon | Request Permissions | PDF file iconPDF (128 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Computer Architecture Letters is a rigorously peer-reviewed forum for publishing early, high-impact results in the areas of uni- and multiprocessor computer systems, computer architecture, microarchitecture, workload characterization, performance evaluation and simulation techniques, and power-aware computing. 

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
José Martinez
Cornell University
336 Frank H.T. Rhodes Hall
Ithaca, NY 14853 USA