By Topic

Computer Architecture Letters

Issue 2 • Date July-Dec. 2008

Filter Results

Displaying Results 1 - 13 of 13
  • [Front cover]

    Publication Year: 2008 , Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (92 KB)  
    Freely Available from IEEE
  • Editorial Board [Cover2]

    Publication Year: 2008 , Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (72 KB)  
    Freely Available from IEEE
  • Pipelined Architecture for Multi-String Matching

    Publication Year: 2008 , Page(s): 33 - 36
    Cited by:  Papers (8)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (116 KB)  

    This letter presents a new oblivious routing algorithm for 3D mesh networks called randomized partially-minimal (RPM) routing that provably achieves optimal worst- case throughput for 3D meshes when the network radix fc is even and within a factor of 1/k2 of optimal when k is odd. Although this optimality result has been achieved with the minimal routing algorithm OITURN for the 2D case, the worst-case throughput of OITURN degrades tremendously in higher dimensions. Other existing routing algorithms suffer from either poor worst-case throughput (DOR, ROMM) or poor latency (VAL). RPM on the other hand achieves near optimal worst-case and good average-case throughput as well as good latency performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Randomized Partially-Minimal Routing on Three-Dimensional Mesh Networks

    Publication Year: 2008 , Page(s): 37 - 40
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (114 KB)  

    This letter presents a new oblivious routing algorithm for 3D mesh networks called Randomized Partially- Minimal (RPM) routing that provably achieves optimal worstcase throughput for 3D meshes when the network radix k is even and within a factor of 1/k2 of optimal when k is odd. Although this optimality result has been achieved with the minimal routing algorithm O1TURN [9] for the 2D case, the worst-case throughput of O1TURN degrades tremendously in higher dimensions. Other existing routing algorithms suffer from either poor worst-case throughput (DOR [10], ROMM [8]) or poor latency (VAL [14]). RPM on the other hand achieves near optimal worst-case and good average-case throughput as well as good latency performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hierarchical Instruction Register Organization

    Publication Year: 2008 , Page(s): 41 - 44
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (89 KB)  

    This paper analyzes a range of architectures for efficient delivery of VLIW instructions for embedded media kernels. The analysis takes an efficient filter cache as a baseline and examines the benefits from 1) removing the tag overhead, 2) distributing the storage, 3) adding indirection, 4) adding efficient NOP generation, and 5) sharing instruction memory. The result is a hierarchical instruction register organization that provides a 56% energy and 40% area savings over an already efficient filter cache. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Parallel Deadlock Detection Algorithm with O(1) Overall Run-time Complexity

    Publication Year: 2008 , Page(s): 45 - 48
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (124 KB)  

    This article proposes a novel parallel, hardware-oriented deadlock detection algorithm for multiprocessor system-on-chips. The proposed algorithm takes full advantage of hardware parallelism in computation and maintains information needed by deadlock detection through classifying all resource allocation events and performing class specific operations, which together make the overall run-time complexity of the new method O(1). We implement the proposed algorithm in Verilog HDL and demonstrate in the simulation that each algorithm invocation takes at most four clock cycles in hardware. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Beyond Fat--tree: Unidirectional Load--Balanced Multistage Interconnection Network

    Publication Year: 2008 , Page(s): 49 - 52
    Cited by:  Papers (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (109 KB)  

    The fat-tree is one of the most widely-used topologies by interconnection network manufacturers. Recently, it has been demonstrated that a deterministic routing algorithm that optimally balances the network traffic can not only achieve almost the same performance than an adaptive routing algorithm but also outperforms it. On the other hand, fat-trees require a high number of switches with a non-negligible wiring complexity. In this paper, we propose replacing the fat-tree by a unidirectional multistage interconnection network (UMIN) that uses a traffic balancing deterministic routing algorithm. As a consequence, switch hardware is almost reduced to the half, decreasing, in this way, the power consumption, the arbitration complexity, the switch size itself, and the network cost. Preliminary evaluation results show that the UMIN with the load balancing scheme obtains lower latency than fat-tree for low and medium traffic loads. Furthermore, in networks with a high number of stages or with high radix switches, it obtains the same, or even higher, throughput than fat-tree. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Transaction-Aware Network-on-Chip Resource Reservation

    Publication Year: 2008 , Page(s): 53 - 56
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (149 KB)  

    Performance and scalability are critically-important for on-chip interconnect in many-core chip-multiprocessor systems. Packet-switched interconnect fabric, widely viewed as the de facto on-chip data communication backplane in the many-core era, offers high throughput and excellent scalability. However, these benefits come at the price of router latency due to run-time multi-hop data buffering and resource arbitration. The network accounts for a majority of on-chip data transaction latency. In this work, we propose dynamic in-network resource reservation techniques to optimize run-time on-chip data transactions. This idea is motivated by the need to preserve existing abstraction and general-purpose network performance while optimizing for frequently-occurring network events such as data transactions. Experimental studies using multithreaded benchmarks demonstrate that the proposed techniques can reduce on-chip data access latency by 28.4% on average in a 16-node system and 29.2% on average in a 36-node system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Proactive Use of Shared L3 Caches to Enhance Cache Communications in Multi-Core Processors

    Publication Year: 2008 , Page(s): 57 - 60
    Cited by:  Papers (1)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (252 KB)  

    The software and hardware techniques to exploit the potential of multi-core processors are falling behind, even though the number of cores and cache levels per chip is increasing rapidly. There is no explicit communications support available, and hence inter-core communications depend on cache coherence protocols, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we present software controlled eviction (SCE) to improve the performance of multithreaded applications running on multi-core processors by moving shared data to shared cache levels before it is demanded from remote private caches. Simulation results show that SCE offers significant performance improvement (8-28%) and reduces L3 cache misses by 88-98%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • BENoC: A Bus-Enhanced Network on-Chip for a Power Efficient CMP

    Publication Year: 2008 , Page(s): 61 - 64
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (138 KB)  

    Network-on-chips (NoCs) outperform buses in terms of scalability, parallelism and system modularity and therefore are considered as the main interconnect infrastructure in future chip multi-processor (CMP). However, while NoCs are very efficient for delivering high throughput point-to-point data from sources to destinations, their multi-hop operation is too slow for latency sensitive signals. In addition, current NoCS are inefficient for broadcast operations and centralized control of CMP resources. Consequently, state-of-the-art NoCs may not facilitate the needs of future CMP systems. In this paper, the benefit of adding a low latency, customized shared bus as an internal part of the NoC architecture is explored. BENoC (bus-enhanced network on-chip) possesses two main advantages: First, the bus is inherently capable of performing broadcast transmission in an efficient manner. Second, the bus has lower and more predictable propagation latency. In order to demonstrate the potential benefit of the proposed architecture, an analytical comparison of the power saving in BENoC versus a standard NoC providing similar services is presented. Then, simulation is used to evaluate BENoC in a dynamic non-uniform cache access (DNUCA) multiprocessor system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • DDMR: Dynamic and Scalable Dual Modular Redundancy with Short Validation Intervals

    Publication Year: 2008 , Page(s): 65 - 68
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (88 KB)  

    DMR (dual modular redundancy) was suggested for increasing reliability. Classical DMR consists of pairs of cores that check each other and are pre-connected during manufacturing by dedicated links. In this paper we introduce the dynamic dual modular redundancy (DDMR) architecture. DDMR supports run-time scheduling of redundant threads, which has significant benefits relative to static binding. To allow dynamic pairing, DDMR replaces the special links with a novel ring architecture. DDMR uses short instruction sequences for validation, smaller than the processor reorder buffer. Such short sequences reduce latencies in parallel programs and save resources needed to buffer uncommitted data. DDMR scales with the number of cores and may be used in large multicore architectures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Information for authors

    Publication Year: 2008 , Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (72 KB)  
    Freely Available from IEEE
  • IEEE Computer Society [Cover4]

    Publication Year: 2008 , Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (92 KB)  
    Freely Available from IEEE

Aims & Scope

IEEE Computer Architecture Letters is a rigorously peer-reviewed forum for publishing early, high-impact results in the areas of uni- and multiprocessor computer systems, computer architecture, microarchitecture, workload characterization, performance evaluation and simulation techniques, and power-aware computing. 

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
José Martinez
Cornell University
336 Frank H.T. Rhodes Hall
Ithaca, NY 14853 USA