By Topic

Computer Architecture Letters

Issue 1 • Date January-June 2007

Filter Results

Displaying Results 1 - 8 of 8
  • Dynamic Predication of Indirect Jumps

    Page(s): 1 - 4
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (136 KB)  

    Indirect jumps are used to implement increasingly common programming language constructs such as virtual function calls, switch-case statements, jump tables, and interface calls. Unfortunately, the prediction accuracy of indirect jumps has remained low because many indirect jumps have multiple targets that are difficult to predict even with specialized hardware. This paper proposes a new way of handling hard-to-predict indirect jumps: dynamically predicating them. The compiler identifies indirect jumps that are suitable for predication along with their control-flow merge (CFM) points. The microarchitecture predicates the instructions between different targets of the jump and its CFM point if the jump turns out to be hardto-predict at run time. We describe the new indirect jump predication architecture, provide code examples showing why it could reduce the performance impact of jumps, derive an analytical cost-benefit model for deciding which jumps and targets to predicate, and present preliminary evaluation results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Microarchitectures for Managing Chip Revenues under Process Variations

    Page(s): 5 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (118 KB)  

    As transistor feature sizes continue to shrink intothe sub-90nm range and beyond, the effects of process variationson critical path delay and chip yields have amplified. A commonconcept to remedy the effects of variation is speed-binning, bywhich chips from a single batch are rated by a discrete range offrequencies and sold at different prices. In this paper, we discussstrategies to modify the number of chips in different bins andhence enhance the profits obtained from them. Particularly, wepropose a scheme that introduces a small Substitute Cacheassociated with each cache way to replicate the data elementsthat will be stored in the high latency lines. Assuming a fixedpricing model, this method increases the revenue by as much as13.8% without any impact on the performance of the chips. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Physical register reference counting

    Page(s): 9 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (88 KB)  

    Several proposed techniques including CPR (checkpoint processing and recovery) and NoSQ (no store queue) rely on reference counting to manage physical registers. However, the register reference counting mechanism itself has received surprisingly little attention. This paper fills this gap by describing potential register reference counting schemes for NoSQ, CPR, and a hypothetical NoSQ/CPR hybrid. Although previously described in terms of binary counters, we find that reference counts are actually more naturally represented as matrices. Binary representations can be used as an optimization in specific situations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Logic-Based Distributed Routing for NoCs

    Page(s): 13 - 16
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (115 KB)  

    The design of scalable and reliable interconnection networks for multicore chips (NoCs) introduces new design constraints like power consumption, area, and ultra low latencies. Although 2D meshes are usually proposed for NoCs, heterogeneous cores, manufacturing defects, hard failures, and chip virtualization may lead to irregular topologies. In this context, efficient routing becomes a challenge. Although switches can be easily configured to support most routing algorithms and topologies by using routing tables, this solution does not scale in terms of latency and area. We propose a new circuit that removes the need for using routing tables. The new mechanism, referred to as logic-based distributed routing (LBDR), enables the implementation in NoCs of many routing algorithms for most of the practical topologies we might find in the near future in a multicore chip. From an initial topology and routing algorithm, a set of three bits per switch output port is computed. By using a small logic block, LHDR mimics (demonstrated by evaluation) the behavior of routing algorithms implemented with routing tables. This result is achieved both in regular and irregular topologies. Therefore, LBDR removes the need for using routing tables for distributed routing, thus enabling flexible, fast and power-efficient routing in NoCs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Chameleon: A High Performance Flash/FRAM Hybrid Solid State Disk Architecture

    Page(s): 17 - 20
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (173 KB)  

    Flash memory solid state disk (SSD) is gaining popularity and replacing hard disk drive (HDD) in mobile computing systems such as ultra mobile PCs (UMPCs) and notebook PCs because of lower power consumption, faster random access, and higher shock resistance. One of the key challenges in designing a high-performance flash memory SSD is an efficient handling of small random writes to non-volatile data whose performance suffers from the inherent limitation of flash memory that prohibits in-placc update. In this paper, we propose a high performance Flash/FRAM hybrid SSD architecture called Chameleon. In Chameleon, metadata used by the flash translation layer (FTL), a software layer in the flash memory SSD, is maintained in a small FRAM since this metadata is a target of intensive small random writes, whereas the bulk data is kept in the flash memory. Performance evaluation based on an FPGA implementation of the Chameleon architecture shows that the use of FRAM in Chameleon improves the performance by 21.3 %. The results also show that even for bulk data that cannot be maintained in FRAM because of the size limitation, the use of fine-grained write buffering is critically important because of the inability of flash memory to perform in-placc update of data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Computing Accurate AVFs using ACE Analysis on Performance Models: A Rebuttal

    Page(s): 21 - 24
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (108 KB)  

    ACE (architecturally correct execution) analysis computes AVFs (architectural vulnerability factors) of hardware structures. AVF expresses the fraction of radiation-induced transient faults that result in user-visible errors. Architects usually perform this analysis on a high-level performance model to quickly compute per-structure AVFs. If, however, low-level details of a microarchitecture are not modeled appropriately, then their effects may not be reflected in the per-structure AVFs. In this paper we refute Wang, et al.'s (2007) claim that this detail is difficult to model and imposes a practical threshold on ACE analysis that forces its estimates to have a high error margin. We show that carefully choosing a small amount of additional detail can result in a much tighter AVF bound than Wang, et al. were able to achieve in their refined ACE analysis. Even the inclusion of small details, such as read/write pointers and appropriate inter-structure dependencies, can increase the accuracy of the AVF computation by 40% or more. We argue that this is no different than modeling the IPC (instructions per cycle) of a microprocessor pipeline. A less detailed performance model will provide less accurate IPCs. AVFs are no different. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Corollaries to Amdahl's Law for Energy

    Page(s): 25 - 28
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (185 KB)  

    This paper studies the important interaction between parallelization and energy consumption in a parallelizable application. Given the ratio of serial and parallel portion in an application and the number of processors, we first derive the optimal frequencies allocated to the serial and parallel regions in the application to minimize the total energy consumption, while the execution time is preserved (i.e., speedup = 1). We show that dynamic energy improvement due to parallelization has a function rising faster with the increasing number of processors than the speed improvement function given by the well-known Amdahl's Law. Furthermore, we determine the conditions under which one can obtain both energy and speed improvement, as well as the amount of improvement. The formulas we obtain capture the fundamental relationship between parallelization, speedup, and energy consumption and can be directly utilized in energy aware processor resource management. Our results form a basis for several interesting research directions in the area of power and energy aware parallel processing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Energy-Efficient Processor Architecture for Embedded Systems

    Page(s): 29 - 32
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (157 KB)  

    We present an efficient programmable architecture for compute-intensive embedded applications. The processor architecture uses instruction registers to reduce the cost of delivering instructions, and a hierarchical and distributed data register organization to deliver data. Instruction registers capture instruction reuse and locality in inexpensive storage structures that arc located near to the functional units. The data register organization captures reuse and locality in different levels of the hierarchy to reduce the cost of delivering data. Exposed communication resources eliminate pipeline registers and control logic, and allow the compiler to schedule efficient instruction and data movement. The architecture keeps a significant fraction of instruction and data bandwidth local to the functional units, which reduces the cost of supplying instructions and data to large numbers of functional units. This architecture achieves an energy efficiency that is 23x greater than an embedded RISC processor. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Computer Architecture Letters is a rigorously peer-reviewed forum for publishing early, high-impact results in the areas of uni- and multiprocessor computer systems, computer architecture, microarchitecture, workload characterization, performance evaluation and simulation techniques, and power-aware computing. 

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
José Martinez
Cornell University
336 Frank H.T. Rhodes Hall
Ithaca, NY 14853 USA