By Topic

Innovative Architecture for Future Generation High-Performance Processors and Systems, 2005

Date 17-17 Jan. 2005

Filter Results

Displaying Results 1 - 23 of 23
  • Innovative architecture for future generation high-performance processors and systems

    Publication Year: 2005
    Save to Project icon | Request Permissions | PDF file iconPDF (857 KB)  
    Freely Available from IEEE
  • Innovative Architecture for Future Generation High-Performance Processors and Systems - Title Page

    Publication Year: 2005 , Page(s): i - iii
    Save to Project icon | Request Permissions | PDF file iconPDF (34 KB)  
    Freely Available from IEEE
  • Innovative Architecture for Future Generation High-Performance Processors and Systems - Copyright Page

    Publication Year: 2005 , Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (42 KB)  
    Freely Available from IEEE
  • Innovative Architecture for Future Generation High-Performance Processors and Systems - Table of contents

    Publication Year: 2005 , Page(s): v - vi
    Save to Project icon | Request Permissions | PDF file iconPDF (46 KB)  
    Freely Available from IEEE
  • Message from the Editors

    Publication Year: 2005 , Page(s): vii
    Save to Project icon | Request Permissions | PDF file iconPDF (24 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Committees

    Publication Year: 2005 , Page(s): viii
    Save to Project icon | Request Permissions | PDF file iconPDF (24 KB)  
    Freely Available from IEEE
  • Superscalar processor with multi-bank register file

    Publication Year: 2005
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (248 KB) |  | HTML iconHTML  

    Register files in highly parallel superscalar processors tend to have large chip area and many access ports. This trend causes problems with chip-size, access time and power consumption. As one of the methods for solving these problems, we have proposed a multi-bank register file which realizes small area, high speed and low power consumption. We have proved effectiveness of this method by software simulation, and by detail designing it as synthesizable Verilog-HDL description with a full custom designed multi-bank register file. In this paper, we show the detail architecture of a superscalar processor with the multi-bank register file and its evaluation results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Steering and forwarding techniques for reducing memory communication on a clustered microarchitecture

    Publication Year: 2005
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (176 KB) |  | HTML iconHTML  

    In a clustered micro architecture design, the execution core which has large RAMs, large CAMs and all connected result bypass loops is partitioned into smaller execution cores that are called clusters. Clustered microarchitecture can allow a scalable core design because intra-cluster operation remains fast regardless of entire execution width of the core. But localization of critical memory transfers (store-load-consumer) is still a problem. In this work, we propose a technique named "distributed speculative memory forwarding (DSMF)" that localizes critical memory transfers into a cluster. DSMF learns memory dependences at retire stage, steers dependent pair of the store and the consumer to the same cluster, transfers data locally in the cluster. We show that the IPC improvement of 15% was obtained by this localization on the baseline clustered microarchitecture. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The bimode++ branch predictor

    Publication Year: 2005
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (520 KB) |  | HTML iconHTML  

    Modern wide-issue superscalar processors tend to adopt deeper pipelines in order to attain high clock rates. This trend increases the number of on-the-fly instructions in processors and a mispredicted branch can result in substantial amounts of wasted work. In order to mitigate these wasted works, an accurate branch prediction is required for the high performance processors. In order to improve the prediction accuracy, we propose the bimode++ branch predictor. It is an enhanced version of the bimode branch predictor. Throughout execution from the start to the end of a program, some branch instructions have the same result at all times. These branches are defined as extremely biased branches. The bimode++ branch predictor is unique in predicting the output of an extremely biased branch with a simple hardware structure. In addition, the bimode++ branch predictor improves the accuracy using the refined indexing and a fusion function. Our experimental results with benchmarks from SpecFP, SpecINT, multi-media and server area show that the bimode++ branch predictor can reduce the misprediction rate by 13.2% to the bimode and by 32.5% to the gshare. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the use of bit filters in shared nothing partitioned systems

    Publication Year: 2005
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (216 KB) |  | HTML iconHTML  

    Parallel query processing is in the core of many business analysis environments. Such applications impose a high demand on the computer hardware to achieve results in reasonable times, specially when queries are launched against huge amounts of warehouse data. We look into the problem of parallel query processing on large data sets focusing on a rational use of the network and memory resources. In this context, we propose a new protocol to make use of bit filters in parallel shared nothing systems for non-collocated joins. We call our protocol remote bit filters with requests (RBFR). We have implemented a prototype of RBFR for the first time in a major commercial database, IBM® DB2 Universal Database™(DB2 UDB). RBFR has two important advantages over the previous usage of bit filters in the same context. First, it reduces the amount of memory used compared to previous solutions. This allows for the processing of more or larger queries. Second, the protocol itself has an insignificant impact on communication. This means that it is as efficient as the previous strategies, avoiding the saturation of the network in parallel intensive network usage environments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Incorporating a secure coprocessor in the database-as-a-service model

    Publication Year: 2005
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (216 KB) |  | HTML iconHTML  

    In this paper, we suggest an extension to the database-as-a-service (DAS) model that introduces a secure coprocessor (SC) at an untrusted database service provider in order to overcome drawbacks in the plain DAS model. The processor serves as a neutral party between the clients and service providers with the goal of increasing security of outsourced data. Additionally, it supports a much broader range of queries performed and reduces both bandwidth and computational burdens on the client. We expect these improvements to make the DAS model more viable and attractive from a client's perspective. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Understanding and comparing the performance of optimized JVMs

    Publication Year: 2005
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (168 KB) |  | HTML iconHTML  

    Java virtual machines have different performance characteristics depending on their interpretation and just in time compilation strategies. These characteristics are even more complex when running on a modern out-of-order superscalar processor. This paper analyzes the behavior of the SPECjvm98 benchmarks on IBM's JikesRVM Java virtual machine executing on the IBM Power4 processor. Execution time parameters such as the number of instructions and cycles, the behavior of instruction and data caches, and the branching characteristics obtained from hardware performance counters are used to explain performance differences between interpreted, JIT compiled and dynamically optimized JVMs. Our goal is to understand benchmark and processor behavior with different JIT optimization options and strategies and to use this knowledge in design of future JVMs. The results show that the reduction in the number of executed instructions due to compiler optimizations is the main reason for improved performance. An increase in instruction level parallelism in compiled code provides further improvement. The increased ILP is in large part due to elimination of dependences in the optimized code. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An exploration of the technology space for multi-core memory/logic chips for highly scalable parallel systems

    Publication Year: 2005
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (280 KB) |  | HTML iconHTML  

    Chip-level multi-processing, where more than one CPU "core" share the same die with significant parts of the memory hierarchy, is appearing with increasing frequency as standard design practice. This paper takes a broader look at how such mixed logic/memory dies may evolve in the future by walking through the latest CMOS roadmap projections, and casting them in terms of the key chip-level system level building blocks. Given the increasing importance of memory density in such systems, especially as we move to single chip-type designs, we pay particular attention to the potential use of not SRAM but leading edge DRAM for many memory structures. The roles of other factors, such as interconnect and power, is also considered. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimal loop-unrolling mechanisms and architectural extensions for an energy-efficient design of shared register files in MPSoCs

    Publication Year: 2005
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (232 KB) |  | HTML iconHTML  

    In this paper, we introduce a new hardware/software approach to reduce the energy of the shared register file in upcoming embedded architectures with several VLIW processors. This paper includes a set of architectural extensions and special loop unrolling techniques for the compilers of MPSoC platforms. This complete hardware/software support enables reducing the energy consumed in the register file of MPSoC architectures up to a 60% without introducing performance penalties. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A New Kind of Processor Interface for a System-on-Chip Processor with TIE Ports and TIE Queues of Xtensa LX

    Publication Year: 2005 , Page(s): 72 - 79
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (208 KB) |  | HTML iconHTML  

    Today, most System-on-a-Chip (SoC) ASIC chips integrate multiple processor cores as well as hard-wired RTL blocks to realize very complex applications. While computation performance of processors increases, data throughput becomes the bottleneck. Moreover, as processors and RTL blocks need to share data and control/status, inter processors/RTL communications become a serious issue. While various system interconnects have been introduced, processor interface architecture remains conceptually the same. To overcome the communication bottleneck, this paper presents a new type of embedded processor interface for SoC design. And, as the actual realization of such an interface, the TIE ports and TIE queues of XtensaLX processor from Tensilica, Inc. is introduced in this paper. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A multi-thread processor architecture based on the continuation model

    Publication Year: 2005
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (224 KB) |  | HTML iconHTML  

    We are developing the Fuce processor based on the dataflow computing model. Fuce means fusion of communication and execution. In order to execute many threads with multiple thread execution units efficiently, the Fuce processor executes multiple threads using the exclusive multi-thread execution model. The core concept of the exclusive multi-thread execution model is continuation based multi-thread execution, which is derived from dataflow computing. The Fuce processor aims to fuse the intra-processor execution and inter-processor communication. The Fuce processor unifies processing inside the processor and communication with processors outside as events, and executes the event as a thread. In this paper, we introduce the architecture of the Fuce processor and evaluate the concurrency performance of a Fuce processor which we described in VHDL. As a result, we understood that the processor has concurrency capability when there is sufficient thread level parallelism. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • PRESTOR-1: a processor extending multithreaded architecture

    Publication Year: 2005
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (512 KB) |  | HTML iconHTML  

    Multithreaded processors are globally spreading. Multithreaded architecture enables fast context switching for tolerating memory access latency and bridging synchronization gap, and thus enables efficient utilization of execution pipelines. However, it cannot avoid all pipeline stalls; stalls still occur when all processor built-in threads are in a wait state or there are not enough threads in a task/process to fill up all available context slots, since the mechanism for switching active threads is effective only for processor built-in threads' contexts. We developed a new multithreaded processor, PRESTOR-1, that increases the virtual number of built-in threads' contexts and enables seamless task/thread switching by allocating and swapping task/thread contexts hierarchically between processor and memory in a multitasking environment. The processor supports real-time applications through hierarchical task/thread allocation based on the task/thread priority and fast response mechanisms for interrupt requests exploiting the multiple-context architecture. Moreover, the processor has reconfigurable caches that provide a priority-based partitioning cache and a FIFO buffer. In this paper, we describe the details of PRESTOR-1. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Continuum computer architecture for nano-scale and ultra-high clock rate technologies

    Publication Year: 2005
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (176 KB) |  | HTML iconHTML  

    Continuum computer architecture (CCA) is a non-von Neumann architecture that offers an alternative to conventional structures as digital technology evolves towards nano-scale and the ultimate flat-lining of Moore's law. Coincidentally, it also defines a model of architecture particularly well suited to logic classes that exhibit ultra-high clock rates (> 100 GHz) such as rapid single flux quantum (RSFQ) gates. CCA eliminates the concept of the "CPU" that has dominated computer architecture since its inception more than half a century ago and establishes a new local element that merges the properties of state storage, state transfer, and state operation. A CCA system architecture is a simple multidimensional organization of these elemental blocks and physically may be considered as a new family of cellular computer. But CCA differs dramatically from conventional cellular automata. While both deliver emergent global behavior from the aggregation of local rules and ensuing operation. The CCA emergent behavior is a global general-purpose model of parallel computation, as opposed to simply mimicking some limited phenomenon like heat and mass transfer as do conventional cellular automata. This paper presents the motivation and foundation concepts of CCA and exposes key issues for further work. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance evaluation of dynamic network reconfiguration using Detour-UD routing

    Publication Year: 2005
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (552 KB) |  | HTML iconHTML  

    Fault-tolerance is an emerging issue for massively parallel computers. This paper describes the performance impact of dynamic network reconfiguration protocols using a fault-tolerant, adaptive deadlock-recovery routing algorithm, Detour-UD, for k-ary n-cubes. We propose a scheme to specify unroutable packets by managing drain-flags in routing tables. We also propose two selective drainage protocols. One protocol drains the unroutable packets specified by the drain-flags after the reconfiguration process. The other protocol drains deadlocked packets to reduce the network load during the reconfiguration process. Our simulation results show that the first protocol helps reduce the number of drainage packets, and the second one keeps the network throughput during the reconfiguration process. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Preliminary evaluations of a FPGA-based-prototype of DIMMnet-2 network interface

    Publication Year: 2005
    Cited by:  Papers (3)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (352 KB) |  | HTML iconHTML  

    Performance improvement of interconnection networks for a PC cluster brings a bottleneck in a standard I/O bus such as PCI bus. DIMMnet is a network interface plugged into a memory slot instead of standard I/O buses. This strategy is one of the solutions in order to balance growing performance with future micro processors. DIMMnet-2 is a prototype which can be plugged into a DDR-DIMM slot to confirm its functions. In this paper, outline of FPGA-based DIMMnet-2 prototype and improvements from DIMMnet-1 to DIMMnet-2 are mentioned. Although the DIMMnet-2 uses an FPGA instead of an ASIC, the latency for writing 8 bytes into remote memory is only 0.948 μs. It is about 3 times fewer latency than that of a high performance commercial network interface QsNET II plugged into PCI-X bus on Intel-based IA32 PC. The delay of CoreLogic part for BOTF sending of FPGA based DIMMnet-2 is 5.75 times as fast as that of DIMMnet-1. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SIMD optimization in COINS compiler infrastructure

    Publication Year: 2005
    Cited by:  Papers (1)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (208 KB) |  | HTML iconHTML  

    COINS is a compiler infrastructure that makes it easy to construct a new compiler by adding/modifying only part of the COINS of compiling/optimization features. SIMD optimization is a major advantage. We present an overview of COINS and some topics on its SIMD optimization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance comparison of vector-calculations between Itanium2 and other processors

    Publication Year: 2005
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (152 KB) |  | HTML iconHTML  

    This paper examines the performance similarity of the Intel Itanium2 processor and a vector processor. From the measurements of vector-calculations on latest scalar processors, Itanium2 shares similar strong points and weak points of performance with VPP5000. For multiplications of dense matrices, Itanium2 and VPP5000 show relatively high sustained-performance to the theoretical peak. For matrix-vector multiplications with sparse matrices, on the other hand, those two processors show poor performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Author index

    Publication Year: 2005 , Page(s): 147
    Save to Project icon | Request Permissions | PDF file iconPDF (32 KB)  
    Freely Available from IEEE