System Maintenance:
There may be intermittent impact on performance while updates are in progress. We apologize for the inconvenience.
By Topic

Innovative Architecture for Future Generation High-Performance Processors and Systems, 2004. Proceedings

Date 12-14 Jan. 2004

Filter Results

Displaying Results 1 - 22 of 22
  • [Cover page]

    Publication Year: 2004 , Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (775 KB)  
    Freely Available from IEEE
  • [Title page]

    Publication Year: 2004 , Page(s): i - iv
    Save to Project icon | Request Permissions | PDF file iconPDF (71 KB)  
    Freely Available from IEEE
  • Table of contents - IWIA 2004

    Publication Year: 2004 , Page(s): v - vi
    Save to Project icon | Request Permissions | PDF file iconPDF (33 KB)  
    Freely Available from IEEE
  • Preface

    Publication Year: 2004 , Page(s): vii
    Save to Project icon | Request Permissions | PDF file iconPDF (16 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Program Committee

    Publication Year: 2004 , Page(s): viii - viiii
    Save to Project icon | Request Permissions | PDF file iconPDF (15 KB)  
    Freely Available from IEEE
  • Direct Instruction Wakeup for Out-of-Order Processors

    Publication Year: 2004 , Page(s): 2 - 9
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (133 KB) |  | HTML iconHTML  

    Instruction queues consume a significant amount of power in high-performance processors, primarily due to instruction wakeup logic access to the queue structures. The wakeup logic delay is also a critical timing parameter. This paper proposes a new queue organization using a small number of successor pointers plus a small number of dynamically allocated full successor bit vectors for cases with a larger number of successors. The details of the new organization are described and it is shown to achieve the performance of CAM-based or full dependency matrix organizations using just one pointer per instruction plus eight full bit vectors. Only two full bit vectors are needed when two successor pointers are stored per instruction. Finally, a design and pre-layout of all critical structures in 70 nm technology was performed for the proposed organization as well as for a CAM-based baseline. The new design is shown to use 1/2 to 1/5th of the baseline instruction queue power, depending on queue size. It is also shown to use significantly less power than the full dependency matrix based design View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Super Instruction-Flow Architecture for High Performance and Low Power Processors

    Publication Year: 2004 , Page(s): 10 - 19
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (830 KB) |  | HTML iconHTML  

    Microprocessor performance has improved at about 55% per year for the past three decades. To maintain this performance growth rate, next generation processors must achieve higher levels of instruction level parallelism. However, it is known that a conditional branch poses serious performance problems in modern processors. In addition, as an instruction pipeline becomes deep and the issue width becomes wide, this problem becomes worse. The goal of this study is to develop a novel processor architecture which mitigates the performance degradation caused by branch instructions. In order to solve this problem, we propose a super instruction-flow architecture. The concept of the architecture is described. This architecture has a mechanism which processes multiple instruction-flows efficiently and tries to mitigate the performance degradation. Preliminary evaluation results with small benchmark programs show that the first generation super instruction-flow processor efficiently mitigates branch overhead View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Power-Aware Register Renaming in High-Performance Processors Power-Aware Register Renaming in High-Performance Processors

    Publication Year: 2004 , Page(s): 20 - 27
    Cited by:  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (289 KB) |  | HTML iconHTML  

    This work presents an efficient multi-banked architecture of the register file, and a low-power compiler support which reduces energy consumption in this device by more than a 78%. The key idea of this work is based on a quasi-deterministic interpretation of the register assignment task, and the use of the voltage scaling techniques View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Custom-Enabled System Architectures for High End Computing

    Publication Year: 2004 , Page(s): 30 - 39
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (121 KB) |  | HTML iconHTML  

    The US Federal Government has convened a major committee to determine future directions for government sponsored high end computing system acquisitions and enabling research. The High End Computing Revitalization Task Force was inaugurated in 2003 involving all Federal agencies for which high end computing is critical to meeting mission goals. As part of the HECRTF agenda, a multi-day community wide workshop was conducted involving experts from academia, industry, and the national laboratories and centers to provide the broadest perspective on important issues related to the HECRTF purview. Among the most critical issues in establishing future directions is the relative merits of commodity based systems such as clusters and MPPs versus custom system architecture strategies. This paper presents a perspective on the importance and value of the custom architecture approach in meeting future US requirements in supercomputing. The contents of this paper reflect the ideas of the participants of the working group chartered to explore custom enabled system architectures for high end computing. As in any such consensus presentation, while this paper captures the key ideas and tradeoffs, it does not exactly match the viewpoint of any single contributor, and there remains much room for constructive disagreement and refinement of the essential conclusions View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A New Memory Module for COTS-Based Personal Supercomputing

    Publication Year: 2004 , Page(s): 40 - 48
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (214 KB) |  | HTML iconHTML  

    This paper presents how to make inexpensive personal supercomputers getting the merit of commercial-off-the-shelf (COTS) continuously after the death of vector super-computer venders. It is designed to realize this goal without any modification on CPU, bridge chips on motherboard and memory chips. Only plugging a new memory module with vector load/store function make an inexpensive home-use personal computer into a node similar to Earth simulator's one. These nodes can be connected by COTS Infiniband 4X type or 12X type switches in order to make parallel systems. COTS SO-DIMMs on the memory modules can be accessed fastly by remote nodes by using AOTF, BOTF, RDMA and remote vector load / store operations. Applications with unit striding or indexed accesses are going to be accelerated. How to accelerate NAS CG class B is shown as an example. Used evaluation methodlogy is about 500 times faster than that of SimpleScalar based methodology. It is predicted with bandwidth analysis that up to 8.75 times improvement can be achieved by proposed system for a single CPU Pentium4 PC without parallel processing View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fault-Tolerant Adaptive Deadlock-Recovery Routing for k-ary n-cube Networks

    Publication Year: 2004 , Page(s): 49 - 58
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (852 KB) |  | HTML iconHTML  

    This paper proposes a fault-tolerant fully adaptive deadlock-recovery routing algorithm for k-ary n-cube networks. We intend to consider both the adaptability for faults and the communication performance by integrating regular and irregular network routing. Our algorithm tolerates any number or shape of faults without disabling fault-free nodes by maintaining routing tables that are configured based on faulty information. Our algorithm also provides minimal misrouting paths around faults by guaranteeing deadlock freedom using only two virtual channels per physical channel. Simulation results show that the proposed algorithm attains robust communication performance for uniform and nonuniform traffic patterns not only on a fault-free torus network but also on irregular tori with faulty nodes View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GXP : An Interactive Shell for the Grid Environment

    Publication Year: 2004 , Page(s): 59 - 67
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (173 KB) |  | HTML iconHTML  

    We describe GXP, a shell for distributed multi-cluster environments. With GXP, users can quickly submit a command to many nodes simultaneously (approximately 600 milliseconds on over 300 nodes spread across five local-area networks). It therefore brings an interactive and instantaneous response to many cluster/network operations, such as trouble diagnosis, parallel program invocation, installation and deployment, testing and debugging, monitoring, and dead process cleanup. It features (1) a very fast parallel (simultaneous) command submission, (2) parallel pipes (pipes between local command and all parallel commands), and (3) a flexible and efficient method to interactively select a subset of nodes to execute subsequent commands on. It is very easy to start using GXP, because it is designed not to require cumbersome per-node setup and installation and to depend only on a very small number of pre-installed tools and nothing else. We describe how GXP achieves these features and demonstrate through examples how they make many otherwise boring and error-prone tasks simple, efficient, and fun View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Array Data Dependence Testing with the Chains of Recurrences Algebra

    Publication Year: 2004 , Page(s): 70 - 81
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (225 KB) |  | HTML iconHTML  

    This paper presents a new approach to dependence testing in the presence of nonlinear and non-closed array index expressions and pointer references. The chains of recurrences formalism and algebra is used to analyze the recurrence relations of induction variables, and for constructing recurrence forms of array index expressions and pointer references. We use these recurrence forms to determine if the array and pointer references are free of dependences in a loop nest. Our recurrence formulation enhances the accuracy of standard dependence algorithms such as the extreme value test and range test. Because the recurrence forms are easily converted to closed forms (when they exist), induction variable substitution and array recovery can be delayed until after the loop is analyzed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Memory Management for Data Localization on OSCAR Chip Multiprocessor

    Publication Year: 2004 , Page(s): 82 - 88
    Cited by:  Papers (2)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (168 KB) |  | HTML iconHTML  

    Chip multiprocessor (CMP) architecture has attracting much attention as a next-generation microprocessor architecture and many kinds of CMP are widely being researched. However, CMP architectures several difficulties for effective use of memory, especially cache or local memory near a processor core. The authors have proposed OSCAR CMP architecture, which cooperatively works with multigrain parallelizing compiler which gives us much higher parallelism than instruction level parallelism or loop level parallelism and high productivity of application programs. To support the compiler optimization for effective use of cache or local memory, OSCAR CMP has local data memory (LDM) for processor private data and distributed shared memory (DSM) for synchronization and fine grain data transfers among processors, in addition to centralized shared memory (CSM) to support dynamic task scheduling. This paper proposes a static coarse grain task scheduling scheme for data localization using live variable analysis. Furthermore, remote memory data transfer scheduling scheme using information of live variable analysis is also described. The proposed scheme is implemented on OSCAR FORTRAN multigrain parallelizing compiler and is evaluated on OSCAR CMP using Tomcatv and Swim in SPEC CFP 95 benchmark View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementation Details and Evaluation of a New Exact and Fast Test for Array Data Dependence Analysis Based on Simplex Method

    Publication Year: 2004 , Page(s): 89 - 100
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (263 KB) |  | HTML iconHTML  

    Data dependence analysis (DDA) is essential for any automatic parallelizing compiler to determine parallelizability of given portions of programs. Several techniques and tests to analyze data dependence between array elements have already been proposed. It is clear that when one examines these conventional DDA techniques, there exists a trade-off between their analysis speed and exactness of their analysis result. When the exactness of DDA is of primary importance, one can usually select the Omega test among the conventional tests. However, the Omega test analyses several ten times slower than the Banerjee test, on average. Moreover, it is hard to implement the Omega test. Therefore, in this paper a new exact test is proposed, whose algorithm is so simple that the analysis speed is generally several times faster than the Omega test. This new algorithm is mainly constructed by combining the Simplex method for linear programming with an exhaustive solution search. To evaluate this new algorithm, 55 benchmark programs were created. The comparison with the Omega test using these benchmark programs is also described View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Large-Scale 3-D Fluid Simulations for Implosion Hydrodynamics on the Earth Simulator

    Publication Year: 2004 , Page(s): 102 - 108
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (522 KB) |  | HTML iconHTML  

    A three-dimensional fluid code, IMPACT-3D has been parallelized with high performance Fortran (HPF) on the Earth Simulator. IMPACT-3D is an implosion analysis code using TVD scheme, which performs three-dimensional compressible and inviscid Eulerian fluid computation with the explicit 5-point stencil scheme for spatial differentiation and the fractional time step for time integration. The third dimension of an array was distributed for the parallelization, but the first dimension was preserved for the vector processing. Shift communications were manually tuned to get best performance by using HPF/JA extensions, which was designed to give the users more control over sophisticated parallelization and communication optimizations. Even the Earth Simulator Center claims more than 95% of vectorization ratio and more than 50% of parallelization efficiency to run the simulation code on the Earth Simulator, we have cleared these severe criteria for using 128 nodes of the Earth Simulator. We are very encouraged to get this outstanding performance on a real-world scientific application parallelized with HPF View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Highly Functional Memory Architecture for Large-Scale Data Applications

    Publication Year: 2004 , Page(s): 109 - 118
    Cited by:  Papers (1)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (230 KB) |  | HTML iconHTML  

    Response time in database systems is not getting small as a processor speed is accelerating because of a growing gap between speed of the processor and that of a memory, and increase in data size. A conventional memory controller and caches in a processor cannot provide enough bandwidth of data transfer between a processor and memory. For fast processing with large data, it is effective to equip a memory controller with mechanisms for transferring large data and a processor with a buffer for receiving the data. In this paper, to accelerate query processing we propose the fast and large scale data transfer methods that take advantage of the data structure in main memory database systems and the characteristics of DRAM, and evaluate them in simulations on several queries. The simulation shows that the query processing with the proposed mechanisms exhibits about 10 times faster execution than a conventional method View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel Processing using Data Localization for MPEG2 Encoding on OSCAR Chip Multiprocessor

    Publication Year: 2004 , Page(s): 119 - 127
    Cited by:  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (186 KB) |  | HTML iconHTML  

    Currently, many people are enjoying multimedia applications with image and audio processing on PCs, PDAs, mobile phones and so on. With the popularization of the multimedia applications, needs for low cost, low power consumption and high performance processors has been increasing. To this end, chip multiprocessor architectures which allow us to attain scalable performance improvement by using multigrain parallelism are attracting much attention. However, in order to extract higher performance on a chip multiprocessor, more sophisticated software techniques are required, such as decomposing a program into adequate grain of tasks, assigning them onto processors considering parallelism, data locality optimization and so on. This paper describes a parallel processing scheme for MPEG2 encoding using data localization which improve execution efficiency assigning coarse grain tasks sharing same data on a same processor consecutively for a chip multiprocessor. The performance evaluation on OSCAR chip multiprocessor architecture shows that proposed scheme gives us 6.97 times speedup using 8 processors and 10.93 times speedup using 16 processors against sequential execution time respectively. Moreover, the proposed scheme gives us 1.61 times speedup using 8 processors and 2.08 times speedup using 16 processors against loop parallel processing which has been widely used for multiprocessor systems using the same number of processors View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Of Piglets and Threadlets: Architectures for Self-Contained, Mobile, Memory Programming

    Publication Year: 2004 , Page(s): 130 - 138
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (142 KB) |  | HTML iconHTML  

    Virtually all of the discussion on "commodity" vs. "custom" architectures, especially for highly parallel systems, has focused on the high-glamor, high complexity processor core. This paper takes a different tack - it explores the potential for directly attacking the memory wall by programming the classically "dumb" memory interface. Several related but separable techniques are involved: converting the data that makes up a memory request into the machine state of a "traveling thread", and developing an ISA that can manipulate this state via extremely short instructions that allow complete programs to be stored within the access packet. The results are interesting: a wide spectrum of functions applications exist for which this approach provides significant improvement in bandwidth and latency View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Impact of Dynamic Allocation of Physical Register Banks for an SMT Processor

    Publication Year: 2004 , Page(s): 139 - 147
    Cited by:  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5799 KB) |  | HTML iconHTML  

    In an SMT processor, the increase of the register contexts of a thread requires a large number of physical registers. Moreover, a physical register file in an SMT processor requires more ports for the execution units, which cause significant growth of the area, access time and power consumption of the register file. These problems are critical hurdles to implement a large scale SMT processor. Especially, growth of access time of a register file has a large impact on performance. In this paper, we propose a strategy to divide a physical register file into some banks and dynamic allocation of the banks to threads in order to reduce the access time of a register file. We have accomplished the reduction in access time of a register file up to 60% without growth of area by using the proposed strategy. On the contrary, IPC degradation can be limited up to 6% by this strategy View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • YAWARA: A Meta-Level Optimizing Computer System

    Publication Year: 2004 , Page(s): 148 - 153
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (199 KB) |  | HTML iconHTML  

    This paper proposes a new, autonomous and dynamic optimization framework, called a meta-level computation. In this framework, a meta-level processor acquires the execution profile of a base-level processor, i.e. a conventional von Neumann machine, produces the optimized base-level configuration and performs the reconfiguration. We define the meta-level computation model based on the considerations of hardware versus software reconfiguration, static versus dynamic reconfiguration and homogeneous versus heterogeneous architecture. The model employs a thread-level reconfiguration to realize the autonomous and dynamic optimization on a uniformly structured multiprocessor. As an implementation of the computation model, we propose a software/hardware combined system, called the YAWARA system. The software system realizes both static and dynamic feedback-directed, autonomous optimization. The hardware system consists of thread engines, each of which includes hardware mechanisms for profiling and feedback-directed resource control View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Author index

    Publication Year: 2004 , Page(s): 154
    Save to Project icon | Request Permissions | PDF file iconPDF (18 KB)  
    Freely Available from IEEE