By Topic

Innovative Architecture for Future Generation High Performance Processors and Systems, 2006. IWIA '06. International Workshop on

Date 23-25 Jan. 2006

Filter Results

Displaying Results 1 - 17 of 17
  • International Workshop on Innovative Architecture for Future Generation High Performance Processors and Systems [Cover]

    Publication Year: 2006 , Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (6833 KB)  
    Freely Available from IEEE
  • International Workshop on Innovative Architecture for Future Generation High Performance Processors and Systems-Title

    Publication Year: 2006 , Page(s): i - iii
    Save to Project icon | Request Permissions | PDF file iconPDF (54 KB)  
    Freely Available from IEEE
  • International Workshop on Innovative Architecture for Future Generation High Performance Processors and Systems-Copyright

    Publication Year: 2006 , Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (62 KB)  
    Freely Available from IEEE
  • International Workshop on Innovative Architecture for Future Generation High Performance Processors and Systems - TOC

    Publication Year: 2006 , Page(s): v - vi
    Save to Project icon | Request Permissions | PDF file iconPDF (33 KB)  
    Freely Available from IEEE
  • Message from the Editors

    Publication Year: 2006 , Page(s): vii
    Save to Project icon | Request Permissions | PDF file iconPDF (24 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Reviewing Committee

    Publication Year: 2006 , Page(s): viii
    Save to Project icon | Request Permissions | PDF file iconPDF (19 KB)  
    Freely Available from IEEE
  • A Holistic Approach to System Reliability in Blue Gene

    Publication Year: 2006 , Page(s): 3 - 12
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5157 KB) |  | HTML iconHTML  

    Optimizing supercomputer performance requires a balance between objectives for processor performance, network performance, power delivery and cooling, cost and reliability. In particular, scaling a system to a large number of processors poses challenges for reliability, availability and serviceability. Given the power and thermal constraints of data centers, the BlueGene/L supercomputer has been designed with a focus on maximizing floating point operations per second per Watt (FLOPS/Watt). This results in a drastic reduction in FLOPS/m2 floor space and FLOPS/dollar, allowing for affordable scale-up. The BlueGene/L system has been scaled to a total of 65,536 compute nodes in 64 racks. A system approach was used to minimize power at all levels, from the processor to the cooling plant. A BlueGene/L compute node consists of a single ASIC and associated memory. The ASIC integrates all system functions including processors, the memory subsystem and communication, thereby minimizing chip count, interfaces, and power dissipation. As the number of components increases, even a low failure rate per-component leads to an unacceptable system failure rate. Additional mechanisms have to be deployed to achieve sufficient reliability at the system level. In particular, the data transfer volume in the communication networks of a massively parallel system poses significant challenges on bit error rates and recovery mechanisms in the communication links. Low power dissipation and high performance, along with reliability, availability and serviceability were prime considerations in BlueGene/L hardware architecture, system design, and packaging. A high-performance software stack, consisting of operating system services, compilers, libraries and middleware, completes the system, while enhancing reliability and data integrity View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Redundancy in Multi-core Memory-Rich Application-Specific PIM Chips

    Publication Year: 2006 , Page(s): 13 - 20
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (328 KB) |  | HTML iconHTML  

    A trend of growing significance in the arena of advanced microprocessor chip design is the inclusion of multiple processor cores onto the same die with significant parts of the memory hierarchy. This is done to reduce both non-recurring design costs and power dissipation, and to get more computational capability and utilization out of the silicon. A side-effect, however, is the opportunity to leverage the redundancy offered by these multiple cores to improve both die yield (and thus reduce chip costs) and the longevity of systems employing such chips. This paper discusses the key variables that go into the configuration of such multi-core chips where the goal is complete integration with the memory hierarchy in a single part type. The emphasis of the study is on understanding how many cores, and of what complexity, are most appropriate View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving Instruction Issue Bandwidth for Concurrent Error-Detecting Processors

    Publication Year: 2006 , Page(s): 21 - 28
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (219 KB) |  | HTML iconHTML  

    Soft error tolerance is a hot research topic for modern microprocessors. We have been investigating soft error tolerance micro architecture, RED, which exploits time redundancy to achieve soft error tolerance without requiring prohibitive additional hardware resources. Unfortunately, our previous study unveiled that a RED-based processor suffers severe performance penalty. We guess that it comes from the reduction in effective instruction issue queue (ISQ) capacity. Since RED uses a register update unit (RUU), which combines an ISQ and a reorder buffer (ROB) into a single structure, redundant instructions occupy the ISQ. Actually, contemporary microprocessors use a dedicated ISQ, which is decoupled from the ROB rather than the RUU. In this paper, in order to reduce the performance penalty, we adopt RED into ROB-based microprocessors. We reduce the penalty from 17.4% to 12.4% and from 23.9% to 18.3% for integer and floating-point programs, respectively View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Speculative Prefetcher and Evaluator Processor for Pipelined Memory Hierarchies

    Publication Year: 2006 , Page(s): 29 - 43
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (260 KB) |  | HTML iconHTML  

    We consider extensible processor designs in which the number of gates and the distance that a signal traverses in one clock period are, within a given technology, independent of system size. Consequently such designs scale with system size (in particular, with memory latency) as well as with technological advancement. We assume aggressive memories that are not only hierarchical in nature, but are also heavily pipelined, accepting requests at a constant rate. In such a setting, we propose a processor organization called the speculative prefetcher and evaluator (SPE), which performs memory accesses on speculated addresses and executes operations on speculated operand values. The speculation policy simply assumes the absence of dependences among suitable sets of instructions that are executed concurrently and it is not based on estimating properties of the program under execution. The SPE also supports branch target speculation; however, the performance results of this paper only assume static prediction of loop branches. In order to appraise the performance of the SPE, we evaluate the execution time on various algorithms. First we consider a class of programs, based on loops, which includes a number of interesting algorithms such as matrix addition and multiplication, FFT, bitonic merging and sorting, finite-difference solutions for some PDEs, and digital filtering simulations. Then, we consider a recursive implementation of quicksort. For all these programs, the execution time is proportional to the number of executed instructions, that is, the cycle-per-instruction metric is constant, even if memory latency is pessimistically taken to grow linearly with the physical address. The result for the loop class exploits only the pipelinability of the memory. The result for quicksort also exploits the hierarchical nature of memory View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Responsive Multithreaded Processor for Distributed Real-Time Processing

    Publication Year: 2006 , Page(s): 44 - 56
    Cited by:  Papers (1)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2405 KB) |  | HTML iconHTML  

    Responsive multithreaded (RMT) processor is a processor chip that integrates almost all functions for parallel/distributed real-time systems such as robots, intelligent rooms/buildings, amusement systems, etc. Concretely, the RMT processor integrates a real-time processing core (RMT PU), a real-time communication (five sets of responsive links), computer I/O peripherals (DDR SDRAM I/Fs, DMAC, PCI-X, USB2.0, IEEE1394, etc.), and control I/O peripherals (PWM generators, pulse counters, etc.). The design rule of the RMT processor is TSMC 0.13 mum CMOS Cu 1P8M and its die size is 100 mm2. The RMT PU can execute eight prioritized threads simultaneously by using the SMT architecture based on priority, called the RMT architecture. Priority of real-time systems is introduced into all functional units including cache systems, a fetch unit, an issue unit, execution units, etc., so that the RMT PU can guarantee the real-time execution of the prioritized threads. If a resource conflict occurs at each functional unit, the higher priority thread can overtake the lower priority threads at the functional unit. So the RMT PU is like an SMT core with priority to execute threads simultaneously in order of priority set by a real-time operating system. The RMT PU has the hierarchical storage of thread states. The RMT PU has eight hardware contexts as the first level (native) register sets to execute the eight prioritized threads simultaneously. The RMT PU also has a context cache that can save 32 hardware contexts so as to handle and execute 40 prioritized threads concurrently by hardware View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Partial Irregular-Network Routing on Faulty k-ary n-cubes

    Publication Year: 2006 , Page(s): 57 - 64
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (460 KB) |  | HTML iconHTML  

    Interconnection networks have been studied to connect a number of processing elements on parallel computers. Their design increasingly includes a challenge to high fault-tolerance, as entire systems become complicated. This paper presents a partial irregular-network routing in order to provide a high fault-tolerance in k-ary n-cube networks. Since an irregular-network routing usually performs poorly in k-ary n-cube networks, it is only used for progressive deadlock-recovery, and avoiding hard failures. The network is logically divided into the fault and regular regions. In the regular region, most packets are transferred along fully adaptive paths that are computed, assuming that there are no hard failures, so as to uniformly distribute the traffic. Simulation results show that the proposed routing achieves the same throughput as that of Duato's protocol under no hard failures. As the number of faulty links increases to up to 8 on 256 nodes, its throughput is only decreased by 15%. Moreover, the throughput of the proposed deadlock-recovery routing is almost maintained during a dynamic reconfiguration View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Predictive Switching in 2-D Torus Routers

    Publication Year: 2006 , Page(s): 65 - 72
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (309 KB) |  | HTML iconHTML  

    This paper proposes predictive switching in 2D torus routers to reduce the number of pipeline stages for low-latency communication. By utilizing the communication regularity in parallel applications, a dynamic predicting mechanism presets packet traversal paths inside the router before packet arrivals. Hence, we can bypass the pipeline stages of routing computation, virtual channel allocation and switch allocation when the prediction hits. We considered the predictor architecture and accuracy for several traffic patterns in NAS parallel benchmarks. Our experiments show that a sampled pattern matching (SPM) predictor achieves 77% to 96% of the prediction hit rates when we use the dimension-order routing algorithm. We also discuss a method to improve the prediction accuracy of SPM by examining the frequency of occurrence for the prediction values in the communication history View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hardware Support for MPI in DIMMnet-2 Network Interface

    Publication Year: 2006 , Page(s): 73 - 82
    Cited by:  Papers (2)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (303 KB) |  | HTML iconHTML  

    In this paper, hardware support for MPI on the DIMMnet-2 network interface plugged into a DDR DIMM slot is presented. This hardware support realize effective eager protocol and effective derived datatype communication of MPI. As a preliminary evaluation, the evaluation results on the real prototype concerning the bandwidth of elements constituting MPI are shown. IPUSH, which is remote indirect writing, showed almost the same performance as RDMA, which is remote direct writing. IPUSH can reduce memory space required for a receiver buffer sharply. The memory space reduction effect of IPUSH on a system with more nodes is higher. Compared with a method that starts the burst vector loading many times, VLS, which performs a regular-interval vector loading, sharply accelerated access to the data arranged at regular intervals. The above-mentioned results indicate that the improvement in the speed of MPI by the proposed method is promising View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Compilation for Delay Impact Minimization in VLIW Embedded Systems

    Publication Year: 2006 , Page(s): 83 - 90
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (364 KB) |  | HTML iconHTML  

    Tomorrow's embedded devices need to run high-resolution multimedia as well as need to support multi-standard wireless systems which require an enormous computational complexity with a very low energy consumption and very high performance constraints. In this context, the register file is one of the key sources of power consumption and performance bottleneck, and its inappropriate design and management can severely affect the performance of the system. In this paper, we present a new compilation approach to mitigate the performance implications of technology variation in the shared register file in upcoming embedded VLIW architectures with several processing units. The compilation approach is based on a redefined register assignment policy and a set of architectural modifications to this device. Experimental results show up to a 67% performance improvement with our technique View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Real-Time Operating System Kernel for Multithreaded Processor

    Publication Year: 2006 , Page(s): 91 - 100
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (403 KB) |  | HTML iconHTML  

    In embedded system development, multithreaded processors are used for further performance improvement to satisfy large-scale and sophisticated applications. PRESTOR-1, a multithreaded processor we developed, has a mechanism, a processor context buffer (PCB), that accommodates thread contexts spilt from built-in context slots. Threads/tasks located in the PCB are controlled and swapped for built-in active contexts fully by hardware control and performance of a system with many threads/tasks can be enhanced. Our RTOS kernel that is compatible with the ITRON specification is extended to utilize the PRESTOR-1 multithreaded architecture including the PCB mechanism and several extended instructions. Evaluation for execution with PCB showed higher performance than single-thread execution or multithreaded execution without PCB in spite of more cache misses View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Author index

    Publication Year: 2006 , Page(s): 101
    Save to Project icon | Request Permissions | PDF file iconPDF (25 KB)  
    Freely Available from IEEE