Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture

22-25 Jan. 1995

Filter Results

Displaying Results 1 - 25 of 37
  • Memory access reordering in vector processors

    Publication Year: 1995, Page(s):380 - 389
    Cited by:  Papers (3)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (844 KB)

    Interference among multiple vector streams that access memory concurrently is the major source of performance degradation in main memory of pipelined vector processors. While totally eliminating interference appears to be impossible, little is known on how to design a memory system that can reduce it. In this paper, we introduce a concept called memory access reordering for reducing interference. ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Program balance and its impact on high performance RISC architectures

    Publication Year: 1995, Page(s):370 - 379
    Cited by:  Papers (11)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (837 KB)

    Information on the behavior of programs is essential for deciding the number and nature of functional units in high performance architectures. In this paper, we present studies on the balance of access and computation tasks on a typical RISC architecture, the MIPS. The MIPS programs are analyzed to find the demands they place on the memory system and the floating point or integer computation units... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimizing instruction cache performance for operating system intensive workloads

    Publication Year: 1995, Page(s):360 - 369
    Cited by:  Papers (24)  |  Patents (2)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (968 KB)

    High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to use an optimizing compiler to minimize cache interference via an improved layout of the code. This technique, however, has been applied to application code only, even though there is evidence that the operating system often uses the cache heavily and with less uniform patterns ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Architectural support for inter-stream communication in a MSIMD system

    Publication Year: 1995, Page(s):348 - 357
    Cited by:  Papers (1)  |  Patents (5)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (753 KB)

    This paper considers hardware support for the exploitation of control parallelism on data parallel architectures. It is well known that data parallel algorithms may also possess control parallel structure. However the splitting of control leads to data dependency and synchronization issues that were implicitly handled in conventional SIMD architectures. These include synchronization of access to s... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Massively parallel array processor for logic, fault, and design error simulation

    Publication Year: 1995, Page(s):340 - 347
    Cited by:  Papers (2)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (570 KB)

    Digital logic, fault, and error simulation of large VLSI circuits is one of the most compute-intensive tasks in digital systems analysis. This paper describes a massively parallel special purpose array processor, or hardware accelerator, for digital logic, fault, and error simulation. Hardware simulation is a viable approach for simulation of large systems, since simulation time increases rapidly ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A VLSI architecture for computing the tree-to-tree distance

    Publication Year: 1995, Page(s):330 - 339
    Cited by:  Papers (1)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (774 KB)

    The distance between two labeled ordered trees, /spl alpha/ and /spl beta/ is the minimum cost sequence of editing operations (insertions, deletions and substitutions, needed to transform or into /spl beta/ such that the predecessor-descendant relation between nodes and the ordering of nodes is not changed). Approximate tree matching has applications in genetic sequence comparison, scene analysis,... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The effects of STEF in finely parallel multithreaded processors

    Publication Year: 1995, Page(s):318 - 325
    Cited by:  Patents (7)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (514 KB)

    The throughput of a multiple-pipelined processor suffers due to lack of sufficient instructions to make multiple pipelines busy and due to delays associated with pipeline dependencies. Finely Parallel Multithreaded Processor (FPMP) architectures try to solve these problems by dispatching multiple instructions from multiple instruction threads in parallel. This paper proposes an analytic model whic... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fine-grain multi-thread processor architecture for massively parallel processing

    Publication Year: 1995, Page(s):308 - 317
    Cited by:  Papers (6)  |  Patents (3)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (546 KB)

    Latency, caused by remote memory access and remote procedure call, is one of the most serious problems in massively parallel computers. In order to eliminate the processors' idle time caused by these latencies, processors must perform fast context switching among fine-grain concurrent processes. In this paper, we propose a processor architecture, called Datarol-II, that promotes efficient fine-gra... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and performance evaluation of a multithreaded architecture

    Publication Year: 1995, Page(s):298 - 307
    Cited by:  Papers (10)  |  Patents (4)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (725 KB)

    Multithreaded architectures have the ability to tolerate long memory latencies and unpredictable synchronization delays. We propose a multithreaded architecture that is capable of exploiting both coarse-grain parallelism, and fine-grain instruction level parallelism in a program. Instruction-level parallelism is exploited by grouping instructions from a number of active threads at runtime. The arc... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Software cache coherence for large scale multiprocessors

    Publication Year: 1995, Page(s):286 - 295
    Cited by:  Papers (11)  |  Patents (3)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (926 KB)

    Shared memory is an appealing abstraction for parallel programming. It must be implemented with caches in order to perform well, however and caches require a coherence mechanism to ensure that processors reference current data. Hardware coherence mechanisms for large-scale machines are complex and costly, but existing software mechanisms for message-passing machines have not provided a performance... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An argument for simple COMA

    Publication Year: 1995, Page(s):276 - 285
    Cited by:  Papers (38)  |  Patents (5)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (863 KB)

    We present design details and some initial performance results of a novel scalable shared memory multiprocessor architecture. This architecture features the automatic data migration and replication capabilities of cache-only memory architecture (COMA) machines, without the accompanying hardware complexity. A software layer manages cache space allocation at a page-granularity-similarly to distribut... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Two techniques for improving performance on bus-based multiprocessors

    Publication Year: 1995, Page(s):264 - 275
    Cited by:  Papers (11)  |  Patents (3)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (798 KB)

    We explore two techniques for reducing memory latency in bus-based multiprocessors. The first one, designed for sector caches, is a snoopy cache coherence protocol that uses a large transfer block to take advantage of spatial locality, while using a small coherence block (called a subblock to avoid false sharing). The second technique is read snarfing (or read broadcasting), in which all caches ca... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Access ordering and memory-conscious cache utilization

    Publication Year: 1995, Page(s):253 - 262
    Cited by:  Papers (32)  |  Patents (1)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (719 KB)

    As processor speeds increase relative to memory speeds, memory bandwidth is rapidly becoming the limiting performance, factor for many applications. Several approaches to bridging this performance gap have been suggested. This paper examines one approach, access ordering, and pushes its limits to determine bounds on memory performance. We present several access-ordering schemes, and compare their ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • U-cache: a cost-effective solution to synonym problem

    Publication Year: 1995, Page(s):243 - 252
    Cited by:  Papers (2)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (689 KB)

    This paper proposes a cost-effective solution to the synonym problem. In this proposed solution, a minimal hardware addition guarantees the correctness whereas the software counterpart helps improve the performance. The key to this proposed solution is an addition of a small physically-indexed cache called U-cache. The U-cache maintains the reverse translation information of the cache blocks that ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving performance by cache driven memory management

    Publication Year: 1995, Page(s):234 - 242
    Cited by:  Papers (1)  |  Patents (3)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (646 KB)

    The efficient utilization of caches is crucial for a competitive memory hierarchy. Access times required by modern processors are continuously decreasing. Direct mapped caches provide the shortest access time. Using them yields reduced hardware costs and fast memory access but implies additional misses in the cache, resulting in performance degradation. Another source of conflicts is the addressin... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementation of atomic primitives on distributed shared memory multiprocessors

    Publication Year: 1995, Page(s):222 - 231
    Cited by:  Papers (7)  |  Patents (3)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (817 KB)

    In this paper we consider several hardware implementations of the general-purpose atomic primitives fetch and /spl Phi/, compare and swap, load linked, and store conditional on large-scale shared-memory multiprocessors. These primitives have proven popular on small-scale bets-based machines, but have yet to become widely available on large-scale, distributed shared memory machines. We propose seve... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Thread prioritization: a thread scheduling mechanism for multiple-context parallel processors

    Publication Year: 1995, Page(s):210 - 221
    Cited by:  Papers (2)  |  Patents (29)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (923 KB)

    Multiple-context processors provide register resources that allow rapid context switching between several threads as a means of tolerating long communication and synchronization latencies. When scheduling threads on such a processor, we must first decide which threads should have their state loaded into the multiple contexts, and second, which loaded thread is to execute instructions at any given ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast barrier synchronization in wormhole k-ary n-cube networks with multidestination worms

    Publication Year: 1995, Page(s):200 - 209
    Cited by:  Papers (20)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (828 KB)

    This paper presents a new approach to implement fast barrier synchronization in wormhole k-ary n-cubes. The novelty lies in using multidestination messages instead of the traditional single destination messages. Two different multidestination worm types, gather and broadcasting, are introduced to implement the report and wake-up phases of barrier synchronization, respectively. Algorithms for compl... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Simulation study of cached RAID5 designs

    Publication Year: 1995, Page(s):186 - 197
    Cited by:  Papers (10)  |  Patents (1)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (752 KB)

    This paper considers the performance of cached RAID5 using simulations that are driven by database I/O traces collected at customer sites. This is in contrast to previous performance studies using analytical modelling or random-number simulations. We studied issues of cache size, disk buffering, cache replacement policies, cache allocation policies, destage policies and striping. Our results indic... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An initial evaluation of the Convex SPP-1000 for earth and space science applications

    Publication Year: 1995, Page(s):176 - 185
    Cited by:  Papers (4)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (817 KB)

    The Convex SPP-1000, the most recent SPC, is distinguished by a true global shared memory capability based on the first commercial version of directory based cache coherence mechanisms and SCI protocol. The system was evaluated at NASA/GSFC in the Beta-test environment using three classes of operational experiments targeting earth and space science applications. A multiple program workload tested ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Modeling virtual channel flow control in hypercubes

    Publication Year: 1995, Page(s):166 - 175
    Cited by:  Papers (2)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (618 KB)

    An analytical model for virtual channel flow control in n-dimensional hypercubes using the e-cube routing algorithm is developed. The model is based on determining the values of the different components that make up the average message latency. These components include the message transfer time, the blocking delay at each dimension, the multiplexing delay at each dimension, and the waiting delay a... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Software assistance for data caches

    Publication Year: 1995, Page(s):154 - 163
    Cited by:  Papers (6)  |  Patents (2)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (784 KB)

    Hardware and software cache optimizations are active fields of research, that have yielded powerful but occasionally complex designs and algorithms. The purpose of this paper is to investigate the performance of combined though simple software and hardware optimizations. Because current caches provide little flexibility for exploiting temporal and spatial locality, two hardware modifications are p... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A design framework for hybrid-access caches

    Publication Year: 1995, Page(s):144 - 153
    Cited by:  Papers (9)  |  Patents (4)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (718 KB)

    High-speed microprocessors need fast on-chip caches in order to keep busy. Direct-mapped caches have better access times than set-associative caches, but poorer miss rates. This has led to several hybrid on-chip caches combining the speed of direct-mapped caches with the hit rates of associative caches. In this paper, we unify these hybrids within a single framework which we call the hybrid access... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • DASC cache

    Publication Year: 1995, Page(s):134 - 143
    Cited by:  Papers (7)  |  Patents (2)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (756 KB)

    For many microprocessors, cache hit time determines the clock cycle. On the other hand, cache miss penalty(measured in instruction issue delays) becomes higher and higher. Conciliating low cache miss ratio with low cache hit time is an important issue. When caches are virtually indexed, the operating system (or some specific hardware) has to manage data consistency of caches and memory. Unfortunat... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fault-tolerant adaptive routing for two-dimensional meshes

    Publication Year: 1995, Page(s):122 - 131
    Cited by:  Papers (26)  |  Patents (4)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (648 KB)

    Many massively parallel computers in use today utilize simple deterministic XY wormhole routing to transmit messages between nodes. Because XY routing does not provide any routing adaptability, it lacks the ability to avoid congested links, as well as faults. Therefore, the focus of this paper will be two-fold: improving the performance of wormhole routing and providing fault tolerance for up to N... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.