# [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture

• ### Proceedings. The 17th Annual International Symposium on Computer Architecture (Cat. No.90CH2887-8)

Publication Year: 1990
• ### The performance impact of block sizes and fetch strategies

Publication Year: 1990, Page(s):160 - 169
The interactions between a cache's block size, fetch size, and fetch policy from the perspective of maximizing system-level performance are explored. It has been previously noted that, given a simple fetch strategy, the performance optimal block size is almost always four or eight words. If there is even a small cycle time penalty associated with either longer blocks or fetches, then the performan... View full abstract»

• ### Virtual-channel flow control

Publication Year: 1990, Page(s):60 - 68
Network throughput can be increased by dividing the buffer storage associated with each network channel into several virtual channels. Each physical channel is associated with several small queues, virtual channels, rather than a single deep queue. The virtual channels associated with one physical channel are allocated independently but compete with each other for physical bandwidth. Virtual chann... View full abstract»

• ### The directory-based cache coherence protocol for the DASH multiprocessor

Publication Year: 1990, Page(s):148 - 159
DASH is a scalable shared-memory multiprocessor whose architecture consists of powerful processing nodes, each with a portion of the shared-memory, connected to a scalable interconnection network. A key feature of DASH is its distributed direction-based cache coherence protocol. Unlike traditional snoopy coherence protocols, the DASH protocol does not rely on broadcast; instead it uses point-to-po... View full abstract»

• ### A new approach to fast control of r2×r2 3-stage Benes networks of r×r crossbar switches

Publication Year: 1990, Page(s):50 - 59
The authors introduce an approach to fast control of N×N three-stage Benes networks of r×r crossbar switches as building blocks. The approach consists of setting the leftmost column of switches to an appropriately chosen configuration so that the network becomes self-routed while still able to realize a given family of permutations. This approach req... View full abstract»

• ### An empirical evaluation of two memory-efficient directory methods

Publication Year: 1990, Page(s):138 - 147
The authors present an empirical evaluation of two memory-efficient directory methods for maintaining coherent caches in large shared-memory multiprocessors. Both directory methods are modifications of a scheme proposed by L.M. Censier and P. Feautrier (1978) that does not rely on a specific interconnection network and can be readily distributed across interleaved main memory. The schemes consider... View full abstract»

• ### Dynamic processor allocation in hypercube computers

Publication Year: 1990, Page(s):40 - 49
Recognizing various subcubes in a hypercube computer fully and efficiently is nontrivial because of the specific structure of the hypercube. The authors propose a method that has much less complexity than the multiple-GC strategy in generating the search space, while achieving complete subcube recognition. This method is referred to as a dynamic processor allocation scheme because the search space... View full abstract»

• ### Adaptive software cache management for distributed shared memory architectures

Publication Year: 1990, Page(s):125 - 134
An adaptive cache coherence mechanism exploits semantic information about the expected or observed access behavior of particular data objects. The authors contend that, in distributed shared-memory systems, adaptive cache coherence mechanisms will outperform static cache coherence mechanisms. They have examined the sharing and synchronization behavior of a variety of shared-memory parallel program... View full abstract»

• ### Synchronization with multiprocessor caches

Publication Year: 1990, Page(s):27 - 37
A new lock-based cache scheme which incorporates synchronization into the cache coherency mechanism is presented. With this scheme high-level synchronization primitives, as well as low-level ones, can be implemented without excessive overhead. Cost functions for well-known synchronization methods are derived for invalidation schemes, write update schemes, and the authors' lock-based scheme. To pre... View full abstract»

• ### PLUS: a distributed shared-memory system

Publication Year: 1990, Page(s):115 - 124
PLUS is a multiprocessor architecture tailored to the fast execution of a single multithreaded process; its goal is to accelerate the execution of CPU-bound applications. PLUS supports shared memory and efficient synchronization. Memory access latency is reduced by nondemand replication of pages with hardware-supported coherence between replicated pages. The architecture has been simulated in deta... View full abstract»

• ### Memory consistency and event ordering in scalable shared-memory multiprocessors

Publication Year: 1990, Page(s):15 - 26
A new model of memory consistency, called release consistency, that allows for more buffering and pipelining than previously proposed models is introduced. A framework for classifying shared accesses and reasoning about event ordering is developed. The release consistency model is shown to be equivalent to the sequential consistency model for parallel programs with sufficient synchronization. Poss... View full abstract»

• ### An investigation of static versus dynamic scheduling

Publication Year: 1990, Page(s):192 - 201
Two techniques for instruction scheduling, dynamic and static scheduling, are investigated. A decoupled access execute architecture consists of an execution unit and a memory unit with separate program counters and separate instruction memories. The very long instruction word (VLIW) architecture has only one program counter and relies on the compiler to perform static scheduling of multiple units.... View full abstract»

• ### The K2 parallel processor: architecture and hardware implementation

Publication Year: 1990, Page(s):92 - 101
K2 is a distributed-memory parallel processor designed to support a multiuser, multitasking, time-sharing operating system and an automatically parallelizing Fortran compiler. The architecture and the hardware implementation of K2 are presented. The authors focus on the architectural features required by the operating system and the compiler. A prototype machine with 24 processors is currently bei... View full abstract»

• ### Weak ordering-a new definition

Publication Year: 1990, Page(s):2 - 14
A memory model for a shared-memory multiprocessor commonly and often implicitly assumed by programmers is that of sequential consistency, which guarantees that all memory accesses will appear to execute atomically and in program order. An alternative model, weak ordering, offers greater performance potential. The central hypothesis of this work is that programmers prefer to reason about sequential... View full abstract»

• ### Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

Publication Year: 1990, Page(s):364 - 373
Hardware techniques for improving the performance of caches are presented. Miss caching places a small, fully associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a 1-cycle miss penalty. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches. Victim caching... View full abstract»

• ### A distributed I/O architecture for HARTS

Publication Year: 1990, Page(s):332 - 342
The issue of I/O device access in HARTS (Hexagonal Architecture for Real-Time Systems)-a distributed real-time computer system under construction at the University of Michigan-is explicitly addressed. Several candidate solutions are introduced, explored and evaluated according to cost, complexity, reliability, and performance: (1) node-direct' distribution with the intranode bus and a local I/O b... View full abstract»

• ### Balance in architectural design

Publication Year: 1990, Page(s):302 - 310
A performance metric, normalized time, which is closely related to such measures as the area-time product of VLSI theory and the price/performance ratio of advertising literature is introduced. This metric captures the idea of a piece of hardware pulling its own weight', that is, contributing as much to performance as it costs in resources. The authors prove general theorems for stating when the ... View full abstract»

• ### Reducing the cost of branches by using registers

Publication Year: 1990, Page(s):182 - 191
In an attempt to reduce the number of operand memory references, many RISC (reduced-instruction-set-computer) machines have 32 or more general-purpose registers (e.g. MIPS, ARM, Spectrum, 88 K). Without special compiler optimizations, such as inlining or interprocedural register allocation, it is rare that a computer will use a majority of these registers for a function. The authors explore the po... View full abstract»

• ### Monsoon: an explicit token-store architecture

Publication Year: 1990, Page(s):82 - 91
Data-flow architectures tolerate long unpredictable communication delays and support generation and coordination of parallel activities directly in hardware, instead of assuming that program mapping will cause these issues to disappear. However, the proposed mechanisms are complex and introduce new mapping complications. A greatly simplified approach to data-flow execution, called the explicit tok... View full abstract»

• ### APRIL: a processor architecture for multiprocessing

Publication Year: 1990, Page(s):104 - 114
The architecture of a rapid-context-switching processor called APRIL, with support for fine-grain threads and synchronization, is described. APRIL achieves high single-thread performance and supports virtual dynamic threads. A commercial reduced-instruction-set-computer-(RISC-) based implementation of APRIL and a run-time software system that can switch contexts in about 10 cycles are described. M... View full abstract»

• ### The TLB slice-a low-cost high-speed address translation mechanism

Publication Year: 1990, Page(s):355 - 363
The MIPS R6000 microprocessor relies on a new type of translation lookaside buffer, called a TLB slice, which is less than one-tenth the size of a conventional TLB and as fast as one multiplexer delay, yet has a high enough hit rate to be practical. The fast translation makes it possible to use a physical cache without adding a translation stage to the processor's pipeline. The small size makes it... View full abstract»

• ### Maximizing performance in a striped disk array

Publication Year: 1990, Page(s):322 - 331
Improvements in disk speeds have not kept up with improvements in processor and memory speeds. One way to correct the resulting speed mismatch is to stripe data across many disks. The authors address how to stripe data to get maximum performance from the disks. Specifically, they examine how to choose the striping unit, that is, the amount of logically contiguous data on each disk. Rules for deter... View full abstract»

• ### Performance of an OLTP application on Symmetry multiprocessor system

Publication Year: 1990, Page(s):228 - 238
Sequent's Symmetry series is a bus-based shared-memory multiprocessor. System performance in an OLTP (online transaction processing) relational database application was investigated using the TP1 benchmark. System performance was tested with fully cached benchmarks and with scaled benchmarks. In fully-cached tests, the entire database fits inside main memory. In scaled tests, the database is large... View full abstract»

• ### Performance measurement and trace driven simulation of parallel CAD and numeric applications on a hypercube multicomputer

Publication Year: 1990, Page(s):260 - 269
The performance evaluation, workload characterization, and trace-driven simulation of a hypercube multicomputer running realistic workloads are presented. Six representative parallel applications were selected as benchmarks. Software monitoring techniques were then used to collect execution traces. On the basis of the measurement results, the authors investigated both the computation and communica... View full abstract»

• ### Architectural support for the management of tightly-coupled fine-grain goals in Flat Concurrent Prolog

Publication Year: 1990, Page(s):292 - 301
Architectural support is proposed for goal management as part of a special-purpose processor architecture for the efficient execution of Flat Concurrent Prolog. Goal management operations, namely, halt, spawn, suspend, and commit, are decoupled from goal reduction and overlapped in the goal management unit. Their efficient execution is enabled using a goal cache. The authors evaluate the performan... View full abstract»