By Topic

Computer Architecture, 1988. Conference Proceedings. 15th Annual International Symposium on

Date May 30 1988-June 2 1988

Filter Results

Displaying Results 1 - 25 of 51
  • 15th Annual International Symposium on Computer Architecture. Conference Proceedings (Cat. No.88CH2545-2)

    Save to Project icon | Request Permissions | PDF file iconPDF (36 KB)  
    Freely Available from IEEE
  • Analysis of bus hierarchies for multiprocessors

    Page(s): 100 - 107
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (496 KB)  

    To build large shared-memory multiprocessor systems that take advantage of current hardware-enforced cache coherence protocols, an interconnection network is needed that acts logically as a single bus while avoiding the electrical loading problems of a large bus. Models of bus delay and bus throughput are developed to aid in optimizing the design of such a network. These models are used to derive a method for determining the maximum number of processors that can be supported by each of several bus organizations, including conventional single-level buses, two-level bus hierarchies, and binary tree interconnections. An example based on a TTL bus is presented to illustrate the methods and to show that shared-memory multiprocessors with several dozen processors are feasible using a simple two-level bus hierarchy View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Toward a dataflow/von Neumann hybrid architecture

    Page(s): 131 - 140
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (924 KB)  

    Dataflow architectures offer the ability to trade program-level parallelism for machine level latency. Dataflow further offers a uniform synchronization paradigm, representing one end of a spectrum wherein the unit of scheduling is a single instruction. At the opposite extreme are the von Neumann architectures which schedule on a task, or process, basis. The spectrum is examined and an architecture which is a hybrid of dataflow and von Neumann organizations is proposed. The analysis attempts to discover those features of the dataflow architecture, lacking in a von Neumann machine, which are essential for tolerating latency and synchronization costs. These features are captured in the concept of a parallel machine language which can be grafted on top of an otherwise traditional von Neumann base. In such an architecture, the units of scheduling, called scheduling quanta, are bound at compile time rather than at instruction-set design time. The parallel machine language supports this notion using a large synchronization name space. A prototypical architecture is described, and results of simulation studies are presented. A comparison is made between the MIT tagged-token dataflow machine and the subject machine which presents a model for understanding the cost of synchronization in a parallel environment View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel architecture for OPS5

    Page(s): 452 - 457
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (628 KB)  

    An architecture that captures some of the inherent parallelism of the OPS5 expert system language has been designed and implemented at Oak Ridge National Laboratory. A central feature of this architecture is a network bus over which a single host processor broadcasts messages to a set of parallel-rule processors. This transmit-only bus is implemented by a memory-mapped scheme which permits the rule processors to be decoded in parallel. All OPS5 rule-matching processes, and most of the processes associated with conflict resolution, are executed by the parallel-rule processors. The host performs the tasks associated with the firing of a rule selected by the conflict resolution process. Performance data are presented for the prototype system which comprises a host processor and 64 parallel rule processors, each embodying a Motorola MC68000 microprocessor and 512 kbytes of unshared memory View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Critical issues in mapping neural networks on message-passing multicomputers

    Page(s): 3 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (744 KB)  

    The architectural requirements for efficiently simulating large neural networks on a multicomputer system with thousands of fine-grained processors and distributed memory are investigated. Models for characterizing the structure of a neural network and the function of individual cells are developed. These models provide guidelines for efficiently mapping the network onto multicomputer technologies such as the hypercube, hypernet, and torus. They are further used to estimate the amount of interprocessor communication bandwidth required, and the number of processors needed to meet a particular cost/performance goal. Design issues such as memory organization and the effect of VLSI technology are also considered View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Wisconsin Multicube: a new large-scale cache-coherent multiprocessor

    Page(s): 422 - 431
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (920 KB)  

    The Wisconsin Multicube, a large-scale, shared-memory multiprocessor architecture that uses a snooping cache protocol over a grid of buses, is introduced. The authors describe its cache coherence protocol and discuss efficient synchronization primitive. Then they discuss a number of other important design issues and modeling results. They introduce the general Multicube topology and discuss the scalability of the Wisconsin Multicube. A formal description of the cache consistency protocol is also given View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cache performance of vector processors

    Page(s): 261 - 268
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (436 KB)  

    An instruction-level simulator for IBM 3090 with VF (vector facility) has been developed for studying the performance of vector processors and their memory hierarchies. Results of a study of the locality of several large scientific applications are presented. The cache miss ratios of vectorized applications are found to be almost equal to those of their original scalar executions. Moreover, both the spatial and temporal locality of these applications (in scalar and vector executions) are strong enough to show a sufficiently high hit ratio on conventional cache structures View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Trade-offs between devices and paths in achieving disk interleaving

    Page(s): 196 - 201
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (424 KB)  

    Four alternative implementations for achieving higher data rates in a disk subsystem (parallel heads without replication, parallel heads with replication, parallel actuators without replication, and parallel actuators with replication) are studied. Focus is on the tradeoffs between the number of devices and the number of data paths while keeping the number of physical devices constant (which may keep the cost roughly constant). The performance advantages and limitations of the alternative implementations are analyzed using an analytic queuing model and compared to a conventional disk subsystem. The study shows that parallel heads with replication from a single actuator performs the best for the average application environments, although other configurations may be more cost-effective View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multinomial conjunctoid statistical learning machines

    Page(s): 12 - 17
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (236 KB)  

    A statistical learning model called the multinomial conjunctoid is reviewed. Multinomial conjunctoids are based on a well-developed, statistical-decision-theory framework, which guarantees that conjunctoid learning will converge to optimal states over learning trials and the learning will be fast during these trials. In addition, a prototype multinomial conjunctoid module based on CMOS VLSI technology is introduced View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and performance of special purpose hardware for Time Warp

    Page(s): 401 - 408
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (756 KB)  

    A special-purpose simulation engine based on the Time Warp mechanism is proposed to attack large-scale discrete-event simulation problems. A key component of this engine is the rollback chip, a hardware component that efficiently implements state saving and rollback functions in Time Warp. The algorithms implemented by the rollback chip are described, as well as mechanisms that allow efficient implementation. Results of simulation studies are presented that show that the rollback chip can virtually eliminate the state-saving overhead that plagues current software implementations of Time Wrap View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The architecture of a Linda coprocessor

    Page(s): 240 - 249
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (908 KB)  

    The architecture of a coprocessor that supports the communication primitives of the Linda parallel-programming environment in hardware is described. The coprocessor is a critical element in the architecture of the Linda machine, a MIMD (multiple-instruction, multiple-data-stream) parallel-processing system that is designed top-down from the specifications of Linda. Communication in Linda programs takes place through a logically shared associative memory mechanism called tuple space. The Linda machine, however, has no physically shared memory. The microprogrammable coprocessor implements distributed protocols for executing tuple-space operations over the Linda machine communication network. The coprocessor has been designed and is in the process of fabrication. The projected performance of the coprocessor is discussed and compared with software implementation of Linda View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Data buffer performance for sequential Prolog architectures

    Page(s): 434 - 442
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (900 KB)  

    Several local data buffers are proposed and measurements are presented for variations of the Warren abstract machine (WAM) architecture for Prolog. Choice-point buffers, stack buffers, split-stack buffers, multiple-register sets, copyback caches, and smart caches are examined. Statistics collected from four benchmark programs indicate that small conventional local memories perform quite well because of the WAM's high locality. The data memory performance results are equally valid for native code and reduced instruction set implementations of Prolog View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A partial-multiple-bus computer structure with improved cost-effectiveness

    Page(s): 116 - 122
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (544 KB)  

    The design and performance analysis of partial-multiple-bus interconnection networks is described. One such structure, called processor-oriented partial-multiple-bus (or PPMB), is proposed. It serves as an alternative to the conventional structure called memory-oriented partial-multiple-bus (or MPMB) and is aimed at higher system performance at less or equal system cost. PPMB's structural feature, which distinguishes itself from the conventional, is to provide every memory module with B paths to processors (where B is the total number of buses). This, in contrast to the B /g paths provided in the conventional MPMB structure (where g is the number of groups), suggests a potential for higher system bandwidth. This potential is fully fulfilled by the load-balancing arbitration mechanism suggested, which in turn highlights the advantages of the proposed structure. As a result, it has been shown, both analytically and by simulation, that a substantial increase in system bandwidth (up to 20%) is achieved by the PPMB structure over the MPMB structure. In addition to the fact that the cost of PPMB is less than, or equal to, that of MPMB, its reliability is shown to be slightly increased View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance tradeoffs in cache design

    Page(s): 290 - 298
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (868 KB)  

    A series of simulations that explore the interactions between various organizational decisions and program execution time are presented. The tradeoffs between cache size and CPU/cache cycle-time, set associativity and cycle time, and block size and main-memory speed, are investigated. The results indicate that neither cycle time nor cache size dominates the other across the entire design space. For common implementation technologies, performance is maximized when the size is increased to the size is increased to the 32-kB to 128-kB range with modest penalties to the cycle time. If set associativity impacts the cycle time by more than a few nanoseconds, it increases overall execution time. Since the block size and memory-transfer rate combine to affect the cache miss penalty, the optimum block size is substantially smaller than that which minimizes the miss rate. The interdependence between optimal cache configuration and the main memory speed necessitates multilevel cache hierarchies for high-performance uniprocessors View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting parallel microprocessor microarchitectures with a compiler code generator

    Page(s): 45 - 53
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (624 KB)  

    Several experiments using a versatile optimizing compiler to evaluate the benefit of four forms of microarchitectural parallelisms (multiple microoperations issued per cycle, multiple result-distribution buses, multiple execution units, and pipelined execution units) are described. The first 14 Livermore loops and 10 of the linpack subroutines are used as the preliminary benchmarks. The compiler generates optimized code for different microarchitecture configurations. It is shown how the compiler can help to derive a balanced design for high performance. For each given set of technology constraints, these experiments can be used to derive a cost-effective microarchitecture to execute each given set of workload programs at high speed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design of a concurrent computer for solving systems of linear equations

    Page(s): 204 - 211
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (608 KB)  

    The systematic synthesis of a systolic array of orthogonal and hyperbolic Householder processor elements for solving large (dense) systems of linear equations is described. The design procedure allows the design of an array which is independent of the size of the problem. The design stage for full problem size arrays was executed with the CAD tool SYSTARS. A special partitioning strategy is used to handle (virtually) infinitely large problems on a fixed-size array. Moreover, the partitioning maintains a high degree of pipelining of the data. A relatively simple architecture for the processor elements of the array is also presented. As a result of the systematic approach, a complete specification of the array is obtained in terms of its interconnections, processor elements, and controller View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Distributed round-robin and first-come first-serve protocols and their application to multiprocessor bus arbitrary

    Page(s): 269 - 277
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (908 KB)  

    The round-robin (RR) protocol, which uses statically assigned arbitration numbers to resolve conflict during an arbitration, is more robust and simpler to implement than previous distributed RR protocols that are based on rotating-aging priorities. The proposed first-come-first-served (FCFS) protocol uses partly static arbitration numbers, and is the first practical proposal for a FCFS arbiter known the authors. The proposed protocols have a better combination of efficiency, cost, and fairness characteristics than existing multiprocessor bus arbitration algorithms. Three implementations of the RR protocol, and two implementations of the FCFS protocol, are discussed. Simulation results are presented that address: (1) the practical potential for unfairness in the simpler implementation of the FCFS protocol: (2) the practical implications of the higher waiting-time variance in the RR protocol; and (3) the allocation of bus bandwidth among agents with unequal request rate in each protocol. The simulation results indicate that there is very little practical difference in the performance of the two protocols View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Resource requirements of dataflow programs

    Page(s): 141 - 150
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (812 KB)  

    Parallel execution of programs requires more resources and more complex resource management than sequential execution. If concurrent tasks can be spawned dynamically, programs may require an inordinate amount of resources when the potential parallelism in the program is much greater than the amount of parallelism the machine can utilize. Loop bounding, a technique for dynamically controlling the amount of parallelism exposed in dataflow programs, is described. The effectiveness of the technique in reducing token storage requirements is supported by experimental data in the form of parallelism profiles and waiting-token profiles. Comparisons are made throughout with more conventional approaches to parallel computing. It is shown that limiting the maximum number of coherent iterations of loops is effective in reducing the resource requirements of typical scientific programs without sacrificing performance. The implementation of this idea is based on compiling loops into dataflow graphs with a loop-bounding parameter than can be set at run time according to some policy View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A fetch-and-op implementation for parallel computers

    Page(s): 384 - 392
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (756 KB)  

    A fetch-and-op circuit is described. A bit-serial circuit-switched implementation requires only five gates per node in a binary tree. This circuit is also capable of test-and-set primitives (priority circuits) and swap operators, as well as AND and OR operations used in SIMD (single-instruction, multiple-data-stream) tests such as branch on all carries set. It provides an alternative implementation for the combining fetch-and-add circuit to the one designed for the Ultracomputer project; this implementation is suited to SIMD computing and can be adapted to MIMD (multiple-instruction, multiple-data stream) computing View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The VMP multiprocessor: initial experience, refinements and performance evaluation

    Page(s): 410 - 421
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1132 KB)  

    VMP is an experimental multiprocessor being developed at Stanford University, suitable for high-performance workstations and server machines. Its primary novelty lies in the use of software management of the per-processor caches and the design decisions in the cache and bus that make this approach feasible. The design and some uniprocessor trace-driven simulations indicating its performance have been reported previously. Initial experience with the VMP design, based on a running prototype as well as various refinements to the design, is presented. Performance evaluation is based both on measurement of actual execution as well as trace-driven simulation of multiprocessor executions from the Mach operating system View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the inclusion properties for multi-level cache hierarchies

    Page(s): 73 - 80
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (620 KB)  

    The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. Some necessary and sufficient conditions for imposing the inclusion property for fully-associative and set-associative caches, which allow different block sizes at different levels of the hierarchy, are given. Three multiprocessor structures with a two-level cache hierarchy (single cache extension, multiport second-level cache, and bus-based) are examined. The feasibility of imposing the inclusion property in these structures is discussed. This leads to the presentation of an inclusion-coherence mechanism for two-level bus-based architectures View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Regular CC-banyan networks

    Page(s): 325 - 332
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (440 KB)  

    Construction algorithms for rectangular and nonrectangular CC-banyan networks, as well as routing algorithms for double-ended and single-ended rectangular and nonrectangular CC-banyan networks, are presented. The time complexities of the routing algorithms are O(l ), where l is the number of stages in the network View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hyperswitch network for the hypercube computer

    Page(s): 90 - 99
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (776 KB)  

    A method is presented that realizes a kind of interconnection network, called a hyperswitch network, that is achieved using a mixture of static and dynamic topologies. Available of fault-free paths need not be specified by a source because the routing header can be modified in response to congestion or faults encountered as a path is established. This method can be accomplished in a static topology such as the hypercube network if the nodes have switching elements which are capable of dynamically performing the necessary routing header revisions. Detailed simulation results show that the hyperswitch network is consistently more efficient than fixed-path routing for large message traffic conditions. The simulation results also show that the hyperswitch network has equivalent latency overhead for messages with localized and antilocal destinations (i.e., less than a 25% difference between diameter 1 and 5) View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Flagship: a parallel architecture for declarative programming

    Page(s): 124 - 130
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (520 KB)  

    The Flagship project aims to produce a computing technology based on the declarative style of programming. A major component of that technology is the design for a parallel machine that can efficiently utilize the implicit parallelism in declarative programs. The computational models that expose this implicit parallelism are described, and an architecture designed to use it is outlined. The operational issues, such as dynamic load balancing, that arise in such a system are discussed, and the mechanisms being used to evaluate the architecture are described View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Deadlock avoidance for systolic communication

    Page(s): 252 - 260
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (676 KB)  

    The nature of the deadlock problem for the systolic model of communication is described. This problem does not exist for special-purpose systolic arrays for which the hardware designer can afford providing as many queues as required by the specific computation that the array intends to implement. However, for programmable systolic arrays, the number of messages crossing the interval between two adjacent cells can be arbitrarily large, depending on the program. As a result, the possibility of deadlock always exists, since the number of queues between adjacent cells is fixed. The problem of avoiding queue-induced deadlocks, for deadlock-free programs, at run time is described, and a solution to the problem is given. Schemes for consistent labeling and compatible queue assignment, for which the solution calls, as also described, as is how to take advantage of the buffering capability provided by queues View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.