By Topic

Computers, IEEE Transactions on

Issue 5 • Date May 1998

Filter Results

Displaying Results 1 - 9 of 9
  • Architecture scalability of parallel vector computers with a shared memory

    Page(s): 614 - 624
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (168 KB)  

    Based on a model of a parallel vector computer with a shared memory, its scalability properties are derived. The processor-memory interconnection network is assumed to be composed of crossbar switches of size b×b. This paper analyzes sustainable peak performance under optimal conditions, i.e., no memory bank conflicts, sufficient processor-memory bank pathways, and no interconnection network conflicts. It will be shown that, with fully vectorizable algorithms and no communication overhead, the sustainable peak performance does not scale up linearly with the number of processors p. If the interconnection network is unbuffered, the number of memory banks must increase at least with O(p logb p) to sustain peak performance. If the network is buffered, this bottleneck can be alleviated; however, the half performance vector length still increases with O(logb p). The paper confirms the validity of the model by examining the performance behavior of the LINPACK benchmark View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient solution to the cache thrashing problem caused by true data sharing

    Page(s): 527 - 543
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (508 KB)  

    When parallel programs are executed on multiprocessors with private caches, a set of data may be repeatedly used and modified by different threads. Such data sharing can often result in cache thrashing, which degrades memory performance. This paper presents and evaluates a loop restructuring method to reduce or even eliminate cache thrashing caused by true data sharing in nested parallel loops. This method uses a compiler analysis which applies linear algebra and the theory of numbers to the subscript expressions of array references. Due to this method's simplicity, it can be efficiently implemented in any parallel compiler. Experimental results show quite significant performance improvements over existing static and dynamic scheduling methods View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A performance study of instruction cache prefetching methods

    Page(s): 497 - 508
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (332 KB)  

    Prefetching methods for instruction caches are studied via trace-driven simulation. The two primary methods are “fall-through” prefetch (sometimes referred to as “one block lookahead”) and “target” prefetch. Fall-through prefetches are for sequential line accesses, and a key parameter is the distance from the end of the current line where the prefetch for the next line is initiated. Target prefetches work also for nonsequential line accesses. A prediction table is used and a key aspect is the prediction algorithm implemented by the table. Fall-through prefetch and target prefetch each improve performance significantly. When combined in a hybrid algorithm, their performance improvement is nearly additive. An instruction cache using a combined target and fall-through method can provide the same performance as a two to four times larger cache that does not prefetch. A good prediction method must not only be accurate, but prefetches must be initiated early enough to allow time for the instructions to return from main memory. To quantify this, we define a “prefetch efficiency” measure that reflects the amount of memory fetch delay that may be successfully hidden by prefetching. The better prefetch methods (in terms of miss rate) also have very high efficiencies, hiding approximately 90 percent of the miss delay for prefetched lines. Another performance measure of interest is memory traffic. Without prefetching, large line sizes give better hit rates; with prefetching, small line sizes tend to give better overall hit rates. Because smaller line sizes tend to reduce memory traffic, the top-performing prefetch caches produce less memory traffic than the top-performing nonprefetch caches of the same size View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Double step branching CORDIC: a new algorithm for fast sine and cosine generation

    Page(s): 587 - 602
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (536 KB)  

    Duprat and Muller (1993) introduced the ingenious “Branching CORDIC” algorithm. It enables a fast implementation of CORDIC algorithm using signed digits and requires a constant normalization factor. The speedup is achieved by performing two basic CORDIC rotations in parallel in two separate modules. In their method, both modules perform identical computation except when the algorithm is in a “branching” [1]. We have improved the algorithm and show that it is possible to perform two circular mode rotations in a single step, with little additional hardware. In our method, both modules perform distinct computations at each step which leads to a better utilization of the hardware and the possibility of further speedup over the original method. Architectures for VLSI implementation of our algorithm are discussed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A configurable membership service

    Page(s): 573 - 586
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (328 KB)  

    A membership service is used to maintain information about which sites are functioning in a distributed system at any given time. Many such services have been defined, with each implementing a unique combination of properties that simplify the construction of higher levels of the system. Despite this wealth of possibilities, however, any given service typically realizes only one set of properties, which makes it difficult to tailor the service provided to the specific needs of the application. Here, a configurable membership service that addresses this problem is described. This service is based on decomposing membership into its constituent abstract properties and then implementing these properties as separate software modules called micro-protocols that can be configured together to produce a customized membership service. A prototype C++ implementation of the membership service for a simulated distributed environment is also described View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design of balanced and constant weight codes for VLSI systems

    Page(s): 556 - 572
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (456 KB)  

    A constant weight, w, code with k information bits and r check bits is a binary code of length n=k+r and cardinality 2k such that the number of 1s in each code word is equal to w. When w=[n/2], the code is called balanced. This paper describes the design of balanced and constant weight codes with parallel encoding and parallel decoding. Infinite families of efficient constant weight codes are given with the parameters k, r, and the “number of balancing functions used in the code design,” ρ. The larger ρ grows, the smaller r will be; and the codes can be encoded and decoded with VLSI circuits whose sizes and depths are proportional to pk and log2 p, respectively. For example, a design is given for a constant weight w=33 code with k=64 information bits, r=10 check bits, and p=8 balancing functions. This code can be implemented by a VLSI circuit using less than 4,054 transistors with a depth of less than 30 transistors View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Theory and design of adjacent asymmetric error masking codes

    Page(s): 544 - 555
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (336 KB)  

    Recently, Matsuzawa and Fujiwara (1988) proposed a novel scheme to mask line faults of bus line circuits (such as address buses) due to short circuit defects between adjacent lines. In this paper, first we propose the fundamental theory and then present some efficient designs of these codes. Some lower and upper bounds for the optimal codes are also given View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Damage assessment for optimal rollback recovery

    Page(s): 603 - 613
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (292 KB)  

    Conventional schemes of rollback recovery with checkpointing for concurrent processes have overlooked an important problem: contamination of checkpoints as a result of error propagation among the cooperating processes. Error propagation is unavoidable due to imperfect detection mechanisms and random interprocess communications, and it could give rise to contaminated checkpoints which, in turn, result in unsuccessful rollbacks. To counter the problem of error propagation, a damage assessment model is developed to estimate the correctness of saved checkpoints under various circumstances. Using the result of damage assessment, determination of the “optimal” checkpoints for rollback recovery-which minimize the average total recovery overhead-is formulated and solved as a nonlinear integer programming problem. Integration of damage assessment into existing recovery schemes is also discussed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • CPU cache prefetching: Timing evaluation of hardware implementations

    Page(s): 509 - 526
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (612 KB)  

    Prefetching into CPU caches has long been known to be effective in reducing the cache miss ratio, but known implementations of prefetching have been unsuccessful in improving CPU performance. The reasons for this are that prefetches interfere with normal cache operations by making cache address and data ports busy, the memory bus busy, the memory banks busy, and by not necessarily being complete by the time that the prefetched data is actually referenced. In this paper, we present extensive quantitative results of a detailed cycle-by-cycle trace-driven simulation of a uniprocessor memory system in which we vary most of the relevant parameters in order to determine when and if hardware prefetching is useful. We find that, in order for prefetching to actually improve performance, the address array needs to be double ported and the data array needs to either be double ported or fully buffered. It is also very helpful for the bus to be very wide (e.g., 16 bytes) for bus transactions to be split and for main memory to be interleaved. Under the best circumstances, i.e., with a significant investment in extra hardware, prefetching can significantly improve performance. For implementations without adequate hardware, prefetching often decreases performance View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Albert Y. Zomaya
School of Information Technologies
Building J12
The University of Sydney
Sydney, NSW 2006, Australia
http://www.cs.usyd.edu.au/~zomaya
albert.zomaya@sydney.edu.au