By Topic

Parallel and Distributed Processing, 1994. Proceedings. Sixth IEEE Symposium on

Date 26-29 Oct. 1994

Filter Results

Displaying Results 1 - 25 of 87
  • Proceedings of 1994 6th IEEE Symposium on Parallel and Distributed Processing

    Publication Year: 1994
    Save to Project icon | Request Permissions | PDF file iconPDF (59 KB)  
    Freely Available from IEEE
  • Optimal fault-tolerant communication algorithms on product networks using spanning trees

    Publication Year: 1994 , Page(s): 188 - 195
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (624 KB)  

    Over the last years cartesian product graphs have started to receive increasing attention as general class of networks for multiprocessor systems. One reason is that many efficient and popular networks such as the meshes, tori, hypercubes, hyper de Bruijn, product shuffle, and the newly proposed folded Petersen networks belong to this class of networks. Secondly, with the help of cartesian product graphs, a unique method for the design and analysis of a class of networks as well as techniques for embedding and communication algorithms can be provided. In this paper, first multiple arc-disjoint spanning trees are constructed on product networks with bidirectional links. These trees are utilized to design fault-tolerant algorithms for several important communication primitives assuming all-port communication. The problems under consideration include broadcasting, gossiping, scattering, and total exchange View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Star-Graph based multistage interconnection network for ATM switch fabric

    Publication Year: 1994 , Page(s): 444 - 451
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (600 KB)  

    This paper considers a multistage interconnection network based on the Indirect Star Graph topology as a candidate for an ATM (Asynchronous Transfer Mode) switch fabric. We consider both buffered and unbuffered versions of the indirect star. The performance of three existing routing algorithms is studied, and it is found that the packet acceptance probability offered by these algorithms is unsatisfactory. We propose two solutions to alleviate the problem: a modification of the indirect star topology, which we call Star Net; and an adaptive routing algorithm based on the concept of packet priorities. We study the performance of the proposed performance enhancement schemes through simulation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Eager combining: a coherency protocol for increasing effective network and memory bandwidth in shared-memory multiprocessors

    Publication Year: 1994 , Page(s): 204 - 213
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (784 KB)  

    An excessive number of remote accesses or a non-uniform distribution of remote accesses can cause even well-designed multiprocessors to exhibit severe memory and network contention. Producer/consumer data generates a particularly common sharing pattern that results in a non-uniform distribution of references. In this paper we quantify the performance impact of producer/consumer sharing as a function of memory and network bandwidth, and argue that the contention caused by this form of sharing severely impacts performance on large-scale machines. We propose a new coherency protocol, called eager combining, which is designed to alleviate this contention. We use execution-driven simulation of parallel programs on a large-scale multiprocessor to show that eager combining can improve the performance of programs with producer/consumer data by a factor of 4 or more View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A class of scalable architectures for high-performance, cost-effective parallel computing

    Publication Year: 1994 , Page(s): 162 - 169
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (636 KB)  

    The family of reconfigurable generalized hypercube (RGH) architectures is proposed for the construction of scalable parallel computers. The objective is to reduce the high VLSI complexity of generalized hypercubes while maintaining to high extent their outstanding performance. Generalized hypercubes are versatile topologies of very high cost that optimally emulate binary hypercubes and k-ary n-cubes. RGH's, which are lower-cost reconfigurable systems, emulate efficiently generalized hypercubes for application algorithms that use regular communication patterns. RGH's generally perform better than binary hypercubes and k-ary n-cubes with the same number of nodes. To illustrate the viability of RGH's, extensive cost analysis and comparisons with relevant systems are carried out. The hardware cost of RGH's is shown to be even lower than that of fat trees. Therefore, scalable RGH's are viable candidates for the construction of versatile parallel computers View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the design and implementation of broadcast and global combine operations using the postal model

    Publication Year: 1994 , Page(s): 594 - 602
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (688 KB)  

    Two models for message passing parallel systems are the postal model and its generalization, the LogP model. In the postal model a parameter λ is used to model the communication latency of the message-passing system. Each node during each round can send a fixed-size message and simultaneously, receive a message of the same size. Furthermore, a message sent out during round r will incur a latency of λ and will arrive at the receiving node at round r+λ-1. The goal of the article is to bridge the gap between the theoretical modeling and the practical implementation. In particular we investigate a number of practical issues related to the design and implementation of two collective communication operations, namely, the broadcast operation, and the global combine operation. Those practical issues include, for example: techniques for measurement of the value of λ on a given machine; creating efficient broadcast algorithms that get the latency λ and the number of nodes n as parameters; and creating efficient global combine algorithms for parallel machines with λ which is not an integer. We propose solutions that address those practical issues and present results of an experimental study of the new algorithms on the Intel Delta machine. The main conclusion is that the postal model can help in performance prediction and tuning, for example, a properly tuned broadcast improves the known implementation by more than 20% View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel bidirectional heuristic search on the EM-4 multiprocessor

    Publication Year: 1994 , Page(s): 100 - 107
    Cited by:  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (648 KB)  

    Solving search problems takes a large amount of computational resources both in terms of execution time and memory usage. This report presents experimental results of Parallel Bidirectional Heuristic Search (PBiHS) on the 80-processor EM-4 multithreaded data-flow multiprocessor. The PBiHS searches from two directions in parallel while search in each direction is also performed in parallel. Important data structures are distributed to all processors to help reduce the execution time of realistic problem sizes down to a few seconds or less. We implement two search problems, the Eight Puzzle and the Tower of Hanoi, and execute on the target multiprocessor. Execution results demonstrate that the Parallel Bidirectional Heuristic Search can solve the tree depth 20-40 of the Eight-Puzzle and the 3-9 disks of the Tower of Hanoi in an optimal or near optimal number of iterations in less than two seconds, is highly scalable as it gives over 40-fold speedup on 80 processors, and yields on the average 10-fold improvement over unidirectional search for the 8-Puzzle while generating a far less number of nodes View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A comparison of techniques used for mapping parallel algorithms to message-passing multiprocessors

    Publication Year: 1994 , Page(s): 434 - 442
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (804 KB)  

    This paper presents a comparison study of popular clustering and mapping heuristics which are used to map task-flow graphs to message-passing multiprocessors. To this end, we use task-graphs which are representative of important scientific algorithms running on data-sets of practical interest. The annotation which assigns weights to nodes and edges of the task-graphs is realistic. It reflects current trends in processor, communication channel, and message-passing interface technology and takes into consideration hardware characteristics of state-of-the-art multiprocessors. Our experiments show that applying realistic models for task-graph annotation affects the effectiveness and functionality of clustering and mapping techniques. Therefore, new heuristics are necessary that will take into account more practical models of communication costs. We present modifications to existing clustering and mapping algorithms which improve their efficiency and running-time for the practical models adopted View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hierarchical adaptive routing: a framework for fully adaptive and deadlock-free wormhole routing

    Publication Year: 1994 , Page(s): 688 - 695
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (712 KB)  

    Adaptive routing can improve network performance and fault-tolerance by providing multiple routing paths. However, the implementation complexity of adaptive routing can be significant, discouraging its use in commercial massively parallel systems. In this paper we introduce Hierarchical Adaptive Routing (HAR), a new adaptive routing framework which provides a unified framework for simple and high performance fully adaptive deadlock-free wormhole routing. HAR divides the physical network into several levels of virtual networks. There is one connection channel between two adjacent virtual networks that allows blocked packets in the higher level to move to the lower level. Different routing algorithms can be used in each virtual network; and the overall network is deadlock-free provided the rotating algorithm in the lowest level virtual network is deadlock-free. However, the routing algorithm in any other virtual network can be fully adaptive, even non-minimal, to increase performance. HAR has three advantages: fully adaptive deadlock-free routing in any non-wrapped and wrapped k-ary n-cube network with 2 and 3 virtual channels respectively, relatively small crossbars, and applicability to a wide variety of network topologies. Detailed implementation and simulation studies of a HAR for 2D mesh networks are presented View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimal polling in communication networks

    Publication Year: 1994 , Page(s): 224 - 231
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (524 KB)  

    Polling is the process in which an issuing node of a communication network (polling station) broadcasts a query to every other node in the network and must receive a unique response from each of them. Polling can be thought as a combination of broadcasting and gathering and finds wide applications in the control of distributed systems. In this paper we consider the problem of polling in minimum time. We give a general lower bound on the minimum number of time units to accomplish polling in any network and we present optimal polling algorithms for several classes of graphs, including hypercubes and recursively decomposable Cayley graphs View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Write grouping for update-based cache coherence protocols

    Publication Year: 1994 , Page(s): 334 - 341
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (608 KB)  

    In our previous work, we demonstrated the possible performance gains from update-based cache coherence protocols for a set of fine-grain scientific applications running on a scalable shared-memory multiprocessor. In this paper, we examine in detail the hardware-based write grouping scheme presented in our earlier work. First we describe both software-based and hardware-based write grouping schemes. The software-based scheme, with its perfect knowledge of the application's write pattern, is able to achieve optimal write grouping efficiency, but not without added complexity to the application's code. Nevertheless, we use the software-based scheme to determine the optimal grouping efficiency for each application studied and then demonstrate that the hardware-based write grouping scheme is almost as efficient as the software-based scheme, but it requires little, if any, software modifications View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Synchronization expressions and languages

    Publication Year: 1994 , Page(s): 257 - 264
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (532 KB)  

    New constructs for synchronization termed synchronization expressions (SEs) have been developed as high-level language constructs for parallel programming languages. We introduce a new family of languages named synchronization languages which we use to give a precise semantic description for SEs. Under this description, relations such as equivalence and inclusion between SEs can be easily understood and tested. In practice, it also provides us with a systematic way for the implementation as well as the simplification of SEs in parallel programming languages. We also show that each synchronization language is closed under the following rewriting rules: (1) asbs →bsas, (2) atbt→btat, (3) as bt→btas, (4) ata sbtbs→btbsa tas and also h(atasbt bs)→h(btbsata s) for any morphism h that satisfies certain conditions which will be specified in the paper. We show that this property can be used to reduce the number of states of a finite automaton that describes a synchronization language View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient dynamic processor allocation algorithm for adaptive mesh applications

    Publication Year: 1994 , Page(s): 38 - 45
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (620 KB)  

    In numerical algorithms based on adaptive mesh refinement, the computational workload changes during their execution. In mapping such algorithms on to distributed memory architectures, it is necessary to balance the workload among the processors dynamically in order to obtain high performance. In this paper, we propose a dynamic processor allocation algorithm for a mesh architecture that reassigns the workload in an attempt to minimize both the computational and communication costs. Our algorithm is based on a heuristic for a 2D packing problem that gives provably close to optimal solutions for special cases of the problem. We also demonstrate through experiments how our algorithm provides good quality solutions in general View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Modelling accesses to migratory and producer-consumer characterised data in a shared memory multiprocessor

    Publication Year: 1994 , Page(s): 612 - 619
    Cited by:  Papers (4)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (680 KB)  

    Directory-based, write-invalidate cache coherence protocols are effective in reducing latencies to the memory but suffer from cache misses due to coherence actions. It is therefore important to understand the nature of data sharing causing misses for this class of protocols. We identify a set of parameters that characterises the accesses to migratory and producer-consumer data in sufficient detail so as to predict the number of cache misses in directory-based, write-invalidate protocols. We show that the parameters can be extracted from real programs and used as input to a reference generator that artificially generates a stream of references causing accurate estimates of cold, coherence and directory replacement misses, compared to the program itself View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Minimum dependence distance tiling of nested loops with non-uniform dependences

    Publication Year: 1994 , Page(s): 74 - 81
    Cited by:  Papers (7)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (716 KB)  

    We address the problem of partitioning nested loops with non-uniform (irregular) dependence vectors. Although many methods exist for nested loop partitioning, most of these perform poorly when parallelizing nested loops with irregular dependencies. We apply the results of classical convex theory and principles of linear programming to iteration spaces and show the correspondence between minimum dependence distance computation and iteration space tiling. The cross-iteration dependencies are analyzed by forming an Integer Dependence Convex Hull (IDCH). A simple way to compute minimum dependence distances from the dependence distance vectors of the extreme points of the IDCH is presented. Using these minimum dependence distances the iteration space can be tiled. Iterations in a tile can be executed in parallel and the tiles can be executed with proper synchronization. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Minimal turn restrictions for designing deadlock-free adaptive routing

    Publication Year: 1994 , Page(s): 680 - 687
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (584 KB)  

    A routing algorithm is basically required to be connected and deadlock-free. We can restrict some directions that messages can turn in a network to avoid deadlock. A deadlock-free adaptive routing with fewer turn restrictions is considered to possess a greater degree of adaptiveness. We present two basic strategies for designing feasible routings on networks which have bidirectional channels. Primer investigation of our strategies reveals their ability to obtain minimal turn restrictions on some typical multicomputer networks, like hypercube, mesh and torus View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • DTVS: a distributed trace visualization system

    Publication Year: 1994 , Page(s): 281 - 288
    Cited by:  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (532 KB)  

    We present a straightforward and sample method of visualizing the execution of a distributed system as it is recorded in a collection of per-process traces. The technique is based directly on Lamport's space-time diagrams and the notion of causal precedence as embodied in vector time. Timestamping trace events with vector time provides the leverage required for the rapid display of the space-time diagram. It also allows us to define and display concurrent regions of the trace and to implement a fast algorithm for the evaluation of global state predicates over those regions. We describe an implementation of DTVS for distributed systems in which synchronous communication is used View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An analysis of data distribution methods for Gaussian elimination in distributed-memory multicomputers

    Publication Year: 1994 , Page(s): 152 - 159
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (484 KB)  

    In multicomputers, an appropriate data distribution is crucial for reducing communication overhead and therefore the overall performance. For this reason, data parallel languages provide programmers with primitives, such as BLOCK and CYCLIC that can be used to distribute data across the distributed memory. However, the languages do not aid the programmer as to how the distribution should be performed to maximize the performance. Therefore, this paper presents an analysis of data distribution methods for overlapping computation and communication in the Gaussian elimination algorithm. The analysis indicates that both BLOCK and CYCLIC distributions have their own merit; however, BLOCK-CYCLIC with its hybrid characteristic consistently out performs its counterparts View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Good algorithm design style for multiprocessors

    Publication Year: 1994 , Page(s): 538 - 543
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (532 KB)  

    We discuss a style of designing parallel algorithms with the following characteristics for a problem of the best known sequential time T(n): C1. Each processor spends O(T(n)/P) time in computing. C2. Each processor sends and/or receives O(n/P) messages of one-word-size. C3. The number of communication phases1 is constant, independent of the input size n. We show this is possible to achieve for several fundamental computational problems View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Wormhole routing algorithms for twisted cube networks

    Publication Year: 1994 , Page(s): 696 - 703
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (420 KB)  

    The hypercube can be “improved” by “twisting” or rearranging edges to create new networks with smaller diameter and average distance. There are two criticisms of these twisted cube networks. First, these networks have not been shown to have deadlock-free routing algorithms. Second, while they can sometimes provide a better performance for a store-and-forward routing strategy, they have not been shown to be efficient when using a wormhole routing strategy. In this paper, we introduce a new network, the Bent Cube, and examine one recently published network, the Generalized Twisted Cube to address these issues View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimal multiple message broadcasting in telephone-like communication systems

    Publication Year: 1994 , Page(s): 216 - 223
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (588 KB)  

    We consider the problem of broadcasting multiple messages from one processor to many processors in telephone-like communication systems. In such systems, processors communicate in rounds, where in every round, each processor can communicate with exactly one other processor by exchanging messages with it. Finding an optimal solution for this problem was open for over a decade. In this paper, we present an optimal algorithm for this problem when the number of processors is even. For an odd number of processors, we provide an algorithm which is within an additive term of 1 of the optimum. A by-product of our solution is an algorithm for the problem of broadcasting multiple messages for any number of processors in the simultaneous send/receive model. In this latter model, in every round, each processor can send a message to one processor and receive a message from another processor View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Pin-efficient networks for cubic neighborhoods

    Publication Year: 1994 , Page(s): 402 - 408
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (440 KB)  

    Pin-efficient bussed network families are discussed that can-in one clock tick-simultaneously shift all data in a k-dimensional grid to neighboring processors in any one of the 3k-1 `compass directions' x&oarr;→x&oarr;+δ&oarr;, for every nonzero vector δ&oarr; ∈ {-1,0,1}k. The networks have the advantages of being simple to describe (using a single 5-state automaton), extendible (the k-dimensional network is obtained by extending the busses of the (k-1)-dimensional network), and provably optimal for k⩽3. The networks use only [3/2(√3)k] pins per processor, which is within 3/2 of the theoretical minimum number of pins required. The best previously known family uses 2k pins View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Characterization of applications with I/O for processor scheduling in multiprogrammed parallel systems

    Publication Year: 1994 , Page(s): 298 - 307
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (764 KB)  

    Most studies of processor scheduling in multiprogrammed parallel systems have ignored the I/O performed by applications. Recent studies have demonstrated that significant I/O operations are performed by a number of different classes of parallel applications. This paper focuses on some basic issues that underlie scheduling in multiprogrammed parallel environments running applications with I/O. Characterization of the I/O behavior of parallel applications is discussed first. Based on simulation models this research investigates the influence of these I/O characteristics on processor scheduling View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Virtual computers-a new paradigm for distributed operating systems

    Publication Year: 1994 , Page(s): 326 - 333
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (620 KB)  

    The virtual computers (VC) paradigm enables the incorporation of predictability and choice into the design of an operating system. Predictability refers to the ability of the system to provide each user with a computing environment whose performance is independent of the behavior of other users. Choice refers to the ability of a user to select a computer system that meets that user's specifications, needs or budget. In this paper, we introduce this new paradigm and show how the VC paradigm can be incorporated into the processor scheduling, and how the on-line schedulers can be effectively implemented View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient submesh permutations in wormhole-routed meshes

    Publication Year: 1994 , Page(s): 672 - 678
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (568 KB)  

    This paper studies how to concurrently permute related logical or physical submeshes in a d-dimensional n×…×n physical mesh via wormhole and dimension-ordered routing. Our objective is to minimize the congestion for realizing the permutations, while maximizing the number and dimensionality of permuted submeshes. We show that for d⩽2α+β, concurrent independent permutations of nβ related physical submeshes, each of α dimensions, can be performed in two routing steps without congestion. If the permuted submeshes are logical ones, they can be permuted in one, instead of two, routing step. In addition, any shift operation along any axis of the logical mesh can be performed in the physical mesh without congestion. We also show that if all nodes know the permutation function, any permutation within a submesh of dimensions [2(d-1)/3] can be realized in three routing steps without congestion View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.