By Topic

Parallel Processing Symposium, 1994. Proceedings., Eighth International

Date 26-29 April 1994

Filter Results

Displaying Results 1 - 25 of 138
  • Proceedings of 8th International Parallel Processing Symposium

    Save to Project icon | Request Permissions | PDF file iconPDF (26 KB)  
    Freely Available from IEEE
  • Parallel extended GCD algorithm

    Page(s): 357 - 361
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (304 KB)  

    The extended GCD algorithm is very useful for data dependence tests, for example, the Power Test on supercomputers. We parallelize the extended GCD algorithm on a CREW SM MIMD computer with O(n) processors. We improve the sequential extended GCD algorithm and parallelize the extended GCD algorithm by two methods. We parallelize to triangularize the matrix by reducing elements in the same column simultaneously. This algorithm almost has no algorithmic redundancy, but has efficiency O(1/log2n) where n is the number of variables. Second, some rows, which have been reduced at the current column in the above algorithm, can be reduced at the next column immediately. This algorithm has efficiency O(min(m, n-1)/n) where m is the number of linear diophantine equations, and is powerful when m is large. When m is equal to n-1, the efficiency of the algorithm is O(1) View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A multi-paradigm object oriented parallel environment

    Page(s): 182 - 186
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (436 KB)  

    Control and data parallelism are two complementary but often mutually exclusive paradigms used to program massively parallel systems. We propose to encapsulate both control and data parallelism in regular classes of a sequential object oriented language: a SPMD programming model is used and thus no language extensions are needed, provided a shared virtual memory is available. We show how these ideas are implemented in EPEE, our Eiffel Parallel Execution Environment. As an example, we present the implementation of both paradigms on a toy linear algebra example and show how they can interoperate. We conclude with some performance results and prospective remarks View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Communication and computation patterns of large scale image convolutions on parallel architectures

    Page(s): 926 - 931
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (476 KB)  

    Segmentation and other image processing operations rely on convolution calculations with heavy computational and memory access demands. The article presents an analysis of a texture segmentation application containing a 96×96 convolution. Sequential execution required several hours an single processor systems with over 99% of the time spent performing the large convolution. 70% to 75% of execution time is attributable to cache misses within the convolution. We implemented the same application on CM-5, iPSC/860 and PVM distributed memory multicomputers, tailoring the parallel algorithms to each machine's architecture. Parallelization significantly reduced execution time, taking 49 seconds on a 512 node CM-5 and 6.5 minutes on a 32 node iPSC/860. The results indicate for large kernel convolutions the size and bandwidth of the fast memory store is more important than processor power or communication overhead View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel routing of VLSI circuits based on net independency

    Page(s): 949 - 953
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (352 KB)  

    During the layout synthesis of integrated circuits, a major part of the time is spent with routing the interconnections of the chip's cells. Even for the most simple optimization criteria, this problem is np-complete, making the use of heuristics necessary. But even when using heuristics, the time required by the routing phase is very high. In the past, several approaches have been proposed to speed up the routing phase by applying parallel processing. Most of these approaches distribute the routing area among processors and have to cope with a considerable communication overhead. In this paper, we present a novel approach where sets of nets are distributed. We show experimentally that this approach leads to significant speedups even in workstation networks View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sorting strings and constructing digital search trees in parallel

    Page(s): 349 - 356
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (692 KB)  

    We describe two simple optimal-work parallel algorithms for sorting a list L=(X1,X2,...,Xm) of m strings over an arbitrary alphabet Σ, where Σi=1m|Xi|=n. The first algorithm is a deterministic algorithm that runs in O((log2 m)/(log log m)) time and the second is a randomized algorithm that runs in O(log m) time. Both algorithms use O(m log(m)+n) operations. Compared to the best known parallel algorithms for sorting strings, the algorithms offer the following improvements: the total number of operations used by the algorithms is optimal while all previous parallel algorithms use a non-optimal number of operations; we make no assumption about the alphabet while the previous algorithms assume that the alphabet is restricted to {1,2,..., nO(1)}; the computation model assumed by the algorithms is the Common CRCW PRAM unlike the known algorithms that assume the Arbitrary CRCW PRAM; and the presented algorithms use O(m log m+n) space, while previous parallel algorithms use O(n1+ε ) space, where ε is a positive constant. We also present optimal-work parallel algorithms to construct a digital search tree for a given set of strings and to search for a string in a sorted list of strings. We use the parallel sorting algorithms to solve the problem of determining a minimal starting point of a circular string with respect to lexicographic ordering View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A framework for programming using non-atomic variables

    Page(s): 133 - 140
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (664 KB)  

    The semantics of interprocess communication occurring through shared objects is investigated. A number of definitions of non-atomic memory are considered. Two different notions of commutativity are defined. These are later used to develop conditions under which computations using a non-atomic memory are equivalent to atomic histories with regard to the final state of the objects View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A scalable MIMD volume rendering algorithm

    Page(s): 916 - 920
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (440 KB)  

    Volume rendering is a compute intensive graphics algorithm with wide application. Researchers have sought to speed it up using parallel computers. The algorithm distributes the data for storage efficiency, avoids bottlenecks, and scales to more processors than rays. The main contribution is explicit partitioning of the input volume for higher memory utilization, while retaining viewpoint freedom and speedup. The largest volumes processed on the MIMD (multiple instruction multiple data) machine (Proteus) are 512×512×128 voxels (32 Mbytes). Performance measurements show a speedup of 22 over sequential code on 32 Intel i860 processors. We have used no preprocessing or data dependent optimization. The efficiency results from nonconflicting communication, a permutation warp, that remains efficient with larger data sets, larger parallel machines, and high order filters showing scalability can be achieved through object space partitioning View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel logic simulation using Time Warp on shared-memory multiprocessors

    Page(s): 942 - 948
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (576 KB)  

    The article presents an efficient parallel logic-circuit simulation scheme based on the Time Warp optimistic algorithm. The Time Warp algorithm is integrated with a new global virtual time (GVT) computation scheme for fossil collection. The new GVT computation is based on a token ring passing method, so that global synchronization is not required in a shared-memory multiprocessor system. This allows us to process large logic simulation problems, where the GVT computation is executed frequently for fossil collection due to limited memory space. We also present how to reduce the frequency of the GVT computation and the rollback radio by scheduling the process with the smallest timestamp first. We implement the parallel logic-circuit simulator using the Time Warp on BBN Butterfly machines, and the experimental results show that the algorithm provides a significant speedup in processing time, even for very large circuits View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fuzzy communication for guided loop scheduling in multicomputers

    Page(s): 439 - 443
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (344 KB)  

    We propose the use of guided loop scheduling and fuzzy communications to map shared-variable communications into message passing operations among multicomputers. The mapping mechanism converts scalar message passing operations into multiple broadcast or multiple multicast operations. The proposed method is evaluated by both simulation experiments and theoretical analysis. The performance results, considering both communication and computation, are reported for mapping two distributed matrix algorithms. Simulated benchmark results demonstrate improved performance and minimized execution time. The proposed method offers the potential to take advantage of the programmability of shared-memory systems and the scalability of distributed-memory systems View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The block distributed memory model for shared memory multiprocessors

    Page(s): 752 - 756
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (452 KB)  

    Introduces a computation model for developing and analyzing parallel algorithms on distributed memory machines. The model allows the design of algorithms using a single address space and does not assume any particular interconnection topology. We capture performance by incorporating a cost measure for interprocessor communication induced by remote memory accesses. The cost measure includes parameters reflecting memory latency, communication bandwidth, and spatial locality. Our model allows the initial placement of the input data and pipelined prefetching. We use our model to develop parallel algorithms for various data rearrangement problems, load balancing, sorting, FFT, and matrix multiplication. We show that most of these algorithms achieve optimal or near optimal communication complexity while simultaneously guaranteeing an optimal speed-up in computational complexity View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Integrating functional and imperative parallel programming: CC++ solutions to the Salishan problems

    Page(s): 61 - 67
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (468 KB)  

    We investigate the practical integration of functional and imperative parallel programming in the context of a popular sequential object-based language. As the basis of our investigation, we develop solutions to the Salishan problems, a set of problems intended as a standard by which to compare parallel programming notations. The language that we use is CC++, C++ extended with single-assignment variables, parallel composition, and atomic functions. We demonstrate how deterministic parallel programs can be written that are identical-except for the addition of a few keywords-to sequential programs that satisfy the same specifications View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Encapsulating networks and routing

    Page(s): 546 - 553
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (592 KB)  

    Presents a new view of routing messages in interconnection networks based on the known compact interval labeling. The authors propose simple algorithms, encapsulating networks and routing, suitable for a large class of topologies. They define a floating rule that unifies the notions of virtual channels and multiple intervals labeling. The introduced approach is applied to some usual structures View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A theoretical network model and the Hamming cube networks

    Page(s): 18 - 22
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (388 KB)  

    We introduce a network model, called the Hamming group, which can be used to generate several important classes of hypercube-like topologies. The Hamming group is a specific group for which the Hamming-distance relations are used as the generators. This model enhanced with the unit incremental capability provides a framework for generating many possible supergraphs of incomplete hypercubes, having an arbitrary number of nodes. In particular, we derive from our model a new family of succinctly representable and labeled networks, called the Hamming cubes (HC's). These networks can recursively grow from the existing ones with the increment of one node at a time, have half of logarithmic diameter and are easily decomposable. Simple routing schemes are designed for Hamming cubes, which are optimally fault-tolerant since the node-connectivity is equal to the minimum degree. With respect to several topological and performance parameters, Hamming cubes are strong competitors of binary hypercubes or folded hypercubes View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An evaluation of multiprocessor cache coherence based on virtual memory support

    Page(s): 158 - 164
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (524 KB)  

    This paper presents an evaluation of the impact of several architectural parameters on the performance of virtual memory (VM) based cache coherence schemes for shared-memory multiprocessors. The VM-based cache coherence schemes use the traditional VM translation hardware on each processor to detect memory access attempts that might leave caches incoherent, and maintain coherence through VM-level system software. The implementation of this class of coherence schemes is flexible and economical: it allows different consistency models, requires no special hardware for multiprocessor cache coherence, and supports arbitrary interconnection networks. We used trace-driven simulations to evaluate the effect of the architectural parameters on the performance of the VM-based schemes. These parameters include VM page sizes. Write-back and write-through caches, memory access latencies, bus and crossbar interconnections, and different cache sizes. Our results show that VM-based cache coherence can be a very practical approach for building shared-memory multiprocessors View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Designing a parallel debugger for portability

    Page(s): 909 - 914
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (492 KB)  

    The growing variety of parallel computers has made it difficult to design portable tools for parallel programs. The article shows how an interactive visualization tool can be designed to work with a variety of parallel machines. The design includes a strategy for adapting to differences in the interfaces and capabilities of the low-level debuggers supplied by hardware vendors. The tool uses these debuggers to perform basic tasks like setting breakpoints and examining variables. By dividing each interaction between the visualization tool and the “base debugger” into a sequence of customizable steps, one can write code that adapts cleanly and efficiently to differences in the debuggers. This design has been implemented in the Panorama parallel debugger, which runs on several message-passing multicomputers View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Solving the all-pair shortest path problem on interval and circular-arc graphs

    Page(s): 224 - 228
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (436 KB)  

    We show that, given the interval model of an unweighted n-vertex interval graph, the information for the all-pair shortest paths can be made available very efficiently, both in parallel and sequentially. After sorting the input intervals by their endpoints, an O(n) space data structure can be constructed optimally in parallel, in O(log n) time using O(n/log n) CREW PRAM processors. Using the data structure, a query on the length of the shortest path between any two input intervals can be answered in O(1) time using one processor, and a query on the actual shortest path can be answered in O(1) time using k processors, where k is the number of intervals on that path. Our parallel algorithm immediately implies a new sequential result: After an O(n) time preprocess, shortest paths can be reported optimally. Our techniques can be extended to solve the problem on circular-arc graphs, both in parallel and sequentially, in the same complexity bounds. The previously best known sequential algorithm for computing the all-pair shortest paths in interval graphs lakes O(n2) time and uses O(n2 ) space to store the lengths of the all-pair shortest paths View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Building multithreaded architectures with off-the-shelf microprocessors

    Page(s): 288 - 294
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (640 KB)  

    Present day parallel computers often face the problems of large software overheads for process switching and inter-processor communication. These problems are addressed by the Multi-Threaded Architecture (MTA), a multiprocessor model designed for efficient parallel execution of both numerical and non-numerical programs. We begin with a conventional processor, and add the minimal external hardware necessary for efficient support of multithreaded programs. The article begins with the top-level architecture and the program execution model. The latter includes a description of activation frames and thread synchronization. This is followed by a detailed presentation of the processor. Major features of the MTA include the Register-Use Cache for exploiting temporal locality in multiple register set microprocessors, support for programs requiring non-determinism and speculation, and local function invocations which can utilize registers for parameter passing View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Extracting parallelism in Fortran by translation to a single assignment intermediate form

    Page(s): 329 - 334
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (516 KB)  

    The paper presents MUSTANG, a system for translating Fortran to single assignment form in an effort to automatically extract parallelism. Specifically, a sequential Fortran source program is translated into IF1, a machine-independent dataflow graph description language that is the intermediate form for the SISAL language. During this translation, Parafrase 2 is used to detect opportunities for parallelization which are then explicitly introduced into the IF1 program. The resulting IF1 program is then processed by the Optimizing SISAL Compiler which produces parallel executables on multiple target platforms. The execution results of several Livermore Loops are presented and compared against Fortran and SISAL implementations on two different platforms. The results show that the translation is an efficient method for exploiting parallelism from the sequential Fortran source code View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On evil twin networks and the value of limited randomized routing

    Page(s): 566 - 575
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (700 KB)  

    A dynamic 2-stage Delta network (N inputs and outputs) is introduced and analyzed for permutation routing. The notion of evil twins is introduced and a deterministic procedure is given to route any permutation in no more than 2×4√(N) network cycles. Two limited randomized routing schemes are then given and yield on average at most NA+1+1/N (NA=O(log N/log(log N=1))) and is the greatest integer such that (NA)!⩽N and [log(log N+1)]+2+1/N network cycles for any input permutation. The probability of any permutation requiring at least c network cycles more than the average bounds above is at most 1/(c+1)! and 1/N2exp(c) View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel benchmarks on the Transtech Paramid

    Page(s): 694 - 699
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (372 KB)  

    This paper presents the results of running the some benchmarks from the Genesis suite on the Transtech Paramid. The benchmarks use the PARMACS parallel processing standard, and are based on applications in the fields of general relativity, molecular dynamics and QCD. The Paramid is a distributed memory parallel computer, using up to 64 Intel i860-XP processors. The results demonstrate good parallel performance, and the ability of the machine to run standard portable software View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • All-to-all communication on meshes with wormhole routing

    Page(s): 561 - 565
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (380 KB)  

    Describes several algorithms to perform all-to-all communication on a two-dimensional mesh connected computer with wormhole routing. The authors discuss both direct algorithms, in which data is sent directly from source to destination processor, and indirect algorithms in which data is sent through one or more intermediate processors. The authors propose algorithms for both power-of-two and non power-of-two meshes as well as an algorithm which works for any arbitrary mesh. They have developed analytical models to estimate the performance of the algorithms on the basis of system parameters. Performance results obtained on the Intel Touchstone Delta are compared with the estimated values View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A multipath contention model for analyzing job interactions in 2-D mesh multicomputers

    Page(s): 744 - 751
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (776 KB)  

    The heterogeneous multipath contention model is a representation of arbitrarily overlapped communication paths of jobs that have different message injection rates. Based on this model, we analyze the degradation of communication performance due to multiple interacting jobs in a 2D mesh wormhole-routed multicomputer system. One problem we address is computing the real contention delay seen by a message on a path in the model. A divide-and-conquer strategy divides the problem into several manageable problems of computing the real contention delay for the heterogeneous 2-path contention model. In order to verify the analysis of our model, we compared our analytic results with a simulation model View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Latency hiding in message-passing architectures

    Page(s): 704 - 709
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (544 KB)  

    The paper demonstrates the advantages of having two processors in the node of a distributed memory architecture, one for computation and one for communication. The architecture of such a dual-processor node is discussed. To exploit fully the potential for parallel execution of computation threads and communication threads, a novel, compiler-optimized IPC mechanism allows for an unbuffered no-wait send and a prefetched receive without the danger of semantics violation. It is shown how an optimized parallel operating system can be constructed such that the application processor's involvement in communication is kept to a minimum while the utilization of both processors is maximized. The MANNA implementation results in an effective message start-up latency of only 1...4 microseconds. It is also shown how the dual-processor node is utilized to efficiently realize virtual shared memory View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Oracle media server for nCUBE massively parallel systems

    Page(s): 670 - 673
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (324 KB)  

    Information users today are hampered by inadequate access to the information they want and need broadcast and cable TV offer inflexible scheduling, while dial-up services are still fairly primitive, offering poor visual quality and arcane user interfaces. Videotape rental is time consuming and frustrating, and on-line shopping is in its infancy. The network infrastructure now exists to deliver vastly improved versions of these services. What has been lacking is a server system sophisticated enough to cost effectively pump information to the delivery networks. Together, Oracle and nCUBE provide a high-performance, cost-effective solution that enables information providers and users to realize the benefits of interactive multimedia services View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.