By Topic

High Performance Distributed Computing, 1995., Proceedings of the Fourth IEEE International Symposium on

Date 2-4 Aug. 1995

Filter Results

Displaying Results 1 - 25 of 29
  • Proceedings of the Fourth IEEE International Symposium on High Performance Distributed Computing

    Save to Project icon | Request Permissions | PDF file iconPDF (176 KB)  
    Freely Available from IEEE
  • Author index

    Save to Project icon | Request Permissions | PDF file iconPDF (53 KB)  
    Freely Available from IEEE
  • Portable checkpointing and recovery

    Page(s): 188 - 195
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (672 KB)  

    This paper presents a checkpointing scheme that was implemented in a parallel library that runs on top of CHIMP/MPI. The main goals of the checkpointing mechanism are portability and efficiency. It runs on every platform supported by MPI in a machine-independent way. The scheme allows the migration of checkpoints and offers a flexible recovery mechanism based on data-reconfiguration. Some performance results will be presented at the end of the paper together with some techniques that can be used to increase the efficiency of the checkpointing mechanism View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A multithreaded message passing environment for ATM LAN/WAN

    Page(s): 238 - 245
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (688 KB)  

    Large scale High Performance Computing and Communication (HPCC) applications (e.g. Video-on-Demand, and HPDC) would require storage and processing capabilities which are beyond existing single computer systems. The current advances in networking technology (e.g. ATM) have made high performance network computing an attractive computing environment for such applications. However, using only high speed network is not sufficient to achieve high performance distributed computing environment unless some hardware and software problems have been resolved. These problems include the limited communication bandwidth available to the application, high overhead associated with context switching, redundant data copying during protocol processing and lack of support to overlap computation and communication at application level. In this paper, we propose a multithreaded message passing system for parallel/distributed processing that we refer to as NYNET communication system (NCS). NCS, being developed for NYNET (ATM wide area network testbed), is built on top of an ATM application programmer interface (API). The multithreaded environment allows applications to overlap computations and communications and provides a modular approach to support efficiently HPDC applications with different quality of service (QOS) requirements View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient causally ordered communications for multimedia real-time applications

    Page(s): 140 - 147
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (564 KB)  

    Multimedia real-time collaborative applications or groupware real-time applications require participants to exchange real-time audio and video information over a communication network. This flow of information must preserve the causal dependency even though part of the information can be lost or can be discarded if it violates the tinting constraints imposed by a real-time interaction. In this paper we propose a communication abstraction to cope with unreliable communication networks with real-time delivery constraints: messages have a lifetime, Δ, after which their contents can no longer be used, moreover some of them can be lost. This new abstraction, called Δ-causal order, requires to deliver as much messages as possible within their lifetime in such a way that these deliveries respect causal order. An efficient protocol is proposed in the case of one-to-one communications. A variation of this protocol weld suited to broadcast communications is also shown View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance benefits of optimistic programming: a measure of HOPE

    Page(s): 197 - 204
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (764 KB)  

    Optimism is a powerful technique for avoiding latency by increasing concurrency. By optimistically assuming the results of some computation, other computations can be executed in parallel, even when they depend on the assumed result. Optimistic techniques can be particularly beneficial to parallel and distributed systems because of the critical impact of inter-node communications latency. This paper describes how optimism can be used to enhance the performance of distributed programs by avoiding remote communications delay. We then present a new programming model that automates many of the difficulties of using optimistic techniques in a general programming environment, and describe a prototype implementation. Finally, we present performance measurements showing how optimism improved the performance of a test application in this environment View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A performance comparison of RAID-5 and log-structured arrays

    Page(s): 167 - 178
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (924 KB)  

    In this paper, we compare the performance of the well-known RAID-5 arrays to that of log-structured arrays (LSA), on transaction-processing workloads. LSA borrows heavily from the log-structured file system (LFS) approach, but is executed in an outboard disk controller. The LSA technique we examine combines LFS, RAID, compression and non-volatile cache. We look at sensitivity of LSA performance to amount of free space on the physical disks and to the compression ratio achieved. We also evaluate a RAID-5 design that supports compression in cache View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An ATM-based multimedia integrated manufacturing system

    Page(s): 230 - 237
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (912 KB)  

    Along with the emergence of high-speed communication technologies and protocols, tools for multimedia applications are growing in terms of their variety and their performances, making multimedia data flows candidates to the integration in almost every computer system. We particularly focus on the case of the factory plant where audio and video equipments could soon take part in the control/monitoring of manufacturing processes. Our paper presents a novel architecture for real time distributed systems using a 155.52 Mbit/s ATM network. The system is based on an intelligent network interface board, which integrates a processor and a field programmable gate array component to implement flexible high level real time data services in an off-host implementation approach. Moreover, the embedded ATM switch allows to build a large variety of topologies and provides high speed communication links, in particular for audio and video streams, inside the workstation. We also propose an MMS-like multimedia application service element for the integrated service computer manufacturing system and present some target applications View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • CALYPSO: a novel software system for fault-tolerant parallel processing on distributed platforms

    Page(s): 122 - 129
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (720 KB)  

    The importance of adapting networks of workstations for use as parallel processing platforms is well established. However current solutions do not always address important issues that exist in real networks. External factors like the sharing of resources, unpredictable behavior of the network and failures, are present in multiuser networks and must be addressed. CALYPSO is a prototype software system for writing and executing parallel programs on non-dedicated platforms, based on COTS networked workstations operating systems, and compilers. Among notable properties of the system are: (1) simple programming paradigm incorporating shared memory constructs and separating the programming and the execution parallelism, (2) transparent utilization of unreliable shared resources by providing dynamic load balancing and fault tolerance, and (3) effective performance for large classes of coarse-grained computations. We present the system and report our initial experiments and performance results in settings that closely resemble the dynamic behavior of a “real” network. Under varying work-load conditions, resource availability and process failures, the efficiency of the test program we present ranged from 84% to 94% bench-marked against a sequential program View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A spanning tree based recursive refinement algorithm for fast task mapping

    Page(s): 58 - 65
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (712 KB)  

    The early version of recursive refinement, an algorithm, for fast task mapping is reported in this paper. An intended application for this algorithm is dynamic load balancing on a parallel/distributed system, including a network of workstations. Since this requires a fast mapping, the algorithm restricts the range of tasks movement during optimization and maps groups of tasks to groups of PE's in each iteration. Tasks are allowed to be swapped only between the two partitions that comprise the parent partition from the previous iteration. More freedom of task movement is unnecessary because structural characteristics of the problem graph are used to guide the mapping. The groups are derived from a spanning tree that is constructed from the original problem graph and is used to identify the structural characteristics required for the algorithm. Experimental results show that the algorithm is able to achieve a mapping quality as good as that by another mapping scheme characteristic of a class of algorithms that require an order of magnitude more time View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Communication overhead for space science applications on the Beowulf parallel workstation

    Page(s): 23 - 30
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (756 KB)  

    The Beowulf parallel workstation combines 16 PC-compatible processing subsystems and disk drives using dual Ethernet networks to provide a single-user environment with 1 Gops peak performance, half a Gbyte of disk storage, and up to 8 times the disk I/O bandwidth of conventional workstations. The Beowulf architecture establishes a new operating point in price-performance for single-user environments requiring high disk capacity and bandwidth. The Beowulf research project is investigating the feasibility of exploiting mass market commodity computing elements in support of Earth and space science requirements for large data-set browsing and visualization, simulation of natural physical processes, and assimilation of remote sensing data. This paper reports the findings from a series of experiments for characterizing the Beowulf dual channel communication over-head. It is shown that dual networks can sustain 70% greater throughput than a single network alone but that bandwidth achieved is more highly sensitive to message size than to the number of messages at peak demand. While overhead is shown to be high for global synchronization, its overall impact on scalability of real world applications for computational fluid dynamics and N-body gravitational simulation is shown to be modest View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • TCP/ATM experiences in the MAGIC testbed

    Page(s): 87 - 93
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (500 KB)  

    This paper describes performance measurements taken in the MAGIC gigabit testbed relating to the performance of TCP in wide area ATM networks. The behavior of TCP with and without cell level pacing is studied. In particular, we focus on results that indicate that the TCP rate control mechanism alone is inadequate for congestion avoidance and control in wide-area gigabit networks. We also present results showing that TCP augmented by cell-level pacing addresses these problems and allows the full bandwidth capacity to be utilized. These results demonstrate the viability of high performance distributed systems based on wide area ATM networks given the proper ATM traffic management infrastructure View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Disk-directed I/O for an out-of-core computation

    Page(s): 159 - 166
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (676 KB)  

    New file systems are critical to obtain good I/O performance on large multiprocessors. Several researchers have suggested the use of collective file-system operations, in which all processes in an application cooperate in each I/O request. Others have suggested that the traditional low-level interface (read, write, seek) be augmented with various higher-level requests (e.g., read matrix). Collective, high-level requests permit a technique called disk-directed I/O to significantly improve performance over traditional file systems and interfaces, at least on simple I/O benchmarks. In this paper we present the results of experiments with an “out-of-core” LU-decomposition program. Although its collective interface was awkward in some places, and forced additional synchronization, disk-directed I/O was able to obtain much better overall performance than the traditional system View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • TPVM: distributed concurrent computing with lightweight processes

    Page(s): 211 - 218
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (820 KB)  

    The TPVM (Threads-oriented PVM) system, is an experimental auxiliary subsystem for the PVM distributed system, which supports the use of lightweight processes or “threads” as the basic unit of parallelism and scheduling. TPVM provides a library interface which presents both a traditional, task based, explicit message passing model, as well as a data-driven scheduling model that enables straightforward specification of computation based on data dependencies. Our system design is still under development, but a prototype implementation has allowed us to perform a number of preliminary experiments. These have provided strong evidence that TPVM can offer improved performance, processor utilization, and load balance to several application categories. Through our experiments we have also determined that the current TPVM design is not very well suited to certain types of applications, most notably highly synchronous, SPMD-style algorithms View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Network shared memory: a new approach for clustering workstations for parallel processing

    Page(s): 48 - 56
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (680 KB)  

    In this paper, we describe a new approach for clustering workstations into a dedicated, medium-sized, shared memory parallel processor. This new approach, called the network shared memory (NSM) approach, is based upon a new way of looking at the role of communication networks multi-computer system. We develop an implementation model of the architecture of an NSM based workstation cluster. This model serves as the basis for simulations that we use to assess the performance of NSM based workstation clusters. We also use simulations to evaluate the performance of architectures representative of existing approaches for workstation clustering as well architectures representative of commercial symmetric multi-processors. The results of the performance assessment show that the NSM approach outperforms existing approaches for clustering workstations View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nimrod: a tool for performing parametrised simulations using distributed workstations

    Page(s): 112 - 121
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (928 KB)  

    This paper discusses Nimrod, a tool for performing parametrised simulations over networks of loosely coupled workstations. Using Nimrod the user interactively generates a parametrised experiment. Nimrod then controls the distribution of jobs to machines and the collection of results. A simple graphical user interface which is built for each application allows the user to view the simulation in terms of their problem domain. The current version of Nimrod is implemented above OSF DCE and runs on DEC Alpha and IBM RS6000 workstations (including a 22 node SP2). Two different case studies are discussed as an illustration of the utility of the system View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multimedia intra-group communication protocol

    Page(s): 180 - 187
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (632 KB)  

    In distributed applications, a group of multiple application processes have to send and receive multimedia messages by using the high-speed network. The multimedia message at the application level is decomposed into multiple smaller packets which are transmitted in the communication system. In this paper, we discuss the atomic and ordered delivery of messages at the application level rather than the system level. In some multimedia applications, the application processes do not mind if some packets are lost and in what order packets from different processes are received. The application process specifies the minimum receipt ratio ε (⩽1) showing how many percentages of whole data in each message the destination processes have to be received at least. The communication system delivers the packets to the destinations in the group so as to satisfy the receipt ratio ε. The protocol is based on the fully distributed control scheme, i.e. no master controller View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A high speed implementation of adaptive shaping for dynamic bandwidth allocation

    Page(s): 94 - 101
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (748 KB)  

    Most algorithms proposed for controlling traffic prior to entering ATM networks are based on static mechanisms. Such static control mechanisms do not account for the dynamics of the user traffic or the network state. Some dynamic control algorithms have been proposed, but most of these algorithms are extremely complex and may make it difficult to provide real time control. In this paper, we present an adaptive rate control algorithm that has been implemented in hardware. The algorithm controls the traffic submitted by a source based on the indirectly observed average rate and burst size for the source. The algorithm is highly efficient and thereby provides real time control at high speed. Our implementation, in concert with flow control in the local area network, provides the basis for ATM-based high performance distributed systems View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A portable distributed implementation of the parallel multipole tree algorithm

    Page(s): 17 - 22
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (444 KB)  

    Several variants of parallel multipole-based algorithms have been implemented to further research in fields such as computational chemistry and astro-physics. We present a distributed parallel implementation of a multipole-based algorithm that is portable to a wide variety of applications and parallel platforms. Performance data are presented for loosely coupled networks of workstations as well as for more tightly coupled distributed multiprocessors, demonstrating the portability and scalability of the application to large number of processors View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Loop scheduling for heterogeneity

    Page(s): 78 - 85
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (836 KB)  

    In this paper we study the problem of scheduling parallel loops at compile-time for a heterogeneous network of machines. We consider heterogeneity in three aspects of parallel programming: program, processor and network. A heterogeneous program has parallel loops with different amount of work in each iteration; heterogeneous processors have different speeds; and a heterogeneous network has different cost of communication between processors. We propose a simple yet comprehensive model for use in compiling for a network of processors, and develop compiler algorithms for generating optimal and sub-optimal schedules of loops for load balancing, communication optimizations and network contention. Experiments show that a significant improvement of performance is achieved using our techniques View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Indigo: user-level support for building distributed shared abstractions

    Page(s): 130 - 137
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (880 KB)  

    Distributed systems that consist of workstations connected by high performance interconnects offer computational power comparable to moderate size parallel machines. It is desirable that such workstation clusters can also be programmed the same way as shared memory machines. We develop a portable, user-level library, called Indigo, that can be used to program a variety of state sharing techniques. In particular, Indigo can be used to program DSM protocols as well as distributed shared abstractions where objects can be fragmented/replicated and consistency actions are customized according to application needs. We present an evaluation of Indigo by using its calls to implement a distributed shared memory system as well as shared abstractions for a number of applications View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hybrid media access protocols for a DSM system based on optical WDM networks

    Page(s): 40 - 47
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (764 KB)  

    Scalable, hierarchical, all-optical wavelength division multiplexed (WDM) networks for interconnection in distributed cluster-based computing systems have been recently considered. Hybrid access protocols combining reservation and pre-allocation have been studied for this type of network which supports a distributed shared memory (DSM) environment. The objectives of the protocols are reduced average latency per packet, support of broadcast/multicast, collisionless communication, and exploitation of inherent DSM tragic characteristics. This paper compares random and static access strategies on the control channel to establish reservation for data packets. Random access provides reduced packet latency under light traffic conditions and has simpler implementation. Static access is free of collisions and instability but has longer control cycle lengths. The performance of the network is analyzed through simulation models with varying system parameters such as number of nodes and channels. Dynamic schemes which switch between random and static access and vice-versa are also considered View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The performance impact of scheduling for cache affinity in parallel network processing

    Page(s): 66 - 77
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1200 KB)  

    We explore processor-cache affinity scheduling of parallel network protocol processing, in a setting in which protocol processing executes on a shared-memory multiprocessor concurrently with a general workload of non-protocol activity. We find that affinity-based scheduling can significantly reduce the communication delay associated with protocol processing, enabling the host to support a greater number of concurrent streams and to provide higher maximum throughput to individual streams. In addition, we compare the performance of two parallelization alternatives, locking and independent protocol stacks (IPS), with very different caching behaviors. We find that IPS (which maximizes cache affinity) delivers much lower message latency and significantly higher message throughput capacity, yet exhibits less robust response to infra-stream burstiness and limited intra-stream scalability View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A versatile packet multiplexer for quality-of-service networks

    Page(s): 148 - 155
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (768 KB)  

    A novel packet multiplexing technique, called rotating-priority-queues (RPQ), is presented which exploits the tradeoff between high efficiency, i.e., the ability to support many connections with delay bounds, and low complexity. The operations required by the RPQ multiplexer are similar to those of the simple, but inefficient, static-priority (SP) multiplexer. The overhead of RPQ, as compared to SP, consists of a periodic rearrangement (rotation) of the priority queues. It is shown that queue rotations can be implemented by updating a set of pointers. The efficiency of RPQ can be made arbitrarily close to the highly efficient, yet complex, earliest-deadline-first (EDF) multiplexer. Exact expressions for the worst case delays in an RPQ multiplexer are presented and compared to expressions for an EDF multiplexer View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel simulation of subsonic fluid dynamics on a cluster of workstations

    Page(s): 6 - 16
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1064 KB)  

    An effective approach of simulating subsonic fluid dynamics on a cluster of non-dedicated workstations is presented. The approach is applied to simulate the flow of air in wind instruments. The use of local-interaction methods and coarse-grain decompositions lead to small communication requirements. The automatic migration of processes from busy hosts to free hosts enables the use of non-dedicated workstations. Simulations of 2D flow achieve 80% parallel efficiency (speedup/processors) using 20 HP-Apollo workstations. Detailed measurements of the parallel efficiency of 2D and 9D simulations are presented, and a theoretical model of efficiency is developed and compared against the measurements. Two numerical methods of fluid dynamics are tested: explicit finite differences, and the lattice Boltzmann method View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.