By Topic

Advances in Parallel and Distributed Computing, 1997. Proceedings

Date 19-21 March 1997

Filter Results

Displaying Results 1 - 25 of 60
  • Proceedings. Advances in Parallel and Distributed Computing

    Save to Project icon | Request Permissions | PDF file iconPDF (2262 KB)  
    Freely Available from IEEE
  • Author index

    Page(s): 425 - 426
    Save to Project icon | Request Permissions | PDF file iconPDF (148 KB)  
    Freely Available from IEEE
  • Parallel replacement mechanism for multithread

    Page(s): 338 - 344
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (652 KB)  

    This paper presents a new rapid thread replacement mechanism which is important in multithread technology. Analysis to the memory system indicates that the memory utilization decreases with the increase of cache hit ratio. The parallelism between thread computation and thread replacement is found by analyzing their working processes. Based on these, we advance a rapid multithread replacement mechanism which overlaps the thread replacement with thread computation. More especially, with finite hardware contexts, this mechanism can play the same role of infinite contexts by tolerating the replacement overhead. By modifying the general thread switching model, we build the thread replacement model and evaluate this mechanism in theory and experiment methods. At last, we discuss the hardware implementation and put forward the problems to be resolved in the future View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel matrix computations and their applications for biomagnetic fields

    Page(s): 139 - 142
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (340 KB)  

    In this paper we present the results of a parallel implementation of a heart field simulation algorithm. The application of biomagnetic fields offers a wide range for using parallel algorithms. Pathological changes in the human body, especially in the heart muscle, can be diagnosed and localised by means of biomagnetic field parameters. The benefit of this diagnosis method is to fit an individual reference model of the heart field of a patient. Based on differences between the reference model and the real measured biomagnetic field parameters, the type and the position of defects in the heart can be located. The most time consuming components of the whole algorithm are the matrix computations, especially the matrix inversion. The matrix inversion can be implemented on a parallel distributed memory system. In this paper we discuss the routing, the parallel matrix inversion, and the speed up for different network topologies that depends on the number of processors and different problem sizes View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel solver of generalized eigenproblem on Dawning-1000

    Page(s): 144 - 148
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (448 KB)  

    In this paper, we consider the parallel implementation of solving generalized eigenproblem of Hermitian type matrices on Dawning-1000. It arises from the theoretical analysis of nonlinear optical crystal structures. We use Cholesky factorisation, Househoulder transformation, bisection method and inverse iteration to complete the computation. The implementation is based on the BLAS library and communication function library provided on Dawning-1000. The numerical results show very good performance and the application in physics is satisfactory View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementation of efficient and reliable multicast servers

    Page(s): 253 - 260
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (848 KB)  

    Reliable multicast services in a group of autonomous distributed processes/sites are desirable to maintain the consistent state of shared information accessed by transactions in distributed systems. Many existing protocols are complicated and thus quite expensive and not efficient for availability of distributed systems. This paper discusses the design and implementations of a new logical token ring based multicast communications services. It provides total ordering, atomicity of multicast messages membership and fault-tolerant services in the presence of sites fail stop and network partitioning. An unique feature of the protocol is that all members, knowing exactly, in the group, who holds the token, are able to detect right order of a multicast message, thereby, reducing the synchronous overhead, preventing possible token loss problems and minimizing control messages. The services are implemented by using finite state machine approach and they are highly efficient comparing with related services in the same network settings View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Interaction nets revisited

    Page(s): 108 - 115
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (584 KB)  

    Past attempts to apply Girard's linear logic to Lafont's interaction nets by treating “symbols” as logical rules, however, failed to come to a significant explanation. In this paper, we try to model “symbols” as external axioms and use “tensor” to describe partition of auxiliary ports. We show that our solution leads to a very natural logical interpretation of the computation on interaction nets View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and analysis of an efficient algorithm for coordinated checkpointing in distributed systems

    Page(s): 261 - 268
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (708 KB)  

    A synchronous checkpointing algorithm coordinates a set of processes in taking checkpoints in such a way that the set of local checkpoints always forms part of a consistent global system state. Whenever a process p requests to take a checkpoint, a set of processes, called the cohorts set of p, must be checked and some of them may also have to take their checkpoints in order to preserve system consistency. Although several synchronous checkpointing algorithms have been proposed in the literature, most of them do not address the performance issue. In this paper we propose an efficient distributed algorithm for synchronous checkpointing. Proof of correctness and analysis of efficiency of the algorithm are presented. It is shown that the algorithm has a better message and time complexity than the existing algorithms. The method proposed in this paper can also be applied to enhance the performance of rollback operation which always require synchronization of the inter-dependent processes View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The χ-calculus

    Page(s): 74 - 81
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (748 KB)  

    The paper proposes a new process algebra, called χ-calculus. The language differs from π-calculus in several aspects. First it takes a more uniform view on input and output. Second, the closed names of the language are homogeneous in the sense that there is only one kind of bound name. Thirdly, the effects of communications in χ-calculus are delimited by localization operators, not by sequentiality combinator. Finally, the language cherishes more freedom of parallelism than π-calculus. The algebraic properties of χ-processes are studied in terms of local bisimulation. It is shown that local bisimilarity is a congruence equivalence on χ-processes View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast parallel algorithm for finding the kth longest path in a tree

    Page(s): 164 - 169
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (556 KB)  

    We present a fast parallel algorithm running in O(log2n) time on a CREW PRAM with O(n) processors for finding the kth longest path in a given tree of n vertices (with Θ(n2 ) intervertex distances). Our algorithm is obtained by efficient parallelization of a sequential algorithm which is a variant of both N. Megiddo et al.'s algorithm and G.N. Fredrickson et al.'s algorithm based on centroid decomposition of tree and succinct representation of the set of intervertex distances. With the same time and space bound as the best known result, our sequential algorithm maintains a shorter length of the decomposition tree View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance of buffered multistage interconnection networks in case of packet multicasting

    Page(s): 50 - 57
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (732 KB)  

    Multistage Banyan networks are frequently proposed as connections in multiprocessor systems. There exist several studies to determine the performance of networks in which messages are unicasted. (One processor sends a message to one and only one other processor.) In this paper, a timed Petri net model is used to derive the performance of buffered Banyan networks, in which messages may also be multicasted (One processor can send a message to more than one other processor). We consider a Banyan network with 2×2-switches and the two cases of complete and partial broadcasting within the switching elements, An algorithm is presented to calculate the destination distribution in all network stages for arbitrary destination patterns of incoming uniform packet traffic. Thus, the automatic generation of timed Petri net models is possible for arbitrary destination patterns of the packets. The dependency upon the network size is also considered View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Control mechanism for software pipelining on nested loop

    Page(s): 345 - 350
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (624 KB)  

    ILSP (Interlaced inner and outer Loop Software Pipelining) is an efficient algorithm of optimizing operations in the nested loops. To ensure the ILSP has a good time efficiency and a good space efficiency, there must be an efficient nested control mechanism to support the algorithm. Our control mechanism is realized by hardware, it avoids adding many extra instructions and minimises the II (Initialization Interval) of each loop in the nested loop. In cooperation with the compiler, our nested loop control mechanism can efficiently support the software pipelining of the nested loop, and can ensure the ILSP has a high speedup and a low space cost View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Coherent parallel programming in C∥

    Page(s): 116 - 122
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (680 KB)  

    This paper presents the coherent parallel programming concept using a new parallel language called C|| (pronounced C Parallel). The C|| language is based on the standard C language with a small set of extended constructs for parallelism and process interaction. At the core of C|| is a structured construct called coherent region, which facilitates the development of coherent programs, i.e., parallel programs that are structured, determinate, terminative, and compositional. We present the basic features of C|| and show that coherent region is a versatile construct View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dependence analysis of parallel and distributed programs and its applications

    Page(s): 370 - 377
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (832 KB)  

    This paper surveys the program dependence analysis technique for parallel and/or distributed programs and its applications from the viewpoint of software engineering. We present primary program dependences which may exist in a parallel and/or distributed program, a general approach to define, analyze, and represent these program dependences formally, and applications of an explicit program dependence based representation for parallel and/or distributed programs in various software engineering activities. We also suggest some research problems an this direction View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • “SEQ OF PAR” style structured parallel programming

    Page(s): 82 - 89
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (804 KB)  

    This paper presents a new structured parallel programming model, “SEQ OF PAR”, based on the Communication Closed Layer (CCL) principle of causal composition for parallel programs and Bird-Meertens formalism (BMF) of locality-based parallel computation. This model is to support for more general, architecture-independent parallel programming. It provides a structured approach to integrate task (or process) parallelism and data-parallelism in one framework. The well-founded algebra of CCL and BMF makes it also possible to derive, optimize and verify parallel programs through algebraic transformations. Experimental results show that it is very promising to adopt this programming model for getting efficient, portable parallel code View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A simulation research on multiprocessor interconnection networks with wormhole routing

    Page(s): 58 - 64
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (644 KB)  

    To design a parallel computer system, selecting an appropriate network is an important issue. This paper presents the simulation results on the performance of message passing interconnection networks used commonly in multiprocessor systems. Comparisons have been made on the performance of various interconnection networks like crossbar, mesh, hypercube, tree and hypertree with wormhole routing. The performance factors compared include the throughput of these networks and message delay. To make a more general model for tree structured network, this paper present the definition of m-fold n-ary tree, which is the extension of the hypertree network View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel VLSI neural system design for time-delay speech recognition computing

    Page(s): 12 - 17
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (572 KB)  

    Neural system, as processors of time-sequence patterns, have been successfully applied to several speaker-dependent speech recognition computing. They can be efficiently implemented by a pipelined architecture. In this paper, parallel time-delay speech recognition computing for VLSI neural systems is presented. The system design methodology is to emphasize coordination between computational model, architectural description, and VLSI systolic implementation. Examples of time-delay speech recognition applications to VLSI neural system design and performance analysis are given to illustrate effectiveness of the parallel computation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GPR-Tree: a global parallel index structure for multiattribute declustering on cluster of workstations

    Page(s): 300 - 306
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (796 KB)  

    R-tree is a very popular dynamic access structure cable of storing multidimensional and spatial data. Considering it's merit of the efficient global balance and dynamic reorganization, we try to use R-tree to decluster the multiattribute data in database system or file system. As many previous multiattribute declustering mechanisms do not take into account the properties of the Cluster of Workstations (COW), we present the Global Parallel R-tree (GPR-Tree) under the architecture of COW. Firstly we inspect the issues in efficiency of R-tree and it's variants, we try to enhance the R-Tree efficiency by using heuristics information in the reconstruction of R-Tree during the node splitting and the treatment of the orphan entries of the underfilled node. Then we parallelize the improved R-Tree among the components in the system. The basic thought is to alleviate the bottleneck effect of the I/O subsystem, making use of the high speed network communication and the memory. The GPR-Tree is shared among the processing units (PU) of the system. We use a mixed LRU algorithm to schedule pages in memory to maintain the nodes visited frequently in memory. A write-update-like protocol is used to keep the coherency among multiple copies maintained in the system. This mechanism is proved efficient to improve the salability and performance of the system View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel design and implementation of SOM neural computing model in PVM environment of a distributed system

    Page(s): 26 - 31
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (556 KB)  

    A parallel design and implementation of the Self-Organizing Map (SOM) neural computing model is proposed. The parallel design of SOM is implemented in a parallel virtual machine (PVM) environment of a distributed system. A practical realization of SOM algorithm is investigated, the construction of computing module in parallel virtual machine is discussed, the communication methods and an optimization of message passing between multiple processes are proposed, and the parallel programming technique and a PVM implementation of SOM neural computing model are given and discussed in detail View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A lifetime-sensitive scheduling method

    Page(s): 351 - 354
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (328 KB)  

    This paper presents a lifetime-sensitive scheduling method. By shortening lifetimes of variables in scheduling phase, it can lighten register pressure in register allocation phase, lessen spill codes and result in more efficient object codes. The preliminary experimental results show that this method is an effective scheduling method View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A scalable parallel workstation cluster system

    Page(s): 307 - 313
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (636 KB)  

    In this paper, we argue that because of recent advance of network & CPU technologies, workstation clusters are poised to become the primary parallel computing infrastructure for science and engineering computing. After analyzing and comparing the communication performance of three popular networks: 10 Mbps Ethernet, 100 Mbps Ethernet and 640 Mbps Myrinet on an experimental workstation cluster, we point out that two main factors hinder the wider application of workstation cluster: low efficiency of communication system (both hardware and software) and lack of friendly parallel program development environment with accessory tools. For these two problem, we implemented two workstation cluster systems for different performance/price rate requirements: one is 8 PowerPCs with shared media network, another is 8 Sun Sparcstations with switch network. By using Reduced Communication Protocol (RCP), we dramatically improved the performance of communication system; by expanding the language support of PVM and adding several useful tools, we build a visual integrated parallel program development environment IPCE. On our platform, we also analyzed several massive applications, such as GRI benchmark, earthquake simulator, weather forecasting and some NAS benchmarks, and we get very good results for these coarse-grain to middle-grain applications. The speedup ranges from 5.83 to 7.98 and parallel efficiency reaches to 72.88%-99.7% View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The design considerations and test results of AFT-a new generation parallelizing compiler

    Page(s): 416 - 423
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (612 KB)  

    An effective automatic parallelizer is critical for users to exploit the resources of parallel computers. Research has gained much progress in recent years. This paper introduces AFT, a new generation of parallelizing compiler that we have developed. It integrates many advanced techniques in an effective and efficient system. The experimental results show that AFT is able to achieve notable parallelization on many programs View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel recursive algorithm for tridiagonal systems

    Page(s): 124 - 130
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (616 KB)  

    In this paper, a parallel algorithm for solving tridiagonal equations based on recurrence is presented. Compared with the parallel prefix method (PP) which is also based on the recursive method, the computation cost is reduced by a factor of two while maintaining the same communication cost. The method can be viewed as a modified prefix method or prefix with substructuring. The complexity of the algorithm is analysed using the BSP model (Bulk Synchronous Parallel). Experimental results are obtained on a Sun workstation using the Oxford BSP Library View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Small, scalable, and efficient, microkernels for highly parallel computers are possible: Cosy as an example

    Page(s): 196 - 203
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (864 KB)  

    Although highly parallel distributed memory computers exist for several years, the operating systems used on them did not fit the requirements very well. Most of them are designed for sequential, shared memory parallel or distributed computers. Examples are Unix on the IBM SP/2 and Mach on the Intel Paragon. This results in poor scalability caused by inefficient communication primitives designed for wide area networks or by waste of resources due to huge kernels (e.g. 8 MB per node are reported for Mach an the Paragon, which is harmful especially in highly parallel systems with hundreds or thousands of nodes. With Cosy (Concurrent Operating System) we have shown that a well structured and carefully designed system can be small (70 Kb for the kernel 372 total memory usage per node), efficient (33 μs for communication), and scalable (applications run efficient on up to 1024 processors) View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Consistent state restoration in shared memory systems

    Page(s): 330 - 337
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (720 KB)  

    In many systems, backward recovery constitutes a classical technique to ensure fault-tolerance. It consists in restoring a computation in a consistent global state, saved in a global checkpoint, from which this computation can be resumed. A global checkpoint includes a set of local checkpoints, one from each process which correspond to local states dumped onto stable storage. In this paper we are interested in defining formally the domino effect for shared memory systems be the shared memory a physical one (as in multiprocessor systems) or a virtual one (as in distributed shared memory systems) and in designing a domino-free adaptive algorithm. These results lie on a necessary and sufficient condition which shows when a set of local checkpoints can belong to some consistent global checkpoint View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.