By Topic

Advances in Parallel and Distributed Computing, 1997. Proceedings

Date 19-21 March 1997

Filter Results

Displaying Results 1 - 25 of 60
  • Proceedings. Advances in Parallel and Distributed Computing

    Save to Project icon | Request Permissions | PDF file iconPDF (2262 KB)  
    Freely Available from IEEE
  • Author index

    Page(s): 425 - 426
    Save to Project icon | Request Permissions | PDF file iconPDF (148 KB)  
    Freely Available from IEEE
  • A versatile directory scheme (Dir2NB+L) and its implementation on BY91-1 multiprocessors system

    Page(s): 180 - 185
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (740 KB)  

    Cache coherence and synchronization between processors have been two critical issues in designing a shared memory multiprocessors system. From the perspective of hardware design, a directory based cache coherence protocol and lock mechanism are employed to prevent inconsistency of caches and warrant atomic memory accesses. The BY91-1 multiprocessors efficiently integrate supports for cache coherence and hardware based primitives by using a uniform directory scheme which is dubbed as Dir2NB+L. This integration allows for low hardware overhead while maintaining both a coherent caches system and indivisible memory accesses in a scalable and cohesive fashion. This paper describes the design and rationale of this versatile directory scheme. Results on the evaluation of different directory schemes based on a preliminary simulator-CASIMU demonstrate that Dir2NB+L scheme is cost-effective. We also report on the experience gained by implementing this directory scheme on BY91-1 multiprocessors system. We believe that this scheme is well suited for CC-NUMA architecture View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Utilization of disk drives for RAID

    Page(s): 186 - 189
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (388 KB)  

    A stochastic Petri nets (SPN) model of RAID-5 is constructed. With the model and its isomorphic Markov chain, the average utilization of disk drives in RAID for small write and large I/O request can be calculated. It provides us a good method to evaluate the performance of RAID in the paper View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementing a software virtual shared memory on PVM

    Page(s): 190 - 195
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (448 KB)  

    This paper introduces a software virtual shared memory, GKD-VSM on PVM. It provides a shared memory parallel programming model in FORTRAN language for distributed memory environments. To reduce the software overhead GKD-VSM takes several approaches, including special-purposed user-level multithread scheme and Prefetch&Poststore at synchronization points scheme. The latencies for basic operations are presented View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An improvement on data dependence analysis supporting software pipelining technique

    Page(s): 378 - 382
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (496 KB)  

    The accuracy of the data dependence analysis of a client program will decide in what an extent the compiler can unleash the power of the potential parallelism of the client program. Most of the current works on dependence analysis are based on the dependence equation and constraint inequalities of loop variable bounds (sometimes augmented with the direction vector). Unfortunately, they can not give an exact detection on the dependence which may greatly affect the parallel optimization of the client program when software pipelining technique is employed. In the paper, we give a more effective constraint inequality which could reflect the characteristics of software pipelining technique and will improve the power of dependence analysis of most of the current algorithms when applied to software pipelining View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptive hybrid scheduling of nonuniform loops on UMA models

    Page(s): 383 - 387
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (500 KB)  

    It is very difficult to keep load balancing among processors for the nonuniform loop in compile-time and it must be at the price of extra overhead to use dynamic methods. This paper proposes an adaptive hybrid scheduling way, in which the processes of distribution of loop are divided into a few rounds and the block size in each round is determined adaptively according to the average overhead due to dynamic scheduling. Several experimental results have also exposed the effect of scheduling parameter, which could be selected by programmers according to the probability that a fetching processor may not perform an additional task fetching View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel VLSI neural system design for time-delay speech recognition computing

    Page(s): 12 - 17
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (572 KB)  

    Neural system, as processors of time-sequence patterns, have been successfully applied to several speaker-dependent speech recognition computing. They can be efficiently implemented by a pipelined architecture. In this paper, parallel time-delay speech recognition computing for VLSI neural systems is presented. The system design methodology is to emphasize coordination between computational model, architectural description, and VLSI systolic implementation. Examples of time-delay speech recognition applications to VLSI neural system design and performance analysis are given to illustrate effectiveness of the parallel computation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Solving sparse least squares problems on massively distributed memory computers

    Page(s): 170 - 177
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (824 KB)  

    In this paper we study the parallel aspects of PCGLS, a basic iterative method whose main idea is to organize the computation of conjugate gradient method with preconditioner applied to normal equations, and incomplete modified Gram-Schmidt (IMGS) preconditioner for solving sparse least squares problems on massively parallel distributed memory computers. The performance of these methods on this kind of architecture is always limited because of the global communication required for the inner products. We describe the parallelization of PCGLS and IMGS preconditioner by two ways of improvement. One is to assemble the results of a number of inner products collectively and the other is to create situations when communication can be overlapped with computation. A theoretical model of computation and communication phases is presented which allows us to decide the number of processors that minimizes the runtime. Several numerical experiments on Parsytec GC/PowerPlus are presented View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A new architecture for branch-intensive loops

    Page(s): 241 - 246
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (580 KB)  

    A new VLIW architecture, called GPMB (Global Pipelining of Multi-Branch), is discussed in this paper. The GPMB architecture can handle branch-intensive programs efficiently. With the concept of next address function, GPMB regards branching as correctly calculating the next address. The next address function is implemented by hardware and software in GPMB. A brief description of GPMB and a detailed example are included. A comparison with other architectures is also presented in this paper View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient parallel texture classification for image retrieval

    Page(s): 18 - 25
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1056 KB)  

    This paper proposes an efficient parallel approach to texture classification for image retrieval. The idea behind this method is to pre-extract texture features in terms of texture energy measurement associated with a `tuned' mask and store them in a multi-scale and multi-orientation texture class database via a two-dimensional linked list for query. Thus each texture class sample in the database can be traced by its texture energy in a two-dimensional row sorted matrix. The parallel searching strategies are introduced for fast identifying the entities closest to the input texture throughout the given texture energy matrix. In contrast to the traditional search methods, our approach incorporates different computation patterns for different cases of available processor numbers and concerns with robust and work-optimal parallel algorithms for row-search and minimum-find based an the accelerated cascading technique and the dynamic processor allocation scheme. Applications of the proposed parallel search and multisearch algorithms to both single image classification and multiple image classification are discussed. The time complexity analysis shows that our proposal will speed up the classification tasks in a simple but dynamic manner. Examples are presented of the texture classification task applied to image retrieval of Brodatz textures, comprising various orientations and scales View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Construction of multimedia server in a distributed multimedia system

    Page(s): 248 - 252
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (460 KB)  

    The framework of constructing a distributed multimedia system based on the server/client architecture is described in this paper. We focus our attention on the realization of synchronization presentation of different media in a multimedia application, and a set of QoS (qualify of service) parameters is given as a criterion to make a trade-off between overall performance of the system and the synchronization presentation in each multimedia application View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dependence analysis of parallel and distributed programs and its applications

    Page(s): 370 - 377
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (832 KB)  

    This paper surveys the program dependence analysis technique for parallel and/or distributed programs and its applications from the viewpoint of software engineering. We present primary program dependences which may exist in a parallel and/or distributed program, a general approach to define, analyze, and represent these program dependences formally, and applications of an explicit program dependence based representation for parallel and/or distributed programs in various software engineering activities. We also suggest some research problems an this direction View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel design and implementation of SOM neural computing model in PVM environment of a distributed system

    Page(s): 26 - 31
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (556 KB)  

    A parallel design and implementation of the Self-Organizing Map (SOM) neural computing model is proposed. The parallel design of SOM is implemented in a parallel virtual machine (PVM) environment of a distributed system. A practical realization of SOM algorithm is investigated, the construction of computing module in parallel virtual machine is discussed, the communication methods and an optimization of message passing between multiple processes are proposed, and the parallel programming technique and a PVM implementation of SOM neural computing model are given and discussed in detail View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An improved parallel algorithm for Delaunay triangulation on distributed memory parallel computers

    Page(s): 131 - 138
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (640 KB)  

    Delaunay triangulation has been much used in such applications as volume rendering, shape representation, terrain modeling and so on. The main disadvantage of Delaunay triangulation is large computation time required to obtain the triangulation on an input points set. This time can be reduced by using more than one processor, and several parallel algorithms for Delaunay triangulation have been proposed. In this paper, we propose an improved parallel algorithm for Delaunay triangulation, which partitions the bounding convex region of the input points set into a number of regions by using Delaunay edges and generates Delaunay triangles in each region by applying an incremental construction approach. Partitioning by Delaunay edges makes it possible to eliminate merging step required for integrating subresults. It is shown from the experiments that the proposed algorithm has good load balance and is more efficient than Cignoni et al.'s algorithm (1993) and our previous algorithm (1996) View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Consistent state restoration in shared memory systems

    Page(s): 330 - 337
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (720 KB)  

    In many systems, backward recovery constitutes a classical technique to ensure fault-tolerance. It consists in restoring a computation in a consistent global state, saved in a global checkpoint, from which this computation can be resumed. A global checkpoint includes a set of local checkpoints, one from each process which correspond to local states dumped onto stable storage. In this paper we are interested in defining formally the domino effect for shared memory systems be the shared memory a physical one (as in multiprocessor systems) or a virtual one (as in distributed shared memory systems) and in designing a domino-free adaptive algorithm. These results lie on a necessary and sufficient condition which shows when a set of local checkpoints can belong to some consistent global checkpoint View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance of buffered multistage interconnection networks in case of packet multicasting

    Page(s): 50 - 57
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (732 KB)  

    Multistage Banyan networks are frequently proposed as connections in multiprocessor systems. There exist several studies to determine the performance of networks in which messages are unicasted. (One processor sends a message to one and only one other processor.) In this paper, a timed Petri net model is used to derive the performance of buffered Banyan networks, in which messages may also be multicasted (One processor can send a message to more than one other processor). We consider a Banyan network with 2×2-switches and the two cases of complete and partial broadcasting within the switching elements, An algorithm is presented to calculate the destination distribution in all network stages for arbitrary destination patterns of incoming uniform packet traffic. Thus, the automatic generation of timed Petri net models is possible for arbitrary destination patterns of the packets. The dependency upon the network size is also considered View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reduced communication protocol for clusters

    Page(s): 314 - 319
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (588 KB)  

    With the development of CPUs and communication networks, workstation clusters using message-passing mechanism become a crucial role in the field of network computing. Today's clusters are mainly connected by networks running traditional communication protocols (such as TCP/IP). The high overheads of these protocols make many parallel applications running on clusters inefficient using the potential computation power provided by the workstations and the networks. A method to solve this problem is to construct reduced communication protocol. This paper gives a detailed analysis of overheads produced by traditional protocols and provides some global strategies to design a reduced communication protocol. Our implementation method of such a protocol is described here together with some core algorithms and the testing results View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Precise dependence test for scalars within nested loops

    Page(s): 356 - 361
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (420 KB)  

    Exact direction and distance vectors are essential for detecting hierarchical parallelism and examining legality of loop transformation for a multiple level loop nest. Much of this work has been concentrated on array references. Little has been done to address the problems of finding precise dependences between scalar references, except to use extended SSA form with factored use-def links. In this paper, we present a technique for calculating precise direction and distance vectors for scalar references within nested loops without using any forms of SSA. To do this, we use conventional use-def links in combination with joint dominator and joint postdominator relationships, which are extended from dominator and postdominator respectively in standard data flow analysis. The precision of dependence information gathered by our algorithm can not be achieved by traditional analysis of dominator or reaching definitions View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The study of parallel simulation processing based on MPP technology

    Page(s): 34 - 40
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (696 KB)  

    Computer numerical simulation is widely applied in engineering and social fields. It has shown great value in these fields. Small scale simulation applications can be processed on the traditional simulation computer, but with the size of problem increasing, sequential processing cannot meet the requirements. Dynamic real-time simulation and super real-time simulation require high performance simulation computers. In this paper we first analyse the structure of a classical simulation computer AD-100 which was developed by ADI Inc., then a novel structure for a simulation computer which adopts the MPP technology is proposed. At the end of this paper an experimental result is given to test the feasibility of parallel simulation processing View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Control mechanism for software pipelining on nested loop

    Page(s): 345 - 350
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (624 KB)  

    ILSP (Interlaced inner and outer Loop Software Pipelining) is an efficient algorithm of optimizing operations in the nested loops. To ensure the ILSP has a good time efficiency and a good space efficiency, there must be an efficient nested control mechanism to support the algorithm. Our control mechanism is realized by hardware, it avoids adding many extra instructions and minimises the II (Initialization Interval) of each loop in the nested loop. In cooperation with the compiler, our nested loop control mechanism can efficiently support the software pipelining of the nested loop, and can ensure the ILSP has a high speedup and a low space cost View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Definition of control variables for automatic performance modeling

    Page(s): 42 - 49
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1020 KB)  

    Automatic model generation is studied as part of a hybrid modeling strategy using simulation for performance analysis. Two major steps have to be carried out in this context. The program which is being investigated has to be translated into a model. During the translation, runtime has to be estimated for numerous computational blocks of statements which are replaced by simple delays. For performance estimation, the model has finally to be analyzed by an evaluation tool. Model evaluation as well as runtime estimation of computational blocks requires values of some variables, the control variables. We discuss the problem of automatic definition of control variables in general and consider some important cases. For the implementation of a model generating tool, we concentrate on parallel Fortran programs using message passing primitives for process communication View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel replacement mechanism for multithread

    Page(s): 338 - 344
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (652 KB)  

    This paper presents a new rapid thread replacement mechanism which is important in multithread technology. Analysis to the memory system indicates that the memory utilization decreases with the increase of cache hit ratio. The parallelism between thread computation and thread replacement is found by analyzing their working processes. Based on these, we advance a rapid multithread replacement mechanism which overlaps the thread replacement with thread computation. More especially, with finite hardware contexts, this mechanism can play the same role of infinite contexts by tolerating the replacement overhead. By modifying the general thread switching model, we build the thread replacement model and evaluate this mechanism in theory and experiment methods. At last, we discuss the hardware implementation and put forward the problems to be resolved in the future View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis of multidimensional loops with non-uniform dependences

    Page(s): 362 - 369
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (684 KB)  

    For a parallelizing compiler, mainly based on loop transformations, dependence information that is as complete and precise as possible is required. In this paper, we propose a generalized method for computing, in any multi-dimensional loop, information which proved to be useful in the case of irregular dependences. Firstly, we solve the basic problem of the existence of a dependence with an algorithm composed of a preprocessing phase of reduction and of an integer simplex resolution. If a solution exists, we compute by integer simplex the bounds of the distances associated with loop indices. Depending on the values of these bounds, we finally define problems consisting in evaluating the bounds of slopes of dependence vectors, which we solve by integer linear fractional programming. The amount of computation for each new problem is very low. This algorithm has been implemented as an extension of the Janus Test, which was presented in a previous work View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A scalable parallel workstation cluster system

    Page(s): 307 - 313
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (636 KB)  

    In this paper, we argue that because of recent advance of network & CPU technologies, workstation clusters are poised to become the primary parallel computing infrastructure for science and engineering computing. After analyzing and comparing the communication performance of three popular networks: 10 Mbps Ethernet, 100 Mbps Ethernet and 640 Mbps Myrinet on an experimental workstation cluster, we point out that two main factors hinder the wider application of workstation cluster: low efficiency of communication system (both hardware and software) and lack of friendly parallel program development environment with accessory tools. For these two problem, we implemented two workstation cluster systems for different performance/price rate requirements: one is 8 PowerPCs with shared media network, another is 8 Sun Sparcstations with switch network. By using Reduced Communication Protocol (RCP), we dramatically improved the performance of communication system; by expanding the language support of PVM and adding several useful tools, we build a visual integrated parallel program development environment IPCE. On our platform, we also analyzed several massive applications, such as GRI benchmark, earthquake simulator, weather forecasting and some NAS benchmarks, and we get very good results for these coarse-grain to middle-grain applications. The speedup ranges from 5.83 to 7.98 and parallel efficiency reaches to 72.88%-99.7% View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.