By Topic

High Performance Computing, 1996. Proceedings. 3rd International Conference on

Date 19-22 Dec. 1996

Filter Results

Displaying Results 1 - 25 of 76
  • 3rd International Conference on High Performance Computing [front matter]

    Page(s): i - x
    Save to Project icon | Request Permissions | PDF file iconPDF (374 KB)  
    Freely Available from IEEE
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • Index of authors

    Page(s): 475 - 476
    Save to Project icon | Request Permissions | PDF file iconPDF (127 KB)  
    Freely Available from IEEE
  • Fault tolerant networks of workstations

    Page(s): 271 - 276
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (528 KB)  

    Networks of workstations make cheap and powerful parallel processors and many systems have been demonstrated using Unix workstations. Personal computers, with their cheap, high performance CPUs, can also be used in a similar way. However, if we are to exploit the multitudes of PCs that sit on desks idle for much of the time, we need robust systems. The uncontrolled environment and physical separation between elements of the parallel system make them prone to a variety of faults that would not normally plague a single processor in a machine room. We describe extensions made to the run-time system for Old MacDonald-a network of PowerPC based Macintoshes-to enable programs written in a functional style using Cilk-a threaded extension of C-to continue in the face of processor and network failures and unpredictable delays likely to be encountered when using widely dispersed machines as components of a parallel system View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A distributed directory scheme for information access in mobile computers

    Page(s): 138 - 143
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (504 KB)  

    In this paper, we discuss the design aspects of a dynamic distributed directory scheme (DDS) to facilitate efficient and transparent access to information files in mobile environments. The proposed directory interface enables users of mobile computers to view a distributed file system on a network of computers as a globally shared file system. In order to counter some of the limitations of wireless communications, we propose improvised invalidation schemes that avoid false sharing and ensure uninterrupted usage under disconnected and low bandwidth conditions View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Compilation to parallel programs from constraints

    Page(s): 73 - 79
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (616 KB)  

    This paper describes the first results from research on the compilation of constraint systems into task level parallel programs in a procedural language. This is the only research of which we are aware which attempts to generate efficient parallel programs for numerical computation from constraint systems. Computations are expressed as constraint systems. A dependence graph is derived from the constraint system and a set of input variables. The dependence graph, which exploits the parallelism in the constraints, is mapped to the language CODE, which represents parallel computation structures as generalized dependence graphs. Finally, parallel C programs are generated. To extract parallel programs of appropriate granularity, the following features are included. (i) modularity, (ii) operations over structured types as primitives, (iii) sequential C functions. A prototype of the compiler has been implemented. The domain of matrix computations is targeted for applications. Initial results are very encouraging View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Finite element calculations at the parallel computers Intel/Paragon, IBM SP-2, and a cluster of DEC-ALPHA workstations

    Page(s): 82 - 87
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (524 KB)  

    In many natural and engineering science applications, systems of partial differential equations have to be solved. The finite element method represents an effective tool to solve these systems, but a disadvantage arises due to the very high computation times regarding conventional serial computers. Therefore, the finite element code SMART has been parallelized and can now be executed on different parallel machines on which a message passing system as Intel/NX, MPI or PARMACS is available. In this paper a short overview of the parallelization strategy is given but the main focus lies on the benchmark of a given problem on different computer architectures, which shows the pros and cons of these machines as well as of parallel computing View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Communicating data-parallel tasks: an MPI library for HPF

    Page(s): 433 - 438
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (476 KB)  

    High Performance Fortran (HPF) has emerged as a standard dialect of Fortran for data-parallel computing. However, HPF does not support task parallelism or heterogeneous computing adequately. This paper presents a summary of our work on a library-based approach to support task parallelism, using MPI as a coordination layer for HPF. This library enables a wide variety of applications, such as multidisciplinary simulations and pipeline computations, to take advantage of combined task and data parallelism. An HPF banding for MPI raises several interface and communication issues. We discuss these issues and describe our implementation of an HPF/MPI library that operates with a commercial HPF compiler. We also evaluate the performance of our library using a synthetic communication benchmark and a multiblock application View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • DiET: a distributed extended transaction processing framework

    Page(s): 114 - 119
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (420 KB)  

    DiET provides a framework to experiment with extended transaction models and also to synthesize new models. As case studies nested and split-join transaction types have been implemented. DiET is a framework loosely coupled with a distributed storage manager and PVM. Such a coupling enables DiET to cope up with a wide variety of storage manager and distributed process manager without any difficulty. The performance measures indicate high speedup for complex applications View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Comparison of parallelization strategies for simulation of aerodynamics problem

    Page(s): 10 - 15
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (388 KB)  

    A complex compressible Navier-Stokes equation using a time-accurate, implicit difference scheme is parallelized using Domain Decomposition (DD) and loop-parallelization methods. The numerical scheme used is a LU-ADI method with Baldwin-Lomax turbulence model. The report discusses how the code is parallelized using the techniques. The convergence rate and computational speed-up of these two methods are illustrated in the numerical experiment described in the report. Generally, the loop-parallelization shows better convergence and speed-up than the domain-decomposition method View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Novel parallel join algorithms for grid files

    Page(s): 144 - 149
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (476 KB)  

    The present advances in parallel and distributed processing and its application to database operations such as join resulted in investigating parallel algorithms. Hash based join algorithms involve a costly data partitioning phase prior to the join operation. This paper presents new parallel join algorithms for relations based on grid files where no costly partitioning phase is involved, hence the performance can improve View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Monte Carlo device simulation on PARAM

    Page(s): 33 - 35
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (224 KB)  

    Self-consistent ensemble Monte Carlo technique has been employed to obtain the velocity profiles for the n+-n-n+ structure. The simulator solves the Poisson equation at regular time intervals to update the electric field distribution and uses cloud-in-cell method for charge assignment. This report presents the parallel implementation of the device simulation on 9000/AA platform of CDAC. A linear speedup has been realised on an 8-node machine View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting parallelism in high performance embedded system scheduling

    Page(s): 400 - 405
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (560 KB)  

    This paper defines a new paradigm for high performance embedded systems. We present a model of distributed embedded control system software to capture the real-time computing requirements of complex computer-based systems. The hierarchical software architecture defines the notion of a software path the construct identified by studying embedded real-time applications. We present a technique for dynamic scheduling of sporadic paths. The novel feature of the approach is to enhance schedulability through high performance concurrent computing View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • BAG real-time distributed operating system

    Page(s): 120 - 125
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (560 KB)  

    In this paper, three main parts of BAG real-time distributed operating system are introduced: task migration, load balancing and distributed file system. Task migration, based on EFSM programming model, is implemented as a means of load balancing mechanism. A file system supporting the task migration mechanism is also designed and developed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient compilation of concurrent call/return communication in actor-based programming languages

    Page(s): 62 - 67
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (500 KB)  

    Concurrent call/return communication (CCRC) allows programmers to conveniently express a communication pattern where a sender invokes a remote operation and uses the result to continue its computation. The blocking semantics requires context switching for efficient utilization of computation resource. We present a compilation technique which allows programmers to use CCRC with the cost of non-blocking asynchronous communication plus minimum context switch cost. The technique transforms CCRCs into non-blocking asynchronous sends and encapsulates continuations into separate objects. A data flow analysis is used to guarantee that only necessary context is cached in continuation objects View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A multithreaded architecture for the efficient execution of vector computations within a loop using status field

    Page(s): 343 - 350
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (548 KB)  

    This paper presents the design of a high performance MULVEC(M_U_L_tithreaded architecture for the V_E_ctor C_omputations), as a building block of massively parallel processing systems. MULVEC comes from the synthesis of the dataflow model and the extant superscalar RISC microprocessor. MULVEC reduces, using a vector wait queue and status field of each vector data, the number of synchronization, context switching, network traffic, and so on in case of repeated vector computations within the same thread segment. And if vector operand in one statement is more than three, MULVEC can be computed by non-strict method. After program having been simulated on the SPARC V9(super scalar bit RISC microprocessor), the performance (execution time of example program) of uniprocessor and MULVEC according to the different number of nodes are analyzed. The performance of MULVEC according to the different number of nodes are analyzed for the several programs View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A communication placement framework with unified dependence and data-flow analysis

    Page(s): 201 - 208
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (800 KB)  

    Communication placement analysis is an important step in the compilation of data-parallel programs for multiprocessor systems. This paper presents a communication placement framework that minimizes frequency of communication, eliminates redundant communication, and maximizes communication latency hiding. The paper shows how data dependence information can be combined with data-flow analysis to devise simpler and cleaner data-flow problems. It shows how to develop equations for balanced communication placement using a set of uni-directional analyses with an independent equation system for each placement criterion. This structure allows the framework to support vector message pipelining-an important optimization for programs with loop-carried dependences-but, that was not supported by any previous data-flow framework. The paper also describes how other optimizations, such as partially redundant communication elimination and message coalescing, are supported by the framework. Finally, the paper presents experimental results to prove the efficacy of our placement analysis View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Order-configurable programmable power-efficient FIR filters

    Page(s): 357 - 361
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (320 KB)  

    We present a novel VLSI implementation of an order-configurable, coefficient-programmable, and power-efficient FIR filter architecture. This single-chip architecture contains 4 multiply-add functional units and each functional unit can have up to 8 multiply-add operations time-multiplexed (or folded) onto it. Thus one chip can be used to realize FIR filters with lengths ranging from 1 to 32 and multiple chips can be cascaded for higher order filters. To achieve power-efficiency, an on-chip phase locked loop (PLL) is used to automatically generate the minimum voltage level to achieve the required sample rate. Within the PLL, a novel programmable divider and a voltage level shifter are used in conjunction with the clock rate to control the internal supply voltage. Simulations show that this chip can be operated at a maximum clock rate of 100 MHz (folding factor of 1 or filter length of 4). When operated at 10 MHz, this chip only consumes 27.45 mW using an automatically set internal supply voltage of 2 V. For comparison, when the chip is operated at 10 MHz and 5 V, it consumes 109.24 mW. At 100 MHz, the chip consumes 891 mW with a 4.5 V supply that is automatically generated by the PLL. This design has been implemented using Mentor Graphics tools for an 8-bit word-length and 1.2 μm CMOS technology View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Classification of handwritten alphanumeric characters: a fuzzy neural approach

    Page(s): 36 - 41
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (420 KB)  

    An efficient supervised feedforward fuzzy neural classifier (SFFNN) and its associated training algorithm for classification of handwritten English alphabets and arabic numerals are proposed in this paper. The utilized classifier is a five layer network and the number of the minimum fuzzy neurons in the third layer is dynamically organized during its training. This classifier learns the membership function values of each input image from the training set. Through extensive experimentation with noiseless and noisy binary images of English alphabets and ten Arabic numerals, it is found that the performance of the SFFNN is better than Yalings's fuzzy neural network (YFNN) and multilayer perceptron (MLP) network. The SFFNN after training, recognizes character images 98.7% accurately View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel SOLVE for direct circuit simulation on a transputer array

    Page(s): 27 - 32
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (520 KB)  

    Sparse matrix solution (SOLVE) is a dominant part of the total execution time in a circuit simulation program such as SPICE. For simulation of modern VLSI circuit designs, it is important that this time be reduced. This paper presents a block partitionable sparse matrix solution algorithm in which a matrix is divided into equal size blocks, and blocks are assigned to different processors for parallel execution. The algorithm developed in this work exploits sparsity at the block level as well as within a non-zero block. An efficient mapping scheme to assign matrix blocks to processors is developed which maximizes concurrency and minimizes communication between processors. Associated reordering and efficient sparse storage schemes are also developed. An implementation of this parallel algorithm is carried out on a Transputer processor array which plugs into a PC bus. The sparse matrix solver is tested on matrices generated from a transistor-level expansion of ISCAS-85 benchmark logic circuits. Good speedup is obtained for all benchmark matrices up to the number of Transputers available View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • File allocation for a parallel Webserver

    Page(s): 16 - 21
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (480 KB)  

    This paper considers the problem of allocating files in a document tree among multiple processors in a parallel webserver. It is assumed that access patterns are characterized by branching probabilities for an access that starts at a node and progresses down the tree. A combinatorial optimization problem is formulated that includes load balancing and communication costs. The general problem is shown to be NP-complete, and a pseudo-polynomial time algorithm is outlined. In addition, two fast heuristic algorithms are presented and evaluated using simulation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Program analysis for page size selection

    Page(s): 189 - 194
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (524 KB)  

    To support high performance architectures with multiple page sizes, it is necessary to assign proper page sizes for array memory in order to improve TLB performance as well as reduce memory contention during program execution. Typically, while a smaller page size causes higher TLB contention, a larger page size causes higher memory contention and fragmentation but also has the effect of prefetching pages required in future thereby reducing the number of cold page faults. Each array in a program contributes to these costs/benefits depending upon how it is referenced in the program. The page size assignment analysis determines a proper page size for every array by analyzing memory reference patterns (which is shown to be NP-hard). We discuss various policies that can be followed for page size assignment in order to maximize performance along with cost models and present algorithms for page size selection View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Program-level control of network delay for parallel asynchronous iterative applications

    Page(s): 88 - 93
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (544 KB)  

    Software distributed shared memory (DSM) platforms on networks of workstations tolerate large network latencies by employing one of several weak memory consistency models. Fully asynchronous parallel iterative algorithms offer an additional degree of freedom to tolerate network latency. They behave correctly when supplied outdated shared data. However these algorithms can flood the network with messages in the presence of large delays. We propose a method of controlling asynchronous iterative methods wherein the reader of a shared datum imposes an upper bound on its age via use of a blocking Global Read primitive. This reduces the overall number of iteration is executed by the reader; thus controlling the amount of shared updates generated. Experiments for a fully asynchronous linear equation solver running on a network of 10 IBM RS/6000 workstations show that the proposed Global Read primitive provides significant performance improvement View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Finite element applications of parallel adaptive integration strategies

    Page(s): 100 - 105
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (492 KB)  

    We present strategies for the parallel computation of the integrals typically arising in finite element problems. In order to deal efficiently with difficulties in the integrands affecting the domain in areas of which the exact locations may be unknown, a parallel adaptive strategy is applied to the entire domain. The elements comprising the domain form the initial pool of subregions in the adaptive strategy and are refined further where needed, thereby allowing load sharing or a load balanced global priority queue within each participating process group View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Integrating concurrency and object-orientation using boolean, access and path guards

    Page(s): 68 - 72
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (304 KB)  

    Inheritance Anomaly is considered as a major problem in integrating object-orientation and concurrency. The anomaly forces redefinitions of inherited methods to maintain the integrity of concurrent objects. In this paper we discuss how the use of boolean, access and path guards attached to methods solve the problem of inheritance anomaly. Synchronization using boolean guards have known to be free of inheritance anomaly caused by partitioning of acceptable states. However, they cause anomalies from history sensitivity. We solve this using path guards. Path guards are similar to path expressions in the sense that both express the execution pattern of methods. However while a path expression independently specifies the synchronization of a collection of methods, a path guard is attached to a method and it specifies the history of execution sequence(s) acceptable for executing the current method. Path expressions have been shown to cause inheritance anomaly, on the other hand, we will show that path guards form a solution to the problem View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.