Date 1217 Nov. 1989
Filter Results

Vectorization on Monte Carlo particle transport: an architectural study using the LANL benchmark “GAMTEB”
Page(s): 10  20Fully vectorized versions of the Los Alamos National Laboratory benchmark code Gamteb, a Monte Carlo photon transport algorithm, were developed for the Cyber 205/ETA10 and Cray XMP/YMP architectures. Singleprocessor performance measurements of the vector and scalar implementations were modeled in a modified Amdahl's Law that accounts for additional data motion in the vector code. The performance and implementation strategy of the vector codes are related to architectural features of each machine. Speedups between fifteen and eighteen for Cyber 205/ETA10 architectures, and about nine for CRAY XMP/YMP architectures are observed. The best single processor execution time for the problem was 0.33 seconds on the ETA10G, and 0.42 seconds on the CRAY YMP. View full abstract»

Parallelizing a large scientific code  methods, issues, and concerns
Page(s): 21  31Objectives of this study were to develop techniques and methods for effective analysis of large codes; to determine the feasibility of parallelizing an existing large scientific code; and to estimate potential speedups attainable, and associated tradeoffs in design complexity and work effort, if the code were parallelized by redesign for a distributed memory system (NCube, iPSC hypercube), or straight serial translation targetting a shared memory system (CRAY2, SEQUENT). MACH2, the code under study, is a 2D magnetohydrodynamic (MHD) finite difference code used to simulate plasma flow switches and nuclear radiation. A taxonomy relating functional levels of a code to levels of parallelism is presented and used as a model for analyzing existing large codes. It is shown that although parallelizing lower level code segments (e.g. algorithms and loops) on shared memory systems is generally easier to accomplish, in some cases an entire large code is most easily parallelized at a high level; via domain and functional decomposition. Also a multidecomposition scheme is introduced in which acceptable load balances can be achieved for functional decompositions and heterogeneous data partitionings. View full abstract»

Benchmark calculations with an unstructured grid flow solver on a SIMD computer
Page(s): 32  41An unstructured grid flow solver was implemented on a massively parallel computer, and benchmark computations were performed. The solver was a twodimensional computational fluid dynamics (CFD) code that performs firstorder, steadystate solutions of the Euler equations. The parallel computer employed was the Connection Machine made by Thinking Machines, Corp. The CFD code was programmed in *Lisp, the accuracy of the code was verified, and numerous optimizations were implemented. Several benchmark runs were then made to assess and understand the impact of the code modifications and to obtain meaningful performance comparisons with other advanced computers. View full abstract»

Implementation of a hypersonic rarefied flow particle simulation on the connection machine
Page(s): 42  49A very efficient direct particle simulation algorithm for hypersonic rarefied flows is presented and its implementation on a Connection Machine is described. The implementation is capable of simulating up to 4 x 106 hard sphere diatomic molecules using 64k processors with a performance better than that of a similar, fully vectorized implementation using a single processor of the Cray 2. Results from flow calculations are presented to demonstrate both the validity of the implementation and the range afforded by the method in solving hypersonic rarefied flow problems. Finally, a breakdown of the calculation time is given to identify a bottleneck which must be resolved for further improvement in performance. View full abstract»

Computational aerothermodynamics
Page(s): 51  57Aerothermodynamics is defined1 as “the study of the relationship of heat and mechanical energy in gases, especially air”. To those familiar with fluid dynamics (the study of the flow properties of liquids and gases) this means that we must consider thermodynamic and chemical processes as they are coupled to the fluid motion. Computational fluid dynamics involves the numerical simulation of the equations of motion for an ideal gas; these equations are the conservation of mass, momentum and energy, and in their most general form are the compressible NavierStokes equations. Computational aerothermodynamics concerns the coupling of real gas effects with these equations of motion to include thermochemical rate process for chemical and energy exchange phenomena. These processes concern the creation and destruction of gas species by chemical reactions and the transfer of energy between the various species and between the various energy modes (e.g. translation, rotation, vibration, ionization, dissociation/recombination, etc.) of the species. To gain some insight into when such phenomena occur for current and future aerospace flight vehicles Fig. 1 shows the flight regimes of some typical vehicles (e.g. Concord, aerospace plane, Space Shuttle, aeroassisted space transfer vehicles, Apollo entry vehicle, etc.) in terms of flight altitude and flight speed. Also indicated in the figure are regimes where chemical reactions such as dissociation and ionization are important and where nonequilibrium thermochemical phenomena are important. To account for chemical reactions equations for the conservation of each chemical species must be added to the flow field equation set. There are 5 flow field equations; one continuity, three momentum and one energy equation. For a simple model of dissociating and ionizing air there are typically 11 major species (N2, 2, N, , NO, N+2, +2, +, N+, NO+, e). The inclusion of conservation equations for each of these species nearly t riples the number of equations to be solved. When there are combustion processes or gas/surface interactions or ablation products, the number of species increases dramatically. To account for thermal nonequilibrium there are additional energy conservation equations to describe the energy exchange between the various energy modes (translational, rotational, vibrational, electronic, etc.). The full set of these conservation equations have been derived by Lee2. Under the following assumptions, namely: continuum and no slip at solid boundaries (Kn≪≪1)the thermal state of the gas can be described by separate, independent temperaturesthe rotational state of the gas is in equilibrium with the translational stateweak ionization (≪1%) the governing equations are written in Cartesian coordinates as3: n mass conservation equations where s is the chemical species @@@@s / @@@@t + @@@@ / @@@@j (suj) =  @@@@ / @@@@j (svsj) + ws (1)d momentum conservation equations, where d is the number of spatial dimensions @@@@ / @@@@t (ui) + @@@@ / @@@@j (uiuj + pδij) =  @@@@τij / @@@@j  @@@@ (Ns / Ne) Zs (@@@@pe / @@@@i) (2)conservation of vibrational energy for each of the m diatomic species @@@@Evs / @@@@t + @@@@ / @@@@j (Evsuj) =  @@@@ / @@@@j (Evsvsj)  @@@@qvsj / @@@@j + QT  vs + Qv  vs + Qe  vs + wsevs (3)electron energy is conservation @@@@Ee / @@@@t + @@@@ / @@@@j (Eeuj) =  @@@@ / @@@@j (Eevej)  pe@@@@uj / @@@@j  @@@@qej / @@@@j + QTe  @@@@ Qevs + weee (4)And conservation of total energy @@@@E / @@@@t + @@@@ / @@@@j ((E + p)uj) =  @@@@ / @@@@≫j (qj + quj + qej)  @@@@ / @@@@j (uiTij)  @@@@ @@@@ / @@@@jvsjhs  @@@@ Ns / NeZs@@@@pe / @@@@iui (5) View full abstract»

Practical parallel supercomputing: examples from chemistry and physics
Page(s): 58  69We use two large simulations, the chemical reaction dynamics of H + H2 and the collision of two galaxies to show that current parallel machines are capable of large supercomputer level calculations. We contrast the different architectural tradeoffs for these problems and draw some implications for future production parallel supercomputers. View full abstract»

Capability of current supercomputers for the computational fluid dynamics
Page(s): 71  80The computer code named LANS3D, one of the representative NavierStokes codes in Japan, is taken as a example and the capability of the current CFD technology is discussed. This code was developed for the numerical simulation of highReynolds number compressible flows. The algorithm used in this code and how it has been improved so far explain two important aspects of the computational fluid dynamics (CFD) codes: efficiency and accuracy. Some of the application examples show the capability of the code for engineering problems as well as physical problems. The benchmark test of the newest version of the code on supercomputers indicates that recent supercomputer improvement enables the code to be a strong engineering tool for the design purpose. At the same time, it is concluded that still more compiler improvement may lead to the best use of the supercomputers on the computational fluid dynamics. View full abstract»

Computations of soil temperature rise due to HVDC ground return
Page(s): 86  95The purpose of this paper is to present an application which historically, did not make use of computing methodology in the solution of design problems. The design of High Voltage Direct Current (HVDC) ground electrodes involves the careful selection of several parameters in order to meet strict operating constraints. In particular, the operation of HVDC ground electrodes, results in a rise of the surrounding soil temperature which must be reasonably computed so as to permit safe operation of the grounding network. The classical approach to this design issue was to make several simplifying assumptions and thereby obtain a closed form solution which approximated the situation. Unfortunately, this technique is extremely limited in its scope and the accuracy of solution is dependent on the simplifications made in order to obtain a solution. The technique presented in this paper uses a numerical model to represent the system to be studied. This approach provides extreme flexibility, thus a minimal of assumptions are needed in order to accurately obtain realistic solutions to the problem. The complexity of the numerical model however, requires substantial computing capabilities. Symbols g = heat generated, W/m3 p = soil electrical resistivity,  mJ = current density, A/m2T = temperature of the medium, °C k = thermal conductivity, W/°Cm α = thermal diffusitivity, m2/sCp = specific heat, J/kg°Cd = mass density, kg/m3V = electrical potential, V λ = over relaxation factor View full abstract»

A radar simulation program for a 1024processor hypercube
Page(s): 96  105We have developed a fast parallel version of an existing synthetic aperture radar (SAR) simulation program, SRIM. On a 1024processor NCUBE hypercube it runs an order of magnitude faster than on a CRAY XMP or CRAY YMP processor. This speed advantage is coupled with an order of magnitude advantage in machine acquisition cost. SRIM is a somewhat large (30,000 lines of Fortran 77) program designed for uniprocessors; its restructuring for a hypercube provides new lessons in the task of altering older serial programs to run well on modern parallel architectures. We describe the techniques used for parallelization, and the performance obtained. Several novel parallel approaches to problems of task distribution, data distribution, and direct output were required. These techniques increase performance and appear to have general applicability for massive parallelism. We describe the hierarchy necessary to dynamically manage (i.e., load balance) a large ensemble. The ensemble is used in a heterogeneous manner, with different programs on different parts of the hypercube. The heterogeneous approach takes advantage of the independent instruction streams possible on MIMD machines. View full abstract»

Parallel MIMD programming for global models of atmospheric flow
Page(s): 106  112Modeling atmospheric flow is one application of supercomputers. In this paper we present some concepts for implementing global flow algorithms on shared memory multiprocessors. We describe how an analysis of the algorithms combined with the appropriate parallel programming language support allows an efficient and computationally correct implementation, which minimizes the synchronization difficulties. Some performance measurements on the Encore multiprocessor serve to support this assertion. View full abstract»

Computational fluid dynamiccurrent capabilities and directions for the future
Page(s): 113  122Computational fluid dynamics (CFD) has made great strides in the detailed simulation of complex fluid flows, including some of those not before understood. It is now being routinely applied to some rather complicated problems, and starting to impact the design cycle of aerospace flight vehicles and their components. It is being used to complement and is being complemented by experimental studies. Several examples are presented in the paper to illustrate the current stateoftheart. Included in this paper is a discussion of the barriers to accomplishing the basic objective of numerical simulation. In addition, the directions for the future in the discipline of computational fluid dynamics are addressed. View full abstract»

Parallel algorithm and VLSI architecture for a robot's inverse kinematics
Page(s): 123  132The inverse solutions of a robotic systems are generally produced by a serial process. Due to the computing time of processing geometry data and generating an inverse solution corresponding to a specified point in Cartesian trajectory is larger than the sampling period, the missing points in the joint space are generated by some interpolation schemes (linear or cubic spline interpolations) between two inverse solutions. The dynamic errors are therefore introduced. Obviously this kind of dynamic errors can be eliminated, if the computational time of generating the inverse solutions and processing geometric information can be less than the sampling period. The dynamic errors are increased when a robot's speed is increased by using the available schemes. For a high speed and high performance robot, the dynamic errors can be significant. In this paper, a parallel algorithm for a robot's inverse kinematics is derived and corresponding VLSI architectures are presented. This algorithm can also be implemented by using multiprocessors. By using the proposed parallel algorithm, it is believed that the dynamic errors can be reduced significantly or even eliminated if the computing time of processing geometric data can be reduced significantly by using the technique of parallel processing. View full abstract»

Supercomputers in computational ocean acoustics
Page(s): 133  140In this paper, we report on some computational experience in solving ocean acoustic propagation problems in three dimensions on supercomputers. The underlying Helmholtz equation is transformed into a parabolictype equation in the LeeSaadSchultz model [5], which has a natural alternating direction implicit (ADI) implementation. We give estimates of the computing power required to solve problems with realistic sound velocity profiles. We then give performance results for the CRAY XMP and for the computational kernel on the Intel hypercube (iPSC/2). We conclude with some remarks about architectural enhancements that would be beneficial to our application. View full abstract»

A study of dissipation operators for the euler equations and a three dimensional channel flow
Page(s): 141  151Explicit methods for the solution of fluid flow problems are of considerable interest in supercomputing. These methods parallelize well. The treatment of the boundaries is of particular interest both with respect to the numeric behavior of the solution, and the computational efficiency. We have solved the threedimensional Euler equations for a twisted channel using secondorder, centered difference operators, and a three stage RungeKutta method for the integration. Three different fourthorder dissipation operators were studied for numeric stabilization: one positive definite, [8], one positive semidefinite, [3], and one indefinite. The operators only differ in the treatment of the boundary. For computational efficiency all dissipation operators were designed with a constant bandwidth in matrix representation, with the bandwidth determined by the operator in the interior. The positive definite dissipation operator results in a significant growth in entropy close to the channel walls. The other operators maintain constant entropy. Several different implementations of the semidefinite operator obtained through factoring of the operator were also studied. We show the difference both in convergence rate and robustness for the different dissipation operators, and the factorizations of the operator due to Eriksson. For the simulations in this study one of the factorizations of the semidefinite operator required 70  90% of the number of iterations required by the positive definite operator. The indefinite operator was sensitive to perturbations in the inflow boundary conditions. The simulations were performed on a 8,192 processor Connection Machine system model CM2. Full processor utilization was achieved, and a performance of 135 Mflops/s in single precision was obtained. A performance of 1.1 Gflops/s for a fully configured system with 65,536 processors was demonstrated. View full abstract»

A computer assisted optimal depth lower bound for sorting networks with nine inputs
Page(s): 152  161It is demonstrated that there is no nineinput sorting network of depth six. The proof was obtained by executing on a supercomputer a branchandbound algorithm which constructs and tests a critical subset of all possible candidates. Such proofs can be classified as experimental science, rather than mathematics. In keeping with the paradigms of experimental science, a highlevel description of the experiment and analysis of the result are given. View full abstract»

Realities associated with parallel processing
Page(s): 162  174At the T. J. Watson Research Center, there is a very active Condensed Matter Physics Group engaged in the study of semiconductors such as silicon (Si) and galliumarsenide (GaAs)1. One of the most important computer codes developed at Watson is a Density Functional Program which is used to study the electronic structure of semiconductors. This program also consumes the most CPU time of all other production applications at Watson. Thus, it was decided to undertake the parallelization of this code not only with the hope of reducing elapsed time, improving turnaround on the IBM 30902, and conserving other system resources such as memory and disk space, but also in an attempt to test the IBM Parallel FORTRAN Compiler3 on a production program while developing an understanding of the impact of parallel jobs from a system standpoint. A speedup of 4.4 using 6 processors was achieved for this important scientific production code. View full abstract»

How a SIMD machine can implement a complex cellular automata? a case study: von Neumann's 29state cellular automaton
Page(s): 175  186This study is a part of an effort to simulate the 29state selfreproducing cellular automaton described by John von Neumann in a manuscript that dates back to 1952. We are interested in the programming of very large SIMD arrays which, as a consequence of scaling them up, incorporate some features of cellular automata. Designing tools for programming them requires an experimental ground: considering that von Neumann's 29state is the only known very large and complex cellular automaton, its simulation is a necessary first step. Embedded in a twodimensional cellular array, using 29 states per cell and 5cell neighborhood, this automaton exhibits the capabilities of universal computation and universal construction. This paper concentrates on the transition rule that governs the complex behavior of the 29state automaton. We give a detailed presentation of its transition rule, with illustrative examples to ease its comprehension. We then discuss its implementation on a SIMD machine, using only 13 bits per processing element to encode the rule, each processing element corresponding to a cell. Finally, we present experimental results based upon the simulation of generalpurpose components of the automaton: pulser, decoder, periodic pulser on the SIMD machine. View full abstract»

Automatic vectorization of character string manipulation and relational operations in Pascal
Page(s): 187  196In our paper of Supercomputing '88, an overview of VPascal, an automatic vectorizing compiler for Pascal, was presented with focus on its Version 1. In that paper, as one of those higher functions to be added to Version 2 VPascal, vectormode execution of nonnummeric operations such as relational database operations and nonnumeric data manipulations was considered. This paper describes the actual results we have obtained. These results are important in that a new vista has been opened up for vector supercomputers which are originally designed solely for highspeed manipulations of scientific numerical data. More concretely, the compiler VPascal has acquired the ability to automatically vectorize Pascal programs that compare/assign massive data of character strings and those programs which prescribe timeconsuming relational operations such as 'join' and others for relational database manipulation. Timing results demonstrate that these nonnumeric operations are performed in the regime of vector performance. View full abstract»

Neural network simulation on sharedmemory vector multiprocessors
Page(s): 197  204We simulate three neural networks on a vector multiprocrssor. The training time can be reduced significantly especially when the training data size is large. These three neural networks are: 1) the feedforward network, 2) the recurrent network and 3) the Hopfield network. The training algorithms are programmed in such a way to best utilize 1) the inherent parallelism in neural computing, and 2) the vector and concurrent operations available on the parallel machine. To prove the correctness of parallelized training algorithms, each neural network is trained to perform a specific function. The feedforward network is trained to perform the Fourier transform, the recurrent network is trained to predict the solution of a delay differential equation, the Hopfield network is trained to solve the traveling salesman problem. The machine we experiment with is the Alliant FX/80. View full abstract»

Concurrent and vectorized Monte Carlo simulation of the evolution of an assembly of particles increasing in number
Page(s): 205  214Parallel Monte Carlo techniques for simulating the evolution of an assembly of charged particles interacting with a background gas medium under the influence of the electrical field are presented. This simulation problem has inherent parallelism in nature. All the particles can be traced independently in a short time interval. We have overcome three major difficulties: 1) the number of particles to be simulated is increasing over time due to the ionization process; 2) the conditional branching statements do not inhibit multiprocessing through proper manual program tuning; 3) concurrency and vectorization are fully utilized through the new parallelized Monte Carlo method. The shared—memory vector multiprocessor Alliant FX/80 has been used for performance measurements. Significant speedup has been achieved. View full abstract»

Protein structure prediction by a datalevel parallel algorithm
Page(s): 215  223We have developed a software system, PHIPSI, on the Connection Machine that uses a parallel algorithm to retrieve and use information from a database of 112 known protein structures (selected from the Brookhaven Protein Databank) to predict the structures of other proteins. The φ and angles of each amino acid (the angles each amino acid forms with its immediate neighbors) in a protein are used to represent its 3D structure. PHIPSI's algorithm is based on the idea of Memorybased reasoning (MBR) [10] and extends it to include a recursive procedure to refine its initial prediction and a “window” of varying sizes to look at different contexts of an input. PHIPSI has been tested with all the available data. Initial results show that it performs better than distributionbased guesses for most of the φ and angle values. View full abstract»

Vector and parallel algorithms for Cholesky factorization on IBM 3090
Page(s): 225  233In many engineering applications, a solution of Fx = b is required, where F is a positive definite symmetric matrix. This is usually done by the Cholesky factorization, F = RRT, where R is the lower triangular Cholesky factor. This is a compute intensive problem. However, in order to achieve the best possible performance on IBM 3090 Vector Facility, the problem requires blocking at various levels to match 3090 memory hierarchy. A large problem which does not fit in a particular level of memory is blocked so that each block fits in memory. This minimizes data transfers between various levels of memory. In this paper, various blocking schemes are described for vector and parallel implementation on 3090 VF. Some of these algorithms have been included in the Engineering and Scientific Subroutine Library (ESSL). Performance numbers are also included. These algorithms achieve close to the peak performance of the 3090 uniprocessor and multiprocessors. View full abstract»

FFTs in external of hierarchical memory
Page(s): 234  242Conventional algorithms for computing large onedimensional fast Fourier transforms (FFTs), even those algorithms recently developed for vector and parallel computers, are largely unsuitable for systems with external or hierarchical memory. The principal reason for this is the fact that most FFT algorithms require at least m complete passes through the data set to compute a 2mpoint FFT. This paper describes some advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory. These algorithms (1) require as few as two passes through the external data set, (2) employ strictly unit stride, long vector transfers between main memory and external storage, (3) require only a modest amount of scratch space in main memory, and (4) are well suited for vector and parallel computation. Performance figures are included for implementations of some of these algorithms on Cray supercomputers. Of interest is the fact that a main memory version outperforms the current Cray library FFT routines on the Cray2, the Cray XMP, and the Cray YMP systems. Using all eight processors on the Cray YMP, this main memory routine runs at nearly two gigaflops. View full abstract»

Macrotasking the singluar value decomposition of block circulant matrices on the Cray2
Page(s): 243  247A parallel algorithm to compute the singular value decomposition (SVD) of block circulant matrices on the Cray2 is described. For a block circulant form described by M blocks with m x n elements in each block, the computation time using an SVD algorithm for general matrices has a lower bound (M3min(m, n)mn). Using a combination of fast Fourier transform (FFT) and SVD steps, the computation time for block circulant singular value decomposition (BCSVD) has a lower bound (Mmin(m, n)mn); a relative savings of ~ M2. Memory usage bounds are reduced from (M2mn) to (Mmn); a relative savings of ~ M. For M = m = n = 64, this decreases the computation time from approximately 12 hours to 30 seconds and memory usage is reduced from 768 megabytes to 12 megabytes. The BCSVD algorithm partitions well into n macrotasks with a granularity of (mM log M) for the FFT portion of the algorithm. The SVD portion of the algorithm partitions into M macrotasks with a granularity of (min(m, n)mn). Again, for the case where M = m = n = 64, the FFT granularity is 29ms and the SVD granularity is 428ms. A speedup of 3.06 was achieved by using a prescheduled partitioning of tasks. The process creation overhead was 2.63ms. Using a more elaborate selfscheduling method with four synchronizing server processes, a speedup of 3.25 was observed with four processors available. The server synchronization overhead was 0.32ms. Relative memory overhead in both cases was about 4% for data space and 40% for code space. View full abstract»

A block QR factorization algorithm using restricted pivoting
Page(s): 248  256This paper presents a new algorithm for computing the QR factorization of a rankdeficient matrix on highperformance machines. The algorithm is based on the Householder QR factorization algorithm with column pivoting. The traditional pivoting strategy is not well suited for machines with a memory hierarchy since it precludes the use of matrixmatrix operations. However, matrixmatrix operations perform better on those machines than matrixvector or vectorvector operations since they involve significantly less data movement per floating point operation. We suggest a restricted pivoting strategy which allows us to formulate a block QR factorization algorithm where the bulk of the work is in matrixmatrix operations. Incremental condition estimation is used to ensure the reliability of the restricted pivoting scheme. Implementation results on the Cray 2, Cray XMP and Cray YMP show that the new algorithm performs significantly better than the traditional scheme and can more than halve the cost of computing the QR factorization. View full abstract»