By Topic

Supercomputing, 1989. Supercomputing '89. Proceedings of the 1989 ACM/IEEE Conference on

Date 12-17 Nov. 1989

Filter Results

Displaying Results 1 - 25 of 94
  • Vectorization on Monte Carlo particle transport: an architectural study using the LANL benchmark “GAMTEB”

    Page(s): 10 - 20
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (970 KB)  

    Fully vectorized versions of the Los Alamos National Laboratory benchmark code Gamteb, a Monte Carlo photon transport algorithm, were developed for the Cyber 205/ETA-10 and Cray X-MP/Y-MP architectures. Single-processor performance measurements of the vector and scalar implementations were modeled in a modified Amdahl's Law that accounts for additional data motion in the vector code. The performance and implementation strategy of the vector codes are related to architectural features of each machine. Speedups between fifteen and eighteen for Cyber 205/ETA-10 architectures, and about nine for CRAY X-MP/Y-MP architectures are observed. The best single processor execution time for the problem was 0.33 seconds on the ETA-10G, and 0.42 seconds on the CRAY Y-MP. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallelizing a large scientific code - methods, issues, and concerns

    Page(s): 21 - 31
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1057 KB)  

    Objectives of this study were to develop techniques and methods for effective analysis of large codes; to determine the feasibility of parallelizing an existing large scientific code; and to estimate potential speedups attainable, and associated tradeoffs in design complexity and work effort, if the code were parallelized by redesign for a distributed memory system (NCube, iPSC hypercube), or straight serial translation targetting a shared memory system (CRAY2, SEQUENT). MACH2, the code under study, is a 2-D magnetohydrodynamic (MHD) finite difference code used to simulate plasma flow switches and nuclear radiation. A taxonomy relating functional levels of a code to levels of parallelism is presented and used as a model for analyzing existing large codes. It is shown that although parallelizing lower level code segments (e.g. algorithms and loops) on shared memory systems is generally easier to accomplish, in some cases an entire large code is most easily parallelized at a high level; via domain and functional decomposition. Also a multi-decomposition scheme is introduced in which acceptable load balances can be achieved for functional decompositions and heterogeneous data partitionings. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Benchmark calculations with an unstructured grid flow solver on a SIMD computer

    Page(s): 32 - 41
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (924 KB)  

    An unstructured grid flow solver was implemented on a massively parallel computer, and benchmark computations were performed. The solver was a two-dimensional computational fluid dynamics (CFD) code that performs first-order, steady-state solutions of the Euler equations. The parallel computer employed was the Connection Machine made by Thinking Machines, Corp. The CFD code was programmed in *Lisp, the accuracy of the code was verified, and numerous optimizations were implemented. Several benchmark runs were then made to assess and understand the impact of the code modifications and to obtain meaningful performance comparisons with other advanced computers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementation of a hypersonic rarefied flow particle simulation on the connection machine

    Page(s): 42 - 49
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (895 KB)  

    A very efficient direct particle simulation algorithm for hypersonic rarefied flows is presented and its implementation on a Connection Machine is described. The implementation is capable of simulating up to 4 x 106 hard sphere diatomic molecules using 64k processors with a performance better than that of a similar, fully vectorized implementation using a single processor of the Cray 2. Results from flow calculations are presented to demonstrate both the validity of the implementation and the range afforded by the method in solving hypersonic rarefied flow problems. Finally, a breakdown of the calculation time is given to identify a bottleneck which must be resolved for further improvement in performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Computational aerothermodynamics

    Page(s): 51 - 57
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (636 KB)  

    Aerothermodynamics is defined1 as “the study of the relationship of heat and mechanical energy in gases, especially air”. To those familiar with fluid dynamics (the study of the flow properties of liquids and gases) this means that we must consider thermodynamic and chemical processes as they are coupled to the fluid motion. Computational fluid dynamics involves the numerical simulation of the equations of motion for an ideal gas; these equations are the conservation of mass, momentum and energy, and in their most general form are the compressible Navier-Stokes equations. Computational aerothermodynamics concerns the coupling of real gas effects with these equations of motion to include thermochemical rate process for chemical and energy exchange phenomena. These processes concern the creation and destruction of gas species by chemical reactions and the transfer of energy between the various species and between the various energy modes (e.g. translation, rotation, vibration, ionization, dissociation/recombination, etc.) of the species. To gain some insight into when such phenomena occur for current and future aerospace flight vehicles Fig. 1 shows the flight regimes of some typical vehicles (e.g. Concord, aerospace plane, Space Shuttle, aeroassisted space transfer vehicles, Apollo entry vehicle, etc.) in terms of flight altitude and flight speed. Also indicated in the figure are regimes where chemical reactions such as dissociation and ionization are important and where nonequilibrium thermochemical phenomena are important. To account for chemical reactions equations for the conservation of each chemical species must be added to the flow field equation set. There are 5 flow field equations; one continuity, three momentum and one energy equation. For a simple model of dissociating and ionizing air there are typically 11 major species (N2, 2, N, , NO, N+2, +2, +, N+, NO+, e-). The inclusion of conservation equations for each of these species nearly t- riples the number of equations to be solved. When there are combustion processes or gas/surface interactions or ablation products, the number of species increases dramatically. To account for thermal non-equilibrium there are additional energy conservation equations to describe the energy exchange between the various energy modes (translational, rotational, vibrational, electronic, etc.). The full set of these conservation equations have been derived by Lee2. Under the following assumptions, namely: continuum and no slip at solid boundaries (Kn≪≪1)the thermal state of the gas can be described by separate, independent temperaturesthe rotational state of the gas is in equilibrium with the translational stateweak ionization (≪1%) the governing equations are written in Cartesian coordinates as3: n mass conservation equations where s is the chemical species @@@@s / @@@@t + @@@@ / @@@@j (suj) = - @@@@ / @@@@j (svsj) + ws (1)d momentum conservation equations, where d is the number of spatial dimensions @@@@ / @@@@t (ui) + @@@@ / @@@@j (uiuj + pδij) = - @@@@τij / @@@@j - @@@@ (Ns / Ne) Zs (@@@@pe / @@@@i) (2)conservation of vibrational energy for each of the m diatomic species @@@@Evs / @@@@t + @@@@ / @@@@j (Evsuj) = - @@@@ / @@@@j (Evsvsj) - @@@@qvsj / @@@@j + QT - vs + Qv - vs + Qe - vs + wsevs (3)electron energy is conservation @@@@Ee / @@@@t + @@@@ / @@@@j (Eeuj) = - @@@@ / @@@@j (Eevej) - pe@@@@uj / @@@@j - @@@@qej / @@@@j + QT-e - @@@@ Qe-vs + weee (4)And conservation of total energy @@@@E / @@@@t + @@@@ / @@@@j ((E + p)uj) = - @@@@ / @@@@≫j (qj + quj + qej) - @@@@ / @@@@j (uiTij) - @@@@ @@@@ / @@@@jvsjhs - @@@@ Ns / NeZs@@@@pe / @@@@iui (5) View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Practical parallel supercomputing: examples from chemistry and physics

    Page(s): 58 - 69
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1120 KB)  

    We use two large simulations, the chemical reaction dynamics of H + H2 and the collision of two galaxies to show that current parallel machines are capable of large supercomputer level calculations. We contrast the different architectural tradeoffs for these problems and draw some implications for future production parallel supercomputers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Capability of current supercomputers for the computational fluid dynamics

    Page(s): 71 - 80
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (986 KB)  

    The computer code named LANS3D, one of the representative Navier-Stokes codes in Japan, is taken as a example and the capability of the current CFD technology is discussed. This code was developed for the numerical simulation of high-Reynolds number compressible flows. The algorithm used in this code and how it has been improved so far explain two important aspects of the computational fluid dynamics (CFD) codes: efficiency and accuracy. Some of the application examples show the capability of the code for engineering problems as well as physical problems. The benchmark test of the newest version of the code on supercomputers indicates that recent supercomputer improvement enables the code to be a strong engineering tool for the design purpose. At the same time, it is concluded that still more compiler improvement may lead to the best use of the supercomputers on the computational fluid dynamics. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Computations of soil temperature rise due to HVDC ground return

    Page(s): 86 - 95
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (711 KB)  

    The purpose of this paper is to present an application which historically, did not make use of computing methodology in the solution of design problems. The design of High Voltage Direct Current (HVDC) ground electrodes involves the careful selection of several parameters in order to meet strict operating constraints. In particular, the operation of HVDC ground electrodes, results in a rise of the surrounding soil temperature which must be reasonably computed so as to permit safe operation of the grounding network. The classical approach to this design issue was to make several simplifying assumptions and thereby obtain a closed form solution which approximated the situation. Unfortunately, this technique is extremely limited in its scope and the accuracy of solution is dependent on the simplifications made in order to obtain a solution. The technique presented in this paper uses a numerical model to represent the system to be studied. This approach provides extreme flexibility, thus a minimal of assumptions are needed in order to accurately obtain realistic solutions to the problem. The complexity of the numerical model however, requires substantial computing capabilities. Symbols g = heat generated, W/m3 p = soil electrical resistivity, - mJ = current density, A/m2T = temperature of the medium, °C k = thermal conductivity, W/°Cm α = thermal diffusitivity, m2/sCp = specific heat, J/kg°Cd = mass density, kg/m3V = electrical potential, V λ = over relaxation factor View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A radar simulation program for a 1024-processor hypercube

    Page(s): 96 - 105
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1092 KB)  

    We have developed a fast parallel version of an existing synthetic aperture radar (SAR) simulation program, SRIM. On a 1024-processor NCUBE hypercube it runs an order of magnitude faster than on a CRAY X-MP or CRAY Y-MP processor. This speed advantage is coupled with an order of magnitude advantage in machine acquisition cost. SRIM is a somewhat large (30,000 lines of Fortran 77) program designed for uniprocessors; its restructuring for a hypercube provides new lessons in the task of altering older serial programs to run well on modern parallel architectures. We describe the techniques used for parallelization, and the performance obtained. Several novel parallel approaches to problems of task distribution, data distribution, and direct output were required. These techniques increase performance and appear to have general applicability for massive parallelism. We describe the hierarchy necessary to dynamically manage (i.e., load balance) a large ensemble. The ensemble is used in a heterogeneous manner, with different programs on different parts of the hypercube. The heterogeneous approach takes advantage of the independent instruction streams possible on MIMD machines. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel MIMD programming for global models of atmospheric flow

    Page(s): 106 - 112
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (596 KB)  

    Modeling atmospheric flow is one application of supercomputers. In this paper we present some concepts for implementing global flow algorithms on shared memory multiprocessors. We describe how an analysis of the algorithms combined with the appropriate parallel programming language support allows an efficient and computationally correct implementation, which minimizes the synchronization difficulties. Some performance measurements on the Encore multiprocessor serve to support this assertion. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Computational fluid dynamic-current capabilities and directions for the future

    Page(s): 113 - 122
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (4935 KB)  

    Computational fluid dynamics (CFD) has made great strides in the detailed simulation of complex fluid flows, including some of those not before understood. It is now being routinely applied to some rather complicated problems, and starting to impact the design cycle of aerospace flight vehicles and their components. It is being used to complement and is being complemented by experimental studies. Several examples are presented in the paper to illustrate the current state-of-the-art. Included in this paper is a discussion of the barriers to accomplishing the basic objective of numerical simulation. In addition, the directions for the future in the discipline of computational fluid dynamics are addressed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel algorithm and VLSI architecture for a robot's inverse kinematics

    Page(s): 123 - 132
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (545 KB)  

    The inverse solutions of a robotic systems are generally produced by a serial process. Due to the computing time of processing geometry data and generating an inverse solution corresponding to a specified point in Cartesian trajectory is larger than the sampling period, the missing points in the joint space are generated by some interpolation schemes (linear or cubic spline interpolations) between two inverse solutions. The dynamic errors are therefore introduced. Obviously this kind of dynamic errors can be eliminated, if the computational time of generating the inverse solutions and processing geometric information can be less than the sampling period. The dynamic errors are increased when a robot's speed is increased by using the available schemes. For a high speed and high performance robot, the dynamic errors can be significant. In this paper, a parallel algorithm for a robot's inverse kinematics is derived and corresponding VLSI architectures are presented. This algorithm can also be implemented by using multiprocessors. By using the proposed parallel algorithm, it is believed that the dynamic errors can be reduced significantly or even eliminated if the computing time of processing geometric data can be reduced significantly by using the technique of parallel processing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Supercomputers in computational ocean acoustics

    Page(s): 133 - 140
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (550 KB)  

    In this paper, we report on some computational experience in solving ocean acoustic propagation problems in three dimensions on supercomputers. The underlying Helmholtz equation is transformed into a parabolic-type equation in the Lee-Saad-Schultz model [5], which has a natural alternating direction implicit (ADI) implementation. We give estimates of the computing power required to solve problems with realistic sound velocity profiles. We then give performance results for the CRAY X-MP and for the computational kernel on the Intel hypercube (iPSC/2). We conclude with some remarks about architectural enhancements that would be beneficial to our application. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A study of dissipation operators for the euler equations and a three- dimensional channel flow

    Page(s): 141 - 151
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1049 KB)  

    Explicit methods for the solution of fluid flow problems are of considerable interest in supercomputing. These methods parallelize well. The treatment of the boundaries is of particular interest both with respect to the numeric behavior of the solution, and the computational efficiency. We have solved the three-dimensional Euler equations for a twisted channel using second-order, centered difference operators, and a three stage Runge-Kutta method for the integration. Three different fourth-order dissipation operators were studied for numeric stabilization: one positive definite, [8], one positive semidefinite, [3], and one indefinite. The operators only differ in the treatment of the boundary. For computational efficiency all dissipation operators were designed with a constant bandwidth in matrix representation, with the bandwidth determined by the operator in the interior. The positive definite dissipation operator results in a significant growth in entropy close to the channel walls. The other operators maintain constant entropy. Several different implementations of the semidefinite operator obtained through factoring of the operator were also studied. We show the difference both in convergence rate and robustness for the different dissipation operators, and the factorizations of the operator due to Eriksson. For the simulations in this study one of the factorizations of the semidefinite operator required 70 - 90% of the number of iterations required by the positive definite operator. The indefinite operator was sensitive to perturbations in the inflow boundary conditions. The simulations were performed on a 8,192 processor Connection Machine system model CM-2. Full processor utilization was achieved, and a performance of 135 Mflops/s in single precision was obtained. A performance of 1.1 Gflops/s for a fully configured system with 65,536 processors was demonstrated. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A computer assisted optimal depth lower bound for sorting networks with nine inputs

    Page(s): 152 - 161
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (852 KB)  

    It is demonstrated that there is no nine-input sorting network of depth six. The proof was obtained by executing on a supercomputer a branch-and-bound algorithm which constructs and tests a critical subset of all possible candidates. Such proofs can be classified as experimental science, rather than mathematics. In keeping with the paradigms of experimental science, a high-level description of the experiment and analysis of the result are given. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Realities associated with parallel processing

    Page(s): 162 - 174
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (742 KB)  

    At the T. J. Watson Research Center, there is a very active Condensed Matter Physics Group engaged in the study of semiconductors such as silicon (Si) and gallium-arsenide (Ga-As)1. One of the most important computer codes developed at Watson is a Density Functional Program which is used to study the electronic structure of semiconductors. This program also consumes the most CPU time of all other production applications at Watson. Thus, it was decided to undertake the parallelization of this code not only with the hope of reducing elapsed time, improving turnaround on the IBM 30902, and conserving other system resources such as memory and disk space, but also in an attempt to test the IBM Parallel FORTRAN Compiler3 on a production program while developing an understanding of the impact of parallel jobs from a system standpoint. A speedup of 4.4 using 6 processors was achieved for this important scientific production code. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • How a SIMD machine can implement a complex cellular automata? a case study: von Neumann's 29-state cellular automaton

    Page(s): 175 - 186
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (850 KB)  

    This study is a part of an effort to simulate the 29-state self-reproducing cellular automaton described by John von Neumann in a manuscript that dates back to 1952. We are interested in the programming of very large SIMD arrays which, as a consequence of scaling them up, incorporate some features of cellular automata. Designing tools for programming them requires an experimental ground: considering that von Neumann's 29-state is the only known very large and complex cellular automaton, its simulation is a necessary first step. Embedded in a two-dimensional cellular array, using 29 states per cell and 5-cell neighborhood, this automaton exhibits the capabilities of universal computation and universal construction. This paper concentrates on the transition rule that governs the complex behavior of the 29-state automaton. We give a detailed presentation of its transition rule, with illustrative examples to ease its comprehension. We then discuss its implementation on a SIMD machine, using only 13 bits per processing element to encode the rule, each processing element corresponding to a cell. Finally, we present experimental results based upon the simulation of general-purpose components of the automaton: pulser, decoder, periodic pulser on the SIMD machine. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic vectorization of character string manipulation and relational operations in Pascal

    Page(s): 187 - 196
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (824 KB)  

    In our paper of Supercomputing '88, an overview of V-Pascal, an automatic vectorizing compiler for Pascal, was presented with focus on its Version 1. In that paper, as one of those higher functions to be added to Version 2 V-Pascal, vector-mode execution of nonnummeric operations such as relational database operations and nonnumeric data manipulations was considered. This paper describes the actual results we have obtained. These results are important in that a new vista has been opened up for vector supercomputers which are originally designed solely for high-speed manipulations of scientific numerical data. More concretely, the compiler V-Pascal has acquired the ability to automatically vectorize Pascal programs that compare/assign massive data of character strings and those programs which prescribe time-consuming relational operations such as 'join' and others for relational database manipulation. Timing results demonstrate that these nonnumeric operations are performed in the regime of vector performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Neural network simulation on shared-memory vector multiprocessors

    Page(s): 197 - 204
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (500 KB)  

    We simulate three neural networks on a vector multiprocrssor. The training time can be reduced significantly especially when the training data size is large. These three neural networks are: 1) the feedforward network, 2) the recurrent network and 3) the Hopfield network. The training algorithms are programmed in such a way to best utilize 1) the inherent parallelism in neural computing, and 2) the vector and concurrent operations available on the parallel machine. To prove the correctness of parallelized training algorithms, each neural network is trained to perform a specific function. The feedforward network is trained to perform the Fourier transform, the recurrent network is trained to predict the solution of a delay differential equation, the Hopfield network is trained to solve the traveling salesman problem. The machine we experiment with is the Alliant FX/80. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Concurrent and vectorized Monte Carlo simulation of the evolution of an assembly of particles increasing in number

    Page(s): 205 - 214
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1124 KB)  

    Parallel Monte Carlo techniques for simulating the evolution of an assembly of charged particles interacting with a background gas medium under the influence of the electrical field are presented. This simulation problem has inherent parallelism in nature. All the particles can be traced independently in a short time interval. We have overcome three major difficulties: 1) the number of particles to be simulated is increasing over time due to the ionization process; 2) the conditional branching statements do not inhibit multiprocessing through proper manual program tuning; 3) concurrency and vectorization are fully utilized through the new parallelized Monte Carlo method. The shared—memory vector multiprocessor Alliant FX/80 has been used for performance measurements. Significant speedup has been achieved. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Protein structure prediction by a data-level parallel algorithm

    Page(s): 215 - 223
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (739 KB)  

    We have developed a software system, PHI-PSI, on the Connection Machine that uses a parallel algorithm to retrieve and use information from a database of 112 known protein structures (selected from the Brookhaven Protein Databank) to predict the structures of other proteins. The φ and angles of each amino acid (the angles each amino acid forms with its immediate neighbors) in a protein are used to represent its 3-D structure. PHI-PSI's algorithm is based on the idea of Memory-based reasoning (MBR) [10] and extends it to include a recursive procedure to refine its initial prediction and a “window” of varying sizes to look at different contexts of an input. PHI-PSI has been tested with all the available data. Initial results show that it performs better than distribution-based guesses for most of the φ and angle values. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Vector and parallel algorithms for Cholesky factorization on IBM 3090

    Page(s): 225 - 233
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (721 KB)  

    In many engineering applications, a solution of Fx = b is required, where F is a positive definite symmetric matrix. This is usually done by the Cholesky factorization, F = RRT, where R is the lower triangular Cholesky factor. This is a compute intensive problem. However, in order to achieve the best possible performance on IBM 3090 Vector Facility, the problem requires blocking at various levels to match 3090 memory hierarchy. A large problem which does not fit in a particular level of memory is blocked so that each block fits in memory. This minimizes data transfers between various levels of memory. In this paper, various blocking schemes are described for vector and parallel implementation on 3090 VF. Some of these algorithms have been included in the Engineering and Scientific Subroutine Library (ESSL). Performance numbers are also included. These algorithms achieve close to the peak performance of the 3090 uniprocessor and multiprocessors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FFTs in external of hierarchical memory

    Page(s): 234 - 242
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (875 KB)  

    Conventional algorithms for computing large one-dimensional fast Fourier transforms (FFTs), even those algorithms recently developed for vector and parallel computers, are largely unsuitable for systems with external or hierarchical memory. The principal reason for this is the fact that most FFT algorithms require at least m complete passes through the data set to compute a 2m-point FFT. This paper describes some advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory. These algorithms (1) require as few as two passes through the external data set, (2) employ strictly unit stride, long vector transfers between main memory and external storage, (3) require only a modest amount of scratch space in main memory, and (4) are well suited for vector and parallel computation. Performance figures are included for implementations of some of these algorithms on Cray supercomputers. Of interest is the fact that a main memory version outperforms the current Cray library FFT routines on the Cray-2, the Cray X-MP, and the Cray Y-MP systems. Using all eight processors on the Cray Y-MP, this main memory routine runs at nearly two gigaflops. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Macrotasking the singluar value decomposition of block circulant matrices on the Cray-2

    Page(s): 243 - 247
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (371 KB)  

    A parallel algorithm to compute the singular value decomposition (SVD) of block circulant matrices on the Cray-2 is described. For a block circulant form described by M blocks with m x n elements in each block, the computation time using an SVD algorithm for general matrices has a lower bound (M3min(m, n)mn). Using a combination of fast Fourier transform (FFT) and SVD steps, the computation time for block circulant singular value decomposition (BCSVD) has a lower bound (Mmin(m, n)mn); a relative savings of ~ M2. Memory usage bounds are reduced from (M2mn) to (Mmn); a relative savings of ~ M. For M = m = n = 64, this decreases the computation time from approximately 12 hours to 30 seconds and memory usage is reduced from 768 megabytes to 12 megabytes. The BCSVD algorithm partitions well into n macrotasks with a granularity of (mM log M) for the FFT portion of the algorithm. The SVD portion of the algorithm partitions into M macrotasks with a granularity of (min(m, n)mn). Again, for the case where M = m = n = 64, the FFT granularity is 29ms and the SVD granularity is 428ms. A speedup of 3.06 was achieved by using a prescheduled partitioning of tasks. The process creation overhead was 2.63ms. Using a more elaborate self-scheduling method with four synchronizing server processes, a speedup of 3.25 was observed with four processors available. The server synchronization overhead was 0.32ms. Relative memory overhead in both cases was about 4% for data space and 40% for code space. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A block QR factorization algorithm using restricted pivoting

    Page(s): 248 - 256
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (755 KB)  

    This paper presents a new algorithm for computing the QR factorization of a rank-deficient matrix on high-performance machines. The algorithm is based on the Householder QR factorization algorithm with column pivoting. The traditional pivoting strategy is not well suited for machines with a memory hierarchy since it precludes the use of matrix-matrix operations. However, matrix-matrix operations perform better on those machines than matrix-vector or vector-vector operations since they involve significantly less data movement per floating point operation. We suggest a restricted pivoting strategy which allows us to formulate a block QR factorization algorithm where the bulk of the work is in matrix-matrix operations. Incremental condition estimation is used to ensure the reliability of the restricted pivoting scheme. Implementation results on the Cray 2, Cray X-MP and Cray Y-MP show that the new algorithm performs significantly better than the traditional scheme and can more than halve the cost of computing the QR factorization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.