By Topic

Supercomputing, ACM/IEEE 1997 Conference

Date 15-21 Nov. 1997

Filter Results

Displaying Results 1 - 25 of 62
  • The Effects of Communication Parameters on End Performance of Shared Virtual Memory Clusters

    Page(s): 1
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (632 KB)  

    Recently there has been a lot of effort in providing cost-effective Shared Memory systems by employing software only solutions on clusters of high-end workstations coupled with high-bandwidth, low-latency commodity networks. Much of the work so far has focused on improving protocols, and there has been some work on restructuring applications to perform better on SVM systems. The result of this progress has been the promise for good performance on a range of applications at least in the 16-32 processor range. New system area networks and network interfaces provide significantly lower overhead, lower latency and higher bandwidth communication in clusters, inexpensive SMPs have become common as the nodes of these clusters, and SVM protocols are now quite mature. With this progress, it is now useful to examine what are the important system bottlenecks that stand in the way of effective parallel performance; in particular, which parameters of the communication architecture are most important to improve further relative to processor speed, which ones are already adequate on modern systems for most applications, and how will this change with technology in the future. Such information can assist system designers in determining where to focus their energies in improving performance, and users in determining what system characteristics are appropriate for their applications. We find that the most important system cost to improve is the overhead of generating and delivering interrupts. Improving network interface (and I/O bus) bandwidth relative to processor speed helps some bandwidth-bound applications, but currently available ratios of bandwidth to processor speed are already adequate for many others. Surprisingly, neither the processor overhead for handling messages nor the occupancy of the communication interface in preparing and pushing packets through the network appear to require much improvement. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FM-QoS: Real-time Communication using Self-synchronizing Schedules

    Page(s): 2
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (184 KB)  

    FM-QoS employs a novel communication architecture based on network feedback to provide predictable communication performance (e.g. deterministic latencies and guaranteed bandwidths) for high speed cluster interconnects. Network feedback is combined with self-synchronizing communication schedules to achieve synchrony in the network interfaces (NIs). Based on this synchrony, the network can be scheduled to provide predictable performance without special network QoS hardware. We describe the key element of the FM-QoS approach, feedback-based synchronization (FBS), which exploits network feedback to synchronize senders. We use Petri nets to characterize the set of self-synchronizing communication schedules for which FBS is effective and to describe the resulting synchronization overhead as a function of the clock drift across the network nodes. Analytic modeling suggests that for clocks of quality 300 ppm (such as found in the Myrinet NI), a synchronization overhead less than 1% of the total communication traffic is achievable -- significantly better than previous software-based schemes and comparable to hardware-intensive approaches such as virtual circuits (e.g. ATM). We have built a prototype of FBS for Myricom s Myrinet network (a 1.28 Gbps cluster network) which demonstrates the viability of the approach by sharing network resources with predictable performance. The prototype, which implements the local node schedule in software, achieves predictable latencies of 23 µs for a single-switch, 8-node network and 2 KB packets. In comparison, the best-effort scheme achieves 104 µs for the same network without FBS. While this ratio of over four to one already demonstrates the viability of the approach, it includes nearly 10 µs of overhead due to the software implementation. For hardware implementations of local node scheduling, and for networks with cascaded switches, these ratios should be much larger factors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multi Protocol Active Messages on a Cluster of SMP

    Page(s): 3
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (10024 KB)  

    Cluster of multiprocessors, or Clumps, promise to be the supercomputers of the future, but obtaining high performance on these architectures requires an understanding of interactions between the multiple levels of interconnection. In this paper, we present the first multi-protocol implementation of a lightweight message layer-a version of Active Messages-II running on a cluster of Sun Enterprise 5000 servers connected with Myrinet. This research brings together several pieces of high-performance interconnection technology: bus backplanes for symmetric multiprocessors, low-latency networks for connections between machines, and simple, user-level primitives for communication. The paper describes the shared memory message-passing protocol and analyzes the multi-protocol implementation with both microbenchmarks and Split-C applications. Three aspects of the communication layer are critical to performance: the overhead of cache-coherence mechanisms, the method of managing concurrent access, and the cost of accessing state with the slower protocol. Through the use of an adapative polling strategy, the multi-protocol implementation limits performance interactions between the protocols, delivering up to 160 MB/s of bandwidth with 3.6 microsecond end-to-end latency. Applications within an SMP benefit from this fast communication, running up to 75% faster than on a network of uniprocessor workstations. Applications running on the entire Clump are limited by the balance of NIC's to processors in our system, and are typically slower than on the NOW. These results illustrate several potential pitfalls for the Clumps architecture. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Highly Portable and Efficient Implementations of Parallel Adaptive N-Body Methods

    Page(s): 4
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (224 KB)  

    We describe the design of several portable and efficient parallel implementations of adaptive N-body methods, including the adaptive Fast Multipole Method, the adaptive version of Anderson’s Method, and the Barnes-Hut algorithm. Our codes are based on a communication and work partitioning scheme that allows an efficient implementation of adaptive multipole methods even on high-latency systems. Our test runs demonstrate high performance and speed-up on several parallel architectures, including traditional MPPs, shared-memory machines, and networks of workstations connected by Ethernet. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimization and Scaling of Shared-Memory and Message-Passing Implementations of the Zeus Hydrodynamics Algorithm

    Page(s): 5
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1024 KB)  

    We compare the performance of shared-memory and message-passing versions of the ZEUS algorithm for astrophysical fluid dynamics on a 64-processor HP/Convex Exemplar SPP-2000. Single-processor optimization is guided by timing several versions of simple loops whose structure typifies the main performance bottlenecks. Overhead is minimized in the message-passing implementation through the use of non-blocking communication operations. Our benchmark results agree reasonably well with the predictions of a simple performance model. The message-passing version of ZEUS scales better than the shared-memory one primarily because, under shared-memory, (unless data-layout directives are utilized) the domain decomposition is effectively one-dimensional. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable, Hydrodynamic and Radiation-Hydrodynamic Studies of Neutron Stars Mergers

    Page(s): 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (184 KB)  

    We discuss the high performance computing issues involved in the numerical simulation of binary neutron star mergers and supernovae. These phenomena, which are of great interest to astronomers and physicists, can only be described by modeling the gravitational field of the objects along with the flow of matter and radiation in a self consistent manner. In turn, such models require the solution of the gravitational field equations, Eulerian hydrodynamic equations, and radiation transport equations. This necessitates the use of scalable, high performance computing assets to conduct the simulations. We discuss some of the parallel computing aspects of this challenging task in this paper. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementing a Performance Forecasting System for Metacomputing The Network Weather Service

    Page(s): 7
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (7328 KB)  

    In this paper we describe the design and implementation of a system called the Network Weather Service (NWS) that takes periodic measurements of deliverable resource performance from distributed networked resources, and uses numerical models to dynamically generate forecasts of future performance levels. These performance forecasts, along with measures of performance fluctuation (e.g the mean square prediction error) and forecast lifetime that the NWS generates, are made available to schedulers and other resource management mechanisms at runtime so that they may determine the quality-of-service that will be available from each resource. We describe the architecture of the NWS and implementations that we have developed and are currently deploying for the Legion [13] and Globus/Nexus [7] metacomputing infrastructures. We also detail NWS forecasts of resource performance using both the Legion and Globus/Nexus implementations. Our results show that simple forecasting techniques substantially outperform measurements of current conditions (commonly used to gauge resource availability and load) in terms of prediction accuracy. In addition, the techniques we have employed are almost as accurate as substantially more complex modeling methods. We compare our techniques to a sophisticated time-series analysis system in terms of forecasting accuracy and computational complexity. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Experiment Management Support for Performance Tuning

    Page(s): 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (144 KB)  

    The development of a high performance parallel system or application is an evolutionary process -- both the code and the environment go through many changes during a program's lifetime -- and at each change, a key question for developers is: how and how much did the performance change? No existing performance tool provides the necessary functionality to answer this question. We report on the design and preliminary implementation of a tool that views each execution as a scientific experiment and provides the functionality to answer questions about a program's performance that span more than a single execution or environment. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting Global Input Output Access Pattern Classification

    Page(s): 9
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (7160 KB)  

    Parallel input/output systems attempt to alleviate the performance bottleneck that affects many input/output intensive applications. In such systems, an understanding of the application access pattern, especially how requests from multiple processors for different file regions are logically related, is important for optimizing file system performance. We propose a method for automatically classifying these global access patterns and using these global classifications to select and tune file system policies to improve input/output performance. We demonstrate this approach on benchmarks and scientific applications using global classification to automatically select appropriate underlying Intel PFS input/output modes and server buffering strategies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Compiling Parallel Code for Sparse Matrix Applications

    Page(s): 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (232 KB)  

    We have developed a framework based on relational algebra for compiling efficient sparse matrix code from dense DO-ANY loops and a specification of the representation of the sparse matrix. In this paper, we show how this framework can be used to generate parallel code, and present experimental data that demonstrates that the code generated by our Bernoulli compiler achieves performance competitive with that of hand-written codes for important computational kernels. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Evaluating the Performance Limitations of MPMD Communication

    Page(s): 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (200 KB)  

    The MPMD approach for parallel computing is attractive for programmers who seek fast development cycles, high code re-use, and modular programming, or whose applications exhibit irregular computation loads and communication patterns. RPC is widely adopted as the communication abstraction for crossing address space boundaries. However, the communication overheads of existing RPC-based systems are usually an order of magnitude higher than those found in highly tuned SPMD systems. This problem has thus far limited the appeal of high-level programming languages based on MPMD models in the parallel computing community. This paper investigates the fundamental limitations of MPMD communication using a case study of two parallel programming languages, Compositional C++ (CC++) and Split-C, that provide support for a global name space. To establish a common comparison basis, our implementation of CC++ was developed to use MRPC, a RPC system optimized for MPMD parallel computing and based on Active Messages. Basic RPC performance in CC++ is within a factor of two from those of Split-C and other messaging layers. CC++ applications perform within a factor of two to six from comparable Split-C versions, which represent an order of magnitude improvement over previous CC++ implementations. The results suggest that RPC-based communication can be used effectively in many high-performance MPMD parallel applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Compiling Stencils in High Performance Fortran

    Page(s): 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (7896 KB)  

    In many Fortran90 and HPF programs performing dense matrix computations, the main computational work is performed by a class of kernels known as stencils. We present a general- purpose compiler optimization strategy that generates efficient code for a wide class of stencil computations expressed using Fortran90 array constructs. This strategy optimizes both single and multi-statement stencils by orchestrating a set of program transformations that minimize both intraprocessor and interprocessor data movement implied by Fortran90 array operations. Our experimental results show that our approach can produce highly optimized code in situations where other compilers do not. In some cases the improvements are as much as several orders of magnitude. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • MPP Solution of Rayleigh - Bénard - Marangoni Flows

    Page(s): 13
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (200 KB)  

    A domain decomposition strategy and parallel gradient-type iterative solution scheme have been developed and implemented for computation of complex 3D viscous flow problems involving heat transfer and surface tension effects. Special attention has been paid to the kernels for the computationally intensive matrix-vector products and dot products, to memory management, and to overlapping communication and computation. Details of these implementation issues are described together with associated performance and scalability studies. Representative Rayleigh- Bénard and microgravity Marangoni flow calculations on the Cray T3D are presented, and performance results verifying a sustained rate in excess of 16 gigaflops on 512 nodes of the T3D have been obtained. The work is currently being extended to the T3E and we have begun carrying out further performance benchmarks and scalability studies on this platform. Preliminary performance studies have recently been carried out and sustained rates above 50 gigaflops and 100 gigaflops have been achieved on the 512 node T3E-600 and 1024 node T3E-900 configurations respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A multi-level parallelization concept for high-fidelity multi-block solvers

    Page(s): 14
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (264 KB)  

    The integration of high-fidelity Computational Fluid Dynamics (CFD) analysis tools with the industrial design process benefits greatly from the robust implementations that are transportable across a wide range of computer architectures. In the present work, a hybrid domain-decomposition and parallelization concept was developed and implemented into the widely-used NASA multi-block Computational Fluid Dynamics (CFD) solvers employed in ENSAERO and OVERFLOW advanced flow analysis packages. These advanced engineering and scientific analysis packages include more than 300,000 lines of code written in FORTRAN 77 language in more than 1300 individual subprograms. The new parallel solver concept, PENS (Parallel Euler Navier-Stokes Solver), employs both fine and coarse granularity with data partitioning as well as data coalescing to obtain the desired load-balance characteristics on the available computer platforms for these legacy packages. This multi-level parallelism implementation itself introduces no changes to the numerical results, hence the original fidelity of the packages are identically preserved. The present implementation uses the Message Passing Interface (MPI) library for interprocessor message passing and memory accessing. By choosing an appropriate combination of the available partitioning and coalescing possibilities only during the execution stage, the PENS solver is used on different computer architectures from shared-memory to distributed-memory platforms with varying degrees of parallelism. Improvements in computational load-balance and speeds are extremely crucial on the realistic problems in the design of aerospace vehicles. The PENS implementation on the IBM SP2 distributed memory environment at the NASA Ames Research Center obtains 85 percent scalable parallel performance using fine-grain partitioning of single-block CFD domains using up to 128 wide computational nodes. Multi-block CFD simulations of complete aircraft geometries achieve 85 percent perfect load-balanced executions using data coalescing and the two levels of parallelism. SGI PowerChallenge, SGI Onyx2, and Cray T3E are the other platforms where the robustness, performance behavior, and the parallel scalability of the implementation are tested and fine-tuned for actual productio- n run environments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On Parallel Implementations of Dynamic Overset Grid Methods

    Page(s): 15
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1272 KB)  

    This paper explores the parallel performance of structured overset CFD computations for multi-component bodies in which there is relative motion between component parts. The two processes that dominate the cost of such problems are the flow solution on each component and the intergrid connectivity solution. A two-part static-dynamic load balancing scheme is proposed in which the static part balances the load for the flow solution and the dynamic part re-balances, if necessary, the load for the connectivity solution. This scheme is coupled with existing parallel implementations of the OVERFLOW flow solver and DCF3D connectivity routine and used for unsteady calculations about aerodynamic bodies on the IBM SP2 and IBM SP multi-processors. This paper also describes the parallel implementation of a new solution-adaption scheme based on structured Cartesian overset grids. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • New Life in Dusty Decks: Results of Porting a CM Fortran Based Aeroacoustic Model to High Performance Fortran

    Page(s): 16
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (184 KB)  

    The High Performance Fortran language is a 'standard by consensus', developed by individuals and vendors in the high performance computing industry, to provide a low barrier entry to parallel computing. It promises to be an easier to use development environment for distributed memory computing platforms compared to the programming complexity required by message passing libraries such as PVM and MPI. HPF promises much and is still in its infancy. Since HPF was developed in part based on experiences gained with early parallel Fortran compilers such as Thinking Machines CM Fortran, we decided to test the effectiveness of HPF with today's generation of HPF compilers by porting a complex existing model code originally developed using CM Fortran. The model code that we selected is a hybrid computational aeroacoustics code that solves the 3D, time-dependent Euler equations in the near flow field and uses a moving surface Kirchhoff's formula to predict the far field sound radiating from turbofan engine inlets. The original CM Fortran model code was developed on a Thinking Machines CM5. The extensive production research use of this model, using varying grid sizes, provides excellent benchmarks with which to compare the HPF port. Two HPF compilers were selected in the porting effort -- The Portland Group's (PGI) pghpf and the xlhpf compiler from IBM. IBM's xlhpf does not implement some elements of the HPF subset while PGI's offering provides several full-HPF extensions. porting efforts using each compiler exposed the strengths and weaknesses of each. Porting this complex code exposed many of the growing pains associated with the current generation of compilers. Critical sections of the code will be explained and these critical areas of the conversion effort will be discussed. Where necessary we will demonstrate how different porting strategies affected the performance of the code. Finally we will present how the ported code ran, using a varying number of processors, on an IBM SP2 and an SGI Origin 2000. While the current simple port does not match the speed of a CM-5, we hope further porting efforts and improved compiler technology will enable us to eventually match and then surpass CM-5 performance levels. With the shutdown of the NCSA CM-5 and the eventual rem- oval and failure of the remaining Thinking Machines hardware currently installed, this effort will demonstrate that investments in CM Fortran code need not be abandoned and that new life can be breathed into those dusty decks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Evaluation of HPF Compilers and the Implementation of a Parallel Linear Equation Solver Using HPF and MPI

    Page(s): 17
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (144 KB)  

    In this work, we evaluated the capabilities and performances of two commercially available HPF compilers, xlhpf from IBM and pghpf from the Portland Group. In particular, we examined the suitability of the two compilers for the development of a reservoir simulator. Because of the nature of reservoir simulation, multiple data distributions and data transfer between arrays of different data layouts are of great importance. An HPF compiler that does not provide these capabilities is unsuitable for the development of a parallel reservoir simulator. A detailed comparison of the functionalities of the two compilers and their suitabilities for reservoir simulator development are presented. To test the performance of the compilers, we used several reservoir simulator kernels and a parallel linear equation solver. The solver is based on preconditioned Orthomin and a truncated Neumann series preconditioner. It was first implemented in HPF and later in MPI using hpf_local to improve the performance. Its computational performance was compared for several MPP machines: CM5, IBM SP2, Cray T-3E and Cray Origin. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Portable Performance of Data Parallel Languages

    Page(s): 18
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (200 KB)  

    A portable program yields consistent performance on different platforms. We study the portable performance of three NAS benchmarks compiled with three commercial HPF compilers on the IBM SP2. Each benchmark is evaluated using DO loops and F90 constructs. Base-line comparison is provided by Fortran/MPI and ZPL. The HPF results show some scalable performance but indicate a considerable portability problem. First, relying on the compiler alone for extensive analysis and optimization leads to unpredictable performance. Second, differences in the parallelization strategies often require compiler specific customization. The results suggest that the foremost criteria for portability is a concise performance model. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Divide and Conquer Spot Noise

    Page(s): 19
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (704 KB)  

    The design and implementation of an interactive spot noise algorithm is presented. Spot noise is a technique that uses texture for the visualization of flow fields. Various design tradeoffs are discussed that allow an optimal implementation on a range of high-end graphical workstations. Two applications are given: the steering of a smog prediction simulation and browsing a very large data set resulting from a direct numerical simulation of turbulence. These applications provide the motivation for the need of interactive visualization techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Wavelet-Based Image Registration on Parallel Computers

    Page(s): 20
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (184 KB)  

    Digital image registration is very important in many applications such as medical imagery, robotics, and remote sensing. Image registration determines the relative orientation between two or more images. NASA's Mission To Planet Earth (MTPE) program will soon produce enormous Earth data, reaching hundreds of Gbytes per day. Analysis of such data requires accurate and fast registration. We describe a fast algorithm for parallel image registration. Performance of the algorithm is analyzed and experimentally evaluated. It will be shown that the algorithm provides substantial computational savings and the measurements reveal many important characteristics of contemporary high performance computers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Issues in the Design of a Flexible Distributed Architecture for Supporting Persistence and Interoperability in Collaborative Virtual Environments

    Page(s): 21
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (336 KB)  

    CAVERN, the CAVE Research Network, is an alliance of industrial and research institutions equipped with CAVE-based virtual reality hardware and high performance computing resources, interconnected by high-speed networks, to support collaboration in design, education, engineering, and scientific visualization. CAVERNsoft is the collaborative software backbone for CAVERN. CAVERNsoft uses distributed data stores and multiple networking interfaces to provide persistence, customizable latency, data consistency, and scalability that are typically needed to support collaborative virtual reality. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multi-client LAN/WAN Performance Analysis of Ninf: a High-Performance Global Computing System

    Page(s): 22
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (248 KB)  

    Rapid increase in speed and availability of network of supercomputers is making high performance global computing possible, in which computational and data resources in the network are collectively employed to solve large-scale problems. There have been several recent proposals of global computing including our Ninf system. However, critical issues regarding system performance characteristics in global computing have been little investigated, especially under multi-clients, multi-sites WAN settings. In order to investigate the feasibility of Ninf and similar systems, we conducted benchmarks with different communication/computation characteristics on a variety of combinations of clients and servers in their performance, architecture, etc. under LAN, single-site WAN, multi-site WAN situations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • PARDIS: CORBA-based Architecture for Application-Level Parallel Distributed Computation

    Page(s): 23
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (160 KB)  

    We describe the architecture and programming abstractions of PARDIS, a system based on ideas underlying the Common Object Request Broker Architecture (CORBA). PARDIS provides a distributed environment in which objects representing data-parallel computation, as well as non-parallel objects present in parallel programs, can interact across platforms and software systems. Each of these objects represents a small, encapsulated application that can be used as a building block in the construction of distributed metaapplications. We will present examples of building such metaapplications with PARDIS, and show their performance in distributed systems combining the computational power of different multi-processor architectures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable Networked Information Processing Environment (SNIPE)

    Page(s): 24
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (144 KB)  

    SNIPE is a metacomputing system that aims to provide a reliable, secure, fault-tolerant environment for long-term distributed computing applications and data stores across the global InterNet. This system combines global naming and replication of both processing and data to support large scale information processing applications leading to better availability and reliability than currently available with typical cluster computing and/or distributed computer environments. To facilitate this the system supports: distributed data collection, distributed computation, distributed control and resource management, distributed output and process migration. The underlying system supports multiple communication paths, media and routing methods to aid performance and robustness across both local and global networks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimization of a Parallel Ocean General Circulation Model

    Page(s): 25
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (184 KB)  

    We describe our efforts to optimize a parallel ocean general circulation model. We have developed several general strategies to optimize the ocean general circulation model on the Cray T3D. These strategies include memory optimization, effective use of arithmetic pipelines, and usage of optimized libraries. Nearly linear scaling performance data is obtained for the optimized code, while the speed-up data for the optimized code also shows excellent improvement over the original code. More importantly, the single-node performance is greatly improved. The optimized code runs about 2.5 times faster than the original code, which corresponds to 3.63 Gflops on the 256-PE Cray T3D. Such a model improvement allows one to perform ocean modeling at increasingly higher spatial resolutions for climate studies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.