By Topic

High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for

Date 13-19 Nov. 2010

Filter Results

Displaying Results 1 - 25 of 57
  • 190 TFlops Astrophysical N-body Simulation on a Cluster of GPUs

    Page(s): 1 - 9
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (6095 KB) |  | HTML iconHTML  

    We present the results of a hierarchical N-body simulation on DEGIMA, a cluster of PCs with 576 graphic processing units (GPUs) and using an InfiniBand interconnect. DEGIMA stands for DEstination for GPU Intensive MAchine, and is located at Nagasaki Advanced Computing Center (NACC), Nagasaki University. In this work, we have upgraded DEGIMA_s interconnect using InfiniBand. DEGIMA is composed by 144 nodes with 576 GT200 GPUs. An astrophysical N-body simulation with 3,278,982,596 particles using a treecode algorithm shows a sustained performance of 190.5 Tflops on DEGIMA. The overall cost of the hardware was $411,921 dollars. The maximum corrected performance is 104.8 Tflops for the simulation, resulting in a cost performance of 254.4 MFlops/$. This corrections is performed by counting the FLOPS based on the most efficient CPU algorithm. Any extra FLOPS that arise from the GPU implementation and parameter differences are not included in the 254.4 MFLOPS/$. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Extreme-Scale AMR

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2142 KB) |  | HTML iconHTML  

    Many problems are characterized by dynamics occurring on a wide range of length and time scales. One approach to overcoming the tyranny of scales is adaptive mesh refinement/coarsening (AMR), which dynamically adapts the mesh to resolve features of interest. However, the benefits of AMR are difficult to achieve in practice, particularly on the petascale computers that are essential for difficult problems. Due to the complex dynamic data structures and frequent load balancing, scaling dynamic AMR to hundreds of thousands of cores has long been considered a challenge. Another difficulty is extending parallel AMR techniques to high-order-accurate, complex-geometry-respecting methods that are favored for many classes of problems. Here we present new parallel algorithms for parallel dynamic AMR on forest-ofoctrees geometries with arbitrary-order continuous and discontinuous finite/spectral element discretizations. The implementations of these algorithms exhibit excellent weak and strong scaling to over 224,000 Cray XT5 cores for multiscale geophysics problems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable Earthquake Simulation on Petascale Supercomputers

    Page(s): 1 - 20
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (6001 KB) |  | HTML iconHTML  

    Petascale simulations are needed to understand the rupture and wave dynamics of the largest earthquakes at shaking frequencies required to engineer safe structures (> 1 Hz). Toward this goal, we have developed a highly scalable, parallel application (AWP-ODC) that has achieved “M8”: a full dynamical simulation of a magnitude-8 earthquake on the southern San Andreas fault up to 2 Hz. M8 was calculated using a uniform mesh of 436 billion 40-m3 cubes to represent the three-dimensional crustal structure of Southern California, in a 800 km by 400 km area, home to over 20 million people. This production run producing 360 sec of wave propagation sustained 220 Tflop/s for 24 hours on NCCS Jaguar using 223,074 cores. As the largest-ever earthquake simulation, M8 opens new territory for earthquake science and engineering - the physics-based modeling of the largest seismic hazards with the goal of reducing their potential for loss of life and property. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multiscale Simulation of Cardiovascular flows on the IBM Bluegene/P: Full Heart-Circulation System at Red-Blood Cell Resolution

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (480 KB) |  | HTML iconHTML  

    We present the first large-scale simulation of blood flow in the coronary artieries and other vessels supplying blood to the heart muscle, with a realistic description of human arterial geometry at spatial resolutions from centimeters down to 10 microns (near the size of red blood cells). This multiscale simulation resolves the fluid into a billion volume units, embedded in a bounding space of 300 billion voxels, coupled with the concurrent motion of 300 million red blood cells, which interact with one another and with the surrounding fluid. The level of detail is sufficient to describe phenomena of potential physiological and clinical significance, such as the development of atherosclerotic plaques. The simulation achieves excellent scalability on up to 294, 912 Blue Gene/P computational cores. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2065 KB) |  | HTML iconHTML  

    We present a fast, petaflop-scalable algorithm for Stokesian particulate flows. Our goal is the direct simulation of blood, which we model as a mixture of a Stokesian fluid (plasma) and red blood cells (RBCs). Directly simulating blood is a challenging multiscale, multiphysics problem. We report simulations with up to 200 million deformable RBCs. The largest simulation amounts to 90 billion unknowns in space. In terms of the number of cells, we improve the state-of-the art by several orders of magnitude: the previous largest simulation, at the same physical fidelity as ours, resolved the flow of O(1,000-10,000) RBCs. Our approach has three distinct characteristics: (1) we faithfully represent the physics of RBCs by using nonlinear solid mechanics to capture the deformations of each cell; (2) we accurately resolve the long-range, N-body, hydrodynamic interactions between RBCs (which are caused by the surrounding plasma); and (3) we allow for the highly non-uniform distribution of RBCs in space. The new method has been implemented in the software library MOBO (for “Moving Boundaries”). We designed MOBO to support parallelism at all levels, including inter-node distributed memory parallelism, intra-node shared memory parallelism, data parallelism (vectorization), and fine-grained multithreading for GPUs. We have implemented and optimized the majority of the computation kernels on both Intel/AMD x86 and NVidia's Tesla/Fermi platforms for single and double floating point precision. Overall, the code has scaled on 256 CPU-GPUs on the Teragrid's Lincoln cluster and on 200,000 AMD cores of the Oak Ridge national Laboratory's Jaguar PF system. In our largest simulation, we have achieved 0.7 Petaflops/s of sustained performance on Jaguar. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Toward First Principles Electronic Structure Simulations of Excited States and Strong Correlations in Nano- and Materials Science

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (415 KB) |  | HTML iconHTML  

    Methods based on the many-body Green's function are generally accepted as the path forward beyond Kohn-Sham based density functional theory, in order to compute from first principles electronic structure of materials with strong correlations and excited-state properties in nano- and materials science. Here we present an efficient method to compute the screened Coulomb interaction W, the crucial and computationally most demanding ingredient in the GW method, within the framework of the all-electron Linearized Augmented Plane Wave method. We use the method to compute from first principles, within the constrained random phase approximation (c-RPA), the frequency-dependent screened Hubbard U-matrix defined for a Wannier basis in which we downfold the many-body Hamiltonian for La2CuO4, the canonical parent compound of several cuprate high-temperature superconductors. These results were computed at scale on the Cray XT5 at ORNL, sustaining 1.30 petaflop. We discuss the details of the algorithm and its implementation that allowed us to reach high efficiency and short time to solution on today's petaflop supercomputers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2408 KB) |  | HTML iconHTML  

    As tile linear algebra algorithms continue achieving high performance on shared-memory multicore architectures, it is a challenging task to make them scalable on distributed-memory multicore cluster machines. The main contribution of this paper is the extension to the distributed-memory environment of the previous work done by Hadri et al. on Communication- Avoiding QR (CA-QR) factorizations for tall and skinny matrices (initially done on shared-memory multicore systems). The fine granularity of tile algorithms associated with communicationavoiding techniques for the QR factorization presents a high degree of parallelism where multiple tasks can be concurrently executed, computation and communication largely overlapped, and computation steps fully pipelined. A decentralized dynamic scheduler has then been integrated as a runtime system to efficiently schedule tasks across the distributed resources. Our experimental results performed on two clusters (with dual-core and 8-core nodes, respectively) and a Cray XT5 system with 12-core nodes show that the tile CA-QR factorization is able to outperform the de facto ScaLAPACK library by up to 4 times for tall and skinny matrices, and has good scalability on up to 3,072 cores. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel Fast Gauss Transform

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (606 KB) |  | HTML iconHTML  

    We present fast adaptive parallel algorithms to compute the sum of N Gaussians at N points. Direct sequential computation of this sum would take O(N2) time. The parallel time complexity estimates for our algorithms are O (N/np) for uniform point distributions and O (N/np log N/+ np log np ) for nonuniform distributions using np CPUs. We incorporate a planewave representation of the Gaussian kernel which permits "diagonal translation". We use parallel octrees and a new scheme for translating the plane-waves to efficiently handle nonuniform distributions. Computing the transform to six-digit accuracy at 120 billion points took approximately 140 seconds using 4096 cores on the Jaguar supercomputer at the Oak Ridge National Laboratory. Our implementation is kernel-independent and can handle other "Gaussian-type" kernels even when an explicit analytic expression for the kernel is not known. These algorithms form a new class of core computational machinery for solving parabolic PDEs on massively parallel architectures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers

    Page(s): 1 - 9
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (722 KB) |  | HTML iconHTML  

    Torus networks are commonly used for massively parallel computers, its performance often becomes the constraint on total application performance. Especially in an asymmetric torus network, network traffic along the longest axis is the performance bottleneck for all-to-all communication, so that it is important to schedule the longest-axis traffic smoothly. In this paper, we propose a new algorithm based on an indirect method for pipelining the all-to-all procedures using shared memory parallel threads, which (1) isolates the longest-axis traffic from other traffic, (2) schedules it smoothly and (3) overlaps all of the other traffic and overhead for the all-to-all communication behind the longest-axis traffic. The proposed method achieves up to 95% of the theoretical peak. We integrated the overlapped all-to-all method with parallel FFT algorithms. And local FFT calculations are also overlapped behind the longest-axis traffic. The FFT performance achieves up to 90% of the theoretical peak for the parallel 1D FFT. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On-Chip Network Evaluation Framework

    Page(s): 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1420 KB) |  | HTML iconHTML  

    With the number of cores on a chip continuing to increase, proper evaluation of on-chip network is critical for not only network performance but also overall system performance. In this paper, we show how a network-only simulation can be limited as it does not provide an accurate representation of system performance. We evaluate traditionally used open loop simulations and compare them to closed-loop simulations. Although they use different methodologies, measurements, and metrics, we identify how they can provide very similar results. However, we show how the results of closed-loop simulations do not correlate well with execution-driven simulations. We then add simple extensions to the closed-loop simulation to model the impact of the processor and the memory system and show how the correlation with execution-driven simulations can be improved. The proposed framework/methodology provides a fast simulation time while providing better insights into the impact of network parameters on overall system performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Circuit-Switched Memory Access in Photonic Interconnection Networks for High-Performance Embedded Computing

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (608 KB) |  | HTML iconHTML  

    As advancements in CMOS technology trend toward ever increasing core counts in chip multiprocessors for high-performance embedded computing, the discrepancy between on- and off-chip communication bandwidth continues to widen due to the power and spatial constraints of electronic off-chip signaling. Silicon photonics-based communication offers many advantages over electronics for network-on-chip design, namely power consumption that is effectively agnostic to distance traveled at the chip- and board-scale, even across chip boundaries. In this work we develop a design for a photonic network-on-chip with integrated DRAM I/O interfaces and compare its performance to similar electronic solutions using a detailed network-on-chip simulation. When used in a circuit-switched network, silicon nanophotonic switches offer higher bandwidth density and low power transmission, adding up to over 10x better performance and 3-5x lower power over the baseline for projective transform, matrix multiply, and Fast Fourier Transform (FFT), all key algorithms in embedded real-time signal and image processing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • CPM in CMPs: Coordinated Power Management in Chip-Multiprocessors

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (823 KB) |  | HTML iconHTML  

    Multiple clock domain architectures have recently been proposed to alleviate the power problem in CMPs by having different frequency/voltage values assigned to each domain based on workload requirements. However, accurate allocation of power to these voltage/frequency islands based on time varying workload characteristics as well as controlling the power consumption at the provisioned power level is quite non-trivial. Toward this end, we propose a two-tier feedback-based control theoretic solution. Our first-tier consists of a global power manager that allocates power targets to individual islands based on the workload dynamics. The power consumptions of these islands are in turn controlled by a second-tier, consisting of local controllers that regulate island power using dynamic voltage and frequency scaling in response to workload requirements. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Multi-Scale Heart Simulation on Massively Parallel Computers

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (12139 KB) |  | HTML iconHTML  

    To understand the macroscopic function of the human heart based on sub-cellular microscopic events, multiscale analysis is indispensable. Our heart simulator uses the so-called homogenization method, where both the human heart and the myocardial cells are modeled and solved simultaneously by the finite element method. Because the contraction and deformation of each finite element in the heart model are governed by their respective cell model, the NDOF of all the cells becomes prohibitively large. Furthermore, the phenomena are highly nonlinear and transient. This challenging problem has been tackled by our group for many years, and a novel algorithm to accelerate the computation was implemented in the code. We have recently tested its performance using the T2K Open Supercomputer (Tokyo). A pulsation of the heart with a total NDOF of 160 million was successfully simulated using 6144 CPU cores in ten hours. Scalability and other computational performances were measured and are discussed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

    Page(s): 1 - 13
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (794 KB) |  | HTML iconHTML  

    Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest-neighbor computations. The bandwidth-to-compute requirement for a large class of stencil kernels is very high, and their performance is bound by the available memory bandwidth. Since memory bandwidth grows slower than compute, the performance of stencil kernels will not scale with increasing compute density. We present a novel 3.5D-blocking algorithm that performs 2.5D-spatial and temporal blocking of the input grid into on-chip memory for both CPUs and GPUs. The resultant algorithm is amenable to both thread- level and data-level parallelism, and scales near-linearly with the SIMD width and multiple-cores. Our performance numbers are faster or comparable to state-of-the-art-stencil implementations on CPUs and GPUs. Our implementation of 7-point-stencil is 1.5X-faster on CPUs, and 1.8X faster on GPUs for single- precision floating point inputs than previously reported numbers. For Lattice Boltzmann methods, the corresponding speedup number on CPUs is 2.1X. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (20360 KB) |  | HTML iconHTML  

    Regional weather forecasting demands fast simulation over fine-grained grids, resulting in extremely memory- bottlenecked computation, a difficult problem on conventional supercomputers. Early work on accelerating mainstream weather code WRF using GPUs with their high memory performance, however, resulted in only minor speedup due to partial GPU porting of the huge code. Our full CUDA porting of the high- resolution weather prediction model ASUCA is the first such one we know to date; ASUCA is a next-generation, production weather code developed by the Japan Meteorological Agency, similar to WRF in the underlying physics (non-hydrostatic model). Benchmark on the 528 (NVIDIA GT200 Tesla) GPU TSUBAME Supercomputer at the Tokyo Institute of Technology demonstrated over 80-fold speedup and good weak scaling achieving 15.0 TFlops in single precision for 6956 x 6052 x 48 mesh. Further benchmarks on TSUBAME 2.0, which will embody over 4000 NVIDIA Fermi GPUs and deployed in October 2010, will be presented. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Understanding the Impact of Emerging Non-Volatile Memories on High-Performance, IO-Intensive Computing

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (329 KB) |  | HTML iconHTML  

    Emerging storage technologies such as flash memories, phase-change memories, and spin-transfer torque memories are poised to close the enormous performance gap between disk-based storage and main memory. We evaluate several approaches to integrating these memories into computer systems by measuring their impact on IO-intensive, database, and memory-intensive applications. We explore several options for connecting solid-state storage to the host system and find that the memories deliver large gains in sequential and random access performance, but that different system organizations lead to different performance trade-offs. The memories provide substantial application-level gains as well, but overheads in the OS, file system, and application can limit performance. As a result, fully exploiting these memories' potential will require substantial changes to application and system software. Finally, paging to fast non-volatile memories is a viable option for some applications, providing an alternative to expensive, powerhungry DRAM for supporting scientific applications with large memory footprints. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • DASH: a Recipe for a Flash-based Data Intensive Supercomputer

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (428 KB) |  | HTML iconHTML  

    Data intensive computing can be defined as computation involving large datasets and complicated I/O patterns. Data intensive computing is challenging because there is a five-orders-of-magnitude latency gap between main memory DRAM and spinning hard disks; the result is that an inordinate amount of time in data intensive computing is spent accessing data on disk. To address this problem we designed and built a prototype data intensive supercomputer named DASH that exploits flash-based Solid State Drive (SSD) technology and also virtually aggregated DRAM to fill the latency gap . DASH uses commodity parts including Intel® X25-E flash drives and distributed shared memory (DSM) software from ScaleMP®. The system is highly competitive with several commercial offerings by several metrics including achieved IOPS (input output operations per second), IOPS per dollar of system acquisition cost, IOPS per watt during operation, and IOPS per gigabyte (GB) of available storage. We present here an overview of the design of DASH, an analysis of its cost efficiency, then a detailed recipe for how we designed and tuned it for high data-performance, lastly show that running data-intensive scientific applications from graph theory, biology, and astronomy, we achieved as much as two orders-of- magnitude speedup compared to the same applications run on traditional architectures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (448 KB) |  | HTML iconHTML  

    System-in-Package (SiP) and 3D integration are promising technologies to bring more memory onto a microprocessor package to mitigate the "memory wall" problem. In this paper, instead of using them to build caches, we study a heterogenous main memory using both on- and off-package memories providing both fast and high-bandwidth on-package accesses and expandable and low-cost commodity off-package memory capacity. We introduce another layer of address translation coupled with an on-chip memory controller that can dynamically migrate data between off-package and off-package memory either in hardware or with operating system assistance depending on the migration granularity. Our experimental results demonstrate that such design can achieve the average effectiveness of 83% of the ideal case where all memory can be placed in high-speed on-package memory for our simulated benchmarks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting 162-Nanosecond End-to-End Communication Latency on Anton

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (684 KB) |  | HTML iconHTML  

    Strong scaling of scientific applications on parallel architectures is increasingly limited by communication latency. This paper describes the techniques used to mitigate latency in Anton, a massively parallel special-purpose machine that accelerates molecular dynamics (MD) simulations by orders of magnitude compared with the previous state of the art. Achieving this speedup required a combination of hardware mechanisms and software constructs to reduce network latency, sender and receiver overhead, and synchronization costs. Key elements of Anton's approach, in addition to tightly integrated communication hardware, include formulating data transfer in terms of counted remote writes, leveraging fine-grained communication, and establishing fixed, optimized communication patterns. Anton delivers software-to-software inter-node latency significantly lower than any other large-scale parallel machine, and the total critical-path communication time for an Anton MD simulation is less than 4% that of the next fastest MD platform. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowledgement Offload

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (882 KB) |  | HTML iconHTML  

    Virtual machine (VM) consolidation has become a common practice in clouds, Grids, and datacenters. While this practice leads to higher CPU utilization, we observe its negative impact on the TCP throughput of the consolidated VMs: As more VMs share the same core/CPU, the CPU scheduling latency for each VM increases significantly. Such increase leads to slower progress of TCP transmissions to the VMs. To address this problem, we propose an approach called vSnoop, where the driver domain of a host acknowledges TCP packets on behalf of the guest VMs - whenever it is safe to do so. Our evaluation of a Xen-based prototype indicates that vSnoop constantly achieves TCP throughput improvement for VMs (of orders of magnitude in some scenarios). We further show that the higher TCP throughput leads to improvement in application- level performance, via experiments with a two-tier online auction application and two suites of MPI benchmarks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Flexible Reservation Algorithm for Advance Network Provisioning

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1874 KB) |  | HTML iconHTML  

    Many scientific applications need support from a communication infrastructure that provides predictable performance, which requires effective algorithms for bandwidth reservations. Network reservation systems such as ESnet's OSCARS, establish guaranteed bandwidth of secure virtual circuits for a certain bandwidth and length of time. However, users currently cannot inquire about bandwidth availability, nor have alternative suggestions when reservation requests fail. In general, the number of reservation options is exponential with the number of nodes n, and current reservation commitments. We present a novel approach for path finding in time-dependent networks taking advantage of user-provided parameters of total volume and time constraints, which produces options for earliest completion and shortest duration. The theoretical complexity is only O(n2r2) in the worst-case, where r is the number of reservations in the desired time interval. We have implemented our algorithm and developed efficient methodologies for incorporation into network reservation frameworks. Performance measurements confirm the theoretical predictions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (845 KB) |  | HTML iconHTML  

    High-performance computing (HPC) systems are growing more powerful by utilizing more hardware components. As the system mean-time-before-failure correspondingly drops, applications must checkpoint more frequently to make progress. However, as the system memory sizes grow faster than the bandwidth to the parallel file system, the cost of checkpointing begins to dominate application run times. Multi-level checkpointing potentially solves this problem through multiple types of checkpoints with different costs and different levels of resiliency in a single run. This solution employs lightweight checkpoints to handle the most common failure modes and relies on more expensive checkpoints for less common, but more severe failures. This theoretically promising approach has not been fully evaluated in a large- scale, production system context. We have designed the Scalable Checkpoint/Restart (SCR) library, a multi-level checkpoint system that writes checkpoints to RAM, Flash, or disk on the compute nodes in addition to the parallel file system. We present the performance and reliability properties of SCR as well as a probabilistic Markov model that predicts its performance on current and future systems. We show that multi-level checkpointing improves efficiency on existing large-scale systems and that this benefit increases as the system size grows. In particular, we developed low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of our system failures. This leads to a gain in machine efficiency of up to 35%, and it reduces the the load on the parallel file system by a factor of two on current and future systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Experiences with a Lightweight Supercomputer Kernel: Lessons Learned from Blue Gene's CNK

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (611 KB) |  | HTML iconHTML  

    The Petascale era has recently been ushered in and many researchers have already turned their attention to the challenges of exascale computing. To achieve petascale computing two broad approaches for kernels were taken, a lightweight approach embodied by IBM Blue Gene's CNK, and a more fullweight approach embodied by Cray's CNL. There are strengths and weaknesses to each approach. Examining the current generation can provide insight as to what mechanisms may be needed for the exascale generation. The contributions of this paper are the experiences we had with CNK on Blue Gene/P. We demonstrate it is possible to implement a small lightweight kernel that scales well but still provides a Linux environment and functionality desired by HPC programmers. Such an approach provides the values of reproducibility, low noise, high and stable performance, reliability, and ease of effectively exploiting unique hardware features. We describe the strengths and weaknesses of this approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Characterizing the Influence of System Noise on Large-Scale Applications by Simulation

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5249 KB) |  | HTML iconHTML  

    This paper presents an in-depth analysis of the impact of system noise on large-scale parallel application performance in realistic settings. Our analytical model shows that not only collective operations but also point-to-point communications influence the application's sensitivity to noise. We present a simulation toolchain that injects noise delays from traces gathered on common large-scale architectures into a LogGPS simulation and allows new insights into the scaling of applications in noisy environments. We investigate collective operations with up to 1 million processes and three applications (Sweep3D, AMG, and POP) with up to 32,000 processes.We show that the scale at which noise becomes a bottleneck is system-specific and depends on the structure of the noise. Simulations with different network speeds show that a 10x faster network does not improve application scalability. We quantify noise and conclude that our tools can be utilized to tune the noise signatures of a specific system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast PGAS Implementation of Distributed Graph Algorithms

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (693 KB) |  | HTML iconHTML  

    Due to the memory intensive workload and the erratic access pattern, irregular graph algorithms are notoriously hard to implement and optimize for high performance on distributed-memory systems. Although the PGAS paradigm proposed recently improves ease of programming, no high performance PGAS implementation of large-scale graph analysis is known. We present the first fast PGAS implementation of graph algorithms for the connected components and minimum spanning tree problems. By improving memory access locality, compared with the naive implementation, our implementation exhibits much better communication efficiency and cache performance on a cluster of SMPs. With additional algorithmic and PGASspecific optimizations, our implementation achieves significant speedups over both the best sequential implementation and the best single-node SMP implementation for large, sparse graphs with more than a billion edges. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.