By Topic

High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for

Date 15-21 Nov. 2008

Filter Results

Displaying Results 1 - 25 of 65
  • Entering the petaflop era: The architecture and performance of Roadrunner

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (416 KB) |  | HTML iconHTML  

    Roadrunner is a 1.38 Pflop/s-peak (double precision) hybrid-architecture supercomputer developed by LANL and IBM. It contains 12,240 IBM PowerXCell 8i processors and 12,240 AMD Opteron cores in 3,060 compute nodes. Roadrunner is the first supercomputer to run Linpack at a sustained speed in excess of 1 Pflop/s. In this paper we present a detailed architectural description of Roadrunner and a detailed performance analysis of the system. A case study of optimizing the MPI-based application Sweep3D to exploit Roadrunner's hybrid architecture is also included. The performance of Sweep3D is compared to that of the code on a previous implementation of the Cell Broadband Engine architecture-the Cell BE-and on multi-core processors. Using validated performance models combined with Roadrunner-specific microbenchmarks we identify performance issues in the early pre-delivery system and infer how well the final Roadrunner configuration will perform once the system software stack has matured. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High performance discrete Fourier transforms on graphics processors

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (980 KB) |  | HTML iconHTML  

    We present novel algorithms for computing discrete Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluestein's algorithm. We use modular arithmetic in Bluestein's algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2-4times over CUFFT and 8-40times improvement over MKL for large sizes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamically adapting file domain partitioning methods for collective I/O based on underlying parallel file system locking protocols

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (228 KB) |  | HTML iconHTML  

    Collective I/O, such as that provided in MPI-IO, enables process collaboration among a group of processes for greater I/O parallelism. Its implementation involves file domain partitioning, and having the right partitioning is a key to achieving high-performance I/O. As modern parallel file systems maintain data consistency by adopting a distributed file locking mechanism to avoid centralized lock management, different locking protocols can have significant impact to the degree of parallelism of a given file domain partitioning method. In this paper, we propose dynamic file partitioning methods that adapt according to the underlying locking protocols in the parallel file systems and evaluate the performance of four partitioning methods under two locking protocols. By running multiple I/O benchmarks, our experiments demonstrate that no single partitioning guarantees the best performance. Using MPI-IO as an implementation platform, we provide guidelines to select the most appropriate partitioning methods for various I/O patterns and file systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (384 KB) |  | HTML iconHTML  

    Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations - a class of algorithms at the heart of many structured grid codes, including PDE solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural tradeoffs of emerging multicore designs and their implications on scientific algorithm development. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bandwidth intensive 3-D FFT kernel for GPUs using CUDA

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (151 KB) |  | HTML iconHTML  

    Most GPU performance ldquohypesrdquo have focused around tightly-coupled applications with small memory bandwidth requirements e.g., N-body, but GPUs are also commodity vector machines sporting substantial memory bandwidth; however, effective programming methodologies thereof have been poorly studied. Our new 3-D FFT kernel, written in NVIDIA CUDA, achieves nearly 80 GFLOPS on a top-end GPU, being more than three times faster than any existing FFT implementations on GPUs including CUFFT. Careful programming techniques are employed to fully exploit modern GPU hardware characteristics while overcoming their limitations, including on-chip shared memory utilization, optimizing the number of threads and registers through appropriate localization, and avoiding low-speed stride memory accesses. Our kernel applied to real applications achieves orders of magnitude boost in power&cost vs. performance metrics. The off-card bandwidth limitation is still an issue, which could be alleviated somewhat with application kernels confinement within the card, while ideal solution being facilitation of faster GPU interfaces. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using server-to-server communication in parallel file systems to simplify consistency and improve performance

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (146 KB) |  | HTML iconHTML  

    The trend in parallel computing toward clusters running thousands of cooperating processes per application has led to an I/O bottleneck that has only gotten more severe as the CPU density of clusters has increased. Current parallel file systems provide large amounts of aggregate I/O bandwidth; however, they do not achieve the high degrees of metadata scalability required to manage files distributed across hundreds or thousands of storage nodes. In this paper we examine the use of collective communication between the storage servers to improve the scalability of file metadata operations. In particular, we apply server-to-server communication to simplify consistency checking and improve the performance of file creation, file removal, and file stat. Our results indicate that collective communication is an effective scheme for simplifying consistency checks and significantly improving the performance for several real metadata intensive workloads. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scientific application-based performance comparison of SGI Altix 4700, IBM POWER5+, and SGI ICE 8200 supercomputers

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2447 KB) |  | HTML iconHTML  

    The suitability of next-generation high-performance computing systems for petascale simulations will depend on various performance factors attributable to processor, memory, local and global network, and input/output characteristics. In this paper, we evaluate performance of new dual-core SGI Altix 4700, quad-core SGI Altix ICE 8200, and dual-core IBM POWER5+ systems. To measure performance, we used micro-benchmarks from High Performance Computing Challenge (HPCC), NAS Parallel Benchmarks (NPB), and four real-world applications- three from computational fluid dynamics (CFD) and one from climate modeling. We used the micro-benchmarks to develop a controlled understanding of individual system components, then analyzed and interpreted performance of the NPBs and applications. We also explored the hybrid programming model (MPI+OpenMP) using multi-zone NPBs and the CFD application OVERFLOW-2. Achievable application performance is compared across the systems. For the ICE platform, we also investigated the effect of memory bandwidth on performance by testing 1, 2, 4, and 8 cores per node. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adapting a message-driven parallel application to GPU-accelerated clusters

    Page(s): 1 - 9
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (146 KB) |  | HTML iconHTML  

    Graphics processing units (GPUs) have become an attractive option for accelerating scientific computations as a result of advances in the performance and flexibility of GPU hardware, and due to the availability of GPU software development tools targeting general purpose and scientific computation. However, effective use of GPUs in clusters presents a number of application development and system integration challenges. We describe strategies for the decomposition and scheduling of computation among CPU cores and GPUs, and techniques for overlapping communication and CPU computation with GPU kernel execution. We report the adaptation of these techniques to NAMD, a widely-used parallel molecular dynamics simulation package, and present performance results for a 64-core 64-GPU cluster. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scaling parallel I/O performance through I/O delegate and caching system

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (457 KB) |  | HTML iconHTML  

    Increasingly complex scientific applications require massive parallelism to achieve the goals of fidelity and high computational performance. Such applications periodically offload checkpointing data to file system for post-processing and program resumption. As a side effect of high degree of parallelism, I/O contention at servers doesn't allow overall performance to scale with increasing number of processors. To bridge the gap between parallel computational and I/O performance, we propose a portable MPI-IO layer where certain tasks, such as file caching, consistency control, and collective I/O optimization are delegated to a small set of compute nodes, collectively termed as I/O Delegate nodes. A collective cache design is incorporated to resolve cache coherence and hence alleviates the lock contention at I/O servers. By using popular parallel I/O benchmark and application I/O kernels, our experimental evaluation indicates considerable performance improvement with a small percentage of compute resources reserved for I/O. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient management of data center resources for Massively Multiplayer Online Games

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (625 KB) |  | HTML iconHTML  

    Today's massively multiplayer online games (MMOGs) can include millions of concurrent players spread across the world. To keep these highly-interactive virtual environments online, a MMOG operator may need to provision tens of thousands of computing resources from various data centers. Faced with large resource demand variability, and with misfit resource renting policies, the current industry practice is to maintain for each game tens of self-owned data centers. In this work we investigate the dynamic resource provisioning from external data centers for MMOG operation. We introduce a novel MMOG workload model that represents the dynamics of both the player population and the player interactions. We evaluate several algorithms, including a novel neural network predictor, for predicting the resource demand. Using trace-based simulation, we evaluate the impact of the data center policies on the resource provisioning efficiency; we show that dynamic provisioning can be much more efficient than its static alternative. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance optimization of TCP/IP over 10 Gigabit Ethernet by precise instrumentation

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1382 KB) |  | HTML iconHTML  

    End-to-end communications on 10 Gigabit Ethernet (10 GbE) WAN became popular. However, there are difficulties that need to be solved before utilizing Long Fat-pipe Networks (LFNs) by using TCP. We observed that the followings caused performance depression: short-term bursty data transfer, mismatch between TCP and hardware support, and excess CPU load. In this research, we have established systematic methodologies to optimize TCP on LFNs. In order to pinpoint causes of performance depression, we analyzed real networks precisely by using our hardware-based wire-rate analyzer with 100-ns time-resolution. We took the following actions on the basis of the observations: (1) utilizing hardware-based pacing to avoid unnecessary packet losses due to collisions at bottlenecks, (2) modifying TCP to adapt packet coalescing mechanism, (3) modifying programs to reduce memory copies. We have achieved a constant through-put of 9.08 Gbps on a 500 ms RTT network for 5 h. Our approach has overcome the difficulties on single-end 10 GbE LFNs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A multi-level parallel simulation approach to electron transport in nano-scale transistors

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (288 KB) |  | HTML iconHTML  

    Physics-based simulation of electron transport in nanoelectronic devices requires the solution of thousands of highly complex equations to obtain the output characteristics of one single input voltage. The only way to obtain a complete set of bias points within a reasonable amount of time is the recourse to supercomputers offering several hundreds to thousands of cores. To profit from the rapidly increasing availability of such machines we have developed a state-of-the-art quantum mechanical transport simulator dedicated to nanodevices and working with four levels of parallelism. Using these four levels we demonstrate that an almost ideal scaling of the walltime up to 32768 processors with a parallel efficiency of 86% is reached in the simulation of realistically extended and gated field-effect transistors. Obtaining the current characteristics of these devices is reduced to some hundreds of seconds instead of days on a small cluster or months on a single CPU. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Feedback-controlled resource sharing for predictable eScience

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (424 KB) |  | HTML iconHTML  

    The emerging class of adaptive, real-time, data-driven applications is a significant problem for today's HPC systems. In general, it is extremely difficult for queuing-system-controlled HPC resources to make and guarantee a tightly-bounded prediction regarding the time at which a newly-submitted application will execute. While a reservation-based approach partially addresses the problem, it can create severe resource under-utilization (unused reservations, necessary scheduled idle slots, underutilized reservations, etc.) that resource providers are eager to avoid. In contrast, this paper presents a fundamentally different approach to guarantee predictable execution. By creating a virtualized application layer called the performance container, and opportunistically multiplexing concurrent performance containers through the application of formal feedback control theory, we regulate the job's progress such that the job meets its deadline without requiring exclusive access to resources even in the presence of a wide class of unexpected disturbances. Our evaluation using two widely-used applications, WRF and BLAST, on an 8-core server show our approach is predictable and meets deadlines with 3.4 % of errors on average while achieving high overall utilization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Wide-area performance profiling of 10GigE and InfiniBand technologies

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (318 KB) |  | HTML iconHTML  

    For wide-area high-performance applications, light-paths provide 10Gbps connectivity, and multi-core hosts with PCI-Express can drive such data rates. However, sustaining such end-to-end application throughputs across connections of thousands of miles remains challenging, and the current performance studies of such solutions are very limited. We present an experimental study of two solutions to achieve such throughputs based on: (a) 10Gbps Ethernet with TCP/IP transport protocols, and (b) InfiniBand and its wide-area extensions. For both, we generate performance profiles over 10Gbps connections of lengths up to 8600 miles, and discuss the components, complexity, and limitations of sustaining such throughputs, using different connections and host configurations. Our results indicate that IB solution is better suited for applications with a single large flow, and 10GigE solution is better for those with multiple competing flows. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Accelerating configuration interaction calculations for nuclear structure

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (652 KB) |  | HTML iconHTML  

    One of the emerging computational approaches in nuclear physics is the configuration interaction (CI) method for solving the many-body nuclear Hamiltonian in a sufficiently large single-particle basis space to obtain exact answers - either directly or by extrapolation. The lowest eigenvalues and corresponding eigenvectors for very large, sparse and unstructured nuclear Hamiltonian matrices are obtained and used to evaluate additional experimental quantities. These matrices pose a significant challenge to the design and implementation of efficient and scalable algorithms for obtaining solutions on massively parallel computer systems. In this paper, we describe the computational strategies employed in a state-of-the-art CI code MFDn (Many Fermion Dynamics - nuclear) as well as techniques we recently developed to enhance the computational efficiency of MFDn. We will demonstrate the current capability of MFDn and report the latest performance improvement we have achieved. We will also outline our future research directions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient auction-based grid reservations using dynamic programming

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (258 KB) |  | HTML iconHTML  

    Auction mechanisms have been proposed as a means to efficiently and fairly schedule jobs in high-performance computing environments. The generalized vickrey auction has long been known to produce efficient allocations while exposing users to truth-revealing incentives, but the algorithms used to compute its payments can be computationally intractable. In this paper we present a novel implementation of the generalized vickrey auction that uses dynamic programming to schedule jobs and compute payments in pseudo-polynomial time. Additionally, we have built a version of the PBS scheduler that uses this algorithm to schedule jobs, and in this paper we present the results of our tests using this scheduler. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Asymmetric interactions in symmetric multi-core systems: Analysis, enhancements and evaluation

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (517 KB) |  | HTML iconHTML  

    Multi-core architectures have spurred the rapid growth in high-end computing systems. While the vast majority of such multi-core processors contain symmetric hardware components, their interaction with systems software, in particular the communication stack, results in a remarkable amount of asymmetry in the effective capability of the different cores. In this paper, we analyze such interactions and propose a novel management library called SyMMer (Systems Mapping Manager) that monitors these interactions and dynamically manages the mapping of processes on processor cores to transparently improve application performance. Together with a detailed description of the SyMMer library, we also present performance evaluation comparing SyMMer to a vanilla communication library using various micro-benchmarks as well as popular applications and scientific libraries. Experimental results demonstrate more than a two-fold improvement in communication time and 10-15% improvement in overall application performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dendro: Parallel algorithms for multigrid and AMR methods on 2:1 balanced octrees

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (633 KB) |  | HTML iconHTML  

    In this article, we present Dendro, a suite of parallel algorithms for the discretization and solution of partial differential equations (PDEs) involving second-order elliptic operators. Dendro uses trilinear finite element discretizations constructed using octrees. Dendro, comprises four main modules: a bottom-up octree generation and 2:1 balancing module, a meshing module, a geometric multiplicative multigrid module, and a module for adaptive mesh refinement (AMR). Here, we focus on the multigrid and AMR modules. The key features of Dendro are coarsening/refinement, inter-octree transfers of scalar and vector fields, and parallel partition of multilevel octree forests. We describe a bottom-up algorithm for constructing the coarser multigrid levels. The input is an arbitrary 2:1 balanced octree-based mesh, representing the fine level mesh. The output is a set of octrees and meshes that are used in the multigrid sweeps. Also, we describe matrix-free implementations for the discretized PDE operators and the intergrid transfer operations. We present results on up to 4096 CPUs on the Cray XT3 (ldquoBigBenrdquo), the Intel 64 system (ldquoAberdquo), and the Sun Constellation Linux cluster (ldquoRangerrdquo). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Characterizing application sensitivity to OS interference using kernel-level noise injection

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (180 KB) |  | HTML iconHTML  

    Operating system noise has been shown to be a key limiter of application scalability in high-end systems. While several studies have attempted to quantify the sources and effects of system interference using user-level mechanisms, there are few published studies on the effect of different kinds of kernel-generated noise on application performance at scale. In this paper, we examine the sensitivity of real-world, large-scale applications to a range of OS noise patterns using a kernel-based noise injection mechanism implemented in the Catamount lightweight kernel. Our results demonstrate the importance of how noise is generated, in terms of frequency and duration, and how this impact changes with application scale. For example, our results show that 2.5% net processor noise at 10,000 nodes can have no impact or can result in over a factor of 20 slowdown for the same application, depending solely on how the noise is generated. We also discuss how the characteristics of the applications we studied, for example computation/communication ratios, collective communication sizes, and other characteristics, related to their tendency to amplify or absorb noise. Finally, we discuss the implications of our findings on the design of new operating systems, middleware, and other system services for high-end parallel systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance prediction of large-scale parallell system and application using macro-level simulation

    Page(s): 1 - 9
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (207 KB) |  | HTML iconHTML  

    To predict application performance on an HPC system is an important technology for designing the computing system and developing applications. However, accurate prediction is a challenge, particularly, in the case of a future coming system with higher performance. In this paper, we present a new method for predicting application performance on HPC systems. This method combines modeling of sequential performance on a single processor and macro-level simulations of applications for parallel performance on the entire system. In the simulation, the execution flow is traced but kernel computations are omitted for reducing the execution time. Validation on a real terascale system showed that the predicted and measured performance agreed within 10% to 20 %. We employed the method in designing a hypothetical petascale system of 32768 SIMD-extended processor cores. For predicting application performance on the petascale system, the macro-level simulation required several hours. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A novel domain oriented approach for scientific Grid workflow composition

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (382 KB) |  | HTML iconHTML  

    Existing knowledge based grid workflow languages and composition tools require sophisticated expertise of domain scientists in order to automate the process of managing workflows and its components (activities). So far semantic workflow specification and management has not been addressed from a general and integrated perspective. This paper presents a novel domain oriented approach which features separations of concerns between data meaning and data representation and between activity function (semantic description of workflow activities) and activity type (syntactic description of workflow activities). These separations are implemented as part of abstract grid workflow language (AGWL) which supports the development of grid workflows at a high level (semantic) of abstraction. The corresponding workflow composition tool simplifies grid workflow composition by (i) enabling users to compose grid workflows at the level of data meaning and activity function that shields the complexity of the grid, any specific implementation technology (e.g. Web or Grid service) and any specific data representation, (ii) semi-automatic data flow composition, and (iii) automatic data conversions. We have implemented our approach as part of the ASKALON grid application development and computing environment. We demonstrate the effectiveness of our approach by applying it to a real world meteorology workflow application and report some preliminary results. Our approach can also be adapted to other scientific domains by developing the corresponding ontologies for those domains. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Toward loosely coupled programming on petascale systems

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2981 KB) |  | HTML iconHTML  

    We have extended the Falkon lightweight task execution framework to make loosely coupled programming on petascale systems a practical and useful programming model. This work studies and measures the performance factors involved in applying this approach to enable the use of petascale systems by a broader user community, and with greater ease. Our work enables the execution of highly parallel computations composed of loosely coupled serial jobs with no modifications to the respective applications. This approach allows a new-and potentially far larger-class of applications to leverage petascale systems, such as the IBM Blue Gene/P supercomputer. We present the challenges of I/O performance encountered in making this model practical, and show results using both microbenchmarks and real applications from two domains: economic energy modeling and molecular dynamics. Our benchmarks show that we can scale up to 160 K processor-cores with high efficiency, and can achieve sustained execution rates of thousands of tasks per second. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Early evaluation of IBM BlueGene/P

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (719 KB) |  | HTML iconHTML  

    BlueGene/P (BG/P) is the second generation BlueGene architecture from IBM, succeeding BlueGene/L (BG/L). BG/P is a system-on-a-chip (SoC) design that uses four PowerPC 450 cores operating at 850 MHz with a double precision, dual pipe floating point unit per core. These chips are connected with multiple interconnection networks including a 3-D torus, a global collective network, and a global barrier network. The design is intended to provide a highly scalable, physically dense system with relatively low power requirements per flop. In this paper, we report on our examination of BG/P, presented in the context of a set of important scientific applications, and as compared to other major large scale supercomputers in use today. Our investigation confirms that BG/P has good scalability with an expected lower performance per processor when compared to the Cray XT4's Opteron. We also find that BG/P uses very low power per floating point operation for certain kernels, yet it has less of a power advantage when considering science driven metrics for mission applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nimrod/K: Towards massively parallel dynamic Grid workflows

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (701 KB) |  | HTML iconHTML  

    A challenge for Grid computing is the difficulty in developing software that is parallel, distributed and highly dynamic. Whilst there have been many general purpose mechanisms developed over the years, Grid programming still remains a low level, error prone task. Scientific workflow engines can double as programming environments, and allow a user to compose dasiavirtualpsila Grid applications from pre-existing components. Whilst existing workflow engines can specify arbitrary parallel programs, (where components use message passing) they are typically not effective with large and variable parallelism. Here we discuss dynamic dataflow, originally developed for parallel tagged dataflow architectures (TDAs), and show that these can be used for implementing Grid workflows. TDAs spawn parallel threads dynamically without additional programming. We have added TDAs to Kepler, and show that the system can orchestrate workflows that have large amounts of variable parallelism. We demonstrate the system using case studies in chemistry and in cardiac modelling. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SMARTMAP: Operating system support for efficient data sharing among processes on a multi-core processor

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (251 KB) |  | HTML iconHTML  

    This paper describes SMARTMAP, an operating system technique that implements fixed offset virtual memory addressing. SMARTMAP allows the application processes on a multi-core processor to directly access each other's memory without the overhead of kernel involvement. When used to implement MPI, SMARTMAP eliminates all extraneous memory-to-memory copies imposed by UNIX-based shared memory strategies. In addition, SMARTMAP can easily support operations that UNIX-based shared memory cannot, such as direct, in-place MPI reduction operations and one-sided get/put operations. We have implemented SMARTMAP in the Catamount lightweight kernel for the Cray XT and modified MPI and Cray SHMEM libraries to use it. Micro-benchmark performance results show that SMARTMAP allows for significant improvements in latency, bandwidth, and small message rate on a quad-core processor. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.