By Topic

Cluster Computing (CLUSTER), 2010 IEEE International Conference on

Date 20-24 Sept. 2010

Filter Results

Displaying Results 1 - 25 of 46
  • [Front cover]

    Publication Year: 2010 , Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (1516 KB)  
    Freely Available from IEEE
  • [Title page i]

    Publication Year: 2010 , Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (14 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Publication Year: 2010 , Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (56 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Publication Year: 2010 , Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (122 KB)  
    Freely Available from IEEE
  • Table of contents

    Publication Year: 2010 , Page(s): v - viii
    Save to Project icon | Request Permissions | PDF file iconPDF (133 KB)  
    Freely Available from IEEE
  • Foreword

    Publication Year: 2010 , Page(s): ix
    Save to Project icon | Request Permissions | PDF file iconPDF (64 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Conference organization

    Publication Year: 2010 , Page(s): x - xi
    Save to Project icon | Request Permissions | PDF file iconPDF (69 KB)  
    Freely Available from IEEE
  • Program Committee

    Publication Year: 2010 , Page(s): xii - xiii
    Save to Project icon | Request Permissions | PDF file iconPDF (74 KB)  
    Freely Available from IEEE
  • External reviewers

    Publication Year: 2010 , Page(s): xiv - xv
    Save to Project icon | Request Permissions | PDF file iconPDF (69 KB)  
    Freely Available from IEEE
  • Sponsors and Supporters

    Publication Year: 2010 , Page(s): xvi
    Save to Project icon | Request Permissions | PDF file iconPDF (141 KB)  
    Freely Available from IEEE
  • Keynotes

    Publication Year: 2010 , Page(s): xvii - xix
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (80 KB)  

    Provides an abstract for each of the keynote presentations and a brief professional biography of each presenter. The complete presentations were not made available for publication as part of the conference proceedings. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Minimizing MPI Resource Contention in Multithreaded Multicore Environments

    Publication Year: 2010 , Page(s): 1 - 8
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (326 KB) |  | HTML iconHTML  

    With the ever-increasing numbers of cores per node in high-performance computing systems, a growing number of applications are using threads to exploit shared memory within a node and MPI across nodes. This hybrid programming model needs efficient support for multithreaded MPI communication. In this paper, we describe the optimization of one aspect of a multithreaded MPI implementation: concurrent accesses from multiple threads to various MPI objects, such as communicators, datatypes, and requests. The semantics of the creation, usage, and destruction of these objects implies, but does not strictly require, the use of reference counting to prevent memory leaks and premature object destruction. We demonstrate how a naive multithreaded implementation of MPI object management via reference counting incurs a significant performance penalty. We then detail two solutions that we have implemented in MPICH2 to mitigate this problem almost entirely, including one based on a novel garbage collection scheme. In our performance experiments, this new scheme improved the multithreaded messaging rate by up to 31% over the naive reference counting method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect

    Publication Year: 2010 , Page(s): 9 - 18
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (641 KB) |  | HTML iconHTML  

    So far, large computing clusters consisting of several thousand machines have been constructed by connecting nodes together using interconnect technologies as e.g. Ethernet, Infiniband or Myrinet. We propose an entirely new architecture called Tightly Coupled Cluster (TCCluster) that instead uses the native host interface of the processors as a direct network interconnect. This approach offers higher bandwidth and much lower communication latencies than the traditional approaches by virtually integrating the network interface adapter into the processor. Our technique neither applies any modifications to the processor nor requires any additional hardware. Instead, we use commodity off the shelf AMD processors and exploit the HyperTransport host interface as a cluster interconnect. Our approach is purely software based and does not require any additional hardware nor modifications to the existing processors. In this paper, we explain the addressing of nodes in such a cluster, the routing within such a system and the programming model that can be applied. We present a detailed description of the tasks that need to be addressed and provide a proof of concept implementation. For the evaluation of our technique a two node TCCluster prototype is presented. Therefore, the BIOS firmware, a custom Linux kernel and a small message library has been developed. We present microbenchmarks that show a sustained bandwidth of up to 2500 MB/s for messages as small as 64 Byte and a communication latency of 227 ns between two nodes outperforming other high performance networks by an order of magnitude. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

    Publication Year: 2010 , Page(s): 19 - 28
    Cited by:  Papers (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (398 KB) |  | HTML iconHTML  

    In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is presented to balance the workload distribution across the GPUs and CPUs with the negligible runtime overhead, resulting in the better performance than the static or the training partitioning methods. The CPU-GPU communication overhead is effectively hidden by a software pipelining technique, which is particularly useful for large memory-bound applications. Combined with other traditional optimizations, the Linpack we optimized using the adaptive optimization framework achieved 196.7 GFLOPS on a single compute element of TianHe-1. This result is 70.1% of the peak compute capability and 3.3 times faster than the result using the vendor's library. On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0.563PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list released in November 2009. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • How to Scale Nested OpenMP Applications on the ScaleMP vSMP Architecture

    Publication Year: 2010 , Page(s): 29 - 37
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1576 KB) |  | HTML iconHTML  

    The novel ScaleMP vSMP architecture employs commodity x86-based servers with an InfiniBand network to assemble a large shared memory system at an attractive price point. We examine this combined hardware- and software-approach of a DSM system using both system-level kernel benchmarks as well as real-world application codes. We compare this architecture with traditional shared memory machines and elaborate on strategies to tune application codes parallelized with OpenMP on multiple levels. Finally we summarize the necessary conditions which a scalable application has to fulfill in order to profit from the full potential of the ScaleMP approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Synchronizing the Timestamps of Concurrent Events in Traces of Hybrid MPI/OpenMP Applications

    Publication Year: 2010 , Page(s): 38 - 47
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2316 KB) |  | HTML iconHTML  

    Event traces are helpful in understanding the performance behavior of parallel applications since they allow the in-depth analysis of communication and synchronization patterns. However, the absence of synchronized clocks on most cluster systems may render the analysis ineffective because inaccurate relative event timings may misrepresent the logical event order and lead to errors when quantifying the impact of certain behaviors or confuse the users of time-line visualization tools by showing messages flowing backward in time. In our earlier work, we have developed a scalable algorithm that eliminates inconsistent inter-process timings postmortem in traces of pure MPI applications. Since hybrid programming, the combination of MPI and OpenMP in a single application, is becoming more popular on clusters in response to rising numbers of cores per chip and widening shared-memory nodes, we present an extended version of the algorithm that in addition to message-passing event semantics also preserves and restores shared-memory event semantics. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Getting Rid of Coherency Overhead for Memory-Hungry Applications

    Publication Year: 2010 , Page(s): 48 - 57
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (884 KB) |  | HTML iconHTML  

    Current commercial solutions intended to provide additional resources to an application being executed in a cluster usually aggregate processors and memory from different nodes. In this paper we present a 16-node prototype for a shared-memory cluster architecture that follows a different approach by decoupling the amount of memory available to an application from the processing resources assigned to it. In this way, we provide a new degree of freedom so that the memory granted to a process can be expanded with the memory from other nodes in the cluster without increasing the number of processors used by the program. This feature is especially suitable for memory-hungry applications that demand large amounts of memory but present a parallelization level that prevents them from using more cores than available in a single node. The main advantage of this approach is that an application can use more memory from other nodes without involving the processors, and caches, from those nodes. As a result, using more memory no longer implies increasing the coherence protocol overhead because the number of caches involved in the coherent domain has become independent from the amount of available memory. The prototype we present in this paper leverages this idea by sharing 128GB of memory among the cluster. Real executions show the feasibility of our prototype and its scalability. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Energy-Aware Scheduling in Virtualized Datacenters

    Publication Year: 2010 , Page(s): 58 - 67
    Cited by:  Papers (20)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (409 KB) |  | HTML iconHTML  

    The reduction of energy consumption in large-scale datacenters is being accomplished through an extensive use of virtualization, which enables the consolidation of multiple workloads in a smaller number of machines. Nevertheless, virtualization also incurs some additional overheads (e.g. virtual machine creation and migration) that can influence what is the best consolidated configuration, and thus, they must be taken into account. In this paper, we present a dynamic job scheduling policy for power-aware resource allocation in a virtualized datacenter. Our policy tries to consolidate workloads from separate machines into a smaller number of nodes, while fulfilling the amount of hardware resources needed to preserve the quality of service of each job. This allows turning off the spare servers, thus reducing the overall datacenter power consumption. As a novelty, this policy incorporates all the virtualization overheads in the decision process. In addition, our policy is prepared to consider other important parameters for a datacenter, such as reliability or dynamic SLA enforcement, in a synergistic way with power consumption. The introduced policy is evaluated comparing it against common policies in a simulated environment that accurately models HPC jobs execution in a virtualized datacenter including power consumption modeling and obtains a power consumption reduction of 15% with respect to typical policies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • TRACER: A Trace Replay Tool to Evaluate Energy-Efficiency of Mass Storage Systems

    Publication Year: 2010 , Page(s): 68 - 77
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (660 KB) |  | HTML iconHTML  

    Improving energy efficiency of mass storage systems has become an important and pressing research issue in large HPC centers and data centers. New energy conservation techniques in storage systems constantly spring up; however, there is a lack of systematic and uniform way of accurately evaluating energy-efficient storage systems and objectively comparing a wide range of energy-saving techniques. This research presents a new integrated scheme, called TRACER, for evaluating energy-efficiency of mass storage systems and judging energy-saving techniques. The TRACER scheme consists of a toolkit used to measure energy efficiency of storage systems as well as performance and energy metrics. In addition, TRACER contains a novel and accurate workload-control module to acquire power varying with workload modes and I/O load intensity. The workload generator in TRACER facilitates a block-level trace replay mechanism. The main goal of the workload-control module is to select a certain percentage (e.g., anywhere from 10% to 100%) of trace entries from a real-world I/O trace file uniformly and to replay filtered trace entries to reach any level of I/O load intensity. TRACER is experimentally validated on a general RAID5 enterprise disk array. Our experiments demonstrate that energy-efficient mass storage systems can be accurately evaluated on full scales by TRACER. We applied TRACER to investigate impacts of workload modes and load intensity on energy-efficiency of storage devices. This work shows that TRACER can enable storage system developers to evaluate energy efficiency designs for storage systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Designing OS for HPC Applications: Scheduling

    Publication Year: 2010 , Page(s): 78 - 87
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (266 KB) |  | HTML iconHTML  

    Operating systems have historically been implemented as independent layers between hardware and applications. User programs communicate with the OS through a set of well defined system calls, and do not have direct access to the hardware. The OS, in turn, communicates with the underlying architecture via control registers. Except for these interfaces, the three layers are practically oblivious to each other. While this structure improves portability and transparency, it may not deliver optimal performance. This is especially true for High Performance Computing (HPC) systems, where modern parallel applications and multi-core architectures pose new challenges in terms of performance, power consumption, and system utilization. The hardware, the OS, and the applications can no longer remain isolated, and instead should cooperate to deliver high performance with minimal power consumption. In this paper we present our experience with the design and implementation of High Performance Linux (HPL), an operating system designed to optimize the performance of HPC applications running on a state-of-the-art compute cluster. We show how characterizing parallel applications through hardware and software performance counters drives the design of the OS and how including knowledge about the architecture improves performance and efficiency. We perform experiments on a dual-socket IBM POWER6 machine, showing performance improvements and stability (performance variation of 2.11% on average) for NAS, a widely used parallel benchmark suite. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting Data Deduplication to Accelerate Live Virtual Machine Migration

    Publication Year: 2010 , Page(s): 88 - 96
    Cited by:  Papers (21)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (384 KB) |  | HTML iconHTML  

    As one of the key characteristics of virtualization, live virtual machine (VM) migration provides great benefits for load balancing, power management, fault tolerance and other system maintenance issues in modern clusters and data centers. Although Pre-Copy is a widespread used migration algorithm, it does transfer a lot of duplicated memory image data from source to destination, which results in longer migration time and downtime. This paper proposes a novel VM migration approach, named Migration with Data Deduplication (MDD), which introduces data deduplication into migration. MDD utilizes the self-similarity of run-time memory image, uses hash based fingerprints to find identical and similar memory pages, and employs Run Length Encode (RLE) to eliminate redundant memory data during migration. Experiment demonstrates that compared with Xen's default Pre-Copy migration algorithm, MDD can reduce 56.60% of total data transferred during migration, 34.93% of total migration time, and 26.16% of downtime on average. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SHelp: Automatic Self-Healing for Multiple Application Instances in a Virtual Machine Environment

    Publication Year: 2010 , Page(s): 97 - 106
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (449 KB) |  | HTML iconHTML  

    When multiple instances of an application running on multiple virtual machines, an interesting problem is how to utilize the fault handling result from one application instance to heal the same fault occurred on other sibling instances, and hence to ensure high service availability in a cloud computing environment. This paper presents SHelp, a lightweight runtime system that can survive software failures in the framework of virtual machines. It applies weighted rescue points and error virtualization techniques to effectively make applications by-pass the faulty path. A two-level storage hierarchy is adopted in the rescue point database for applications running on different virtual machines to share error handling information to reduce the redundancy and to more effectively and quickly recover from future faults caused by the same bugs. A Linux prototype is implemented and evaluated using four web server applications that contain various types of bugs. Our experimental results show that SHelp can make server applications to recover from these bugs in just a few seconds with modest performance overhead. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability

    Publication Year: 2010 , Page(s): 107 - 115
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (433 KB) |  | HTML iconHTML  

    As one of the most important enabling technologies of cloud computing, virtualization brings to HPC good manageability, online system maintenance, performance isolation and fault isolation. Furthermore, previous study on VMM-bypass I/O that virtualizes OS-bypass networks (e.g. InfiniBand) relieved the worry of performance degradation coming along with virtualization. In this paper, we address the scalability challenges imposed upon OS-bypass networks under virtualized environments. The eXtended Reliable Connection (XRC) transport, proposed in modern high-speed interconnection networks to address the scalability problem in large scale applications, would not work in virtualized environments. To solve the problem, we propose VM-proof XRC design to eliminate the scalability gap between virtualized and native environments. Prototype evaluation shows that the virtualization of modern high-speed interconnection networks could get the same raw performance and scalability as in native non-virtualized environment with our VM-proof XRC design. The connection memory scalability shows a potential of 16 times improvement on virtualized clusters composed of 16-core nodes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • RDMA-Based Job Migration Framework for MPI over InfiniBand

    Publication Year: 2010 , Page(s): 116 - 125
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (343 KB) |  | HTML iconHTML  

    Coordinated checkpoint and recovery is a common approach to achieve fault tolerance on large-scale systems. The traditional mechanism dumps the process image to a local disk or a central storage area of all the processes involved in the parallel job. When a failure occurs, the processes are restarted and restored to the latest checkpoint image. However, this kind of approach is unable to provide the scalability required by increasingly large-sized jobs, since it puts heavy I/O burden on the storage subsystem, and resubmitting a job during restart phase incurs lengthy queuing delay. In this paper, we enhance the fault tolerance of MVAPICH2, an open-source high performance MPI-2 implementation, by using a proactive job migration scheme. Instead of checkpointing all the processes of the job and saving their process images to a stable storage, we transfer the processes running on a health-deteriorating node to a healthy spare node, and resume these processes from the spare node. RDMA-based process image transmission is designed to take advantage of high performance communication in InfiniBand. Experimental results show that the Job Migration scheme can achieve a speedup of 4.49 times over the Checkpoint/Restart scheme to handle a node failure for a 64-process application running on 8 compute nodes. To the best of our knowledge, this is the first such job migration design for InfiniBand-based clusters. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Host Side Dynamic Reconfiguration with InfiniBand

    Publication Year: 2010 , Page(s): 126 - 135
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (365 KB) |  | HTML iconHTML  

    Rerouting around faulty components and migration of jobs both require reconfiguration of data structures in the Queue Pairs residing in the hosts on an InfiniBand cluster. In this paper we report an implementation of dynamic reconfiguration of such host side data-structures. Our implementation preserves the Queue Pairs, and lets the application run without being interrupted. With this implementation, we demonstrate a complete solution to fault tolerance in an InfiniBand network, where dynamic network reconfiguration to a topology-agnostic routing function is used to avoid malfunctioning components. This solution is in principle able to let applications run uninterruptedly on the cluster, as long as the topology is physically connected. Through measurements on our test-cluster we show that the increased cost of our method in setup latency is negligible, and that there is only a minor reduction in throughput during reconfiguration. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.