Scheduled System Maintenance:
On Monday, April 27th, IEEE Xplore will undergo scheduled maintenance from 1:00 PM - 3:00 PM ET (17:00 - 19:00 UTC). No interruption in service is anticipated.
By Topic

Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on

Date 17-20 May 2010

Filter Results

Displaying Results 1 - 25 of 137
  • [Front cover]

    Publication Year: 2010 , Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (356 KB)  
    Freely Available from IEEE
  • [Title page i]

    Publication Year: 2010 , Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (85 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Publication Year: 2010 , Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (155 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Publication Year: 2010 , Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (115 KB)  
    Freely Available from IEEE
  • Table of contents

    Publication Year: 2010 , Page(s): v - xv
    Save to Project icon | Request Permissions | PDF file iconPDF (189 KB)  
    Freely Available from IEEE
  • Message from the General Chair

    Publication Year: 2010 , Page(s): xvi - xvii
    Save to Project icon | Request Permissions | PDF file iconPDF (110 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Message from the Program Chair

    Publication Year: 2010 , Page(s): xviii - xix
    Save to Project icon | Request Permissions | PDF file iconPDF (95 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Message from the Workshops Chair

    Publication Year: 2010 , Page(s): xx - xxi
    Save to Project icon | Request Permissions | PDF file iconPDF (126 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Organising Committee

    Publication Year: 2010 , Page(s): xxii - xxiv
    Save to Project icon | Request Permissions | PDF file iconPDF (127 KB)  
    Freely Available from IEEE
  • Program Committee Members

    Publication Year: 2010 , Page(s): xxv - xxvi
    Save to Project icon | Request Permissions | PDF file iconPDF (105 KB)  
    Freely Available from IEEE
  • CCGrid 2010 Sponsors

    Publication Year: 2010 , Page(s): xxvii
    Save to Project icon | Request Permissions | PDF file iconPDF (127 KB)  
    Freely Available from IEEE
  • Enabling the Next Generation of Scalable Clusters

    Publication Year: 2010 , Page(s): 3
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (141 KB)  

    First Page of the Article
    View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sky Computing: When Multiple Clouds Become One

    Publication Year: 2010 , Page(s): 4
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (141 KB)  

    First Page of the Article
    View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic Load-Balanced Multicast for Data-Intensive Applications on Clouds

    Publication Year: 2010 , Page(s): 5 - 14
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (552 KB) |  | HTML iconHTML  

    Data-intensive parallel applications on clouds need to deploy large data sets from the cloud's storage facility to all compute nodes as fast as possible. Many multicast algorithms have been proposed for clusters and grid environments. The most common approach is to construct one or more spanning trees based on the network topology and network monitoring data in order to maximize available bandwidth and avoid bottleneck links. However, delivering optimal performance becomes difficult once the available bandwidth changes dynamically. In this paper, we focus on Amazon EC2/S3 (the most commonly used cloud platform today) and propose two high performance multicast algorithms. These algorithms make it possible to efficiently transfer large amounts of data stored in Amazon S3 to multiple Amazon EC2 nodes. The three salient features of our algorithms are (1) to construct an overlay network on clouds without network topology information, (2) to optimize the total throughput dynamically, and (3) to increase the download throughput by letting nodes cooperate with each other. The two algorithms differ in the way nodes cooperate: the first `non-steal' algorithm lets each node download an equal share of all data, while the second `steal' algorithm uses work stealing to counter the effect of heterogeneous download bandwidth. As a result, all nodes can download files from S3 quickly, even when the network performance changes while the algorithm is running. We evaluate our algorithms on EC2/S3, and show that they are scalable and consistently achieve high throughput. Both algorithms perform much better than having each node downloading all data directly from S3. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Profit-Driven Service Request Scheduling in Clouds

    Publication Year: 2010 , Page(s): 15 - 24
    Cited by:  Papers (32)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (643 KB) |  | HTML iconHTML  

    A primary driving force of the recent cloud computing paradigm is its inherent cost effectiveness. As in many basic utilities, such as electricity and water, consumers/clients in cloud computing environments are charged based on their service usage, hence the term ‘pay-per-use’. While this pricing model is very appealing for both service providers and consumers, fluctuating service request volume and conflicting objectives (e.g., profit vs. response time) between providers and consumers hinder its effective application to cloud computing environments. In this paper, we address the problem of service request scheduling in cloud computing systems. We consider a three-tier cloud structure, which consists of infrastructure vendors, service providers and consumers, the latter two parties are particular interest to us. Clearly, scheduling strategies in this scenario should satisfy the objectives of both parties. Our contributions include the development of a pricing model—using processor-sharing—for clouds, the application of this pricing model to composite services with dependency consideration (to the best of our knowledge, the work in this study is the first attempt), and the development of two sets of profit-driven scheduling algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Availability Prediction Based Replication Strategies for Grid Environments

    Publication Year: 2010 , Page(s): 25 - 33
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (318 KB) |  | HTML iconHTML  

    Volunteer-based grid computing resources are characteristically volatile and frequently become unavailable due to the autonomy that owners maintain over them. This resource volatility has significant influence on the applications the resources host. Availability predictors can forecast unavailability, and can provide schedulers with information about reliability, which helps them make better scheduling decisions when combined with information about speed and load. This paper studies using this prediction information for deciding when to replicate jobs. In particular, our predictors forecast the probability that a job will complete uninterrupted, and our schedulers replicate those jobs that are least likely to do so. Our strategies outperform other comparable replication strategies, as measured by improved make span and fewer redundant operations. We define a new ``replication efficiency" metric, and demonstrate that our availability predictor can provide information that allows our schedulers to be more efficient than the most closely related replication strategy for a variety of loads in a trace-based grid simulation. We demonstrate that under low load conditions, our techniques come within 6% of the makespan improvement of a previously proposed replication technique while creating 76.8% fewer replicas and under higher loads, can improve makespan marginally while creating 72.5% fewer replicas. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • EGSI: TGKA Based Security Architecture for Group Communication in Grid

    Publication Year: 2010 , Page(s): 34 - 42
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (219 KB) |  | HTML iconHTML  

    Security is one of the most important issues in the grid. Several processes generated by a user need to communicate securely. The existing architectures do not provide support for secure group communication in grid. Grid applications differ in various parameters such as life time, arrival rate, departure rate, staying period, and tolerance times for join and leave. These parameters are critical for optimizing performance and meeting the security requirements of the application. Existing grid security architectures do not harness knowledge of these parameters. In this paper, we propose an extended grid security infrastructure (EGSI) to support secure group communication in grid. We also present an authentication and access control scheme at virtual organization level. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Elastic Site: Using Clouds to Elastically Extend Site Resources

    Publication Year: 2010 , Page(s): 43 - 52
    Cited by:  Papers (43)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (762 KB) |  | HTML iconHTML  

    Infrastructure-as-a-Service (IaaS) cloud computing offers new possibilities to scientific communities. One of the most significant is the ability to elastically provision and relinquish new resources in response to changes in demand. In our work, we develop a model of an “elastic site” that efficiently adapts services provided within a site, such as batch schedulers, storage archives, or Web services to take advantage of elastically provisioned resources. We describe the system architecture along with the issues involved with elastic provisioning, such as security, privacy, and various logistical considerations. To avoid over- or under-provisioning the resources we propose three different policies to efficiently schedule resource deployment based on demand. We have implemented a resource manager, built on the Nimbus toolkit to dynamically and securely extend existing physical clusters into the cloud. Our elastic site manager interfaces directly with local resource managers, such as Torque. We have developed and evaluated policies for resource provisioning on a Nimbus-based cloud at the University of Chicago, another at Indiana University, and Amazon EC2. We demonstrate a dynamic and responsive elastic cluster, capable of responding effectively to a variety of job submission patterns. We also demonstrate that we can process 10 times faster by expanding our cluster up to 150 EC2 nodes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations

    Publication Year: 2010 , Page(s): 53 - 62
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (520 KB) |  | HTML iconHTML  

    This paper introduces the newly developed Infini- Band (IB) Management Queue capability, used by the Host Channel Adapter (HCA) to manage network task data flow dependancies, and progress the communications associated with such flows. These tasks include sends, receives, and the newly supported wait task, and are scheduled by the HCA based on a data dependency description provided by the user. This functionality is supported by the ConnectX-2 HCA, and provides the means for delegating collective communication management and progress to the HCA, also known as collective communication offload. This provides a means for overlapping collective communications managed by the HCA and computation on the Central Processing Unit (CPU), thus making it possible to reduce the impact of system noise on parallel applications using collective operations. This paper further describes how this new capability can be used to implement scalable Message Passing Interface (MPI) collective operations, describing the high level details of how this new capability is used to implement the MPI Barrier collective operation, focusing on the latency sensitive performance aspects of this new capability. This paper concludes with small scale bench- mark experiments comparing implementations of the barrier collective operation, using the new network offload capabilities, with established point-to-point based implementations of these same algorithms, which manage the data flow using the central processing unit. These early results demonstrate the promise this new capability provides to improve the scalability of high- performance applications using collective communications. The latency of the HCA based implementation of the barrier is similar to that of the best performing point-to-point based implementation managed by the central processing unit, starting to outperform these as the number of processes involved in the collective operation increases. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Distributed Diskless Checkpoint for Large Scale Systems

    Publication Year: 2010 , Page(s): 63 - 72
    Cited by:  Papers (15)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (579 KB) |  | HTML iconHTML  

    In high performance computing (HPC), the applications are periodically check pointed to stable storage to increase the success rate of long executions. Nowadays, the overhead imposed by disk-based checkpoint is about 20% of execution time and in the next years it will be more than 50% if the checkpoint frequency increases as the fault frequency increases. Diskless checkpoint has been introduced as a solution to avoid the IO bottleneck of disk-based checkpoint. However, the encoding time, the dedicated resources (the spares) and the memory overhead imposed by diskless checkpoint are significant obstacles against its adoption. In this work, we address these three limitations: 1) we propose a fault tolerant model able to tolerate up to 50% of process failures with a low check pointing overhead 2) our fault tolerance model works without spare node, while still guarantying high reliability, 3) we use solid state drives to significantly increase the checkpoint performance and avoid the memory overhead of classic diskless checkpoint. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Enabling Instantaneous Relocation of Virtual Machines with a Lightweight VMM Extension

    Publication Year: 2010 , Page(s): 73 - 83
    Cited by:  Papers (13)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (602 KB) |  | HTML iconHTML  

    We are developing an efficient resource management system with aggressive virtual machine (VM) relocation among physical nodes in a data center. Existing live migration technology, however, requires a long time to change the execution host of a VM, it is difficult to optimize VM packing on physical nodes dynamically, corresponding to ever-changing resource usage. In this paper, we propose an advanced live migration mechanism enabling instantaneous relocation of VMs. To minimize the time needed for switching the execution host, memory pages are transferred after a VM resumes at a destination host. A special character device driver allows transparent memory page retrievals from a source host for the running VM at the destination. In comparison with related work, the proposed mechanism supports guest operating systems without any modifications to them (i.e, no special device drivers and programs are needed in VMs). It is implemented as a lightweight extension to KVM (Kernel-based Virtual Machine Monitor). It is not required to modify critical parts of the VMM code. Experiments were conducted using the SPECweb2005 benchmark. A running VM with heavily-loaded web servers was successfully relocated to a destination within one second. Temporal performance degradation after relocation was resolved by means of a precaching mechanism for memory pages. In addition, for memory intensive workloads, our migration mechanism moved all the states of a VM faster than existing migration technology. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Map-Reduce System with an Alternate API for Multi-core Environments

    Publication Year: 2010 , Page(s): 84 - 93
    Cited by:  Papers (13)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (333 KB) |  | HTML iconHTML  

    Map-reduce framework has received a significant attention and is being used for programming both large-scale clusters and multi-core systems. While the high productivity aspect of map-reduce has been well accepted, it is not clear if the API results in efficient implementations for different subclasses of data-intensive applications. In this paper, we present a system MATE (Map-reduce with an Alternate API), that provides a high-level, but distinct API. Particularly, our API includes a programmer-managed reduction object, which results in lower memory requirements at runtime for many data-intensive applications. MATE implements this API on top of the Phoenix system, a multi-core map-reduce implementation from Stanford. We evaluate our system using three data mining applications, and compare its performance to that of both Phoenix and Hadoop. Our results show that for all the three applications, MATE outperforms Phoenix and Hadoop. Despite achieving good scalability, MATE also maintains the easy-to-use API of map-reduce. Overall, we argue that, our approach, which is based on the generalized reduction structure, provides an alternate high-level API, leading to more efficient and scalable implementations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Analysis of Traces from a Production MapReduce Cluster

    Publication Year: 2010 , Page(s): 94 - 103
    Cited by:  Papers (42)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (923 KB) |  | HTML iconHTML  

    MapReduce is a programming paradigm for parallel processing that is increasingly being used for data-intensive applications in cloud computing environments. An understanding of the characteristics of workloads running in MapReduce environments benefits both the service providers in the cloud and users: the service provider can use this knowledge to make better scheduling decisions, while the user can learn what aspects of their jobs impact performance. This paper analyzes 10-months of MapReduce logs from the M45 supercomputing cluster which Yahoo! made freely available to select universities for academic research. We characterize resource utilization patterns, job patterns, and sources of failures. We use an instance-based learning technique that exploits temporal locality to predict job completion times from historical data and identify potential performance problems in our dataset. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Effective Architecture for Automated Appliance Management System Applying Ontology-Based Cloud Discovery

    Publication Year: 2010 , Page(s): 104 - 112
    Cited by:  Papers (19)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (959 KB) |  | HTML iconHTML  

    Cloud computing is a computing paradigm which allows access of computing elements and storages on-demand over the Internet. Virtual Appliances, pre-configured, ready-to-run applications are emerging as a breakthrough technology to solve the complexities of service deployment on Cloud infrastructure. However, an automated approach to deploy required appliances on the most suitable Cloud infrastructure is neglected by previous works which is the focus of this work. In this paper, we propose an effective architecture using ontology-based discovery to provide QoS aware deployment of appliances on Cloud service providers. In addition, we test our approach on a case study and the result shows the efficiency and effectiveness of the proposed work. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Region-Based Prefetch Techniques for Software Distributed Shared Memory Systems

    Publication Year: 2010 , Page(s): 113 - 122
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1243 KB) |  | HTML iconHTML  

    Although shared memory programming models show good programmability compared to message passing programming models, their implementation by page-based software distributed shared memory systems usually suffers from high memory consistency costs. The major part of these costs is inter-node data transfer for keeping virtual shared memory consistent. A good prefetch strategy can reduce this cost. We develop two prefetch techniques, TReP and HReP, which are based on the execution history of each parallel region. These techniques are evaluated using offline simulations with the NAS Parallel Benchmarks and the LINPACK benchmark. On average, TReP achieves an efficiency (ratio of pages prefetched that were subsequently accessed) of 96% and a coverage (ratio of access faults avoided by prefetches) of 65%. HReP achieves an efficiency of 91% but has a coverage of 79%. Treating the cost of an incorrectly prefetched page to be equivalent to that of a miss, these techniques have an effective page miss rate of 63% and 71% respectively. Additionally, these two techniques are compared with two well-known software distributed shared memory (sDSM) prefetch techniques, Adaptive++ and TODFCM. TReP effectively reduces page miss rate by 53% and 34% more, and HReP effectively reduces page miss rate by 62% and 43% more, compared to Adaptive++ and TODFCM respectively. As for Adaptive++, these techniques also permit bulk prefetching for pages predicted using temporal locality, amortizing network communication costs and permitting bandwidth improvement from multi-rail network interfaces. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.