By Topic

Parallel and Distributed Systems (ICPADS), 2011 IEEE 17th International Conference on

Date 7-9 Dec. 2011

Filter Results

Displaying Results 1 - 25 of 164
  • [Front cover]

    Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (7372 KB)  
    Freely Available from IEEE
  • [Title page i]

    Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (62 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (110 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (118 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): v - xviii
    Save to Project icon | Request Permissions | PDF file iconPDF (163 KB)  
    Freely Available from IEEE
  • Message from General and Program Co-chairs

    Page(s): xix
    Save to Project icon | Request Permissions | PDF file iconPDF (100 KB)  
    Freely Available from IEEE
  • Organizing Committee

    Page(s): xx - xxi
    Save to Project icon | Request Permissions | PDF file iconPDF (106 KB)  
    Freely Available from IEEE
  • Program Committee

    Page(s): xxii - xxvi
    Save to Project icon | Request Permissions | PDF file iconPDF (138 KB)  
    Freely Available from IEEE
  • Reviewers

    Page(s): xxvii - xxviii
    Save to Project icon | Request Permissions | PDF file iconPDF (67 KB)  
    Freely Available from IEEE
  • Extending Lifetime and Reducing Garbage Collection Overhead of Solid State Disks with Virtual Machine Aware Journaling

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (397 KB) |  | HTML iconHTML  

    Virtualization is becoming widely deployed in commercial servers. In our previous study, we proposed Virtual Machine Aware journaling (VMA journaling), a file system journaling approach for virtual server environments. With reliable VMM and hardware subsystems, VMA journaling eliminates journal writes while ensuring file system consistency and data integrity in virtual server environments, allowing it to be an effective alternative to traditional journaling approaches in these environments. In recent years, solid-state disks (SSDs) have shown their potential as a replacement of traditional magnetic disks in commercial servers. In this paper, we demonstrate the benefits of VMA journaling on SSDs. We compare the performance of VMA journaling and three traditional journaling approaches (i.e., the three journaling modes of the ext3 journaling file system) in terms of lifetime and garbage collection overhead, which are two key performance metrics for SSDs. Since a Flash Translation Layer (FTL) is used in an SSD to emulate traditional disk interface and the SSD performance highly depends on the FTL being used, three state-of-the-art FTLs (i.e., FAST, Super Block FTL and DFTL) are implemented for performance evaluation. The performance results show that, traditional full data journaling could reduce the lifetime of the SSD significantly. VMA journaling extends the SSD lifetime under full data journaling by up to 86.5%. Moreover, GC overhead is reduced by up to 80.6% when compared to the full data journaling approach of ext3. Finally, VMA journaling is effective under all the FTLs. These results demonstrate that VMA journaling is effective in SSD-based virtual server environments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scheduling Mixed Real-Time and Non-real-Time Applications in MapReduce Environment

    Page(s): 9 - 16
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (328 KB) |  | HTML iconHTML  

    MapReduce scheduling is becoming a hot topic as MapReduce attracts more and more attention from both industry and academia. In this paper, we focus on the scheduling of mixed real-time and non-real-time applications in MapReduce environment, which is a challenging problem but receives only limited attention. To solve this problem, we present a two-level MapReduce scheduler built on previous techniques and make two key contributions. First, to meet the performance goal of real-time applications, we propose a deadline scheduler which adopts (1) a sampling based approach-Tasks Forward Scheduling (TFS) to predict map/reduce task execution time(unlike prior work that requires users to input an estimated value). (2) a resource allocation model-Approximately Uniform Minimum Degree of parallelism (AUMD) to dynamically control each realtime job to execute with minimum tasks assignment in any time so as to maximize the number of concurrent real-time jobs. Second, through integrating this deadline scheduler into existing MapReduce scheduler, we develop a two-level scheduler with resource preemption supported, and it could schedule mixed real-time and non-real-time jobs according to their respective performance demands. We implement our scheduler in Hadoop system and experiments running on a real, small-scale cluster demonstrate that it could schedule mixed real-time and nonreal-time jobs to meet their different quality-of-service (QoS) demands. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An In-Memory Framework for Extended MapReduce

    Page(s): 17 - 24
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (362 KB) |  | HTML iconHTML  

    The MapReduce programming model simplifies the design and implementation of certain parallel algorithms. Recently, several work-groups have extended MapReduce's application domain to iterative and on-line data processing. Despite having different data access characteristics, these extensions rely on the same storage facility as the original model, but propagate data updates using additional techniques. In order to benefit from large main memories, fast data access and stronger data consistency, we propose to employ in-memory storage for extended MapReduce. In this paper, we describe the design and implementation of EMR, an in-memory framework for extended MapReduce. To illustrate the usage and performance of our framework, we present measurements of typical MapReduce applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Physical Machine State Migration

    Page(s): 25 - 32
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (234 KB) |  | HTML iconHTML  

    A powerful functionality enabled by modern virtualization technologies is the ability to move a virtual machine (VM) from one physical machine to another, which enables unprecedented flexibility for system fault tolerance and load balancing. However, no similar capability exists for physical machines. This paper describes the first known successful implementation of migrating a physical machine's state from one physical Linux machine to another. This physical machine state migration (PMSM) capability greatly decreases the amount of disruption due to scheduled shut-down for non-virtualized physical machines, and is more challenging than VM migration because it cannot rely on a separate piece of software to perform the state transfer, e.g., the hyper visor in the case of VM migration. The PMSM prototype described in this paper is adapted from Linux's hibernation facility. The current PMSM prototype can migrate a physical machine running the MySQL DBMS server under 7 seconds. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hypervisor Support for Efficient Memory De-duplication

    Page(s): 33 - 39
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (216 KB) |  | HTML iconHTML  

    Memory de-duplication removes the memory state redundancy among virtual machines that run on the same physical machine by identifying common memory pages shared by these virtual machines and storing only one copy for each of common memory pages. A standard approach to identifying common memory pages is to hash the content of each memory page, and compare the resulting hash values. In a virtualized server, only the hyper visor is in a position to compute the hash value of every physical memory page on the server, but the memory de-duplication engine is best implemented outside the hyper visor for flexibility and simplicity reasons. A key design issue in the memory de-duplication engine is to minimize the performance impact of these hashing computations on the running VMs. To reduce this impact, memory page hashing should be performed with low overhead and when the CPU is idle. This paper describes why existing hyper visors do not provide adequate support for our memory de-duplication engine, how a new primitive called deferrable aggregate hyper call (DAH) fills the need, and what the resulting performance improvement is. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Hierarchical Memory Service Mechanism in Server Consolidation Environment

    Page(s): 40 - 47
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (286 KB) |  | HTML iconHTML  

    Increasing Internet business and computing footprint motivate server consolidation in data centers. Through virtualization technology, server consolidation can reduce physical hosts and provide scalable services. However, the ineffective memory usage among multiple virtual machines (VMs) becomes the bottleneck in server consolidation environment. Because of inaccurate memory usage estimate and the lack of memory resource managements, there is much service performance degradation in data centers, even though they have occupied a large amount of memory. In order to improve this scenario, we first introduce VM's memory division view and VM's free memory division view. Based on them, we propose a hierarchal memory service mechanism. We have designed and implemented the corresponding memory scheduling algorithm to enhance memory efficiency and achieve service level agreement. The benchmark test results show that our implementation can save 30% physical memory with 1% to 5% performance degradation. Based on Xen virtualization platform and balloon driver technology, our works actually bring dramatic benefits to commercial cloud computing center which is providing more than 2,000 VMs' services to cloud computing users. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Method for Improving Concurrent Write Performance by Dynamic Mapping Virtual Storage System Combined with Cache Management

    Page(s): 48 - 55
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (352 KB) |  | HTML iconHTML  

    The paper presents a new dynamic mapping virtual storage system, which is characterized by: (a) distribute storage resources according to need, thus improving storage resource utilization, (b) more fundamentally, through write-anywhere pattern to sequentialize concurrent write requests by dynamic address mapping mechanism, thus improving concurrent write performance for applications. Meanwhile, taking into account imperfection small size write requests handling and the negative impact on read performance, we combine the system with CBD system which is responsible for optimizing cache management. We have implemented our virtual storage system in Linux kernel 2.6.18 as a pseudo device driver. The experiment results show that concurrent write bandwidth approaches raw disk write performance. And in 64-stream concurrent write case, E-ASD can outperform concurrent write bandwidth of LVM by up to 160%, and its max single-stream read loss is less than 25% compared to LVM, with concur-rent read performance close to each other. Read performance still has space to be improved given larger cache and proper prefetching. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Source Code Partitioning in Program Optimization

    Page(s): 56 - 63
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (216 KB) |  | HTML iconHTML  

    Program analysis and program optimization seek to improve program performance. There are optimization techniques which are applied to various scopes such as a source file, function or basic block. Inter-procedural program optimization techniques have the scope of source file and analyze the interaction and relationship between different program functions. The techniques analyze the entire translation unit (typically a source file) and optimize the whole translation unit globally instead of just optimizing inside a function. Analyzing and optimizing an entire translation unit increases compilation time drastically because many factors need to be considered during analysis and optimization. The translation unit size can be quite large, containing many functions. Another issue is that functions in different translation units can be more closely related to each other than to the functions within their translation unit. The main goal of this research is grouping or partitioning of closely related program functions into the same translation unit. Our method profiles an application, determines relationship information between program functions and groups closely related functions together. The source code partitioner method improves the processing time of inter-procedural optimization techniques by applying it to a subset of program functions. Partitioning of program functions by analyzing profiling output shows dramatic decrease in compilation time of programs. Our results show we can improve the compiling time in all tested real world benchmarks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fair and Efficient Online Adaptive Scheduling for Multiple Sets of Parallel Applications

    Page(s): 64 - 71
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (288 KB) |  | HTML iconHTML  

    Both fairness and efficiency are crucial measures for the performance of parallel applications on multiprocessor systems. In this paper, we study online adaptive scheduling for multiple sets of such applications, where each set may contain one or more jobs with time-varying parallelism profile. This scenario arises naturally when dealing with several applications submitted simultaneously by different users in a large parallel system, where both user-level fairness and system-wide efficiency are important concerns. To achieve fairness, we use the equipartitioning algorithm, which evenly splits the available processors among the active job sets at any time. For efficiency, we apply a feedback-driven adaptive scheduler, which periodically adjusts the processor allocations within each set by consciously exploiting the jobs' execution history. We show that our algorithm is competitive for the objective of minimizing the set response time. For sufficiently large jobs, this theoretical result improves upon an existing algorithm that provides only fairness but lacks efficiency. Furthermore, we conduct simulations to empirically evaluate our algorithm, and the results confirm its improved performance using malleable workloads consisting of a wide range of parallelism variation structures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Combining Multiple Metrics to Control BSP Process Rescheduling in Response to Resource and Application Dynamics

    Page(s): 72 - 79
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1256 KB) |  | HTML iconHTML  

    This article discusses MigBSP: a rescheduling model that acts on Bulk Synchronous Parallel applications running over computational Grids. It combines the metrics Computation, Communication and Memory to make migration decisions. MigBSP also offers efficient adaptations to reduce its overhead. Additionally, MigBSP is infrastructure and application independent and tries to handle dynamicity on both levels. MigBSP's results show application performance improvements of up to 16% on dynamic environments while maintaining a small overhead when migrations do not take place. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Roystonea: A Cloud Computing System with Pluggable Component Architecture

    Page(s): 80 - 87
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (540 KB) |  | HTML iconHTML  

    A Cloud computing system provides infrastructure layer services to users by managing virtualized infrastructure resources. The infrastructure resources include CPU, hyper visor, storage, and networking. Each category of infrastructure resources is a subsystem in a cloud computing system. The cloud computing system coordinates infrastructure subsystems to provide services to users. Most current cloud computing systems lacks pluggability in their infrastructure subsystems and decision algorithms, which restricts the development of infrastructure subsystems and decision algorithms in cloud computing system. A cloud computing system should have the flexibility to switch from one infrastructure subsystem to another, and one decision algorithm to another with ease. This paper describes Roystonea, a hierarchical distributed cloud computing system with plug gable component architecture. The component pluggability ability gives administrators the flexibility to use the most appropriate subsystem as they wish. The component pluggability of Roystonea is based on a specifically designed interfaces among Roystonea controlling system and infrastructure subsystems components. The component pluggability also encourages the development of infrastructure subsystems in cloud computing. Roystonea provides a test bed for designing decision algorithms used in cloud computing system. The decision algorithms are totally isolated from other components in Roystonea architecture, so the designers of the decision algorithms can focus on algorithm design without worrying about how his algorithm will interact with other Roystonea components. We believed that component pluggability will be one of the most important issues in the research of cloud computing system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Static Task Scheduling Framework for Independent Tasks Accelerated Using a Shared Graphics Processing Unit

    Page(s): 88 - 95
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (357 KB) |  | HTML iconHTML  

    The High Performance Computing (HPC) field is witnessing the increasing use of Graphics Processing Units (GPUs) as application accelerators, due to their massively data-parallel computing architectures and exceptional floating-point computational capabilities. The performance advantage from GPU-based acceleration is primarily derived for GPU computational kernels that operate on large amount of data, consuming all of the available GPU resources. For applications that consist of several independent computational tasks that do not occupy the entire GPU, sequentially using the GPU one task at a time leads to performance inefficiencies. It is therefore important for the programmer to cluster small tasks together for sharing the GPU, however, the best performance cannot be achieved through an ad-hoc grouping and execution of these tasks. In this paper, we explore the problem of GPU tasks scheduling, to allow multiple tasks to efficiently share and be executed in parallel on the GPU. We analyze factors affecting multi-tasking parallelism and performance, followed by developing the multi-tasking execution model as a performance prediction approach. The model is validated by comparing with actual execution scenarios for GPU sharing. We then present the scheduling technique and algorithm based on the proposed model, followed by experimental verifications of the proposed approach using an NVIDIA Fermi GPU computing node. Our results demonstrate significant performance improvements using the proposed scheduling approach, compared with sequential execution of the tasks under the conventional multi-tasking execution scenario. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimizing Dynamic Programming on Graphics Processing Units via Adaptive Thread-Level Parallelism

    Page(s): 96 - 103
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (353 KB) |  | HTML iconHTML  

    Dynamic programming (DP) is an important computational method for solving a wide variety of discrete optimization problems such as scheduling, string editing, packaging, and inventory management. In general, DP is classified into four categories based on the characteristics of the optimization equation. Because applications that are classified in the same category of DP have similar program behavior, the research community has sought to propose general solutions for parallelizing each category of DP. However, most existing studies focus on running DP on CPU-based parallel systems rather than on accelerating DP algorithms on the graphics processing unit (GPU). This paper presents the GPU acceleration of an important category of DP problems called nonserial polyadic dynamic programming (NPDP). In NPDP applications, the degree of parallelism varies significantly in different stages of computation, making it difficult to fully utilize the compute power of hundreds of processing cores in a GPU. To address this challenge, we propose a methodology that can adaptively adjust the thread-level parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages. We realize our approach in a real-world NPDP application -- the optimal matrix parenthesization problem. Experimental results demonstrate our method can achieve a speedup of 13.40 over the previously published GPU algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SAW: Java Synchronization Selection from Lock or Software Transactional Memory

    Page(s): 104 - 111
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (268 KB) |  | HTML iconHTML  

    To rewrite a sequential program into a concurrent one, the programmer has to enforce atomic execution of a sequence of accesses to shared memory to avoid unexpected inconsistency. There are two means of enforcing this atomicity: one is the use of lock-based synchronization and the other is the use of software transactional memory (STM). However, it is difficult to predict which one is more suitable for an application than the other without trying both mechanisms because their performance heavily depends on the application. We have developed a system named SAW that decouples the synchronization mechanism from the application logic of a Java program and enables the programmer to statically select a suitable synchronization mechanism from a lock or an STM. We introduce annotations to specify critical sections and shared objects. In accordance with the annotated source program and the programmer's choice of a synchronization mechanism, SAW generates aspects representing the synchronization processing. By comparing the rewriting cost using SAW and that using individual synchronization mechanism directly, we show that SAW relieves the programmer's burden. Through several benchmarks, we demonstrate that SAW is an effective way of switching synchronization mechanisms according to the characteristics of each application. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mercury: A Negotiation-Based Resource Management System for Grids

    Page(s): 112 - 119
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (246 KB) |  | HTML iconHTML  

    In this paper, we present the design and implementation of a negotiation-based resource management system called Mercury. As the number of resources in Grids is increasing rapidly, selecting appropriate resources to execute tasks has become a crucial issue. The proposed system implements the Extended Contract Net Protocol (ECNP), which improves the standard bidding model by integrating a matchmaking technique. Our model addresses the issues of matchmaker overload and the lack of up-to-date resource state information in the original matchmaking model. To ensure that the system is user-friendly, we provide a web interface. By using the provided job templates, users can describe their jobs more easily, and service providers can deploy their services in the system in a simple manner. Furthermore, to improve the interoperability of the system, we adopt open standards, such as the WS-Agreement and JSDL protocols. Our experimental results demonstrate the scalability and efficiency of the system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Catwalk-ROMIO: A Cost-Effective MPI-IO

    Page(s): 120 - 126
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (890 KB) |  | HTML iconHTML  

    The nature of highly parallelized parallel file access which often consists of lots of fine grain, non-contiguous I/O requests, can degrade the I/O performance severely. To tackle this problem, a novel technique to maximize the bandwidth of the MPI-IO is proposed. This proposed technique is utilize a ring communication topology. This technique is implemented as an ADIO device of ROMIO, named Catwalk-ROMIO, and evaluated. The evaluation shows that Catwalk-ROMIO utilizing only one disk can exhibit comparable performance with parallel files systems, PVFS2 and Lustre, utilizing several file servers and disks. The evaluation also shows that Catwalk-ROMIO performance is almost independent from file access patterns, in contrast to the performance of parallel file systems performing only well with collective I/O. Catwalk-ROMIO only requires TCP/IP network for the ring communication topology and one file server which are common in HPC clusters without any additional cost. Thus, Catwalk-ROMIO is considered to be a very cost-effective MPI-IO implementation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.