By Topic

Parallel Processing, 2008. ICPP '08. 37th International Conference on

Date 9-12 Sept. 2008

Filter Results

Displaying Results 1 - 25 of 93
  • [Front cover]

    Publication Year: 2008 , Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (133 KB)  
    Freely Available from IEEE
  • [Title page i]

    Publication Year: 2008 , Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (19 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Publication Year: 2008 , Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (49 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Publication Year: 2008 , Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (46 KB)  
    Freely Available from IEEE
  • Table of contents

    Publication Year: 2008 , Page(s): v - xi
    Save to Project icon | Request Permissions | PDF file iconPDF (132 KB)  
    Freely Available from IEEE
  • Message from the General Chair

    Publication Year: 2008 , Page(s): xii
    Save to Project icon | Request Permissions | PDF file iconPDF (121 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Message from the Program Chair

    Publication Year: 2008 , Page(s): xiii
    Save to Project icon | Request Permissions | PDF file iconPDF (122 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Organizing Committee

    Publication Year: 2008 , Page(s): xiv
    Save to Project icon | Request Permissions | PDF file iconPDF (118 KB)  
    Freely Available from IEEE
  • Program Vice-Chairs and Program Committee Members

    Publication Year: 2008 , Page(s): xv - xix
    Save to Project icon | Request Permissions | PDF file iconPDF (133 KB)  
    Freely Available from IEEE
  • List of reviewers

    Publication Year: 2008 , Page(s): xx - xxi
    Save to Project icon | Request Permissions | PDF file iconPDF (99 KB)  
    Freely Available from IEEE
  • Towards Minimum Traffic Cost and Minimum Response Latency: A Novel Dynamic Query Protocol in Unstructured P2P Networks

    Publication Year: 2008 , Page(s): 1 - 8
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (322 KB) |  | HTML iconHTML  

    Controlled-flooding algorithms are widely used in unstructured networks. Expanding ring (ER) achieves low response delay, while its traffic cost is huge; dynamic querying (DQ) is known for its desirable behavior in traffic control, but it achieves lower search cost at the price of an undesirable latency performance; Enhanced dynamic querying (DQ+) can reduce the search latency too, while it is hard to determine a general optimum parameters set. In this paper, a novel algorithm named selective dynamic query (SDQ) is proposed. Unlike previous works that awkwardly processing floating TTL values, SDQ properly select an integer TTL value and a set of neighbors to narrow the scope of next query. Our experiments demonstrate that SDQ provides finer-grained control than other algorithms: its latency is close to the well-known minimum one via ER; in the mean time its traffic cost also close to the minimum. To our best knowledge, this is the first work capable of achieving best performance in terms of both response latency and traffic cost. In addition, our experiments also demonstrate that SDQ works well in various network topologies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Flash Data Dissemination in Unstructured Peer-to-Peer Networks

    Publication Year: 2008 , Page(s): 9 - 16
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (213 KB) |  | HTML iconHTML  

    The problem of flash data dissemination refers to spreading dynamically-created medium-sized data to all members of a large group of users. In this paper, we explore a solution to the problem of flash data dissemination in unstructured P2P networks and propose a gossip-based protocol, termed catalogue-gossip. Our protocol alleviates the shortcomings of prior gossip-based dissemination approaches through the introduction of an efficient catalogue exchange scheme that helps reduce unnecessary interactions among nodes in the unstructured network. We provide deterministic guarantees for the termination of the protocol and suggest optimizations concerning the order with which pieces of flash data are assembled at receiving peers. Experimental results show that catalogue-gossip is significantly more efficient than existing solutions when it comes to delivery of flash data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast Source Switching for Gossip-Based Peer-to-Peer Streaming

    Publication Year: 2008 , Page(s): 17 - 24
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (245 KB) |  | HTML iconHTML  

    In this paper we consider gossip-based peer-to-peer streaming applications where multiple sources exist and they work serially. More specifically, we tackle the problem of fast source switching to minimize the startup delay of the new source. We model the source switch process and formulate it into an optimization problem. Then we propose a practical greedy algorithm that can approximate the optimal solution by properly interleaving the data delivery of the old source and the new source. We perform simulations on various real-trace overlay topologies to demonstrate the effectiveness of our algorithm. The simulation results show that our proposed algorithm outperforms the normal source switch algorithm by reducing the source switch time by 20%-30% without bringing extra communication overhead, and the reduction ratio tends to increase when the network scale expands. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • TFlux: A Portable Platform for Data-Driven Multithreading on Commodity Multicore Systems

    Publication Year: 2008 , Page(s): 25 - 34
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (425 KB) |  | HTML iconHTML  

    In this paper we present thread flux (TFlux), a complete system that supports the data-driven multithreading (DDM) model of execution. TFlux virtualizes any details of the underlying system therefore offering the same programming model independently of the architecture. To achieve this goal, TFlux has a runtime support that is built on top of a commodity operating system. Scheduling of threads is performed by the thread synchronization unit (TSU), which can be implemented either as a hardware or a software module. In addition, TFlux includes a preprocessor that, along with a set of simple compiler directives, allows the user to easily develop DDM programs. The preprocessor then automatically produces the TFlux code, which can be compiled using any commodity C compiler, therefore automatically producing code to any ISA. TFlux has been validated on three platforms. A Simics-based multicore system with a TSU hardware module (TFluxHard), a commodity 8-core Intel Core2 QuadCore-based system with a software TSU module (TFluxSoft), and a Cell/BE system with a software TSU module (TFluxCell). The experimental results show that the performance achieved is close to linear speedup, on average 21x for the 27 nodes TFluxHard, and 4.4x on a 6 nodes TFluxSoft and TFluxCell. Most importantly, the observed speedup is stable across the different platforms thus allowing the benefits of DDM to be exploited on different commodity systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Enabling Streaming Remoting on Embedded Dual-Core Processors

    Publication Year: 2008 , Page(s): 35 - 42
    Cited by:  Papers (5)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (689 KB) |  | HTML iconHTML  

    Dual-core processors (and, to an extent, multicore processors) have been adopted in recent years to provide platforms that satisfy the performance requirements of popular multimedia applications. This architecture comprises groups of processing units connected by various interprocess communication mechanisms such as shared memory, memory mapping interrupts, mailboxes, and channel-based protocols. The associated challenges include how to provide programming models and environments for developing streaming applications for such platforms. In this paper, we present middleware called streaming RPC for supporting a streaming-function remoting mechanism on asymmetric dual-core architectures. This middleware has been implemented both on an experimental platform known as the PAC dual-core platform and in TI OMAP dual-core environments. We also present an analytic model of streaming equations to optimize the internal handshaking for our proposed streaming RPC. The usage and efficiency of the proposed methodology are demonstrated in a JPEG decoder, MP3 decoder, and QCIF H.264 decoder. The experimental results show that our approach improves the performance of the decoders of JPEG, MP3, and H.264 by 24%, 38%, and 32% on PAC, respectively. The communication load of internal handshaking has also been reduced compared to the naive use of RPC over embedded dual-core systems. The experiments also show that the performance improvement can also be achieved on OMAP dual-core platforms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalability Evaluation and Optimization of Multi-Core SIP Proxy Server

    Publication Year: 2008 , Page(s): 43 - 50
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (346 KB) |  | HTML iconHTML  

    The session initiation protocol (SIP) is one popular signaling protocol used in many collaborative applications like VoIP, instant messaging and presence. In this paper, we evaluate one well-known SIP proxy server (i.e. OpenSER) on two multi-core platforms: SUN Niagara and Intel Clovertown, which are installed with Solaris OS and Linux OS respectively. Through the evaluation, we identify three factors that determine the performance scalability of OpenSER server. One is inside the OSes: overhead from the coarse-grained locks used in the UDP socket layer. Others are specific to the multi-process programming model: 1. overhead caused by passing socket descriptors among processes; 2. overhead brought by sharing transaction objects among processes. To remedy these problems, we propose several incremental optimizations, including out-of-box dispatcher, light-weight connection dispatcher and dataset partition, and achieve significant improvements: for UDP and TCP transport, on SUN Niagara, speedup (ideal is 8) are improved from 1.5 to 5.8 and from 2.2 to 6.2, respectively; on Intel Clovertown, speedup (ideal is 8) are improved from 1.2 to 3.1 and from 2.6 to 4.8, respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • DiSTM: A Software Transactional Memory Framework for Clusters

    Publication Year: 2008 , Page(s): 51 - 58
    Cited by:  Papers (16)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (239 KB) |  | HTML iconHTML  

    While transactional memory (TM) research on shared-memory chip multiprocessors has been flourishing over the last years,limited research has been conducted in the cluster domain. In this paper,we introduce a research platform for exploiting software TMon clusters. The distributed software transactional memory (DiSTM) system has been designed for easy prototyping of TM coherence protocols and it does not rely on a software or hardware implementation of distributed shared memory. Three TM coherence protocols have been implemented and evaluated with established TM benchmarks. The decentralized transactional coherence and consistency protocol has been compared against two centralized protocols that utilize leases. Results indicate that depending on network congestion and amount of contention different protocols perform better. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementing and Exploiting Inevitability in Software Transactional Memory

    Publication Year: 2008 , Page(s): 59 - 66
    Cited by:  Papers (1)  |  Patents (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (301 KB) |  | HTML iconHTML  

    Transactional Memory (TM) takes responsibility for concurrent, atomic execution of labeled regions of code, freeing the programmer from the need to manage locks. Typical implementations rely on speculation and rollback, but this creates problems for irreversible operations like interactive I/O. A widely assumed solution allows a transaction to operate in an inevitable mode that excludes all other transactions and is guaranteed to complete, but this approach does not scale. This paper explores a richer set of alternatives for software TM, and demonstrates that it is possible for an inevitable transaction to run in parallel with (non-conflicting) non-inevitable transactions, without introducing significant overhead in the non-inevitable case. We report experience with these alternatives in a graphical game application. We also consider the use of inevitability to accelerate certain common-case transactions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable Techniques for Transparent Privatization in Software Transactional Memory

    Publication Year: 2008 , Page(s): 67 - 74
    Cited by:  Papers (3)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (197 KB) |  | HTML iconHTML  

    We address the recently recognized privatization problem in software transactional memory (STM) runtimes, and introduce the notion of partially visible reads (PVRs) to heuristically reduce the overhead of transparent privatization. Specifically, PVRs avoid the need for a "privatization fence" in the absence of conflict with concurrent readers. We present several techniques to trade off the cost of enforcing partial visibility with the precision of conflict detection. We also consider certain special-case variants of our approach, e.g., for predominantly read-only workloads. We compare our implementations to prior techniques on a multicore Niagara1 system using a variety of artificial workloads. Our results suggest that while no one technique performs best in all cases, a dynamic hybrid of PVRs and strict in-order commits is stable and reasonably fast across a wide range of load parameters. At the same time, the remaining overheads are high enough to suggest the need for programming model or architectural support. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel Inferencing for OWL Knowledge Bases

    Publication Year: 2008 , Page(s): 75 - 82
    Cited by:  Papers (2)  |  Patents (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (173 KB) |  | HTML iconHTML  

    We examine the problem of parallelizing the inferencing process for OWL knowledge-bases. A key challenge in this problem is partitioning the computational workload of this process to minimize duplication of computation and the amount of data communicated among processors. We investigate two approaches to address this challenge. In the data partitioning approach, the data-set is partitioned into smaller units, which are then processed independently. In the rule partitioning approach the rule-base is partitioned and the smaller rule-bases are applied to the complete data set. We present various algorithms for the partitioning and analyze their advantages and disadvantages. A parallel inferencing algorithm is presented which uses the partitions that are created by the two approaches. We then present an implementation based on a popular open source OWL reasoner and on a networked cluster. Our experimental results show significant speedups for some popular benchmarks, thus making this a promising approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimizing JPEG2000 Still Image Encoding on the Cell Broadband Engine

    Publication Year: 2008 , Page(s): 83 - 90
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (524 KB) |  | HTML iconHTML  

    JPEG2000 is the latest still image coding standard from the JPEG committee, which adopts new algorithms such as embedded block coding with optimized truncation (EBCOT) and discrete wavelet transform (DWT). These algorithms enable superior coding performance over JPEG and support various new features at the cost of the increased computational complexity. The Sony-Toshiba-IBM cell broadband engine (or the Cell/B.E.) is a heterogeneous multicore architecture with SIMD accelerators. In this work, we optimize the computationally intensive algorithmic kernels of JPEG2000 for the Cell/B.E. and also introduce a novel data decomposition scheme to achieve high performance with low programming complexity. We compare the Cell/B.E.'s performance to the performance of the Intel Pentium IV 3.2 GHz processor. The Cell/B.E. demonstrates 3.2 times higher performance for lossless encoding and 2.7 times higher performance for lossy encoding. For the DWT, the Cell/B.E. outperforms the Pentium IV processor by 9.1 times for the lossless case and 15 times for the lossy case. We also provide the experimental results on one IBM QS20 blade with two Cell/B.E. chips and the performance comparison with the existing JPEG2000 encoder for the Cell/B.E. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bandwidth-Efficient Continuous Query Processing over DHTs

    Publication Year: 2008 , Page(s): 91 - 98
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (292 KB) |  | HTML iconHTML  

    In this paper, we propose novel techniques to reduce bandwidth cost in a continuous keyword query processing system that is based on a distributed hash table. We argue that query indexing and document announcement are of significant importance towards this goal. Our detailed simulations show that our proposed techniques, combined together, effectively and greatly reduce bandwidth cost. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving Priority Enforcement via Non-Work-Conserving Scheduling

    Publication Year: 2008 , Page(s): 99 - 106
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (245 KB) |  | HTML iconHTML  

    Current operating system schedulers are not fully aware of multi-core and multi-threaded architectures, and as a result, schedule threads in a way that may cause contention for critical resources such as the last level in the cache memory hierarchy or the memory access bandwidth. This contention has a significant impact on the system productivity and the quality of service that each individual thread gets from the platform, which can widely vary depending on the behavior of its simultaneous co-runners.In this paper we describe the design and implementation of a non-work-conserving framework to schedule threads that tries to improve priority enforcement, based on on-line statistics collected through hardware performance counters. We have implemented our scheme in Linux running on both multicore and SMT processors. For synthetic workloads based on the latest SPEC CPU2006 benchmarks, our framework speeds up high-priority threads by up to 50%, while keeping or even slightly improving the overall system throughput. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Incentive-Compatible Mechanism for Scheduling Non-Malleable Parallel Jobs with Individual Deadlines

    Publication Year: 2008 , Page(s): 107 - 114
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (192 KB) |  | HTML iconHTML  

    We design an incentive-compatible mechanism for schedulingn non-malleable parallel jobs on a parallel system comprising m identical processors. Each job is owned by a selfish user who is rational: she performs actions that maximize her welfare even though doing so may cause system-wide suboptimal performance. Each job is characterized by four parameters: value, deadline, number of processors, and execution time. The user's welfare increases by the amount indicated by the value if her job can be completed by the deadline. The user declares theparameters to the mechanism which uses them to compute the schedule and the payments. The user can misreport the parameters, but since the mechanism is incentive-compatible, she chooses to truthfully declare them. We prove the properties of the mechanism and perform a study by simulation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Thermal Management for 3D Processors via Task Scheduling

    Publication Year: 2008 , Page(s): 115 - 122
    Cited by:  Papers (14)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (546 KB) |  | HTML iconHTML  

    A rising horizon in chip fabrication is the 3D integration technology. It stacks two or more dies vertically with a dense, high-speed interface to increase the device density and reduce the delay of interconnects across the dies. However, a major challenge in 3D technology is the increased power density which brings the concern of heat dissipation within the processor. High temperatures trigger voltage and frequency throttlings in hardware which degrade the chip performance. Moreover, high temperatures impair the processorpsilas reliability and reduce its lifetime. To alleviate this problem, we propose in this paper an OS-level scheduling algorithm that performs thermal-aware task scheduling on a 3D chip. Our algorithm leverages the inherent thermal variations within and across different tasks, and schedules them to keep the chip temperature low. We observed that vertically adjacent dies have strong thermal correlations, and the scheduler should consider them jointly. Our proposed algorithm can remove on average 54% of hardware DTMs and result in 7.2% performance improvement over the base case. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.