<![CDATA[ IEEE Transactions on Parallel and Distributed Systems - new TOC ]]>
http://ieeexplore.ieee.org
TOC Alert for Publication# 71 2017November 16<![CDATA[Editor's Note]]>281233283329107<![CDATA[A GPU-Architecture Optimized Hierarchical Decomposition Algorithm for Support Vector Machine Training]]> https://github.com/OrcusCZ/OHD-SVM.]]>281233303343570<![CDATA[An Energy-Efficient Storage Strategy for Cloud Datacenters Based on Variable K-Coverage of a Hypergraph]]>281233443355860<![CDATA[Analysis, Classification and Comparison of Scheduling Techniques for Software Transactional Memories]]>281233563373699<![CDATA[Combining Vertex-Centric Graph Processing with SPARQL for Large-Scale RDF Data Analytics]]>2812337433881584<![CDATA[CRED: Cloud Right-Sizing with Execution Deadlines and Data Locality]]>281233893400968<![CDATA[Deadline-Constrained Cost Optimization Approaches for Workflow Scheduling in Clouds]]>281234013412979<![CDATA[Efficient Self-Invalidation/Self-Downgrade for Critical Sections with Relaxed Semantics]]>provided, that self-invalidation and self-downgrade are performed prudently. In this work we examine how self-invalidation and self-downgrade are performed in relation to atomicity and ordering. We show that self-invalidation and self-downgrade do not need to be applied conservatively, as so far implemented. Our key observation is that, often, critical sections which are not ordered in time, are intended to provide only atomicity and not thread synchronization. We thus propose a new type of self-invalidation, forwardself-invalidation (FSI), which invalidates solely data that are going to be accessed inside a critical section. Based on the same reasoning, we propose a new type of self-downgrade, forward self-downgrade (FSD), also restricted to writes in critical sections. Finally, we define the semantics of locks using FSI and FSD, which resemble the semantics of relaxed atomic operations in C++. Our evaluation for 64-core multiprocessors shows significant improvements using the proposed FSI and FSD—where applicable—in Splash-3 and PARSEC benchmarks, over a directory-based protocol (17.1 percent in execution time and 33.9 percent in energy consumption) and also o-
er a state-of-the-art self-invalidation/self-downgrade protocol (7.6 percent in execution time and 9.1 percent in energy consumption), while still retaining the design simplicity of the protocol.]]>281234133425995<![CDATA[Energy-Efficient Scheduling Algorithms for Real-Time Parallel Applications on Heterogeneous Distributed Embedded Systems]]>2812342634421349<![CDATA[Exploiting the Parallelism Between Conflicting Critical Sections with Partial Reversion]]>$>$73.4%) of parallelism between CCSs can be exploited as fully as possible by simply allowing the parallel execution of their first conflict-free code fragment at runtime. We therefore present BSOptimizer, a new microarchitecture, to perform the partial reversion integrated with a series of sophisticated hardware and software strategies for the CCS parallelization. We complement the off-the-shelf cache coherency protocol to perceive the conflict location of CCS, present a predictive checkpoint mechanism to register and predict the concerned conflict point in a lightweight and accurate fashion, and redefine the traditional mutual exclusive semantics with a binary relationship. With these collaborative techniques, each CCS can be scheduled in parallel. Our experimental results on a wide variety of real programs and PARSEC benchmarks show that, compared to the native execution and two state-of-the-art lock elision techniques (including SLE and SLR), BSOptmizer can dramatically improves the performance of programs with a slight ($<$0.8%) energy cons-
mption and ($<$3.9%) extra runtime overhead. Our evaluation on a micro-benchmark with software based optimization also verifies that BSOptimizer can accurately exploit the CCS parallelism as promised.]]>2812344334571864<![CDATA[Eyes in the Dark: Distributed Scene Understanding for Disaster Management]]>2812345834711285<![CDATA[FairGV: Fair and Fast GPU Virtualization]]>$geq 0.97$ Min-Max Ratio) with little performance degradation ($leq 1.02$ aggregated overhead) in a range of mixed HPC workloads that leverage GPUs.]]>2812347234851455<![CDATA[Fence-Free Synchronization with Dynamically Serialized Synchronization Variables]]> sync-vars) from normal ones. Sync-Order reduces hardware complexity such that the processor only needs to serialize the ordering among sync-vars. Its simplicity makes it easy to be integrated to the directory controller and it supports distributed directory, a missing feature in prior designs. We show that Sync-Order eliminates traditional fences on all sides of synchronization constructs (instead of only one side in prior work) and requires small effort for a programmer or compiler to annotate sync-vars. Our experimental results show that Sync-Order significantly reduces CPU stalls and boosts the performance of a set of synchronization constructs and concurrent data structures by 10 percent; meanwhile, the fence overhead of full applications from SPLASH-2 and PARSEC is reduced from 42 to 3 percent.]]>2812348635001846<![CDATA[High Performance Exact Triangle Counting on GPUs]]>281235013510726<![CDATA[Efficient Approximation Algorithms for the Bounded Flexible Scheduling Problem in Clouds]]>$frac{C-k}{C}$ approximation algorithm for BFS, where $k$ is the maximum parallelism degree and $C$ is the capacity of the system (i.e., the number of machines). Since $Cgg k$ in BFS, our result significantly improves the known best approximation ratio of $(frac{C-k}{2C-k})(1-epsilon)$ for tight deadlines [17] , and $frac{C-k}{C}cdot frac{s-1}{s}$ -
or loose deadlines [18] on a slackness ratio $sgeq 1$ that is the maximum ratio between a job’s earliest actual finish time and its deadline. We first propose feasibility condition to determine whether an instance of BFS is feasible, i.e., whether there exists a scheduling according to which all jobs can finish before their deadlines, which is the key to achieve the ratio improvement of our algorithm. To prove the correctness of the feasibility condition, we give a simple linear program (LP) for a weaker version of BFS, and show that it is with an integral polyhedron and hence the version of BFS is polynomial-time solvable. Then we present a greedy algorithm and its equivalent primal-dual algorithm for the complementary problem of BFS. Both algorithms have an approximation ratio of $frac{C-k}{C}$, and time complexity $O(n^{2}+nT)$, where $n$ is the number of jobs and $T$ is the number of time slots. As a by-product, we show that the BFS admits a polynomial-time approximation scheme (PTAS) when $T$281235113520443<![CDATA[Managing Battery Aging for High Energy Availability in Green Datacenters]]>BAAT-P ), a novel power delivery architecture included aging management algorithms from the perspective of computing system to hide, reduce, mitigate and plan the battery aging effects for high energy availability in datacenter. Our techniques exploit diverse battery aging mechanisms and dynamic aging management algorithms to provide system-level availability guarantee for datacenter. We evaluate the BAAT-P design with a real prototype. Compared with a battery powered datacenter without aging management policies, the results show that BAAT-P can extend battery lifetime by 72 percent, reduce battery cost by 33 percent and effectively improve energy availability for datacenter servers while maintaining workload performance for the performance critical workloads.]]>2812352135362979<![CDATA[Nessie: A Decoupled, Client-Driven Key-Value Store Using RDMA]]>PUT -oriented workloads when data value sizes are 128 KB or larger, and reduces power consumption by 18 percent at 80 percent system utilization and 41 percent at 20 percent system utilization compared with idle power consumption.]]>2812353735521104<![CDATA[Online Scheduling and Interference Alleviation for Low-Latency, High-Throughput Processing of Data Streams]]>2812355335691737<![CDATA[Operation-Level Wait-Free Transactional Memory with Support for Irrevocable Operations]]>281235703583548<![CDATA[Opportunistic Mobile Data Offloading with Deadline Constraints]]>2812358435991644<![CDATA[Optimistic Transactional Boosting]]>read-only traversal phase, which scans the data structure without locking or monitoring, and a read-write commit phase, which atomically validates the output of the traversal phase and applies the needed modifications to the data structure. In this paper we introduce Optimistic Transactional Boosting (OTB), an optimistic methodology for extending those designs in order to support the composition of multiple operations into one atomic execution by building a single traversal phase and a single commit phase for the whole atomic execution. As a result, OTB-based data structures are optimistic and composable . The former because they defer any locking and/or monitoring to the commit phase of the entire atomic execution; the latter because they allow the execution of multiple operations atomically. Additionally, in this paper we provide a theoretical model for analyzing OTB-based data structures and proving their correctness. In particular, we extended a recent approach that models concurrent data structures by including the two notions of optimism and composition of operations.]]>281236003614695<![CDATA[Parallel Nonuniform Discrete Fourier Transform (P-NDFT) Over a Random Wireless Sensor Network]]>281236153625955<![CDATA[Resource Sharing in Multicore Mixed-Criticality Systems: Utilization Bound and Blocking Overhead]]>criticality-aware utilization bound under partitioned Earliest Deadline First (EDF) and MSRP by taking the worst case synchronization overheads of tasks into account. The non-monotonicityof the bound where it may decrease when more cores are deployed is identified, which can cause anomalies in the feasibility tests. With the objective to improve system schedulability, a novel criticality-cognizant and resource-oriented analysis approach is further studied to tighten the bound on the synchronization overheads for MC tasks scheduled under partitioned EDF and MSRP. The simulation results show that the new analysis approach can effectively reduce the blocking times for tasks (up to 30 percent) and thus improve the schedulability ratio (e.g., 10 percent more). The actual implementation in Linux kernel further shows the practicability of partitioned-EDF with MSRP (with run-time overhead being about 3 to 7 percent of the overall execution time) for MC tasks running on multicores with shared resources.]]>281236263641873<![CDATA[Toward General Software Level Silent Data Corruption Detection for Parallel Applications]]>2812364236551551<![CDATA[Transport-Support Workflow Composition and Optimization for Big Data Movement in High-Performance Networks]]>2812365636702115<![CDATA[Using Imbalance Characteristic for Fault-Tolerant Workflow Scheduling in Cloud Systems]]>2812367136831012<![CDATA[2017 reviewers list]]>28123684368954