By Topic

Workload Characterization (IISWC), 2010 IEEE International Symposium on

Date 2-4 Dec. 2010

Filter Results

Displaying Results 1 - 21 of 21
  • Analysis on semantic transactional memory footprint for hardware transactional memory

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (206 KB) |  | HTML iconHTML  

    We analyze various characteristics of semantic transactional memory footprint (STMF) that consists of only the memory accesses the underlying hardware transactional memory (HTM) system has to manage for the correct execution of transactional programs. Our analysis shows that STMF can be significantly smaller than declarative transactional memory footprint (DTMF) that contains all memory accesses within transaction boundaries (i.e., only 8.3% of DTMF in the applications examined). This result encourages processor designers and software toolchain developers to explore new design points for low-cost HTM systems and intelligent software toolchains to find and leverage STMF efficiently. We identify seven code patterns that belong to DTMF, but not to STMF, and show that they take up 91.7% of all memory accesses in transactional boundaries, on average, for the transactional programs examined. A new instruction prefix is proposed to express STMF efficiently, and the existing compiler techniques are examined to check their applicability to deduce STMF from DTMF. Our trace analysis shows that using STMF significantly reduces the ratio of transactions overflowing a 32KB L1 cache, from 12.80% to 2.00%, and substantially lowers the false positive probability of Bloom filters used for transaction signature management, from 23.60% to less than 0.001%. The simulation result shows that the STAMP applications with the STMF expression run 40% faster on average than those with the DTMF expression. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1580 KB) |  | HTML iconHTML  

    The GPUs are emerging as a general-purpose high-performance computing device. Growing GPGPU research has made numerous GPGPU workloads available. However, a systematic approach to characterize these benchmarks and analyze their implication on GPU microarchitecture design evaluation is still lacking. In this research, we propose a set of microarchitecture agnostic GPGPU workload characteristics to represent them in a microarchitecture independent space. Correlated dimensionality reduction process and clustering analysis are used to understand these workloads. In addition, we propose a set of evaluation metrics to accurately evaluate the GPGPU design space. With growing number of GPGPU workloads, this approach of analysis provides meaningful, accurate and thorough simulation for a proposed GPU architecture design choice. Architects also benefit by choosing a set of workloads to stress their intended functional block of the GPU microarchitecture. We present a diversity analysis of GPU benchmark suites such as Nvidia CUDA SDK, Parboil and Rodinia. Our results show that with a large number of diverse kernels, workloads such as Similarity Score, Parallel Reduction, and Scan of Large Arrays show diverse characteristics in different workload spaces. We have also explored diversity in different workload subspaces (e.g. memory coalescing and branch divergence). Similarity Score, Scan of Large Arrays, MUMmerGPU, Hybrid Sort, and Nearest Neighbor workloads exhibit relatively large variation in branch divergence characteristics compared to others. Memory coalescing behavior is diverse in Scan of Large Arrays, K-Means, Similarity Score and Parallel Reduction. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving virtualization performance and scalability with advanced hardware accelerations

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (538 KB) |  | HTML iconHTML  

    Many advanced hardware accelerations for virtualization, such as Pause Loop Exit (PLE), Extended Page Table (EPT), and Single Root I/O Virtualization (SR-IOV), have been introduced recently to improve the virtualization performance and scalability. In this paper, we share our experience with the performance and scalability issues of virtualization, especially those brought by the modern, multi-core and/or overcommitted systems. We then describe our work on the implementation and optimizations of the advanced hardware acceleration support in the latest version of Xen. Finally, we present performance evaluations and characterizations of these hardware accelerations, using both micro-benchmarks and a server consolidation benchmark (vConsolidate). The experimental results demonstrate an up to 77% improvement with these hardware accelerations, 49% of which is due to EPT and another 28% due to SR-IOV. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fidelity and scaling of the PARSEC benchmark inputs

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (369 KB) |  | HTML iconHTML  

    A good benchmark suite should provide users with inputs that have multiple levels of fidelity for different use cases such as running on real machines, register level simulations, or gate-level simulations. Although input reduction has been explored in the past, there is a lack of understanding how to systematically scale input sets for a benchmark suite. This paper presents a framework that takes the novel view that benchmark inputs should be considered approximations of their original, full-sized inputs. It formulates the input selection problem for a benchmark as an optimization problem that maximizes the accuracy of the benchmark subject to a time constraint. The paper demonstrates how to use the proposed methodology to create several simulation input sets for the PARSEC benchmarks and how to quantify and measure their approximation error. The paper also shows which parts of the inputs are more likely to distort their original characteristics. Finally, the paper provides guidelines for users to create their own customized input sets. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Eigenbench: A simple exploration tool for orthogonal TM characteristics

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (831 KB) |  | HTML iconHTML  

    There are a significant number of Transactional Memory(TM) proposals, varying in almost all aspects of the design space. Although several transactional benchmarks have been suggested, a simple, yet thorough, evaluation framework is still needed to completely characterize a TM system and allow for comparison among the various proposals. Unfortunately, TM system evaluation is difficult because the application characteristics which affect performance are often difficult to isolate from each other. We propose a set of orthogonal application characteristics that form a basis for transactional behavior and are useful in fully understanding the performance of a TM system. In this paper, we present EigenBench, a lightweight yet powerful microbenchmark for fully evaluating a transactional memory system. We show that EigenBench is useful for thoroughly exploring the orthogonal space of TM application characteristics. Because of its flexibility, our microbenchmark is also capable of reproducing a representative set of TM performance pathologies. In this paper, we use Eigenbench to evaluate two well-known TM systems and provide significant insight about their strengths and weaknesses. We also demonstrate how EigenBench can be used to mimic the evaluation coverage of a popular TM benchmark suite called STAMP. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A limit study of JavaScript parallelism

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1917 KB) |  | HTML iconHTML  

    JavaScript is ubiquitous on the web. At the same time, the language's dynamic behavior makes optimizations challenging, leading to poor performance. In this paper we conduct a limit study on the potential parallelism of JavaScript applications, including popular web pages and standard JavaScript benchmarks. We examine dependency types and looping behavior to better understand the potential for JavaScript parallelization. Our results show that the potential speedup is very encouraging- averaging 8.9x and as high as 45.5x. Parallelizing functions themselves, rather than just loop bodies proves to be more fruitful in increasing JavaScript execution speed. The results also indicate in our JavaScript engine, most of the dependencies manifest via virtual registers rather than hash table lookups. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance characterization and acceleration of Optical Character Recognition on handheld platforms

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (727 KB) |  | HTML iconHTML  

    Optical Character Recognition (OCR) converts images of handwritten or printed text captured by camera or scanner into editable text. OCR has seen limited adoption in mobile platforms due to the performance constraints of these systems. Intel® Atom™ processors have enabled general purpose applications to be executed on handheld devices. In this paper, we analyze a reference implementation of the OCR workload on a low power general purpose processor and identify the primary hotspot functions that incur a large fraction of the overall response time. We also present a detailed architectural characterization of the hotspot functions in terms of CPI, MPI, etc. We then implement and analyze several software/algorithmic optimizations such as i) Multi-threading, ii) image sampling for a hotspot function and Hi) miscellaneous code optimization. Our results show that up to 2X performance improvement in execution time of the application and almost 9X improvement for a hotspot can be achieved by using various software optimizations. We designed and implemented a hardware accelerator for one of the hotspots to further reduce the execution time and power. Overall, we believe our analysis provides a detailed understanding of the processing overheads for OCR running on a new class of low power compute platforms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Data handling inefficiencies between CUDA, 3D rendering, and system memory

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (518 KB) |  | HTML iconHTML  

    While GPGPU programming offers faster computation of highly parallelized code, the memory bandwidth between the system and the GPU can create a bottleneck that reduces the potential gains. CUDA is a prominent GPGPU API which can transfer data to and from system code, and which can also access data used by 3D rendering APIs. In an application that relies on both GPU programming APIs to accelerate 3D modeling and an easily parallelized algorithm, the hidden inefficiencies of nVidia's data handling with CUDA become apparent. First, CUDA uses the CPU's store units to copy data between the graphics card and system memory instead of using a more efficient method like DMA. Second, data exchanged between the two GPU-based APIs travels through the main processor instead of staying on the GPU. As a result, a non-GPGPU implementation of a program runs faster than the same program using GPGPU. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Toward a more accurate understanding of the limits of the TLS execution paradigm

    Page(s): 1 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (687 KB) |  | HTML iconHTML  

    Thread-Level Speculation (TLS) facilitates the extraction of parallel threads from sequential applications. Most prior work has focused on developing the compiler and architecture for this execution paradigm. Such studies often narrowly concentrated on a specific design point. On the other hand, other studies have attempted to assess how well TLS performs if some architectural/ compiler constraint is relaxed. Unfortunately, such previous studies have failed to truly assess TLS performance potential, because they have been bound to some specific TLS architecture and have ignored one or another important TLS design choice, such as support for out-of-order task spawn or support for intermediate checkpointing. In this paper we attempt to remedy some of the shortcomings of previous TLS limit studies. To this end a characterization approach is pursued that is, as much as possible, independent of specific architecture configurations. High-level TLS architectural support is explored in one common framework. In this way, a more accurate upper-bound on the performance potential of the TLS execution paradigm is obtained (as opposed to some particular architecture design point) and, moreover, relative performance gains can be related to specific high-level architectural support. Finally, in the spirit of performing a comprehensive study, applications from a variety of domains and programming styles are evaluated. Experimental results suggest that TLS performance varies significantly depending on the features provided by the architecture. Additionally, the performance of these systems is not only hindered by data dependences, but also by load imbalance and limited coverage. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallelization and characterization of GARCH option pricing on GPUs

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (576 KB) |  | HTML iconHTML  

    Option pricing is an important problem in computational finance due to the fast-growing market and increasing complexity of options. For option pricing, a model is required to describe the price process of the underlying asset. The GARCH model is one of the prominent option pricing models since it can model stochastic volatility of the underlying asset. To derive expected profit based on the GARCH model, tree-based simulations are one of the commonly used approaches. Tree-based GARCH option pricing is computing intensive since the tree grows exponentially, and it requires enormous floating point arithmetic operations. In this paper, we present the first work on accelerating the tree-based GARCH option pricing on GPUs with CUDA. As the conventional tree data structure is not memory access friendly to GPUs, we propose a new family of tree data structures which position concurrently accessed nodes in contiguous and aligned memory locations. Moreover, to reduce memory bandwidth requirement, we apply fusion optimization, which combines two threads into one to keep data with temporal locality in register files. Our results show 50× speedup compared to a multi-threaded program on a 4-core CPU. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Tackling the challenges of server consolidation on multi-core systems

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (698 KB) |  | HTML iconHTML  

    With increasing demand to reduce system operation cost amidst growing adoption of virtualization technologies, multiple servers on a single physical chip is fast seeing typical usage in modern data centers. This trend has become more obvious with advances in system design that puts more and more multi-core CPUs into one system. As a result, it is interesting to investigate the challenges of virtualization on top of a multi-core system and the scalability of the consolidation workload on top of it. With this objective, we conduct a comprehensive analysis of an industry consolidation workload on top of a state of the art Intel system that consists of 64 physical threads. Our findings reveal that careful considerations must be taken in I/O virtualization and VM scheduling to achieve high scalability under the consolidation environment in a multi-core system. Against this backdrop, we carried out some preliminary work to demonstrate optimization opportunities. In particular, applying interrupt balance method brings in 16% performance boost whereas a finer-grained scheduler prototype added another 6% performance lift. Based on our analyses, existing VMMs should be enhanced with a) a well-designed multi-threaded I/O virtualization implementation and b) a scheduler with finer grained control on priorities. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Characterization of workload and resource consumption for an online travel and booking site

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (443 KB) |  | HTML iconHTML  

    Online travel and ticket booking is one of the top E-Commerce industries. As they present a mix of products: flights, hotels, tickets, restaurants, activities and vacational packages, they rely on a wide range of technologies to support them: Javascript, AJAX, XML, B2B Web services, Caching, Search Algorithms and Affiliation; resulting in a very rich and heterogeneous workload. Moreover, visits to travel sites present a great variability depending on time of the day, season, promotions, events, and linking; creating bursty traffic, making capacity planning a challenge. It is therefore of great importance to understand how users and crawlers interact on travel sites and their effect on server resources, for devising cost effective infrastructures and improving the Quality of Service for users. In this paper we present a detailed workload and resource consumption characterization of the web site of a top national Online Travel Agency. Characterization is performed on server logs, including both HTTP data and resource consumption of the requests, as well as the server load status during the execution. From the dataset we characterize user sessions, their patterns and how response time is affected as load on Web servers increases. We provide a fine grain analysis by performing experiments differentiating: types of request, time of the day, products, and resource requirements for each. Results show that the workload is bursty, as expected, that exhibit different properties between day and night traffic in terms of request type mix, that user session length cover a wide range of durations, which response time grows proportionally to server load, and that response time of external data providers also increase on peak hours, amongst other results. Such results can be useful for optimizing infrastructure costs, improving QoS for users, and development of realistic workload generators for similar applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance of multi-process and multi-thread processing on multi-core SMT processors

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (382 KB) |  | HTML iconHTML  

    Many modern high-performance processors support multiple hardware threads in the form of multiple cores and SMT (Simultaneous Multi-Threading). Hence achieving good performance scalability of programs with respect to the numbers of cores (core scalability) and SMT threads in one core (SMT scalability) is critical. To identify a way to achieve higher performance on the multi-core SMT processors, this paper compares the performance scalability with two parallelization models (using multiple processes and using multiple threads in one process) on two types of hardware parallelism (core scalability and SMT scalability). We tested standard Java benchmarks and a real-world server program written in PHP on two platforms, Sun's UltraSPARC T1 (Niagara) processor and Intel's Xeon (Nehalem) processor. We show that the multi-thread model achieves better SMT scalability compared to the multi-process model by reducing the number of cache misses and DTLB misses. However both models achieve roughly equal core scalability. We show that the multi-thread model generates up to 7.4 times more DTLB misses than the multi-process model when multiple cores are used. To take advantage of the both models, we implemented a memory allocator for a PHP runtime to reduce DTLB misses on multi-core SMT processors. The allocator is aware of the core that is running each software thread and allocates memory blocks from same memory page for each processor core. When using all of the hardware threads on a Niagara, the core-aware allocator reduces the DTLB misses by 46.7% compared to the default allocator, and it improves the performance by 3.0%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Real Java applications in software transactional memory

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (359 KB) |  | HTML iconHTML  

    Transactional Memory (TM) shows promise as a new concurrency control mechanism to replace lock-based synchronization. However, there have been few studies of TM systems with real applications, and the real-world benefits and barriers of TM remain unknown. In this paper, we present a detailed analysis of the behavior of real applications on a software transactional memory system. Based on this analysis, we aim to clarify what programming work is required to achieve reasonable performance in TM-based applications. We selected three existing Java applications: (1) HSQLDB, (2) the Geronimo application server, and (3) the GlassFish application server, because each application has a scalability problem caused by lock contentions. We identified the critical sections where lock contentions frequently occur, and modified the source code so that the critical sections are executed transactionally. However, this simple modification proved insufficient to achieve reasonable performance because of excessive data conflicts. We found that most of the data conflicts were caused by application-level optimizations such as reusing objects to reduce the memory usage. After modifying the source code to disable those optimizations, the TM-based applications showed higher or competitive performance compared to lock-based applications. Another finding is that the number of variables that actually cause data conflicts is much smaller than the number of variables that can be accessed in critical sections. This implies that the performance tuning of TM-based applications may be easier than that of lock-based applications where we need to take care of all of the variables that can be accessed in the critical sections. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (419 KB) |  | HTML iconHTML  

    The recently released Rodinia benchmark suite enables users to evaluate heterogeneous systems including both accelerators, such as GPUs, and multicore CPUs. As Rodinia sees higher levels of acceptance, it becomes important that researchers understand this new set of benchmarks, especially in how they differ from previous work. In this paper, we present recent extensions to Rodinia and conduct a detailed characterization of the Rodinia benchmarks (including performance results on an NVIDIA GeForce GTX480, the first product released based on the Fermi architecture). We also compare and contrast Rodinia with Parsec to gain insights into the similarities and differences of the two benchmark collections; we apply principal component analysis to analyze the application space coverage of the two suites. Our analysis shows that many of the workloads in Rodinia and Parsec are complementary, capturing different aspects of certain performance metrics. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Benchmark synthesis for architecture and compiler exploration

    Page(s): 1 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1174 KB) |  | HTML iconHTML  

    This paper presents a novel benchmark synthesis framework with three key features. First, it generates synthetic benchmarks in a high-level programming language (C in our case), in contrast to prior work in benchmark synthesis which generates synthetic benchmarks in assembly. Second, the synthetic benchmarks hide proprietary information from the original workloads they are built after. Hence, companies may want to distribute synthetic benchmark clones to third parties as proxies for their proprietary codes; third parties can then optimize the target system without having access to the original codes. Third, the synthetic benchmarks are shorter running than the original workloads they are modeled after, yet they are representative. In summary, the proposed framework generates small (thus quick to simulate) and representative benchmarks that can serve as proxies for other workloads without revealing proprietary information; and because the benchmarks are generated in a high-level programming language, they can be used to explore both the architecture and compiler spaces. The results obtained with our initial framework are promising. We demonstrate that we can generate synthetic proxy benchmarks for the MiBench benchmarks, and we show that they are representative across a range of machines with different instruction-set architectures, microarchitectures, and compilers and optimization levels, while being 30 times shorter running on average. We also verify using software plagiarism detection tools that the synthetic benchmark clones hide proprietary information from the original workloads. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analyzing and scaling parallelism for network routing protocols

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1078 KB) |  | HTML iconHTML  

    The serial nature of legacy code in routing protocol implementations has inhibited a shift to multicore processing in the control plane, even though there is much inherent parallelism. In this paper, we investigate the use of multicore as the compute platform for routing applications using BGP, the ubiquitous protocol for routing in the Internet backbone, as a representative application. We develop a scalable multithreaded implementation for BGP and evaluate its performance on several multicore configurations using a fully configurable multicore simulation environment. We implement several optimizations at the software and architecture levels, achieving a speedup of 6.5 times over the sequential implementation, which translates to a throughput of ~170K updates per second. Subsequently, we propose a generic architecture and parallelization methodology which can be applied to all routing protocol implementations to achieve significant performance improvement. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance variations of two open-source cloud platforms

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1542 KB) |  | HTML iconHTML  

    The performance of workloads running on cloud platforms varies significantly depending on the cloud platform configurations. We evaluated the performance variations using two open-source cloud platforms, OpenNebula and Eucalyptus. To assess the performance variations on the cloud platforms, we created a representative workload from Wikipedia software and data. The performance with this workload was quite sensitive to two key configuration choices, the physical location of the virtual machine disk images (local disk or NFS), and eager or lazy allocation of the virtual machine disk images. Our performance metrics included (1) the provisioning times for the virtual machines, (2) the elapsed times for two types of batch processing, and (3) the throughputs of two types of Web transactions. The local-disk configuration was 75% slower for provisioning, 2.9 times faster for batch possessing, and 50% faster for Web transactions compared to the NFS configuration. Relative to lazy-allocation, eager-allocation took 2.7 times longer for provisioning and was 43% faster for batch processing, but was only 1.5% faster for Web transactions. Our results indicate that no configuration offers the best performance for all three of the metrics at the same time. If batch processing is more important than provisioning, the local-disk configuration with eager disk allocation should be used. Otherwise, local-disk allocation with lazy allocation should be used. We also evaluated a multi-tenancy scenario using the Apache Day Trader benchmark on Eucalyptus. The results show that VM provisioning significantly affected the throughputs of Day Trader due to the lack of any disk I/O throttling mechanism. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Runtime workload behavior prediction using statistical metric modeling with application to dynamic power management

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (616 KB) |  | HTML iconHTML  

    Adaptive computing systems rely on accurate predictions of workload behavior to understand and respond to the dynamically-varying application characteristics. In this study, we propose a Statistical Metric Model (SMM) that is system-and metric-independent for predicting workload behavior. SMM is a probability distribution over workload patterns and it attempts to model how frequently a specific behavior occurs. Maximum Likelihood Estimation (MLE) criterion is used to estimate the parameters of the SMM. The model parameters are further refined with a smoothing method to improve prediction robustness. The SMM learns the application patterns during runtime as applications run, and at the same time predicts the upcoming program phases based on what it has learned so far. An extensive and rigorous series of prediction experiments demonstrates the superior performance of the SMM predictor over existing predictors on a wide range of benchmarks. For some of the benchmarks, SMM improves prediction accuracy by 10X and 3X, compared to the existing last-value and table-based prediction approaches respectively. SMM's improved prediction accuracy results in superior power-performance trade-offs when it is applied to dynamic power management. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting approximate value locality for data synchronization on multi-core processors

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (449 KB) |  | HTML iconHTML  

    This paper shows that for a variety of parallel “soft computing” programs that use optimistic synchronization, the approximate nature of the values produced during execution can be exploited to improve performance significantly. Specifically, through mechanisms for imprecise sharing of values between threads, the amount of contention in these programs can be reduced thereby avoiding expensive aborts and improving parallel performance while keeping the results produced by the program within the bounds of an acceptable approximation. This is made possible due to our observation that for many such programs, a large fraction of the values produced during execution exhibit a substantial amount of value locality. We describe how this locality can be exploited using extensions to C/C++ language types that allow specification of limits on the precision and accuracy required and a novel value-aware conflict detection scheme that minimizes the number of conflicts while respecting these limits. Our experiments indicate that for the programs studied substantial speedups can be achieved - upto 5.7x over the original program for the same number of threads. We also present experimental evidence that for the programs studied, the amount of error introduced often grows relatively slowly. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Characterizing datasets for data deduplication in backup applications

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (741 KB) |  | HTML iconHTML  

    The compression and throughput performance of data deduplication system is directly affected by the input dataset. We propose two sets of evaluation metrics, and the means to extract those metrics, for deduplication systems. The First set of metrics represents how the composition of segments changes within the deduplication system over five full backups. This in turn allows more insights into how the compression ratio will change as data accumulate. The second set of metrics represents index table fragmentation caused by duplicate elimination and the arrival rate at the underlying storage system. We show that, while shorter sequences of unique data may be bad for index caching, they provide a more uniform arrival rate which improves the overall throughput. Finally, we compute the metrics derived from the datasets under evaluation and show how the datasets perform with different metrics. Our evaluation shows that backup datasets typically exhibit patterns in how they change over time and that these patterns are quantifiable in terms of how they affect the deduplication process. This quantification allows us to: 1) decide whether deduplication is applicable, 2) provision resources, 3) tune the data deduplication parameters and 4) potentially decide which portion of the dataset is best suited for deduplication. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.