Scheduled System Maintenance:
On May 6th, system maintenance will take place from 8:00 AM - 12:00 PM ET (12:00 - 16:00 UTC). During this time, there may be intermittent impact on performance. We apologize for the inconvenience.
By Topic

Workload Characterization, 2007. IISWC 2007. IEEE 10th International Symposium on

Date 27-29 Sept. 2007

Filter Results

Displaying Results 1 - 25 of 29
  • IEEE International Symposium on Workload Characterization

    Publication Year: 2007 , Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (125 KB)  
    Freely Available from IEEE
  • Message from the General Chair

    Publication Year: 2007 , Page(s): ii
    Save to Project icon | Request Permissions | PDF file iconPDF (78 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Message from the Program Co-Chairs

    Publication Year: 2007 , Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (69 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • IISWC-2007 People

    Publication Year: 2007 , Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (101 KB)  
    Freely Available from IEEE
  • IISWC-2007 Reviewers

    Publication Year: 2007 , Page(s): v
    Save to Project icon | Request Permissions | PDF file iconPDF (67 KB)  
    Freely Available from IEEE
  • [Breaker page]

    Publication Year: 2007 , Page(s): nil1
    Save to Project icon | Request Permissions | PDF file iconPDF (19 KB)  
    Freely Available from IEEE
  • The SPEC Gorilla Turns One. So What?

    Publication Year: 2007 , Page(s): 1
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (60 KB) |  | HTML iconHTML  

    SPEC CPU2006 is a 500 pound gorilla of benchmarking, with 1300 results published since its release one year ago (24 August 2006), despite consuming vastly more time and computational resources than its predecessor suites. What have we learned about its workloads during its first year of life? Are there surprises lurking in the code, workloads, or run rules that are difficult to simulate? What characteristics of CPU2006 have proven successful? What does SPEC need to improve in successor suites? Some proposed answers will be provided and time will be reserved for an open microphone. The presenter will also be available during breaks to listen to feedback about the suite. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Taking Concurrency Seriously: the Multicore Challenge

    Publication Year: 2007 , Page(s): 2
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (69 KB) |  | HTML iconHTML  

    Computer architecture is undergoing, if not another revolution, then a vigorous shaking-up. The major chip manufacturers have, for the time being, simply given up trying to make processors run faster. Instead, they have recently started shipping "multicore" architectures, in which multiple processors (cores) communicate directly through shared hardware caches, providing increased concurrency instead of increased clock speed. As a result, system designers and software engineers can no longer rely on increasing clock speed to hide software bloat. Instead, they must somehow learn to make effective use of increasing parallelism. This adaptation will not be easy. Conventional synchronization techniques based on locks and conditions are unlikely to be effective in such a demanding environment. Transactional memory is a computational model in which threads synchronize by transactions. This synchronization model promises to alleviate many (perhaps not all) of the problems associated with locking, and there is a growing community of researchers working on both software and hardware support for this approach. This talk will survey the area, with a focus on open research problems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Characterizing the Effect of Microarchitecture Design Parameters on Workload Dynamic Behavior

    Publication Year: 2007 , Page(s): 5 - 14
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (772 KB) |  | HTML iconHTML  

    Program runtime behavior exhibits significant variations across multiple scales. The increasing design complexity and technology scaling make microprocessor performance and efficiency increasingly depend on runtime workload dynamics. Therefore understanding the effect of design parameters on workload dynamics at early, microarchitecture exploration stage is crucial for high-performance and complexity-efficient designs. In this study, we apply wavelet-based analysis to decompose workload dynamics into a series of wavelet coefficients, which represent program behavior ranging from low-resolution approximation to high-resolution detail. We then construct error-bounded linear regression models that relate microarchitecture design parameters to various wavelet coefficients that capture workload dynamics at multiresolution levels. The most significant factors affecting program dynamics at different scales are obtained. To our knowledge, this paper presents the first work on microarchitecture design space exploration focusing on workload dynamics. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implications of False Conflict Rate Trends for Robust Software Transactional Memory

    Publication Year: 2007 , Page(s): 15 - 24
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (636 KB) |  | HTML iconHTML  

    We demonstrate that a common optimization for reducing the single-thread overhead of word-based Software Transactional Memory (STM) systems can have a significant negative impact on their scalability. Specifically, we find that the use of a tagless ownership table incurs false conflicts at a rate that grows superlinearly with both the TM data footprint and concurrency, and that increasing the size of the ownership table results in only a sub-linear reduction in conflict rate. These empirically observed trends are shown to result from the same statistical priniciples responsible for the (so called) "Birthday Paradox," as we demonstrate with an analytical model based on random population of an ownership table by concurrently executing transactions. From this study, we conclude that tagless ownership tables are not a robust approach to supporting transactional memories. Even large tables (> 64K entries) are only somewhat effective at mitigating false conflicts in the presence of modestly-sized transactions (e.g., 20 cache blocks) and modest degrees of concurrency (e.g., 4 simultaneous transactions). The practical implications of these results are particularly acute for a hybrid TMs, where the small transactions are likely handled in hardware, leaving only the large ones for the STM. For reasonably-sized tables, a tagless organization will almost guarantee a maximum concurrency of 1 for these overflowed transactions. Using a tagged ownership table completely avoids these false conflict problems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Predicting Program Behavior Based On Objective Function Minimization

    Publication Year: 2007 , Page(s): 25 - 34
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3733 KB) |  | HTML iconHTML  

    Computer systems increasingly rely on dynamic management of their operations with the goal of optimizing an individual or joint metric involving performance, power, temperature, reliability and so on. Such an adaptive system requires an accurate, reliable, and practically viable metric predictors to invoke the dynamic management actions in a timely and efficient manner. Unlike ad-hoc predictors proposed in the past, we propose a unified prediction method in which the optimal metric prediction problem is considered as that of minimizing an objective function. Choice of the objective function and the model type determines the form of the solution whether it is a closed form or one that is numerically determined through optimization. We formulate two particular realizations of the unified prediction method by using the total squared error and accumulated squared error as the objective functions in conjunction with autoregressive models. Under this scenario, the unified prediction method becomes linear prediction and the predictive least square (PLS) prediction, respectively. For both of these predictors, there is a analytical closed form solution that determines model parameters. Experimental results with prediction of instruction per cycle (IPC) and L1 cache miss rate metrics demonstrate superior performance for the proposed predictors over the last value predictor on SPECCPU 2000 benchmarks where in some cases the mean absolute prediction error is reduced by as much as 10-fold. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the Effects of Memory Latency and Bandwidth on Supercomputer Application Performance

    Publication Year: 2007 , Page(s): 35 - 43
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (658 KB) |  | HTML iconHTML  

    Since the first vector supercomputers in the mid-1970's, the largest scale applications have traditionally been floating point oriented numerical codes, which can be broadly characterized as the simulation of physics on a computer. Supercomputer architectures have evolved to meet the needs of those applications. Specifically, the computational work of the application tends to be floating point oriented, and the decomposition of the problem two or three dimensional. Today, an emerging class of critical applications may change those assumptions: they are combinatorial in nature, integer oriented, and irregular. The performance of both classes of applications is dominated by the performance of the memory system. This paper compares the memory performance sensitivity of both traditional and emerging HPC applications, and shows that the new codes are significantly more sensitive to memory latency and bandwidth than their traditional counterparts. Additionally, these codes exhibit lower base-line performance, which only exacerbates the problem. As a result, the construction of future supercomputer architectures to support these applications will most likely be different from those used to support traditional codes. Quantitatively understanding the difference between the two workloads will form the basis for future design choices. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Evaluation of Server Consolidation Workloads for Multi-Core Designs

    Publication Year: 2007 , Page(s): 47 - 56
    Cited by:  Papers (12)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (857 KB) |  | HTML iconHTML  

    While chip multiprocessors with ten or more cores will be feasible within a few years, the search for applications that fully exploit their attributes continues. In the meantime, one sure-fire application for such machines will be to serve as consolidation platforms for sets of workloads that previously occupied multiple discrete systems. Such server consolidation scenarios will simplify system administration and lead to savings in power, cost, and physical infrastructure. This paper studies the behavior of server consolidation workloads, focusing particularly on sharing of caches across a variety of configurations. Noteworthy interactions emerge within a workload, and notably across workloads, when multiple server workloads are scheduled on the same chip. These workloads present an interesting design point and will help designers better evaluate trade-offs as we push forward into the many-core era. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance Studies of Commercial Workloads on a Multi-core System

    Publication Year: 2007 , Page(s): 57 - 65
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (413 KB) |  | HTML iconHTML  

    The multi-threaded nature of many commercial applications makes them seemingly a good fit with the increasing number of available multi-core architectures. This paper presents our performance studies of a collection of commercial workloads on a multi-core system that is designed for total throughput. The selected workloads include full operational applications such as SAP-SD and IBM Trade, and popular synthetic benchmarks such as SPECjbb2005, SPEC SDET, Dbench, and Tbench. To evaluate the performance scalability and the thread-placement sensitivity, we monitor the application throughput, processor performance, and the memory subsystem of 8, 16, 24, and 32 hardware threads with (a) increasing number of cores and (b) increasing number of threads per core. We observe that these workloads scale close to linearly (with efficiencies ranging from 86% to 99%) with increasing number of cores. For scaling with hardware-threads per core, the efficiencies are between 50% and 70%. Furthermore, among other observations, our data show that the ability of hiding long latency memory operations (i.e. L2 misses) in a multi-core system enables the performance scaling. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Addressing Cache/Memory Overheads in Enterprise Java CMP Servers

    Publication Year: 2007 , Page(s): 66 - 75
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (582 KB) |  | HTML iconHTML  

    As we enter the era of chip multiprocessor (CMP) architectures, it is important that we explore the scaling characteristics of mainstream server workloads on these platforms. In this paper, we analyze the performance of two significant enterprise Java workloads (SPECjAppServer2004 and SPECjbb2005) on CMP platforms -present and future. We start by characterizing the core, cache and memory behavior of these workloads on the newly released Intel core 2 Duo Xeon platform (dual-core, dual-socket). Our findings from these measurements indicate that these workloads have a significant performance dependence on cache and memory subsystems. In order to guide the evolution of future CMP platforms, we perform a detailed investigation of potential cache and memory architecture choices. This includes analyzing the effects of thread sharing and migration, object allocation and garbage collection. Based on the observed behavior, we propose architectural optimizations along three dimensions: (a) data-less cache line initialization (DCLI), (b) hardware-guided thread collocation (HGTC) and (c) on-socket DRAM caches (OSDC). In this paper, we will describe these optimizations in detail and validate their performance potential based on trace-driven simulations and execution-driven emulation. Overall, we expect that the findings in this paper will guide future CMP architectures for enterprise Java servers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Benchmarking BGP Routers

    Publication Year: 2007 , Page(s): 79 - 88
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (759 KB) |  | HTML iconHTML  

    Determining which routes to use when forwarding traffic is one of the major processing tasks in the control plane of computer networks. We present a novel benchmark that evaluates the performance of the most commonly used Internet-wide routing protocol, the Border Gateway Protocol (BGP). Using this benchmark, we evaluate four different systems that implement BGP, including a uni-core and a dual-core workstation, an embedded network processor, and a commercial router. We present performance results for these systems under various loads of cross-traffic and explore the tradeoffs between different system architectures. Our observations help identify bottlenecks and limitations in current systems and can lead to next-generation router architectures that are better optimized for this important workload. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Characterizing and Improving the Performance of Bioinformatics Workloads on the POWER5 Architecture

    Publication Year: 2007 , Page(s): 89 - 97
    Cited by:  Papers (1)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (310 KB) |  | HTML iconHTML  

    This paper examines several mechanisms to improve the performance of life science applications on high-performance computer architectures typically designed for more traditional supercomputing tasks. In particular, we look at the detailed performance characteristics of some of the most popular sequence alignment and homology applications on the POWERS architecture offering from IBM. Through detailed analysis of performance counter information collected from the hardware, we identify the main performance bottleneck in the current POWER5 architecture for these applications is the high branch misprediction penalty of the most time-consuming kernels of these codes. Utilizing our PowerPC full system simulation environment, we show the performance improvement afforded by adding conditional assignments to the PowerPC ISA. We also show the impact of changing the number of functional units to a more appropriate mix for the characteristics of bioinformatics applications. Finally, we examine the benefit of removing the two-cycle penalty currently in the POWERS architecture for taken branches due to the lack of a branch target buffer. Addressing these three performance-limiting aspects provides an average 64% improvement in application performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Pynamic: the Python Dynamic Benchmark

    Publication Year: 2007 , Page(s): 101 - 106
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (345 KB) |  | HTML iconHTML  

    Python is widely used in scientific computing to facilitate application development and to support features such as computational steering. Making full use of some of Python's popular features, which improve programmer productivity, leads to applications that access extremely high numbers of dynamically linked libraries (DLLs). As a result, some important Python-based applications severely stress a system's dynamic linking and loading capabilities and also cause significant difficulties for most development environment tools, such as debuggers. Furthermore, using the Python paradigm for large scale MPI-based applications can create significant file IO and further stress tools and operating systems. In this paper, we present Pynamic, the first benchmark program to support configurable emulation of a wide-range of the DLL usage of Python-based applications for large scale systems. Pynamic has already accurately reproduced system software and tool issues encountered by important large Python-based scientific applications on our supercomputers. Pynamic provided insight for our system software and tool vendors, and our application developers, into the impact of several design decisions. As we describe the Pynamic benchmark, we will highlight some of the issues discovered in our large scale system software and tools using Pynamic. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Delaunay Triangulation with Transactions and Barriers

    Publication Year: 2007 , Page(s): 107 - 113
    Cited by:  Papers (5)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (396 KB) |  | HTML iconHTML  

    Transactional memory has been widely hailed as a simpler alternative to locks in multithreaded programs, but few nontrivial transactional programs are currently available. We describe an open-source implementation of Delaunay triangulation that uses transactions as one component of a larger parallelization strategy. The code is written in C+ +, for use with the RSTM software transactional memory library (also open source). It employs one of the fastest known sequential algorithms to triangulate geometrically partitioned regions in parallel; it then employs alternating, barrier-separated phases of transactional and partitioned work to stitch those regions together. Experiments on multiprocessor and multicore machines confirm excellent single-thread performance and good speedup with increasing thread count. Since execution time is dominated by geometrically partitioned computation, performance is largely insensitive to the overhead of transactions, but highly sensitive to any costs imposed on shamble data that are currently "privatized". View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FacePerf: Benchmarks for Face Recognition Algorithms

    Publication Year: 2007 , Page(s): 114 - 119
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (322 KB) |  | HTML iconHTML  

    In this paper we present a collection of C and C++ biometric performance benchmark algorithms called FacePerf. The benchmark includes three different face recognition algorithms that are historically important to the face recognition community: Haar-based face detection, Principal Components Analysis, and Elastic Bunch Graph Matching. The algorithms are fast enough to be useful in realtime systems; however, improving performance would allow the algorithms to process more images or search larger face databases. Bottlenecks for each phase in the algorithms have been identified. A cosine approximation was able to reduce the execution time of the Elastic Bunch Graph Matching implementation by 32%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • HD-VideoBench. A Benchmark for Evaluating High Definition Digital Video Applications

    Publication Year: 2007 , Page(s): 120 - 125
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (195 KB) |  | HTML iconHTML  

    HD-VideoBench is a benchmark devoted to high definition (HD) digital video processing. It includes a set of video encoders and decoders (Codecs) for the MPEG-2, MPEG-4 and H.264 video standards. The applications were carefully selected taken into account the quality and portability of the code, the representativeness of the video application domain, the availability of high performance optimizations and the distribution under a free license. Additionally, HD-VideoBench defines a set of input sequences and configuration parameters of the video Codecs which are appropriate for the HD video domain. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Seekable Compressed Traces

    Publication Year: 2007 , Page(s): 129 - 138
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (416 KB) |  | HTML iconHTML  

    Program traces are commonly used for purposes such as profiling, processor simulation, and program slicing. Uncompressed, these traces are often too large to exist on disk. Although existing trace compression algorithms achieve high compression rates, they sacrifice the accessibility of uncompressed traces; typical compressed traces must be traversed linearly to reach a desired position in the stream. This paper describes seekable compressed traces that allow arbitrary positioning in the compressed data stream. Furthermore, we enhance existing value prediction based techniques to achieve higher compression rates, particularly for difficult-to-compress traces. Our base algorithm achieves a harmonic mean compression rate for SPEC2000 memory address traces that is 3.47 times better than existing methods. We introduce the concept of seekpoints that enable fast seeking to positions evenly distributed throughout a compressed trace. Adding seekpoints enables rapid sampling and backwards traversal of compressed traces. At a granularity of every 10 M instructions, seekpoints only increase trace sizes by an average factor of 2.65. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis of Statistical Sampling in Microarchitecture Simulation: Metric, Methodology and Program Characterization

    Publication Year: 2007 , Page(s): 139 - 148
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5530 KB) |  | HTML iconHTML  

    Statistical sampling, especially stratified random sampling, is a promising technique for estimating the performance of the benchmark program without executing the complete program on microarchitecture simulators or real machines. The accuracy of the performance estimate and the simulation cost depend on the three parameters, namely the interval size, the sample size, and the number of phases (or strata). Optimum values for these three parameters depends on the performance behavior of the program and the microarchitecture configuration being evaluated. In this paper, we quantify the effect of these three parameters and their interactions on the accuracy of the performance estimate and simulation cost. We use the Confidence Interval of estimated Mean (CIM), a metric derived from statistical sampling theory, to measure the accuracy of the performance estimate; we also discuss why CIM is an appropriate metric for this analysis. We use the total number of instructions simulated and the total number of samples measured as cost parameters. Finally, we characterize 21 SPEC CPU2000 benchmarks based on our analysis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Easy and Efficient Disk I/O Workload Characterization in VMware ESX Server

    Publication Year: 2007 , Page(s): 149 - 158
    Cited by:  Papers (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (520 KB) |  | HTML iconHTML  

    Collection of detailed characteristics of disk I/O for workloads is the first step in tuning disk subsystem performance. This paper presents an efficient implementation of disk I/O workload characterization using online histograms in a virtual machine hypervisor VMware ESX Server. This technique allows transparent and online collection of essential workload characteristics for arbitrary, unmodified operating system instances running in virtual machines. For analysis that cannot be done efficiently online, we provide a virtual SCSI command tracing framework. Our online histograms encompass essential disk I/O performance metrics including I/O block size, latency, spatial locality, I/O interarrival period and active queue depth. We demonstrate our technique on workloads of Filebench, DBT-2 and large file copy running in virtual machines and provide an analysis of the differences between ZFS and UFS filesystems on Solaris. We show that our implementation introduces negligible overheads in CPU, memory and latency and yet is able to capture essential workload characteristics. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Observation-Based Approach to Performance Characterization of Distributed n-Tier Applications

    Publication Year: 2007 , Page(s): 161 - 170
    Cited by:  Papers (5)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (606 KB) |  | HTML iconHTML  

    The characterization of distributed n-tier application performance is an important and challenging problem due to their complex structure and the significant variations in their workload. Theoretical models have difficulties with such wide range of environmental and workload settings. Experimental approaches using manual scripts are error-prone, time consuming, and expensive. We use code generation techniques and tools to create and run the scripts for large-scale experimental observation of n-tier benchmarking application performance measurements over a wide range of parameter settings and software/hardware combinations. Our experiments show the feasibility of experimental observations as a sound basis for performance characterization, by studying in detail the performance achieved by (up to 3) database servers and (up to 12) application servers in the RUBiS benchmark with a workload of up to 2700 concurrent users. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.