By Topic

Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on

Date 26-28 April 2009

Filter Results

Displaying Results 1 - 25 of 34
  • IEEE International symposium on performance analysis of systems and software

    Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (76 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Page(s): ii
    Save to Project icon | Request Permissions | PDF file iconPDF (141 KB)  
    Freely Available from IEEE
  • Message from the Program Chair

    Page(s): iii - iv
    Save to Project icon | Request Permissions | PDF file iconPDF (121 KB)  
    Freely Available from IEEE
  • ISPASS 2009 people

    Page(s): v - vi
    Save to Project icon | Request Permissions | PDF file iconPDF (117 KB)  
    Freely Available from IEEE
  • ISPASS 2009 reviewers

    Page(s): vii
    Save to Project icon | Request Permissions | PDF file iconPDF (116 KB)  
    Freely Available from IEEE
  • Accelerating architecture research

    Page(s): viii
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (119 KB)  

    With the recent demonstration of 32nm processors we have seen Moore's law providing another large increase in the number of transistors. While more transistors provides architects with a great opportunity, I believe we have been observing increasing challenges in finding the most effective uses for these transistors. Design team size, mask costs and fabrication costs are all increasing, thus there is increasing desire to make the right decisions about which research ideas to bring forward to design. Unfortunately, our existing evaluation methodologies are proving increasingly ineffective at providing compelling evidence that a new idea warrants inclusion in future designs. In this talk, I will elaborate on these challenges and discuss some approaches to improve on our ability to prove the merit of architectural ideas. In particular, there is a recent movement toward using field-programmable gate arrays (FPGAs) as the basis for the evaluating future systems. Therefore, I will outline the alternative approaches to using FPGAs with an emphasis on using FPGAs to do performance modeling. But designing hardware models is far more complicated than writing software models, so included in the discussion will be techniques to reduce that complexity. These will include a practical approach to modularizing the model, separation of the functional and timing aspects of the simulation, and additional infrastructure important for performance modeling. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance analysis in the real world of on line services

    Page(s): ix
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (119 KB)  

    Performance analysis has always been an integral part of a computer architect's agenda. However, the term performance is used largely to measure “speed”. The dictionary defines performance more broadly as “the manner in which or the efficiency with which something reacts or fulfills its intended purpose”. In today's internet based on line computing environment, performance has taken a broader view. For example, power and energy efficiency is becoming as or more important measures of system performance as speed of computation. The industry's ability to deliver speed has outpaced the ability of most applications to consume it effectively. This talk will discuss how performance is viewed in the world of on line web services from an end user's point of view. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Table of contents

    Save to Project icon | Request Permissions | PDF file iconPDF (107 KB)  
    Freely Available from IEEE
  • [Blank page]

    Page(s): xii
    Save to Project icon | Request Permissions | PDF file iconPDF (49 KB)  
    Freely Available from IEEE
  • Differentiating the roles of IR measurement and simulation for power and temperature-aware design

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (789 KB) |  | HTML iconHTML  

    In temperature-aware design, the presence or absence of a heatsink fundamentally changes the thermal behavior with important design implications. In recent years, chip-level infrared (IR) thermal imaging has been gaining popularity in studying thermal phenomena and thermal management, as well as reverse-engineering chip power consumption. Unfortunately, IR thermal imaging needs a peculiar cooling solution, which removes the heatsink and applies an IR-transparent liquid flow over the exposed bare die to carry away the dissipated heat. Because this cooling solution is drastically different from a normal thermal package, its thermal characteristics need to be closely examined. In this paper, we characterize the differences between two cooling configurations-forced air flow over a copper heatsink (AIR-SINK) and laminar oil flow over bare silicon (OIL-SILICON). For the comparison, we modify the HotSpot thermal model by adding the IR-transparent oil flow and the secondary heat transfer path through the package pins, hence modeling what the IR camera actually sees at runtime. We show that OIL-SILICON and AIR-SINK are significantly different in both transient and steady-state thermal responses. OIL-SILICON has a much slower short-term transient response, which makes dynamic thermal management less efficient. In addition, for OIL-SILICON, the direction of oil flow plays an important role by changing hot spot location, thus impacting hot spot identification and thermal sensor placement. These results imply that the power- and temperature-aware design process cannot just rely on IR measurements. Simulation and IR measurement are both needed and are complementary techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • User- and process-driven dynamic voltage and frequency scaling

    Page(s): 11 - 22
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (279 KB) |  | HTML iconHTML  

    We describe and evaluate two new, independently-applicable power reduction techniques for power management on processors that support dynamic voltage and frequency scaling (DVFS): user-driven frequency scaling (UDFS) and process-driven voltage scaling (PDVS). In PDVS, a CPU-customized profile is derived offline that encodes the minimum voltage needed to achieve stability at each combination of CPU frequency and temperature. On a typical processor, PDVS reduces the voltage below the worst-case minimum operating voltages given in datasheets. UDFS, on the other hand, dynamically adapts CPU frequency to the individual user and the workload through direct user feedback. Our UDFS algorithms dramatically reduce typical operating frequencies and voltages while maintaining performance at a satisfactory level for each user. We evaluate our techniques independently and together through user studies conducted on a Pentium M laptop running Windows applications. We measure the overall system power and temperature reduction achieved by our methods. Combining PDVS and the best UDFS scheme reduces measured system power by 49.9% (27.8% PDVS, 22.1% UDFS), averaged across all our users and applications, compared to Windows XP DVFS. The average temperature of the CPU is decreased by 13.2degC. User trace-driven simulation to evaluate the CPU only indicates average CPU dynamic power savings of 57.3% (32.4% PDVS, 24.9% UDFS), with a maximum reduction of 83.4%. In a multitasking environment, the same UDFS+PDVS technique reduces the CPU dynamic power by 75.7% on average. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Accuracy of performance counter measurements

    Page(s): 23 - 32
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (433 KB) |  | HTML iconHTML  

    Many experimental performance evaluations depend on accurate measurements of the cost of executing a piece of code. Often these measurements are conducted using infrastructures to access hardware performance counters. Most modern processors provide such counters to count micro-architectural events such as retired instructions or clock cycles. These counters can be difficult to configure, may not be programmable or readable from user-level code, and can not discriminate between events caused by different software threads. Various software infrastructures address this problem, providing access to per-thread counters from application code. This paper constitutes the first comparative study of the accuracy of three commonly used measurement infrastructures (perfctr, perfmon2, and PAPI) on three common processors (Pentium D, Core 2 Duo, and AMD ATHLON 64 X2). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GARNET: A detailed on-chip network model inside a full-system simulator

    Page(s): 33 - 42
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (425 KB) |  | HTML iconHTML  

    Until very recently, microprocessor designs were computation-centric. On-chip communication was frequently ignored. This was because of fast, single-cycle on-chip communication. The interconnect power was also insignificant compared to the transistor power. With uniprocessor designs providing diminishing returns and the advent of chip multiprocessors (CMPs) in mainstream systems, the on-chip network that connects different processing cores has become a critical part of the design. Transistor miniaturization has led to high global wire delay, and interconnect power comparable to transistor power. CMP design proposals can no longer ignore the interaction between the memory hierarchy and the interconnection network that connects various elements. This necessitates a detailed and accurate interconnection network model within a full-system evaluation framework. Ignoring the interconnect details might lead to inaccurate results when simulating a CMP architecture. It also becomes important to analyze the impact of interconnection network optimization techniques on full system behavior. In this light, we developed a detailed cycle-accurate interconnection network model (GARNET), inside the GEMS full-system simulation framework. GARNET models a classic five-stage pipelined router with virtual channel (VC) flow control. Microarchitectural details, such as flit-level input buffers, routing logic, allocators and the crossbar switch, are modeled. GARNET, along with GEMS, provides a detailed and accurate memory system timing model. To demonstrate the importance and potential impact of GARNET, we evaluate a shared and private L2 CMP with a realistic state-of-the-art interconnection network against the original GEMS simple network. The objective of the evaluation was to figure out which configuration is better for a particular workload. We show that not modeling the interconnect in detail might lead to an incorrect outcome. We also evaluate Express Virtual Channels (EVCs), an on-ch- ip network flow control proposal, in a full-system fashion. We show that in improving on-chip network latency-throughput, EVCs do lead to better overall system runtime, however, the impact varies widely across applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cetra: A trace and analysis framework for the evaluation of Cell BE systems

    Page(s): 43 - 52
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (755 KB) |  | HTML iconHTML  

    The cell broadband engine architecture (CBEA) is an heterogeneous multiprocessor architecture developed by Sony, Toshiba and IBM. The major implementation of this architecture is the cell broadband engine (cell for short), a processor that contains one generic PowerPC core and eight accelerators. The cell is targeted at high-performance computing systems and consumer-level devices that have high computational requirements. The workloads for the former are generally run in a queue-based environment while those for the latter are multiprogrammed. Applications for the cell are composed of multiple parallel tasks: one runs on the PowerPC core and one or more run on the accelerators. The operating system (OS) is in charge of scheduling these tasks on top of the physical processors, and such scheduling decisions become critical in multiprogrammed environments. System developers need a way to analyze how user applications behave in these conditions to be able to tune the OS internal algorithms. This article presents Cetra, a new tool-set that allows system developers to study how cell workloads interact with Linux, the OS kernel. First, we outline the major features of Cetra and provide a detailed description of its internals. Then, we demonstrate the usefulness of Cetra by presenting a case study that shows the features of the tool-set and allows us to compare the results to those provided by other performance analysis tools available in the market. At last, we describe another case study in which we discovered a scheduling starvation bug using Cetra. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Zesto: A cycle-level simulator for highly detailed microarchitecture exploration

    Page(s): 53 - 64
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (225 KB) |  | HTML iconHTML  

    For academic computer architecture research, a large number of publicly available simulators make use of relatively simple abstractions for the microarchitecture of the processor pipeline. For some types of studies, such as those for multi-core cache coherence designs, a simple pipeline model may suffice. For detailed microarchitecture research, such as those that are sensitive to the exact behavior of out-of-order scheduling, ALU and bypass network contention, and resource management (e.g., RS and ROB entries), an over-simplified model is not representative of modern processor organizations. We present a new timing simulator that models a modern x86 microarchitecture at a very low level, including out-of-order scheduling and execution that much more closely mirrors current implementations, a detailed cache/memory hierarchy, as well as many x86-specific microarchitecture features (e.g., simple vs. complex decoders, micro-op decomposition and fusion, microcode lookup overhead for long/complex x86 instructions). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Lonestar: A suite of parallel irregular programs

    Page(s): 65 - 76
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (240 KB) |  | HTML iconHTML  

    Until recently, parallel programming has largely focused on the exploitation of data-parallelism in dense matrix programs. However, many important application domains, including meshing, clustering, simulation, and machine learning, have very different algorithmic foundations: they require building, computing with, and modifying large sparse graphs. In the parallel programming literature, these types of applications are usually classified as irregular applications, and relatively little attention has been paid to them. To study and understand the patterns of parallelism and locality in sparse graph computations better, we are in the process of building the Lonestar benchmark suite. In this paper, we characterize the first five programs from this suite, which target domains like data mining, survey propagation, and design automation. We show that even such irregular applications often expose large amounts of parallelism in the form of amorphous data-parallelism. Our speedup numbers demonstrate that this new type of parallelism can successfully be exploited on modern multi-core machines. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploring speculative parallelism in SPEC2006

    Page(s): 77 - 88
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1354 KB) |  | HTML iconHTML  

    The computer industry has adopted multi-threaded and multi-core architectures as the clock rate increase stalled in early 2000's. It was hoped that the continuous improvement of single-program performance could be achieved through these architectures. However, traditional parallelizing compilers often fail to effectively parallelize general-purpose applications which typically have complex control flow and excessive pointer usage. Recently hardware techniques such as Transactional Memory (TM) and Thread-Level Speculation (TLS) have been proposed to simplify the task of parallelization by using speculative threads. Potential of speculative parallelism in general-purpose applications like SPEC CPU 2000 have been well studied and shown to be moderately successful. Preliminary work examining the potential parallelism in SPEC2006 deployed parallel threads with a restrictive TLS execution model and limited compiler support, and thus only showed limited performance potential. In this paper, we first analyze the cross-iteration dependence behavior of SPEC 2006 benchmarks and show that more parallelism potential is available in SPEC 2006 benchmarks, comparing to SPEC2000. We further use a state-of-the-art profile-driven TLS compiler to identify loops that can be speculatively parallelized. Overall, we found that with optimal loop selection we can potentially achieve an average speedup of 60% on four cores over what could be achieved by a traditional parallelizing compiler such as Intel's ICC compiler.We also found that an additional 11% improvement can be potentially obtained on selected benchmarks using 8 cores when we extend TLS on multiple loop levels as opposed to restricting to a single loop level. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Machine learning based online performance prediction for runtime parallelization and task scheduling

    Page(s): 89 - 100
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (412 KB) |  | HTML iconHTML  

    With the emerging many-core paradigm, parallel programming must extend beyond its traditional realm of scientific applications. Converting existing sequential applications as well as developing next-generation software requires assistance from hardware, compilers and runtime systems to exploit parallelism transparently within applications. These systems must decompose applications into tasks that can be executed in parallel and then schedule those tasks to minimize load imbalance. However, many systems lack a priori knowledge about the execution time of all tasks to perform effective load balancing with low scheduling overhead. In this paper, we approach this fundamental problem using machine learning techniques first to generate performance models for all tasks and then applying those models to perform automatic performance prediction across program executions. We also extend an existing scheduling algorithm to use generated task cost estimates for online task partitioning and scheduling. We implement the above techniques in the pR framework, which transparently parallelizes scripts in the popular R language, and evaluate their performance and overhead with both a real-world application and a large number of synthetic representative test scripts. Our experimental results show that our proposed approach significantly improves task partitioning and scheduling, with maximum improvements of 21.8%, 40.3% and 22.1% and average improvements of 15.9%, 16.9% and 4.2% for LMM (a real R application) and synthetic test cases with independent and dependent tasks, respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • WARP: Enabling fast CPU scheduler development and evaluation

    Page(s): 101 - 112
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (366 KB) |  | HTML iconHTML  

    Developing CPU scheduling algorithms and understanding their impact in practice can be difficult and time consuming due to the need to modify and test operating system kernel code and measure the resulting performance on a consistent workload of real applications. To address this problem, we have developed WARP, a trace-driven virtualized scheduler execution environment that can dramatically simplify and speed the development of CPU schedulers. WARP is easy to use as it can run unmodified kernel scheduling code and can be used with standard user-space debugging and performance monitoring tools. It accomplishes this by virtualizing operating system and hardware events to decouple kernel scheduling code from its native operating system and hardware environment. A simple kernel tracing toolkit can be used with WARP to capture traces of all CPU scheduling related events from a real system. WARP can then replay these traces in its virtualized environment with the same timing characteristics as in the real system. Traces can be used with different schedulers to provide accurate comparisons of scheduling performance for a given application workload. We have implemented a WARP Linux prototype. Our results show that WARP can use application traces captured from its toolkit to accurately reflect the scheduling behavior of the real Linux operating system. Furthermore, testing scheduler behavior using WARP with application traces can be two orders of magnitude faster than running the applications using Linux. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • CMPSched$im: Evaluating OS/CMP interaction on shared cache management

    Page(s): 113 - 122
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (446 KB) |  | HTML iconHTML  

    CMPs have now become mainstream and are growing in complexity with more cores, several shared resources (cache, memory, etc) and the potential for additional heterogeneous elements. In order to manage these resources, it is becoming critical to optimize the interaction between the execution environment (operating systems, virtual machine monitors, etc) and the CMP platform. Performance analysis of such OS and CMP interactions is challenging because it requires long running full-system execution-driven simulations. In this paper, we explore an alternative approach (CMPSched$im) to evaluate the interaction of OS and CMP architectures. In particular, CMPSched$im is focused on evaluating techniques to address the shared cache management problem through better interaction between CMP hardware and operating system scheduling. CMPSched$im enables fast and flexible exploration of this interaction by combining the benefits of (a) binary instrumentation tools (Pin), (b) user-level scheduling tools (Linsched) and (c) simple core/cache simulators. In this paper, we describe CMPSched$im in detail and present case studies showing how CMPSched$im can be used to optimize OS scheduling by taking advantage of novel shared cache monitoring capabilities in the hardware. We also describe OS scheduling heuristics to improve overall system performance through resource monitoring and application classification to achieve near optimal scheduling that minimizes the effects of contention in the shared cache of a CMP platform. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Understanding the cost of thread migration for multi-threaded Java applications running on a multicore platform

    Page(s): 123 - 132
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1422 KB) |  | HTML iconHTML  

    Multicore systems increase the complexity of performance analysis by introducing a new source of additional costs: thread migration between cores. This paper explores the cost of thread migration for Java applications. We first present a detailed analysis of the sources of migration overhead and show that they result from a combination of several factors including application behavior (working set size), OS behavior (migration frequency) and hardware characteristics (nonuniform cache sharing among cores). We also present a performance characterization of several multi-threaded Java applications. Surprisingly, our analysis shows that, although significant migration penalizes can be produced in controlled environments, the set of Java applications that we examined do not suffer noticeably from migration overhead when run in a realistic operating environment on an actual multicore platform. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The data-centricity of Web 2.0 workloads and its impact on server performance

    Page(s): 133 - 142
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (345 KB) |  | HTML iconHTML  

    Advances in network performance and browser technologies, coupled with the ubiquity of internet access and proliferation of users, have lead to the emergence of a new class of Web applications, called Web 2.0. Web 2.0 technologies enable easy collaboration and sharing by allowing users to contribute, modify, and aggregate content using applications like Wikis, Blogs, Social Networking communities, and Mashups. Web 2.0 applications also make heavy use of Ajax, which allows asynchronous communication between client and server, to provide a richer user experience. In this paper, we analyze the effect of these new features on the infrastructure that hosts these workloads. In particular, we focus on the data-centricity, inherent in many Web 2.0 applications, and study its impact on the persistence layer in an application server context. Our experimental results reveal some important performance characteristics; we show that frequent Ajax requests, and other requests arising from the participatory nature of Web 2.0, often retrieve and update persistent data. This can lead to frequent database accesses, lock contention, and reduced performance. We also show that problems in the persistence layer, arising from the data-intensive nature of Web 2.0 applications, can lead to poor scalability that can inhibit us from exploiting current and future multicore architectures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Characterizing and optimizing the memory footprint of de novo short read DNA sequence assembly

    Page(s): 143 - 152
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (291 KB) |  | HTML iconHTML  

    In this work, we analyze the memory-intensive bioinformatics problem of ldquode novordquo DNA sequence assembly, which is the process of assembling short DNA sequences obtained by experiment into larger contiguous sequences. In particular, we analyze the performance scaling challenges inherent to de Bruijn graph-based assembly, which is particularly well suited for the data produced by ldquonext generationrdquo sequencing machines. Unlike many bioinformatics codes which are computation-intensive or control-intensive, we find the memory footprint to be the primary performance issue for de novo sequence assembly. Specifically, we make four main contributions: 1) we demonstrate analytically that performing error correction before sequence assembly enables larger genomes to be assembled in a given amount of memory, 2) we identify that the use of this technique provides the key performance advantage to the leading assembly code, Velvet, 3) we demonstrate how this pre-assembly error correction technique can be subdivided into multiple passes to enable de Bruijn graph-based assembly to scale to even larger genomes, and 4) we demonstrate how Velvet's in-core performance can be improved using memory-centric optimizations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An analytic model of optimistic Software Transactional Memory

    Page(s): 153 - 162
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (255 KB) |  | HTML iconHTML  

    An analytic model is proposed to assess the performance of optimistic software transactional memory (STM) systems with in-place memory updates for write operations. Based on an absorbing discrete-time Markov chain, closed-form analytic expressions are developed, which are quickly solved iteratively to determine key parameters of the STM system. The model covers complex implementation details such as read/write locking, data consistency checks and conflict management. It provides fundamental insight into the system behavior, when we vary input parameters like number and size of concurrent transactions or the number of the data objects. Numerical results are validated by comparison with a discrete-event simulation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analyzing CUDA workloads using a detailed GPU simulator

    Page(s): 163 - 174
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (725 KB) |  | HTML iconHTML  

    Modern graphic processing units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow's manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA's CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.