By Topic

Workload Characterization, 2003. WWC-6. 2003 IEEE International Workshop on

Date 27 Oct. 2003

Filter Results

Displaying Results 1 - 14 of 14
  • Identifying program power phase behavior using power vectors

    Page(s): 108 - 118
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (840 KB) |  | HTML iconHTML  

    Characterizing program behavior is important for both hardware and software research. Most modern applications exhibit distinctly different behavior throughout their runtimes, which constitute several phases of execution that share a greater amount of resemblance within themselves compared to other regions of execution. These execution phases can occur at very large scales, necessitating prohibitively long simulation times for characterization. Due to the implementation of extensive clock gating and additional power and thermal management techniques in modern processors, these program phases are also reflected in program power behavior, which can be used as an alternative means of program behavior characterization for power-oriented research. In this paper, we present our methodology for identifying phases in program power behavior and determining execution points that correspond to these phases, as well as defining a small set of power signatures representative of overall program power behavior. We define a power similarity metric as an intersection of both magnitude based and ratio-wise similarities in the power dissipation of processor components. We then develop a thresholding algorithm in order to partition the power behavior into similarity groups. We illustrate our methodology with the gzip benchmark for its whole runtime and characterize gzip power behavior with both the selected execution points and defined signature vectors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving the performance of OLTP workloads on SMP computer systems by limiting modified cache lines

    Page(s): 21 - 29
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (577 KB) |  | HTML iconHTML  

    Symmetric multiprocessor (SMP) computer systems with more than four CPUs often exhibit significantly lower overall performance than would be expected from the sum of the performance of the individual CPUs. One of the causes of this degradation is the increased average memory latency due to cache to cache migration of modified cache lines. Such transfers often incur significantly longer latencies than a simple cache miss, which can be satisfied from main memory. By setting an upper bound on the number of modified cache lines that are allowed to exist in a main memory when this limit is exceeded, the average memory latency and overall system performance on an online transaction processing (OLTP) workload can be improved. This paper presents an investigation of this concept, called original limiting, on a commercial SMP system. The Original Limiting concept was implemented in the second level cache (SLC) of the Unisys NX6830 series of SMP systems, which support up to eight CPUs. An original limiting queue (OLQ) was added to limit the number of exclusive or modified lines in a 5% improvement in the number of transactions processed per minute, by reducing the average memory latency. A variety of experiments indicate that the OLQ is a simple, but effective, mechanism to enhance the performance of OLTP applications on SMP systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Intrinsic data locality of modern scientific workloads

    Page(s): 77 - 85
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (543 KB) |  | HTML iconHTML  

    Understanding the intrinsic data locality of a workload is essential to understanding and predicting cache performance. The intrinsic data locality of a particular application or workload can be measured in a microarchitecture-independent manner. The data resulting from these measurements ideally can be used to develop an analytic model for predicting memory performance on different cache sizes and configurations. Many studies on data locality use cache hit ratios, a microarchitecture-dependent metric, to examine locality. In this paper, we present a microarchitecture-dependent and a microarchitecture-independent characterization of the SPEC2000 workloads. We present quantitative statistics on the different types of data locality (e.g. spatial and temporal) exhibited by these workloads and we show that the composite intrinsic locality can be correlated to locality measured by cache hit ratio. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Real-time L3 cache simulations using the Programmable Hardware-Assisted Cache Emulator (PHA$E)

    Page(s): 86 - 95
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (619 KB) |  | HTML iconHTML  

    As the gap between the CPU and memory speeds increases, there has been an increasingly important need to study the memory-hierarchy designs. Investigations of memory performance have typically been conducted using trace-driven simulation, which could take tremendous resources (e.g. long simulation time, large storage requirements for traces, and high overall cost). Recent works have proposed the use of hardware for performing cache simulations. Such approach is advantageous as it can be done in real-time, which eliminates the need or large storage for traces, reduces the simulation time, and improves the accuracy of the results. This paper discusses our preliminary work with theProgrammable Hardware-Assisted Cache Emulator (PHA$E), a system for emulating cache in real-time. We discuss the design and implementation of our system. Furthermore, the results of simulating varying sizes of off-chip L3 caches on various workloads (SPECcpu2000, SPECjbb2000, SPECjAppServer2002, and a large vocabulary continuous speech recognition system are presented and analyzed. Lastly, future research directions are elaborated on. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • PacketBench: a tool for workload characterization of network processing

    Page(s): 42 - 50
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (642 KB) |  | HTML iconHTML  

    Network processing is becoming an increasingly important paradigm as the Internet moves towards an architecture with more complex functionality inside the network. Modern routers not only forward packets, but also process headers and payloads to implement a variety of functions related to security, performance, and customization. It is important to get a detailed understanding of the workloads associated with this processing in order to be able to develop efficient network processing engines. We present a tool called PacketBench, which provides a framework for implementing network processing applications and obtaining an extensive set of workload characteristics. PacketBench provides the support functions to handle various packet traces and manage packet memory. For statistics collection, PacketBench provides the ability to derive a number of microarchitectural and networking related metrics. We present the results of such measurements for four different networking applications ranging from simple packet forwarding to complex packet payload encryption. The results show that such workload analysis has a range of uses from network processor design to application optimization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An analysis of disk performance in VMware ESX server virtual machines

    Page(s): 65 - 76
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (696 KB) |  | HTML iconHTML  

    VMware ESX Server is a software platform that efficiently multiplexes the hardware resources of a server among virtual machines. This paper studies the performance of a key component of the ESX Server architecture: its storage subsystem. We characterize the performance of native systems and virtual machines using a series of disk microbenchmarks on several different storage systems. We show that the virtual machines perform well compared to native, and that the I/O behavior of virtual machines closely matches that of the native server. We then discuss how the microbenchmarks can be used to estimate virtual machine performance for disk-intensive applications by studying two workloads: a simple file server and a commercial mail server. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance characterization of TCP/IP packet processing in commercial server workloads

    Page(s): 33 - 41
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (722 KB) |  | HTML iconHTML  

    TCP/IP is the communication protocol of choice for many current and next generation server applications (Web services, e-commerce, storage, etc.). As a result, the performance of these applications can be heavily dependent on the efficient TCP/IP packet processing within the termination nodes. Motivated by this, our work presented in this paper focuses on analyzing the underlying architectural characteristics of TCP/IP packet processing component within server workloads. Our analysis and characterization methodology is based on in-depth measurement experiments of TCP/IP packet processing performance on Intel's state-of-the-art low-power Pentium® M microprocessor running the Microsoft Windows* Server 2003 operating system. We start by analyzing the impact of NIC features such as Large Segment Offload and the use of Jumbo frames on TCP/IP packet processing performance. We then show that the architectural characteristics of transmit-side processing (largely compute-bound) are significantly different than receive-side processing (mostly memory-bound). Finally we quantify the computational requirements for sending/receiving packets within commercial workloads (SPECweb99, TPC-C and TPC-W) and show that they can form a substantial component. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards workload characterization of auction sites

    Page(s): 12 - 20
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (496 KB) |  | HTML iconHTML  

    The popularity of online auctions is growing with the participation of businesses and individual customers in various forms of auctions to buy and sell goods and services. This form of electronic commerce is expected to grow and become a significant form of exchange of goods and services competing in a global scale with traditional fixed-price commerce. A good understanding of the workload of auction sites should provide insights about their activities and help in the process of designing business-oriented metrics and designing novel resource management policies based on these metrics. This paper provides a workload characterization of auction sites including i) a multi-scale analysis of auction traffic and bid activity within auctions, ii) a closing time analysis in terms of number of bids and price variation within auctions, iii) the characteristics of the auction winner in terms of entry time, entry price, and bidding activity, and iv) unique bidder analysis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A characterization of visual feature recognition

    Page(s): 3 - 11
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (620 KB) |  | HTML iconHTML  

    Natural human interfaces are a key to realizing the dream of ubiquitous computing. This implies that embedded systems must be capable of sophisticated perception tasks. This paper analyzes the nature of a visual feature recognition workload. Visual feature recognition is a key component of a number of important applications, e.g. gesture based interfaces, lip tracking to augment speech recognition, smart cameras, automated surveillance systems, robotic vision, etc. Given the power sensitive nature of the embedded space and the natural conflict between low-power and high-performance implementations, a precise understanding of these algorithms is an important step in developing efficient visual feature recognition applications for the embedded space. In particular, this work analyzes the performance characteristics of flesh toning, face detection and face recognition codes based on well known algorithms. We show that the problem can be decomposed into a pipeline of filters which could lead to efficient implementations as stream processors. With better than 92% hit rate for a modest 16KB L1 data cache, the algorithms have memory system behavior commensurate with embedded processors. However, our results indicate that their execution requirements strain the performance available on current embedded systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Evaluating and modeling window synchronization in highly multiplexed flows

    Page(s): 51 - 61
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (755 KB) |  | HTML iconHTML  

    In this paper, we investigate issues of synchronization in highly aggregated flows such as would be found in the Internet backbone. Our hypothesis is that regularly spaced loss events lead to window synchronization in long lived flows. We argue that window synchronization is likely to be more common in the Internet than previously reported. We support our argument with evidence of the existence and evaluation of the characteristics of periodic discrete congestion events using active probe data gathered in the Surveyor infrastructure. When connections experience loss events which are periodic, the aggregate offered load to neighboring links rises and falls in cadence with the loss events. Connections whose cWnd values grow from W/2 to W at approximately the same rate as the loss event period soon synchronize their cWnd additive increases and multiplicative decreases. We find that this window synchronization can scale to large numbers of connections depending on the diversity of roundtrip times of individual flows. A model is presented that predicts important characteristics of the loss events in window synchronized flows including the quantity, intensity, and duration. The model effectively explains the prevalence of discrete loss events in fast links with high multiplexing factors as well as the queue buildup and queue draining phases of congestion. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Characterization of embedded applications for decoupled processor architecture

    Page(s): 119 - 127
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (612 KB) |  | HTML iconHTML  

    Needs for performance on embedded applications leads to the use of dynamic execution on embedded processors in the next few years. However, complete out-of-order superscalar cores are still expensive in terms of silicon area and power dissipation. In this paper, we study the adequacy of a more limited form of dynamic execution, namely decoupled architecture, to embedded applications. Decoupled architecture is known to work very efficiently whenever the execution does not suffer from inter-processor dependencies causing some loss of decoupling, called LOD events. In this study, we address regularity of codes in terms of the LOD events that may occur. We address three aspects of regularity: control regularity, control/memory dependency, and patterns of referencing memory data. Most of the kernels in MiBench will be amenable to efficient performance on a decoupled architecture. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting streams in instruction and data address trace compression

    Page(s): 99 - 107
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (577 KB) |  | HTML iconHTML  

    Novel research ideas in computer architecture are frequently evaluated using trace-driven simulation. The large size of traces incited different techniques for trace reduction. These techniques often combine standard compression algorithms with trace-specific solutions, taking into account the tradeoff between reduction in the trace size and simulation slowdown due to compression. This paper introduces SBC, a new algorithm for instruction and data address trace compression based on instruction streams. The proposed technique significantly reduces trace size and simulation time, and can be successfully combined with general compression algorithms. The SBC technique combined with gzip reduces the size of SPEC CPU2000 traces 59-97930 times, and combined with Sequitur 65-185599 times. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • 2003 IEEE International Workshop on Workload Characterization (IEEE Cat. No.03EX775)

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (72 KB)  

    The following topics are dealt with: workload characterization; visual feature recognition; OLTP workloads; network characterization; memory characterization; disk characterization; execution phases; stream-based compression and decoupled processor architecture. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Author index

    Page(s): 129 - 130
    Save to Project icon | Request Permissions | PDF file iconPDF (39 KB)  
    Freely Available from IEEE