2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

7-11 Feb. 2015

Filter Results

Displaying Results 1 - 25 of 32
  • [Front cover]

    Publication Year: 2015, Page(s): c1
    Request permission for reuse | PDF file iconPDF (451 KB)
    Freely Available from IEEE
  • [Title page]

    Publication Year: 2015, Page(s): 1
    Request permission for reuse | PDF file iconPDF (94 KB)
    Freely Available from IEEE
  • [Copyright notice]

    Publication Year: 2015, Page(s): 1
    Request permission for reuse | PDF file iconPDF (50 KB)
    Freely Available from IEEE
  • CGO 2015 sponsors & supporters

    Publication Year: 2015, Page(s):1 - 2
    Request permission for reuse | PDF file iconPDF (2565 KB)
    Freely Available from IEEE
  • Message from the general chairs

    Publication Year: 2015, Page(s):1 - 2
    Request permission for reuse | PDF file iconPDF (36 KB) | HTML iconHTML
    Freely Available from IEEE
  • CGO'15 organizing committee

    Publication Year: 2015, Page(s): 1
    Request permission for reuse | PDF file iconPDF (63 KB)
    Freely Available from IEEE
  • External reviewers

    Publication Year: 2015, Page(s): 1
    Request permission for reuse | PDF file iconPDF (33 KB)
    Freely Available from IEEE
  • Table of contents

    Publication Year: 2015, Page(s):1 - 3
    Request permission for reuse | PDF file iconPDF (93 KB)
    Freely Available from IEEE
  • Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS

    Publication Year: 2015, Page(s):1 - 11
    Cited by:  Papers (15)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (723 KB) | HTML iconHTML

    Current generation GPUs can accelerate high-performance, compute-intensive applications by exploiting massive thread-level parallelism. The high performance, however, comes at the cost of increased power consumption. Recently, commercial GPGPU architectures have introduced support for concurrent kernel execution to better utilize the computational/memory resources and thereby improve overall throu... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Characterizing and enhancing global memory data coalescing on GPUs

    Publication Year: 2015, Page(s):12 - 22
    Cited by:  Papers (11)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (668 KB) | HTML iconHTML

    Effective parallel programming for GPUs requires careful attention to several factors, including ensuring coalesced access of data from global memory. There is a need for tools that can provide feedback to users about statements in a GPU kernel where non-coalesced data access occurs, and assistance in fixing the problem. In this paper, we address both these needs. We develop a two-stage framework ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic data placement into GPU on-chip memory resources

    Publication Year: 2015, Page(s):23 - 33
    Cited by:  Papers (16)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (353 KB) | HTML iconHTML

    Although graphics processing units (GPUs) rely on thread-level parallelism to hide long off-chip memory access latency, judicious utilization of on-chip memory resources, including register files, shared memory, and data caches, is critical to application performance. However, explicitly managing GPU on-chip memory resources is a non-trivial task for application developers. More importantly, as on... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A parallel abstract interpreter for JavaScript

    Publication Year: 2015, Page(s):34 - 45
    Cited by:  Papers (1)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (481 KB) | HTML iconHTML

    We investigate parallelizing flow- and context-sensitive static analysis for JavaScript. Previous attempts to parallelize such analyses for other languages typically start with the traditional framework of sequential dataflow analysis, and then propose methods to parallelize the existing sequential algorithms within this framework. However, we show that this approach is non-optimal and propose a n... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • MemorySanitizer: Fast detector of uninitialized memory use in C++

    Publication Year: 2015, Page(s):46 - 55
    Cited by:  Papers (3)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (176 KB) | HTML iconHTML

    This paper presents MemorySanitizer, a dynamic tool that detects uses of uninitialized memory in C and C++. The tool is based on compile time instrumentation and relies on bit-precise shadow memory at run-time. Shadow propagation technique is used to avoid false positive reports on copying of uninitialized memory. MemorySanitizer finds bugs at a modest cost of 2.5× in execution time and 2× in memo... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On performance debugging of unnecessary lock contentions on multicore processors: A replay-based approach

    Publication Year: 2015, Page(s):56 - 67
    Cited by:  Papers (4)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (811 KB) | HTML iconHTML

    Locks have been widely used as an effective synchronization mechanism among processes and threads. However, we observe that a large number of false inter-thread dependencies (i.e., unnecessary lock contentions) exist during the program execution on multicore processors, thereby incurring significant performance overhead. This paper presents a performance debugging framework, PerfPlay, to facilitat... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimizing binary translation of dynamically generated code

    Publication Year: 2015, Page(s):68 - 78
    Cited by:  Papers (8)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (427 KB) | HTML iconHTML

    Dynamic binary translation serves as a core technology that enables a wide range of important tools such as profiling, bug detection, program analysis, and security. Many of the target applications often include large amounts of dynamically generated code, which poses a special performance challenge in maintaining consistency between the source application and the translated application. This pape... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Getting in control of your control flow with control-data isolation

    Publication Year: 2015, Page(s):79 - 90
    Cited by:  Papers (4)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (1173 KB) | HTML iconHTML

    Computer security has become a central focus in the information age. Though enormous effort has been expended on ensuring secure computation, software exploitation remains a serious threat. The software attack surface provides many avenues for hijacking; however, most exploits ultimately rely on the successful execution of a control-flow attack. This pervasive diversion of control flow is made pos... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reactive tiling

    Publication Year: 2015, Page(s):91 - 102
    Cited by:  Papers (5)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (926 KB) | HTML iconHTML

    To fully exploit the power of emerging multicore architectures, managing shared resources (i.e., caches) across applications and over time is critical. However, to our knowledge, most prior efforts view this problem from the OS/hardware side, and do not consider whether applications themselves can also participate in this process of managing shared resources. In this paper, we show how an applicat... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Branch prediction and the performance of interpreters — Don't trust folklore

    Publication Year: 2015, Page(s):103 - 114
    Cited by:  Papers (9)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (256 KB) | HTML iconHTML

    Interpreters have been used in many contexts. They provide portability and ease of development at the expense of performance. The literature of the past decade covers analysis of why interpreters are slow, and many software techniques to improve them. A large proportion of these works focuses on the dispatch loop, and in particular on the implementation of the switch statement: typically an indire... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimizing the flash-RAM energy trade-off in deeply embedded systems

    Publication Year: 2015, Page(s):115 - 124
    Cited by:  Papers (1)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (763 KB) | HTML iconHTML

    Deeply embedded systems often have the tightest constraints on energy consumption, requiring that they consume tiny amounts of current and run on batteries for years. However, they typically execute code directly from flash, instead of the more energy efficient RAM. We implement a novel compiler optimization<sup>1</sup> that exploits the relative efficiency of RAM by statically moving ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • EMEURO: A framework for generating multi-purpose accelerators via deep learning

    Publication Year: 2015, Page(s):125 - 135
    Cited by:  Papers (4)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (444 KB) | HTML iconHTML

    Approximate computing is a very promising design paradigm for crossing the CPU power wall, primarily driven by the potential to sacrifice output quality for significant gains in performance, energy, and fault tolerance. Unfortunately, existing solutions have primarily either focused on new programming models, or new hardware designs, leaving significant room between these two ends for software-bas... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on Intel Xeon Phi

    Publication Year: 2015, Page(s):136 - 145
    Cited by:  Papers (8)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (1773 KB) | HTML iconHTML

    Recently, the Intel Xeon Phi coprocessor has received increasing attention in high performance computing due to its simple programming model and highly parallel architecture. In this paper, we implement sparse matrix vector multiplication (SpMV) for scale-free matrices on the Xeon Phi architecture and optimize its performance. Scale-free sparse matrices are widely used in various application domai... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Data provenance tracking for concurrent programs

    Publication Year: 2015, Page(s):146 - 156
    Cited by:  Papers (4)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (222 KB) | HTML iconHTML

    We propose Last Writer Slicing (LWS), a mechanism for tracking data provenance information in multithreaded code in a production setting. Last writer slices dynamically track provenance of values by recording the thread and operation that last wrote each variable. We show that this information complements core dumps and greatly improves debugability. We also propose communication traps (CTraps), a... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Locality aware concurrent start for stencil applications

    Publication Year: 2015, Page(s):157 - 166
    Cited by:  Papers (5)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (2581 KB) | HTML iconHTML

    Stencil computations are at the heart of many physical simulations used in scientific codes. Thus, there exists a plethora of optimization efforts for this family of computations. Among these techniques, tiling techniques that allow concurrent start have proven to be very efficient in providing better performance for these critical kernels. Nevertheless, with many core designs being the norm, thes... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Checking correctness of code generator architecture specifications

    Publication Year: 2015, Page(s):167 - 178
    Cited by:  Papers (5)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (342 KB) | HTML iconHTML

    Modern instruction sets are complex, and extensions are proposed to them frequently. This makes the task of modelling architecture specifications used by the code generators of modern compilers complex and error-prone. Given the important role played by the compilers, it is necessary that they are tested thoroughly, so that most of the bugs are detected early on. Unfortunately, modern compilers su... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Snapshot-based loading-time acceleration for web applications

    Publication Year: 2015, Page(s):179 - 189
    Cited by:  Papers (9)
    Request permission for reuse | Click to expandAbstract | PDF file iconPDF (247 KB) | HTML iconHTML

    Web applications (apps) are programmed using HTML, CSS, and JavaScript. Web apps allow a faster app development based on existing web technology and a better portability since they are runnable on any device where a web browser is installed. Unfortunately, web apps are involved with a performance issue due to JavaScript, because its dynamic typing, function object, and prototype are difficult to e... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.