By Topic

Computer Architecture and High Performance Computing (SBAC-PAD), 2013 25th International Symposium on

Date 23-26 Oct. 2013

Filter Results

Displaying Results 1 - 25 of 39
  • [Front cover]

    Page(s): C4
    Save to Project icon | Request Permissions | PDF file iconPDF (4269 KB)  
    Freely Available from IEEE
  • [Title page i]

    Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (31 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (201 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (365 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): v - vii
    Save to Project icon | Request Permissions | PDF file iconPDF (134 KB)  
    Freely Available from IEEE
  • Message from the General Chairs

    Page(s): viii
    Save to Project icon | Request Permissions | PDF file iconPDF (291 KB)  
    Freely Available from IEEE
  • Message from the Program Chairs

    Page(s): ix
    Save to Project icon | Request Permissions | PDF file iconPDF (221 KB)  
    Freely Available from IEEE
  • Committees

    Page(s): x - xi
    Save to Project icon | Request Permissions | PDF file iconPDF (338 KB)  
    Freely Available from IEEE
  • Program Committee and External Reviewers

    Page(s): xii - xiv
    Save to Project icon | Request Permissions | PDF file iconPDF (101 KB)  
    Freely Available from IEEE
  • Cluster Cache Monitor

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (404 KB) |  | HTML iconHTML  

    As the number of cores and the working sets of parallel workloads increase, shared L2 caches exhibit fewer misses than private L2 caches by making a better use of the total available cache capacity, but they also induce higher overall L1 miss latencies because of the longer average distance between two nodes, and the potential congestions at certain nodes. One of the main causes of the long L1 miss latencies are accesses to home nodes of the directory. However, we have observed that there is a high probability that the target data of an L1 miss resides in the L1 cache of a neighbor node. In such cases, these long-distance accesses to the home nodes can be potentially avoided. We organize the multi-core into clusters of 2×2 nodes, and in order to leverage the aforementioned property, we introduce the Cluster Cache Monitor (CCM). The CCM is a hardware structure in charge of detecting whether an L1 miss can be served by one of the cluster L1 caches, and two cluster-related states in the coherence protocol in order to avoid long-distance accesses to home nodes upon hits in the cluster L1 caches. We evaluate this approach on a 64-node multi-core using SPLASH-2 and PARSEC benchmarks, and we find that the CCM can reduce the execution time by 15% and reduce the energy by 14%, while saving 28% of the directory storage area compared to a standard multi-core with a shared L2. We also show that the CCM outperforms recent mechanisms, such as ASR, DCC and RNUCA. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Experiences with Disjoint Data Structures in a New Hardware Transactional Memory System

    Page(s): 9 - 16
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (747 KB) |  | HTML iconHTML  

    In this paper we present our experiences constructing and testing in-memory data structures designed to be disjoint enough for transactional memory to be profitable as a serialization mechanism with no fallback to traditional locking. Our goal was to restrict memory conflicts to actual contention situations so that transactional memory techniques could be used as efficiently as possible. We describe the hardware transactional execution facility in the IBM enterprise EC12 server. We present an order preserving hashed structure that permits insertion, deletion, and traversal operations typically supported by a sorted linked list. We measure the performance and scalability for these operations on the IBM enterprise EC12 server. Our results show near linear scalability of the insertion and deletion operations for up to 96 CPUs. We also discuss transaction abort frequency and hardware/software interactions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • HotStream: Efficient Data Streaming of Complex Patterns to Multiple Accelerating Kernels

    Page(s): 17 - 24
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (879 KB) |  | HTML iconHTML  

    Designing accelerating kernels is a comprehensive task that requires efficient coupling of hardware and software. In particular, the structures responsible for handling data transfers in multi-core accelerator-based systems play a crucial role in the resulting performance. This paper proposes a data streaming accelerator framework that provides efficient data management facilities that are easily tailored for any application and data pattern. This is achieved through an innovative and fully programmable data management structure, implemented with two granularity levels. The obtained results show that the proposed framework is capable of efficient address generation and data fetch for complex streaming data patterns, while significantly reducing the size occupied by the pattern description. A large matrices multiplication case-study, based on a streaming architecture with four sub-block multiplication cores, demonstrates that, by enabling data re-use, the proposed framework increases the available bandwidth by 4.2x, resulting in a performance speedup of 2.1x. Furthermore, it reduces the Host memory requirements and its intervention by more than 40x. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Large Payload Streaming Database Sort and Projection on FPGAs

    Page(s): 25 - 32
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (285 KB) |  | HTML iconHTML  

    In recent years, real-time analytics has seen widespread adoption in the business world. While it provides useful business insights and improved market responsiveness, it also adds a computational burden to traditional online transaction processing (OLTP) systems. Analytics queries involve complex database operations such as sort, aggregation, and join that consume significant computational resources, and, when executed on the same system, may affect the performance of OLTP queries. In this paper, we try to address this issue by accelerating two such database operations, namely, projection and sort, using a field programmable gate array (FPGA). Our prototype is implemented on an Alter a Stratix V FPGA and achieves an order of magnitude speedup in the sort operation compared to baseline software. Furthermore, our prototype implements projection in parallel with other query operations on FPGA, thus completely eliminating the cost of projection without consuming any extra cycles on the FPGA. FPGA accelerated sort and projection have been integrated with our previous work on accelerating other query operations [1], making our analytics acceleration prototype on FPGA applicable to a wider variety of queries. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable Many-Field Packet Classification on Multi-core Processors

    Page(s): 33 - 40
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (684 KB) |  | HTML iconHTML  

    Packet classification matches a packet header against the predefined rules in a rule set, it is a kernel function that has been studied for decades. A recent trend in packet classification is to match a large number of packet header fields. For example, the flow table lookup in Software Defined Networking (SDN) requires 15 fields of the packet header to be examined. Another trend in packet classification is to use software-based solutions employing multi-core general purpose processors and virtual machines. Although packet classification has been widely studied, most existing solutions on multi-core systems target the classic 5-field packet classification, their performance cannot be easily scaled up for a larger number of packet header fields. In this paper, we propose a decomposition-based packet classification approach, it supports large rule sets consisting of a large number of packet header fields. We first use range-tree and hashing to search each field of the input packet header individually in parallel. The partial results from all the fields are represented by bit vectors, they are merged in parallel to produce the final packet header match. We also balance the search and merge latencies, and employ software pipelining to further enhance the overall performance. We implement our approach on state-of-the-art multi-core processors, we evaluate its performance with respect to throughput and latency for rule set size ranging from 1K to 32K. Experimental results show that, for a 32K rule set, our algorithms can achieve an average processing latency of 2000 ns per packet and an overall throughput of 30 million packets per second on a state-of-the-art 16-core platform. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Extending Summation Precision for Network Reduction Operations

    Page(s): 41 - 48
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (608 KB) |  | HTML iconHTML  

    Double precision summation is at the core of numerous important algorithms such as Newton-Krylov methods and other operations involving inner products, but the effectiveness of summation is limited by the accumulation of rounding errors, which are an increasing problem with the scaling of modern HPC systems and data sets. To reduce the impact of precision loss, researchers have proposed increased- and arbitrary-precision libraries that provide reproducible error or even bounded error accumulation for large sums, but do not guarantee an exact result. Such libraries can also increase computation time significantly. We propose big integer (BigInt) expansions of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. This is feasible with performance comparable to that of double-precision floating point summation, by the inclusion of simple and inexpensive logic into modern NICs to accelerate performance on large-scale systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Tackling Permanent Faults in the Network-on-Chip Router Pipeline

    Page(s): 49 - 56
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1615 KB) |  | HTML iconHTML  

    The proliferation of multi-core and many-core chips for performance scaling is making the Network-on-Chip (NoC) occupy a growing amount of silicon area spanning several metal layers. The NoC is neither immune to hard faults and transient faults nor unaffected by the adverse increase in hard faults caused by technology scaling. The ramifications for the NoC are immense: a single fault in the NoC may paralyze the working of the entire chip. To this end, we propose a Permanent Fault Tolerant Router (PFTR) that is capable of tolerating multiple permanent faults in the pipeline. PFTR is designed by making architectural modifications to individual pipeline stages of the baseline NoC router. These architectural modifications involve adding minimum extra circuitry and exploiting temporal parallelism to accomplish fault tolerance. Tolerance of multiple faults is achieved by striking a balance between three important design factors namely, area overhead, power overhead and reliability. We use Silicon Protection Factor (SPF) as the reliability metric to assess the reliability improvement of the proposed architecture. SPF takes into account the number of faults required to cause failure and the area overhead of the additional circuitry to evaluate reliability. SPF calculation reveals that the proposed PFTR is 11 times more reliable than the baseline NoC router. Synthesis results using Cadence Encounter RTL Compiler at 45nm technology show that the additional circuitry adds an area overhead of 31% and power overhead of 30% with respect to the baseline NoC router. PFTR provides much better reliability with much less overhead as compared to other fault tolerant routers such as BulletProof, Vicis and RoCo [15]. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast LH*

    Page(s): 57 - 64
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (650 KB) |  | HTML iconHTML  

    Linear Hashing is a widely used and efficient version of extensible hashing. A distributed version of Linear Hashing is LH* that stores key-indexed records on up to hundreds of thousands of sites in a distributed system. LH* implements the dictionary data structure efficiently since it does not use a central component for the key-based operations of insertion, deletion, actualization, and retrieval and for the scan operation. LH* allows a client or a server to commit an addressing error by sending a request to the wrong server. In this case, the server forwards to the correct server directly or in one more forward operation. We discuss here methods to avoid the double forward, which is rare but might breach quality of service guarantees. We compare our methods with LH* P2P that pushes information about changes in the file structure to clients, whether they are active or not. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Attaining Strictly Increasing and Precise Time Count in Energy-Efficient Computer Systems

    Page(s): 65 - 72
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (561 KB) |  | HTML iconHTML  

    Energy-efficient computer systems are making increasing use of processors that have multiple core units, DVFS, and virtualization support. However, current system clocks have not been usually designed to cope with the capacity of such mechanisms to decelerate/accelerate the passage of time, which increases the time drifts in the system and produces two adverse side effects. First, a reduction in the precision of the system clocks, which makes it infeasible to run applications that are dependent on precise time measurements. Second, increasing the rate of system resynchronization with an external global clock, which adds more noise to the system and counteracts the attainment of a desirable energy efficiency. As an alternative to the system clock, we propose an original virtual clock, named RVEC, with the property that the time count is strictly increasing and precise (SIP). A preliminary experimental evaluation of an implementation of RVEC in Linux using a beowulf cluster of four energy-efficient computer systems showed that RVEC exhibited the SIP property while was highly precise and had negligible overhead in comparison with representative Linux system clocks. Furthermore, we used RVEC to build a High-Precision Global Clock (HPGC) which is free from resynchronization and implemented HPGC in the OpenMPI library as a time synchronization service for the MPI_Wtime() function to improve its timekeeping functions and lower the system noise. Our preliminary results from micro benchmarks executing in the same cluster indicated that the HPGC is highly scalable and precise solution which allowed the micro benchmarks to stay globally synchronized by using only 30 messages per node to initially synchronize the cluster nodes, thanks to the RVEC's SIP property. These results suggest that RVEC and HPGC can be effective alternatives to the system clock and the global clock respectively, in energy-efficient computer systems, especially for MPI applications running on beowulf cl- sters. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Energy Efficient Last Level Caches via Last Read/Write Prediction

    Page(s): 73 - 80
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (652 KB) |  | HTML iconHTML  

    The size of the Last Level Caches (LLC) in multi-core architectures is increasing, and so is their power consumption. However, most of this power is wasted on unused or invalid cache lines. For dirty cache lines, the LLC waits until the line is evicted to be written back to memory. Hence, dirty lines compete for the memory bandwidth with read requests (prefetch and demand), increasing pressure on the memory controller. This paper proposes a Dead Line and Early Write-Back Predictor (DEWP) to improve the energy efficiency of the LLC. DEWP early evicts dead cache lines with an average accuracy of 94%, and only 2% false positives. DEWP also allows scheduling of dirty lines for early eviction, allowing earlier write-backs. Using DEWP over a set of single and multi-threaded benchmarks, we obtain an average of 61% static energy savings, while maintaining the performance, for both inclusive and non-inclusive LLCs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic Selective Devectorization for Efficient Power Gating of SIMD Units in a HW/SW Co-Designed Environment

    Page(s): 81 - 88
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (406 KB) |  | HTML iconHTML  

    Leakage power is a growing concern in current and future microprocessors. Functional units of microprocessors are responsible for a major fraction of this power. Therefore, reducing functional unit leakage has received much attention in the recent years. Power gating is one of the most widely used techniques to minimize leakage energy. Power gating turns off the functional units during the idle periods to reduce the leakage. Therefore, the amount of leakage energy savings is directly proportional to the idle time duration. This paper focuses on increasing the idle interval for the higher SIMD lanes. The applications are profiled dynamically, in a HW/SW co-designed environment, to find the higher SIMD lanes usage pattern. If the higher lanes need to be turned-on for small time periods, the corresponding portion of the code is devectorized to keep the higher lanes off. The devectorized code is executed on the lowest SIMD lane. Our experimental results show average SIMD accelerator energy savings of 12% and 24% relative to power gating, for SPECFP2006 and Physics bench. Moreover, the slowdown caused due to devectorization is less than 1%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • HPC Performance and Energy-Efficiency of Xen, KVM and VMware Hypervisors

    Page(s): 89 - 96
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1473 KB) |  | HTML iconHTML  

    With a growing concern on the considerable energy consumed by HPC platforms and data centers, research efforts are targeting green approaches with higher energy efficiency. In particular, virtualization is emerging as the prominent approach to mutualize the energy consumed by a single server running multiple VMs instances. Even today, it remains unclear whether the overhead induced by virtualization and the corresponding hypervisor middleware suits an environment as high-demanding as an HPC platform. In this paper, we analyze from an HPC perspective the three most widespread virtualization frameworks, namely Xen, KVM, and VMware ESXi and compare them with a baseline environment running in native mode. We performed our experiments on the Grid'5000 platform by measuring the results of the reference HPL benchmark. Power measures were also performed in parallel to quantify the potential energy efficiency of the virtualized environments. In general, our study offers novel incentives toward in-house HPC platforms running without any virtualized frameworks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimizing a 3D-FWT Code in a Heterogeneous Cluster of Multicore CPUs and Manycore GPUs

    Page(s): 97 - 104
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (687 KB) |  | HTML iconHTML  

    Clusters of nodes composed of many core GPUs and multicore CPUs are used to solve scientific problems with high computational requirements. The development and optimization of parallel-heterogeneous codes for these systems is a complex task which requires a deep knowledge of the different components of the hybrid, heterogeneous and hierarchical computational system, and also of the scientific problem to be solved and the different programing paradigms to be used for its efficient solution. Techniques for efficient development and optimization of scientific codes for these systems are needed. This paper presents an analysis of the development and optimization of the 3D-Fast Wavelet Transform (3D-FWT) for a heterogeneous cluster of multicores+GPUs. Different parallel programming paradigms (message passing, shared memory and SIMD GPU) are combined to fully exploit the computing capacity of the different computational elements of the cluster, so resulting in an efficient combination of basic codes developed previously for individual components (individual nodes, multicore or GPU) and an important reduction of the compression time of long video sequences. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor

    Page(s): 105 - 112
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (216 KB) |  | HTML iconHTML  

    This paper presents preliminary performance comparisons of parallel applications developed natively for the Intel Xeon Phi accelerator using three different parallel programming environments and their associated runtime systems. We compare Intel OpenMP, Intel CilkPlus and XKaapi together on the same benchmark suite and we provide comparisons between an Intel Xeon Phi coprocessor and a Sandy Bridge Xeon-based machine. Our benchmark suite is composed of three computing kernels: a Fibonacci computation that allows to study the overhead and the scalability of the runtime system, a NQueens application generating irregular and dynamic tasks and a Cholesky factorization algorithm. We also compare the Cholesky factorization with the parallel algorithm provided by the Intel MKL library for Intel Xeon Phi. Performance evaluation shows our XKaapi data-flow parallel programming environment exposes the lowest overhead of all and is highly competitive with native OpenMP and CilkPlus environments on Xeon Phi. Moreover, the efficient handling of data-flow dependencies between tasks makes our XKaapi environment exhibit more parallelism for some applications such as the Cholesky factorization. In that case, we observe substantial gains with up to 180 hardware threads over the state of the art MKL, with a 47% performance increase for 60 hardware threads. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A CPU, GPU, FPGA System for X-Ray Image Processing Using High-Speed Scientific Cameras

    Page(s): 113 - 119
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (797 KB) |  | HTML iconHTML  

    Currently, computers can be composed of different Processing Units (PUs) - general-purpose and also programmable and specialist-purpose. One of the goals for such heterogeneity is to improve applications' performance. Particularly, scientific applications can highly benefit from this kind of platform. They produce large amounts of data within several types of algorithms, and distinct PUs are an alternative to better execute such tasks. This work presents a new system box - composed of CPU, GPU, and FPGA - to carry on site X-ray image evaluations. It was firstly tested by evaluating the performance of a Linear Integration (LI) algorithm over the PUs. This algorithm is largely used by synchrotron experiments in which high-speed X-ray cameras produce extremely large amounts of data for post-processing analysis, which includes performing LI. In our experiments, LI execution was around 30x faster in FPGA compared to CPU, achieving a similar performance to GPU. Taking the end-to-end application, i.e., image transfer into memory, this rate increases to hundreds. Issues for using FPGAs as a co-processor under our dynamic scheduling framework is also discussed. Synthesizing times for LI when assigned to FPGA are still too long for dynamic scheduling, preventing online synthesizing of functions not designed a priori. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Parallel IRAM Algorithm to Compute PageRank for Modeling Epidemic Spread

    Page(s): 120 - 127
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (257 KB) |  | HTML iconHTML  

    The eigenvalue equation intervenes in models of infectious disease propagation and could be used as an ally of vaccination campaigns in the actions carried out by health care organizations. The stochastic model based on Page Rank allows to simulate the epidemic spread, where a Like-like infection vector is calculated to help establish efficient vaccination strategy. In the context of epidemic spread, generally the damping factor is high. This is because the probability that an infected individual contaminates any other individual through some unusual contact is low. One consequence of this results is that the second largest eigenvalue of Page Rank matrix could be very close to its dominant eigenvalue. Another difficulty arises from the growing size of real networks. Handling very big graph becomes a challenge for computing Page Rank. Furthermore, the high damping factor makes many existing algorithms less efficient. In this paper, we explore the computation methods of Page Rank to address these issues. Specifically, we study the implicitly restarted Arnoldi method (IRAM) and discuss some possible improvements over it. We also present a parallel implementation for IRAM, targeting big data and sparse matrices representing scale-free networks (also known as power law networks). The algorithm is tested on a nation wide cluster of clusters Grid5000. Experiments on very large networks such as twitter, yahoo (over 1 billion nodes) are conducted. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.