By Topic

Computer Architecture and High Performance Computing (SBAC-PAD), 2010 22nd International Symposium on

Date 27-30 Oct. 2010

Filter Results

Displaying Results 1 - 25 of 43
  • [Front cover]

    Publication Year: 2010 , Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (2037 KB)  
    Freely Available from IEEE
  • [Title page i]

    Publication Year: 2010 , Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (19 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Publication Year: 2010 , Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (66 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Publication Year: 2010 , Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (122 KB)  
    Freely Available from IEEE
  • Table of contents

    Publication Year: 2010 , Page(s): v - vii
    Save to Project icon | Request Permissions | PDF file iconPDF (148 KB)  
    Freely Available from IEEE
  • Message from the General Chairs

    Publication Year: 2010 , Page(s): viii
    Save to Project icon | Request Permissions | PDF file iconPDF (67 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Message from the Program Committee Co-chairs

    Publication Year: 2010 , Page(s): ix - x
    Save to Project icon | Request Permissions | PDF file iconPDF (70 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Conference organizers

    Publication Year: 2010 , Page(s): xi - xii
    Save to Project icon | Request Permissions | PDF file iconPDF (82 KB)  
    Freely Available from IEEE
  • Program Committee

    Publication Year: 2010 , Page(s): xiii - xiv
    Save to Project icon | Request Permissions | PDF file iconPDF (73 KB)  
    Freely Available from IEEE
  • list-reviewer

    Publication Year: 2010 , Page(s): xv
    Save to Project icon | Request Permissions | PDF file iconPDF (52 KB)  
    Freely Available from IEEE
  • Brazilian Computer Society (SBC)

    Publication Year: 2010 , Page(s): xvi - xviii
    Save to Project icon | Request Permissions | PDF file iconPDF (86 KB)  
    Freely Available from IEEE
  • Flexible Error Protection for Energy Efficient Reliable Architectures

    Publication Year: 2010 , Page(s): 1 - 8
    Cited by:  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (302 KB) |  | HTML iconHTML  

    Technology scaling is having an increasingly detrimental effect on microprocessor reliability, with increased variability and higher susceptibility to errors. At the same time, as integration of chip multiprocessors increases, power consumption is becoming a significant bottleneck that could threaten their growth. To deal with these competing trends, energy-efficient solutions are needed to deal with reliability problems. This paper presents a reliable multicore architecture that provides targeted error protection by adapting to the characteristics of individual cores and workloads, with the goal of providing reliability with minimum energy. The user can specify an acceptable reliability target for each chip, core, or application. The system then adjusts a range of parameters, including replication and supply voltage, to meet that reliability goal. In this multicore architecture, each core consists of a pair of pipelines that can run independently (running separate threads) or in concert (running the same thread and verifying results). Redundancy is enabled selectively, at functional unit granularity. The architecture also employs timing speculation for mitigation of variation-induced timing errors and to reduce the power overhead of error protection. On-line control based on machine learning dynamically adjusts multiple parameters to minimize energy consumption. Evaluation shows that dynamic adaptation of voltage and redundancy can reduce the energy delay product of a CMP by 30 - 60% compared to static dual modular redundancy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Characterizing Energy Consumption in Hardware Transactional Memory Systems

    Publication Year: 2010 , Page(s): 9 - 16
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (232 KB) |  | HTML iconHTML  

    Transactional Memory is currently being advocated as a promising alternative to lock-based synchronization because it simplifies multithreaded programming. In this way, future many-core CMP architectures may need to provide hardware support for transactional memory. On the other hand, power dissipation constitutes a first class consideration in multicore processor design. In this work, we characterize the performance and energy consumption of two well-known Hardware Transactional Memory systems that employ opposite policies for data versioning and conflict management. More specifically, we compare the Log TM-SE Eager-Eager system and a version of the Scalable TCC Lazy-Lazy system that enables parallel commits. To the best of our knowledge, this is the first characterization in terms of energy consumption of hardware transactional memory systems. To do that, we extended the GEMS simulator to estimate the energy consumed in the on-chip caches according to CACTI, and used the interconnection network energy model given by Orion 2. Results show that the energy consumption of the Eager-Eager system is 60% higher on average than in the Lazy-Lazy case, whereas performance differences between the two systems are 42% on average. Finally, we found that although on average Lazy-Lazy beats Eager-Eager there are considerable deviations in performance depending on the particular characteristics of each application. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Control Scheme for a CGRA

    Publication Year: 2010 , Page(s): 17 - 24
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1299 KB) |  | HTML iconHTML  

    Ability to instantiate low cost and agile FSMs that can implement an arbitrary parallelism and combine such FSMs in a chain and in a hierarchy is one of the key differentiating factors between the ASICs and MPSOCs. CGRAs that have been reported in literature, like MPSOCs, also lack this ASIC like ability. The downside of ASICs is their lack of reuse and high engineering cost. We present a CGRA architecture that retains the programmability of CGRA and yet has the ASIC like ability to construct a) arbitrarily parallel data-path/FSM combine, b) chain an arbitrary number of such FSMs and c) create a hierarchy of such chains. We present in detail the architecture of such a control scheme and illustrate its use for an example composed of FFT and FIRs. We quantify the benefits of our approach by benchmarking for energy-delay product against a) ASICs (4.8X worse), b) a state-of-the-art CGRA (4.58X better) and FPGAs (63.95X better). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High Level Power and Energy Exploration Using ArchC

    Publication Year: 2010 , Page(s): 25 - 32
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (630 KB) |  | HTML iconHTML  

    With the increase in the design complexity of MPSoC architectures, estimating power consumption is very complex and time consuming at lower level of abstraction. We propose a methodology using ArchC named Power-ArchC for a fast high-level estimation of processor power consumption. Power values are obtained by an instruction level power characterization at gate level. The requirements for power evaluation infrastructure are compatible processor models written in ArchC and RTL, and the Technology library. We show power results for a 32-bit MIPS processor with different benchmarks, based on 45nm technology. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance Debugging of GPGPU Applications with the Divergence Map

    Publication Year: 2010 , Page(s): 33 - 40
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (448 KB) |  | HTML iconHTML  

    The increasing programability and the high computational power of Graphical Processing Units (GPU) make them attractive to general purpose programming. However, taking full benefit of this execution environment is a challenging task. One of these challenges stem from divergences, a phenomenon that occurs when threads that execute in lock-step are forced to take different program paths due to branches in the code. In face of divergences, some threads will have to wait, idly, while their diverging siblings execute. Optimizing the code to avoid divergences is difficult, because this task demands a deep understanding of programs that might be large and convoluted. In order to facilitate the detection of divergences, this paper introduces the divergence map, a data structure that indicates the location and the volume of divergences in a program. We build this map via dynamic profiling techniques, which we have implemented on top of an open source CUDA compiler. To illustrate the importance of the divergence map, we have used it to pin-point the core regions that must be optimized in well known public applications. By hand optimizing some applications, we have added 9-11% speedups onto kernels that have already gone through the sieve of many programmers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mixed-Precision Parallel Linear Programming Solver

    Publication Year: 2010 , Page(s): 41 - 46
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (279 KB) |  | HTML iconHTML  

    We use mixed-precision technique, which is used to exploit the high single precision performance of modern processors, to build the first sparse mixed-precision linear programming solver on the Cell BE processor. The technique is used to enhance the performance of an LP IPM-based solver by implementing mixed-precision sparse Cholesky factorization, the most time consuming part of LP solvers. Moreover, we implemented sparse matrix multiplication of the form required by the solver as it is also very time consuming for some LP problems. Implemented on the Cell BE processor (Playstation 3) and tested using Netlib data sets, our LP solver achieved a maximum speedup of 2.9 just by using the mixed-precision technique. Moreover, we found that some problems, especially in final iterations, result in ill-conditioned matrices where mixed-precision can not be used. As a result, the solver needs to switch to double-precision if a more accurate solution of an LP problem is required. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Tree Projection-Based Frequent Itemset Mining on Multicore CPUs and GPUs

    Publication Year: 2010 , Page(s): 47 - 54
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (968 KB) |  | HTML iconHTML  

    Frequent itemset mining (FIM) is a core operation for several data mining applications as association rules computation, correlations, document classification, and many others, which has been extensively studied over the last decades. Moreover, databases are becoming increasingly larger, thus requiring a higher computing power to mine them in reasonable time. At the same time, the advances in high performance computing platforms are transforming them into hierarchical parallel environments equipped with multi-core processors and many-core accelerators, such as GPUs. Thus, fully exploiting these systems to perform FIM tasks poses as a challenging and critical problem that we address in this paper. We present efficient multi-core and GPU accelerated parallelizations of the Tree Projection, one of the most competitive FIM algorithms. The experimental results show that our Tree Projection implementation scales almost linearly in a CPU shared-memory environment after careful optimizations, while the GPU versions are up to 173 times faster than standard the CPU version. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mapping Pipelined Applications with Replication to Increase Throughput and Reliability

    Publication Year: 2010 , Page(s): 55 - 62
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1632 KB) |  | HTML iconHTML  

    Mapping and scheduling an application onto the processors of a parallel system is a difficult problem. This is true when performance is the only objective, but becomes worse when a second optimization criterion like reliability is involved. In this paper we investigate the problem of mapping an application consisting of several consecutive stages, i.e., a pipeline, onto heterogeneous processors, while considering both the performance, measured as throughput, and the reliability. The mechanism of replication, which refers to the mapping of an application stage onto more than one processor, can be used to increase throughput but also to increase reliability. Finding the right replication trade-off plays a pivotal role for this bi-criteria optimization problem. Our formal model includes heterogeneous processors, both in terms of execution speed as well as in terms of reliability. We study the complexity of the various sub problems and show how a solution can be obtained for the polynomial cases. For the general NP-hard problem, heuristics are presented and experimentally evaluated. We further propose the design of an exact algorithm based on A* state space search which allows us to evaluate the performance of our heuristics for small problem instances. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving In-memory Column-Store Database Predicate Evaluation Performance on Multi-core Systems

    Publication Year: 2010 , Page(s): 63 - 70
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (375 KB) |  | HTML iconHTML  

    The ability to analyze a large volume of data for the purpose of business intelligence has led to various innovations in database technology. One example is the increased interest of using column-oriented data layout to address query performance in analytical and warehousing workloads. As system architectures move towards multi-core designs, it is important to address optimizing performance for these workloads on these platforms. In this paper we present SPHINX, an architecture that utilizes multi-core systems for search-based predicate evaluation operations in analytical query workloads against in-memory column store. We discuss the natural parallelism of predicate evaluations and various bottlenecks that impact search performance. We present several performance improvement techniques and apply a scan sharing technique based on cache reuse efficiency to further improve the performance. We demonstrate the performance benefits of our scan sharing scheduler over other scheduling approaches in a workload of mixed search queries. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Comparative Analysis of Load Balancing Algorithms Applied to a Weather Forecast Model

    Publication Year: 2010 , Page(s): 71 - 78
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (362 KB) |  | HTML iconHTML  

    Among the many reasons for load imbalance in weather forecasting models, the dynamic imbalance caused by localized variations on the state of the atmosphere is the hardest one to handle. As an example, active thunderstorms may substantially increase load at a certain time step with respect to previous time steps in an unpredictable manner - after all, tracking storms is one of the reasons for running a weather forecasting model. In this paper, we present a comparative analysis of different load balancing algorithms to deal with this kind of load imbalance. We analyze the impact of these strategies on computation and communication and the effects caused by the frequency at which the load balancer is invoked on execution time. This is done without any code modification, employing the concept of processor virtualization, which basically means that the domain is over-decomposed and the unit of rebalance is a sub-domain. With this approach, we were able to reduce the execution time of a full, real-world weather model. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sharing Resources for Performance and Energy Optimization of Concurrent Streaming Applications

    Publication Year: 2010 , Page(s): 79 - 86
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (372 KB) |  | HTML iconHTML  

    We aim at finding optimal mappings for concurrent streaming applications. Each application consists of a linear chain with several stages, and processes successive data sets in pipeline mode. The objective is to minimize the energy consumption of the whole platform, while satisfying given performance-related bounds on the period and latency of each application. The problem is to decide which processors to enroll, at which speed (or mode) to use them, and which stages they should execute. We distinguish two mapping categories, interval mappings without reuse, and fully arbitrary general mappings. On the theoretical side, we establish complexity results for this tri-criteria mapping problem (energy, period, latency). Furthermore, we derive an integer linear program that provides the optimal solution in the most general case. On the experimental side, we design polynomial-time heuristics, and assess their absolute performance thanks to the linear program. One main goal is to evaluate the impact of processor sharing on the quality of the solution. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Feedback-Driven Restructuring of Multi-threaded Applications for NUCA Cache Performance in CMPs

    Publication Year: 2010 , Page(s): 87 - 94
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (631 KB) |  | HTML iconHTML  

    This paper addresses feedback-directed restructuring techniques tuned to Non Uniform Cache Architectures (NUCA) in CMPs running multi-threaded applications. Access time to NUCA caches depends on the location of the referred block, so the locality and cache mapping of the application influence the overall performance. We show techniques for altering the distribution of applications into the cache space as to achieve improved average memory access time. In CMPs running multi-threaded applications, the aggregated accesses (and locality) of the processors form the actual cache load and pose specific issues. We consider a number of Splash-2 and Parsec benchmarks on an 8 processor system and we show that a relatively simple remapping algorithm is able to improve the average Static-NUCA (SNUCA) cache access time by 5.5% and allows an SNUCA cache to surpass the performance of a more complex dynamic-NUCA (DNUCA) for most benchmarks. Then, we present a more sophisticated remapping algorithm, relying on cache geometry information and on the access distribution statistics from individual processors, that reduces the average cache access time by 10.2% and is very stable across all benchmarks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Cache Replacement Policy Using Adaptive Insertion and Re-reference Prediction

    Publication Year: 2010 , Page(s): 95 - 102
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (875 KB) |  | HTML iconHTML  

    Previous research shows that LRU replacement policy is not efficient when applications exhibit a distant re-reference interval. Recently proposed RRIP policy improves performance for such workloads. However, RRIP lacks of access recency information, which may confuse the replacement policy to make accurate prediction. Consequently, RRIP is not robust for recency-friendly workloads. This paper proposes an Adaptive Insertion and Re-reference Prediction (AI-RRP) policy which evicts data based on both re-reference prediction value and the access recency information. To make the replacement policy more adaptive across different workloads and different phases during execution, Dynamic AI-RRP (DAI-RRP) is proposed which adjusts the insertion position and prediction value for different access patterns. Simulation results show DAI-RRP reduces CPI over LRU and Dynamic RRIP by an average of 8.3% and 4.1% respectively on a single-core processor with a 1MB 16-way set last-level cache (LLC). Evaluations on quad-core CMP with a 4MB shared LLC show that DAI-RRP outperforms LRU and Dynamic RRIP (DRRIP) on the weighted speedup metric by an average of 13.2% and 26.7% respectively. Furthermore, compred to LRU, DAI-RRP requires similar hardware, or even less hardware for high-associativity cache. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • MOPSO Applied to Architecture Tuning with Unified Second-Level Cache for Energy and Performance Optimization

    Publication Year: 2010 , Page(s): 103 - 110
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (544 KB) |  | HTML iconHTML  

    Design Space Exploration (DSE) have been a suitable strategy to configure a parameterized SoC platform in terms of systems requirements such as energy and performance. In this work, a multi-objective approach (MOPSO) based on Particle Swarm Optimization was applied for DSE problems for supporting architecture tuning in memory hierarchy with unified second level cache. The proposed approach considers two objectives to be optimized: energy consumption and application performance; and allows to reduce the design space by exploring only 2,64% of the exploration space. Results of MOPSO with regard to cost function found solutions approaching Pareto Optimum in terms of energy consumption and performance in the majority of cases, about 66% of the studied cases. Experiments based on simulations were carried out on 18 applications from the Mibench and PowerStone suite benchmarks. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.