By Topic

Computer Architecture and High Performance Computing Workshops (SBAC-PADW), 2010 22nd International Symposium on

Date 27-30 Oct. 2010

Filter Results

Displaying Results 1 - 23 of 23
  • [Front cover]

    Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (2112 KB)  
    Freely Available from IEEE
  • [Title page i]

    Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (49 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (90 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (108 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): v - vi
    Save to Project icon | Request Permissions | PDF file iconPDF (228 KB)  
    Freely Available from IEEE
  • Conference Organization

    Page(s): vii
    Save to Project icon | Request Permissions | PDF file iconPDF (71 KB)  
    Freely Available from IEEE
  • Message from the Workshop Organizers

    Page(s): viii
    Save to Project icon | Request Permissions | PDF file iconPDF (69 KB)  
    Freely Available from IEEE
  • Program Committee and Reviewers

    Page(s): ix
    Save to Project icon | Request Permissions | PDF file iconPDF (76 KB)  
    Freely Available from IEEE
  • Brazilian Computer Society (SBC)

    Page(s): x - xii
    Save to Project icon | Request Permissions | PDF file iconPDF (84 KB)  
    Freely Available from IEEE
  • Ring Pipelined Algorithm for the Algebraic Path Problem on the CELL Broadband Engine

    Page(s): 1 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (246 KB) |  | HTML iconHTML  

    The algebraic path problem (APP) unifies a number of related combinatorial or numerical problems into one that can be resolved by a generic algorithmic schema. In this paper, we propose a linear SPMD model based on the Warshall-Floyd procedure coupled with a systematic shift-toroïdal. Our scheduling requires a number of processors that equals the size of the input matrix. With a fewer number of processors, we exploit the modularity revealed by our linear array to achieve the task using a locally parallel and globally sequential} (LPGS) partitioning. Whatever the case, we just need each processor to have a local memory large enough to house one (probably block) column of the matrix. Considering these two characteristics clearly justify an implementation on the CELL Broadband engine, because of the efficient SPE to SPE communication bandwidth and the absolute power of each SPE. We report our experimentations on a QS22 CELL blade on various input configurations and exhibit the efficiency and scalability of our implementation. We show that, with a highly optimized Warshall-Floyd kernel, we could get close to 80 GFLOPS in simple precision with 8 SPEs which represents 80% of the peak performance for the APP on the CELL. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance Evaluation of Optimized Implementations of Finite Difference Method for Wave Propagation Problems on GPU Architecture

    Page(s): 7 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (305 KB) |  | HTML iconHTML  

    The scattering of acoustic waves in non-homogeneous media has been of practical interest for the petroleum industry, mainly in the determination of new oil deposits. A family of computational models that represent this phenomenon is based on finite difference methods. The simulation of these phenomena demands a high computational cost. In this work we employ GPU for the development of solvers for a 2D wave propagation problem with finite difference methods. Although there are many related works that use the same implementation presented in this paper, we propose a detailed and novel performance and memory bottleneck analysis for this hardware architecture. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploring Data Streaming to Improve 3D FFT Implementation on Multiple GPUs

    Page(s): 13 - 18
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1160 KB) |  | HTML iconHTML  

    FFT is a well known and widely used algorithm in many scientific and engineering applications. However, FFT is a memory-bound problem that still presents performance challenges to new generations of computer architectures due to its relatively low ratio of computation per memory access. For GPU architectures, where the data transfers between the host CPU memory and the device memory is very expensive, the memory overhead can become a huge bottleneck for large size problems. In this work, we propose an efficient parallel implementation of FFT on multiple GPUs that tackles the overhead of host memory access, by implementing a streaming scheme that hides the data transfer latency. The idea is to divide the problem into smaller ones, generating several lighter and asynchronous memory transfers from host to device enabling the computation for those data simultaneously. We obtained an acceleration of approximately 60% over the non streamed GPU implementation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • LU Decomposition on GPUs: The Impact of Memory Access

    Page(s): 19 - 24
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (652 KB) |  | HTML iconHTML  

    Graphics Processing Units (GPUs) are emerging as an attractive computing platform for general purpose computations due to their extremely high floating-point processing performance and their comparatively low cost. In the context of dense linear algebra, the LU decomposition represents a fundamental step in many computationally intensive scientific applications. The use of GPUs can accelerate the computation many times the speed of a single CPU. In this work, we investigate different implementations of the LU decomposition algorithm in a GPU. Our main goal is to parallelize the LU decomposition to fit the highly parallel architecture of modern GPUs, and to evaluate different types of memory access and their impact on the execution time of the algorithm. The results demonstrate that the memory access pattern can significantly impact the performance of the GPU implementation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On Modelling Multicore Clusters

    Page(s): 25 - 30
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (186 KB) |  | HTML iconHTML  

    Multicore architectures are an important contribution in computing technology since they are capable of providing more processing power with better cost-benefit than single-core processors. Cores execute instructions independently but share critical resources such as L2 cache memory and data channels. Clusters using multicore architectures or multiprocessors chips (MPC's) suggest a hierarchical memory environment. Parallel applications should take advantage of such memory hierarchy to achieve high performance. This paper presents a performance analysis of a synthetic application in a multicore cluster and introduces a preliminary architecture model that considers communication through both shared memory and data channels and its impact on the application performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • TALM: A Hybrid Execution Model with Distributed Speculation Support

    Page(s): 31 - 36
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (325 KB) |  | HTML iconHTML  

    Parallel programming has become mandatory to fully exploit the potential of modern CPUs. The data-flow model provides a natural way to exploit parallelism. However, traditional data-flow programming is not trivial: specifying dependencies and control using fine-grained tasks (such as instructions) can be complex and present unwanted overheads. To address this issue we have built a coarse-grained data-flow model with speculative execution support to be used on top of widespread architectures, implemented as a hybrid Von Neumanm/data-flow execution system. We argue that speculative execution fits naturally with the data-flow model. Using speculative execution liberates the programmer to consider only the main dependencies, and still allows correct data-flow execution of coarse-grained tasks. Moreover, our speculation mechanism does not demand centralised control, which is a key feature for upcoming many-core systems, where scalability has become an important concern. An initial study on a artificial bank server application suggests that there is a wide range of scenarios where speculation can be very effective. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Effective Dynamic Scheduling on Heterogeneous Multi/Manycore Desktop Platforms

    Page(s): 37 - 42
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (570 KB) |  | HTML iconHTML  

    GPUs (Graphics Processing Units) have become one of the main co-processors that contributed to desktops towards high performance computing. Together with multicore CPUs and other co-processors, a powerful heterogeneous execution platform is built on a desktop for data intensive calculations. In our perspective, we see the modern desktop as a heterogeneous cluster that can deal with several applications'tasks at the same time. To improve application performance and explore such heterogeneity, a distribution of workload over the asymmetric PUs (Processing Units) plays an important role for the system. However, this problem faces challenges since the cost of a task at a PU is non-deterministic and can be influenced by several parameters not known a priori, like the problem size domain. We present a context-aware architecture that maximizes application performance on such platforms. This approach combines a model for a first scheduling based on an offline performance benchmark with a runtime model that keeps track of tasks' real performance. We carried a demonstration using a CPU-GPU platform for computing iterative SLEs (Systems of Linear Equations) solvers using the number of unknowns as the main parameter for assignment decision. We achieved a gain of 38.3% in comparison to the static assignment of all tasks to the GPU (which is done by current programming models, such as Open CL and CUDA for Nvidia). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards a Power-Aware Application Level Scheduler for a Multithreaded Runtime Environment

    Page(s): 43 - 48
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (233 KB) |  | HTML iconHTML  

    At the same time that the modern society becomes more dependent on computing power, people become more concerned about the environment and, in consequence, about energy consumption. In the high performance computing field, most works only take into account performance aspects such as throughputs to measure schedulers. In this paper, we introduce and evaluate an energy-aware list scheduler that uses heuristics based on the critical path to determine processor affinity and the clock rate of each core. We have observed that it is possible to implement an execution supportable to offer acceptable performance at same time that provides a strategy to save energy. Two case studies discussed in the paper support this conclusion. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • I/O Performance Evaluation on Multicore Clusters with Atmospheric Model Environment

    Page(s): 49 - 54
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (367 KB) |  | HTML iconHTML  

    This work evaluates the I/O performance in a multicore cluster environment for an atmosphere model for weather and climate simulations. It contains large data sets for I/O in scientific applications. The analysis demonstrates that the scalability of the system gets worse as we increase the number of cores per machine, with greater impact on output operations. We also demonstrate poor capacity of the multicore system for providing high aggregate I/O bandwidth and that the scalability is not improved when I/O operations are running trough a parallel file system neither running on local disk. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • OpenMP-based Parallel Algorithms for Solving Kronecker Descriptors

    Page(s): 55 - 60
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (208 KB) |  | HTML iconHTML  

    Numerical analysis of Markovian models is relevant for performance evaluation and probabilistic analysis of systems' behavior from several fields such as Bioinformatics, Economics, and Engineering. These models can be represented in a compact fashion using Kronecker algebra. The Vector-Descriptor Product is the key operation to obtain stationary solutions of Kronecker-based descriptors. Due to its complexity, the numerical algorithms are usually CPU intensive, requiring alternatives such as data partitioning in order to produce results in less time. This paper proposes three OpenMP-based parallel implementations for solving descriptors to be deployed on shared-memory machines. We evaluated the implementations in a multi-core machine and obtained a speed-up near to eight when using eight cores with Intel Hyper-Threading technology. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel Implementations of an Immune Network Model Using POSIX Threads and OpenMP

    Page(s): 61 - 66
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (280 KB) |  | HTML iconHTML  

    In the last few years, there has been an increasing interest in the mathematical and computational modeling of the human immune system (HIS). In particular, the use of computational models is fundamental to understand the HIS dynamics. The availability of computational models on which to do experiments and to test new hypothesis would accelerate our understanding of the HIS, allowing us to develop new drugs against many diseases. In this scenario we extended, in a previous work, a computational model of the HIS that represents the behavior of two of its cells, the B and T lymphocytes, in distinct situations. In this paper we present a parallel implementation of the model, developed in order to reduce the computational time needed in the simulations. We also describe the techniques used in the implementation of the parallel model, evaluates its performance, and discuss the initial results that have been obtained. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel Implementation of a Computational Model of the HIS Using OpenMP and MPI

    Page(s): 67 - 72
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (262 KB) |  | HTML iconHTML  

    Primary immune responses are initiated when foreign pathogenic microorganisms move past the front-line defense system of the body. The primary immune responses consist of a) the production of antibody molecules that are specific for the pathogenic microorganism and b) the expansion and differentiation of special-purpose defensive cells, the lymphocytes. In this scenario, our work aims to develop and implement a mathematical and computational model for this primary immune response in a microscopic section of a tissue. However, solving the set of equations related to the mathematical model requires a large amount of computation. Therefore, in this work we present an initial attempt to improve the performance of the computational implementation via the use of parallel computing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Author index

    Page(s): 73
    Save to Project icon | Request Permissions | PDF file iconPDF (103 KB)  
    Freely Available from IEEE
  • [Publisher's information]

    Page(s): 74
    Save to Project icon | Request Permissions | PDF file iconPDF (141 KB)  
    Freely Available from IEEE