By Topic

Architecture and Multi-Core Applications (WAMCA), 2011 Second Workshop on

Date 26-27 Oct. 2011

Filter Results

Displaying Results 1 - 17 of 17
  • [Front cover]

    Publication Year: 2011 , Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (1835 KB)  
    Freely Available from IEEE
  • [Title page i]

    Publication Year: 2011 , Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (77 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Publication Year: 2011 , Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (148 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Publication Year: 2011 , Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (118 KB)  
    Freely Available from IEEE
  • Table of contents

    Publication Year: 2011 , Page(s): v
    Save to Project icon | Request Permissions | PDF file iconPDF (136 KB)  
    Freely Available from IEEE
  • Message from the Program Chairs

    Publication Year: 2011 , Page(s): vi
    Save to Project icon | Request Permissions | PDF file iconPDF (101 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Committees

    Publication Year: 2011 , Page(s): vii
    Save to Project icon | Request Permissions | PDF file iconPDF (74 KB)  
    Freely Available from IEEE
  • Keynote

    Publication Year: 2011 , Page(s): viii
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (121 KB) |  | HTML iconHTML  

    Summary form only given. In this talk we examine how high performance computing has changed over the last 10-year and look toward the future in terms of trends. These changes have had and will continue to have a major impact on our software. Some of the software and algorithm challenges have already been encountered, such as management of communication and memory hierarchies through a combination of compile-time and run--time techniques, but the increased scale of computation, depth of memory hierarchies, range of latencies, and increased run-time environment variability will make these problems much harder. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Industrial Talks

    Publication Year: 2011 , Page(s): ix - x
    Save to Project icon | Request Permissions | PDF file iconPDF (123 KB)  
    Freely Available from IEEE
  • Tutorial

    Publication Year: 2011 , Page(s): xi
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (121 KB)  

    Provides an abstract of the tutorial presentation and a brief professional biography of the presenter. The complete presentation was not made available for publication as part of the conference proceedings. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Large Scale Kronecker Product on Supercomputers

    Publication Year: 2011 , Page(s): 1 - 4
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (243 KB) |  | HTML iconHTML  

    The Kronecker product, also called tensor product, is a fundamental matrix algebra operation, which is widely used as a natural formalism to express a convolution of many interactions or representations. Given a set of matrices, we need to multiply their Kronecker product by a vector. This operation is a critical kernel for iterative algorithms, thus needs to be computed efficiently. In a previous work, we have proposed a cost optimal parallel algorithm for the problem, both in terms of floating point computation time and interprocessor communication steps. However, the lower bound of data transfers can only be achieved if we really consider (local) logarithmic broadcasts. In practice, we consider a communication loop instead. Thus, it becomes important to care about the real cost of each broadcast. As this local broadcast is performed simultaneously by each processor, the situation is getting worse on a large number of processors (supercomputers). We address the problem in this paper in two points. In one hand, we propose a way to build a virtual topology which has the lowest gap to the theoretical lower bound. In the other hand, we consider a hybrid implementation, which has the advantage of reducing the number of communicating nodes. We illustrate our work with some benchmarks on a large SMP 8-Core supercomputer. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Trace-Based Visualization as a Tool to Understand Applications' I/O Performance in Multi-core Machines

    Publication Year: 2011 , Page(s): 5 - 11
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (595 KB) |  | HTML iconHTML  

    This paper presents the use of trace-based performance visualization of a large scale atmospheric model, the Ocean-Land-Atmosphere Model (OLAM). The trace was obtained with the libRastro library, and the visualization was done with Paje. The use of visualization aimed to analyze OLAM's performance and to identify its bottlenecks. Especially, we are interested in the model's I/O operations, since it was proved to be the main issue for the model's performance. We show that most of the time spent in the output routine is spent in the close operation. With this information, we delayed this operation until the next output phase, obtaining improved I/O performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptive Power Optimization of On-chip SNUCA Cache on Tiled Chip Multicore Architecture Using Remap Policy

    Publication Year: 2011 , Page(s): 12 - 17
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (270 KB) |  | HTML iconHTML  

    Advances in technology have increased the number of cores and size of caches present on chip multicore platforms(CMPs). As a result, leakage power consumption of on-chip caches has already become a major power consuming component of the memory subsystem. We propose to reduce leakage power consumption in static nonuniform cache architecture(SNUCA) on a tiled CMP by dynamically varying the number of cache slices used and switching off unused cache slices. A cache slice in a tile includes all cache banks present in that tile. Switched-off cache slices are remapped considering the communication costs to reduce cache usage with minimal impact on execution time. This saves leakage power consumption in switched-off L2 cache slices. On an average, there map policy achieves 41% and 49% higher EDP savings compared to static and dynamic NUCA (DNUCA) cache policies on a scalable tiled CMP, respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Evaluating the Problem of Process Mapping on Network-on-Chip for Parallel Applications

    Publication Year: 2011 , Page(s): 18 - 23
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (268 KB) |  | HTML iconHTML  

    Process mapping on Networks-on-Chip (NoC) is an important issue for the future many-core processors. Mapping strategies can increase performance and scalability by optimizing the communication cost. However, parallel applications have a large set of collective communication performing a high traffic on the Network-on-Chip. Therefore, our goal in this paper is to evaluate the problem related to the process mapping for parallel applications. The results show that for different mappings the performance is similar. The reason can be explained by collective communication due to the high number of packets exchanged by all routers. Our evaluation shows that topology and routing protocol can influence the process mapping. Consequently, for different NoC architectures different mapping strategies must be evaluated. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Economical Two-fold Working Precision Matrix Multiplication on Consumer-Level CUDA GPUs

    Publication Year: 2011 , Page(s): 24 - 29
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (299 KB) |  | HTML iconHTML  

    Dot product faithfully rounded after "as if" computed in K-fold working precision (K ≤ 2) is known to be computable only with floating-point numbers defined in IEEE 754 floating-point standard. This paper presents a CUDA GPU implementation of two-fold working precision matrix multiplication based on the dot product computation method. Experimental results on a GeForce GTX580 and a GTX560Ti show that the proposed implementation has 1.84 to 1.95 times higher GFLOPS performance in two- fold working precision compared to the performance of CUBLAS dgemm in double-precision on a Tesla C2070 high-end GPU. The proposed implementation can be used to obtain higher performance in pseudo double-precision with low cost consumer-level GPUs whose double-precision native performance is limited. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Author index

    Publication Year: 2011 , Page(s): 30
    Save to Project icon | Request Permissions | PDF file iconPDF (74 KB)  
    Freely Available from IEEE
  • [Publishers information]

    Publication Year: 2011 , Page(s): 32
    Save to Project icon | Request Permissions | PDF file iconPDF (90 KB) |  | HTML iconHTML  
    Freely Available from IEEE