By Topic

Computer Architecture and High Performance Computing, 2005. SBAC-PAD 2005. 17th International Symposium on

Date 24-27 Oct. 2005

Filter Results

Displaying Results 1 - 25 of 42
  • Proceedings. 17th International Symposium on Computer Architecture and High Performance Computing

    Publication Year: 2005
    Save to Project icon | Request Permissions | PDF file iconPDF (320 KB)  
    Freely Available from IEEE
  • 17th International Symposium on Computer Architecture and High Performance Computing - Title Page

    Publication Year: 2005 , Page(s): i - iii
    Save to Project icon | Request Permissions | PDF file iconPDF (121 KB)  
    Freely Available from IEEE
  • 17th International Symposium on Computer Architecture and High Performance Computing - Copyright

    Publication Year: 2005 , Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (98 KB)  
    Freely Available from IEEE
  • 17th International Symposium on Computer Architecture and High Performance Computing - Table of contents

    Publication Year: 2005 , Page(s): v - viii
    Save to Project icon | Request Permissions | PDF file iconPDF (161 KB)  
    Freely Available from IEEE
  • Message from the General Chair and Vice-Chair

    Publication Year: 2005 , Page(s): ix
    Save to Project icon | Request Permissions | PDF file iconPDF (142 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Welcome from the Program Chairs

    Publication Year: 2005 , Page(s): x
    Save to Project icon | Request Permissions | PDF file iconPDF (134 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Conference organizers

    Publication Year: 2005 , Page(s): xi
    Save to Project icon | Request Permissions | PDF file iconPDF (130 KB)  
    Freely Available from IEEE
  • Program Committee

    Publication Year: 2005 , Page(s): xii
    Save to Project icon | Request Permissions | PDF file iconPDF (133 KB)  
    Freely Available from IEEE
  • list-reviewer

    Publication Year: 2005 , Page(s): xiii
    Save to Project icon | Request Permissions | PDF file iconPDF (105 KB)  
    Freely Available from IEEE
  • Brazilian Computer Society (SBC)

    Publication Year: 2005 , Page(s): xiv
    Save to Project icon | Request Permissions | PDF file iconPDF (131 KB)  
    Freely Available from IEEE
  • e-Science and its application: life sciences and finance

    Publication Year: 2005 , Page(s): 2 - 9
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (248 KB) |  | HTML iconHTML  

    In 2001, the UK government announced new funding for an initiative known as e-Science. This initiative was challenged with developing a new way of collaborative research using grid as the information utility. The initiative was structured to provide funding to support a national grid infrastructure, tackle large scale scientific research problems and to engage in projects with industry. This paper considers the impact of e-Science in the UK and, in particular, in the finance and life sciences sectors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High performance computing in science and engineering

    Publication Year: 2005 , Page(s): 10 - 17
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (232 KB) |  | HTML iconHTML  

    High performance computing has gradually shifted from the realm of research into development and partially even into the production cycles of industry. High performance computers therefore have to be integrated into production environments that demand the simultaneous solution of multidisciplinary physics problems. Supercomputer centers can learn from these new challenges imposed by industry. The concepts of workflow and production cycle open up a new horizon for integrating systems and software into what is called a distributed 'Teraflop-workbench' approach. Terascale storage and communication infrastructures eventually needed to support such an environment. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards grid implementations of metaheuristics for hard combinatorial optimization problems

    Publication Year: 2005 , Page(s): 19 - 26
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (320 KB) |  | HTML iconHTML  

    Metaheuristics are approximate algorithms that are able to find very good solutions to hard combinatorial optimization problems. They do, however, offer a wide range of possibilities for implementations of effective robust parallel algorithms which run in much smaller computation times than their sequential counterparts. We present four slightly differing strategies for the parallelization of an extended GRASP with ILS heuristic for the mirrored traveling tournament problem. Computational results on widely used benchmark instances, using a varying number of processors, illustrate the effectiveness and the scalability of the different strategies. These low communication cost parallel heuristics not only find solutions faster, but also produce better quality solutions than the best known sequential algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scheduling collective communications on wormhole fat cubes

    Publication Year: 2005 , Page(s): 27 - 34
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (288 KB) |  | HTML iconHTML  

    A recent renewed interest in hypercube interconnection network has been concentrated to the more scalable and mostly cheaper version known as a fat cube. This paper generalizes the known results on time complexity of collective communications on a hypercube for the wormhole fat cube. Examples of particular communication algorithms on the 2D-fat cube topology with 8 processors are summarized and given in detail. The performed study shows that a large variety of fat cubes can provide lower cost, better scalability and manufacturability without compromising communication performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A new multi-processor architecture for parallel lazy cyclic reference counting

    Publication Year: 2005 , Page(s): 35 - 42
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (256 KB) |  | HTML iconHTML  

    Reference counting is the memory management technique of most widespread use today. This paper presents a new multi-processor architecture for parallel cyclic reference counting. In this architecture, there is no direct mutator-collector communication and synchronization is kept minimal. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reconfigurable optical interconnection system supporting concurrent application-specific parallel computing

    Publication Year: 2005 , Page(s): 44 - 51
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (328 KB) |  | HTML iconHTML  

    Application specific architectures are highly desirable in embedded parallel computing systems at the same time as designers strive for using one embedded parallel computing platform for several applications. If this can be achieved, the cost can be decreased in comparison to using several different embedded parallel computing systems. This paper presents a novel approach of running several high-performance applications concurrently on one single parallel computing system. By using a reconfigurable backplane interconnection system, the applications can be run efficiently with high network flexibility since the interconnect network can be adapted to fit the application that is being processed for the moment. More precisely, this paper investigates how the space time adaptive processing (STAP) radar algorithm and the stripmap synthetic aperture radar (SAR) algorithm can be mapped on a multi-cluster processing system with a reconfigurable optical interconnection system realized by a micro-optical-electrical mechanical system (MOEMS) crossbars. The paper describes the reconfigurable platform, the two algorithms and how they individually can be mapped on the targeted multiprocessor system. It is also described how these two applications can be mapped simultaneously on the optical reconfigurable platform. Implications and requirements on communication bandwidth and processor performance in different critical points of the two applications are presented. The results of the analysis show that an implementation is feasible with today's MOEMS technology, and that the two applications can be successfully run in a time-sharing scheme, both at the processing side and at the access for interconnection bandwidth. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High-performance and area-efficient reduction circuits on FPGAs

    Publication Year: 2005 , Page(s): 52 - 59
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (312 KB) |  | HTML iconHTML  

    Field-programmable gate arrays (FPGAs) have become an attractive option for scientific applications. However, due to the pipelining in the FPGA-based floating-point units, data hazards may occur during reduction of series of values. A typical example of reduction is the accumulation of sets of floating-point values, which is needed in many scientific operations such as dot product and matrix-vector multiplication. Reduction circuits can significantly impact the overall performance, impose unrealistic buffer requirements, or occupy large area on the FPGA. In this paper, we introduce a high-performance and area-efficient FPGA-based reduction circuit. It can reduce multiple sets of sequentially delivered floating-point values without stalling the pipeline. In contrast with previous works, the proposed circuit uses one floating-point adder, and can handle input sets of arbitrary size. The buffer size needed by the circuit is independent of the size of the individual sets and the number of input sets. Using a Xilinx Virtex-II Pro FPGA as the target device, we implement the proposed reduction circuit and present performance and area results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Extending the ArchC language for automatic generation of assemblers

    Publication Year: 2005 , Page(s): 60 - 67
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (248 KB) |  | HTML iconHTML  

    In this paper, we extend the ArchC language with new constructs to describe the assembly language syntax and operand encoding of an instruction set architecture. Based on the extended language we have created a tool which can automatically generate assemblers. Our tool uses the GNU Binutils framework in order to produce the assembler, generating the architecture dependent files necessary to retarget the GNU assembler and the Binutils libraries. We have generated assemblers for the MIPS-I and SPARC-V8 architectures based on ArchC models using our tool. The assemblers generated for both architectures were compared with the default gas assemblers for a set of files taken from the MiBench benchmark, and the ELF object files generated by each pair of assemblers were equivalent in both cases. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Managing the execution of large scale MPI applications on computational grids

    Publication Year: 2005 , Page(s): 69 - 76
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (288 KB) |  | HTML iconHTML  

    Computational grids aim to aggregate significant numbers of resources to provide sufficient, but low cost, computational power to an ever growing variety applications. Writing applications capable of executing efficiently in these grid environments is however extremely difficult for inexperienced users. The grid's geographically distributed resources are typically heterogeneous, non-dedicated, and are offered without any performance or availability guarantees. This work investigates an alternative approach (based on smarter system-aware applications) to solve the problem of developing and managing the execution of grid applications efficiently. Results show that these system-aware MPI applications are indeed faster than their conventional implementations and easily grid enabled. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A scalable dissemination service for the ISAM architecture

    Publication Year: 2005 , Page(s): 77 - 84
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (552 KB) |  | HTML iconHTML  

    The ISAM architecture presents a platform to the development and to the execution of pervasive applications. DIMI (Disseminador Multicast de Informacoes - information multicast disseminator) is an information dissemination service designed for the ISAM architecture. DIMI seeks mainly to obtain scalability, avoiding bottlenecks formation while accepting new consumers on a channel. Besides the search for scalability, this service outfits other necessary characteristics of the computational environment proposed by the ISAM architecture, as planned disconnection support and user mobility support. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Remote data service installation on a grid-enabled Java platform

    Publication Year: 2005 , Page(s): 85 - 91
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (616 KB) |  | HTML iconHTML  

    We define a framework for remote service installation, called SIMG (service installation metaservice on grids). Its main goal is to allow grid users to define and safely install their own data filters on remote data servers. Once installed, these filters may be used like the predefined services and may be available for a broader community of users. SIMG is implemented in SUMA/G, a Globus-enabled platform for remote execution of Java bytecode. We also present our experiences using SIMG for defining and remotely installing two complementary services in a content-based image retrieval system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic data-flow graph generation of MPI programs

    Publication Year: 2005 , Page(s): 93 - 100
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (328 KB) |  | HTML iconHTML  

    The data-flow graph (DFG) of a parallel application is frequently used to take scheduling decisions, based on the information that it models (dependencies among the tasks and volume of exchanged data). In the case of MPI-based programs, the DFG may be built at run-time by overloading the data exchange primitives. This article presents a library that enables the generation of the DFG of a MPI program, and its use to analyze the network contention on a test-application: the Linpack benchmark. It is the first step towards automatic mapping of a MPI program on a distributed architecture. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Function outlining and partial inlining

    Publication Year: 2005 , Page(s): 101 - 108
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (280 KB) |  | HTML iconHTML  

    Frequently invoked large functions are common in non-numeric applications. These large functions present challenges to modern compilers not only because they require more time and resources at compilation time, but also because they may prevent optimizations such as function inlining. However, usually it is the case that large portions of the code in a hot function fhost are executed much less frequently than fhost itself. Partial inlining is a natural solution to the problems caused by including cold code segments that are seldom executed into hot functions that are frequently invoked. When applying partial inlining, a compiler outlines cold statements from a hot function fhost. After outlining, fhost becomes smaller and thus can be easily inlined. This paper presents a framework for function outlining and partial inlining that includes several innovations: (1) an abstract-syntax-tree-based analysis and transformation to form cold regions for outlining; (2) a set of flexible heuristics to control the aggressiveness of function outlining; (3) several possible function outlining strategies; (4) alias agent, a new technique that overcomes negative side-effects of function outlining. With the proper strategy, partial inlining improves performance by up to 5.75%. A performance study also suggests that partial inlining is not effective on enabling more aggressive in-lining. The performance improvement from partial in-lining actually comes from better code placement and better code generation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A flexible specification model based on XML for parallel applications

    Publication Year: 2005 , Page(s): 109 - 116
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (256 KB) |  | HTML iconHTML  

    This article introduces a new workflow specification model for parallel applications. Many parallel systems allow only the following basic flow control: (1) a coordinator task distributes the data among the worker tasks, (2) the worker tasks process the data in parallel and (3) the coordinator task gather all the results. In some systems the application execution flow is based on a dependence relationship among tasks, representing a directed acyclic graph (DAG). Even with this model, it is not possible to execute some important parallel applications and, therefore, there is a need for a new specification model with more sophisticated flow controls that allow, for example, conditional branching and iterations at the level of task management. The purpose of this article is to present a proposal for a new parallel application specification model that provides new types of control structures and allows the implementation of a broader range of applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Branch prediction topologies for SMT architectures

    Publication Year: 2005 , Page(s): 118 - 125
    Cited by:  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (688 KB) |  | HTML iconHTML  

    The exploitation of instruction level parallelism in superscalar architectures is limited by data and control dependencies. Simultaneous multi-threaded (SMT) architectures can explore another level of parallelism, called thread-level parallelism, to fetch and execute instructions from different tasks at the same time. While a task is blocked by control or data dependencies, other tasks may continue executing, thus masking latencies caused by mispredicted branches and memory accesses, and increasing the occupation of functional units. However, the design of SMT architectures brings new challenges, such as determining the most efficient way to share resources among different threads. In this paper, we present different branch prediction topologies for SMT architectures. We show that the best results are obtained by matching the number of i-cache modules (fetch width) with the number of branch prediction modules (number of lookups and updates), while increasing the number of modules also helps increasing clock rates. Moreover, contention on branch prediction lookup and updates buses cannot be ignored on such architectures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.