By Topic

Computer Architecture and High Performance Computing, 2002. Proceedings. 14th Symposium on

Date 28-30 Oct. 2002

Filter Results

Displaying Results 1 - 25 of 28
  • Proceedings 14th Symposium on Computer Architecture and High Performance Computing

    Publication Year: 2002
    Save to Project icon | Request Permissions | PDF file iconPDF (654 KB)  
    Freely Available from IEEE
  • Author index

    Publication Year: 2002 , Page(s): 221
    Save to Project icon | Request Permissions | PDF file iconPDF (160 KB)  
    Freely Available from IEEE
  • Minimally-skewed-associative caches

    Publication Year: 2002 , Page(s): 100 - 107
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (399 KB) |  | HTML iconHTML  

    Skewed-associativity is a technique that reduces the miss ratios of CPU caches by applying different indexing functions to each way of an associative cache. Even though it showed impressive hit/miss statistics, the scheme has not been welcomed by the industry, presumably because implementation of the original version is complex and might involve access-time penalties among other costs. This paper presents a simplified, easy to implement variant that we call "minimally-skewed-associativity" (MSkA). We show that MSkA caches, for many cases, should not have penalties in either access time or power consumption when compared to set-associative caches of the same associativity. Hit/miss statistics were obtained by means of trace-driven simulations. Miss ratios are not as good as those for full skewing, but they are still advantageous. Minimal-skewing is thus proposed as a way to improve the hit/miss performance of caches, often without producing access-time delays or increases in power consumption as other techniques do (for example, using higher associativities). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Instruction set extension for long integer modulo arithmetic on RISC-based smart cards

    Publication Year: 2002 , Page(s): 13 - 19
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (436 KB) |  | HTML iconHTML  

    Modulo multiplication of long integers (≥ 1024 bits) is the major operation of many public-key cryptosystems like RSA or Diffie-Hellman. The efficient implementation of modulo arithmetic is a challenging task, in particular on smart cards due to their constrained resources and relatively slow clock frequency. We present the concept of an application-specific instruction set extension (ISE) for long integer arithmetic. We introduce an optimized multiply-and-accumulate (MAC) unit that makes it possible to compute a×b+c+d with only one instruction, whereby a, b, c, d are single-precision words (unsigned integers). This additional instruction is simple to incorporate into common RISC architectures like the MIPS32. Experimental results show that the inner-product operation of a multiple-precision multiplication can be accelerated by a factor of two without increasing the processor's clock frequency. We also estimate the execution time of a 1024-bit modulo exponentiation assuming that this special MAC instruction was made available. The proposed ISE is an alternative solution to a crypto co-processor especially for multi-application smart cards (e.g., Java cards) with an embedded 32-bit RISC core. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Simulating L3 caches in real time using hardware accelerated cache simulation (HACS): a case study with SPECint 2000

    Publication Year: 2002 , Page(s): 108 - 114
    Cited by:  Papers (1)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (275 KB) |  | HTML iconHTML  

    Trace-driven simulation is a commonly used tool to evaluate memory-hierarchy designs. Unfortunately, trace collection is very expensive, and storage requirements for traces are very large. In this paper, we introduce HACS (Hardware Accelerated Cache Simulator), and describe the validation methods we used to demonstrate functionality. We also present some initial cache simulation results from SPECint 2000. We then propose future directions for research with HACS. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Interactive ray tracing using a SIMD reconfigurable architecture

    Publication Year: 2002 , Page(s): 20 - 28
    Cited by:  Papers (2)  |  Patents (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (501 KB) |  | HTML iconHTML  

    This paper presents an architecture for running interactive ray tracing applications on portable devices such as cell phones, PDAs, and head mounted displays and discusses the main issues related to the mapping of this graphics algorithm using fixed-point arithmetic. The paper shows that a floating-point arithmetic unit, with its associated power and area consumption, can be avoided by using appropriate fixed-point arithmetic and block floating-point operations. It is also shown that a computation intensive graphics method like ray tracing can be used to generate simple images at interactive rates on portable devices. This can be achieved by employing a reconfigurable SIMD architecture on a chip, which trades parallelism for frequency of operation, thus providing significant benefits in power saving, which is essential in portable devices. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An advanced filtering TLB for low power consumption

    Publication Year: 2002 , Page(s): 93 - 99
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (367 KB) |  | HTML iconHTML  

    This research is to design a new two-level TLB (translation look-aside buffer) architecture that integrates a 2-way banked filter TLB with a 2-way banked main TLB. One of the main objectives is to reduce power consumption in embedded processors by distributing the accesses to the TLB entries across several banks in a balanced manner. Thus, an advanced filtering technique is devised to reduce power dissipation by adopting a sub-bank structure at the filter TLB. And also a bank-associative structure is applied to each level of the TLB hierarchy. Simulation result shows that the miss ratio and Energy*Delay product can be improved by 59.26% and 24.9%, respectively, compared with a micro TLB with 4-32 entries, and 40.81% and 12.18%, compared with a micro TLB with 16-32 entries. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Distributed shared memory in kernel mode

    Publication Year: 2002 , Page(s): 159 - 166
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (293 KB) |  | HTML iconHTML  

    In this paper we introduce MOMEMTO (MOre MEMory Than Others) a new set of kernel mechanisms that allow users to have full control of the distributed shared memory on a cluster of personal computers. In contrast to many existing software DSM systems, MOMEMTO supports efficiently and flexibly global shared-memory allowing applications to address larger memory space than that available in a single node. MOMEMTO has been implemented in the Linux 2.4 kernel and preliminary performance results show that MOMEMTO has low memory management and communication overheads and that it can indeed perform very well for large memory configurations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Simulating semiconductor spectra emissions in a PC cluster

    Publication Year: 2002 , Page(s): 39 - 43
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (374 KB) |  | HTML iconHTML  

    Computer simulation is, in our days, one of the most important tools for the correct understanding of physical phenomena. We analyse the improvement of performance by the parallelization of an algorithm used to simulate electronic properties from semiconductor systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A framework for exploiting adaptation in high heterogeneous distributed processing

    Publication Year: 2002 , Page(s): 125 - 132
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (347 KB) |  | HTML iconHTML  

    ISAM is a proposal directed to resource management in heterogeneous networks, supporting physical and logical mobility, dynamic adaptation and the execution of distributed applications based on components. In order to achieve its goals, ISAM uses, as strategy, an integrated environment that: (a) provides a programming paradigm and its execution environment; (b) handles the adaptation process through a multilevel collaborative model, in which both the system and the application contribute. In this paper we discuss the main mechanisms used to implement the ISAM features, and we also present a parallel application that explores some of this features. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting loop-level parallelism with the Shift Architecture

    Publication Year: 2002 , Page(s): 184 - 191
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (298 KB) |  | HTML iconHTML  

    The limited amount of instruction-level parallelism inherent in applications is a limiting factor for improving the performance of most conventional microprocessors. A promising solution to overcome this problem is to exploit coarser granularities of parallelism. In this paper, we propose exploiting loop-level parallelism in a multithreaded fashion. We use the Shift Architecture as a baseline architecture, with improved compiler support and register file. The compiler converts iterations of a loop into threads, to be executed by multiple processing elements. The hardware provides a selective register shifting mechanism in order to allow the execution of loops containing loop-carried data dependences, which are very difficult to execute by using conventional architectures. In this paper, we simulate and discuss the parameters of major importance for the implementation of this architectural approach. Our initial results show that, on two simple numerical benchmarks, a considerable amount of iteration overlapping can be potentially achieved by an implementation of the Shift Architecture, in comparison with a multiprocessor machine. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • T&D-Bench: an environment for modeling and simulating complex processor architectures

    Publication Year: 2002 , Page(s): 176 - 183
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (640 KB) |  | HTML iconHTML  

    This paper presents T&D-Bench, an integrated suite of tools for modeling and simulating state-of-the-art processors, which is composed of two main parts. SimPL is an object-oriented methodology for modeling the behavior of an instruction set, with precise information on the timing of basic instruction steps. The methodology is general and allows easy modeling of various architecture types. The second part of the suite is CSPSim, an open set of visualization tools that communicate with any number of SimPL models based on a client-server architecture. T&D-Bench gathers the main advantages of teaching environments, with a rich user interface, and design environments, with resources for modeling any complex processor architectures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Infrastructure, requirements and applications for e-Science

    Publication Year: 2002 , Page(s): 3 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (412 KB) |  | HTML iconHTML  

    Recent developments in the international arena has meant the technology is now mature enough to bring together those required for the implementation of a grid computing facility. This paper examines the requirements and applications for an eScience infrastructure with particular reference to developments in Europe. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementing declarative parallel bottom-avoiding choice

    Publication Year: 2002 , Page(s): 82 - 89
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (302 KB) |  | HTML iconHTML  

    Non-deterministic choice supports efficient parallel speculation, but unrestricted non-determinism destroys the referential transparency of purely-declarative languages by removing unfoldability and it bears the danger of wasting resources on unnecessary computations. While numerous choice mechanisms have been proposed that preserve unfoldability, and some concurrent implementations exist, we believe that no compiled parallel implementation has previously been constructed This paper presents the design, semantics, implementation and use of a family of bottom-avoiding choice operators for Glasgow parallel Haskell. The subtle semantic properties of our choice operations are described, including a careful classification using an existing framework, together with a discussion of operational semantics issues and the pragmatics of distributed memory implementation. The expressiveness of our choice operators is demonstrated by constructing a branch and bound search, a merge and a speculative conditional. Their effectiveness is demonstrated by comparing the parallel performance of the speculative search with naive and 'perfect' implementations. Their efficiency is assessed by measuring runtime overhead and heap consumption. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving the communication network of a cluster of PC's to solve implicit CFD problems

    Publication Year: 2002 , Page(s): 44 - 50
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (338 KB) |  | HTML iconHTML  

    In this work we investigate the feasibility of using a cluster of PCs built with mass market networks to deal with the necessities of the CFD community, in particular for unstructured implicit CFD solvers that require a very irregular pattern of communications. We report the initial findings from a series of experiments with some well known benchmarks to determine CFD application sensitivity to machine communication parameters. This is done by running these benchmarks on a cluster in which the communication network has been modified to allow an increase of the bandwidth by adding multiple channels and a reduction on the latency by using a lightweight protocol like the M-Via. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel boundary elements using LAPACK and ScaLAPACK

    Publication Year: 2002 , Page(s): 51 - 58
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1731 KB) |  | HTML iconHTML  

    The work introduces the main steps towards the parallelization of existing boundary element method (BEM) codes using available standard and portable libraries for writing parallel programs, such as LAPACK and ScaLAPACK. Here, a well-known BEM Fortran implementation is reviewed and rewritten to run on shared and distributed memory systems. This effort is the initial step to develop a new generation of parallel BEM codes to be used in many important engineering problems. Numerical experiments on a SGI Origin 2000 show the effectiveness of the proposed approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The virtual cluster: a dynamic network environment for exploitation of idle resources

    Publication Year: 2002 , Page(s): 141 - 148
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (426 KB)  

    Standard environments for exploiting idle time of workstations are based on some kind of spying process that detects low CPU usage and informs to a scheduler so that work can be dispatched. This approach generates local interference and, since the same local environment is used, could lead to security problems. We are investigating the exploitation of idle times in network resources based on a complete mode change in a candidate node. After the detection that some node is idle, a mode-switcher boots a new operating system that will work over a separate disk partition. After the boot phase the node is linked to a logical network topology and is available to receive jobs. Users can allocate nodes from this virtual cluster through a standard frontend as they would do in a "conventional" cluster. Because nodes may leave and join this virtual machine we use a distributed processor management to allow user applications to cope with this dynamic resource behavior. In this paper we describe the architecture of the virtual cluster and present the results obtained with a mode-switcher and a prototype application under real use conditions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Architecture of oscillatory neural network for image segmentation

    Publication Year: 2002 , Page(s): 29 - 36
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1765 KB) |  | HTML iconHTML  

    Oscillatory neural networks are a recent approach for applications in image segmentation. In this context, the LEGION (Locally Excitatory Globally Inhibitory Oscillator Network) is the most consistent proposal. As positive aspects, the network has got a parallel architecture and capacity to separate the segments in time. On the other hand, the structure based on differential equations presents high computational complexity and limited capacity of segmentation, which restricts practical applications. In this paper, a proposal of a parallel architecture for implementation of an oscillatory neural network suitable for image segmentation is presented. The proposed network keeps the positive features of the LEGION network, offering lower complexity for implementation in digital hardware and capacity of segmentation unlimited, as well as a few parameters, with an intuitive setting. Preliminary results confirm the successful operation of the proposed network in applications of image segmentation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Instruction usage and the memory gap problem

    Publication Year: 2002 , Page(s): 169 - 175
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (428 KB) |  | HTML iconHTML  

    The gap between memory and processor speeds is responsible for the substantial amount of idle time of current processors. To reduce the impact provoked by the so-called "memory gap problem," many, software techniques (e.g., the code layout reorganization) together with hardware mechanisms (cache memory, translation look-aside buffer branch prediction, speculative execution, trace cache, instruction reuse, and so on) have been successfully implemented. In this paper we present some experiments that explain why these mechanisms and techniques are so efficient. We found that only a small fraction of the object code is actually executed: our experiments disclosed that more than 50% of the instructions remain untouched during the whole execution, and the percentages of basic blocks which remain unused are slightly greater. In addition to the usage of instructions and blocks, the paper provides further insights regarding the behavior of application programs, and gives some suggestions for extra performance gains. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Generating Java code for TINA systems

    Publication Year: 2002 , Page(s): 68 - 74
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (388 KB) |  | HTML iconHTML  

    The work presented in this paper consists of a tool developed to help the process of prototyping a TINA system. This tool is responsible for generating Java code automatically for a general TINA system, whose objects were previously described by the use of SDL language. The generated code is a distributed system that makes use of CORBA as the distributed environment and is completely functional. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cluster-based static scheduling: theory and practice

    Publication Year: 2002 , Page(s): 133 - 140
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (369 KB) |  | HTML iconHTML  

    Task scheduling is a key element in achieving high performance from multicomputer systems. To be efficient, scheduling algorithms must be based on a cost model appropriate for computing systems in use. The optimal scheduling of tasks is NP-hard, and a large number of heuristic algorithms have been proposed for a variety of scheduling conditions (graph types, granularities or cost models). This paper studies the problem of task scheduling under the LogP model and presents both theoretical and experimental results for a cluster-based, task duplication methodology. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient cyclic weighted reference counting

    Publication Year: 2002 , Page(s): 61 - 67
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (284 KB) |  | HTML iconHTML  

    Weighted reference counting is a very simple and efficient memory management system for multiprocessor architectures. This paper extends the weighted reference counting algorithm to work efficiently with cyclic data structures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GloVE: a distributed environment for low cost scalable VoD systems

    Publication Year: 2002 , Page(s): 117 - 124
    Cited by:  Papers (2)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (323 KB) |  | HTML iconHTML  

    In this paper we introduce a scalable Video-on-Demand (VoD) system called GloVE (Global Video Environment) in which active clients cooperate to create a shareable video cache that is used as the primary source of video content for subsequent client requests. In this way, GloVE server's bandwidth does not limit the number of simultaneous clients that can watch a video since once its content is in the cooperative video cache (CVC) it can be directly transmitted from the cache rather than the VoD server Also, GloVE follows the peer-to-peer approach, allowing the use of low-cost PCs as video servers. In addition, GloVE supports video servers without multicast capability and videos in any stored format. We analyze preliminary performance results of GloVE implemented in a PC server using a Fast Ethernet interconnect and small video buffers at the clients. Our results confirm that while the GloVE-based server uses only a single video channel to deliver a highly popular video simultaneously to N clients, conventional VoD servers require as much as N times more channels. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A parallel approximation hitting set algorithm for gene expression analysis

    Publication Year: 2002 , Page(s): 75 - 81
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (553 KB) |  | HTML iconHTML  

    With the recent DNA-microarray technology, it is possible to measure the expression levels of thousands of genes simultaneously in the same experiment. A genetic network is a model that describes how the expression level of each gene is affected by the expression levels of other genes in the network. Given the results of an experiment with n genes and m measures over time (m << n), we consider the problem of finding a subset of genes (k genes, where k < < n) that explain the expression level of a given target gene under study. We consider the coarse-grained multicomputer (CGM) model, with p processors. In this paper we first present a sequential approximation algorithm of O(m4n) time and O(m2n) space. The main result is a new parallel approximation algorithm that determines the k genes in O(m4n/p) local computing time plus O(k) communication rounds, and with space requirement of O(m2n/p). The p factor in the parallel time and space complexities indicates a good parallelization. We also show preliminary promising experimental results on a Beowulf machine. To our knowledge there are no CGM algorithms for the problem considered in this paper. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and evaluation of data access prediction strategies in SDSM systems

    Publication Year: 2002 , Page(s): 151 - 158
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (324 KB) |  | HTML iconHTML  

    Software Distributed Shared Memory (SDSM) systems provide the shared memory abstraction on top of a message passing hardware, simplifying application programming in these architectures. However, some memory references exhibit long latencies due to remotely cached data. In order to hide this latency, many techniques that propagate data speculatively were developed. This requires that the data access behavior of the applications be determined. Traditionally, many of these techniques were directed at specific data sharing patterns such as producer-consumer and migratory. In this paper, we propose and evaluate generic data access prediction techniques for SDSM systems. By generic we mean.: that our strategies don't try to detect specific sharing patterns known a priori. The prediction strategies proposed can be divided into two classes: local information predictors (LIP), that are guided only by local information in each processor and global information predictors (GIP) that use the data access pattern of all processors in order to make predictions. Our experimental result show that techniques within both classes can attain high hit ratios in most of the applications evaluated. Overall, the results allow us to conclude that the prediction strategies for data accesses we propose can contribute to increase the performance of current page-based SDSMs significantly. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.