By Topic

High-Performance Computer Architecture, 1998. Proceedings., 1998 Fourth International Symposium on

Date 1-4 Feb. 1998

Filter Results

Displaying Results 1 - 25 of 34
  • Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture

    Publication Year: 1998
    Request permission for commercial reuse | PDF file iconPDF (221 KB)
    Freely Available from IEEE
  • The emergence of workstation clusters: Should we continue to build mpps? [panel session]

    Publication Year: 1998, Page(s): 138
    Request permission for commercial reuse | PDF file iconPDF (2 KB)
    Freely Available from IEEE
  • Author index

    Publication Year: 1998, Page(s): 352
    Request permission for commercial reuse | PDF file iconPDF (4 KB)
    Freely Available from IEEE
  • Efficiently adapting to sharing patterns in software DSMs

    Publication Year: 1998, Page(s):289 - 299
    Cited by:  Papers (24)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (208 KB)

    In this paper we introduce a page-based Lazy Release Consistency protocol called ADSM that constantly and efficiently adapts to the applications' sharing patterns. Adaptation in ADSM is based on our dynamic categorization of the type of sharing experienced by each page. Pages can be categorized as falsely-shared, migratory, or producer/consumer(s). Migratory and producer/consumer(s) pages are mana... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Communication across fault-containment firewalls on the SGI origin

    Publication Year: 1998, Page(s):277 - 287
    Cited by:  Patents (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (180 KB)

    Scalability and reliability are inseparable in high-performance computing. Fault-isolation through hardware is a popular means of providing reliability. Unfortunately, such isolation also increases communication latencies: typically, one has to drop into and out of the kernel to communicate between failure domains. On the other hand, relaxing fault isolation domains allows efficient communication,... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Treegion scheduling for wide issue processors

    Publication Year: 1998, Page(s):266 - 276
    Cited by:  Papers (19)  |  Patents (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (252 KB)

    Instruction scheduling is one of the most important phases of compilation for high-performance processors. A compiler typically divides a program into multiple regions of code and then schedules each region. Many past efforts have focused on linear regions such as traces and superblocks. The linearity of these regions can limit speculation, leading to under-utilization of processor resources, espe... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fine-grain software distributed shared memory on SMP clusters

    Publication Year: 1998, Page(s):125 - 136
    Cited by:  Papers (16)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (132 KB)

    Commercial SMP nodes are an attractive building block for software distributed shared memory systems. The advantages of using SMP nodes include fast communication among processors within the same node and potential gains from clustering where remote data fetched by one processor is used by other processors on the same node. This paper describes a major extension to the Shasta distributed shared me... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Enhancing memory use in Simple Coma: Multiplexed Simple Coma

    Publication Year: 1998, Page(s):152 - 161
    Cited by:  Papers (2)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (240 KB)

    Scalable shared-memory multiprocessors that are designed as cache-only memory architectures (Coma) allow automatic replication and migration of data in the main memory. This enhances programmability by hopefully eliminating the need for data distribution strategies and page migration schemes. A variant of Coma called Simple Coma has been proposed as a lower-cost alternative to hardware-intensive s... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance evaluation of tiling for the register level

    Publication Year: 1998, Page(s):254 - 265
    Cited by:  Papers (1)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (84 KB)

    Tiling is a well-known loop transformation, which is basically used to expose coarse-grain parallelism and to exploit data reuse at the cache level. However, it can also be used to exploit data reuse at the register level and to improve programs's ILP. Previous work on tiling and also commercial compilers are able to perform tiling for the register level in more than one dimension when the iterati... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Challenging applications on fast networks

    Publication Year: 1998, Page(s):68 - 79
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (104 KB)

    Parallel computing on clusters of workstations is attractive because of the low costs in comparison to MPPs, but the speed of the local area network limits the class of applications that can be run efficiently. Fortunately, faster network technology is becoming available for the next generation of workstation clusters. This paper studies the effect of running challenging applications that communic... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Home-based SVM protocols for SMP clusters: Design and performance

    Publication Year: 1998, Page(s):113 - 124
    Cited by:  Papers (24)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (196 KB)

    As small-scale shared memory multiprocessors proliferate in the market, it is very attractive to construct large-scale systems by connecting smaller multiprocessors together in software using efficient commodity, network interfaces and networks. Using a shared virtual memory (SVM) layer for this purpose preserves the attractive shared memory programming abstraction across nodes. In this paper: We ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • PRISM: an integrated architecture for scalable shared memory

    Publication Year: 1998, Page(s):140 - 151
    Cited by:  Papers (8)  |  Patents (5)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (116 KB)

    This paper describes PRISM, a distributed shared memory architecture that relies on a tightly integrated hardware and operating system design for scalable and reliable performance. PRISM's hardware provides mechanisms for flexible management and dynamic configuration of shared memory pages with different behaviors. As an example, PRISM can configure individual shared memory pages in both CC-NUMA a... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Virtual-physical registers

    Publication Year: 1998, Page(s):175 - 184
    Cited by:  Papers (42)  |  Patents (7)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (44 KB)

    A novel dynamic register renaming approach is proposed in this work. The key idea of the novel scheme is to delay the allocation of physical registers until a late stage in the pipeline, instead of doing it in the decode stage as conventional schemes do. In this way, the register pressure is reduced and the processor can exploit more instruction-level parallelism. Delaying the allocation of physic... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speculative versioning cache

    Publication Year: 1998, Page(s):195 - 205
    Cited by:  Papers (72)  |  Patents (24)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (112 KB)

    Dependences among loads and stores whose addresses are unknown hinder the extraction of instruction level parallelism during the execution of a sequential program. Such ambiguous memory dependences can be overcome by memory dependence speculation which enables a load or store to be speculatively executed before the addresses of all preceding loads and stores are known. Furthermore, multiple specul... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Address translation mechanisms in network interfaces

    Publication Year: 1998, Page(s):219 - 230
    Cited by:  Papers (9)  |  Patents (5)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (132 KB)

    Good network hardware performance is often squandered by overheads for accessing the network interface (NI) within a host. NIs that support user-level messaging avoid frequent operating system (OS) action yet unnecessary copying can still result in low performance. We explore improving application messaging performance by eliminating all unnecessary copies (minimal messaging). For minimal messagin... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Temporal-based procedure reordering for improved instruction cache performance

    Publication Year: 1998, Page(s):244 - 253
    Cited by:  Papers (9)  |  Patents (4)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (908 KB)

    As the gap between memory and processor performance continues to grow, it becomes increasingly important to exploit cache memory effectively. Both hardware and software techniques can be used to better utilize the cache. Hardware solutions focus on organization, while most software solutions investigate how to best layout a program on the available memory space. We present a new link-time code reo... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A very efficient distributed deadlock detection mechanism for wormhole networks

    Publication Year: 1998, Page(s):57 - 66
    Cited by:  Papers (17)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (156 KB)

    Networks using wormhole switching have traditionally relied upon deadlock avoidance strategies for the design of routing algorithms. More recently, deadlock recovery strategies have begun to gain acceptance. Progressive deadlock recovery techniques are very attractive because they allocate a few dedicated resources to quickly deliver deadlocked messages, instead of killing them. However, the distr... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FPGA based custom computing machines for irregular problems

    Publication Year: 1998, Page(s):324 - 333
    Cited by:  Papers (6)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (92 KB)

    Over the past few years there has been increased interest in building custom computing machines (CCMs) as a way of achieving very high performance on specific problems. The advent of high density field programmable gate arrays (FPGAs), in combination with new synthesis tools, have made it relatively easy to produce programmable custom machines without building specific hardware. In many cases, the... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The effectiveness of SRAM network caches in clustered DSMs

    Publication Year: 1998, Page(s):103 - 112
    Cited by:  Papers (16)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (100 KB)

    The frequency of accesses to remote data is a key factor affecting the performance of all Distributed Shared Memory (DSM) systems. Remote data caching is one of the most effective and general techniques to fight processor stalls due to remote capacity misses in the processor caches. The design space of remote data caches (RDC) has many dimensions and one essential performance trade-off hit ratio v... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The sensitivity of communication mechanisms to bandwidth and latency

    Publication Year: 1998, Page(s):37 - 46
    Cited by:  Papers (5)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (120 KB)

    The goal of this paper is to gain insight into the relative performance of communication mechanisms as bisection bandwidth and network latency vary. We compare shared memory with and without prefetching, message passing with interrupts and with polling, and bulk transfer via DMA. We present two sets of experiments involving four irregular applications on the MIT Alewife multiprocessor. First, we i... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Comparative evaluation of latency tolerance techniques for software distributed shared memory

    Publication Year: 1998, Page(s):300 - 311
    Cited by:  Papers (9)  |  Patents (4)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (220 KB)

    A key challenge in achieving high performance on software DSMs is overcoming their relatively large communication latencies. In this paper, we consider two techniques which address this problem: prefetching and multithreading. While previous studies have examined each of these techniques in isolation, this paper is the first to evaluate both techniques using a consistent hardware platform and set ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hardware for speculative run-time parallelization in distributed shared-memory multiprocessors

    Publication Year: 1998, Page(s):162 - 173
    Cited by:  Papers (16)  |  Patents (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (236 KB)

    Run-time parallelization is often the only way to execute the code in parallel when data dependence information is incomplete at compile time. This situation is common in many important applications. Unfortunately, known techniques for run-time parallelization are often computationally expensive or not general enough. To address this problem, we propose new hardware support for efficient run-time ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Architectural implications of a family of irregular applications

    Publication Year: 1998, Page(s):80 - 89
    Cited by:  Papers (7)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (172 KB)

    Irregular applications based on sparse matrices are at the core of many important scientific computations. Since the importance of such applications is likely to increase in the future, high-performance parallel and distributed systems must provide adequate support for such applications. We characterize a family of irregular scientific applications and derive the demands they will place on the com... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Supporting highly speculative execution via adaptive branch trees

    Publication Year: 1998, Page(s):185 - 194
    Cited by:  Papers (2)  |  Patents (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (176 KB)

    Most of the prediction mechanisms predict a single path to continue the execution on a branch. Alternatively, we may exploit parallelism from either possible paths of a branch, discarding wrong paths once the branch is resolved. This paper proposes a concept of adaptive branch trees (ABT) to support highly speculative execution for processors with deeper pipelines and wide issue widths. The basic ... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The impact of data transfer and buffering alternatives on network interface design

    Publication Year: 1998, Page(s):207 - 218
    Cited by:  Papers (3)  |  Patents (3)
    Request permission for commercial reuse | Click to expandAbstract | PDF file iconPDF (80 KB)

    The explosive growth in the performance of microprocessors and networks has created a new opportunity to reduce the latency of fine-grain communication. Microprocessor clock speeds are now approaching the gigahertz range. Network switch latencies have dropped to tens of nanoseconds. Unfortunately, this explosive growth also exposes processor accesses to the network interface (NI) as a critical bot... View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.