By Topic

High Performance Interconnects, 2005. Proceedings. 13th Symposium on

Date 17-19 Aug. 2005

Filter Results

Displaying Results 1 - 25 of 35
  • Proceedings. 13th Symposium on High Performance Interconnects

    Save to Project icon | Request Permissions | PDF file iconPDF (71 KB)  
    Freely Available from IEEE
  • 13th Symposium on High Performance Interconnects - Title Page

    Page(s): i - iii
    Save to Project icon | Request Permissions | PDF file iconPDF (41 KB)  
    Freely Available from IEEE
  • 13th Symposium on High Performance Interconnects - Copyright Page

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (46 KB)  
    Freely Available from IEEE
  • 13th Symposium on High Performance Interconnects - Table of contents

    Page(s): v - vii
    Save to Project icon | Request Permissions | PDF file iconPDF (35 KB)  
    Freely Available from IEEE
  • General Chairs’ Message

    Page(s): viii
    Save to Project icon | Request Permissions | PDF file iconPDF (27 KB)  
    Freely Available from IEEE
  • Message from the Program Co-Chairs

    Page(s): ix - x
    Save to Project icon | Request Permissions | PDF file iconPDF (29 KB)  
    Freely Available from IEEE
  • Committees

    Page(s): xi - xii
    Save to Project icon | Request Permissions | PDF file iconPDF (28 KB)  
    Freely Available from IEEE
  • Using the open network lab

    Page(s): 2 - 3
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (35 KB)  

    The Open Network Laboratory is a resource designed to enable experimental evaluation of advanced networking concepts in a realistic operating environment. The laboratory is built around a set of open-source, extensible, high performance routers, which can be accessed by remote users through a remote laboratory interface (RLI). The RLI allows users to configure the testbed network, run applications and monitor those running applications using built-in data gathering mechanisms. Support for data visualization and real-time remote display is provided. The RLI also allows users to extend, modify or replace the software running in the routers' embedded processors and to similarly extend, modify or replace the routers' packet processing hardware, which is implemented largely using field programmable gate arrays. The routers included in the testbed are architecturally similar to high performance commercial routers, enabling researchers to evaluate their ideas in a much more realistic context than can be provided by PC-based routers. The Open Network Laboratory is designed to provide a setting in which systems researchers can evaluate and refine their ideas and then demonstrate them to those interested in moving their technology into new products and services. This tutorial will teach users how to use the ONL. It will include detailed presentations on the system architecture and principles of operation, as well as live demonstrations. We also plan to give participants an opportunity for hands-on experience with setting up and running experiments themselves. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Quality of service in global grid computing

    Page(s): 4 - 5
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (47 KB)  

    This tutorial tries to address some of the issue related to the keywords present in Foster's grid computing definition. Specifically it tackles the problem of providing global grid computing applications with a network infrastructure able to guarantee quality of service. After reviewing the basics of grid computing, this tutorial focuses on specific network infrastructure issues. Quality of service (QoS) parameters such as throughput, delay, and resilience are considered. It is shown that how the integration of the grid programming environment with an intelligent grid network infrastructure allows to dynamically adapt the utilized computational and network resources to meet the application QoS requirements transparently to the user. Finally the performance evaluation of a specific implementation of an integrated application and network layer resilience scheme is presented. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Internet infrastructure security

    Page(s): 6 - 7
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (43 KB)  

    The goal of this tutorial is to provide a comprehensive understanding of the state-of-the-art research and practice in Internet infrastructure security, to its audience. In addition to discussions on attacks and counter-measures, issues such as performance, scalability, deployability, and high speed implementations will also be discussed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High-speed networking: a systematic approach to high-bandwidth low-latency communications

    Page(s): 8 - 9
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (41 KB)  

    This tutorial presents a comprehensive introduction to all aspects of high-speed networking, based on the book high-speed networking: a systematic approach to high-bandwidth low-latency communication. The target audience includes computer scientists and engineers who may have expertise in a narrow aspect of high-speed networking but want to gain a broader understanding of all aspects of high-speed networking and the impact that their designs have on overall network performance. This tutorial presents a systemic approach to high-speed networks, where the goal is to provide high bandwidth and low latency to distribute applications, and to deal with the high bandwidth-x-delay product that results from high-speed networking over long distances. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Challenges in building a flat-bandwidth memory hierarchy for a large-scale computer with proximity communication

    Page(s): 13 - 22
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (584 KB) |  | HTML iconHTML  

    Memory systems for conventional large-scale computers provide only limited bytes/s of data bandwidth when compared to their flop/s of instruction execution rate. The resulting bottleneck limits the bytes/flop that a processor may access from the full memory footprint of a machine and can hinder overall performance. This paper discusses physical and functional views of memory hierarchies and examines existing ratios of bandwidth to execution rate versus memory capacity (or bytes/flop versus capacity) found in a number of large-scale computers. The paper then explores a set of technologies, proximity communication, low-power on-chip networks, dense optical communication, and sea-of-any thing interconnect, that can flatten this bandwidth hierarchy to relieve the memory bottleneck in a large-scale computer that we call "Hero". View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimised global reduction on QsNetII

    Page(s): 23 - 28
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (296 KB) |  | HTML iconHTML  

    In this paper we describe how QsNetII supports reduction, a key collective for massively parallel applications. Results from jobs run on a 512-node quad CPU cluster show excellent scaling, with the average time to execute a 2048 process global sum being 22 microsecs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Control path implementation for a low-latency optical HPC switch

    Page(s): 29 - 35
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (336 KB) |  | HTML iconHTML  

    A crucial part of any high-performance computing system is its interconnection network. In the OSMOSIS project, Corning and IBM are jointly developing a demonstrator interconnect based on optical cell switching with electronic control. Starting from the core set of requirements, we present the system design rationale and show how it impacts the practical implementation. Our focus is on solving the technical issues related to the electronic control path, and we show that it is feasible at the targeted design point. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Breaking the connection: RDMA deconstructed

    Page(s): 36 - 42
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (176 KB) |  | HTML iconHTML  

    The architecture, design and performance of RDMA (remote direct memory access) over the IBM HPS (high performance switch and adapter) are described. Unlike conventional implementations such as InfiniBand, our RDMA transport model is layered on top of an unreliable datagram interface, while leaving the task of enforcing reliability to the ULP (upper layer protocol). We demonstrate that our model allows a single MPI task to deliver bidirectional bandwidth of close to 3.0 GB/s across a single link and 24.0 GB/s when striped across 8 links. In addition, we show that this transport protocol has superior attributes in terms of a) being able to handle RDMA packets coming out of order; b) being able to use multiple routes between a source-destination pair and c) reducing the size of adapter caches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Can memory-less network adapters benefit next-generation infiniband systems?

    Page(s): 45 - 50
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (200 KB) |  | HTML iconHTML  

    InfiniBand is emerging as a high-performance interconnect. It is gaining popularity because of its high performance and open standard. Recently, PCI-Express, which is the third generation high-performance I/O bus used to interconnect peripheral devices, has been released. The third generation of InfiniBand adapters allow applications to take advantage of PCI-Express. PCI-Express offers very low latency access of the host memory by network interface cards (NICs). Earlier generation InfiniBand adapters used to have an external DIMM attached as local NIC memory. This memory was used to store internal information. This memory increases the overall cost of the NIC. In this paper we design experiments, analyze the performance of various communication patterns and end applications on PCI-Express based systems, whose adapters can be chosen to run with or without local NIC memory. Our investigations reveal that on these systems, the memory fetch latency is the same for both local NIC memory and host memory. Under heavy I/O bus usage, the latency of a scatter operation increased only by 10% and only for message sizes IB -4 KB. These memory-less adapters allow more efficient use of overall system memory and show practically no performance impact (less than 0.1%) for the NAS parallel benchmarks on 8 processes. These results indicate that memory-less network adapters can benefit next generation InfiniBand systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Initial performance evaluation of the Cray SeaStar interconnect

    Page(s): 51 - 57
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (208 KB) |  | HTML iconHTML  

    The Cray SeaStar is a new network interface and router for the Cray Red Storm and XT3 supercomputer. The SeaStar was designed specifically to meet the performance and reliability needs of a large-scale, distributed-memory scientific computing platform. In this paper, we present an initial performance evaluation of the SeaStar. We first provide a detailed overview of the hardware and software features of the SeaStar, followed by the results of several low-level micro-benchmarks. These initial results indicate that SeaStar is on a path to achieving its performance targets. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance characterization of a 10-Gigabit Ethernet TOE

    Page(s): 58 - 63
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (168 KB) |  | HTML iconHTML  

    Though traditional Ethernet based network architectures such as Gigabit Ethernet have suffered from a huge performance difference as compared to other high performance networks (e.g, InfiniBand, Quadrics, Myrinet), Ethernet has continued to be the most widely used network architecture today. This trend is mainly attributed to the low cost of the network components and their backward compatibility with the existing Ethernet infrastructure. With the advent of 10-Gigabit Ethernet and TCP offload engines (TOEs), whether this performance gap be bridged is an open question. In this paper, we present a detailed performance evaluation of the Chelsio T110 10-Gigabit Ethernet adapter with TOE. We have done performance evaluations in three broad categories: (i) detailed micro-benchmark performance evaluation at the sockets layer, (ii) performance evaluation of the message passing interface (MPI) stack atop the sockets interface, and (iii) application-level evaluations using the Apache Web server. Our experimental results demonstrate latency as low as 8.9 μs and throughput of nearly 7.6 Gbps for these adapters. Further, we see an order-of-magnitude improvement in the performance of the Apache Web server while utilizing the TOE as compared to the basic 10-Gigabit Ethernet adapter without TOE. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hybrid cache architecture for high speed packet processing

    Page(s): 67 - 72
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (152 KB) |  | HTML iconHTML  

    The exposed memory hierarchies employed in many network processors (NPs) are expensive and hard to be effectively utilized. On the other hand, conventional cache cannot be directly incorporated into NP either because of its low efficiency in locality exploitation for network applications. In this paper, a novel memory hierarchy component, called split control cache, is presented. The proposed scheme employs two independent low latency memory stores to temporarily hold the flow-based and application-relevant information, exploiting the different locality behaviors exhibited by these two types of data. Data movement is manipulated by specially designed hardware to relieve the programmers from details of memory management. Performance evaluation shows that this component can achieve a hit rate of over 90% with only 16 KB of memories in route lookup under link rate of OC-3c and provide enough flexibility for the implementation of most network applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High-speed and low-power network search engine using adaptive block-selection scheme

    Page(s): 73 - 78
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (176 KB) |  | HTML iconHTML  

    A new approach for using block-selection scheme to increase the search throughput of multi-block TCAM-based network search engines is proposed. While the existing methods try to counter and forcibly balance the inherent bias of the Internet traffic, our method takes advantage of it. Our method improves flexibility of table management and gains scalability towards high rates of change in traffic bias. It offers higher throughput than the current art and a very low average power consumption. One of the embodiments of the proposed model, using four TCAM chips, can deliver over six times the throughput of a conventional configuration of the same TCAM chips. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and implementation of a content-aware switch using a network processor

    Page(s): 79 - 85
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (168 KB) |  | HTML iconHTML  

    Cluster based server architectures have been widely used as a solution to overloading in Web servers because of their cost effectiveness, scalability and reliability. A content aware switch can be used to examine the Web requests and distribute them to the servers based on application level information. In this paper, we present the analysis, design and implementation of such a content aware switch based on an IXP2400 network processor (NP). We first analyze the mechanisms for implementing a content-aware switch and present the necessity for an NP-based solution. We then present various possibilities of workload allocation among different computation resources in an NP and discuss the design tradeoffs. Measurement results based on an IXP 2400 NP demonstrate that our NP-based switch can reduce the http processing latency by an average of 83.3% for a 1 K byte Web page, compared to a Linux-based switch. The amount of reduction increases with larger file sizes. It is also shown that the packet throughput can be improved by up to 5.7x across a range of files by taking advantage of multithreading and multiprocessing, available in the NP. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Impact of grid computing on network operators and HW vendors

    Page(s): 89 - 90
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (42 KB)  

    Grid computing is an attempt to make computing work like the power grid. When you run a job, you shouldn't know or care where it runs, so long as it gets done within your constraints (including security). However, in attempting to accomplish this, Grid researchers are presenting network access patterns and loads different from what has been typical of Internet traffic. MPI applications are looking for latency critical, bursty, small message traffic, some applications are producing data sets in the 100's of GBs and even Terabytes that need to be moved quickly and efficiently, or you might need remote control of earthquake shake tables and thus require constant jitter. Grid researchers are asking for finer grained control of the network, dynamic optical routes, allowing user apps (via middleware) to alter router configurations, etc. For some network operators, this sounds like their worst nightmare come true. For the network HW vendors, this presents challenges to say the least. This panel is intended to bring together Grid researchers, network operators, and network HW vendors to discuss what the Grid researchers want and why, what impact that will have on network operations, and what challenges it will bring for the future HW designs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A scalable switch for service guarantees

    Page(s): 93 - 99
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (192 KB) |  | HTML iconHTML  

    Operators need routers to provide service guarantees such as guaranteed flow rates and fairness among flows, so as to support real-time traffic and traffic engineering. However, current centralized input-queued router architectures cannot scale to fast line rates while providing these service guarantees. On the other hand, while load-balanced switch architectures that rely on two identical stages of fixed configuration switches appear to be an effective way to scale Internet routers to very high capacities, there is currently no practical and scalable solution for providing service guarantees in these architectures. In this paper, we introduce the interleaved matching switch (IMS) architecture, which relies on a novel approach to provide service guarantees using load-balanced switches. The approach is based on emulating a Birkhoff-von Neumann switch with a load-balanced switch architecture and is applicable to any admissible traffic. In cases where: fixed frame sizes are applicable, we also present an efficient frame-based decomposition method. More generally, we show that the IMS architecture can be used to emulate any input queued or combined input-output queued switch. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design of randomized multichannel packet storage for high performance routers

    Page(s): 100 - 106
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (152 KB) |  | HTML iconHTML  

    High performance routers require substantial amounts of memory to store packets awaiting transmission, requiring the use of dedicated memory devices with the density and capacity to provide the required storage economically. The memory bandwidth required for packet storage subsystems often exceeds the bandwidth of individual memory devices, making it necessary to implement packet storage using multiple memory channels. This raises the question of how to design multichannel storage systems that make effective use of the available memory and memory bandwidth, while forwarding packets at link rate in the presence of arbitrary packet retrieval patterns. A recent series of papers has demonstrated an architecture that uses on-chip SRAM to buffer packets going to/from a multichannel storage system, while maintaining high performance in the presence worst-case traffic patterns. Unfortunately, the amount of on-chip storage required grows as the product of the number of channels and the number of separate queues served by the packet storage system. This makes it too expensive to use in systems with large numbers of queues. We show how to design a practical randomized packet storage system that can sustain high performance using an amount of on-chip storage that is independent of the number of queues. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Addressing queuing bottlenecks at high speeds

    Page(s): 209 - 224
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (152 KB) |  | HTML iconHTML  

    Modern routers and switch fabrics can have hundreds of input and output ports running at up to 10 Gb/s; 40 Gb/s systems are starting to appear. At these rates, the performance of the buffering and queuing subsystem becomes a significant bottleneck. In high performance routers with more than a few queues, packet buffering is typically implemented using DRAM for data storage and a combination of off-chip and on-chip SRAM for storing the linked-list nodes and packet length, and the queue headers, respectively. This paper focuses on the performance bottlenecks associated with the use of off-chip SRAM. We show how the combination of implicit buffer pointers and multi-buffer list nodes can dramatically reduce the impact of buffering and queuing subsystem on queuing performance. We also show how combining it with coarse-grained scheduling can improve the performance of fair queuing algorithms, while also reducing the amount of off-chip memory and bandwidth needed. These techniques can reduce the amount of SRAM needed to hold the list nodes by a factor of 10 at the cost of about 10% wastage of the DRAM space, assuming an aggregation degree of 16. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.