By Topic

High Performance Interconnects, 2003. Proceedings. 11th Symposium on

Date 20-22 Aug. 2003

Filter Results

Displaying Results 1 - 21 of 21
  • Proceedings 11th Symposium on High Performance Interconnects

    Publication Year: 2003
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (222 KB)  

    The following topics are dealt with: interconnect technologies; associative technologies; network and clustering technologies; packet processors; optical networks; hardware accelerators and network interfaces. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Author index

    Publication Year: 2003 , Page(s): 143
    Save to Project icon | Request Permissions | PDF file iconPDF (147 KB)  
    Freely Available from IEEE
  • Initial end-to-end performance evaluation of 10-Gigabit Ethernet

    Publication Year: 2003 , Page(s): 116 - 121
    Cited by:  Papers (6)  |  Patents (12)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (266 KB) |  | HTML iconHTML  

    We present an initial end-to-end performance evaluation of Intel's® 10 Gigabit Ethernet (10 GbE) network interface card (or adapter). With appropriate optimizations to the configurations of Linux, TCP, and the 10 GbE adapter, we achieve over 4 Gb/s throughput and 21 μs end-to-end latency between applications in a local-area network despite using less capable, lower-end PCs. These results indicate that 10 GbE may also be a cost-effective solution for system-area networks in commodity clusters, data centers, and web-server farms as well as wide-area networks in support of computational and data grids. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic power management for power optimization of interconnection networks using on/off links

    Publication Year: 2003 , Page(s): 15 - 20
    Cited by:  Papers (17)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (361 KB) |  | HTML iconHTML  

    Power consumption in interconnection networks has become an increasingly important architectural issue. The links which interconnect network node routers are a major consumer of power and devour an ever-increasing portion of total available power as network bandwidth and operating frequencies upscale. In this paper, we propose a dynamic power management policy where network links are turned off and switched back on depending on network utilization in a distributed fashion. We have devised a systematic approach based on the derivation of a connectivity graph that balances power and performance for a 2D mesh topology. This coupled with a deadlock-free, fully adaptive routing algorithm guarantees packet delivery. Our approach realizes up to 37.5% reduction in overall network link power for an 8-ary 2-mesh topology with a moderate network latency increase. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Architecture for a hardware based, TCP/IP content scanning system [intrusion detection system applications]

    Publication Year: 2003 , Page(s): 89 - 94
    Cited by:  Papers (7)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (245 KB) |  | HTML iconHTML  

    Hardware assisted intrusion detection systems and content scanning engines are needed to process data at multiGigabit line rates. These systems, when placed within the core of the Internet, are subject to millions of simultaneous flows, with each flow potentially containing data of interest. Existing IDS systems are not capable of processing millions of flows at Gigabit-per-second data rates. This paper describes an architecture which is capable of performing complete, stateful, payload inspections on 8 million TCP flows at 2.5 Gigabits-per-second. To accomplish this task, a hardware circuit is used to combine a TCP protocol processing engine, a per flow state store, and a content scanning engine. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On slotted WDM switching in bufferless all-optical networks

    Publication Year: 2003 , Page(s): 96 - 101
    Cited by:  Papers (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (281 KB) |  | HTML iconHTML  

    The current λ-switching technology requires each lightpath to occupy the full bandwidth of a wavelength throughout the bufferless all-optical domain. With such a constraint, the granularity of bandwidth is so coarse that the utilization of optical fibers is considerably low. To resolve the granularity problem, a simple way is to incorporate time division multiplexing (TDM) into wavelength division multiplexing (WDM) so as to divide the entire λ-bandwidth into smaller base bandwidths. This approach is referred to as the slotted WDM (sWDM). In this paper, we define the slot labeling, mapping and assignment problems, and derive the necessary and sufficient condition that validates sWDM. We also provide simulation results of the call blocking rate of sWDM in comparison with λ-switching. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A rule grouping technique for weight-based TCAM coprocessors [packet classification application]

    Publication Year: 2003 , Page(s): 32 - 37
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (471 KB) |  | HTML iconHTML  

    A crucial issue associated with a TCAM (ternary content addressable memory) coprocessor with weights is that no more rules can be enforced if the weights are exhausted. In this paper, the problem is identified and a rule grouping technique is proposed to solve the problem. The technique allows a virtually unlimited number of rules with arbitrary rule structures to be enforced. It requires no special hardware support and can be readily implemented in a fully programmable network processor and a weight-based TCAM coprocessor. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nexus: an asynchronous crossbar interconnect for synchronous system-on-chip designs

    Publication Year: 2003 , Page(s): 2 - 9
    Cited by:  Papers (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4000 KB)  

    Asynchronous circuits can provide an elegant and high performance interconnect solution for synchronous system-on-chip (SoC) designs with multiple clock domains. This 'globally asynchronous, locally synchronous' (GALS) approach simplifies global timing and synchronization problems, improving performance, reliability, and development time. Fulcrum Microsystems' SoC interconnect, 'Nexus', includes a 16 port, 36 bit asynchronous crossbar which connects via asynchronous channels to clock domain converters for each synchronous module. Each synchronous module has its own local clock domain, and can send a variable length burst of data to any other module. In TSMC's 130 nm LV low-K process, the system achieves 1.35 GHz at 1.2 V with less than 5 mm2 area. Power scales linearly with bandwidth, from a few mW of leakage to 8 W at the peak 780 Gb/s cross-section bandwidth. Latency through the interconnect is 2 ns plus 1/2 to 2/3 clock cycles of the receiving module. This compares favorably with other SoC interconnect solutions that have less bandwidth, higher energy per transfer and longer latencies. Nexus is an innovative and comprehensive solution to the challenge of SoC interconnect. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • PCI express and advanced switching: evolutionary path to building next generation interconnects

    Publication Year: 2003 , Page(s): 21 - 29
    Cited by:  Papers (17)  |  Patents (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (270 KB) |  | HTML iconHTML  

    With processor and memory technologies pushing the performance limit, the bottleneck is clearly shifting towards the system interconnect. Any solution that addresses the PCI bus-based interconnect, which has serious scalability problems, must also protect the huge legacy infrastructure. PCI Express provides such an evolutionary approach and allows a smooth migration towards building a highly scalable next generation interconnect. Advanced switching further boosts the capabilities of PCI Express by allowing a rich application space to be covered that includes multiprocessing and peer-to-peer communication. Indeed, the synergy between PCI Express and advanced switching permits the adoption of an evolutionary yet revolutionary approach for building the interconnect of the future. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design of optical burst switches based on dual shuffle-exchange network and deflection routing

    Publication Year: 2003 , Page(s): 102 - 107
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (289 KB) |  | HTML iconHTML  

    In this paper, we propose a novel approach to implement optical burst switching (OBS) using the dual shuffle-exchange network (DSN) as the core switching fabric. DSN possesses a self-routing property which allows major simplifications of the complex crossbar setup mechanisms. In addition, its asynchronous and buffer-less nature is highly preferable in the optical environment. We also show that with an appropriate error-correcting routing algorithm, the output wavelength contentions can be reduced by means of internal deflection routing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A dual-level matching algorithm for 3-stage Clos-network packet switches

    Publication Year: 2003 , Page(s): 38 - 43
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (295 KB)  

    In this paper, we present a new dual-level matching algorithm for 3-stage Clos-network packet switches, called d-MAC. Using a two-level matching algorithm, namely module-level matching and port-level matching, d-MAC is highly scalable and maintains high system performance. The module-level matching is responsible for finding the module-to-module matching according to the queue status of the switch, while the port-level matching is responsible for determining port-to-port matching and route assignment simultaneously. The two-level matchings are computed in a pipelined and parallel manner to speed up packet scheduling. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A case for network-centric buffer cache organization

    Publication Year: 2003 , Page(s): 66 - 71
    Cited by:  Papers (5)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (234 KB) |  | HTML iconHTML  

    The emergence of clustered and networked storage architecture gives rise to a new type of server which acts as a data conduit over the network for remotely stored data. These servers, which we call pass-through servers, are mainly responsible for passing the data through them without interpreting it in any way. In this paper, we put forward a scheme of network-centric buffer cache management. This scheme can facilitate the data transmission through pass-through servers by avoiding redundant data copying, and by caching the data in a network-ready form, while having no modifications to the existing buffer cache organization. The performance measurement on an NFS server, using iSCSI storage, running on Linux, with this scheme shows throughput improvement of more than 50% compared to an NFS server on common Linux, while consuming about 40% less CPU resource. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A wave-pipelined on-chip interconnect structure for networks-on-chips

    Publication Year: 2003 , Page(s): 10 - 14
    Cited by:  Papers (12)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (227 KB) |  | HTML iconHTML  

    The paper describes a structured communication link design technique, wave-pipelined interconnect, for networks-on-chip. We achieved 3.45 GHz and 55.2 Gbps throughput on a 10 mm 16 bit interconnection in a 0.25 μm technology. It uses 0.079 mm2 of area, and it only needs 18.8 pJ to transmit one bit. We reduce crosstalk delay 79% by using two techniques - interleaved lines and misaligned repeaters. This paper shows, in detail, the various techniques we used to save power and area and achieve high performance in a relatively old technology. Wave-pipelined interconnect design is relatively easy, but its many features give a large and flexible design space for high-performance chips. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Micro-benchmark level performance comparison of high-speed cluster interconnects

    Publication Year: 2003 , Page(s): 60 - 65
    Cited by:  Papers (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (269 KB) |  | HTML iconHTML  

    In this paper we present a comprehensive performance evaluation of three high speed cluster interconnects: Infini-Band, Myrinet and Quadrics. We propose a set of micro-benchmarks to characterize different performance aspects of these interconnects. Our micro-benchmark suite includes not only traditional tests and performance parameters, but also those specifically tailored to the interconnects advanced features such as user-level access for performing communication and remote direct memory access. In order to explore the full communication capability of the interconnects, we have implemented the micro-benchmark suite at the low level messaging layer provided by each interconnect. Our performance results show that all three interconnects achieve low latency, high bandwidth and low host overhead. However, they show quite different performance behaviors when handling completion notification, unbalanced communication patterns and different communication buffer reuse patterns. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Network processors as building blocks in overlay networks

    Publication Year: 2003 , Page(s): 83 - 88
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (247 KB) |  | HTML iconHTML  

    This paper proposes an architecture that permits selected application- and middleware-level functionality to be 'pushed into' network processors. Such functionality is represented as stream handlers that run on the network processors (NPs) attached to the host nodes participating in overlay networks. When using stream handlers, application- and middleware-level functionality is 'split' into multiple components that are jointly executed by the host and the attached NP (ANP). Resulting improvements in application performance are due to the network-near nature of ANPs and due to the ability to dynamically customize stream handlers to meet current application needs or match current network resources. The evaluation of our current prototype implementation indicates that the use of ANP-level handlers can reduce the delays on the application data path by more than 25%, and can sustain higher throughput for the application services provided by stream handlers. In addition, stream handlers are a suitable basis for scalable implementations of data-increasing services like destination-customized multicast. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • ETA: experience with an Intel® Xeon™ processor as a packet processing engine

    Publication Year: 2003 , Page(s): 76 - 82
    Cited by:  Papers (1)  |  Patents (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (239 KB) |  | HTML iconHTML  

    The ETA (embedded transport acceleration) project at Intel Research and Development has developed a software prototype that uses one of the Intel® Xeon™ processors in a multi-processor server as a packet processing engine. The prototype is used as a vehicle for empirical measurement and analysis of a highly programmable packet processing engine that is closely tied to the server's core CPU and memory complex. The usage model for the prototype is the acceleration of server TCP/IP networking. The ETA prototype runs in an asymmetric multiprocessing mode, in that the packet processing engine does not run as a general computing resource for the host operating system. We show an effective method of interfacing the packet processing engine to the host processors using efficient asynchronous queuing mechanisms. This paper describes the ETA software architecture, the ETA prototype, and details the measurement and analysis that has been performed to date. Test results include running the packet processing engine in single-threaded mode, as well as in multi-threaded mode using Intel's hyper-threading technology (HT). Performance data gathered for network throughput and host CPU utilization show a significant improvement when compared to the standard TCP/IP networking stack. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient exploitation of kernel access to Infiniband: a software DSM example

    Publication Year: 2003 , Page(s): 130 - 135
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (250 KB) |  | HTML iconHTML  

    The Infiniband (IB) system area network (SAN) enables applications to access hardware directly from the user level, reducing the overhead of user-kernel crossings during data transfer. However, distributed applications that exhibit close coupling between network and OS services may benefit from accessing IB from the kernel through IB's native verbs interface, which permits tight integration of these services. We assess this approach using a sequential-consistency distributed shared memory (DSM) system as an example. We first develop primitives that abstract the low-level communication and kernel details, and efficiently serve the application's communication, memory,and scheduling needs. Next, we combine the primitives to form a kernel DSM protocol. The approach is evaluated using our full-fledged Linux kernel DSM implementation over Infiniband. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scalable collective communication on the ASCI Q machine

    Publication Year: 2003 , Page(s): 54 - 59
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (351 KB) |  | HTML iconHTML  

    Scientific codes spend a considerable part of their run time executing collective communication operations. Such operations can also be critical for efficient resource management in large-scale machines. Therefore, scalable collective communication is a key factor to achieve good performance in large-scale parallel computers. In this paper we describe the performance and scalability of some common collective communication patterns on the ASCI Q machine. Experimental results conducted on a 1024-node/4096-processor segment show that the network is fast and scalable. The network is able to barrier-synchronize in a few tens of μs, perform a broadcast with an aggregate bandwidth of more than 100 GB/s and sustain heavy hot-spot traffic with a limited performance degradation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FPsed: a streaming content search-and-replace module for an Internet firewall

    Publication Year: 2003 , Page(s): 122 - 129
    Cited by:  Papers (2)  |  Patents (39)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (358 KB) |  | HTML iconHTML  

    A module has been implemented in field programmable gate array (FPGA) hardware that is able to perform regular expression search-and-replace operations on the content of Internet packets at Gigabit/second rates. All of the packet processing operations are performed using reconfigurable hardware within a single Xilinx Virtex XCV2000E FPGA. A set of layered protocol wrappers is used to parse the headers and payloads of packets for Internet protocol data. A content matching server automatically generates, compiles, synthesizes, and programs the module into the field-programmable port extender (FPX) platform. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Deep packet inspection using parallel Bloom filters

    Publication Year: 2003 , Page(s): 44 - 51
    Cited by:  Papers (71)  |  Patents (55)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (368 KB) |  | HTML iconHTML  

    Recent advances in network packet processing focus on payload inspection for applications that include content-based billing, layer-7 switching and Internet security. Most of the applications in this family need to search for predefined signatures in the packet payload. Hence an important building block of these processors is string matching infrastructure. Since conventional software-based algorithms for string matching have not kept pace with high network speeds, specialized high-speed, hardware-based solutions are needed. We describe a technique based on Bloom filters for detecting predefined signatures (a string of bytes) in the packet payload. A Bloom filter is a data structure for representing a set of strings in order to support membership queries. We use hardware Bloom filters to isolate all packets that potentially contain predefined signatures. Another independent process eliminates false positives produced by Bloom filters. We outline our approach for string matching at line speeds and present a performance analysis. Finally, we report the results for a prototype implementation of this system on the FPX platform. Our analysis shows that with the state-of-the-art FPGAs, a set of 10,000 strings can be scanned in the network data at the line speed of OC-48 (2.4 Gbps). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic scheduling of optical data bursts in time-domain wavelength interleaved networks

    Publication Year: 2003 , Page(s): 108 - 113
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (377 KB) |  | HTML iconHTML  

    We consider the problem of scheduling bursts of data in an optical network with an ultra-fast tunable laser and a fixed receiver at each node. In (K. Ross et al, Technical Report SU NETLAB-2002-12/1, Eng. Lib., Stanford Uni., Stanford, CA (2002)) we considered the static scheduling problem of meeting demand in the minimal time. Here we substantially extend these results to the case of online, dynamic scheduling. Due to the high data rates employed on the optical links, the burst transmissions typically last for very short times compared to the round trip propagation times between source-destination pairs. A good schedule ensures that (i) there are no transmit/receive conflicts, (ii) throughput is maximized, and (iii) propagation delays are observed. We formulate the scheduling problem as a generalization of the well-known crossbar switch scheduling problem. We show that the algorithms presented in the previous work can be implemented in dynamic form to give 100% throughput. Further, we show that one of the more intuitive solutions does not lead to maximal throughput. In particular, we show advantages of adaptive batch sizes rather than fixed batch sizes for both throughput and performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.