By Topic

High-Performance Interconnects (HOTI), 2012 IEEE 20th Annual Symposium on

Date 22-24 Aug. 2012

Filter Results

Displaying Results 1 - 25 of 26
  • [Cover art]

    Page(s): C4
    Save to Project icon | Request Permissions | PDF file iconPDF (326 KB)  
    Freely Available from IEEE
  • [Title page i]

    Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (32 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (61 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (32 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): v - vi
    Save to Project icon | Request Permissions | PDF file iconPDF (142 KB)  
    Freely Available from IEEE
  • Message from the General Co-chairs

    Page(s): vii - viii
    Save to Project icon | Request Permissions | PDF file iconPDF (86 KB)  
    Freely Available from IEEE
  • Message from the Technical Program Committee Chair

    Page(s): ix - x
    Save to Project icon | Request Permissions | PDF file iconPDF (82 KB)  
    Freely Available from IEEE
  • Organizing Committee

    Page(s): xi
    Save to Project icon | Request Permissions | PDF file iconPDF (71 KB)  
    Freely Available from IEEE
  • Technical Program Committee Members

    Page(s): xii
    Save to Project icon | Request Permissions | PDF file iconPDF (72 KB)  
    Freely Available from IEEE
  • Steering Committee

    Page(s): xiii
    Save to Project icon | Request Permissions | PDF file iconPDF (77 KB)  
    Freely Available from IEEE
  • Reviewers

    Page(s): xiv
    Save to Project icon | Request Permissions | PDF file iconPDF (71 KB)  
    Freely Available from IEEE
  • Keynotes - HOTI 2012 [four abstracts]

    Page(s): xv - xix
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (108 KB)  

    Provides an abstract for each of the four keynote presentations and a brief professional biography of each presenter. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Invited Talks - HOTI 2012 [three abstracts]

    Page(s): xx
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (69 KB)  

    Summary form only given. The keynote is not provided, neither is an abstract. Only the author's professional biography is given. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Panels - HOTI 2012 [two panel session listings]

    Page(s): xxi - xxii
    Save to Project icon | Request Permissions | PDF file iconPDF (74 KB)  
    Freely Available from IEEE
  • Tutorials - HOTI 2012

    Page(s): xxiii - xxviii
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (100 KB)  

    This keynotes discusses the following: Hands-on Tutorial on Software-Defined Networking; Interconnection Networks for Cloud Data Centers; Designing Scientific, Enterprise, and Cloud Computing Systems with InfiniBand and High-Speed Ethernet: Current Status and Trends; The Evolution of Network Architecture towards CloudCentric Applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • ParaSplit: A Scalable Architecture on FPGA for Terabit Packet Classification

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (392 KB) |  | HTML iconHTML  

    Packet classification is a fundamental enabling function for various applications in switches, routers and firewalls. Due to their performance and scalability limitations, current packet classification solutions are insufficient in ad-dressing the challenges from the growing network bandwidth and the increasing number of new applications. This paper presents a scalable parallel architecture, named Para Split, for high-performance packet classification. We propose a rule set partitioning algorithm based on range-point conversion to reduce the overall memory requirement. We further optimize the partitioning by applying the Simulated Annealing technique. We implement the architecture on a Field Programmable Gate Array (FPGA) to achieve high throughput by exploiting the abundant parallelism in the hardware. Evaluation using real-life data sets including Open Flow-like 11-tuple rules shows that Para Split achieves significant reduction in memory requirement, compared with the-state-of-the-art algorithms such as Hyper Split [6] and EffiCuts [8]. Because of the memory efficiency of Para Split, our FPGA design can support in the on-chip memory multiple engines, each of which contains up to 10K complex rules. As a result, the architecture with multiple Para Split engines in parallel can achieve up to Terabit per second throughput for large and complex rule sets on a single FPGA device. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Low-Latency Library in FPGA Hardware for High-Frequency Trading (HFT)

    Page(s): 9 - 16
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (480 KB) |  | HTML iconHTML  

    Current High-Frequency Trading (HFT) platforms are typically implemented in software on computers with high-performance network adapters. The high and unpredictable latency of these systems has led the trading world to explore alternative "hybrid" architectures with hardware acceleration. In this paper, we survey existing solutions and describe how FPGAs are being used in electronic trading to approach the goal of zero latency. We present an FPGA IP library which implements networking, I/O, memory interfaces and financial protocol parsers. The library provides pre-built infrastructure which accelerates the development and verification of new financial applications. We have developed an example financial application using the IP library on a custom 1U FPGA appliance. The application sustains 10Gb/s Ethernet line rate with a fixed end-to-end latency of 1μs - up to two orders of magnitude lower than comparable software implementations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Rx Stack Accelerator for 10 GbE Integrated NIC

    Page(s): 17 - 24
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (390 KB)  

    The miniaturization of CMOS technology has reached a scale at which server processors are starting to integrate multi-gigabit network interface controllers (NIC). While transistors are becoming cheap and abundant in solid-state circuits, they remain at a premium on a processor die if they do not contribute to increase the number of cores and caches. Therefore, an integrated NIC (iNIC) must provide high networking performance under high logic density and low power dissipation. This paper describes the design of an integrated accelerator to offload computation-intensive protocol-processing tasks. The accelerator combines the concepts of the transport-triggered architecture with a programmable finite-state machine to deliver high instruction-level parallelism, efficient multiway branching and flexibility. The flexibility is key to adapt to protocol changes and address new applications. This accelerator was used in the construction of a 10 GbE iNIC in 45-nm CMOS technology. The ratio of performance (15 Mfps - 20 Gb/s Tput per port) to area (0.7 mm2) and the power consumption (0.15 W) of this accelerator were core enablers for constructing a processor compute complex with four iNICs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Caliper: Precise and Responsive Traffic Generator

    Page(s): 25 - 32
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (269 KB) |  | HTML iconHTML  

    This paper presents Caliper, a highly-accurate packet injection tool that generates precise and responsive traffic. Caliper takes live packets generated on a host computer and transmits them onto a gigabit Ethernet network with precise inter-transmission times. Existing software traffic generators rely on generic Network Interface Cards which, as we demonstrate, do not provide high-precision timing guarantees. Hence, performing valid and convincing experiments becomes difficult or impossible in the context of time-sensitive network experiments. Our evaluations show that Caliper is able to reproduce packet inter-transmission times from a given arbitrary distribution while capturing the closed-loop feedback of TCP sources. Specifically, we demonstrate that Caliper provides three orders of magnitude better precision compared to commodity NIC: with requested traffic rates up to the line rate, Caliper incurs an error of 8 ns or less in packet transmission times. Furthermore, we explore Caliper's ability to integrate with existing network simulators to project simulated traffic characteristics into a real network environment. Caliper is freely available online. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Weighted Differential Scheduler

    Page(s): 33 - 39
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (360 KB) |  | HTML iconHTML  

    The Weighted Differential Scheduler (WDS) is a new scheduling discipline for accessing shared resources. The work described here was motivated by the need for a simple weighted scheduler for a network switch where multiple packet flows are competing for an output port. The scheme can be implemented with simple arithmetic logic and finite state machines. We are describing several versions of WDS that can merge two or more flows. An analysis reveals that WDS has lower jitter than any other weighted scheduler known to us. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance Evaluation of Open MPI on Cray XE/XK Systems

    Page(s): 40 - 47
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (388 KB) |  | HTML iconHTML  

    Open MPI is a widely used open-source implementation of the MPI-2 standard that supports a variety of platforms and interconnects. Current versions of Open MPI, however, lack support for the Cray XE6 and XK6 architectures -- both of which use the Gemini System Interconnect. In this paper, we present extensions to natively support these architectures within Open MPI, describe and propose solutions for performance and scalability bottlenecks, and provide an extensive evaluation of our implementation, which is the first completely open-source MPI implementation for the Cray XE/XK system families used at 49,152 processes. Application and micro-benchmark results show that the performance and scaling characteristics of our implementation are similar to the vendor-supplied MPI's. Micro-benchmark results show short-data 1-byte and 1,024-byte message latencies of 1.20 μs and 4.13 μs, which are 10.00% and 39.71% better than the vendor-supplied MPI's, respectively. Our implementation achieves a bandwidth of 5.32 GB/s at 8 MB, which is similar to the vendor-supplied MPI's bandwidth at the same message size. Two Sequoia benchmark applications, LAMMPS and AMG2006, were also chosen to evaluate our implementation at scales up to 49,152 cores -- where we exhibited similar performance and scaling characteristics when compared to the vendor-supplied MPI implementation. LAMMPS achieved a parallel efficiency of 88.20% at 49,152 cores using Open MPI, which is on par with the vendor-supplied MPI's achieved parallel efficiency. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing Systems

    Page(s): 48 - 55
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (530 KB) |  | HTML iconHTML  

    Communication interfaces of high performance computing (HPC) systems and clouds have been continually evolving to meet the ever increasing communication demands being placed on them by HPC applications and cloud computing middleware (e.g., Hadoop). The PCIe interfaces can now deliver speeds up to 128 Gbps (Gen3) and high performance interconnects (10/40 GigE, InfiniBand 32 Gbps QDR, InfiniBand 54 Gbps FDR, 10/40 GigE RDMA over Converged Ethernet) are capable of delivering speeds from 10 to 54 Gbps. However, no previous study has demonstrated how much benefit an end user in the HPC / cloud computing domain can expect by utilizing newer generations of these interconnects over older ones or how one type of interconnect (such as IB) performs in comparison to another (such as RoCE).In this paper we evaluate various high performance interconnects over the new PCIe Gen3 interface with HPC as well as cloud computing workloads. Our comprehensive analysis done at different levels, provides a global scope of the impact these modern interconnects have on the performance of HPC applications and cloud computing middleware. The results of our experiments show that the latest InfiniBand FDR interconnect gives the best performance for HPC as well as cloud computing applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Bufferless Routing in Optical Gaussian Macrochip Interconnect

    Page(s): 56 - 63
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (233 KB) |  | HTML iconHTML  

    In optical multichip system, called Gaussian macro chip, where embedded chips are interconnected by an optical Gaussian network. By taking advantage of the underlying Hamiltonian cycles in the Gaussian network, we design a buffer less routing algorithm for the Gaussian macro chip, which routes packets along the shortest path in the absence of deflection, and guarantees that deflected packets reach their destinations along a segment of the Hamilton cycle. Our extensive simulation results demonstrate that by adopting the proposed routing algorithm, Gaussian macro chip can support much higher inter-chip communication bandwidth, has much shorter average packet delay, and is more power efficient than the previously proposed architectures for optical multichip systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Occupancy Sampling for Terabit CEE Switches

    Page(s): 64 - 71
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (662 KB) |  | HTML iconHTML  

    One consequential feature of Converged Enhanced Ethernet (CEE) is loss lessness, achieved through L2 Priority Flow Control (PFC) and Quantized Congestion Notification (QCN). We focus on QCN and its effectiveness in identifying congestive flows in input-buffered CEE switches. QCN assumes an idealized, output-queued switch, however, as future switches scale to higher port counts and link speeds, purely output-queued or shared-memory architectures lead to excessive memory bandwidth requirements, moreover, PFC typically requires dedicated buffers per input. Our objective is to complement PFC's coarse per-port/priority granularity with QCN's per-flow control. By detecting buffer overload early, QCN can drastically reduce PFC's side effects. We install QCN congestion points (CPs) at input buffers with virtual output queues and demonstrate that arrival-based marking cannot correctly discriminate between culprits and victims. Our main contribution is occupancy sampling (QCN-OS), a novel, QCN-compatible marking scheme. We focus on random occupancy sampling, a practical method not requiring any per-flow state. For CPs with arbitrarily scheduled buffers, QCN-OSis shown to correctly identify congestive flows, improving buffer utilization, switch efficiency, and fairness. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Author index

    Page(s): 72
    Save to Project icon | Request Permissions | PDF file iconPDF (58 KB)  
    Freely Available from IEEE