I. Introduction
RDMA-based transport has been adopted by more and more data center applications [1]–[3] as their increasingly demanding throughput and latency requirements can no longer be adequately met by traditional TCP/IP-based networking. RDMA over Converged Ethernet (RoCE) allows InfiniBand (IB) RDMA packets to be encapsulated via Ethernet/IP/UDP and carried over the commodity Ethernet fabric, facilitating the adoption of RDMA on existing data center networks. Given RDMA’s growing popularity, it is increasingly important to monitor RDMA traffic and associated applications at the RDMA protocol level. End-to-end RDMA telemetry allows data center operators to quickly detect network congestion and migrate some of associated RDMA applications to mitigate the congestion. It can also help the operators monitor the use or misuse of shared RDMA hardware resources by tenants. RDMA transaction monitoring can be a valuable input not only for optimizing application performance and troubleshooting performance anomalies such as tail latencies [4], but also for uncovering RDMA-based security attacks [5]. Despite these potential benefits, existing monitoring facilities of modern data centers fall short of RDMA’s popularity, and realizing these benefits faces several unique challenges as describe below.