High Performance Delay Monitoring for SRv6 Based SD-WANs

Software-Defined Wide Area Networks (SD-WANs) are used to provide services to enterprises with geographically dispersed locations in a flexible and efficient way. We focus on SD-WAN services based on the Segment Routing over IPv6 (SRv6) technology. Performance Monitoring solutions are needed in SD-WANs to detect performance degradation and outages, and optimize network operations. In this paper, we describe a high performance solution for end-to-end delay monitoring for SRv6 based SD-WAN services. The proposed solution leverages the Simple Two-way Active Measurement Protocol (STAMP) to monitor the delay of an SRv6 path between two nodes called STAMP Session-Sender and Session-Reflector. We describe three implementations of the STAMP Session-Sender and Session-Reflector for a Linux software router and compare their performance. In particular, two implementations are based on user space processing and one is based on eBPF. The results show that the eBPF-based implementation outperforms the user space implementations and has a negligible impact on the forwarding capacity of the Linux software router.


I. INTRODUCTION
I T is common for enterprises to have multiple data centers and branch offices spread over large geographical areas.The reference scenario is shown in Fig. 1.Traditional Wide Area Networks for enterprises were based on static interconnections of remote sites.With the advent of cloud computing, many enterprises moved their applications to cloud systems.Traditional WANs started to exhibit limitations because they were not designed for cloud systems.First, traditional WANs do not provide the desired level of flexibility to users.Extending traditional WANs and adding new services require human intervention and are time consuming.Moreover, traditional WANs do not support cloud ecosystems natively.To provide access to cloud applications, traditional WANs typically require backhauling all traffic to a data center.Then, from the data center, the traffic is sent to the cloud.Software-Defined Wide Area Networking (SD-WAN) is a paradigm that aims at overcoming the limitations of traditional WANs.SD-WAN uses a software-defined approach to control the network and build the interconnections among the different locations.An SD-WAN builds interconnections among users and applications hosted on clouds or remote branches by leveraging any combination of transport services.Over the years, many SD-WAN solutions have been proposed.Most SD-WAN solutions are commercial, such as Cisco SD-WAN [1].The Google B4 WAN [2] [3] is a proprietary SD-WAN solution that connects Google's data centers across the world.B4 relies on a hybrid SDN approach: the WAN sites are interconnected using traditional routing protocols, an SDNbased Traffic Engineering service runs on top of the network to maximize links utilization and perform load balancing, Open-Flow is used to control and program the switches.FlexiWAN [4] was the first open-source solution.It uses VXLAN tunnels to establish the SD-WAN interconnections.In our previous work [5], we presented an open-source SD-WAN solution called EveryWAN, which is capable of using SRv6 to establish the SD-WAN interconnections.To the best of our knowledge, EveryWAN is the first open-source solution to leverage SRv6 technology to create SD-WAN services.EveryWAN is based on Linux networking and can be deployed on software routers located at the edge of an SD-WAN.
In fact, software routers can play a role in SD-WAN scenarios, thanks to their flexibility, complementing hardwarebased solutions.For example, they can be easily deployed in virtualized environments in cloud and data center scenarios.For this reason, we believe it is fundamental to work on the design and implementation of open-source SD-WAN solutions suitable for software routers.
An important function to be executed in wide area networks is Performance Monitoring (PM).PM allows network operators to detect failures and outages and assess network performance.Effective network monitoring is essential, and new tools and protocols have been designed accordingly for SDN based networks [6].Important application scenarios in which we can benefit from network monitoring are IoT [7] and security [8].
To assess performance, there exist several metrics.The throughput measures how many packets in the time unit are received.Usually, it is measured in bit/sec.Packet loss is a measure of how many packets fail to reach the destination due to network congestion or transmission errors.Delay (or latency) is a measure of the time required by a packet to reach the destination, and it is usually expressed in milliseconds.There are two types of delay: i) the one way delay is the delay measured from the source to the destination; ii) the two way delay is the delay measured from the source to the destination plus the delay measured from the destination to the source.Finally, the jitter measures the variation of the delay.
Moreover, the measurement methods can be classified in passive methods, active methods or hybrid.Passive methods typically use a sniffer to analyze the traffic flows in real-time and extract the relevant statistics.Passive methods do not alter the traffic flows, but are usually less effective than active methods.On the contrary, active methods typically inject probe packets in the network that carry useful information for monitoring purposes.Finally, hybrid methods try to combine the advantages of active and passive methods.
In this paper, we focus on the delay monitoring of SRv6 networks.We consider a number of research and technological questions: • Is it possible to design an effective solution for delay monitoring of SRv6 networks based on current IETF standards and work-in-progress Internet drafts?• Can we implement the solution in a working open-source prototype based on Linux software routers?• What is the impact of the delay monitoring solutions on the forwarding capacity of software routers?Can we implement delay monitoring with negligible impact on forwarding performance?The main novel contributions are as follows: • Realization of a High Performance End-to-End Delay Monitoring solution for SRv6 networks compliant with available standards and Internet drafts; • Design and implementation of a gRPC Southbound interface to control the SRv6 nodes; • Implementation of two user space solutions and of a kernel solution based on eBPF; • Evaluation of the performance degradation introduced by the Delay Monitoring solution and comparison between the two user space and the eBPF-based implementations.This paper is organized as follows.In Section II, we present an introduction to the SRv6 technology and its main use cases.Section III presents how the SRv6 technology can be used to realize SD-WAN services.In Section IV, we introduce EveryWAN, the SD-WAN prototype that we have extended.Section V presents our Delay Monitoring solution.The implementations are discussed in Section VI.In Section VII, we show how we integrated our performance measurement solution in EveryWAN to measure the delay of VPN services.In Section VIII, we present a performance evaluation and comparison of the implementations.In section IX we present the related works.Finally, Section X concludes the paper.

II. SRV6 TECHNOLOGY
Segment Routing (SR) is a routing technology based on the loose source routing paradigm ( [9], [10]).It allows a source node to steer a packet through a list of instructions called segments.A segment can represent a topological instruction (e.g., forward the packet via a specific nexthop) or a function to be applied to the packet (e.g., execute an operation on the packet).A segment is identified by an identifier known as Segment ID (SID).The list of SIDs of a packet, called Segment List or SID List, is carried in the packet header.SR can be implemented using either MPLS or IPv6 as data plane technology.In MPLS Segment Routing (SR-MPLS) [11], the SIDs are encoded as MPLS labels.The Segment List is encoded as a stack of labels.In Segment Routing over IPv6 (SRv6), the SIDs are encoded as IPv6 addresses.The Segment List is carried in an IPv6 Extension Header called Segment Routing Header (SRH) [12].A set of standardized SRv6 functions is presented in [13].In this paper, we focus on SRv6.
Fig. 2 shows an example of an SRv6 network scenario.The gray cloud represents an SRv6 domain.An ingress node processes the packets entering the SRv6 domain and encapsulates each received packet in an outer IPv6 header with an SRH.In the example, the SRH carries a SID List containing three SIDs.The first two SIDs represent the two waypoints that the packets should traverse before reaching the destination.The ingress node forwards the encapsulated packets towards the first waypoint.The path to reach the waypoint is decided by the traditional routing protocol (e.g., IS-IS or OSPF).The first waypoint forwards the packet toward the second waypoint, which in turn forwards the packet toward the egress node, identified by the third SID in the segment list.The third SID is also used in the egress node to determine the operation to be performed.In this case, the egress node performs a decapsulation operation (i.e., removes the outer IPv6 header which contains the SRH) and forwards the packets to the destination.It is also possible to use two different SIDs instead of the third single one: a SID to reach the egress node and another SID to identify the operation to be performed, but this Fig.3: SD-WAN Network Slicing scenario.is less efficient as four SIDs instead of three would be carried in the Segment List.
The SRv6 technology has been proposed in the recent past and has raised great interest in academia and industry.Since then, its development has progressed very rapidly.Today, SRv6 is supported in many hardware deployments [14] and software routers such as the Linux kernel and the Vector Packet Processor (VPP) [15].The Linux kernel has supported SRv6 packet generation and forwarding capabilities since version 4.10 (released in February 2017).Later, it has been extended to support many of the SRv6 behaviors described in [13].
SRv6 enables many use cases such as overlay VPNs [16], Traffic Engineering [17], Fast Rerouting [17], and Service Function Chaining (SFC) [18].An overview of SRv6 implementation and deployment status is available at [19] and [14].The ROSE project [20] aims to build a Linux-based Open Ecosystem for SRv6.It tackles multiple aspects of the SRv6 technology, including the Data Plane, Control Plane, SRv6 host networking stack, integration with applications, and integration with Cloud/Data Center Infrastructures.ROSE comprises several sub-projects, including SRPerf [21] (a performance evaluation framework for SRv6 implementations), SRv6-PM [22] (a loss monitoring solution for SRv6 networks), and HIKe [23] (a solution that combines the advantages of Linux kernel networking and custom-designed eBPF programs to speed up the performance of SRv6 software routers).In [24], the authors proposed a solution to efficiently represent SIDs, called Micro SID.This solution reduces the length of the SID List and facilitates using SRv6 on devices with limited processing capabilities.

III. SD-WAN SERVICES BASED ON SRV6
In the UCSS project 1 (User Controlled SD-WAN Services with Performance Monitoring over GÉANT) [26] we designed, implemented and deployed SD-WAN services based on the SRv6 technology.
An SD-WAN can offer different services.We focus on the Network Slicing service.The reference scenario is shown in Fig. 3. Network Slicing allows customers to create different logical instances of virtual networks over the same WAN connection.It allows multiple applications to run in isolation 1 part of the GÉANT Innovation Programme [25] over the same WAN.Among the different types of slicing, we focus on Routed End-to-End Slices.A Routed End-to-End Slice is an implementation of a Layer 3 VPN (L3VPN), in which the devices attached to the SD-WAN Edges belong to different broadcast domains.The SD-WAN Edge routers act as gateways to route the traffic between these broadcast domains.
In our terminology, a Slice (or Local Slice) is a portion of the customer network where users or applications are located.Each Local Slice is terminated in an SD-WAN Edge router.The SD-WAN Edge router forwards the traffic of the connected Local Slice to an egress SD-WAN Edge router.The interconnections between two different SD-WAN Edge routers are realized by using a set of Tunnels (also called Overlays).The Overlay, together with the Local Slices, forms the socalled End-to-End Slice (E2E Slice).Several technologies can be used to realize an Overlay.We focus on SRv6-based Overlays.Fig. 4 shows the reference scenario for the Network Slicing service based on SRv6 technology.An ingress SD-WAN Edge router receives IP packets from a customer source host.It classifies and associates each incoming packet with a specific End-to-End Network Slice according to various criteria, such as the incoming interface, the source IP address, or the protocol.After the classification, the ingress SD-WAN Edge router performs a lookup in its Forwarding Information Base (FIB) to discover the SD-WAN egress Edge router attached to the destination host.Then, the ingress SD-WAN Edge router applies the H.Encaps behavior described in [13] to the packet.This behavior steers the packet into an SRv6 Policy.Steering is realized by encapsulating the IP packet into an outer IPv6 header containing an SRH.The SRH carries two SIDs.The first SID represents an instruction to deliver the packet to the egress SD-WAN Edge router.The second SID is an End.DT6 instruction.End.DT6 forces the egress router to strip the outer IPv6+SRH header and deliver the original packet to the correct Slice.
In the SD-WAN solutions, the SD-WAN Edge routers are deployed in all the locations where the SD-WAN interconnections need to be established.An SD-WAN controller manages and programs the SD-WAN Edge routers.Depending on the location and the characteristics of the SD-WAN Edge routers, three scenarios are possible: 1) the SD-WAN Edge routers are located within the provider network and are under the network operator's control; 2) the SD-WAN Edge routers are outside the provider network, and they have no control over the transport services; 3) the SD-WAN Edge routers are outside the provider network but can interact with the provider network to deploy the SD-WAN services.
We focused on scenario 2. The SRv6-based SD-WAN services were deployed in scenarios where the SD-WAN Edge routers have no interaction with the provider network.We deployed several SD-WAN Edge routers as Virtual Machines (VMs) across Europe.These SD-WAN Edge routers were located in different kinds of networks, like university campus networks, NRENs (National Research and Education Networks), and commercial provider networks.We analyzed and classified the IPv6/SRv6 connectivity between these VMs and introduced the concept of SRv6 Transparency.SRv6 Transparency is the ability of an IPv6 network to carry SRv6 traffic.Several factors can reduce the SRv6 Transparency of a network, such as firewalls that block IPv6 packets carrying an SRH.We found different SRv6 Transparency levels in the networks that we considered.We have shown that it is possible to configure the SRv6-based SD-WAN services, taking into account the SRv6 Transparency level of the network providing IPv6 connectivity and we have practically deployed SD-WAN services across operational networks over the Internet.An indepth discussion of the SRv6 Transparency problem and the configuration of SRv6-based SD-WAN services can be found in the UCSS report [26].

IV. THE EVERYWAN ARCHITECTURE
EveryWAN [5] is an open-source SD-WAN prototype based on Linux networking.Fig. 5 shows the EveryWAN architecture.At the lowest level, we have the SD-WAN Edge routers called EveryEdge routers.The EveryEdge routers take care of the interconnections among all the sites.EveryEdge routers can be deployed as Virtual Network Functions (VNFs) over a Linux OS in the sites to be interconnected.An SD-WAN Controller, called EveryEdgeOS, manages all the EveryEdge routers through an API based on the gRPC protocol.It deals with many configuration and management aspects of the EveryEdge routers, ranging from their initial registration, authentication, and configuration to the activation of the policies that implement the SD-WAN services.On top of the controller, there is an SD-WAN Orchestrator named EveryBOSS, which automates the deployment of the EveryEdge routers and SD-WAN services.The orchestrator also offers a GUI that allows The EveryEdge router comprises several open-source components installed on a general-purpose Linux distribution (e.g., Ubuntu Server).It uses Linux networking capabilities to forward the traffic.A component called EveryEdgeManager offers a Southbound API that allows the EveryEdgeOS controller to program and configure the router.Through the Southboud API, the controller can send commands to the EveryEdge router (e.g., install a specific route or set the IP address of a network interface).The EveryEdgeManager translates the received commands into lower-level actions.Then, it sends these actions to the Linux kernel using the opensource project pyroute2 [27], a pure Python Netlink library.A detailed description of the EveryEdge router architecture can be found in [5].
The main service offered by EveryWAN is Network Slicing (described in Section III), which allows customers to create End-to-End Slices among the remote sites.To create Endto-End Slices, the EveryEdge routers must be configured properly.The configuration depends on the SRv6 Transparency of the network.The EveryEdge router receives ingress IP packets over the customer-facing interfaces, i.e., the Local interfaces (LAN).It classifies and associates each packet with a particular End-to-End Network Slice.To perform the classification, the EveryEdge leverages the Virtual Routing and Forwarding (VRF) technology offered by the Linux kernel.VRFs provide the ability to create isolated virtual routing and forwarding domains.Each VRF serves a particular slice.Each customer-facing interface in the EveryEdge router is mapped to a slice and enslaved to the VRF that serves that slice.Based on the destination IP address, the EveryEdge router forwards the packets associated with a slice to the remote EveryEdge routers over the WAN interfaces.
A transport technology ensures that the network delivers the packets to the remote EveryEdge router.EveryWAN supports two transport technologies: VXLAN [28] and SRv6.In this work, we only consider SRv6.To transmit the packets to a remote router, the EveryEdge router (which is an ingress EveryEdge router from the SRv6-domain point of view) applies the H.Encaps behavior described in [13] to the incoming packets.This behavior steers the packets into an SRv6 Policy.Steering is realized by encapsulating the IP packets into an outer IPv6 packet containing an SRH.The SRH carries two SIDs.The first SID represents an instruction to deliver the packet to the remote EveryEdge router.The second SID is an instruction End.DT4/End.DT6 that strips the outer IPv6+SRH header and delivers the original packet to the VRF that serves the slice.The encapsulated packets traverse the network and reach the remote EveryEdge router (which acts as an egress EveryEdge router).The egress EveryEdge router receives the SRv6 packets over the WAN interfaces.It applies the End.DT4/End.DT6 behavior to decapsulate the SRv6 packet and forward the original IP packets over the LAN interfaces to the destination.
A detailed description of the EveryWAN architecture can be found in the white paper [29].

V. STAMP DELAY MONITORING FOR SRV6
In this section, we present the proposed End-to-End Delay Monitoring solution for SRv6 networks based on the Simple Two-Way Active Measurement Protocol (STAMP) [30].STAMP enables the measurement of several performance metrics, including packet loss, delay, and jitter.It supports both one-way and round-trip measurements in IP networks.RFC 8762 [30] defines the base functionalities of STAMP and describes the format of the packets that collect and carry the measurement data.RFC 8972 [31] introduces the STAMP Session IDentifier (SSID) and defines optional STAMP extensions that enhance the STAMP base functions.The drafts [32] and [33] present general guidelines for measuring various performance metrics in SR networks using STAMP.In the following subsections, we present a solution based on STAMP to measure the end-to-end delay of SRv6 paths.
Fig. 6 shows our STAMP reference scenario.We use a STAMP Session to measure the end-to-end delay on an SRv6 path between two nodes called STAMP Session-Sender and Session-Reflector.For delay measurements to be meaningful, the Session-Sender and Session-Reflector clocks must be synchronized 2 .RFC 8762 does not envisage any particular approach to configure and manage the STAMP Session-Sender, Session-Reflector, and the STAMP Session, which can be achieved in different ways, such as using a Command Line Interface (CLI) or an SDN controller.The proposed solution leverages an SDN controller to manage the STAMP Session and configure the STAMP Session-Sender and Session-Reflector.The public documentation of our delay monitoring solution with links to code repositories is available in [34].

A. Data Plane Protocol
A STAMP session measures the end-to-end delay on a given SRv6 path between two nodes, the STAMP Session-Sender and Session-Reflector.A STAMP session consists of a bidirectional packet exchange between the STAMP Session-Sender and the Session-Reflector.Each STAMP session is identified by a unique 16-bit nonzero unsigned integer called STAMP Session IDentifier (SSID).
The STAMP Session-Sender transmits a STAMP Session-Sender test packet to the STAMP Session-Reflector.The test packet is an IPv6/UDP packet sent to the STAMP UDP port of the Session-Reflector.By default, the STAMP Session-Reflector uses the UDP port 862.The SDN controller can set a different port during the configuration of the STAMP  In the rest of this section, we refer to this timestamp as T 1 (see Fig. 8).RFC 8762 specifies two different timestamp formats: Network Time Protocol (NTP) [35] and the IEEE 1588v2 Precision Time Protocol (PTP) [36], both using 64 bits.By default, the STAMP Session-Sender uses NTP as timestamp format, as specified in RFC 8762.The SDN controller can select a different timestamp format during the STAMP Session-Sender or STAMP Session configuration.
The Error Estimate field comprises four sub-fields: i) the S field; ii) the Z field; iii) the Scale field; iv) the Multiplier field.S is a one-bit field that indicates whether the STAMP Session-Sender has a clock that is synchronized to UTC using an external source like GPS hardware or not.The one-bit Z field indicates the timestamp format used for the test packet   time when the Session-Sender received the Session-Reflector test packet.The Session-Sender collects the three timestamps from the session reflector test packet and adds T 4 creating a measurement record (T 1 , T 2 , T 3 , T 4 ) that is stored locally.The generated records need to be sent to the SDN controller for post-processing, as it will be discussed later.Considering its role in the processing of the STAMP test packets coming back from the Session-Reflector, we can refer to the Session-Sender as the final Collector of the STAMP test packets.

B. Configuration and Management
We have defined the API offered by the STAMP Session-Sender and by the STAMP Session-Reflector to the SDN controller for the configuration of the STAMP measurement service.The configuration involves setting various parameters, including the STAMP UDP port, the network interfaces on which the STAMP Session-Sender/Session-Reflector expects to receive the STAMP Test packets, and the source IPv6 address to be used in the STAMP Test packets.The controller can also create and manage the STAMP Sessions using the API exposed by the STAMP Session-Sender/Session-Reflector.In particular, to create a STAMP Session, the SDN controller must provide the following parameters: 1) the SSID of the STAMP Session; 2) the SID List of the path under measurement; 3) the interval between two consecutive STAMP Test packets; 4) the source IPv6 address of the STAMP Test packets; 5) the authentication mode (i.e., unauthenticated or authenticated); 6) the timestamp format (i.e., NTP or PTPv2); 7) the delay measurement mode (i.e., one-way or two-way); 8) the IP address of the STAMP Session-Reflector; 9) the STAMP UDP port of the STAMP Session-Sender and Session-Reflector; 10) the Session-Reflector mode (i.e., stateful or stateless).Fig. 9 shows the interaction of the SDN controller with the STAMP Session-Sender and Session-Reflector required to create a STAMP Session.

C. Data Collection
The STAMP Session-Sender and STAMP Session-Reflector exchange STAMP Test packets containing the timestamps required to compute the delay.The STAMP Session-Sender collects all the timestamps.The SDN controller can interact with the Session-Sender to fetch the timestamps.In general, there are two approaches the SDN controller can use to fetch the timestamps: polling mode and notification mode.In polling mode, the controller periodically polls the Session-Sender to gather the collected timestamps.In notification mode, the Session-Sender will "push" the information toward the SDN controller, either by sending the single measurement records or aggregating a set of measurement records in a single notification.In our solution we have implemented the polling mode.
When the measurements records are available to the SDN controller, it can compute the delay of the direct path d d (i.e., the path from the Session-Sender to the Session-Reflector) and return path d r (i.e., the path from the Session-Reflector to the Session-Sender): where T 1 , T 2 , T 3 , and T 4 are the four timestamps defined in Section V-A, d d and d r are the delay of the direct path and return path, respectively.Of course, the clocks of the Session-Sender and of the Session-Receiver must be synchronized and the accuracy of this delay estimates d d and d r depends on the accuracy of the clock synchronization.
the Control Plane, both the Session-Sender and the Session-Reflector interact with the SDN controller by offering an API (see subsection VI-A).
As for the Data Plane, we have implemented three versions of the Session-Sender and Session-Reflector with the goal of improving their performance: two User Space implementations (referred to as basic and otimized, see subsection VI-B) and a Kernel Space implementation based on the eBPF framework [38], see subsection VI-C.We evaluate and compare the performance of the different implementations in Section VIII.

A. Control Plane functionalities
Both the STAMP Session-Sender and Session-Reflector expose a Southbound API that allows an SDN controller to create/start/stop/destroy a STAMP Session and fetch the results of a STAMP Session.This API follows the design ideas discussed in Section V. We decided to extend the Southbound API proposed in [39], based on the gRPC protocol [40].The implementation of our Southbound interface is open-source and available at [37].
The Southbound API supports the following operations: i) The Init operation is used to provide the global configuration parameters (i.e., the parameters common to all the STAMP Sessions) to the STAMP Session-Sender and Session-Reflector.These parameters include the UDP port of the Session-Sender/Session-Reflector, the interface used to send/receive the STAMP Test packets, and the source IPv6 address of the STAMP Test packets.Init also starts a packet sniffer to intercept all the incoming STAMP Test packets and creates an ip6tables rule to drop the STAMP Test packets after they have been processed by STAMP.The STAMP Test packets must be consumed by STAMP and not handled by the kernel.The ip6tables rule prevents STAMP Test packets from being processed by the kernel.The Reset operation resets the configuration parameters and stops the packet sniffer.
CreateStampSession prepares the STAMP Session-Sender/Session-Reflector to run a STAMP Session and send/receive the STAMP Test packets.As explained in Section V, a STAMP Session is a bidirectional packet exchange between the STAMP Session-Sender and Session-Reflector on a given SRv6 path.The results of a STAMP Session are timestamps that can be used to compute the delay of the SRv6 path under measurement.CreateStampSession allocates all the necessary data structures and provides the configuration parameters specific for the STAMP Session, including the SSID, the Segment List of the path to test, the source IPv6 address of the STAMP Test packets, and the other STAMP parameters described in [30].StartStampSession and StopStampSession take care of starting and stopping a STAMP Session, respectively.All the STAMP Sessions are asynchronous.Thus, after starting a STAMP Session, the controller does not need to wait for its completion.Optionally, the controller can also specify the duration of the STAMP Session.Alternatively, the controller can stop the STAMP Session using the StopStampSession RPC.The DestroyStampSession operation removes a STAMP Session, and deallocates all the related data structures.
We also implemented a GetStampSessionResults RPC that allows the controller to fetch the measurement results (i.e., the timestamps) collected by the STAMP Session-Sender.This operation is supported only by the Session-Sender as the Session-Reflector does not collect any information during the STAMP Session.Our implementation supports asynchronous fetching.The collected timestamps are stored in the Session-Sender until the controller fetches them.As discussed in Section V-C, the controller can compute the delay of the SRv6 path under measurement (see equations 1 and 2) based on the collected timestamps.

B. User Space Implementations for Data Plane
In this subsection, we describe our user space implementations of the STAMP Session-Sender and Session-Reflector, compliant with RFC 8762 [30], RFC 8972 [31], and draft [33].The implementations are based on the Scapy python library [41] and are available as open-source at [37].We have developed a first implementation (referred to as basic) and then designed an improved version (referred to as optimized.Hereafter, we first describe the basic Scapy user space implementation and then we discuss how we have tackled its performance issues with the optimized implementation. The Session-Sender and Session-Reflector leverage the Scapy library to generate the STAMP Test packets.When we started our work, the latest release of Scapy (version 2.4.5) did not implement the RFC 8762 (STAMP).Scapy modular design allows developers to define new protocol layers easily.We have added the support for both STAMP Session-Sender and STAMP Session-Reflector Test packets in unauthenticated mode.Our contribution has been accepted and merged in the mainstream distribution of Scapy, adding the support of the STAMP protocol.Both the Session-Sender Test packet and Session-Reflector Test packet are compliant with the formats defined in RFC 8962 and described in Section V.The STAMP Test packets contain the timestamps used to compute the delay.As discussed in Section V, STAMP can support two timestamp formats: NTP and PTPv2.Our current implementation only supports NTP timestamps.
After generating the STAMP Test packets, the Session-Sender and the Session-Reflector use the Scapy library to send the packets on the outgoing network interface.In particular, before sending a STAMP Test packet, the Session-Sender adds an UDP header and an IPv6+SRH header to the packet.The UDP header contains the STAMP port of the Session-Reflector as destination port.The SRH contains the Segment List of the path under measurement (i.e., the path from the Session-Sender to the Session-Reflector).The Session-Reflector performs the specular operations adding the proper UDP header and IPv6+SRH header to send the packet to the Session-Sender.Then, the Session-Sender and the Session-Reflector pass the packet to an L3RawSocket6.The L3RawSocket6 is a Scapy socket built on top of a AF_INET6/SOCK_RAW Linux socket.The Linux kernel adds a Layer 2 header and sends the packet to the destination (i.e., the Session-Reflector or the Session-Sender) according to the usual L2/L3 rules.
Both the Session-Sender and the Session-Reflector need to process the incoming STAMP Test packets.The Session-Reflector receives the STAMP Session-Sender Test packets from the Session-Sender and it has to reply to these packet by adding the proper timestamps.The STAMP Session-Sender receives STAMP Session-Reflector Test packets from the Session-Reflector and processes them, acting as a measurement data collector.
The Session-Sender and the Session-Reflector run a dedicated thread to capture, validate and process the STAMP Session Test packets.To capture the incoming STAMP Test packets, the basic implementation of Session-Sender uses a Scapy AsyncSniffer.The AsyncSniffer captures all the incoming packets received on a given interface and passes the captured packets to a user space callback named stamp_reply_packet_received.This callback drops any non-STAMP Test packet and processes only the valid STAMP Test packets.Since stamp_reply_packet_received operates in user space, calling it for each received packet can have a big impact on the CPU usage.In order to reduce the impact on the CPU usage, it is important to reduce the number of packets processed by the stamp_reply_packet_received.In our implementation, we attach a BPF filter to the AsyncSniffer.This filter allows the AsyncSniffer to capture only the STAMP Test packets by filtering non-STAMP Test packets at kernel level.Thus, stamp_reply_packet_received is invoked only when a STAMP Test packet is received.For each captured STAMP Test packet, stamp_reply_packet_received performs several validation checks.If the packet passes all the validation checks, the Session-Sender extracts the timestamps and collects them in a FIFO queue.The controller periodically can send a GetStampSessionResults command to fetch the latest results from the Session-Sender.The results are kept in the FIFO queue until they are fetched, then they are permanently removed from the queue.
The basic implementation of the Session-Reflector performs similar operations to capture STAMP Session-Sender Test packets and send STAMP Session-Reflector Test packets.
During our performance evaluation, we found that the basic Scapy solution exhibited very poor performance.
As explained previously, the basic implementation relies on the Scapy AsyncSniffer to capture the STAMP Test packets.AsyncSniffer is implemented using a Linux AF_PACKET/SOCK_RAW socket.An AF_PACKET/SOCK_RAW socket captures all the packets received on a given interface.The capture process of a plain AF_PACKET socket is very inefficient, because it uses very limited buffers and requires a system call to capture each packet.
The second bottleneck of the basic implementation is related to the process of building and dissecting the STAMP Test packets.The Session-Sender periodically generates and sends STAMP Test packets to the Session-Reflector.Generating a STAMP Test packet involves several operations, such as building each layer, filling each header with the proper information, stick all the layers together, and computing the checksum.We found that repeating this sequence of operations for building each packet to be transmitted is very expensive.
Therefore, we designed an improved implementation of the STAMP Session-Sender that mitigates the above described performance issues.We refer to this improved version as optimized.This implementation uses the PACKET_MMAP [42] socket option.PACKET_MMAP improves the capture process by using a circular buffer mapped in user space that can be used to send and receive packets.This buffer is shared between the kernel and our user space application.A shared buffer between the kernel and the user also has the advantage of minimizing packet copies.When a packet arrives, the kernel stores the packet in the buffer.Since the buffer is shared between the kernel and our user space STAMP application, the application can read the packet without issuing any system call.
In order to fix the inefficiencies in the sending procedures, we observed that packets sent in the context of a STAMP Session are very similar to each other.Most of the packet fields are equal for each packet in a STAMP Session.These fields include the SSID, the Segment List, the source and destination IP addresses, and the UDP ports.Few fields need to be changed, such as the timestamp fields and the sequence number contained in the STAMP Test packets.Instead of generating a new packet for each STAMP packet to be sent, the optimized implementation of the Session-Sender allocates a STAMP Session-Sender Test packet when the STAMP Session is created (CreateStampSession operation).When a new packet needs to be sent, the Session-Sender only changes the variable fields of the packet (e.g., the timestamps and the sequence number).Then it computes the UDP checksum and sends the packet to the Session-Reflector.In this way, we avoid the overhead related to generating a new STAMP Test packet from scratch.To further improve performance, we save the STAMP Test packet as a bytes array instead of a Python object.In this way, we avoid the overhead due to converting the packet from Python representation to a bytes array before sending it on the network.We also optimized the logic used to parse the received packets.For each received STAMP Session-Reflector Test packet, Scapy performs the so-called packet dissection, i.e., it reads the bytes of the packet and builds a Python object to represent the packet.Then, it collects the timestamps from the packet.In the optimized solution we bypassed the Scapy dissector and we extract the timestamps directly from the bytes representation of the packets.
As for the Session-Reflector, its optimized implementation improves the efficiency of the basic version using the same approaches that we have discussed for the the Session-Sender.
The optimized versions of the Session-Sender and Session-Reflector STAMP implementation have been integrated in the EveryWAN prototype as described in Sec.VII.

C. eBPF Implementation for Data Plane
eBPF [38] is a Linux technology that enables running programs in kernel space and in a sandboxed environment, without having to deploy ad-hoc kernel modules or change the kernel code.eBPF can offer high performance to specific packet processing tasks.We designed and implemented a proof-of-concept eBPF implementation with the goal to assess its performance.
Our eBPF deployment is based on the HIKe / eCLAT [43] [44] framework.HIKe (Heal, Improve and desKill eBPF) is a virtual machine abstraction for eBPF.It makes it possible to chain multiple simpler eBPF programs in a larger and more complex program.eCLAT (eBPF Chains Language And Toolset) is a python-like language and programming framework.Its scripts compile to HIKe chains, providing a highlevel, simpler language that can be used to compose complex eBPF programs in a modular fashion.
Algorithm 1 HIKe chain high level structure for STAMP Session-Reflector.
if packet is STAMP then process headers for layers 2, 3, 4 compute UDP checksum cross connect to layer 2 interface else pass packet to kernel end if The high-level pseudocode 1 shows the structure of the HIKe chain for the STAMP Session-Reflector.The chain is attached to the XDP hook on the desired interface and the entire processing is performed without letting the packet enter the Linux kernel networking stack.The first eBPF program filters only STAMP Test packets, everything else is passed to the kernel without further processing.The chain then manipulates the STAMP fields adding the new timestamps.Then, the address/port fields in MAC, IPv6 and UDP headers are changed before forwarding the packet.Lastly, the UDP checksum is recalculated and the packet is forwarded on the desired interface.
The Collector implementation is simpler because the packet does not need to be forwarded.The chain comprises a filter so that only STAMP packets are processed, while other packets are sent to the kernel networking stack.Then we have the actual Collector eBPF program.It parses the STAMP payload of the packet and extracts the timestamps.The extracted timestamp records are written inside an eBPF map, accessible from the userspace, so that it is possible to read the measurements.
The code for the eBPF implementation can be found in the repository [45].The deployment and configuration of the eBPF implementation is not integrated in the EveryWAN prototype.The configuration is performed manually as the eBPF proofof-concept implementation is only used for the performance experiments described in section VIII.

VII. DELAY MONITORING THROUGH EVERYWAN CONTROLLER
We integrated the delay monitoring in the EveryWAN prototype.As explained in the EveryWAN white paper [29], the EveryEdgeOS controller exposes a Northbound API that We also extended the Northbound API to offer the basic operations to create, control, and destroy the STAMP Sessions.Furthermore, we added a section to EveryGUI where users can monitor in real time the delay of the deployed SRv6based VPNs.The result of a measurement session presented on EveryGUI is shown in Figure 10.In the x-axis there is the time in which each measure is performed.Delays are reported on the y-axis.A walkthrough documentation showing the use of delay monitoring in EveryWAN is available in [34].
In addition to the instant delays, the controller also computes the average delay for both the direct and return paths.

VIII. EXPERIMENTS AND RESULTS
In this section, we describe the testbed and the methodology used to assess the performance of our STAMP implementations, and we present a comparison between the different implementations.

A. Testbed and Performance Evaluation Methodology
To evaluate the performance of our three implementations, we have deployed a testbed according to RFC 2544 [46], which provides a methodology for benchmarking network devices.The testbed (shown in Figure 11) includes two nodes: Traffic Generator (TG) and System Under Test (SUT).We have deployed our testbed in the Wisconsin cluster of CloudLab [47], a platform dedicated to scientific research on the future of cloud computing.The testbed nodes (TG and SUT) are bare metal servers equipped with two Intel E5-2630 v3 processors with 16 cores (hyper-threaded) clocked at 2.40GHz, 128 GB To control Linux networking capabilities (e.g., network interfaces, routing, and SRv6 behaviors), we installed the 5.13 release of the iproute2 [50] suite.We also installed ethtool 5.13 to configure the hardware capabilities of the NIC, such as offloading [51].
To perform the experiments, we used SRPerf [21], a performance evaluation framework for software and hardware implementations of SRv6.SRPerf orchestrates and automates the execution of the experiments using the TRex Python automation libraries [52].It interacts with the TRex generator installed on the TG.The TG generates packets using the TRex traffic generator and sends them to the SUT.The SUT processes the received packets.The TG evaluates the maximum throughput that can be processed by the SUT.SRPerf supports different throughput measurements, such as No-Drop Rate (NDR), Partial Drop Rate (PDR), and Maximum Receive Rate (MRR).In our experiments, we used the Partial Drop Rate at a 0.5% drop ratio (in short, PDR@0.5%) as throughput measurement, which is defined as the maximum packet rate at which the packet drop ratio is less than or equal to 0.5%.For further details on this metric and how it is evaluated by the SRPerf tools, we refer to [21].
Our goal is to evaluate the impact of STAMP measurement procedures on the packet processing capabilities of a Linux software router.As a reference, we consider the scenario in which the router is only processing regular data packets, then we intermix regular data packets with STAMP measurement packets in different percentages.
For the processing of regular data packets, we consider an SRv6 ingress node that performs packet encapsulation: it receives IPv6 packets and applies the H.Encaps behavior to encapsulate the packets in an outer IPv6+SRH packet.Therefore, in our baseline scenario the TG generates IPv6 packets, the SUT receives the packets on one interface, performs the encapsulation, and forwards the packets on the second interface.
For the processing of the STAMP measurement packets, we have considered two cases: 1) the SUT is configured as a STAMP Session-Reflector, it receives STAMP Session-Sender Test packets, processes them, and for each STAMP Test packet it sends a STAMP Session-Reflector Test packet to the TG; 2) the SUT is configured as a STAMP Session-Sender, it receives STAMP Session-Reflector Test packets, extracts, and collects the timestamps from the packets, performing the role of the Collector.The impact of STAMP measurements is evaluated by changing the fraction of STAMP packets and measuring the packet processing capacity using the PDR@0.5% metric.When the SUT acts as a Session-Reflector (case 1), the methodology to evaluate the packet drop ratio described above can be applied easily, as both the data packets and the STAMP test packets are forwarded back by the SUT towards the TG (the data packets are encapsulated, the STAMP packets are processed and properly updated).To evaluate the packet drop ratio, the TG simply compares the number of transmitted and received packets in an experiment session (summing up the data and STAMP test packets).On the other hand, when the SUT acts as a Session-Sender/Collector (case 2), it does not forward the received STAMP test packets back to the TG, because it receives the STAMP packets and produces the measurement records.Therefore, the TG cannot simply count the packets transmitted back by the SUT to evaluate the packet drop ratio.In fact, the number of packets correctly processed by the SUT corresponds to the sum data packets that are forwarded back and the STAMP test packets that are properly processed by the SUT (i.e., by collecting the STAMP measurement metrics).A STAMP packet that is not processed by the SUT must count as a dropped packet.Therefore, the TG must retrieve the counter of processed STAMP packets from the router under test after each experiment session.To solve this problem, we have designed and implemented a gRPC based API.The SUT/router acts as a gRPC server, whereas a gRPC client in the TG queries the server after each experiment session and retrieves the number of processed STAMP packets.In this way, the TG can sum up this number with the number of received data packets and can properly evaluate the packet drop ratio.
To run the performance experiments, a careful configuration of the SUT node is needed because we need to saturate the capacity of a CPU to measure the PDR@0.5% metric.Therefore, we need that all tasks of our interest are executed by the selected CPU and we need to avoid that any other task is executed in the same CPU.A detailed discussion on these aspects can be found in the Appendix.A walkthrough documentation of how to setup the testbed and run the experiment is available in [34].

B. Performance analysis
We report several experiments to evaluate the impact of our Session-Sender and Session-Reflector implementations on the user traffic.First, we evaluate the forwarding capability in the scenario with only data traffic (no STAMP test packets) without running any STAMP implementation.We consider this  throughput as our baseline.Then, we run the Session-Sender or the Session-Reflector on the SUT and we evaluate the maximum achievable throughput for different combinations of data and STAMP test packets using our three different STAMP implementations.
The forwarding capacity of the node is measured using the PDR@0.5% metric as discussed in the previous subsection.The results reported in Figs.12-16 are always the average of 10 evaluations (every single evaluation is carried out using the SRPerf tool [21]).We do not report error bars with confidence intervals in our figures, as we obtained stable results and the 95% confidence intervals are so close to the average that they are not noticeable.The tables with the detailed results are reported in the Appendix.
The comparison among the STAMP Session-Sender/Collector implementations is shown in Fig. 12.The Scapy implementations suffer a 10.4% performance degradation compared to the baseline performance.This performance degradation is due to the fact that even if there are no STAMP Test packets to be processed, the Session-Sender still has to look at all the incoming packets to capture the STAMP Test packets.This operation is very efficient, as it is executed in kernel mode.Both user space implementations have the same performance (≈925 kpps).The reason lies in the fact that even if the two implementations differ greatly in the processing of STAMP Test packets, the mechanisms used to filter the STAMP Test packets are the same.Thus, when  there is only data traffic, the two implementations exhibit the same performance degradation.The packet rate of the eBPF-based implementation (≈1016 kpps) is higher than the two user space implementations.This is due to the fact that the HIKe eBPF chain contains a more efficient eBPF filter with respect to the filter of the user space implementation.Since this test is performed without STAMP packets, the performance is only affected by the filter that the packet traverses before being sent to the kernel networking stack.
The performance drop of the eBPF-based implementation with respect to the baseline is 1.6%.The STAMP Session-Reflector implementations exhibit the same behavior when processing only data traffic.A comparison among the Session-Reflector implementations is shown in Fig. 13.
We evaluated the PDR@0.5% in the opposite scenario in which there is only measurement traffic (i.e., only STAMP Test packets).The results are shown in Fig. 14.
Regarding the Session-Reflector (shown in Fig. 14a), the basic implementation reaches a packet rate of ≈1.06 kpps, which is much lower than the other two implementations.As discussed in Section VI, the reasons for this poor performance are related to the inefficiency of the Scapy AsyncSniffer and the high overhead of the Scapy builder and dissector.In the optimized implementation, we mitigated these issues.This allows the Session-Sender to reach an higher packet rate, ≈85.8 kpps.The performance of eBPF-based implementation is much higher (≈2995 kpps).The reason is that eBPF performs all the processing in kernel space, while optimized is a user space solution.
Concerning the performance of the Session-Reflector (shown in Fig. 14b), we observe the same trend (Fig. 14b).The basic implementation reaches a packet rate of ≈470 pps, which is lower than the packet rates of the optimized (≈35.3 kpps) and eBPF-based implementation (≈2179 kpps).The performance of the Session-Sender is always better than the Session-Reflector.The reason is that the Session-Sender processing is less expensive than the Session-Reflector processing.For each received STAMP Session-Reflector packet, the Session-Sender must collect and store the timestamps.Instead, when the Session-Reflector receives a STAMP Session-Sender Test packet, it must generate a STAMP Session-Reflector Test packet and forward the packet towards the Session-Sender.These operations are much more expensive than storing the timestamps.
Clearly, the scenario described above with only measurement traffic is unrealistic.We only use it to assess and compare the performance of the different implementations.In real scenarios, the measurement traffic (i.e., STAMP) is a small fraction of the overall traffic and will never reach 100% link capacity.For this reason, we analysed the performance considering different fraction of STAMP measurement packets.
Fig. 15 shows the maximum achievable throughput for the Session-Sender, varying the fraction of STAMP measurement packets.The basic implementation starts at ≈927.8 kpps at 0% STAMP, drops to ≈641.3 kpps (at 0.05% STAMP) and ≈20.3 kpps (at 5% STAMP), and then it continues to slowly drop to ≈1.06 kpps (100% STAMP).The throughput of the optimized implementation starts at ≈924.5 kpps and remains stable until the measurement traffic is 0.1% of the total traffic.The packet rate of the eBPF-based implementation starts at ≈1015.9 kpps when there is no measurement traffic (i.e., no STAMP packets) and it remains almost stable until the measurement traffic is 10% of the total traffic.Then, we observe a trend in contrast with the two user space implementations.The performance goes up to ≈1152.1 kpps when the measurement traffic is 20% of the total traffic and reaches ≈2994.9kpps when the measurement traffic is 100%.
The reason why the eBPF implementation starts with a higher throughput (PDR@0.5%)when the STAMP traffic is low, is that its BPF filter used to select the STAMP traffic is lighter than the one used by the Scapy implementations.When the percentage of STAMP traffic is very low, it does not affect the overall performance and the filtering is the only factor that plays a role.When the STAMP traffic increases, the throughput of the eBPF implementation increases because the STAMP packets are not sent to the kernel networking stack and they are processed faster by our eBPF program than the SRv6 packets that the kernel is encapsulating.On the other hand, the Scapy implementations process the STAMP packets in the user space, hence the performance is reduced when the fraction of STAMP packets increases.
The Session-Reflector throughput for different value of the percentage of STAMP measurement packets is shown in Fig. 16.The results for the three implementations are consistent with what we have discussed for the Session-Sender/Collector implementation.For high value of the percentage of STAMP traffic, it can be noted that the performance is slightly lower, this is because the Session-Reflector sends back the STAMP measurement packets.Apparently, this is heavier than storing the STAMP measurement records as done by the Session-Sender/Collector.

IX. RELATED WORKS
Several solutions have been proposed for performance monitoring in a network.Some of them like Nagios [53] and Zabbix [54] focus on the monitoring of network devices.Other solutions like Ceilometer [55] target cloud environments.Concerning SDN, several solutions have been proposed.OpenNet-Mon [56] is a framework to measure throughput, delay, and packet loss in OpenFlow networks.A monitoring framework for SDN Virtual Networks is proposed in [57].Other solutions for OpenFlow networks can be found in [58] and [59].[6] proposes a review of the monitoring techniques used in SDN.
IETF worked on the standardization of a protocol to measure the performance of IP and MPLS networks.This protocol is defined in RFC 4656 [60] and it is called One-Way Active Measurement Protocol (OWAMP).OWAMP only focused on the one-way performance metrics, such as oneway delay and one-way packet loss.Another protocol was defined later, called Two-Way Active Measurement Protocol (TWAMP).TWAMP (defined in RFC 5357 [61]) introduced the two-way measurements.RFC 5357 defines both the test protocol (i.e., the format of the messages exchanged to collect the measures) and the control protocol (i.e., the protocol used to setup the parameters required by the measurement session).RFC 8762 [30] introduces a new protocol, known as Simple Two-Way Active Measurement Protocol (STAMP).RFC 8972 [31] proposes optional extensions, such as TLV (Type-Length-Value) coding to specify the Return Path.Later on, the STAMP protocol has been extended to support SR networks (both SR-MPLS and SRv6) [33].This solution can measure metrics like delay or packet loss of a SRv6 path.The measurement mechanism is based on packets exchanged on the SRv6 path under measurement.These packets carry information used to compute the performance.
In [62], the authors described a per-flow packet loss measurement solution based on the alternate marking method called PF-PLM.They also proposed and compared two different implementations of the proposed solution, realized extending Netfilter/Xtables and IP set Linux frameworks, respectively.In our previous work [22], we proposed an open source solution for Performance Monitoring of SRv6 networks, called SRv6-PM.SRv6-PM includes a cloud-native infrastructure that supports ingestion, processing, storage and visualization of PM data.We also provided an implementation based on the eBPF framework.Both works focused on packet loss monitoring.
In [63], the authors described SRA, a user space implementation of the SRv6 data plane based on AF XDP.The proposed solution supports a custom SRv6 behavior called End.DM which enables the measurement of the delay in SRv6 networks.SRA collects the timestamps in each node of the SRv6 path.Our solution does not implement an SRv6 dataplane, it only implements the STAMP protocol and leaves the SRv6 packets to the Linux kernel.Moreover, STAMP is focused on the endto-end delay, so it is not needed to record all the intermediate nodes timestamps.

X. CONCLUSION
In this paper, we proposed a solution to support the delay monitoring of SRv6 SD-WAN services.Our solution is based on the STAMP protocol and its extensions to support performance measurements in SRv6 networks, currently under discussion in the IETF.The main components of the solution are the STAMP Session-Sender and Session-Reflector which run in the SRv6 routers and perform the delay monitoring operations in the data plane.These data plane components need to be configured to execute the monitoring procedures.We defined and implemented an API that allows an SDN controller to interact with the Session-Sender and Session-Reflector.We integrated the proposed solution in EveryWAN, an SD-WAN open source prototype.Therefore, we deployed and tested a complete open source framework for delay monitoring of SRv6 based SD-WANs.In this respect, we have given a positive answer to the first two research and technological questions outlined in the introduction: i) the proposed approach based on IETF standards and current Internet drafts is an effective solution for delay monitoring of SRv6 networks; ii) we were able to implement the Delay Monitoring in an open source prototype based on Linux software routers, covering both the data plane aspects and the control plane aspects.
Then, we have addressed the research questions related to the performance impact of delay monitoring procedures on a Linux software router.We have implemented the proposed solution in three different versions and executed a number of performance experiments to evaluate and compare the three implementations.We have started with a naive user space implementation of STAMP based delay monitoring, but we realized that its performance was poor, with a high reduction of the forwarding capacity of the software router.We have optimized the user space implementation, achieving an acceptable performance impact.In particular, with the optimized user space implementation the impact is acceptable when the fraction of measurement packets is kept within reasonable limits (e.g. less than 0.1%).We think that these limits will not be exceeded under practical operational conditions, as the number of measurement packets will always be a small fraction of the data traffic.Therefore, we have integrated the optimized user space implementation in our open source SD-WAN framework, which now offers a running prototype of the delay monitoring solution.We further considered a third implementation, based on the Linux eBPF technology.This proof-of-concept implementation providee a positive answer to question about the feasibility of delay monitoring in SD-WANs with negligible impact on the forwarding capability of a Linux software router.

ACKNOWLEDGMENT
This work has received funding from the Cisco University Research Program and from the GÉANT Innovation Programme.

A. Configuration of the SUT node
We configured the SUT node according to the guidelines described in [21].In particular, we disabled the hyper-threading feature of the SUT node using the sysctl Linux utility.We used the SMP IRQ affinity features to bind all the NIC receive queues to a single CPU core.This prevents the kernel from distributing the received packets across multiple CPU cores.In this way, we force all the incoming traffic (i.e., STAMP traffic and user traffic) to be processed by a single CPU core.For the user space implementation, we used the taskset utility to bind the user space process to the same CPU.In order to make the experiments independent of the NIC hardware capabilities, we also disabled all the NIC hardware offloading capabilities, including checksum offload, Large Receive Offload (LRO), Generic Receive Offload (GRO), and Generic Segmentation Offload (GSO).

B. eBPF Implementation of the Session-Sender/Collector
To evaluate the performance of the Session-Sender/Collector, we need to count the number of correctly processed STAMP test packets during a test session.
For testing the eBPF implementation, it is not possible to count this number after the test session, because the records that are written in the eBPF maps exceed the capacity of the map, so they are overwritten.Therefore, we have developed a modified version of the eBPF implementation of the Session-Sender/Collector, which counts the correctly processed packets and writes them in a map record.This counting operation can impact performance because it requires additional processing per each received packet.Therefore, we modified the implementation so that the same number of operations is executed by the regular implementation and by the version that does the counting.This obviously means that the version that does the counting is not actually storing the measurement records, but this is fine for our purposes.
The regular implementation performs a read operation and a write operation per each received packet, using these two maps: • A map containing a single timestamp (64 bits) contains data used to reconstruct the local timestamp upon receiving the packet.This is accessed with a read operation for each received packet.
• A map containing 4 timestamps (64 bits each) is used to store the STAMP data record after the packet is received.This is accessed with a write operation for each received packet.The modified implementation for counting purposes performs one read operation to read the counter with the number of packets processed and a write operation to write the updated counter.To keep the complexity identical to the regular implementation, we would need to read a 64 bit counter from the first map, and write the incremented counter to a map with 4*64 bits values, which of course is not possible.We carried out the test twice, the first time reading and writing to the map with 64 bits values, and the second time using the map with 4*64 bits values.With the first test, we get an overestimation of the processing capability, because we are performing a shorter write operation (64 bits instead of 4*64 bits).With the second test we get an underestimation of the processing capabilities, because we are performing a longer read operation (4*64 bits instead of 64 bits).In our results shown in Fig. 15 we report the average between the two tests.The difference between the two tests is only noticeable in the two experiments with a high percentages of STAMP traffic (50% and 100% in Fig. 15).In particular, the PDR@0.5% metric at 100% STAMP traffic is 3.3M pps for the first test and 2.6M pps for the second test.The actual achievable throughput is in between these values, but we are not interested in its precise evaluation, because the interesting part of the Fig. 15 is in the left part, where the percentage of STAMP measurement packets is below 1%.

C. Implementation of gRPC server
In order to count the STAMP packets processed by the Collector, we implemented a gRPC API GetResultsCounter() that allows the TG to retrieve the number of STAMP Test packets received and processed by the SUT.In our first implementation, we run the gRPC server on a dedicated thread running on the same CPU as the Collector.During our experiments, we discovered that the gRPC server impacted on the packet processing capability of the router, which resulted in inaccurate measurements.To fix this issue, we had to move the gRPC server to a different process and bind the process to a different CPU.

D. Results
In the following tables, we report the details of the experiment results reported in Figs.12-16.Each value of PDR@0.5% is averaged over 10 evaluations performed with the SRPerf tool [21] over the testbed in Cloudlab shown in Fig. 11.We report the average (Avg) and standard deviation (SD) in kpps (10 3 packet/s) and the Coefficient of Variation (CV), i.e. the ratio between the Standard Deviation and the average.We evaluate the 95% confidence interval, denoted as CI95, and report in the table the ratio between CI95 and the average.As we can see, the ratio between CI95 and the average is always very low, for this reason we have not plotted error bars with confidence intervals in Figs 12-16.Note that in Table I the values for SD, CV and CI95 are zero for high percentages of STAMP traffic.This happens because the throughputs are very low and the resolution of SRPerf is also low, with the consequence that all the measures are equal.This is not a problem because we are not interested in measuring with high precision the throughputs in the order of few kpps.We are simply assessing that the performance is very poor (not acceptable) when the maximum rate of STAMP test packets that can be processed is in the order of one thousand packets per second.
(i.e., NTP or PTPv2).Scale and Multiplier provide an error estimate.The SSID (Session Sender ID) field contains the SSID of the STAMP Session to which the test packet belongs.It associates the STAMP Session-Sender test packet with the corresponding STAMP Session.The remaining 28 bytes (224 bits) are set to zero (Must-Be-Zero or MBZ field).The content of STAMP Session-Reflector test packet is larger than the content of a STAMP Session-Sender test packet.The MBZ field makes the size of the Session-Sender test packet equal to the size of the Session-Reflector test packet.Following the SRv6 path under measurement, the test packet is delivered to the Session-Reflector.The Session-Reflector receives the STAMP Session-Sender test packet and verifies it.If the packet is valid and the SSID corresponds to an active STAMP Session, the Session-Reflector creates and sends a STAMP Session-Reflector test packet to the STAMP UDP port of the Session-Sender.The STAMP Session-Reflector test packet carries the payload depicted in Fig. 7b.Bytes 24-33 contain an exact copy of the STAMP Session-Sender test packet.The Sequence Number field contains a 32-bit unsigned integer.The STAMP Session-Reflector can work in two modes: i) stateless mode; ii) stateful mode.In the stateless mode, the STAMP Session-Reflector reuses the same Sequence Number value contained in the STAMP Session-Sender test packet.In the stateful mode, the STAMP Session-Reflector maintains a counter for the transmitted packets.This counter starts at zero and is incremented for each transmitted packet in the context of the STAMP Session and used as Sequence Number field in the packets transmitted by the Session-Reflector.The Receive Timestamp field contains the time when the Session-Reflector received the Session-Sender test packet, denoted as T 2 (see Fig. 8).The Timestamp field contains the time when the Session-Reflector starts transmitting the Session-Reflector test packet, denoted as T 3 .Similarly to the Timestamp field of the STAMP Session-Sender test packet, the Timestamp and Receive Timestamp can be encoded using either NTP or PTPv2.The Error Estimate field indicates the synchronization bit, timestamp format and error estimation of the Timestamp and Receive Timestamp and has the same structure as the Error Estimate field in the Session-Sender test packet.The SSID 16-bit field contains the STAMP Session IDentifier and allows the STAMP Session-Sender to associate the received STAMP Session-Reflector packets with the correct STAMP Session.The Session-Sender TTL is a copy of the Hop Limit field of the IPv6 header contained in the received STAMP Session-Sender test packet.The MBZ fields are used to achieve an alignment on a four-byte boundary.The Session-Reflector test packet is transmitted on the same path as the data traffic flow under measurement to measure the delay experienced by the data traffic flow.This can be the same path as the Session-Sender test packet or a different path.The draft [32] defines a TLV called Return Path TLV that allows the Session-Sender to request the Session-Reflector to transmit the Session-Reflector test packet on a specific path.However, we do not use the Return Path TLV in our solution.We leverage the SDN controller to set up the return path as part of the STAMP Session configuration.Before sending the STAMP Session-Reflector test packet, the Session-Reflector adds an SRH to the IPv6 header to enforce the return path.The SRH contains a SID List that encodes the path under measurement from the STAMP Session-Reflector to the Session-Sender.Following the path specified in the SID List, the STAMP Session-Reflector test packet is delivered to the Session-Sender.The Session-Sender verifies the packet and validates the SSID.If the SSID corresponds to an active STAMP Session, it generates a new timestamp T 4 , which is the
The average delay is updated using the Welford online algorithm whenever new d d,new and d r,new values are available:d d,avg = d d,avg + d d,new − d d,avg N(3)d r,avg = d r,avg + d r,new − d r,avg N(4)where d d,avg is the average delay of the direct path, d r,avg is the average delay of the return path, N is the number of collected delays, and d d,new and d r,new are the new delay values of the direct path and return path, respectively.

TABLE I :
Scapy Basic Reflector.

TABLE IV :
Scapy Basic Collector.

TABLE V :
Scapy Opt Collector.