Enabling P4 Network Telemetry in Edge Micro Data Centers With Kubernetes Orchestration

Integrating computation resources with networking technologies is an hot research topic targeting the optimization of containers deployment on a set of host machines interconnected by a network infrastructure. Particularly, next generation edge nodes will offer significant advantages leveraging on integrated computation resources and networking awareness, enabling configurable, granular and monitorable quality of service to different micro-services, applications and tenants, especially in terms of bounded end-to-end latency. In this regard, SDN is a key technology enabling network telemetry and traffic switching with the granularity of the single traffic flow. However, currently available solutions are based on legacy SDN techniques, not enabling the matching of tunneled traffic, and thus require a tricky integration inside the hosts where containers are deployed. This work considers Kubernetes clusters deployed on next generation edge micro data center platforms and proposes an innovative SDN solution exploiting the P4 technology to gain visibility inside tunnelled traffic exchanged among pods. This way, the integration is achieved at the control plane level through the communication between Kubernetes and the SDN controller. The proposed solution is experimentally validated including a comprehensive framework enabling effective traffic switching and in-band telemetry at pod level. The major paper contributions consist in the design and the development of: (i) the networking applications at SDN control plane level; (ii) the P4 switch pipeline at the data plane level; (iii) the monitoring system used to collect, aggregate and elaborate the telemetry data.

containers running on the same host machine. Only high level indications are provided by the Kubernetes community about the way in which pod networking should be implemented, i.e., regarding pod-to-pod communication the only specified requirement is that each pod can communicate with all other pods on any other node without Network Address Translation(NAT) [7]. However, the actual networking implementation is delegated to third-party plugins, i.e., the Container Network Interface (CNI) plugins, which may adopt different configuration of the pods network interfaces for providing pod-to-pod connectivity (e.g., adopting different tunneling technologies). Moreover, CNI plugins typically do not account for network constraints in terms, for example, of required bandwidth or bounded latency. For this reason, deploying Kubernetes in edge computing environments requires specifically designed and comprehensive solutions able to provide flexible network control and traffic telemetry. This is especially true in the case Quality of service (QoS)-critical applications have to be supported and the cluster machines are distributed at different locations, e.g., over a metropolitan area network [8], [9].
The introduction of Software Defined Networking (SDN) provided the opportunity to enable effective networking resources control. Nowadays P4 (i.e., Programming Protocol-Independent Packet Processors) is the SDN solution enabling the most advanced programmability of the forwarding plane [10], also enabling innovative in-network function offloading [11]. P4 provides many additional features with respect to previous SDN solutions (e.g., OpenFlow [12]): (1) it defines a standard language to specify custom data plane pipelines; (2) it allows to define and manipulate custom packet headers through the definition of dedicated packet parsers; (3) it supports the utilization of registers enabling the development of stateful functionalities; (4) it provides improved network visibility, e.g., using In-band Network Telemetry (INT) [13]. Moreover, P4 is supported by several kind of targets, e.g., bare metal or software switches, smart-NICs, NetFPGAs. Thus, P4 enables advanced traffic telemetry in passive mode, with no need of dedicated probing packets. Three different INT techniques are proposed in the specification [14], highlighting the main technical details. In the three schemes, report packets are directly built in the data plane and sent toward a telemetry collection point, that can be used to perform the per-flow traffic analysis.
Considering all the aforementioned features, we believe that P4 offers a flexible framework for enabling effective integration of computational resources, orchestration tools and networking resources. Recently, OpenFlow has been considered for such integration [15], [16], [17]. However, since OpenFlow cannot have visibility inside the tunnels created by the CNI plugins, the proposed solutions require a deep integration with the plugins deploying dedicated software applications in all host machines composing the Kubernetes cluster. Conversely, exploiting P4 it is possible to define a dedicated packet parser, gaining visibility on pod-to-pod traffic flows inside the tunnels. This enables the deployment of fine granularity traffic engineering and telemetry techniques without direct interaction with the data plane of the Kubernetes pod network. Thus, and this is our proposal, the orchestration can be implemented through interactions at the control plane level exploiting proper interfaces between Kubernetes and the SDN controller. To enable this, we proposed a specific P4 pipeline to match and monitor each podto-pod traffic flow encapsulated in a tunnel by a CNI plugin. The design of the pipeline is generic, while the parser is specific for the Flannel CNI plugin, however it could be easily extended to support other plugins.
To achieve the aforementioned targets, P4 devices should be effectively controlled by the SDN controller. Regarding this point, the P4 consortium defined an interface called P4Runtime [18], suitable to configure and control P4 devices (e.g., to install pipelines and flow rules). However, in the real world, P4 compatible devices do not mandatory include a P4Runtime agent. Indeed, several P4 use-cases do not rely on the utilization of an SDN controller. Therefore, another important contribution of this work is the implementation of an application, at the SDN controller side, to configure and control P4 devices, including the specific parsers and pipelines that we have designed for enabling visibility inside tunnels established by the CNI plugins. Specifically, we have considered the SDN controller developed by Open Networking Foundation(ONF), i.e., the Open Network Operating System (ONOS) [19], that is characterized by an active development community.
Finally, the work goes beyond the integration of Kubernetes with the network connecting the worker nodes, implementing also a closed-loop control where telemetry data is used to detect possible Service Level Agreement (SLA) degradation that could be recovered providing a feedback to the SDN controller or directly to the applications generating the traffic. Specifically, this last contribution consists of a P4 pipeline supporting in-band telemetry, a Telemetry Collector used to aggregate generated telemetry data and a dedicated Telemetry and Monitoring Platform to elaborate collected telemetry data.
A preliminary version of this work has been presented as a practical demonstration in [20]. In addition, this work includes: (i) the detailed research background; (ii) the overall architecture of Edge Micro Data Center nodes as designed within the BRAINE project; (iii) the extended version of the ONOS NetApps to support matching of traffic generated inside/outside Kubernetes pods; (iv) a wide set of experimental results.

II. BACKGROUND AND RELATED WORK
The joint optimization of IT and networking resources is a well established research topic as demonstrated by the wide survey reported in [5] where both centralized and distributed solutions are resumed. The former approach exploits a central element collecting resource information (e.g., interacting with the cloud orchestrator and the network controller) and typically provides improved solutions introducing a coordination overhead layer. Differently, the distributed approach is more flexible and adaptive to dynamic environments, however typically provides less effective solutions. For instance, the work in [21] proposes a distributed approach to partition a pool of computational resources among multiple applications using a dynamic agreement. On the other hand, centralized approaches have recently gained attention because they can more easily leverage on Machine Learning (ML) techniques. In [22] a ML platform is developed for effective management of both computational and networking resources in a 5G mobile environment, where data are collected from both the Kubernetes orchestrator and the SDN controller.
The actual networking implementation in Kubernetes clusters is delegated to third-party CNI plugins. The most considered ones are Flannel, Calico, and Kube-router [23]. As explained in [24], Flannel provides a layer-3 IPv4 network among multiple nodes within the cluster, i.e., it does not control how pods are networked to the host machine. Several back-end mechanisms are supported (i.e., VXLAN, UDP, host-gw and additional experimental mechanisms) but VXLAN is recommended (see Sec.III for details). With respect to Flannel, Calico [25] also provides security and policy enforcement between pods supporting a wide range of deployment options. Kube-router [26] is a specific solution for Kubernetes pods networking with the aim of providing operational simplicity and high performance. Recent research has focused on CNI plugins performance comparison in different scenarios. In [27] the performance of the most popular plugins are compared in terms of latency and average TCP throughput. The results show that Flannel and Kube-router outperform in terms of latency, while for all the plugins the TCP average throughput is close to bare-metal capacity.
Regarding network programmability, the work in [28] reports a comparison using OpenFlow and P4 for the implementation of equivalent functionalities (e.g, packet header manipulation), showing similar results in terms of throughput. However, the major potential of P4 is its suitability for a variety of use cases. In [29], a new pipeline is deployed for providing stateful traffic engineering and cyber-security on an edge node designed for a multi-layer IP over optical network. Moreover, augmented firewalling capabilities are envisioned for mitigating Distributed Denial-of-Service (DDoS) cyber attack. Additional P4 use cases for multi-layer networks are reported in [30], including the telemetry of endto-end optical performance indicators exchanged between packet-optical nodes and P4-defined neural networks targeting online cyber-security. In [31] programmable switches are leveraged for deploying a multi-purpose ML-based security applications. It collects the packet length/inter-packet timing frequency distributions, classifying the traffic flows directly on the switches. P4 can be also used for deploying an open source framework that combines the flexibility of software-based traffic generation with the accuracy of hardware packet time-stamping, as presented in [32]. In [33], P4 is leveraged for providing Bit Index Explicit Replication (BIER), proposed by Internet Engineering Task Force (IETF) for efficient transport of Internet Protocol (IP) multicast traffic [34]. The work in [35] presents an use case of P4 registers to store stateful information achieving autonomous forwarding and low-latency path discovery. Finally, [36] proposes solutions for providing network slicing in different networking environments.
In addition to the previous use cases, in-band network telemetry is a key feature enabled by P4 that is achieved thanks to improved visibility on networks events provided by the P4 language [13]. The specification [14] proposes three different techniques, i.e., INT-XD, INT-MX, and INT-MD. In the INT-XD technique (also known as Postcard-based Telemetry (PBT)), the node directly exports, for each monitored packet, metadata from the data-plane to the monitoring system, based on the instructions configured in local flow tables. Collected metadata is inserted in a new packet called report, that is forwarded to the monitoring system. No packet modification is applied on the traffic packets. In INT-MX (INT-MD) instructions (and metadata) are written into traffic packets, adding a specific header (i.e, the INT header). The report packets are generated at each traversed node in the INT-MX technique, while in the INT-MD technique metadata are accumulated in the INT header while the packet is travelling in the network and report packets are generated only by sink nodes. The work in [37] provides a surveys of several INT implementations using different target devices and INT header encapsulation. Among them, it is worth to mention [38], the first implementation including the dynamic control of the monitored network using the ONOS controller. Specifically, such work proposed extensions to ONOS for supporting INT that later have been included in the ONOS official distribution. More recently, the work in [39], focused on 5G networks, proposes the extension of the INT-MD technique up to the user equipment to enable the evaluation of fully end-to-end (e2e) latency. In general, INT telemetry features the generation of a report packet for each data plane packet. However, this approach may uselessly overload the telemetry system. Thus, the work in [40] and [41] proposes two different approaches for reducing the amount of generated telemetry data. In [40], an event detection framework is used to generate report packets only when certain events are detected in the network. In [41] a flexible sampling mechanism is implemented so that only a configurable fraction of data plane packets actually generates a report packet.
The use of an SDN controller over a P4-based network (e.g., exploiting the P4Runtime interface) allows to take fully advantage of P4 capabilities. Several solutions are currently available to deploy a P4Runtime interface in a physical P4 device. The Bmv2 software switch [42] is a tool for emulating P4-based devices that implements also the P4Runtime interface. It is typically used for developing, testing and debugging VOLUME 11, 2023 the P4 data plane and the related network applications to be used at the SDN controller. However, many commercially available P4 devices do not provide a P4Runtime interface. To deal with this issue the Stratum [43], [44] and PINS (P4 Integrated Network Stack) [45] open-source projects are currently on-going to provide a P4Runtime interface deployable on P4-based devices equipped with a Network Operating System (NOS). In particular, Stratum can run on top of a Debian-based NOS (and on top of the Bmv2 software switch), whereas PINS can run on top of the SONiC (i.e., Software for Open Networking in the Cloud [46]) NOS. Few works in literature use P4Runtime to control P4-based devices, most of them leverage on Bmv2 and ONOS SDN controller. The work in [47] demonstrates the compatibility of P4Runtime and Openflow devices operating in the same network under the control of the single SDN controller, guaranteeing performance isolation among multiple network slices. The work in [48] implements a benchmarking tool for P4Runtime-based controllers and applies the tool to evaluate the performance of the ONOS controller running in both OpenFlow and P4Runtime mode. The work in [49] proposes the extension of P4Runtime to support multi-tenant service on a switch and evaluates the solution in an international experimental P4 network. Finally, within the ONF community, the SD-Fabric [50] project is a full stack application, i.e., involving both data and control planes, that implements a P4 pipeline for the Industry 4.0. The proposed P4 pipeline supports basic L2/L3 forwarding capabilities, 4G/5G mobile user plane, and in-band telemetry, whereas at the control side an ONOS application is provided for managing and controlling the P4 devices using P4Runtime.
Regarding the integration of Kubernetes with advanced networking, to the best of our knowledge we did not find solutions exploiting P4. Some research studies propose OpenFlow-based SDN solutions integrated with Kubernetes, providing basic networking features. In [15] a framework is proposed to create network slices on-demand among containers, connected by an OpenFlow-based software switch (i.e., Open vSwitch, OVS) in the host machine, being programmed by the SDN controller. The work in [16] proposes to contemporaneously deploy two CNI plugins, i.e., Calico to maintain the connectivity between pods and the Kubernetes master, and Multus that defines additional interfaces for attaching each pod to the network avoiding tunneling techniques, i.e., it maps each pod on a specific IP/MAC addresses pair so that the SDN controller can locate each pod. In [17] a CNI plugin is developed to expose a virtual network to pods and configure network tunnels among them using the SDN controller. Finally, the work in [51] proposes a tool (i.e., Host-INT) enabling end-to-end monitoring of traffic flows within a Kubernetes cluster. Host-INT leverages on extended Berkeley Packet Filter (eBPF) [52] to extend the Linux network stack of host machines introducing the support of the INT header, that is then used to collect information related to the traffic flows (e.g., packet loss and latency). The work in [53] proposes the implementation of a load balancer for P4-based Network Interface Controllers (NICs) towards services deployed with Kubernetes; however, it does not consider the fact that pod-to-pod traffic may be encapsulated in a tunnel.
The aforementioned work confirms that end-to-end monitoring is required in Kubernetes clusters, especially if host machines are deployed in different locations in a fog environment. However, using OpenFlow is impossible to match on packet fields encapsulated within a tunnel, thus all the proposed solutions require a deep integration with the Kubernetes cluster at the data plane level (e.g., installation of dedicated software in the host machines). Conversely, our proposal considers the utilization of P4 enabling the matching of pod-to-pod traffic throughout the traversed network without modifying the Kubernetes deployment.

III. BRAINE ARCHITECTURE
This work has been conducted in the context of the Big data pRocessing and Artificial Intelligence at the Network Edge (BRAINE) project. Thus, this section provides an overview of the BRAINE architecture to better contextualize the proposed integration between Kubernetes and the P4-based programmable network.
The BRANE project targets the development of an energy efficient Edge Micro Data Center (EMDC) exploiting a modular architecture (e.g., including heterogeneous hardware such as Central Processing Units(CPU), Graphics Processing UnitS(GPU),and Field-Programmable Gate Arrays(FPGA) to offer computing, acceleration, storage, and 5G Network Function Virtualization(VNF)) at the network edge. The project involves many industrial partners and works in several fields, including design and fabrication of hardware boards and development of the software framework to be deployed on top of the EMDC for cluster resource orchestration.
Within the scope of this paper, the BRAINE EMDC node includes a set of CPU boards, providing the cluster computational resources (with one of these boards dedicated to the hosting of orchestration tools), and dedicated boards exploiting the Spectrum chipset, made by Mellanox/Nvidia, implementing the SDN programmable P4-based switches to provide the connectivity among CPU boards. Besides a representation of the EMDC physical infrastructure, Fig. 1 reports the main components of the software framework as currently designed by BRAINE, where Kubernetes and ONOS have been respectively selected to orchestrate the computational resources and to control the programmable switches aiming to connect pods deployed on different CPU boards.
The following sections detail the BRAINE components that have been integrated in this work to implement a closed-loop automation where pods are deployed on different CPU boards of the same EMDC node or even on different EMDC nodes inter-connected by an SDN-enabled network devices.

A. SERVICE MANAGER USER INTERFACE
The Service Manager User Interface (SMUI) is the north-bound interface of the EMDC node and provides a set of features to the users such as the unified view of services execution state through semantic knowledge graphs.
The SMUI is a containerized web-based application that can be deployed in the cloud or run locally on the EMDC [54]. It is built upon Kubernetes and Docker concepts such as images, containers, pods, worker nodes, services, workflows and their metadata. For instance, SMUI allows the user to identify the best node for running, training, or testing an Artificial intelligence (AI),service, with the possibility to choose the desirable execution architecture (e.g., CPU or GPU). Moreover, it enables the evaluation of resources availability across the system. With the collected running metadata, it is possible to check the presence of failures, the data accesses and further execution metadata.
More in detail, in the BRAINE data model, pods, services, and workflows are defined in a declarative way through Kubernetes definition language, using manifest files in YAML format, with the addition of the workflow data-type. The SMUI front-end enables service and workflow measurement/monitoring by accessing global and individual views on the multiple agents involved in the execution, while promptly taking corrective actions in case of failures such as redeploying the service to another node.

B. KUBERNETES SCHEDULER
The workloads defined using the SMUI are submitted to the EMDC by means of Kubernetes pods. A pod is the smallest execution unit in Kubernetes. Pods can contain one or more containers to run on a target worker node(s), all of the containers in a pod share the same IP address. Each service is a composition of pods that can claim different life-cycles as well as resources. For example, a service is provided by a number of pods identified by a cluster-wide Domain Name System (DNS), name, while the actual pods that compose the service may change during the execution, the clients of the service will still refer to the same endpoint.
Kubernetes is in charge of managing the pods. Upon admission of a pod, it runs mutation hooks, providing opportunities to validate, complete, and/or manipulate the pod according to the cluster's policies, e.g., replacing all container VOLUME 11, 2023 images with their latest version counterpart, or checking if they are hosted on a trusted image repository. Afterward, Kubernetes sends the pods into the scheduling pipeline, which consists of sorting, filtering, and scheduling. All these steps are plugin-based and can be extended or customized.
At the sorting stage, Kubernetes provides options to prioritize the pods. At the filtering step, Kubernetes checks whether the pod requirements could be fulfilled by the available worker nodes, and if so, it lists the matching nodes as feasible nodes. During the scheduling phase, Kubernetes runs the pod through a set of plugins asking them to score the feasible nodes. The node with the highest score will be nominated for binding, that is the process of shipping the pod to the selected worker node and asking it to accept and run the workload. This is organized via communication between Kubernetes and an agent (i.e., Kubelet) running on every worker node that updates and reports the pods status after each event (e.g., pod admission, termination, resource change). The status updates are received, aggregated, and collected by Kubernetes and maintained in a distributed key/value database called etcd. This information is used during filtering and scheduling steps, and by every other plugins requiring information about deployed pods.

C. FLANNEL-BASED KUBERNETES NETWORKING
In BRAINE, Kubernetes works with the Flannel CNI plugin running in VXLAN mode. Within the cluster, Flannel essentially solves two problems: duplication of pod IP addresses and inter-node pod networking (i.e., inter-board pod networking inside a single EMDC). The VXLAN method is the most used, due to the low introduced latency (i.e., traffic encapsulation and forwarding operations are performed in the Linux kernel) and because, exploiting tunneling based on IP reachability, it can be used to assure communication also among worker nodes interconnected by a routed IP network (e.g., among two separate EMDC nodes).
With reference to Fig. 2, Flannel creates a VXLAN network card named flannel.1 on each node that acts as VXLAN Tunnel End Point (VTEP). Such interface is attached to the bridge cni0 that works as IP gateway for all the pods in the node. The daemon flanneld configures the UDP port 8472 as default for VXLAN on the flannel.1 interface. When a new node joins the cluster, flanneld exploits the information stored in the etcd to: i) create a routing entry in the local node to route the traffic addressed to pods running in the new detected node toward flannel.1 interface; ii) add the IP of the new node to the ARP cache mapping it on the MAC address of flannel.1 interface of the detected node.
Thus, for instance, once an IP packet is generated in pod A 1 (10.244.1.2), located at node A, and is destined to pod B 2 (10.244.4.3), located at node B, the packet is sent to cni0 in node A (i.e., 10.244.1.1) through the routing table of pod A 1 . Then at cni0 the packet is sent to the flannel.1 interface. As a VTEP device, flannel.1 receives the message, according to the VTEP configuration the flanneld deamon knows that the destination pod 10.244.4.3 belongs to node B and it knows the IP address of flannel.1 interface on node B from etcd, moreover through the forwarding table in node A, it knows the MAC of the VTEP of node B. Thus, it performs VXLAN packet encapsulation according to the configured parameters (e.g., local IP, port) and sends the packet through the physical interface ens160. The resulting protocol stack for packets traveling from pod A 1 to pod B 2 when exiting the interface ens160 is illustrated in Fig. 3.
At node B, the VXLAN packet reaches the interface ens160 via port 8472, the VXLAN packet is forwarded to the VTEP device flannel.1 for decapsulation. The unpacked IP packet matches the routing table (10.244.4.0) in node B, and is therefore forwarded to cni0 that, in turn, forwards it to pod B 2 .

D. SDN CONTROLLER
The BRAINE SDN network controller is based on ONOS [19]. Fig. 4 represents the components specifically developed for BRAINE and utilized in this work to implement traffic forwarding and in-band telemetry, i.e., the BRAINE app and the BRAINE P4 app.
ONOS implements the concept of intent-based networking [55], where intents generalize the concept of connectivity ensuring that target policies are met by enabling automatic reconfiguration as a consequence of network changes (i.e., reacting to network events following a Finite State Machine (FSM)-based implementation). Since the intent is expected to be the base connectivity request submitted to the SDN controller, we developed our SDN applications on top of the ONOS intent service. This way all the implemented solutions are automatically managed by the intent FSM. For instance, in case of network failures all affected traffic flows are automatically moved to an alternate path.

1) THE BRAINE APP
This application implements a set of functionalities exposed through REST APIs, enabling the interaction with Kubernetes, and the SMUI. Also, the same functionalities can be manually accessed through a set of CLI commands. Moreover, the application utilizes the ONOS core services to enable the deployment of point-to-point connections between pods running in different worker nodes of the cluster. The two main functionalities supported at the data plane by the BRAINE app are: i) connection management (i.e., add/delete/modify), where each created connection can be specified up to the transport level (i.e., TCP/UDP ports); ii) activation of telemetry on selected active connection(s).
To support the aforementioned features, the BRAINE app is composed of several components (see left side of Fig. 4). In particular the application includes: i) two databases where connection and link state information is stored; ii) a routing module that performs redundant routing of requested connections and interacts with the ONOS intent service; iii) an intent listener that allows the application to react in case of network events affecting established connections; iv) a logger for tracing and debugging. Moreover, the BRAINE app supports a set of accessories features to facilitate the interaction with the network and the gathering of network state information. Specifically, the features supported by the app can be grouped in four categories: connections related commands, device related commands, host related commands and link related commands.

2) THE BRAINE P4 APP
The companion BRAINE P4 application has been developed to program the specific P4 pipeline to be used in the data plane switches. This application has two main roles: i) enabling the match of header field encapsulated within VXLAN tunnels; ii) activating the postcard telemetry on specific traffic flows.
The first objective is achieved through the implementation of a dedicated pipeline (described in Sec. V). For the latter objective, the application exposes a REST API that is dynamically consumed by the BRAINE app when a telemetry activation request is received from the orchestrator.
The internal architecture of the BRAINE P4 application is represented on the right side of Fig. 4. It includes the pipeline loader component which loads the P4 pipeline description via the P4Runtime protocol upon the discovery of P4-based switches. Once the request to activate a new postcard telemetry on a specific traffic is received through the REST interface, the Postcard telemetry manager identifies the devices traversed by the flow and sends them the flow rules to enable the postcard via the pipeline interpreter. Since the pipeline interpreter is the only component that is aware of the pipeline structure (e.g., number of tables and supported matching fields per table) it is also used for translating into flow rules the output of the intent service created to forward traffic. The statistic discovery component collects traffic related information from the P4-based devices to be visualized in the ONOS GUI (e.g., counters associated to flow rules). Finally, the logger component facilitates tracing and debug.
Both applications then rely on the Bmv2 P4 driver included in the master ONOS master distribution that has been demonstrated to be fully functional to perform the connection to P4 devices and to install all the required flow rules using the P4 Runtime protocol.

E. TELEMETRY COLLECTOR
The telemetry collector is the module in charge of receiving the telemetry report packets generated by the P4 switches. Typically, report packets are generated in a 1:1 ratio, with VOLUME 11, 2023 respect to the traffic packets belonging to telemetry-enabled flows, where each report provides metadata information (e.g., the latency experienced in the switch) related to a specific traffic packet. However, when the traffic rate increases, it is not possible for scalability reasons to populate the telemetry databases through direct processing of the report packets. Therefore, the telemetry collector module has been introduced for sampling the reports and providing low-rate telemetry statistics to the telemetry database. Such module classifies and aggregates the telemetry information at rates sustainable by the database layer, specific rates can be configured for different traffic flows through direct interaction with the SDN controller thus considering specific requirements of each flow (see Fig. 1). The aggregated per-flow and per-switch data samples are then sent to the telemetry time series database for monitoring purposes. Details regarding the implementation of a P4-based telemetry collector enabling full collector functionality at wire speed can be found in our previous work [56]; Fig. 5 shows the telemetry collector operation and its internal architecture. The different monitored traffic flows (i.e., F1 and F2 in the figure) generate the related telemetry report packets. The figure also details the report packet format showing the key fields utilized by the collector. In particular, the switch_id (Sx) field identifies the physical P4 switch x generating the report, while the flow_id (Fy) field discriminates the traffic flow y. The former field is defined in the P4 INT specifications [14], while the latter has been proposed as protocol extension in our previous work [57], in which the SDN controller computes and assigns the flow_id univocally and provides it along with the flow rules in the telemetry activation flow entry.
In this work, the considered metadata information retrieved by the programmable P4 switch is the intra-switch packet latency (i.e., the time spent by a packet in the switch queue).
Each switch generates one report packets for each monitored flow packet, thus the report rate is equal to the flow rate. The collector receives and processes the report packets using specific internal modules. The dissector module extracts and stores the metadata of each report within different memory buffer arrays. The statistics extractor performs the aggregation of metadata samples in a pre-defined time window, computing the average (avg) and the maximum (max) latency experienced by packets belonging to flow y when crossing switch x. Specifically, the telemetry collector deployed in the experiments performs average and max latency value computation over the last 1000 report packets on a per-flow and per-switch basis.
The result is passed to the InfluxDB client API, that computes the overall statistics timestamp, and sends a REST POST message to the central InfluxDB database with the new time series sample. This way, specific latency analysis are possible at the BRAINE Telemetry and Monitoring Platform for each monitored flow at different switches.
Specifically, the application collects from the worker nodes a set of metrics (e.g., CPU and RAM utilization) via Node Exporter and other custom-built exporters (e.g., the Telemetry Collector acts as a network resources exporter). Collected metrics are then harvested by Prometheus, which in turn, pushes them to the InfluxDB for storage. InfluxDB is an open-source time series database management system for the storage of metrics during the retention period. Moreover, it makes the collected metrics available for querying by other components of the system. InfluxDB can be queried via external tools or its own REST APIs. Grafana is used as the default client of the database to provide a visual and interactive representation of the metric time series. However, there are other clients in the system that can interact with InfluxDB to obtain metric data. For instance, in BRAINE an important role is played by the the SLA Broker, that monitors the incoming metrics and validates them against the agreed SLA terms, to take proper corrective action in case of violations, i.e., activating the responsible actuators (e.g., the SDN controller).
In the BRAINE project, the InfluxDB is utilized as the single-point-of-truth metric database that persists the recorded data via Kubernetes volumes and provides a service endpoint for interacting with other components of the infrastructure. For instance, as described above, the telemetry collector pushes data directly to InfluxDB via the built-in APIs.

IV. TELEMETRY WORKFLOW
This work integrates the aforementioned BRAINE components in a closed-loop telemetry workflow. Specifically, the idea is to establish a connectivity between a pair of pods deployed on different worker nodes belonging to the same Kubernetes cluster, thus passing through a network composed of P4-based switches. The traffic flow exchanged between the two pods is then monitored activating in-band telemetry. When a latency degradation is detected along the path, the SDN controller is notified to find an alternative path, e.g., avoiding the switch that is introducing excessive delay.
The steps of the implemented telemetry workflow are represented in Fig. 1. Step 1: upon the trigger from the SMUI, Kubernetes places a number of pods with their own requirements on different worker nodes.
Step 2: Kubernetes retrieves the network parameters of the deployed pods within the etcd.
Step 3: Kubernetes submits a connectivity request to the SDN controller using the REST APIs provided by the ONOS BRAINE app including the network parameters of the deployed pods (i.e., the request typically contains MAC and IP addresses and TCP/UDP ports).
Step 4: The SDN controller performs the configuration of the connectivity, sending the required flow rules to the involved P4-based devices (using P4-Runtime protocol), in the same step the SDN controller activates the postcard telemetry for the specific traffic flow, relying on the BRAINE P4 app that is dynamically queried by the BRAINE app (the telemetry could be also started/stopped in a subsequent step). VOLUME 11, 2023 Step 5: Once the connectivity is configured, the traffic starts to flow into the network.
Step 6: The related postcard telemetry is generated toward the Telemetry Collector.
Step 7: When the Telemetry and Monitoring Platform detects a service level degradation (e.g., increased latency in a specific P4-based switch) it triggers a service upgrade request to the SDN controller using a dedicated method of the BRAINE app REST APIs.
Step 8: The SDN controller modifies the network connectivity parameters in accordance with the received request (e.g., modify the traffic routing avoiding the degraded switch).

V. P4 PIPELINE IMPLEMENTATION
The developed P4 program is written in P4 16 for the target architecture v1model [62] that includes a parser and two pipelines (ingress and egress). With the proposed approach the P4 device can be programmed by the SDN controller to forward both traffic exchanged among pods (i.e., encapsulated using VXLAN) and traffic exchanged among host machines (i.e., not encapsulated). Moreover, the controller can activate in-band telemetry (i.e., postcard telemetry, INT-XD) on selected traffic flows, that can be specified up to transport layer details (i.e., TCP/UDP ports).
The proposed architecture is working only in conjunction with the Flannel plugin operating in the VXLAN mode. However, it is easily extensible to other tunneling techniques applied by different CNI plugins only requiring the upgrade of the parser module. For instance, the parser can be extended to support Calico operating with the IP in IP overlay networking by adding a specific check during the parsing of the IP header, i.e., to recognize the IP protocol code 0 × 04. Thus, a parser supporting multiple tunneling techniques could be deployed on the same network infrastructure.
Each pipeline is composed by a number of tables, operating with a match/action policy. Each table supports a specific set of keys and actions. In each table, a ternary match policy is used where the selected mask allows to ignore a key (i.e.,0 × 0000) or apply an exact match (i.e., 0xffff).) All keys are defined using custom metadata (i.e., local_metadata. * ) that are initialized loading the proper packet header fields during the parsing procedure. This way, depending on the detected tunneling technique, different packet fields can be copied in the metadata enabling the support of multiple CNI plugins. For the traffic not exploiting a tunnel, e.g., traffic among host machines or pod traffic generated in a cluster adopting flat networking (e.g., using the Calico default behaviour), metadata are filled considering the most external packet header.

A. P4-BASED MATCHING OF POD-TO-POD TRAFFIC
The parser, detailed in Fig. 6(a), is the first module of the ingress pipeline, as shown in Fig. 6(b). While the packet passes through the parser stages, the metadata fields are gradually filled. The first stage of the parser writes the ingress port index into the specific metadata field. Then, the Parse Packet IO stage is executed only for packets received from the CPU port (i.e., P4 Runtime packet_out messages received from the controller) to retrieve the packet_out header. The Parse Eth stage extracts the Ethernet header and fills the corresponding metadata fields with the MAC source, MAC destination and Ethernet type fields values. Then, in case of IP packets, the Parse IPv4 stage parses the IPv4 header and fills the corresponding metadata fields with the IP source, IP destination and IP protocol fields value. Subsequently, the packet is sent to one of the Parse TCP/UDP stages where the metadata fields local_metadata.l4_src_port and local_metadata.l4_dst_port are filled.
If the UDP destination port is 8472, it means that the packet belongs to a pod-to-pod traffic flow encapsulated within a VXLAN tunnel by Flannel (see Fig. 3). In this case, the Parse VXLAN stage is executed parsing VXLAN header, subsequently IP and TCP/UDP headers are parsed by Parse Internal stages. During these stages, the aforementioned local_metadata. * fields are overwritten with the corresponding fields enclosed in the internal headers. This way, if the packet is encapsulated in a VXLAN tunnel, the ingress pipeline will match the internal header fields, thus enabling pod-to-pod traffic forwarding.
As illustrated in Fig. 6(b), after parsing, the packets are forwarded to the ingress pipeline and processed by table0 where the egress port is assigned based on the flow rules installed by the SDN controller. The actions supported in this table are: i) set_egress_port, ii) send_to_cpu and iii) drop. The set_egress_port action is typically applied to packets matching a forwarding flow rule and assigns the output port on which the packet will be transmitted. Action send_to_cpu is used to forward packets to the SDN controller through the control plane port (e.g., it is used for LLDP packets matching specific flow rules); finally, the default drop action is applied to packets not matching any flow rules.

B. P4-BASED POSTCARD TELEMETRY IMPLEMENTATION
The subsequent tables in both the ingress and the egress pipelines are used to implement the postcard telemetry. The Postcard_Telemetry table, see Fig. 6(b), matches on metadata fields and is intended to contain flow rules for matching each traffic flow requiring postcard telemetry. Two actions are supported: activate_postcard and nop. The action activate_postcard is executed for each matching packet (i.e., to packets belonging to traffic flows for which the SDN controller has activated the telemetry), setting a specific metadata field (i.e., postcard_meta_activate_postcard) that is later evaluated by an if condition to clone the packet using the cloneI2E external feature. If a packet is not matched, the default action nop is executed resulting in the packet forwarded to the egress pipeline without cloning. The cloned packet will be manipulated in the egress pipeline to generate a report packet. Cloning the packet is mandatory because P4 devices cannot create packets from scratch [10].   The egress pipeline is illustrated in Fig. 6(c). All the metadata fields local_metadata. * must be re-initialized because P4 does not allow the propagation of custom metadata from the ingress pipeline to the egress pipeline. No actions are applied to the original packet that leaves the switch through the port assigned in table0. Instead, the cloned packet is processed by the two tables: int_insert and generate_report. The former table, with a null default action (i.e., nop), applies the action init_metadata to matching packets. This action is the one that actually retrieves the information to be included in the report message that is written in the local_metadata.postcard_ * fields.
The latter table generates the in-band telemetry report message using the action do_report_encapsulation manipulating the cloned packet. More in detail, the header of the cloned packet is modified as following. The Ethernet and IP source addresses are set to the local switch values, while the destination addresses are set to the telemetry collector values. The UDP source and destination ports are set to a specific values to easily recognize report packets at the telemetry collector. Finally, the report header is added as UDP payload that includes the metadata retrieved in the previous table, i.e., switch_id, flow_id and all other metadata required by the SDN controller using the instruction_mask as defined in [14].

C. APPLICABILITY TO HARDWARE P4-BASED DEVICES
The proposed P4 pipeline is suitable for switches adopting the Protocol Independent Switch Architecture (PISA), e.g., the Tofino chipset is based on PISA. However, different PISA-based switches can be characterized by different accessory features, e.g., externs. Specifically, the key P4 capabilities needed to run our proposed pipeline are extra header processing, timestamp metadata support, and the extern used to clone packets from the ingress to egress pipeline. All of them are available on most of the currently commercialized PISA-based switches.
In addition, an estimation of the maximum number of installable flow rules can be performed considering the typical size of the memory modules in commercial switches and the flow rule structure defined in Sec.V-A. Specifically, hardware switches are typically composed by the following modules: Static Random Access Memory (SRAM), Ternary Content Addressable Memory (TCAM), Hash, Arithmetic Logical Units (ALU) and stateful ALUs. The TCAM, used to implement ternary match, is the most expensive and flexible module and is therefore the scarcest resource. The P4 pipeline implemented in our work uses ternary match only for Table0, while other tables, applying exact match, can be mapped on the less critical SRAM. Thus, considering the fields to be matched (illustrated in Fig. 6(b)), each flow rule occupies about 300 bits. A Table0 with 1000 rules occupies 300 Kbits (i.e., about 40 KBytes). Since modern switches have a TCAM size in the order of 1 MByte, they can typically contain 25.000 flow-rules.

VI. EXPERIMENTS A. EXPERIMENTAL SETUP
The experimental testbed encompasses both computing and networking resources. Computing resources are deployed on two dedicated servers, i.e., EMDC 1 and EMDC 2 in Fig. 7. The hardware of both servers is a DELL PowerEdge R740, 56 CPUs Intel Xeon Gold 6238R @ 2.20GHz, 256 GB RAM. Three virtual machines (VMs) are deployed in EMDC 1 , while two VMs are deployed in EMDC 2 . One of the VMs deployed on EMDC 1 hosts the management and control software including the Kubernetes master, the ONOS SDN controller, the telemetry collector and the telemetry and monitoring platform. The other VMs act as Kubernetes worker nodes, where each node runs a number of pods (i.e., each pod encompasses a plain Ubuntu 20.04 distribution with basic networking tools).The Telemetry and Monitoring platform includes the telemetry database deployed into an influxdB container, and the SLA Broker, implemented as a set of configurable queries and threshold-based alarms through dedicated Grafana panels.
Networking resources encompass five P4-based switches, all of them emulated using Bmv2. Since we are using emulated devices, measured latency values are expected to significantly reduce using hardware devices. However, this is not relevant for our purpose, since our work does not target to improve the achievable latency, it only targets to measure the latency and triggers a network reconfiguration when a certain threshold is crossed. Switches S1, S2, S3, S4 are emulated on a dedicated DELL server (Intel Xeon E5-2643 v3 6-core 3.40 GHz clock, 32 GB RAM) using physical Ethernet interfaces. Switch S5 is emulated by deploying a dockerized Bmv2 on a Mellanox/Nvidia SN2010, running SONiC. In particular, the Mellanox/Nvidia SN2010 is a switch exploiting the Spectrum chipset, where it is possible to install the SONiC NOS. In turn, SONiC enables the deployment of the dockerized Bmv2 switch within in the SN2010.
The traffic report generated by the network nodes is received by the Telemetry Collector, hosted by the Kubernetes master node. As depicted in Fig. 5 the report packet contains: the switch_id field that identifies the switch, the flow_id field that discriminates traffic flows, Ingress_Timestamp and Egress_Timestamp needed to evaluate the hop latency.

1) POD TRAFFIC FORWARDING VALIDATION
This section functionally validates the proposed solution to process the traffic exchanged between a pair of Kubernetes pods. Specifically, the traffic is generated between two pods respectively deployed on node EMDC 1 and EMDC 2 , thus traversing the P4-based network. Fig. 8 illustrates the Wireshark capture, including the VXLAN encapsulation and the protocol stacking as shown in Fig. 3 Fig. 9 shows a screenshot of the ONOS web GUI illustrating the flow rules installed in switch S1 where the rules counters show that the traffic exchanged between the two pods is correctly matched.

2) POD TRAFFIC TELEMETRY VALIDATION
This section functionally validates the whole telemetry workflow as described in Fig. 1. Specifically, two separate traffic flows are activated between two different pairs of pods: flow IDs 250 and 123. The two flows consist of five parallel TCP sessions generated with the iperf3 application. Telemetry is active in both flows; however, the SLA Broker is configured to generate the feedback to ONOS (step 8 in Fig. 1) only for flow 250. Fig. 10 reports the latency data as collected by the SLA Broker panels during the network reconfiguration. Both flows are initially routed along the path S1, S3, S4, S2, thus both plots report four latency lines, one per traversed switch. At time t 0 switch S3 transmission rate is manually degraded, thus increasing the switch latency for both flows. The SLA Broker performs a threshold-based control over the per switch latency of flow 250 and triggers an alert if the degradation persists for 4 seconds. This behaviour is reflected in the SLA Broker panel as depicted in Fig. 10.
In the actual experiment, degradation is detected at t 1 and the alert is triggered back to ONOS at t 2 . As described in the previous sections, ONOS reacts by rerouting the affected flow (i.e., flow 250) on path S1, S5, S2, i.e., after t 2 , Fig. 10 reports the latency of those switches. It is worth noting that S5 is characterized by a higher latency compared to other switches; indeed, S5 is emulated on less performance hardware. Conversely, flow 123 is not involved in the reconfiguration, showing that the implemented framework is able to select the single traffic flow. VOLUME 11, 2023  The telemetry workflow experiment has been repeated 10 times collecting also the achieved end-to-end bit-rate of both flows. The results are illustrated in Fig. 11, including ten cyan lines reporting the specific result for each experiment and a single red line reporting the average trend. Specifically, Fig. 11(a) is related to traffic flow 250, it shows that after t 0 the rate is degraded, then it is partially recovered at time t 2 when the traffic is switched on the alternate path. It is worth noting that rerouting the traffic does not guarantee the recovery of the overall bit-rate. In fact, the recovery path includes switch S5 emulated on a less performing hardware with limited traffic capabilities. Fig. 11(b) is related to traffic flow 123 that is not involved in the reconfiguration, thus after t 0 the bit-rate results to be degraded and never recovered. Fig. 11(a) shows that the whole workflow takes about 6 seconds to be performed (i.e., t 2 − t 0 ). However, most of this time is expended within the telemetry and monitoring platform (i.e., SLA Broker) as a result of our configuration to trigger the alert. This time could be reduced by configuring the SLA Broker with higher SLA checking rates on the InfluxDB filled by the Telemetry Collector. Therefore, to better evaluate the achievable performance of the system, we have measured the re-configuration time excluding the telemetry and monitoring platform from the workflow, i.e., the feedback to the ONOS controller is directly generated by the Telemetry Collector. Fig. 12 reports the latency data collected by an auxiliary Grafana panel during the network reconfiguration, when the reconfiguration is triggered directly by the Telemetry Collector (i.e., thus excluding the influxdB and the SLA Broker). The experiment has been repeated 10 times and the average time for performing the reconfiguration is 1.95 seconds that includes: the detection of the latency degradation at the Telemetry Collector, all control plane procedures performed in ONOS (e.g., computation of an alternate path), and P4 Runtime message exchange towards the involved switches.

VII. CONCLUSION AND FUTURE DIRECTIONS
The integration between IT and networking technologies is fundamental for effective micro-services deployment on next generation edge nodes interconnected by a network infrastructure. However, currently available solutions considering Kubernetes orchestrated clusters and programmable networks are based on legacy SDN techniques and thus require deep integration at the data plane level.
In this work we proposed a P4-based solution able to gain visibility inside tunnelled traffic, and thus enabling such integration at the control plane level through communication between the Kubernetes orchestrator and the ONOS SDN controller. Also we experimentally demonstrated the first comprehensive framework enabling effective traffic telemetry, at pod level, building upon a closed-loop workflow among (i) the Kubernetes orchestrator, (ii) the ONOS SDN Controller, (iii) the enhanced P4-based data plane, and (iv) the telemetry system. The integrated system is able to orchestrate Kubernetes micro-service chains and automatic P4 switch configuration including configurable telemetry. Moreover, the closed-loop BRAINE telemetry and monitoring system is able to enforce automatic network recovery of specific flows violating latency SLA in less than 2 seconds.
The proposed framework paves the way toward even more advanced closed-loop strategies for the dynamic reconfiguration of flows (e.g., traffic prioritization, control of generated traffic rates at the application level, etc.) depending on the performance measured on the network.

(Davide Scano and Alessio Giorgetti contributed equally to this work.)
MUSTAFA AL-BADO is currently a Senior Research Scientist with the Dell Technologies with a demonstrated history of working in industries and higher education institutes. He has contributed to several granted patents and standards about 5G technologies. His research interests include virtualization, 5G technologies, network slicing, as-a-service, and cloud and edge computing.
EDGARD MARX is currently the Project Manager with the Leipzig University of Applied Science and a Linked Data Expert with eccenca GmbH, Germany. His main research interests include theoretical and experimental studies in information retrieval, databases, and knowledge graphs. He has been involved on various international research projects such as DFG DINOBBIO, BMWK COYPU, ECSEL BRAINE, and CLEVER. He has coauthored over 40 peer-reviewed international publications.

SEAN AHEARNE is currently a Senior Research
Scientist with the Dell Technologies and a Technical Lead of the ECSEL BRAINE Project. His background includes extensive work on Software Defined Networking for both optical and wireless networks including 5G and THz with several published works. He has a further background and interest in system virtualization both hypervisor and container-based, hardware acceleration with GPU's and FPGA's, and hardware and software architectures and design.
FILIPPO CUGINI (Member, IEEE) is currently the Head of Research Area with CNIT, Pisa, Italy. He is the coauthor of 14 patents and more than 300 international publications. His main research interests include theoretical and experimental studies in communications and networking. He serves as a Coordinator of the ECSEL BRAINE Project, an EU-Funded Project aiming to boosting Artificial Intelligence at the Network Edge (www.braine-project.eu).
Open Access funding provided by 'Scuola Superiore "S.Anna" di Studi Universitari e di Perfezionamento' within the CRUI CARE Agreement VOLUME 11, 2023