Live Migration of Virtual Machine and Container Based Mobile Core Network Components: A Comprehensive Study

With the increasing demand for openness, flexibility, and monetization, the Network Function Virtualization (NFV) of mobile network functions has become the embracing factor for most mobile network operators. Early reported field deployments of virtualized Evolved Packet Core (EPC) — the core network (CN) component of 4G LTE and 5G non-standalone mobile networks — reflect this growing trend. To best meet the requirements of power management, load balancing, and fault tolerance in the cloud environment, the need for live migration of these virtualized components cannot be shunned. Virtualization platforms of interest include both Virtual Machines (VMs) and Containers, with the latter option offering more lightweight characteristics. This paper’s first contribution is the proposal of a framework that enables migration of containerised virtual EPC components using an open-source migration solution which does not fully support the mobile network protocol stack yet. The second contribution is an experimental-based comprehensive analysis of live migration in two virtualization technologies — VM and Container — with the additional scrutinization on the container migration approach. The presented experimental comparison accounts for several system parameters and configurations: flavor (image) size, network characteristics, processor hardware architecture model, and the CPU load of the backhaul network components. The comparison reveals that the live migration completion time and also the end-user service interruption time of the virtualized EPC components is reduced approximately by 70% in the container platform when using the proposed framework.


I. INTRODUCTION
T HE 3GPP standards for the 5G mobile communication and the ESTI NFV [1]  communication comprises the New Radio (NR) and the Next Generation Radio Access Network (NG-RAN), including the gNodeB (gNB), connected to the 4G EPC.In the NG-RAN system, the gNB is disaggregated into three components, namely the Remote Radio Unit (RRU), the Distributed Unit (DU), and the Central Unit (CU) [2].The RRU mainly comprises the RF components, while the functions performed by the DU and the CU vary based on the chosen split option from the available list in the 3GPP standards [3].The use of both commercial off-the-shelf (COTS) hardware and Network Function Virtualization (NFV) helps Mobile Network Operators (MNO) reduce their operational cost and the need for excessive over-provisioning of network capacity in order to achieve the much needed support for service redundancy [4], [5].Well-established virtualization platforms exist that support both the Virtual Machine (VM) and Container based hardware virtualization such as OpenStack, and VMware [6], [7].VMs can concurrently and independently run on the same host compute hardware while each provides a distinct OS support to its guest application, namely each VNF.Docker makes use of OS-level virtualization to produce VNFs that run in packages called Containers.Container-based solutions have been gaining traction in the recent years due to their reduced overhead.While these platforms are widely used, some open challenges still remain to be addressed [8].One of these challenges is to achieve the required carriergrade Service Level Agreement (SLA) in the virtualization platform that supports NFV [9].It is believed that the NFV compute platform must utilize the fullest features such as (live) migration, snapshot, and rebirth in order to ensure that the SLA requirements are finally met in terms of security, reliability, and total cost of ownership.
Focusing on the first feature, live migration is the process of migrating the VNFs from one host to another while guaranteeing zero or minimal impact to the connectivity service offered to the mobile network users.Being able to live migrate VNFs offers a number of significant advantages.VNFs can be moved away from overloaded servers/hosts and reallocated in less loaded compute nodes.Load balancing in the compute nodes [10] can be timely achieved by redistributing VNFs to sparsely loaded servers.To effectively perform maintenance -such as upgrading OS versions and changing network configurations -or fault management, live migration of VNFs is often required.Last but not least, cost savings in terms of power consumption management may be facilitated through VNF migration.For example, when some servers are underutilized for a prolonged period of time, their VNFs may be relocated elsewhere to allow these computing elements to be shutdown.
Realizing the importance of timely offering virtualized EPC solutions with built-in capability for VNF live migration, this paper describes a few experimental settings designed to achieve this goal.These experimental settings are obtained by leveraging open software and standard solutions whenever possible, and by implementing additional custom software packages when that is necessary to complete the required NFV/SDN architecture.All hardware components are commercially available.The ultimate objective is to validate the feasibility and compare the performance of a few plausible NFV/SDN architectures, which provide live migration of EPC virtualized functions with minimal connectivity disruption to the mobile user.Specifically, the VNFs for which live migration is tested are three core network components, namely Home Subscriber Server (HSS), Mobility Management Entity (MME), and Serving and Packet Gateway (SPGW).These virtualized EPC components are implemented using the Ope-nAirInterface (OAI) software package.
Two virtualization technologies are considered, one based on VMs and the other based on Docker Containers.In the former platform, live migration of VNFs running as VMs is achieved through Kernel-based Virtual Machine/Quick EMUlator (KVM/QEMU) with the libvirt API [11].In the latter platform, live migration of VNFs running as Docker Containers is achieved through Checkpoint and Restore In Userspace (CRIU) [12].
It must be noted that the currently available CRIU software package does not offer two key functionalities [13] that are required to support Container-based VNF migration in the C-RAN backhaul network, which are: i) support for the Stream Controlled Transmission Protocol (SCTP), used in the LTE network to guarantee message delivery between MME and CU; and ii) GTP (GPRS Tunnelling Protocol) device-specific information needed by the SPGW software to provide tunnelling of the user data traffic.To overcome these limitations of the CRIU, two custom solutions described in Section III-A, have been implemented and integrated in the experimental settings, which are: i) support for SCTP in the CRIU software and ii) a utility software to handle the GTP device-specific information.
The two virtualization technologies are tested in two distinct experimental settings.In the first setting, compute hosts are realized using repurposed Stampede [14] servers to form two geographically distinct edge compute sites.The two sites are connected through a backhaul fiber-optics network that is realized using reconfigurable optical add-drop multiplexing (ROADM) equipment, optical transponders, and switchponders from a number of equipment manufacturers.The optical equipment is OpenROADM compliant [15] and controlled by the open source TransportPCE controller [16].Orchestration of resource allocation in the optical network, Ethernet switches, and Stampede compute nodes is provided by a custom PROnet Orchestrator software [15].Both virtualization technologies are also tested using the CloudLab federated testbed [17], which provides an additional benchmark to validate the newly added custom code in an open environment.This second setting offers a more performing and diverse compute hardware platform compared to the first one but does not provide an optical backhaul network that can be controlled by the experimenter.

II. RELATED WORK
In [18], the 3rd Generation Partnership Project (3GPP) specifies different resiliency mechanisms for EPC components, and handling failures with the help of Echo Request/Response timer messages.In addition, [19] presents approaches for recovering VNF through replication and migration of network functions when outages affect compute resources.Moreover, infrastructure network failures can be recovered directly at the network level, for example by resorting to a Software Defined Network (SDN) controller [20], or by combining replication/migration with connection rerouting.In [21], a two-step resiliency scheme is proposed for RAN functional split reconfiguration by orchestrating lightpath transmission adaptation.
Ref. [22] refers to the advantage of VNFs by conducting a survey and collecting technical inputs from the 5G-PPP projects.Most of the project prototypes evolved from ETSI MANO resort to OpenStack Virtualized Infrastructure Manager (VIM) with the addition of Kubernetes orchestration to host both containerized network functions and the classical VM-based VNFs.
In [23], the authors evaluate the performance of the Virtualized RAN using both VM and Docker Container in the SDN enabled network.Using their analytical model and experimentation, they report that Docker Container performance is superior compared to VMs in terms of IO performance and communication latency.The authors analyse the service rate, average waiting time, inter-arrival time for both the VMs and Docker Container using a queueing model.However, the aspects concerning migration techniques and related implementation challenges in different virtualization technologies are not addressed in [23].
The VNF migration of virtualized CU/virtualized DU (vCU/vDU) over WDM network using CRIU is briefly discussed in [24].Here, the authors mention checkpointing the vCU by collecting the CPU state and memory page information and storing them on disk.The collected metadata is restored at the destination host by the lightpath reconfiguration to ensure the connectivity of the end-user.
So far, there are no papers that address implementation and provide detailed evaluation of NFV-SDN systems performing live migration of VM and Container supporting core network functions.

III. LIVE MIGRATION OF VIRTUALIZED CORE NETWORK FUNCTIONS: TECHNIQUES, LIMITATIONS, AND SOLUTIONS
This section describes the VNF migration strategies exploited, the key implementation challenges faced, and the custom software solutions developed for successfully performing core network components live migration.

A. Docker Container Migration with CRIU
With the Container technology gaining increasing attention due to its smaller footprints, many recent research efforts focus on designing the optimal Container migration approach, where the VNF memory, file system, and network connectivity state need to be transferred to the destination host without disrupting the running application [25], [26].A number of prominent Container run-time software packages handles the migration in user space through checkpoint and restoration technique.
As shown in Fig. 1, during checkpoint, the CRIU method freezes the running Container at the source node (host A) and collects metadata about the CPU state, memory content, and information about the process tree [27] associated with the running Container service.The collected metadata information is passed on to the destination node (host B), and the restore process resumes the Container service from the frozen point with the copied metadata at the destination node.The total time required to perform the checkpoint, metadata copy and transmission, and Container restoration contributes to the frozen time of the application.There are ways to reduce the frozen time using lazy migration method [28].This method is however outside the scope of this study.The following subsections describe the CRIU current limitations that are encountered during the specific migration of certain core network components [13] and how these limitations are circumvented in order to successfully perform such components' live migration.
1) HSS Live Migration with CRIU Support: A TCP connection is required between HSS and MME.The HSS component establishes the TCP socket at start time and stores relevant user information in a MySQL database.Upon performing migration of the HSS component, the CRIU application needs to copy the database information into the memory page and restore the TCP connectivity at the destination host without disturbing the peer end connection state at the MME side.The tcpestablished mode [29] must be set in the CRIU configuration in order to ensure TCP connection re-establishment at the destination host.This TCP repair mode is supported starting from version 3.5 of Linux Kernel mainline, which provides support for socket re-establishment without requiring the exchange of the initial TCP connection setup messages.No additional custom software is required in CRIU.
2) MME Live Migration with CRIU Support: The MME component makes use of SCTP (stream control transmission protocol) to exchange S1-MME messages with multiple gNBs.One of the main differences between TCP and SCTP is that TCP has a single association in the given socket connectivity whereas SCTP has multiple associations with the single socket connection using the stream options.The SCTP protocol is not supported in the currently available CRIU release version, and consequently migration of the MME component cannot be executed successfully unless a new SCTP connection is reestablished with a new start message handshake.To overcome this CRIU limitation it was necessary to design and develop the additional custom software described next.When the SCTP socket is in listen mode, adding support in CRIU for SCTP is relatively simple because only user-space software changes are required.When the endpoint association is in established mode, the associativity endpoint details along with the kernel code changes are also needed, thus adding complexity to the required procedure.The authors developed a procedure in CRIU to support migration of one-to-one style of SCTP socket along with the required kernel changes for achieving automatic SCTP socket re-establishment at the destination host.The kernel code was modified in such a way that when the MME metadata information is passed onto the destination host, the kernel is able to re-establish the SCTP socket at the destination host without requiring to re-instantiate the SCTP connection.This feature is now available when the SCTP repair mode is turned on.
3) SPGW Live Migration with CRIU Support: The SPGW component makes use of a GTP interface for handling the User Equipment (UE) connectivity and maintains the GTP tunnel list up to date with UEs and base station (gNB) relevant information.With the currently available open software platforms (OAI, CRIU) these critical pieces of information are not carried over onto the destination host during the SPGW migration.Consequently the end-user connectivity is lost and the entire UE connection re-establishment has to take place again starting from the base station.For the reader's convenience a short overview of the GTP tunnel mechanism is first provided, followed by the description of the custom software that was developed to overcome this severe limitation.
To provide mobility to the UE and cope with the resulting network topology dependencies, the UE uplink and downlink IP packets are routed through a GTP tunnel that is previously established between the base station and the SPGW.Tunnel Endpoint Identifier (TEID) values are mutually exchanged between the base station and SPGW to ensure correct flow of data traffic.For example, considering the UE uplink communication, the IP data packet is first encapsulated at the base station by adding its IP/UDP/GTP header and transmitted in the GTP tunnel to reach the Service Gateway (SGW).The SGW replaces the outer header with its IP/UDP/GTP header and sends it to the Packet Gateway (PGW).The PGW decapsulates the outer header, and passes the original UE IP data packet to the Internet/Packet Data Network (PDN).In this solution the base station acts as the Serving GPRS Support Node (SGSN) and the SPGW acts as the Gateway GPRS Support Node (GGSN).The GTP-U communication along with the UDP/GTP header addition is illustrated in Fig. 2. The OAI SPGW software implements the above mentioned data plane connectivity by using the Linux Kernel GTP tunnelling module.This kernel module creates the GTP device interface (gtp0) for tunnelling the user data traffic to the PDN, and the SPGW software maintains the SGSN TEID information in the GPRS tunnelling list.A procedure had to be identified to overcome the current CRIU software inability to store device interface-specific information.Additionally, with the CRIU software operating in user space, the kernel dependent gtp0 device interface information cannot be checkpointed.Finally, the tunnelling list associated with that interface (gtp0) is completely lost too when the SPGW metadata information is moved to the destination host.In essence, the restored SPGW application at the destination host holds the UDP socket connectivity with the gNB endpoint, but without the required gtp0 interface and tunnel list, unfortunately.The authors had to upgrade the OAI software package with a software utility program that overcomes these current limitations.
The software utility program is implemented to run along with the SPGW container and the utility program takes care of i) creating the gtp0 interface thereby managing the rtnetlink socket [30]; ii) reading and configuring the gtp0 interfacespecific information from the SPGW configuration file (e.g., IP address, MTU size, mask, and routing information); iii) enabling masquerading for SGi interface -point of interconnection between the PGW and the external network (PDN)by adding POSTROUTING iptable commands -helps to alter the IP packets after routing completed; and iv) maintaining the GTP tunnel list information of the running SPGW Container application after migration.Thus, when the SPGW Container is checkpointed, along with the CRIU collected metadata, the utility program adds its gtp0 interface-specific information and the tunnel list data.During restore time, the SPGW application is restored successfully at the host with both the socket connectivity and the GTP related information to reinstate the end-user communication.Additional details about the OAI design changes that are applied to handle SPGW CRIU migration can be found in [31].

B. VM Migration with the KVM/QEMU Hypervisor
A hypervisor is a software-based virtualization layer between the physical machine (host) and the VM guests running on it.The hypervisor takes care of scheduling and allocating compute resources to the VM guests.KVM hypervisor is a kernel module integrated with version 2.6.20 of mainline Linux Kernel that is used in OpenStack [11] for providing the virtualization infrastructure.The QEMU-KVM module provides the VM management such as spawning and migrating VMs using the guest execution mode.Interaction with the KVM/QEMU hypervisor is made possible through the libvirt library and its set of API calls.
During VM migration, the CPU state, memory state, network and disk image of the entire VM are migrated from the source to destination host.During the memory pages coping process, the dirty pages (i.e., modified memory pages) are iteratively transferred -referred to as push phase, while the VM is still running at the source host.Once the maximum iteration count is reached, the VM is temporarily stopped at the source host, all the main memory pages are copied to the destination and then the VM is resumed at the destination host.This process of memory page coping -referred to as Pre-copy method -is illustrated in Fig. 3.
There exists another coping strategy named Post-copy [32], where the stop and copy phase happens first so that the VM is started earlier at the destination host.Then the remaining dirty pages are copied at the time of page fault occurrence, a technique that is referred to as pull phase.Only the Precopy migration method is used in the experiments discussed in this paper, since it is an optimized method for memory read intensive applications.As shown in Fig. 3, the Pre-live phase considers the preselect and reservation process such as preparing the destination host with the VNF instance details of keypair association and network information, the live phase carries the memory page copy process to the destination host -pull phase for the Pre-Copy method and the Post-live phase performs post operation after the live migration such as it updates the running VM state in the MySQL database and Neutron database with the host information and port details. 1) VM Migration Limitations Handling: Due to the KVM/ QEMU hypervisor maturity as virtualization technology [33], the VM components running core network elements can be more easily migrated when compared to Docker Containers.Only a few precautions are necessary to ensure correct migration execution in the OpenStack environment, where layer-2 network connectivity is provided by default using the Open Virtual Switch (OVS) integration bridge [34].Even though the OpenStack security rule permits the use of the SCTP protocol, the OVS firewall blocks the SCTP packets preventing them from reaching the hosts.To circumvent this drawback, SCTP messages are encapsulated inside UDP frames using Open Virtual Private Network/Virtual Extensible LAN (OpenVPN/ VXLAN) connectivity.The resulting UDP frames are therefore not blocked by the OVS firewall.In summary, the SCTP protocol communication between MME and CU is made possible through an OpenVPN service enabled in the S1-MME communication interface [35].

IV. EXPERIMENT TESTBEDS
Two testbeds are used in this study, which provide all of the required C-RAN system key components including radio hardware units, compute nodes, Ethernet switches, and optical transport network equipment.The first testbed is implemented at the University of Texas at Dallas (UTD) and makes use of optical transport network equipment that is OpenROADM compliant [36].The second testbed consists of ClouldLab [17] compute resources connected to the radio units hosted at UTD through the Internet.Combined, the two testbeds provide an opportunity to test the described procedures to live migrate virtualized core network components in the presence of stateof-the-art programmable optical network equipment on the one hand (in the former testbed) while also ensuring compliance of the proposed software implementation in an open federated environment on the other (in the latter testbed).
In both testbeds, the C-RAN software modules are implemented using OAI [37], while the radio hardware units are implemented using NI B210 radio prototyping boards [38], as shown in Figs. 4 and 5.The OAI software version considered for the CU and DU is v2019.w25and for the core network, v0.5.0-4-g724542d is used.The radio hardware unit and DU interface is realized using the USB 3.0 (B210 radio) interface with the DU running on a dedicated physical machine.All of the experiments make use of option 2 split [39] between DU and CU, according to which both Packet Data Convergence Protocol (PDCP) and Radio Resource Control (RRC) services run on the CU.
A. Testbed 1: OpenROADM Fig. 4 shows the block diagram of the OpenROADM testbed configuration used to investigate both the KVM and CRIU based migration procedures.Two racks of Stampede compute nodes are connected through an optical transport (backhaul) network comprising only OpenROADM compliant equipment.The virtualized EPC software components (HSS, MME, SPGW) are first executed on the left rack (Rack 1).Once triggered, the live migration of either the VM or Container that supports one of these EPC components takes place over a dedicated optical circuit (lightpath) that is dynamically created between the two racks to form a temporary high-speed connection in the management network to expedite the migration procedure between racks.The optical transport (backhaul) network consists of two OpenROADM nodes provided by Ciena (6500) and Fujitsu (1FINITY) for routing lightpaths between the two racks or compute sites.is an application running on OpenDaylight version 6.0.9.Also shown in Fig. 4, the programmable optical network (PROnet) Orchestrator is a software module developed at UTD to coordinate automatic resource provisioning in an Ethernet-over-WDM network [41].OpenFlow [42], [43] enabled switches (Juniper QFX5120 and Dell N3048p) -controlled by the PROnet Orchestrator -are used to interconnect compute nodes in the two racks and also to route packets (in both management and tenant networks) to the assigned transport optical equipment.The PROnet Orchestrator was recently upgraded with two additional features [44]: a RESTCONF interface to work with the TransportPCE northbound API which relies on the OpenROADM Service Model, and a REST API to work with OpenStack.With these two upgrades, the PROnet Orchestrator offers a single point of control and coordination of the compute and network resources in the described experimental setting.For example, to enable experimentation with varying backhaul network round trip delays, the PROnet orchestrator is instructed to create lightpaths in the OpenROADM network with varying end-to-end propagation distances, i.e., a few meters -considered as short distance, 25 km, and 50 km.During the migration process, the PROnet Orchestrator first triggers the creation of the management lightpath between the two racks and then initiates the migration of one of the EPC virtual components.The migration procedure is carried out through the OpenStack dashboard when using VMs and through shell script commands when using Containers.

B. Testbed 2: CloudLab
Fig. 5 shows the block diagram of the CloudLab testbed, in which some of the RAN components -UE, DU, and CU -reside in the UTD Lab while the virtualized core network components -HSS, MME, and SPGW -run in the CloudLab environment in the Utah lab [17].The CloudLab and UTD Lab are connected via Internet2.Due to the firewall restrictions in the UTD campus network, connectivity between the UTD Lab and the CloudLab environment is established through OpenVPN.Compute nodes in the CloudLab testbed are co-located and can therefore be used to test virtualized EPC component migration within the same datacenter.Both the VM and Docker Container components run in the CloudLab compute nodes using the same procedures already described for the OpenROADM testbed scenario, with the only exception that in the CloudLab testbed there is no optical transport (backhaul) network and OpenVPN service runs on all the core network VNF.The CloudLab testbed provides the opportunity to validate and test the robustness of the proposed and developed custom software in Section III-A -software configuration changes, the new utility program, and CRIU code changes -in an open environment outside the UTD in-house lab setting.

C. System Configuration
The system configuration details of both the OpenROADM and CloudLab testbed are reported in Table I.Compute nodes are Intel Xeon model in both testbeds.However, the underlying compute hardware in the CloudLab testbed is more advanced  In the OpenStack VM, the communication between the services is carried in the management network through advanced message queuing protocol (AMQP).Since the OpenStack VM migration traffic flows through the management network, the key parameter of interest for the VM migration is the management network data rate, which is relatively better in the CloudLab as the compute nodes involved in the live migration procedure are all co-located.For the Container migration, since the Checkpoint and Restore services are done in the individual node, the parameters of interest are the better hardware architecture, cache size and the number of cores used.The same versions for the libvirt, KVM/QEMU, and CRIU packages are installed in both testbeds.

D. Additional Observations
The migration completion time is affected by a few key system parameters, which must be taken into account carefully.First, flavors of computing instances (compute, memory, and storage capacity) may affect the time that is required to migrate both VMs and Containers to a new host.Table II reports the flavors that are applied in this study.Second, the backhaul network round trip time may affect the completion time of the live migration.The backhaul round trip time is affected by the Ethernet switch latency, optical transponder and switchponder latency, and finally optical signal propagation time across the network fiber.To test the effect of the signal propagation time on the migration completion time, multiple experiments are carried out while varying the route of the lightpath that is established between the two compute sites (racks) in the OpenROADM testbed (Fig. 4).The lightpath length is set to be a few meters, 25 km, and 50 km, respectively.Third, the use of OpenVPN may affect the migration completion time too.While OpenVPN must be used for the reasons discussed in previous sections -e.g., overcome the OVS firewall driver configured in Neutron that blocks SCTP packets -some of the virtualized EPC components may still use floating IP connections in the OpenROADM testbed.To estimate the effect of these network interfaces on the migration completion time two configurations are investigated.In the OpenVPN configuration all EPC components make use of OpenVPN.In the Floating IP configuration the HSS and SPGW components make use of floating IP, while OpenVPN is still applied to the CU-MME SCTP connection.
V. RESULT ANALYSIS In this section performance indicators related to the migration procedure are first defined, followed by the live migration Migration of the chosen EPC component is performed once the UE connectivity is established and UE data is being transmitted and received over the RAN uplink and downlink.Each experiment is repeated five times and the average is reported for each performance indicator to mitigate stochastic variations of network, I/O, and process delays.When migrating VMs, it was noted that when the VNF is running for a long period of time, its internal data storage increases (in terms of MB) due to the application collected logs.This data storage increase may cause variation of the network data traffic generated during the VM migration.

A. Performance Indicators
The main performance indicators considered in this study are migration time, downtime, network load at the time of migration, and the UE service recovery time [25], [26], [46].Their definitions are given next.
• Migration time is defined as the amount of time required to migrate a VNF from one host to another host.In the VM migration experiment, the migration time is the sum of the Pre-live, Live and Post-live execution times, evaluated from the Nova log file.In the Container migration experiment, the migration time is the sum of the time intervals that are required to freeze the process, dump and transfer the metadata to the destination host, and restore the process at the destination host.The evaluation is carried out in millisecond resolution with the Shell Script that automates the migration process.• Downtime is defined as the amount of time the VNF functionality is paused and unavailable.In the VM migration experiment, the downtime is associated to the execution of the final dirty page copy and the reconfiguration of the virtual interface bridge connectivity with the port settings at the destination host.Due to the chosen CRIU migration type, the downtime in the Container migration experiment is the same as the migration time.For both the VM and the Container, the downtime is measured using the ICMP ping of the VNF IP in millisecond resolution.• Network load is defined as the amount of data transferred from the source host to the destination host.In the VM migration experiment, the network load accounts for the transfer of the CPU state, memory state, network state, and disk state and it is measured using the network bandwidth monitoring tool named "iftop".For the Container migration experiment, the network load accounts for the process tree, the CPU state, memory page, namespace, and control group (cgroup) information and it is the size of metadata files checkpointed.• UE service recovery time is defined as the time that is required to regain UE connectivity from the moment the UE is temporarily disconnected from the mobile network due to the migration of one of the EPC components.When using the OAI software modules the migration of either HSS or MME does not result in UE connectivity interruption.However, during the migration of SPGW, the UE uplink and downlink are temporarily disrupted and the UE service recovery time needs to be assessed.For both the VM and the Container, the UE Service Recovery Time is measured using the ICMP ping from the UE gtp interface IP to the SPGW gtp interface IP with one hundred milliseconds resolution.Only a modest extra time is required to complete all four migration types when using a longer lightpath, thus proving that these solutions can scale geographically.VM and Container migration times are differently affected by the flavor type.The VM Medium flavor requires a modest extra migration time (more noticeable when using the 50 km lightpath) compared to the VM Small flavor because of its increased image size.Transferring a larger image from the source host to the destination host takes extra time (magnified when the network round trip time is large).Conversely, the Container Medium flavor requires less migration time (more noticeable when using the short lightpath) compared to the Container Small flavor as its CPU core configuration enables  Fig. 7 compares the HSS migration times collected using both the OpenROADM (lightpath length of few meters) and CloudLab testbed.The latter testbed offers shorter migration times due to two system factors.First, as specified in Sec.IV-C, the management network in the CloudLab testbed operates at 10G compared to that of 1G-10G-1G used in the OpenROADM configuration.The faster management network helps reduce the migration time in the VM-based experiments.Second, both checkpoint and restoration executions in the CloudLab testbed are more performing due to the Broadwell architecture running in the Intel Xeon servers with better L1 and L2 cache sizes when compared to the Sandy Bridge architecture used in the OpenROADM testbed.The more performing compute architecture of the CloudLab testbed helps reduce the migration time in the Container-based experiments.In addition, the migration time is reduced a bit in the OpenROADM testbed, when the HSS application is configured with Floating IP instead of OpenVPN.By not using the OpenVPN client service package the VM image size is reduced, which in turn reduces its migration time.

B. Migration Analysis of the HSS Component
Tables III and IV show the time taken by each phase of the VM and Container migration, respectively.For the VM, the network loads for the HSS Small and Medium flavors at the time of migration are 3.47 GB and 3.68 GB, respectively.For the HSS Docker Container, the metadata size is 173 MB regardless of the flavor type used.
As shown in Fig. 8 -and in contrast to the previously presented migration time analysis -the downtime value for the HSS servicing Container is higher than that of the VM.The HSS VM downtime is mainly due to the virtual interface bridge and port reconfiguration.Other than that, the VNF service is not additionally paused during the VM live migration process.On the contrary, the HSS application running in the Container is paused once the checkpoint is initiated and it  resumes only after restoration is complete at the destination host.Both lightpath length and flavor type do not have any significant impact on the HSS VM downtime.However, the flavor size impacts the HSS Container downtime as already noted for the migration time of the same experiments.Fig. 9 confirms that the downtime of the Docker Container in the CloudLab testbed is shorter than that in the Open-ROADM testbed, as previously noted for the migration times of these experiments.As already mentioned, this outcome is mainly due to the superior processor architecture and cache level of the CloudLab testbed.More interestingly, the VM downtime is longer in the CloudLab testbed compared to the OpenROADM testbed.The significant geographical distance between the CloudLab in Utah -hosting the core network components -and the UTD Lab in Texas -hosting the RAN components -slows down the update procedures for the Reverse Address Resolution Protocol (RARP) to determine the new host's IP address and reroute the OpenVPN client traffic.Additionally, when Floating IP is used to replace OpenVPN for the HSS connectivity in the OpenROADM testbed, there is a slight reduction of the HSS downtime because in this configuration OpenVPN does not need to establish a new route for the client-to-client communication.
For an already attached UE, the mobile network service is not impacted by the temporary pause of the HSS component.However, any new UE attempting to perform attachment to the mobile network during the HSS migration procedure would be affected by the HSS downtime.

C. Migration Analysis of the MME Component
As reported in Fig. 10, the migration time of the MME VM compared to the MME Container is six and seven times larger when using the Small and Medium flavor, respectively.The MME VM network loads for the Small and Medium flavors at the time of migration are 3.53 GB and 4.16 GB, respectively.Both larger image size and longer lightpath length tend to increase the migration time of the MME VM.In comparison, the Docker Container metadata size is 42 MB in the Small flavor and only grows by 0.3 MB in the Medium flavor.Migration time of the MME Container is almost unaffected by the flavor type and lightpath length, with reported variations in the subsecond range.The MME migration times in the OpenROADM testbed and CloudLab testbed are reported in Fig. 11 for a few configurations.As already noted for the HSS migration, the superior compute architecture available in the CloudLab testbed achieves shorter MME migration times compared to those in the OpenROADM testbed.For the same reason, the VM migration time compared to the Container migration time is only about three times longer in the CloudLab testbed.For the MME VM Floating IP scenario, the MME configuration uses the Floating IP to communicate with the HSS and the SPGW software and the OpenVPN IP to communicate with the CU software (to avoid SCTP packets blocking at the OVS).The migration time of the Small flavor VM in the OpenROADM testbed is not significantly affected by the use of Floating IP in place of OpenVPN.This outcome is not surprising as the Floating IP network configuration can only be applied to the HSS and SPGW component and cannot be applied completely to the MME component that requires SCTP to run over OpenVPN to overcome the OVS firewall in Neutron.
Tables V and VI report the time taken during each phase of the VM and the Container migration of the MME component.The impact due to lightpath length and flavor type variations is more noticeable in the Live phase (disk and memory copy phase) of the VM migration.Fig. 12 shows that for the virtualized MME, the Container downtime is almost double the value of VM.As specified earlier, for the Docker Container, the MME service is paused at the Checkpoint initiation and it is resumed after the Restore procedure -contributing the longer downtime value.As mentioned earlier, for the MME VM, primarily the virtual interface bridge and port reconfiguration influences the VM downtime.The OpenROADM testbed and the CloudLab testbed comparison confirms the influence of system configuration characteristics for tuning the migration parameters.As shown in Fig. 13, the VM downtime is longer than the Container downtime in the CloudLab testbed, in contrast with the opposite trend observed in the OpenROADM testbed.The geographical location with the OpenVPN configuration is the prime factor behind this observation for the VM in the CloudLab.Considering the MME Docker Container service, the CloudLab with its better underlying hardware architecture shows the superiority in downtime than the OpenROADM Docker testbed.In the OpenROADM environment, the MME service downtime is reduced by a thin margin when the Floating IP is configured in the MME application instead of OpenVPN IP.
For the attached UE, the OAI based UE service is not disrupted by the MME migration as long as there is no handover or tracking area update related signaling functionality requirement.

D. Migration Analysis of the SPGW Component
In this section -in addition to the migration time and downtime of the SPGW component -the end-user service interruption is visualized with the UE service recovery time performance indicator.
Fig. 14 presents the migration time of both the VM and Container running a virtualized SPGW in the OpenROADM testbed.The migration time of the Container is significantly less compared to that of the VM.For the VM, the image sizes for the SPGW Small and Medium flavors at the time of migration are 3.53 GB and 3.87 GB, respectively.For the SPGW Docker Container, the metadata size is in terms of 100 MB independent of the flavor type used.Tables VII and VIII reports the time taken by each phase of the SPGW VM and Container migration.For the VM migration, the flavor size and the lightpath influence the Prelive and Live migration phases, however, no change is observed for the Post-live migration phase.For the Container migration, the Checkpoint and Restoration time for the SPGW application is relatively lower (metadata size 100 MB) than the HSS service (metadata size 173 MB) and higher than the MME service (metadata size 42 MB).This shows that the time to Checkpoint and Restore an application depends on various factors such as dumping the opened files, pages, core and task information, etc. Fig. 16 shows the downtime values during the SPGW migration in the OpenROADM testbed.More interestingly -and in contrast to the HSS and MME downtime analysis -a modest downtime variation can be observed for the SPGW VM migration as the flavor size increases.This is mainly due to the combined influence of the dirty page contribution time and the virtual network interface reconfiguration.For the SPGW application, the main memory usage is intensified with the continuous uplink data request from the UE ping request.As the CPU cores increase from the flavor size upgrade, the dirty page copy time reduces, that eventually helped to reduce the SPGW downtime for the VM flavor size increase.
Fig. 17 compares the SPGW downtime collected in both the OpenROADM and CloudLab testbed.During the SPGW migration, the UE service is temporarily paused.In this case, the SPGW Container downtime is presented with additional cases: i) downtime for the existing UE (who has the user data traffic); ii) and downtime for the new UE connection (who initiates the attached procedure during the SPGW service unavailability).The downtime for the existing UE is significantly less than that of the new UE connection.This is because, the existing UE connectivity is re-established (with the newly contributed GTP  utility program support specified in Sec.III-A) once the GTP tunneling information along with the network is restored at the destination node.However, in case of the new UE connection, the SPGW needs to be restored successfully with the control plane sockets update, which influences the increase in the downtime value for the Container.program at the Container.In addition, as the flavor size increases, the application performance improves to regain the UE connectivity faster for both the VMs and the Containers.Moreover, for the Container, no significant impact is observed in UE SRT due to the lightpath length change.This is because, the round trip time between the physical server and the VNFs in the compute node is less than one millisecond.Fig. 19 compares the UE SRT measured in the two testbeds.The SRT is 10 seconds when OpenVPN is used in both testbeds.The reason is that the network connectivity restoration time with OpenVPN takes about 8 seconds as depicted in Fig. 17.After restoration, about 2 extra seconds are required to route the UDP encapsulated GTP traffic through the OpenVPN server.The SRT reduces to 6 seconds when Floating IP is used in the OpenROADM testbed.

VI. CONCLUSIONS AND FUTURE STUDY
This paper reports the first set of public experiments about a NFV enabled mobile network comprising a backhaul fiber optics transport network that is entirely built with the latest OpenROADM compliant equipment and SDN control technology.Through the single point of coordination provided by the PROnet Orchestrator module -for joint control of the backhaul optical layer, the Ethernet layer, and the compute resources -live migration of three EPC components -HSS, MME, and SPGW -virtualized through either VM or Container technology is experimentally achieved without permanent loss of UE connectivity.To successfully carry out the Container live migration of the three EPC components, the authors designed and developed a number of custom functions that permit to overcome the limitations of both OAI and CRIU current open software packages.These functions support migration of the end-points for both the GTP and SCTP connections that are employed by the SPGW and CU-MME, respectively.The newly added software packages and upgrades are also tested on the federated CloudLab testbed, which provides a third party and open platform for independent compliance validation of the said software.
Experimental results obtained using both testbeds (the OpenROADM and the CloudLab) are reported and discussed.Specifically, migration time and service downtime performance indicators for the two virtualization technologies (VM and Container) are compared, while accounting for a number of system factors like flavor type of the computing instances (compute, memory, and storage capacity), length of the temporary lightpath created between two compute sites and used to expedite the migration of the EPC component of interest, and type of network interface (OpenVPN and Floating IP) applied.It is shown that fine tuning of these factors may be required to achieve optimal performance.
Outside the scope of this paper and possible subject of future studies, live migration of virtualized CU/DU (vCU/vDU) using both the VM and Container technologies is an additional critical functionality required in C-RAN.With the 3GPP recommended functionality split options, the (OAI) vCU/vDU modules must cope with the backhaul transport network dependency and meet the desired mobile network service latency and throughput.For example, CU interacts with MME using the SCTP transport layer protocol, for which support can be provided in the CRIU code as described in this paper.However, the CRIU code and related kernel level changes must be handled carefully to specifically account for the network requirements (e.g., fronthaul latency) as dictated by the split option chosen for the vCU/vDU pair.Another aspect that remains to be investigated is the possible reduction of Container service downtime by means of lazy migration technique.
While a number of open challenges remains to be addressed before achieving a completely flexible virtualized C-RAN solution that is capable of supporting live migration of all of its components, the contribution of this paper takes C-RAN a step closer to that ultimate goal of enabling power management, load balancing, and fault tolerance in the cloud environment assigned to support the needed NFV.

Fig. 2 :
Fig. 2: GTP uplink IP traffic from the UE to the Internet
Fig.4shows the block diagram of the OpenROADM testbed configuration used to investigate both the KVM and CRIU based migration procedures.Two racks of Stampede compute nodes are connected through an optical transport (backhaul) network comprising only OpenROADM compliant equipment.The virtualized EPC software components (HSS, MME, SPGW) are first executed on the left rack (Rack 1).Once triggered, the live migration of either the VM or Container that supports one of these EPC components takes place over a dedicated optical circuit (lightpath) that is dynamically created between the two racks to form a temporary high-speed connection in the management network to expedite the migration procedure between racks.The optical transport (backhaul) network consists of two OpenROADM nodes provided by Ciena (6500) and Fujitsu (1FINITY) for routing lightpaths between the two racks or compute sites.Transmission and reception of Ethernet client signals across the optical transport network are realized by deploying OpenROADM compliant Fujitsu (1FINITY) T300 100G Transponder and Juniper ACX6160-SF Transponder for the tenant network, and Fujitsu (1FINITY) F200 1G/10G/100G Switchponder and ECI Apollo OTN OpenROADM switchponder for the management network.The optical equipment is controlled by the open source optical network controller TransportPCE version 2.0.0, which

Fig. 6 Fig. 6 :
Fig.6reports the migration time of both VM and Container running a virtualized HSS instance in the OpenROADM testbed for three lengths of the lightpath connecting the two racks (compute sites) and the two image flavor types in TableII, respectively.

Fig. 8 :
Fig. 8: HSS -VM and Container downtime for three lightpath lengths and two flavor types.

Fig. 12 :
Fig. 12: MME -VM and Container downtime for three lightpath lengths and two flavor types.

Fig. 14 :
Fig. 14: SPGW -VM and Container migration time for three lightpath lengths and two flavor types.

Fig. 15
Fig.15compares the SPGW migration time collected in both the OpenROADM and CloudLab testbed.This result confirms -in the same way as HSS and MME migration time -the CloudLab testbed migration time value is considerably less than the OpenROADM migration time due to the system factors.In addition, from the CloudLab results, the migration time of the SPGW VM is almost double than that of the SPGW

Fig. 18 :Fig. 19 :
Fig. 18: UE SRT for three lightpath lengths and two flavor types.Fig.18 reports the UE Service Recovery Time (SRT) of both the VM and Container during the SPGW migration in the OpenROADM testbed.The UE SRT value is less for the Container than that of the VM for all lightpath lengths and flavor types due to the newly contributed GTP utility software are two key enablers for 5G virtualization.The Non-Standalone version of the 5G mobile S. Ramanathan, T. Zhang, B. Mirkhanzadeh, M. Razo, M. Tacca, and A. Fumagalli are with Erik Jonsson School of Engineering and Computer Science, Open Networking Advanced Research (OpNeAR) Lab, The University of Texas at Dallas, Richardson, TX, USA.
K. Kondepu is with Indian Institute of Technology Dharwad, Dharwad, India.L. Valcarenghi is with Scuola Superiore Sant'Anna, Pisa, Italy.The updated version published at IEEE Access.

TABLE I :
System configuration details.

TABLE II :
OpenStack flavors for the experimentation

TABLE III :
HSS -VM migration time breakdown

TABLE IV :
HSS -Container migration time breakdown

TABLE V :
MME -VM migration time breakdown

TABLE VI :
MME -Container migration time breakdown " stored in its metadata.In contrast, the HSS metadata stores the user database information and, as the number of UE client information increases, the HSS metadata size increases further.

TABLE VII :
SPGW -VM migration time breakdown

TABLE VIII :
SPGW -Container migration time breakdown