Proactive Stateful Fault-Tolerant System for Kubernetes Containerized Services

Recently, the development of Kubernetes (K8s) containerization platform has enabled cloud-based, lightweight, highly scalable, and agile services in both general and telco use-cases. Ensuring high availability, reliable and continuous containerized services is a major requirement of service providers to provide fault-tolerance, transparent service experiences to end-users. To satisfy this requirement, fault prediction and proactive stateful service recovery features must be applied in cloud systems. Prior proactive failure recovery approaches mostly focused on either improving fault prediction performance based on different machine learning time series forecasting techniques or optimizing recovery service placement after fault prediction. However, a mechanism that enables stateful containerized service migration from the predicted faulty node to the healthy destination node has not been studied. Service migration in previous proactive works is only simulated or performed by virtual machine (VM) migration techniques. In this paper, we propose a proactive stateful fault-tolerant system for K8s containerized services that pipelines a Bidirectional Long Short-Term Memory (Bi-LSTM) fault prediction framework and a novel K8s stateful service migration mechanism for service recovery. Experimental results show how the Bi-LSTM model improved prediction performance against other time-series forecasting models used prior proactive works. We then combined the Bi-LSTM fault prediction framework with both the default K8s and our stateful migration mechanisms. The comparison between these two proactive systems proves our system efficiency in terms of reducing Quality of Service (QoS) violation percentage and service recovery time.

However, most of these techniques only support Docker 100 container runtime, which is not enough because container 101 services are normally deployed and managed by a con-102 tainer orchestration platform. Considering K8s is the most 103 popular container platform nowadays, the stateful migration 104 techniques need to be integrated into it. K8s has its own 105 StatefulSet feature to support stateful services. However, this 106 feature only retains the state of the services' data storage. 107 To avoid long startup time and service restarting from scratch 108 when recovering a service at a new node, a mechanism that 109 can checkpoint and restore the in-memory booting configu-110 ration state and running task execution state should be inte-111 grated into K8s. 112 Apart from the default K8s solution, there are only two 113 research that integrated the stateful migration feature into 114 K8s. The first work [12] only snapshots and transfer the 115 volume that stores the service data. To retain the in-memory 116 state, the application might need to be redesigned to store 117 this in-memory state on this volume. This design might cost 118 application developers considerable efforts to redesign their 119 applications; therefore, it is not reliable. The second work 120 [13] proposed MyceDrive, which creates an agent in each 121 container and an additional container per application pod. 122 The more pods/containers are deployed in the system, the 123 more resource overhead this framework will cause. Besides, 124 the scope of these two works is limited to single-cluster 125 scenarios. In real cloud-edge deployment, a multi-clusters 126 scenario solution is required. 127 Due to these reasons, in this paper, we propose a proactive 128 stateful fault-tolerant system for K8s containerized services. 129 The contributions of this work are as follows: 130 • A novel K8s integrated stateful service migration mech-131 anism which adds in-memory booting and running state sup-132 port besides the default storage state support. 133 • An architecture of a proactive stateful containerized 134 services recovery system that pipelining a Bi-LSTM fault 135 prediction framework (with resource overload fault as the 136 example use-case) and the K8s stateful migration framework 137 to avoid service QoS latency violation. 138 • Our experimental results confirm the combination of 139 these two frameworks' effectiveness in avoiding QoS latency 140 violation compared with previous machine learning tech-141 niques and the default K8s migration method. 142 The rest of this paper is organized as follows: Section II 143 discusses the related work. Section III describes the proactive 144 stateful fault-tolerant system. Section IV shows our system 145 implementation and evaluation. Finally, Section V concludes 146 the paper.

209
For the preemptive service recovery stage, multiple objec-210 tives decision-making algorithms were utilized to select the 211 optimal placement for the migrated services. In [14], after 212 getting overheating node prediction from the fault prediction 213 stage, integer linear programming is used to choose the desti-214 nation node that maximizes service providers' profit and min-215 imizes migration cost. In [4], after user mobility prediction, 216 another integer linear programming model is used to choose 217 a destination VM that can maximize the service accepted 218 requests and minimize the user latency. Another decision-219 making algorithm popularly used by prior approaches is 220 particle swarm optimization. It was used in [15] and [7] to 221 migrate the service from the overheated/overloaded node to 222 the new node which minimizes migration cost and maximizes 223 resource utilization. Besides, reinforcement learning is also 224 a well-known method that combines both proactive fault-225 tolerant stages into one model [8], [9]. Other prior proac-226 tive approaches either use fault prediction models to predict 227 healthy nodes for migration [2], [3] or use a greedy-based 228 algorithm [16]. Based on this literature review, we notice 229 that all these previous works focused on service placement 230 algorithms and did not concern about the service migration 231 mechanism to execute these placement decisions. The service 232 migration process is either simulated or is simply mentioned 233 that it is based on VM live migration technique. Hence, in this 234 paper, instead of addressing a well-studied service placement 235 problem, we focus on a stateful service migration mechanism 236 for containerized systems that aligns with the containeriza-237 tion trend for cloud applications. Since no stateful migration techniques for container applica-241 tions were mentioned in prior proactive fault-tolerant works, 242 in this part, we discuss some current standalone container 243 stateful migration approaches. 244 Machen et al. [18] proposed a multi-layer framework con-245 tainer service live. The framework copies the base layer, 246 which contains the operating system (OS) and kernel to 247 all nodes. When a container service needs to be migrated, 248 the application layer which contains the idle version of the 249 service will be migrated first during runtime. Then the ser-250 vice is suspended, and only the instance layer needs to be 251 transferred to the destination node. These splitting methods 252 reduce migration downtime because only the state needs to 253 be transferred. However, this framework only supports LXC 254 container runtime. Checkpoint and Restore In Userspace 255 (CRIU) [19] is another approach that enables container state-256 ful migration problems by performing checkpoint and restore 257 processes in the user space via available kernel interfaces. 258 The checkpoint process uses the ptrace system call [20] to 259 control the execution of a process. Then, it injects a parasite 260 code to dump the memory pages of the process into image 261 files from within the process's address space. The container 262 can be restored in another node with the previous state using 263 these dumping files, and it has the same process identifier 264 it had before checkpointing. Thanks to CRIU's support for 265 many popular container runtimes such as Docker, Containerd, 266 runC, it is widely adopted by many containerized service 267 migration works. ARNAB system in [21]      If this state is not retained during migration, the service will 320 be restarted from scratch. Examples are live stream video 321 service, multimedia processing service (video, image con-322 verter, analyzer), deep learning service, etc. These booting 323 and running states have not been supported by K8s yet but 324 can be retained by CRIU.

325
To our best knowledge, at the time of writing this paper, 326 there were only two works that integrated stateful service 327 migration into K8s. In the first work [12], the authors only 328 focused on the storage state. They proposed to use the Over-329 layFS file system to snapshot the persistent volume to retain 330 the state of the data inside it. To utilize this solution for boot-331 ing and running state, the application developers might be 332 required to redesign their services' application layer to dump 333 these in-memory states into the persistent volume, which 334 is not practical. is also a technique that supports retaining in-memory booting 339 and running states. This work's solution requires a DMTCP 340 container running at each pod and an execution agent at 341 each container to perform a stateful migration process. This 342 design might create significant resource overhead when many 343 pods and containers are running in a large containerized 344 system. Besides, these two prior approaches only concern 345 single-cluster scenarios. Stateful migration support for multi-346 cluster scenarios should also be considered, especially in geo-347 distributed environments where applications are normally 348 deployed over different clusters.

349
Therefore, we propose our own K8s integrated stateful 350 migration mechanism that supports both single and multi-351 clusters scenarios. Our solution utilizes the default K8s State-352 fulSet feature to retain storage state and integrates CRIU into 353 K8s to retain booting and running states. Moreover, this paper 354 is the first work that integrates a K8s stateful migration mech-355 anism into a proactive fault-tolerant system. We evaluate the 356 benefits of the K8s stateful migration technique to a proactive 357 fault-tolerant system by comparing it with the default K8s 358 migration system.   Unlike LSTM unit, GRU unit only has two gates: reset gate 419 and update gate. The update gate of GRU unit aggregates 420 LSTM unit's forget gate and input gate. While the update gate 421 decides how much previous data can be used in the future, the 422 reset gate decides how much data can be removed. Unlike 423 LSTM, GRU unit does not use the cell state; it processes 424 input data and previous hidden state through its two gates to 425 generate the output. CNN-LSTM, on the other hand, stacks 426 the LSTM network on top of the CNN network. The CNN 427 network consists of three layers, namely: convolutional, pool-428 ing, and fully connected. It is used to discover the ordered 429 relationship in time-series data before feeding the processed 430 input into the LSTM network [17]. As our work only aims 431 to apply the current latest time-series prediction model and is 432 not deep-learning algorithm focused, we only briefly intro-433 duce the architecture of these models. More details about 434 these models can be found in related papers.

435
Bi-LSTM model can be used to solve any time-series pre-436 diction problem. Since the data which is used to predict differ-437 ent kinds of faults such as CPU, memory, disk input/output, 438 or network bandwidth can be collected and modeled as time-439 series data as shown in several previous works [2], [3], [14], 440 [15], the prediction problems for these kinds of fault are 441 similar. Hence, we applied the Bi-LSTM model for only CPU 442 overloading fault prediction as the representative use-case for 443 other faults.

444
2) MODEL TRAINING 445 We used the VM workload dataset from Bitbrain cloud [28] to 446 train our model. The Bitbrains dataset records CPU, memory, 447 and network and disk input/output values in 2 months dura-448 tion. Since our goal is to predict the CPU fault, we chose CPU 449 metrics in terms of percentage from the dataset. We split the 450 dataset into two sets: training set and test set with ratios 80% 451 and 20%, respectively, then normalized them to the range 452 (0, 1) before training. After that, we prepared historic sub-453 sequences from the normalized dataset by using the sliding 454 window method as input sequences for our models, which 455 uses n time steps as inputs to predict the next time step in 456 one-step-ahead prediction. For this dataset, we used 12 past 457 time steps sliding window to predict the next time step. Past 458 time steps and many other hyperparameters are fine-tuned 459 The fault prediction framework acts as a trigger mechanism 463 that defines when the system needs to migrate the container 464 service from the node that was predicted as the failure node to  In our framework design, real-time data which is used for 493 prediction will also be stored in an external database at the   when being deployed. The application state in this paper is 504 the process information that is dumped in an ''image'' file and 505 can be restored so that it can be resumed at the exact moment 506 before reallocating. In general, the pod migration process is 507 performed via the following basic steps: 508 1) Request to migrate a pod are sent to the system.      Based on the K8s architecture, we designed and developed 516 the novel stateful migration as an extended K8s feature to 517 achieve the mentioned goal above. Our changes to the default 518 K8s platform are:

519
• We developed the migration API converter that helps 520 the K8s cluster listen to the external computing engine, such 521 as the fault prediction framework in our case, to execute the 522 migration process 523 • We developed the Pod migration operator that helps the 524 K8s cluster control plane verify the migration request. The 525 request describes which application should be migrated, and 526 it will be moved from which node to which node, inside or 527 outside of the cluster.

528
• We developed the Pod migration executor at every 529 worker node. We did this by extending the K8s agent. where the pod is requested to be migrated to. The destination 563 host will be invoked to perform the initial steps and then wait 564 to restore the pod from the checkpoint information received 565 from the source node.  request to the K8s API server language. Every time the 584 migration request is sent to the migration API converter, 585 it will adjust the pod-migration CRD. The Pod-migration 586 CRD controller watches the custom resources type and takes 587 the application-specific actions to make the current state 588 match the desired state in that resource. For example, in the 589 case of checkpointing, the controller will monitor the state 590 of the pod. If the application has not been checkpointed, the 591 controller will send a trigger to the migration executor to 592 checkpoint the pod by changing the pod metadata. K8s uses kubelet as an agent that runs on each node to manage 595 pods. It is responsible for creating, terminating, or updating 596 pods. However, it currently cannot capture and resume the 597 pod state. Hence, to enable these features, a plugin that can 598 checkpoint and restore containers should be installed and 599 integrated into kubelet at every worker node.

600
In this work, we created this plugin by leveraging the CRIU 601 project and the Container Runtime Interface (CRI) extension 602 in [30]. To support multiple container runtimes, K8s has a 603 defined CRI, which is an interface that any container run-604 time can implement to be compatible with it. CRI contains 605 two interface definitions as gRPC [31] services. The first is 606 RuntimeService, which is used for managing pod sandboxes 607 and containers. The second is ImageSevice, which is used 608 for pulling images from the storage. Currently, the container 609 management methods defined in RuntimeService include 610 CreateContainer, StartContainer, StopContainner, Remove-611 Container, ListContainer, and ContainerStatus, etc. There is 612 no method for checkpointing and restoring containers. There-613 fore, we extended the CRI by defining two new methods 614 in the RuntimeSerivce definition: CheckpointContainer and 615 RestoreContainer. The Checkpoint Container method snap-616 shots the container running state, and the RestoreContainer 617 method enables container restoration. With these two new 618 extended methods defined in the CRI as GRPC services, 619 kubelet can now request the container runtime site to check-620 point and restore containers.

621
To call these two new RuntimeService's checkpoint and 622 restore methods at the kubelet, we designed a method to 623 handle the pod migration requests sent from the API server. 624 We extended the kubelet's syncPod function with two new 625 handler functions: Checkpoint handler and Restore handle. 626 With these functions, the kubelet agent at each node can 627 read the requested migration action type (restore, checkpoint, 628 or check pod status) in the pod annotation [32] to ask the 629 remote-CRI to perform the corresponding actions. Noted 630 that by using the pod annotation to provide the migration 631 actions information, there is no need to define any addi-632 tional API objects or pod specifications, thereby reducing the 633 complexity. Figure 4 shows components of the migration executor -the 635 extended kubelet at a worker node which performs pod 636 migration tasks that the migration operator from the control 637 plane assigned to the corresponding node. The red dash line 638 VOLUME 10, 2022   connections. One node must be configured as a file server 667 to perform a migration, and the other needs to mount the 668 shared folder. Therefore, a dedicated file server might be a 669 better solution. Both the source node and the destination node 670 mount the same volume from the file server to exchange 671 the checkpoint. Furthermore, the checkpoint can be shared 672 between every node in the cluster or even between multi 673 clusters. For simplicity, in our implementation, we use a 674 Network File System (NFS) server to store the checkpoint 675 information.

634
676 2) STATEFUL MIGRATION WORKFLOW 677 Figure 5 illustrates pod migration workflow in detail, showing 678 how existing and extended components work together to per-679 form the service migration process. If the application is con-680 sidered the necessary one to be migrated to another location, 681 the request is sent to the migration converter API to translate it 682 to the K8s API object. Then the K8s API can understand and 683 extract this request to change the related migration CRD. The 684 migration operator watches the migration CRD information 685 to verify which nodes and which actions are requested. Then 686 the requirement which is translated by the migration operator 687 will go through the API server again, and the migration 688 executor at an appropriate node will watch this requirement 689 via the API server to perform the checkpoint or restore pro-690 cess. In the checkpoint process, the worker node agent will 691 checkpoint the container state and finally store this data in a 692 shared DB to transfer to the destination node. In the restore 693 process, the worker node agent as the extended kubelet will 694 first initialize a pod with the same metadata as the source pod. 695 Then, it will wait until the container state is fully created and 696 saved to the shared DB. Finally, it will pull the container state 697 from the shared DB and restore the application pod in this 698 worker node. Additionally, migration between multi-clusters 699 is supported by the migration API converter as a RESTFUL 700 API. For example, if the fault prediction decides that the 701 application needs to be migrated to another cluster, it will 702 simply send a checkpoint request to the source cluster and 703 a restore request to the destination cluster.

704
Our K8s stateful migration framework is made available 705 on GitHub [33].    the service has a higher response time than the accepted QoS 790 to the total running time of the service. We evaluated the 791 fraction of QoS violation over different prediction time step 792 length t (the interval between two times the system makes a 793 prediction). We also evaluated the performance of our stateful 794 K8s system when enlarging the size of the containerized 795 system by increasing the number of pods in each node.

796
This second experiment is conducted separately for each 797 type of stateful service: booting-state-dependent and running-798 state-dependent due to the different characteristics of these 799 two types of services, which will be explained further in the 800 next part.

802
For the fault prediction model comparison, we took the aver-803 age values over 100 runs. Figure 7 shows the evaluation score 804 comparison between our chosen Bi-LSTM model and other 805 baseline models. On average, the RMSE score and the MAE 806 score of the Bi-LSTM model are the lowest. This is due to 807 the ability to learn data dependencies in both backward and 808 forward directions of the Bi-LSTM model.

809
For stateful and default proactive systems comparison, 810 we used our Bi-LSTM model for the fault prediction frame-811 work as it had the best performance results. We analyzed 812 the comparison results with two different service types men-813 tioned above.

814
It must be noted that only booting states and running states 815 of these corresponding types of service are transferred in 816 our stateful service recovery migration process. The storage 817 states, which are the states of the data that is stored in each 818 pod local Persistent Volume, do not need to be transferred. 819 These Persistent Volumes at all nodes are never deleted, and 820 they always synchronize with the main remote storage of 821 each service at NFS servers as explained in our Experiment 822 Setup part. We set up like this for experiment simplicity. 823 In production, framework such as Longhorn [38] can be used 824 for quick storage recovery between clusters. When a pod is 825 migrated to the new node, it can simply re-attach to its corre-826 sponding local Persistent Volume thanks to K8s StatefulSet 827 functionality. In case of MongoDB and Redis service, this 828 storage state is the current data in these services' databases. 829 In case of FFMPEG and CNN Training model service, there 830 is no storage state since these services does not need to store 831 any data. to predict node overloading in the next time step. For the 871 first case, because the service is migrated at the start of 872 each time step t, considering the service recovery time is m 873 seconds, if overloading happens at the first m seconds of t, 874 the migration process is not yet completed. The latency QoS 875 violation will happen from the overloaded moment until the 876 service is available at the new node. The faster the service 877 recovery time is, the less QoS latency violation the system 878 suffers. Hence, the stateful system with a shorter recovery 879 time has less violation rate than the default one in every time 880 step. However, when the prediction time step t becomes too 881 long, the model accuracy is significantly decreased, as shown 882 in Figure 8(a) (below 90% for 2 minutes time step and only 883 80% for 4 minutes time step). This happens because the 884 model predicts the average resource usage value in the next 885 step. Then, if the step is too long, the predicted average value 886 is likely less than the overloaded threshold. For example: 887 in a 4-minute time interval, overloading happens 3 times 888 and the total overloading duration is only 20 seconds, then 889 the average predicted value is likely to be less than the 890 threshold. Therefore, no migration is triggered, and because 891 both stateful and default systems use the same prediction 892 model, they will mostly suffer the same QoS violation at 893 overloading durations. This explains why when the time step 894 increases, less of the first and more of the second violation 895 type happens. Hence, both systems' violation rates increase 896 and slowly converge, as can be seen at the 4-minutes time step 897 in the figure. The exceptional decrease in the violation rate 898 of the default system between 30-seconds and 1-minute time 899 steps caused by its migration time m (average 23s of these 900 two applications) is too close to the time step t (30-seconds), 901 which leads to the major violation rate type being the incom-902 plete migration. With reasonable time steps (1 or 2 minutes in 903 our experiment), the QoS violation percentage of the stateful 904 system was 2 to 3% lower than the default one. 905 Third, we evaluated whether the stateful k8s proactive 906 system can keep the performance when the containerized 907 environment becomes larger. Considering all nodes in the 908 same cluster share the same dedicated bandwidth link to the 909 NFS servers, we raised this concern because when the number 910 of pods inside each node or the number of nodes inside each 911 cluster increases, more snapshots will be transferred in the 912 same link. It will cause slower transferring speed and, thus, 913 longer transferring time as well as longer service recovery 914 time. As we analyzed in previous figures, a longer service 915 recovery time will decrease the QoS avoidance capability. 916 Therefore, we propose the solution to this issue by increasing 917 the number of NFS instances and load-balancing the check-918 point snapshots between them from the migration operator 919 at each cluster. Each NFS server and cluster connection 920 has a separate bandwidth link. We evaluated this solution's 921 efficiency by sequentially increasing the number of pods 922 in each node, then increasing the number of NFS servers. 923 The experiment is conducted using MongoDB application 924 with the best configuration found in the previous evaluation, 925 which is a 1-minute prediction time step length and 100Mbps 926 bandwidth. The result is shown in Figure 10 and Figure 11. 927   recovery time and QoS avoidance capability decreased as 938 more NFS servers were available. If each node has a separate 939 connection link to the NFS servers instead of each cluster, the 940 performance will be even better. Therefore, we concluded that 941 the stateful K8s proactive system can retain the performance 942 in a large containerized system with an appropriate number 943 of NFS servers and a dedicated bandwidth link setup.

944
There is one more notable observation that can be seen in 945 Figure 10. The checkpoint time increase is just 1-2 seconds 946 and is not significant as we increased the number of pods 947 in each node from 4 to 32 (equals 16 to 128 pods in each 948 cluster). This result showed that our stateful K8s proactive 949 system does not create significant resource overhead when 950 there is more concurrent pod checkpointing processes hap-951 pen. The reason is that our system integrates the check-952 point process into K8s, which is better than the solution 953 MyceDrive proposed in [13]. MyceDrive solution requires 954 an additional DMTCP sidecar container running in each pod 955 and an extra Execution Agent running inside each service 956 container. Hence, the more pods/containers in the system, the 957 more resource overhead this solution might have. We can-958 not show MyceDrive performance here because of its code 959 unavailability. Meanwhile, our CRIU-based migration execu-960 tion agent is integrated inside the K8s kubelet at each node. 961 Therefore, it does not create significant extra overhead when 962 increasing the number of pods. In contrast to booting-state-dependent services, running-965 state-dependent services such as FFMPEG and CNN model 966 training services in our experiment have quick booting time 967 but much longer service completion time than read-write data 968 operations of databases. Because of their short booting time 969 in case of using default K8s system and short restore time 970 and transferring time due to small application size in case of 971 using stateful K8s system, there is no significant difference 972 in service recovery time between the two systems for this 973 kind of service. However, the stateful K8s system matters 974 FIGURE 12. QoS avoidance capability comparison between the stateful and default proactive K8s systems with running-state-dependent services.   Figure 13. The stateful system reduced the QoS viola-1005 tion rate by 20 to 22%. This happened because with such 1006 a long service completion time, the migrated service using 1007 the default system always violated QoS latency violation 1008 even when the fault prediction model correctly forecasted 1009 the overloading fault. Every time migration happened, the 1010 default system's services violated QoS. Meanwhile, when 1011 using the stateful system, it only violated QoS when the 1012 fault prediction made a wrong call or the migration time was 1013 high due to the bandwidth bottleneck between the cluster and 1014 the NFS server. We already mentioned the solution for this 1015 issue in the previous part when evaluating the stateful system 1016 performance in a large containerized environment.

1017
Based on the above results, we conclude that, with a fine-1018 tuned time step length, the proactive stateful system has a 1019 better QoS violation avoidance rate and lower migration cost 1020 than the default one for different kinds of stateful services. 1021

1022
This paper presents the architecture of a proactive state-1023 ful failure recovery system for containerized services with 1024 K8s as the containerization platform in multi-cluster sce-1025 narios. It integrates two stages: a failure fault prediction 1026 using Bi-LSTM model and a novel stateful service migration 1027 scheme for K8s services. Our system allows K8s to proac-1028 tively retain booting and running states to recover services 1029 from their interrupted moment caused by cloud infrastructure 1030 failure. Our experimental results showed the proposed system 1031 efficiency against baseline methods over different kinds of 1032 state-dependent services. For booting-state-dependent ser-1033 vices, the service recovery time is reduced by 50% and the 1034 QoS violation percentage is reduced by 2-3%. For running-1035 state-dependent services, the service recovery time is reduced 1036 by 40-60% and the QoS violation percentage is reduced 1037 by 20-22%.

1038
For future works, we plan to integrate this approach in 1039 specific telecommunication use-cases, in which maintaining 1040 user sessions' states is a vital requirement.