Dynamic Routing and Failure Recovery Approaches for Efficient Resource Utilization in OpenFlow-SDN: A Survey

Software Defined Networks (SDN) is a new network paradigm that emerged to offer better network management by separating network control logic and data forwarding element. This separation speeds up network innovation without relying on the vendor-proprietary interface for network element configuration to forward packets. However, SDN is flow driven network; for each arrived flow, a feasible path is computed to deliver the flow to its destination. Afterwards, the SDN control logic process the corresponding routing rules and instruct the set of data forwarding elements to install them on their Flowtable to guide the routing process. Unfortunately, the network changes more frequently in dynamic large-scale networks, and the Flowtable is a constraint with limited space. These challenges require the SDN controller to compute paths more often, which may also require many flows routing rules. In addition, the frequency of communication link failures has increased lately. The successful deployment of SDN heavily depends on how it satisfies the reliability requirement with uninterrupted services. Several studies were conducted to compute the optimal path for data forward to meet their Quality-of-Service demand. Other studies focus on reducing the frequency of link failure. Some studies were conducted to manage the constraint Flowtable resources. This survey focuses on Routing rules placement, unoptimized routing, link, and switch load balancing, failure detection, and recovery. The paper extensively discusses each issue and analyses the weakness of the current solutions. Finally, it highlights potential challenges that need future research attention.


I. INTRODUCTION
The rapid growth of data centres and the emergence of the Internet of Things (IoT) have increased the number of network-connected devices. Integrating these heterogeneous devices enables humans to interact easily with their surrounding physical world, boosting business growth. Network traffic control in these modern networks is a very complex task that requires incorporating the dynamicity of time-varying The associate editor coordinating the review of this manuscript and approving it for publication was Adamu Murtala Zungeru . changes in the network environment over a heterogeneous network. Traditionally, network operators constantly used a command line interface using vendor software to configure the network devices. For any subtle network changes, operators struggle too much with the manual reconfiguring processes, which could account for around 40% of the most typical network operational issues [1]. Unfortunately, the architecture is not well designed to enable fine-grained and Quality of Service (QoS) aware traffic engineering over the network. Integrating the network control element with the data forwarding entities complicates the network traffic VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ monitoring process, leading to less QoS-aware flow control. As a result, inefficient network resource utilization is unavoidable, and network management is quite challenging, hindering network innovation. Software Defined Networks (SDN) emerged to simplify network management and speed up innovation by separating the control plane from the data plane. SDN moved the network state and intelligence to a logically centralized controller. The data plane becomes a simpler forwarding entity that forwards packets to the desired destination. The controller extracts the high-level network application through the northbound interface and manages the network devices through the southbound protocol [2]. This way, network managers leverage SDN controllers to implement new dynamic routing strategies, customized traffic management, dynamic allocation of network resources, and many other programmable functionalities. An OpenFlow is the most popular protocol that hides the network devices' complexity and exposes a simple Application Programming Interface (API) to the network operators. This way, operators are relieved from the manual configuration of devices and resource management. Thus, accelerating network management and innovation and reducing the risk of an error that may arise during the network manual configuration time. The merit of SDN made it to be considered under the deployment in Carrier-Grade Networks (CGNs). It has attracted the attention of academia and commercial industries, such as Google, Microsoft, Deutsche Telekom, and Verizon [3].
Despite the advantages of separating the network control logic and data forwarding entities introduced by SDN, there have been various concerns about network resource utilization and performance issues. The communication channel between the control logic and data forwarding element introduced extra processing delay and increased the amount of control traffic managed by the network controller. As the network increases, a high traffic volume is exchanged between the two entities. The data forwarding entities also introduced another concern of shortage of storage to accommodate the required traffic flows. QoS flows have stringent routing requirements, which must comply for optimal performance. Fault tolerance is a crucial property for computer network availability. Unfortunately, an increased communication link failure at the SDN data plane was recently reported.
Many studies have been proposed to compute feasible paths while meeting the demand of flow QoS and preserving the network resource [4], [5], [6]. Similarly, when a failure occurs on the selected path, several works addressed the failure recovery in SDN [7], [8], [9]. Other studies survey the existing failure recovery management. Fonseca et al. [10] presented an overview of fault management threats focusing on each layer in SDN and a threat by each interface between the layers. They further discussed trade-offs between approaches and their stabilities for different SDN applications. To ensure SDN reliability, Yu et al. [2] also presented a systematic survey of SDN fault management by evaluating the existing works, and their weaknesses were noted. In-depth analyses focusing on detecting, localizing, correcting, and preventing faults were among the issues noted and discussed extensively. Issues of fault monitoring, diagnosis, recovery, and repair were also touched up.
On the other hand, Rehman et al. [11] focus on SDN fault tolerance and classify it into four phases: Error Detection, damage confinement and assessment, error recovery, and fault treatment and service continuation. These phases were extensively discussed. Error detection highlight how failure is detected. The error recovery phase explained how to restore the system and maintain a failure-free network. The extent of damage caused by faulty components is presented in damage confinement and assessment. Fault treatment and service continuation discussed how faults are either repaired or the system is reconfigured to avert further fault damage. Finally, additional research gaps were identified. Other works by Ali et al. [12] review schemes for link failure recovery; their work investigates open research questions posed by centralized network architecture. Restoration and protection recovery were analyzed. Performance gains for the former and latter are highlighted, and research directions are presented. In a similar effort, Malik also surveys fault tolerance related [3] and emphasizes the need to investigate restoration mechanisms.
However, research is better with time, and SDN fault tolerance issues are still in their infancy [11]. There is still a wide range of research opportunities with unanswered questions by the research community. This survey investigates routing and failure recovery issues. Failure recovery is greatly affected by several design factors: computing a reliable primary path while meeting different flow demands with efficient resource utilization, the controller's operation mode, and the failure path's recovery range. Overlooking these factors may lead to poor network resource utilization, high communication overhead, and poor network performance. Moreover, after the occurrence of network failure, Quality of Service (QoS), load balancing, and post-recovery congestion are other crucial issues that may affect the network performance in a large-scale network. Dynamic load balancing and multipath forwarding to solve the problem of congestion control are also discussed in this survey.
Transmission medium requires high throughput, resilience, and reliability for efficient data transmission. Multimedia applications such as live video streaming, video gaming, and other delay and throughput-sensitive flows require an optimal path for routing. To cope with these multimedia services demands, optimal dynamic routing is usually computed to achieve better network resource utilization, leading to higher network throughput and better QoS. Unfortunately, communication networks are prone to failure, and the selected path may not always guarantee service availability. Furthermore, the link failure between the chosen route may negatively influence network performance because it may interfere with some of its tasks, such as routing. Therefore, a dynamic routing decision is required to incorporate link quality to minimize link failure occurrence. Unfortunately, there is a lack of sufficient literature to include routing problems with failure recovery in SDN for efficient resource utilization and better QoS. Table 1 summarizes the scope of the related existing survey. This further motivates us to investigate the SDN's dynamic routing schemes and failure recovery approaches. This paper surveys routing problems and failure recovery management for efficient resource utilization.
The rest of the paper is organized as follows: Section II presents an overview of Software Defined Network and discusses its widely used standard protocols. Section III explained the research challenges and problem background related to routing and failure recovery. In addition, routing problems and related works are also discussed. Section IV surveyed failure detection and recovery approaches. Future research directions are discussed in Section V. Finally, Section VI concludes the paper.

II. OVERVIEW OF SOFTWARE DEFINED NETWORKS
The SDN is a composition of different planes, such as the Application Plane (AP), the Control Plane (CP), and the Data Plane (DP); other layers are NBI and SBI, as shown in Figure 1. Each of the planes and layers has its specific functions. AP is a set of applications such as routing policy, Quality of Service (QoS), access control list, load balancing, and others. These applications leverage the NBI to implement network control and operational logic at the CP. The CP mediates between the AP and the DP, thereby managing the entire network through programmability features using the plug-in. This way, it generates a network control function based on defined policies by the network operator from AP into a set of instructions inform of flow entries into DP to guide the network management. However, a single CP can manage significant traffic in a small network setting. Distributed controllers can also manage the network when a single controller fails. Details about distributed controllers can be found in [1]. The DP is a set of networking equipment, such as switches and routers, similarly used in the conventional network. The main difference is that DP is now simply forwarding entities without any embedded control logic to take an independent decision. The network intelligence has been removed while focusing only on delivering traffic flows. Therefore, for any control function DP consults the CP for further action. In return, it generates a set of flow entries and instructs DP to be installed in its Flowtable through SBI. The SBI served as a critical component for separating CP from DP through programmability. Typically, the manufacturing of a new hardware switch can take an average of two years to be ready for commercialization, with upgrade cycles of up to nine months [13]. Developing a new software product may take six months to one year [14]. Therefore, this process takes a lot of time with high investment risk, which may slow network innovations. As such, the SDN SBI API emerged to offer programmable to provide flexible network management. Forwarding and Control Elements (ForCES), Open v Switch Database (OVSDB), and Protocol-Oblivious Forwarding (POF) were among the early SBI API proposals [1]. ForCES provides flexibility compared to conventional network management without changing the network architecture. However, the CP is potentially maintained in the same network element. VOLUME 10, 2022 In contrast, OVSDB provides more advanced management features for Open v Switches. POF is among the SBI APIs aimed at enhancing the forwarding performance of DP. However, these SBI APIs rely on modifying DP to support Flowtables [1]. This way, the remote entity can flexibly configure the DP through operations such as adding, removing or modifying flow entries in the Flowtable at DP. OpenFlow is another SBI API that emerged to facilitate network and control management without altering the SDN architecture. The SDN controller has direct access and control of the DP through a logical data structure called Flowtable. The flexibility of OpenFlow attracted not only research communities but including an industry [13]. As a result, Open-Flow is widely considered the most SBI API to achieve the benefit of SDN. Although, OpenState [15] was proposed as an extended version of OpenFlow with more programmable features aimed at making the DP devices act independently without consulting the controller. However, OpenState has not been standardized as an acceptable SDN SBI. As such, OpenFlow implements the concept of SDN by abstracting network communications in the form of flows to be processed by the OpenFlow switches with a set of instructions, in other words, flow entries.

III. CHALLENGES AND BACKGROUND
various types of SDN Applications in large-scale networks generate different traffic flows, such as throughput, delaysensitive, and bandwidth-sensitive flows. Some flows are generated by network protocols such as (ARP and DNS), and interactive applications usually generate latency-sensitive flows. Conversely, scientific computing, MapReduce, and machine migration applications formed throughput-sensitive flows. The recent proliferation of the Internet of Things (IoT) devices required network architecture to react in real-time and be scalable for many traffic flows. Forwarding devices are also required to present efficient data delivery for both types of traffics flows with an acceptable delay to meet the demand of CGNs. This way, for every arrived flow, a feasible primary path is computed considering available network resources. The shortest Path Algorithm (SPA) is widely used to select feasible paths [16] to achieve efficient network resource usage. The most common way for a SPA to achieve resource efficiency is to limit resource consumption and to keep the network load balanced. However, traffic flows differ with a variety of Quality of Service (QoS) requirements, and SPA may not meet the demand of some flows. As such, additional constraints must be imposed to meet the needs of different flows while achieving efficient network resource utilization. To achieve this, the controller needs to i) gain and maintain accurate global network knowledge, ii) compute the feasible path that meets the demand of given flows while efficiently utilizing the underlying network resource, and finally, iii) install the corresponding rules in the switch Flowtable to guide the routing process. Therefore, the efficiency of path selection heavily depends on how the controller gain and maintain accurate Link State Information (LSI). The existing SPA used static LSI (i.e., hop count, distance, link capacity) to compute a path [17], [18]. Others considered dynamic LSI (e.g., link utilization, available link capacity) [19], [20]. This way, dynamic routing periodically requests LSI information from underlying network devices. Static LSI significantly reduces communication overhead because the controller does not have to query the network state regularly. However, as the network increases in size, it will cause congestion because it always calculates the same path. In contrast, dynamic LSI yields better performance but at the cost of periodic network state overhead.
Conversely, reliability is another issue affecting smooth data transmission after computing the feasible path. Link or switch failure is reported to have occurred frequently, on average, every 30 min [7]. In addition, some links are likely to fail due to software bugs or hardware malfunctions. Upon failure occurrences, all traffic flows affected by the failure must be rerouted through backup paths as soon as possible for continued service provisioning. For CGNs, the failure recovery process must be completed within a predefined time interval [18]. The recovery mechanism in SDN is divided into two categories: restoration and protection. In the case of restoration, the backup path is deployed on-demand. As for protection, the backup path is always pre-planned in advance before occurrences of failures. As a result, disrupted traffic flows can immediately be redirected without incurring extra signaling overhead. Therefore, protection approaches can meet the demand for delay-sensitive application requirements of CGN. However, the switch Flowtable storage Ternary Content Addressable Memory (TCAM) is a constraint with limited space [21]. Typically, a Flowtable is populated with correlated primary and backup path rules for fast-forwarding. Unfortunately, the corresponding forwarding rules is relatively higher than the TCAM capacity.
In contrast, the restoration approach reduced the number of forwarding rules and can adapt to frequent network changes. However, the merit comes at the cost of communication overhead and sometimes longer computing paths. Finding the fastest path could be expensive in terms of the time it takes to compute a new route and modify the relevant switches Flowtable [22], [23]. The number of forwarding rules in the Flowtable can contribute to higher update operations. Switches with many forwarding rules in their Flowtable experienced higher update operation, which may extend the routing convergence time. Latency is critical for some applications such as VoIP (Voice over IP). Traffic from delaysensitive application flows must arrive at their destination without considerable flow setup latency, typically within 25 milliseconds [24]. Therefore, path computation and failure recovery approaches using either restoration or protection still face various challenges in the static or dynamic routing and failure recovery process [25].
For the routing issues, network unbalances and congestions are among the issues that affect static routing, although they can be overcome by dynamic state routing. However, it required substantial overhead to obtain accurate dynamic link state information. In addition, link cost needs to be regularly recomputed based on the new network state, while a prior study has shown network state changes every 1.5 sec [26]. This could lead to extra pre-computation costs before the actual path computation. In full protection coverage, network topology density differs, making full protection very hard when multiple components fail. Typically, failure can be categorized into two parts in SDN: 1) at the controller and 2) at the data plane. At the controller, master and slave controller failure problems occur when one of the controllers runs out of the assigned capacity [27]. Several controller failure problem solutions have been introduced over the years [28], [29], [30]. However, multiple controller failure problems are out of the scope of the present document. We focus on the failure at the data plane; this way, component failure can be categorized into four types: singlelink failure, single-node failure, multiple-link failure, and multiple-node failure. In highly dense network topology, some flows may not always have a valid alternative path for multiple component failures; forcing the controller to reroute the affected flows may lead to routing loops. Hence, such flows may not have full protection coverage, recovery can be pretty challenging when alternative path is not provisioned efficiently.
Unfortunately, an alternative path may not always be available when needed and could fail earlier than the primary path [26]. The alternative path performance is usually affected by many factors. The most notable one reflects on bandwidth utilization, delay, and packet loss rate, and the worst case is chosen path, which correlates with the primary. Failure on such path may significantly affect many paths in dense network topology. Therefore, when selecting an alternative path, it is essential to consider the current performance of links and the service availability as the traffic evolves.
Additionally, the switch Flowtable storage constraint is another factor that affects rerouting during failure recovery convergence time. Most recovery approaches provisioned flow entries to protect each flow; however, such decisions lead to many entries, which consumed Flowtable storage space. As a result, some Flowtable entries will be removed when the storage usage crosses the threshold or is exhausted, leading to a Flowtable scalability concern, especially in largescale networks. This survey organized these challenges into four categories, as illustrated in Fig. 2. Routing rules placement, unoptimized routing, link and switch load balancing, failure detection, and recovery.

A. ROUTING RULES PLACEMENT
SDN is flow-based routing instead of destination-based, which is widely used in traditional networks. The corresponding flow rule is installed in the switch Flowable memory for each arriving traffic flow. It is a logical data structure used in OpenFlow switches to handle flows. This data structure used special high-speed memory to called Ternary Content Memory Addressable (TCAM) on switches to ensure matching flexibility and high lookup performance in constant time [31]. However, although TCAM has a high speed of searching for flow rules, it is 400 times more expensive with limited storage space than RAM-based storage. Besides, the size of each flow rule is 356 bits, much larger than the 60-bit entries used in conventional switches [31].
Conversely, per-flow routing required many dedicated rules placement to efficiently route flows to various destinations. As a result, the data forwarding entities need a large TCAM memory size to accommodate many flow entries. To overcome this shortcoming, the SDN controller can reactively install the forwarding rules on demand upon arrival of each traffic flow. Unfortunately, in dynamic large-scale networks where traffic flows frequently change, the number of flow setup requests will augment rapidly, increasing the processing load on the SDN controller, which causes extra packet processing delay [32]. This overhead and delay can significantly affect the real-time application and degrade network performance.
Software switches based on commodity servers have recently gained popularity. These switches have a large storage capacity and can handle packets quickly (e.g., 40 Gbps on a quad-core machine) [33]. However, software switches are more constrained in forwarding and lookup rate than commodity switches [34], because they employ general-purpose CPUs for forwarding. In contrast, commodity switches used Application-Specific Integrated Circuits (ASICs) designed for high-speed throughput. In addition, software switches stored routing rules in computer Random Access Memory (RAM) which has a larger capacity; unfortunately, the lookup speed in the Flowtable is slow. In contrast, the lookup rate in TCAM-based Flowtable is faster at the cost of a small space. Furthermore, the proliferation of IoT devices further aggravates the TCAM storage problem because of the large number of traffic flows generated by different devices.
To overcome the storage limitation, the SDN controller configures flow rules with an idle timeout with a small value [35]. It is the inactive period, after which the flow rule will be evicted from the switch Flowtable. Unfortunately, traffic flows exhibit some variables with different inter-arrival times and duration, which make the idle time value inefficient. Alternatively, the controller configures flows with a hard timeout value [36]. It is the total life span of flows after which the rule is declared invalid and removed from the Flowtable. However, a hard timeout may be too long for some flows, mainly with less or no packet to match. As such, flows will stay in the Flowtable longer than necessary, occupying precious space. Therefore, the diversity and heterogeneity of IoT devices and the traffic flows generated with different variabilities can easily overwhelm the timeout-based flowrules placement mechanism [1]. Various schemes have been proposed to address these challenges. For example, the work in [37] proposed a timeout scheme based on flow inter-arrival time and duration to manage the switch Flowtable efficiently. Another proposal explores the impact of timeout settings on Flowtable [38]. Therefore, we extensively survey the various schemes proposed in the literature to address the Flowtable table memory limitation in the context of the routing rules placement problem.

B. UNOPTIMIZED ROUTING PROBLEM
SDN controllers make routing decisions on behalf of the dataforwarding entities using a set of policies. These policies are converted to instructions and sent to forwarding elements such as switches and routers. The endpoint policy determines each flow's source and destination nodes based on the supplied high-level design requirements, while the routing policy determines the flow path. This way, the shortest-path routing policy asks the network forwarding element to forward packets along the shortest path between two given nodes. To efficiently route flows, the SDN controller must regularly monitor and gain accurate knowledge of Network State Information (NSI). Such knowledge is used to compute the best path to meet the requirement of given flows while efficiently utilizing the network resource. Afterwards, place the routing rules on the switches Flowtable to guide the routing process. The ability to compute a feasible path heavily depends on how fast the controller obtains the NSI. To obtain the NSI, the existing routing algorithms used either static or dynamic link state information [39]. The former calculates the link state information once during the topology discovery state and computes a feasible path based on hop count, distance, and link capacity. While the latter considered available link capacity, link utilization, or the number of processed flows. This way, the controller periodically queries the underlying network device to obtain updated network information. Static routing has less overhead because it does not always request updated information from switches. However, it always finds the same path for all flows, making it infeasible for some flows with special QoS demands. Besides, it causes congestion in some links while others will be underutilized. As a result, researchers improved the routing algorithm using dynamic routing. Although dynamic routing may adapt to real-time traffic having different QoS demands, there is significant overhead in frequently querying switches to obtain updated network information. Besides, it is very challenging to get an accurate NSI because of the delay and message exchange overhead. An inaccurate NSI may lead to accepting more flows than the network resource can handle, and significant packet loss is unavoidable.
Over the past years, several routing schemes have been introduced to improve dynamic routings while efficiently utilizing network resources. Unicast and multicast are widely used as the two classes of routing. A unicast routing problem is defined as given source s to destination t, with a set of best effort/ QoS constraint C, and possibly an optimization goal to find the feasible path from s to t while optimizing C. Multicast routing problem, given a set of R destinations and set of X constraint such that source s, and all set of R destination satisfy X constraint. This way, residual bandwidth is considered one of the main problems for QoS route computation. We classified the existing routing problems into three categories, as shown in Fig. 3.
Other researchers leverage machine and deep learning models to devise routing schemes [40]. Several proposals were introduced to incorporate a deep learning model on SDN [41], [42], [43], [44], [45]. Unfortunately, this paper does not cover machine and deep learning approaches.

1) ROUTING WITH LINK COST AWARE
The minimum Hop Count Algorithm (MHCA) [46] selects the path with the shortest distance between the given source and the destination. MHCA maintains information about each link; only those with enough resources to meet the user's needs are considered for routing. MCHA is a straightforward algorithm and can easily be implemented. However, it might easily lead to a bottleneck for upcoming requests due to inefficient use of network resources. Widest Shortest Path Algorithm (WSPA) was presented in [47]; it is a modified version of MHCA which chooses the path with the highest amount of bandwidth from the set of available paths. MHCA exhibits a trade-off between load balancing and resource consumption. Shortest Widest Path Algorithm (SWPA) [48] is another variant of MHCA, it computes the widest possible path among N paths, and the path with the smallest number of hops is considered optimal. However, most static routing assumes the controller has full knowledge of the network topology and overlooks the dynamic link state information. The controller can calculate the shortest path based on distance or hop count using the discovered topology information at time t 1 . Unfortunately, such an assumption may not yield optimal performance, especially in dynamic large-scale networks. Dynamic link state information routing was devised to overcome the challenges of static routing. In this way, several schemes were introduced to optimize the Quality of Service (QoS) parameters [39], [49], [50], [51]. State-of-theart dynamic routing focused on minimizing the interference among the flows [16]. The authors developed a heuristic path selection algorithm to avoid routing flows through critical links. It is heavily loaded links that consume a significant amount of bandwidth, and routing flows through a path with such features would make it quite challenging to satisfy the future demand of certain flows. Although the scheme outperforms minimum hop and widest routing algorithms. However, traffic flows are often unevenly distributed, bypassing critical links at time t 1 , and routing flows through another path may not always guarantee the network load balance. The selected path may be overloaded at time t 2 and, consequently, be problematic to multimedia applications that require certain QoS demand. As such, it is desirable to incorporate more routing metrics to maximize network resource utilization since it is the primary concern of the infrastructure provider.
The works in [19], [52], and [52] examined the effectiveness of various routing algorithms for dynamically establishing performance-guaranteed traffic while considering bandwidth and path latency besides bandwidth rejection ratio. The scheme proved that bandwidth-constrained algorithms using source-to-destination pair information could considerably improve network performance. However, it chooses long routes regardless of network load, which could affect delay-sensitive applications. Their extended work in [53] classified flows based on their level of delay sensitivity aimed at optimizing resource utilization while providing absolute bandwidth and delay guarantees. However, multicast applications require high bandwidth and lower delay while maintaining low controller computational overhead. Unfortunately, their solution exhibits high complexity as the number of delay-sensitive application flows increases and consequently declines the system throughput. If the network fails to distribute bandwidth for flows correctly, the performance of delivering and processing data will be degraded due to excessive network congestion. To fulfil the bandwidth and delay requirements of multicast applications and apply QoS parameters based on the current availability of network resources. The work in [51] and [54] presents a uni-cast and multicast request to increase network throughput under critical and user-requested bandwidth. An application-aware routing was presented in [55], [56], and [57]. The schemes compute the link load at a regular time interval. After the arrival of flows, they examine the flow's connection type and bandwidth requirement. This way, they allocate paths accordingly. Dynamic and adaptive multi-path [20] is another applicationbased routing that computes paths based on packet loss, time delay, and bandwidth for multimedia applications. In contrast, the work in [58] argues that the composite routing metrics may not yield better forwarding performance, the greater the probability that the device will send flows over the most optimal path. This way, they devised an adaptive routing scheme considering delay, packet loss, and jitter.
The model adaptively routes flows based on their QoS demand. Although there was forwarding performance gain, calculating the metric values may induce another overhead as the network increases in size. Other solutions [59] compute paths based on link load to avoid congestion. A critical switch and link routing scheme was presented in [60] and [61] to compute the path based on link and switch features. This way, flows are routed through a path with minimum critical switches and links to improve the packet delivery ratio and throughput.
Although, the aforementioned literature has significantly made a difference compared to static routing. However, Inaccurate network information and protocol overhead are among the pressing issues that affect dynamic routing [62], [63]. The references [19] and [64] evaluate static and dynamic routing under two scenarios 1) where the accurate network state information is available as assumed in the literature and 2) considering the practical case where the controller periodically gathers the network state information with inaccuracy. Although the authors noted the former is impossible in reality. However, dynamic link routing outperforms static in the number of accepted flows and total throughput. In addition, they observed that the performance of every algorithm is adversely affected by the inaccuracy of network state information. Other researchers argued that flows were overrated by only considering their QoS requirement. The common approach of routing flows based on bandwidth, delay, throughput, loss, and other routing metrics, may not always be realistic. However, flows on the internet exhibit variabilities; some network applications generate flows with a large number of packets others contain a small number of packets. Flows are typically classified into two types: elephant flows and mice flows. The former are not many, but they have high traffic quantities, contributing significantly by increasing link utilization. While the latter are large, they dominate the switch Flowtable entries.
Therefore, managing the influence of both flows could go a long way in balancing Flowtable or link resource utilization. Hedera [65] introduced a dynamic flow scheduling approach to minimize conflict between elephant and mice flows while improving resource utilization. An OpenFlow-based architecture was devised to dynamically modify flows according to their traffic load. Similarly, MiceDCER was introduced in [66] to promote mice flows by assigning internal Pseudo-MAC (PMAC) addresses to the edge switches and hosts. Mice flows are typically associated with latency-sensitive and bursty applications like VoIP and search results. They contained a small number of packets but many flows live for a short time. The authors leverage wildcard features to aggregate flows to reduce the number of corresponding forwarding rules. This way, the number of rules was optimized with high throughput.
Similarly, DIFFERENCE was presented in [67] to dynamically sets up paths for elephant and mice flows separately. DIFFERENCE estimates residual link utilization at regular intervals. This way, mice flows are routed based proactively approach, while elephant flows are forwarded through the least congested path. Although, path search space was improved while guaranteeing bandwidth requirements. However, aggregated routing metrics were not considered for optimal network performance. In addition, switch Flowtable resource and link utilization were overlooked. Imbalances of these resources could potentially harm network performance [68]. A DIFF was introduced [69] to differentiate flows based on their impact on a network resource aimed at balancing switches, Flowtable occupancy and link utilization. This way, DIFF adaptively selects routes for elephant flows based on current link utilization in the network to achieve high network throughput.
However, these approaches are Dijkstra-based based routing solutions. The complexity of the Dijkstra algorithm is that the number of nodes and edges in the network affects the algorithm's efficiency [70]. Alternatively, Researchers apply meta-heuristic techniques within the routing optimization algorithms in SDNs to solve this complexity. Ant Colony Optimization (ACO) is the most well-known meta-heuristic and is widely used for routing optimization. Interestingly, its performance was tested and outperformed other traditional routing techniques. ACO methodologies also support flowbased routing strategies as used in SDNs. The references [71], [72], and [73] were introduced to manage an elephant and mouse flow based on ACO. Link delay and bandwidth are employed as reference indicators of the transmission path. Mice and elephant flows are routed through the best path that meets their flow demand.
Similarly, the work in [73] presents QoE-centric flow routing. Their work focus on identifying the best available routes for various multimedia services. QoE depends on QoS, and link resource constraints such as delay, jitter, and packet loss were incorporated in path computation for audio, video, and file transfer. Each flow was routed according to their requirements while considering the network limitation. The works in [71], [72], and [73] may perform well in a small network. However, due to time and space complexity, they may not guarantee the same performance in a dynamic largescale network.
In contrast, Hybrid Ant Colony Optimization (HACO) algorithm [50] was proposed to address the issue of time and space complexity. A box-covering and k-means clustering methods were used to divide the network into the small subnet. As a result, flows are routed to the best path while optimizing the computation time. Although, there was performance gain in terms of loss ratio and delay. However, their solution and [71], [72], and [73] may impose some extra processing load on the controller because the network changes more often. For every subtle change, the heuristic algorithm has to be triggered, which in turn may not only affect the performance of the SDN controller but will also throttle the switch update operations.
Therefore, different approaches were proposed to improve routing efficiency in SDN, as summarized in Table 2; some schemes generalized flows to achieve QoS requirements. Other literature leverage a heuristic algorithm to improve time and space complexity. However, these approaches do not relate to switch resource constraints like switch updating time while devising their schemes. It is an important parameter that directly impacts the routing algorithm because corresponding forwarding rules need to be placed on the switch routing table to guide the routing process. Therefore, overlooking such metrics may directly affect the routing convergence time. In the following, we have discussed other related works that considered the switch resource.

2) ROUTING WITH SWITCH COST AWARE
Routing policies required corresponding forwarding rules installed in the SDN switch memory to route flows efficiently. In contrast, the number of routing rules in dynamic largescale networks keeps increasing. It leads to poor resource utilization, which needs urgent attention to meet the demand of users. Several solutions have been proposed to improve this precious resource utilization over the years. In the following sections, we divide the existing literature into two categories Routing with Flow operations and Energy awareness to enhance the switch memory utilization. In addition, we summarized the most related work in Table 3.

A. ROUTING WITH FLOWS OPERATIONS AWARE
Since SDN is flow driven network, forwarding devices regularly carry out update operations in their routing table to guide the routing process. Several approaches were introduced over the years to incorporate the switch cost during routing decisions. In a dynamic large-scale network, thousands of flows could easily be disrupted, such flows must be restored within the shortest possible time, and the time required to perform such operation is significant and, therefore, must be optimized to maintain a carrier-grade network status. STAR [74] is an online routing scheme to efficiently utilizes limited Flowtable resources while maximizing network performance. STAR frequently detects switches' real-time utilization of Flowtable and intelligently removes inactive forwarding rules to accommodate more new flows. This way, routing paths for new flows are computed based on switches real-time Flowtable usage. The references in [75] and [76] formulate a problem to find an optimal path with the lowest path cost and update operation. They compute N number of paths and compare it with their operation cost. The path with fewer operations and a threshold value is chosen to forward flows. Although the selected path may speed up routing convergence time. There is a lack of information on how they arrived at operational cost estimation. In addition, computing threshold values more often may not guarantee the feasibility of an end-to-end path in a large-scale network [22]. The work in [23] finds a reliable path based on a disjoint path aimed at reducing the number of update operations. They select the path with a set of nodes frequently shared by many paths. This way, the number of forwarding will be reduced. However, such a design may not guarantee the reliability of the link as the network evolves, some switches could easily be overloaded, leading to high congestion. RAF [77] calculates link reliability and installs minimum flow rules for multiple paths based on the reliability value of the path. Such value is appended to N paths, and the path with a higher value is chosen. However, this approach may involve higher reliability value computation costs in large-scale networks.
In contrast, the approach in [22] noted switch update operation is proportional to the length of the selected path; the longer the path, the higher the operation cost. They devised an approach based on the graph theory for E2E path computation to divide the network into communities. Suppose any part of the network is disrupted due to link failure. In that case, only the forwarding rules of the affected community will be updated, and the rest of the forwarding rules will remain intact. This has minimized the flow operations cost. However, their solution does not secure the shortest path from source to destination, and the process of detecting the affected community and the failed link may incur an extra processing delay. A multicast routing model for multiple multicast requests to reduce the number of rules was presented in [78]. The authors formulate an ILP model to concurrently determine several multicast tree paths to share a single flow entry stored in a forwarding element set up for a single receiver. Although, the approach can reduce the number of operations. However, sharing a single entry for many flows may introduce congestion, and changes in flow behaviours may affect many paths. To address this challenge, reference in [79] formulates a problem to minimize the number of rules to redirect flows. The former examines link capacity, and congested links are bypassed, while the latter group multiple flows and merges their forwarding rules. This way, they routed flows without incurring much flow operation. There was performance gain for these solutions in terms of reducing the number of flows operation. However, it is quite challenging for SDN controllers to regularly have global network knowledge with flow aggregation techniques. Because flows are treated using wildcard rules, its additional features in OpenFlow which may not be supported by some switches [13], [31].
Therefore, most current solutions take a long time to minimize the update operation or introduce controller overhead, making them impractical for large-scale high-dynamic networks. To overcome these challenges, FLUS was presented in [80]; a Segment Routing (SR) based strategy was introduced for fast and lightweight path update operations. For any change in the network, FLUS immediately uses SR to develop a new path by joining some parts of existing and new paths. Afterwards, the actual path for the data transmission will take place. This way, flows are shifted to the newly computed path. Unfortunately, this strategy is constrained by the bandwidth limit function. The references in [81] and [82] proposed an ILP model for a real-time delay-optimized flow route algorithm. The approach separates the forwarding strategies of various flows based on their sensitivity to delay. The authors claimed to reduce route update operation by 60% compared to benchmarking work. However, the solution may require a large solver to converge in a large-scale network.
Other solutions [83], [84], and [85] focus on aggregating the number of flows with similar features going to the same destination to reduce the number of flows operation. For example, an Integer Linear Programming model (ILP) was introduced in [84] to minimize the total cost. First, it assigned predefined weights to differentiate, thereafter, aggregate flows with similar features. Afterwards, they dynamically route flows to their destination. However, an ILP solution is not scalable for dynamic large-scale networks.

B. ROUTING WITH SWITCH ENERGY SAVING AWARE
It is worth noting that routing rules are stored in the switch memory, usually implemented in TCAM. It is good in terms of look-up rate and forwarding performance. However, it's known for its high cost and power-hungry. Therefore, the proliferation of traffic flow load on communication links influences the energy consumption of links. Similarly, the hardware constraint of TCAM further argument the power consumption of forwarding elements in SDN. To overcome these challenges, various Energy Aware Routing (EAR) was introduced over the years to minimize the energy consumption of these resources while preserving connectivity and QoS, as summarized in Table 4. Energy Aware Routing with Compression (EARC) was proposed in [86] and [87]. It's an ILP that uses a greedy heuristic to optimize switch resource energy consumption using a software switch.
High cost was eliminated with faster Flowtable update operation up to 10 times faster than harder ware switches. However, the scheme is associated with packet processing delay.
MINNIE [83] was introduced to reduce the extra delays for routing lookup rules while reducing TCAM energy. It provides a two-phase solution: the compression and the routing phase. The former uses a heuristic method to compress forwarding rules with the same feature, while the latter provides a heuristic based on the shortest path algorithm with adaptive metrics. This way, it reduced the number of routing rules by 50%. However, MINNIE will experience the same weakness as the works in [86] and [87] due to their architectural design of hardware and software switches. The processing of exchanging packets between the switches will no doubt introduce packet processing delay. Reducing routing rules reconfiguration cost is another way to optimize switch TCAM power consumption. Reference [88] presented an LP optimization model to reduce the cost of re-configuring flow tables when traffic demand changes. An LP model was used to formulate the problem while considering the obsolete flow entries that must be removed and the new flow rules that must be installed. Furthermore, a Genetic Algorithm has been presented for decreasing network power usage while minimizing the number of updated forwarding rules in a Flowtable.
Since switch TCAM power consumption is correlated with the flow's arrival, other solutions focus on devising trafficaware energy saving. The key idea is to turn on or off network components (for instance, some switches) based on the traffic load. For example, when traffic loads are low (especially at night), this strategy can save up to 50% of overall energy utilization, like the work in [89]. Typically, an elastic structure is used to depict network components that can expand and contract in response to dynamic traffic loads. The main problem is deciding which components to switch on and which one to turn off without compromising the desired level of QoS [90]. Elasticity, topology awareness, queue engineering, and smart sleep on and off are desirable qualities of a traffic-aware controller. The ability to dynamically increase or decrease the number of network components employed in response to traffic is referred to as elasticity.
Topology awareness adds the advantage of using formulations and solvers to customize to any given topology. The hierarchically arranged fat-tree architecture is the most commonly employed in data centres. Knowing how the components are grouped and their capacities allow us to take alternative path that avoid energy-intensive paths. The work in [91] presents an EAR by devising four modules on the SDN controller using fat-tree topology: optimizer, routing, flow monitoring, and power control. It periodically receives input information from flow monitoring and finds the most energy-efficient subnet that satisfies current traffic demand. Afterwards, the optimizer provides the active topology to the routing and power control module. Thereafter, the power control module changes the power state of the switches and link cards while the routing selects the optimal path for flows. This strategy improved the power-saving level with efficient network resource usage.
Queue Engineering approaches provide extra port-level traffic monitoring functionality. Flow size and link bandwidth could be obtained at regular time intervals. This way, the ending time of flow can be deterministically calculated to schedule unallocated flows in queues to increase the flexibility of the flow scheduling method. The references in [92] and [93] presented a routing strategy that combines exclusive routing and flow scheduling to achieve efficient energy saving. The scheme expands flow scheduling to the time dimension and considers energy consumption during flow transfer time. The active and suspended flow sets are used to schedule and transfer flows as part of a heuristic search for the flow group solutions that use the least energy. Another solution in [94] reduced network power by taking Flowtable size and link bandwidth into account for a single flow per user. This study employs Dijkstra's method to minimise network power upon flow arrival while considering link bandwidth and Flowtable size. They employs ACO to determine the optimum path and minimize network power for all flows. However, the flow arrival in a dynamic large-scale network is beyond single flows. Hence, such a solution may not be applicable in different network settings.
Moreover, link bandwidth and Flowtable size are not adequately considered while employing the ACO. To overcome this limitation, reference [95] modified and considered link bandwidth and Flowtable size while routing every flow. Instead of relying on ACO, it selects several routes per flow and computes network power usage while considering bandwidth and Flowtable size. However, the performance gain was observed compared to [94]. However, the routing decisions are subjected to bandwidth and delay only. The dynamic large-scale network contains different applications with various flow QoS requirements. Diverse routing metrics make it more flexible to meet the demand of other flows. The Minimum Criticality Routing Algorithm (MCRA) is proposed [96], along with an energy-efficient optimization. MCRA determines the available paths based on the endto-end request. Suppose the discovered paths do not satisfy specific requirements, such as the maximum latency and link utilization ratio. In that case, the rerouting process is initiated using the Energy-Efficient Multi-constraint ReRouting (E2MR2) protocol. However, network changes more often, as such the model needs to run each time when there is network changes. This may introduce unnecessary delay, especially for delay-sensitive applications.
However, topology-aware and queue engineering approaches are most tailored toward a specific topology, such as a fat tree. Unfortunately, such solutions cannot be applied to other topological structures. While individual heuristics work well for a specific case and cannot be generalized, broad heuristics do not fully grasp energy capacities [97]. Other solutions focus on a smart sleep approach to improve energyaware routing. Smart sleep and off refers to the capacity to VOLUME 10, 2022 turn on/off switch ports, links, or the complete switch in response to traffic. In [98] a routing method based on the shortest path is proposed for incoming flows to reduce the number of active OpenFlow switches in an SDN. This technique focus on meeting the demand of throughput-sensitive flows while considering the implicit link capacity limits. This way, they analyzed the network's night-time traffic, and the number of flows is considered negligible. This technique assumes no QoS constraints other than bandwidth for path selection. Even though energy usage was improved, limiting QoS routing metric decisions to only bandwidth may not be a valid assumption.
In [99] an energy-aware routing algorithm for SDN-based carrier Ethernet networks was proposed. It presents a binary linear programming approach for the EAR problem that optimizes the number of network switches that can be turned off while considering traffic demand and rule space limits. Unfortunately, this method is time-consuming due to the use of the Integer Linear Programming (ILP) model, a set of first-fit heuristic algorithms to reduce computation time is also offered. Interestingly, it balances energy savings and connection utilization while ignoring the application flows' QoS requirements. However, the method is less promising for large real-world networks with traffic patterns variabilities due to high temporal complexity. The work in [100] presents a two-phase SDN-based routing technique that minimises energy usage while maintaining a given degree of QoS for user flows. A minimal graph-based Ant Colony Optimization (ACO) technique reduces network energy usage. It prunes and optimizes the network tree by shutting off unneeded switches and providing an energy-efficient subgraph in charge of routing the current flows.

3) LOAD BALANCING SCHEMES
The heterogeneity of the current network devices and the number of internet-connected users increases the traffic flows generation. Since SDN is a flow-driven network, many routing rules will be placed on the set of switches to guide the routing process. However, path roles differ, some paths are frequently used others are underutilized. As a result, forwarding entities on the frequently used path may be overloaded with many routing rules, while others may have a minimum and therefore cause a load in balance. Similarly, the links connecting these sets of forwarding entities are constrained with limited bandwidth, while bandwidth demand is increasing due to the surge of traffic demand. As a result, frequently used links consume a significant amount of bandwidth. Therefore, effective load balancing is required to improve network resource usage. The load balancing technique manages incoming traffic flows, thereby distributing and sharing the traffic load fairly among the network forwarding element to improve network service availability. This way, traffic bottlenecks on data forwarding entities and communication links could be avoided. Several efforts have been made lately to balance the network on either link load or switches Flowtable [101], [102]. For example, the approach [103] devised an Automatic Re-routing with Loss Detection architecture. Packet loss is detected with the queue stat message of OpenFlow protocol, and then the re-routing module tries to find a bypass path and applies it to Flowtables to balance the load on the switches better. However, the approach may require significant bandwidth resources in a network with varying traffic patterns.
Moreover, flow rerouting without considering the current network statistics, may results in congestion propagation [104]. In contrast, the work in [105] proposed a scalable congestion control protocol to reduce Flowtable overflow, thereby reducing queuing delay under busty traffic. The number of TCP flows passing through each switch port was monitored to ensure total bandwidth utilization does not exceed the bandwidth-delay. This information was eventually passed to each TCP source by updating the advertisement window field in the TCP header. The proposed algorithm transfers the minimum number of flows from the congested link to the backup path, resulting in improved QoS and congestion control. To alleviate the congestion and improve application performance. Multi-path routing algorithms can distribute traffic over diverse paths optimally than simple solutions like ECMP. This way, Kanagavelu et al. [106] proposed a local rerouting mechanism in SDN-based Data Centre networks to effectively manage congestion in the event of link congestion or failure. SDN controller periodically gathers port and flow statistics from all switches at a fixed time interval. Afterwards, the routing engine computes a less loaded path between pairs of demand. Link congestion was checked periodically based on a threshold value; once link congestion exceeded the T value, the SDN controller re-routed the affected flows along an alternative path.
Similarly, the work in [107] proposes an effective routing mechanism for link congestion avoidance in SDN. The controller observes the current traffic of switches and updates the topology according to the weight assigned based on computed bandwidth usage. Traffic from congested links is instantly redirected to the available links to enhance the efficiency of link usage. Interestingly, the method effectively allocates and utilizes the network bandwidth but at the cost of monitoring overhead. Another solution in [108] leverages the flow parameter to devise an algorithm for avoiding congestion. The network state is monitored by calculating the link utilization periodically and redirecting the newly arrived flows to a congestion-free path. Congested paths are obtained through threshold values. If the average link utilization exceeds a threshold, the controller predicts the congestion on the link and calculates the load to be redirected to other backup paths. This way, flows are rerouted to the proper backup path without much congestion. However, Sminesh et al. [104] argue that congestion is far beyond balancing the network on either link or path; therefore, most load-balancing schemes overlook addressing congestion propagation. The authors proposed a method that categorized links for efficient load balancing. Toward this goal, the scheme monitors the utilization of each link, and over-utilized links that cause network congestion and packet loss are identified as bottleneck links. Afterwards, identified largest and bottleneck links are redirected through lightly loaded paths to achieve better network load balancing. However, most of the existing scheme load balancing focuses on reducing network congestion; therefore, their methods heavily rely on the monitoring mechanism to obtain the flow of statistical information. Thus, the drawback of these schemes depends on the polling interval value to get the statistics. If a flow lasts longer than the statistics polling interval, obtaining such statistics at the regular interval requires an extra processing load. This process can potentially be resource-intensive for the SDN controller and introduce another scalability concern, mainly when the traffic flows increase and the number of switches to be monitored increases. Conversely, when the polling interval value is short, it increases the communication overhead switchcontroller. Therefore, entirely depending on the controller to pull the data of total active flows to obtain the usage of the Flowtable may not be the best solution [36].

IV. FAILURE RECOVERY APPROACHES
Generally, failure management approaches can be divided into failure detection and recovery phases, each has its implementation mechanism and execution time. Failure recovery must be established immediately after link or switch failure detection. Afterwards, the recovery process is initiated. Most of the existing failure recovery proposals can be categorized based on their link, switch, or path recovery scope. This way, the configuration of the backup path could be per switch, link, or every disrupted flow using local and path protection or restoration mechanism. Other works focus on the hybridization of protection and restoration to achieve the maximum benefit of both. Intuitively, the recovery schemes are categorized into four (4) categories; This includes TCAM memory aware, load balancing aware, Quality of Service (QoS) aware, and rule update operation awareness based on either restoration, protection, or hybrid as presented in Fig. 4. This section discusses and analyses these approaches and highlight their weaknesses.

A. FAILURE DETECTION APPROACHES
The most crucial role of the SDN Controller is to maintain a real-time state and consistent network topology. Due to VOLUME 10, 2022 frequent network changes, links state changes more often. In this regard, the link discovery process is initiated to detect the link failure between connected OpenFlow switches and to efficiently detect changes to the network topology. Delay in link failure detection increases packet loss and increases failure recovery convergence time. Table 5 summarizes some of the existing detection schemes. Inefficient link failure detection can significantly affect the network operation that depends entirely on the SDN controller [109]. The fast failover group type was introduced in OpenFlow 1.1 to handle link failures locally without controller intervention. This can be achieved by configuring the failover predefinition of failure recovery policies on the OpenFlow devices to support forwarding behaviors that depend on the local states of OpenFlow switches. Toward this goal, references in [110], [111], and [112] leverage on failover scheme and use Bidirectional Forwarding Detection (BFD) to detect individual link failure. Each switch established a BFD session with its neighbor using a three-way handshake. Afterwards, the switches exchange echo messages to monitor the link state. If link failure occurs, the switch before the affected switches will detect the failure through an echo message and immediately communicate to the controller. In return, the controller instructs the affected switches to remove the link from its Flowtable.
In contrast, the work in [113] presents a link detection mechanism based on outgoing packets. The installed flow rules on the link were tagged and monitored, and packets were counted at the destination. Send packets are recorded, and the difference between sent and received packages is compared with the threshold value used to calculate the error rate. If the error rate is greater than a threshold value for each given link, the link is assumed to have failed. However, frequent threshold value computation may introduce extra processing load on the controller.
Other solutions used the concept of a monitoring cycle to detect failure location. The reference in [114] proposes a lightweight software-based failure detection scheme by exchanging alive packets between neighboring switches and controllers. Each switch that received the packet will create two copies, one to the controller and the other to the next switch. When the controller does not receive the packet from the expected switch for some time, it will be declared down. A similar concept was introduced in [115].
To some extent, BFD can help to detect failure quickly; however, this may be more applicable in a small-scale network. The presence of large applications may result in network traffic congestion, especially in a large-scale network. An alternative path needs to be established on time upon occurrences of failure; backup entries and BFD packets may easily congest the network and overflown the switch Flowtable. The proposal in [116] devised a centralized probe mechanism to detect link failures in the network. Even though, the minimum interval between probe packets is shown to have affected the failure recovery time. However, centralizing the failure detection may flood the network with probe packets, which can further exhaust the limited storage. Different from previously mentioned work. Reference [117] includes detection logic at the switch known as Switch Failure Detection (SFD). SFD detects failures by discovering the host connected to the switch and computing the packet loss ratio. If the loss ratio is 100 %, the switch is assumed to have failed else no failure. However, this scheme is limited to switch failure, and the frequency of link failure outweighs switch failure [118].

B. LOCAL AND PATH RESTORATION APPROACHES
Several proposals were made to configure the backup path for every disrupted flow [119], [121], as summarized in Table 6.
The works in [112] and [122] argued that periodic link monitoring to detect failure before establishing a backup path might considerably introduce controller overhead. To overcome these challenges [112], offload Operation Administration and Maintenance (OAM) link monitoring capability from the controller to the OpenFlow switch. OAM leverages on general message generator and processing function in the switches, and extension in OpenFlow 1.1 protocol to support the monitoring function. However, offloading some of the control functions to switches violates SDN's promise [123]. Alternatively, the approach in [124] presents a fault management scheme without modifying the SDN architecture. In this regard, a topology discovery module was devised to periodically collect the link state event. Afterwards, the route planning module used the gathered information to calculate multiple route paths based on the topology information. Upon failure, The VLAN switch configuration module configures multiple switch ports with relevant VLAN IDs to enforce each routing path. However, the scheme only focuses on recovering from failure, but consequences after a failure, such as potential failure or post-recovery congestion, were overlooked. A local reroute congestion-aware failure scheme was proposed in [106]. The scheme considered flows type and established path based on its requirement. When congestion occurs, re-routing is applied to the elephant while the mice flow packet forwarding continues. In this case, rerouting is performed locally at the point of congestion instead of re-diverting affected flows through available paths. However, triggering rerouting at a point of congestion can increase packet losses after failure. In addition, round trip time may also augment, affecting the failure convergence time. To overcome this limitation, the work in [125] devised a local fast reroute (LFR) to achieve faster recovery with less controller operation. After failure, all disrupted flows are aggregated into a new ''big'' flow. VLAN ID values are set to the aggregate flow; every packet is stamped with a new label and stored in VLAN ID. Thereafter, the local reroute path is dynamically deployed by the controller for the aggregated flows. In their follow-up work Cheng et al. [127] an integer linear programming model with heuristic model congestion awareness was introduced to avert link post-recovery congestion.
Similarly, the scheme in [128] introduced CAFFE to detour the affected flows as soon as the failure is detected to avoid potential congestion. Furthermore, CAFFE jointly considered knowledge of network topology, failure states, and network load distribution to formulate an Integer Linear Program (ILP) model to protect against potential future failure. However, their solutions heavily rely on aggregation and VLAN to perform fast rerouting, making computation expensive due to aggregation and may disenable the actual usage of VLAN [3].
Unlike the previously mentioned works, [120] argue that a reliable and scalable failure recovery scheme should minimise the controller's processing load and react even when the controller is not reachable. Toward these goals, the authors formulate a Mixed Integer Linear Programming Model (MILP) for precomputed backup recovery paths considering Quality of service (QoS) metrics. The scheme leverages on crank back signal to ensure instantaneous recovery times and aims at zero packet loss after failure detection, regardless of controller reachability, even when OpenFlow's ''fast-failover'' feature cannot be used. However, heavily relying on crank back routing may result in long backup paths and extra link usage. Their follow-up work SPIDER [115] implements the respective failure rerouting mechanism using MPLS tags. Furthermore, the scheme heavily depends on an extension of OpenFlow (Open state) to perform customized failure detection and data plane switching, making it incompatible with existing networks and available hardware switches [111].
In contrast, the previously mentioned works focus on reducing congestion, while others try reducing the controller's processing load. However, the authors consider the correlation between the switch Flowtable update operation and the recovery speed, which impacts the network convergence time. The reference in [125] presented Local Fast Reroute (LFR) flow aggregation techniques to a minimized number of flow operations. Once link failure occurs, the affected flows are aggregated into a single flow. Afterwards, the local reroute path will dynamically reroute the compressed flows using fast failover. Therefore, LFR can achieve faster failure recovery while minimizing flow operation. However, the merit depends on fast failover local reroute availability. Therefore, the availability of local reroute can affect the scheme convergence time. Moreover, LFR may also lead to larger packet drops when an end-to-end path is applied.
In contrast, [76] presented a failure restoration technique for minimizing the recovery cost. An ILP model was formulated to find a path with the lowest cost requiring up to a number of operations. Some path requires the fewest possible operations, and Dijkstra-like path cost requires minimum operations. The lowest possible operations, like that of the Dijkstra algorithm, are used to optimize the recovery time of failure. For all the sets of paths, a threshold value property was used to minimize the set of resulting paths. However, such a property may not always guarantee the feasibility of an end-to-end path [22]. An end-to-end fast link failure recovery approach based on the shortest was introduced in [117]. Packets are categorized into high and lowpriority packets. When failure occurs, high-priority packets are rerouted through minimum delay. Other packets are forwarded through an alternative path; thus, the later and former traffic flows are distributed equally over the available paths to avoid congestion. However, the solution is tested in a small network setting, which may not be feasible in a large-scale network because the algorithm's complexity augments as the network's size increases [12]. The work of [129] proposed a mechanism to avoid frequent contacting the controller and take local corrective measures. In this regard, two methods were devised to store bypass paths on all pairs of nodes and others on some selected nodes. When a failure occurs, the switch can act locally. However, installing two sets of flow rules bypass in the switch Flowtable beside the primary path rules will lead to load imbalance and cause congestion. Besides, this will further slow down the network reconfiguration because of the large number of updates.
Although restoration recovery schemes offer a more flexible way to handle verse flows in real-time, the cost of end-to-end computation is very high [26]. In this case, the restoration scheme is time-consuming as the SDN controller needs to calculate the new end-to-end path for each affected flow and reconfigure the network. Therefore, the time to reconfigure the networks includes new path calculation and switch update time. To reduce these issues, [130] introduced the principle of a community detection scheme. If a failure occurs, the affected community is detected, and a backup path is established within the affected community without tampering with packet forwarding in other communities. This recovery is faster, thereby improving the network fault tolerance capability. However, removing the old flow entries in the affected path and re-installing the new entries for the alternative can be costly, especially when the path length is long. This concern has been addressed in their followup works [130], [26]. When failure occurs, the scheme only searches the new path from the point of failure down to the destination switch and removes the old flow entries of the affected switches only; the remaining flow entries on the path are preserved. Therefore, the scheme has significantly reduced the update operation and end-to-end computation time. However, the scheme neither guarantees the shortest path nor considers congestion.

C. LOCAL AND PATH PROTECTION APPROACHES
Failure recovery range can be divided into a path or local recovery. In the protection mechanism, rules must be preinstalled in advance for path and local recovery approaches. Due to the proliferation of network flows per second, the number of forwarding rules in the switch flow table may quickly escalate, leading to large switch memory TCAM consumption. To minimize the TCAM consumption, the references [7], [132] proposed a set of algorithms: Forward Local Re-routing(FLR) and Backward Local Rerouting (BLR) to compute backup paths for a primary path for faster failure recovery. Local re-routing of the failed traffic from the point of failure enables speedy recovery. FLR and BLR backup paths improved sharing of forwarding rules at the switches, thereby choosing a backup path with the least number of additional switches. However, the solution neither considers post-recovery congestion nor the potential future failure of the selected link. The references [133], [134], [135] leverage VLAN tagging to present a new protection method to aggregate the disrupted flows. This way, many flow rules for rerouting the affected flows are reduced, thereby improving the recovery time of carrier-grade recovery requirements. However, the aggregation technique may reduce flow visibility which in turn may be challenging for the controller to maintain the global view of the network. Their fellow up work implements two algorithms; Local Immediate (LIm) recovery strategy, in which the controller will utilize the fast failover to locally switch to an alternative path using the VLAN tagging feature to tag the arriving packets with the outgoing link ID. While Immediate Controller Dependent (ICoD) recovery required controller intervention to establish an alternative path. This way, recovery time may be faster because of the fast failover, which allows the quick and local reaction to failures without the need to resort to the central controller.
Several attempts have been made to achieve faster recovery using fast failover to avoid controller involvement in detouring the affected flows. Table 7 summarizes the related works. The reference in [136] proposed fast failover link failure with a congestion-aware mechanism. Fast failover is preconfigured with multiple paths to redirect the disrupted flows to a failure state. Based on the mechanism configuration, the controller periodically monitors the status of the switch port to perceive the failure on time. The protection scheme resulted in an average recovery time of around 40 ms. However, constant controller monitoring can introduce extra processing load on the controller. A scalable multi-failure fast failover was presented in [137]. Their work dynamically compressed the alternate path's flow entries of the incoming flows with the existing flow entries on the backup path. In this way, the total number of rules was significantly reduced. However, such a dynamic procedure may lead to extra processing load on the controller to configure the primary and backup path for every new arrival flow.
Moreover, the number of backup path rules augment as the flow arrival increases [138]. An efficient fault-tolerant memory management aware approach was proposed [133]. The scheme computes path protection per link instead of rules per flow by configuring VLAN tagging for each link failure. This way, A VLAN tagging is provided for each link identification while defining backup path rules. Therefore, the number of rules would be proportional to the network setting. However, in a large-scale network, the overhead would be non-trivial [9]. Another solution [139], [139] considered the switch Flowtable storage constraints to devise a recovery scheme. A Fault-Tolerant Forwarding Table Design (FFTD) was introduced to group the flows using group entries and aggregates the flows using a tagging mechanism for rapid recovery from the dual failures. This way, FFTD satisfies the GCN's 50 ms recovery requirement, reducing the backup path flow storage requirement. However, neither [133] nor [139] considered post-recovery congestion after the localized recovery. A shared ring was proposed as yet another solution to reduce the consumption of a backup resource [140]. The authors devised a ring-based single failure recovery approach to reduce the number of entries.
A ring circle in the network topology is selected to act as a share backup path, based on the all-backup path is introduced to improve the Flowtable utilization. Although recovery time and backup resource consumption could be improved, network post-recovery congestion may occur. Therefore, efficient congestion and memory-aware failure recovery (SafeGuad) were presented [118]. SafeGuard iterates through a backup path of the impacted flows to ensure that the rerouting switches have enough space to accommodate the backup path rules. In addition, residual link capacity is checked to avoid post-recovery congestion. This way, impacted flows are deployed efficiently. However, Safeguard may be expensive because it requires two paths to be installed for each flow, which could overwhelm critical network resources such as switch TCAM [26]. Most of the previously mentioned work heavily depends on fast failover local reroute to achieve faster failure recovery. Other solutions considered VLAN tagging with aggregation. Fast Failover (FF) local reroute approaches provide an efficient means of achieving fast failure recovery. FF can handle failure locally, thereby detouring the affected flows around the failed link using preconfigured alternative without the need to consult the controller. However, fast failover local reroute can only be used to define local detour mechanisms when alternative paths are available from the node that detects the failure. Therefore, the unavailability of a path may still require controller intervention, increasing the processing load.
Moreover, the FF group feature is optional in OpenFlow; these solutions' applicability depends on the FF group's actual hardware support [138]. Thus, several works have devised various recovery approaches to meet the recovery requirement. Other works attempt to reduce the number of flow entries due to many flow rules configured per switch in FF. However, decreasing service disruption, unavailability, and increasing availability to speed up the end-to-end convergence process time is being overlooked, unlike the existing works. For example, the work in [141] proposed Smart Routing (SR), which enabled the controller to receive early failure signs and avoid risky paths before the occurrence of the failure. SR predicts link failure events and formulates an alternative path for some links. The risky path is a bypass to avoid future failure, leading to higher packet loss and decreased service availability. The mean time between failure and mean time to recover for evaluating the availability and reliability are considered. However, SR overlooks post-recovery congestion of the selected path, which may lead to more packet loss; the solution does not consider the system throughput or round trip time after failure convergence time.

D. HYBRID RESTORATION AND PROTECTION APPROACHES
Several attempts have been made lately to address the issues of failure, either restoration or protection. However, it is evident both approaches inherent some defects due to time and storage space gaps which need further investigation. Therefore, some solutions have been proposed to allocate backup rules flexibly by combining the restoration and protection modes to fulfil different application requirements without paying for their drawback, as summarized in Table 8. The work in [145] divided flows into Gold, Silver, and Bronze and provisioned different backup path strategies. Bronze flows provision backup paths reactively, while silver and bronze flow enjoyed proactive backup paths. The priorities of backup and the primary path of gold flows are the same. Gold flows are sent to the destination along two paths simultaneously. This way, better load balancing could be achieved. However, such a procedure may decline the system throughput and increase the chances of congestion, leading to switch Flowtable memory overflow. To optimize the Flowtable storage, [146] considered the switch Flowtable storage constraints to devise a recovery scheme. The authors derived link importance metrics and classified links based on the number of flows passing and bandwidth utilization ratio. Three backup strategies are introduced: double-path protection, single-path protection, and the reactive mode. Double path protection is deployed for high importance and links others through a single path. In this way, the limited switch storage resources are used to protect more essential links. However, provisioning a double path may affect the system throughput. Motivated by [147], [146], their recovery approaches considered delay during back path selection and flow classification received little attention. The impact of failure varies with respect to flows, and losses caused by failure differ from other flows. In this regard, efficient failure recovery is required to consider flows. This way, the authors formulate a formula to derive link importance and provision backup path based on flow importance. Neither [147] nor [146] consider post-recovery congestion nor potential failure as the result of the selected path.
In contrast, Revive [148] was introduced to proactively install alternative routes on a subset of switches between a given source-designation pair. Other switches are installed backup rules along with primary reactively on demand. However, this approach requires additional path computation for backup paths along the primary path, and the implication of the selected path was overlooked.
A flexible recovery mechanism (BOND) was presented [9] to deal with a link failure. BOND preconfigured backup path rules in the switch Flowtable to establish an alternative path in case of failure. A Hash table was used to quickly recover link failure. However, if a switch fails to operate, the system will lose all the routing path information of the network. References [149], [150] present another solution for efficient and flexible link fault tolerance (FTLink). A collection of the backup path was generated for primary links. This way, FTLink maintained a matching table for maintaining an alternative path entries at the controller. When the system detects failures, it enables the backup path link as the new primary link for the affected link. Flow rules are installed in the switch after they are matched from the generated table using the controller. However, FTLink overlooks a link's criticality before enabling the backup path link as the new primary link for the affected [150]. Such a procedure may lead to future failure since the backup path may fail earlier than the primary path.

V. RESEARCH CHALLENGES AND FUTURE DIRECTION FOR DYNAMIC ROUTING AND FAILURE RECOVERY
Existing studies have proposed different routing and failure recovery solutions for various use cases. However, dynamic routing based on applications requirement and efficient failure recovery while managing resource utilization is still a challenging research area with several unanswered questions. This section discussed some of the unaddressed challenges and suggested future research direction.

A. OPTIMIZED ENERGY-AWARE ROUTING WITH SWITCH UPDATE OPERATION
Most of the existing Energy-Aware Routing focuses on topology awareness, queue engineering, or smart sleep-on strategies to optimize network resource energy consumption during the routing process. However, update operation in SDN switch TCAM is very slow, while traffic flow changes more often with the large number of flows arriving in a dynamic large-scale network. This process affects the switch updating time with significant processing delay besides power consumption. It would be an interesting research direction to devise a dynamic energy-aware routing while incorporating switch TCAM update operation and link processing power B. DYNAMIC ROUTING WITH QOS AWARE Traffic flows exhibit variabilities with different Quality of Service requirements. Some flows are delay-sensitive, others require sufficient link bandwidth, throughput sensitive flows are among flows with special needs. Some flows required multiple routing constraints for optimal performance. Similarly, routing paths differ, some paths are very critical with limited bandwidth, and others have smaller path latency. It is VOLUME 10, 2022 challenging to find a path with both properties in both terms. Although several dynamic routings have been presented, they overlooked incorporating multiple routing metrics for optimal routing. Therefore, how to aggregate various routing metrics using fuzzy logic while efficiently utilizing the limited Flowtable resource is still an open question.

C. POST-RECOVERY CONGESTION AND INCREASE IN SERVICE DISRUPTION
Restoration recovery schemes offered more flexibility to cope with frequent traffic changes. However, frequent path computation may impose an extra processing load on the controller, and the time to update switches may also introduce another bottleneck. This way, several solutions were proposed to address these issues. Some solutions focus on reducing the computational path time, and others introduced path selection costing a certain number of operations. Failure recovery with congestion was also proposed. However, these solutions may not always ensure full service is available and decrease disruption. In large-scale networks, path importance differs; some paths are critical because of the shortest paths that pass through them. Other links on paths are frequently shared between the primary and backup paths. Therefore, failures on either path may lead to failure on multiple paths. Consequently, service disruption leads to significant packet losses and a decline in throughput. Therefore, an efficient recovery scheme with congestion awareness and path reliability considering real traffic flows is required to improve service availability and decrease system disruption with postrecovery congestion awareness

D. HYBRID RESTORATION AND PROTECTION RECOVERY APPROACH
Both restoration and protection have their pros and cons. Therefore, combining the two approaches for efficient failure recovery while meeting the requirement of different flows would be another interesting research work. There exist different flows with different Quality of Service requirements. Packets from video traffic will be required to be redirected through a path with sufficient bandwidth to accommodate many affected flows upon failure occurrences. While delay-sensitive flows such as VoIP may require paths with small delays toward a destination. This way, it would be interesting research to classify flows based on their quality of service requirement and flexibly apply restoration and protection without paying for their drawback.

VI. CONCLUSION
Software Defined Network is an emerging network with better network management. It speeds up network innovation. However, it has some weaknesses which require urgent attention to speed up the adoption of SDN. Several researchers have been conducted over the years to address different challenges introduced by SDN. Route path selection and failure recovery are among the challenges affecting the SDN. This paper presents A survey for dynamic routing and failure recovery approaches for efficient resource utilization. The paper elaborates on the concepts and fundamental knowledge of Software-defined Networks required to efficiently design and implement reliable failure recovery. A comprehensive review of path route selection toward better failure recovery schemes was reviewed. Tables were presented to support the study, and critical evaluations of the existing schemes were discussed to highlight the weakness of the existing schemes. To overcome the limitation of the existing schemes, this research suggests a future direction to enhance the network performance of the existing literature with efficient usage of the switch Flowtable and less overhead.
KAMALRULNIZAM BIN ABU BAKAR (Member, IEEE) received the B.Sc. degree in computer science from Universiti Teknologi Malaysia, Malaysia, in 1996, the M.Sc. degree in computer communications and networks from Leeds Metropolitan University, U.K., in 1998, and the Ph.D. degree in computer science from Aston University, U.K., in 2004. He is currently a Professor with the Department of Computer Science, Universiti Teknologi Malaysia, and a member of the Pervasive Computing Research Group. His research interests include mobile and wireless computing, ad-hoc and sensor networks, information security, and grid computing. He is involved in many research projects and also a referee of several scientific journals and conferences. He is a member of ACM, the Internet Society, and the International Association of Engineering. His research interests include vehicular network security, cyber security, intrusion detection, data science, data mining, and artificial intelligence. He was a recipient of many awards and recognitions, such as the Postdoctoral Fellowship Award, the Best Postgraduate Student Award, the Excellence Awards, and the Best Presenter Award from the School of Computing, Faculty of Engineering, UTM, as well as the best paper awards from many international conferences.
ABDULAZIZ AL-NAHARI received the B.Sc. degree in information technology from Al-Balqa Applied University, in 2005, the M.Sc. degree in computer science from The University of Jordan, in 2009, and the Ph.D. degree in computer science from the School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia (UTM), in 2018. He has been working at the Programming Unit, Sana'a Community College, Sana'a, since 2009. He has been working as a Senior Lecturer at the UNITAR Graduate School, UNITAR International University, Malaysia, since June 2021. His research interests include computer networks, routing protocols in ad-hoc networks, machine learning, and data analytics. VOLUME 10, 2022