Green Edge Intelligence for Smart Management of a FANET in Disaster-Recovery Scenarios

Disruption of the ground communication infrastructure in emergency scenarios makes post-disaster rescue operations very complicate. This paper proposes to use edge intelligence to support rescue operators in managing these emergency scenarios. A set of Unmanned Aerial Vehicles (UAV), organized as Flying Ad-Hoc network (FANET), autonomously takeoff and land to provide emergency operators with edge computing services. A charging station for batteries is supplied by a renewable-energy generator. The FANET Controller applies model-based Reinforcement Learning to decide how many UAVs have to take off, according to the current edge-computing service requests and the power availability, and a forecast of them. The optimal management policy has to provide the necessary level of edge-computing avoiding wide use of satellite channels in a short-time horizon during low green-energy generation and high service request periods. Results highlight that the optimal policy is an efficient modification of the greedy one, i.e., the policy enabling the takeoff of all the necessary UAVs without being care of challenging events in the future. A deep analysis has revealed that the level of modification depends on the combination of the edge-computing service request and the green power availability.

services only, but with very limited data transmission capability and interoperability [7]. In the last decade, thanks to the great advancement in personal communication technologies, from 3G to 4G [8], [9], broadband wireless technologies have begun to be used in emergency communications, allowing users to send and receive crucial voice, video, and other kinds of data [10], [11]. However, conventional emergency communication systems are mostly infrastructure-based, so presenting difficulties in maintaining communication services when the communication infrastructure is severely damaged [12].
Unmanned Aerial Vehicles (UAV) have been demonstrated to address this problem by providing a quick turn-around option. During the critical first 72 hours, UAVs can be used in emergency scenarios for tasks such as situational awareness, deploying communication systems, or search and rescue missions [13], [14], [15]. Moreover, small base stations can be mounted onboard them to rapidly deploy an easy-to-operate and responsive emergency ad-hoc network [16], providing broadband communication services anytime and anywhere [8], [17], [18], [19], [20], [21], [22], [23]. This way, UAVs have gained importance as a low-cost tool for post-disaster monitoring and reconstruction management for first responders, planners, and elected officials. Now, the 5th generation broadband cellular networks (5G) have a strong potential to meet the high demands of up-to-date emergency communications for reliability and resilience. To this purpose, in the last few years, some papers have proposed to integrate edge computing facilities onboard of UAVs, making them Multi-Access Edge Computing (MEC) UAVs [24], [25], [26], in order to provide users with edge computing in areas not covered by the structured 5G Internet. However, as observed in some of these works, the problem of limited UAV flight lifetime due to battery charge duration is further exacerbated due to the power consumption of the computing elements, which is added to the engine power consumption. For this reason, a widely adopted solution is the use of a storm of UAVs [27] connected to create a Flying Ad-Hoc network (FANET) [28]. Further insights regarding FANETs can be found in [29], [30], while a reader can refer to [31], [32], [33] for issues related to the management of FANET topology.
The use of a FANET allows that, when the battery charge level of a UAV is low, it can land for charging or substituting the battery, while the service is maintained active by the remaining ones. Of course, the higher the number of UAVs that are on the ground for charging, the worse the service provided to the users. This problem is of crucial importance in the considered emergency scenarios, since the scarceness of edge computing This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and connectivity can strongly deteriorate rescue missions and emergency management activities. A possible solution to this problem is the use of a satellite channel to offload jobs to a remote data centre. However, this can be considered as a backup non-optimal solution, aimed at not losing jobs that cannot be managed by the FANET, but with evident disadvantages regarding costs, high delay and privacy issues. Another important element that must be considered in emergency scenarios is the lack of connection to the power grid. A solution is the use of a mobile renewable generator [34] to charge the UAV batteries, but available energy is limited and time-variant.
Therefore, the target of this paper is to apply artificial intelligence in the management of FANETs to maximize edgecomputing services they provide in post-disaster environments. This is compliant with the new paradigm of Edge Intelligence (EI), which aims at using Artificial Intelligence (AI) at the network edge for management edge-computing facilities, in this case, provided by the UAVs of a FANET [35]. To this purpose, Reinforcement Learning (RL) is applied to decide the takeoff of the UAVs with a full battery, taking into account the time-variant behavior of both the edge-computing service requirements and the power availability from a wind generator (WG). More specifically, the contribution of this paper is three-fold.
1) Proposing an unmanned UAV-based platform to provide emergency operators with edge-computing service capabilities in emergency scenarios. The platform is constituted by a WG-powered charging station (CS), a set of UAVs and backup batteries for immediate swapping and takeoff of landing UAVs [36]. A FANET Controller (FC) decides how many UAVs have to autonomously takeoff among the necessary and ready ones. 2) Proposing a decision-making strategy maximizing a reward function that depends on the ability to avoid the use of backup satellite transmission to remote data centres for job offload, especially in a short-time horizon. 3) Defining a discrete-time analytical model of the system to provide the RL-based FC with a Markov Decision Process (MDP), to support its actions. The RL uses the model to forecast power availability and service requests in order to find the optimal management policy that the FC must apply to maximize the reward function. To the best of our knowledge, this is the first paper that proposes a framework to provide edge intelligence in a disaster area where neither computing nor networking facilities nor the power grid are available.
An extensive performance analysis is presented to evaluate the performance of the proposed system. A comparison among the proposed optimal policy, the greedy one (i.e., the policy enabling the take-off of all the necessary UAVs without taking care of more challenging events in the future) and other policies is carried out in some case studies. The comparison has highlighted the ability of the optimal policy to avoid a wide use of satellite channels in a short time horizon.
The paper is organized as follows. Section II describes the reference system, constituted by an emergency area covered by a FANET that provides it with connectivity and edge-computing services. Section III defines the Markov model of the considered system and derives the transition probability and the reward matrices of the MDP used to support the decision-making process of the FC. Section IV analytically derives the main performance parameters characterizing the system behaviour. Section V presents a use case with some numerical analysis. Finally, Section VI concludes the paper.

II. REFERENCE SYSTEM DESCRIPTION AND ASSUMPTIONS
The system considered in this paper is a post-disaster emergency scenario. As sketched in Fig. 1, it is made up of a huge number of user devices (sensors, actuators and user equipment) deployed on the ground in a wide emergency area. They require connectivity and edge-computing services to support the monitoring and reconstruction activities of first responders, planners, elected officials and other people involved in emergency management. It is assumed that the considered area is completely isolated, that is, not connected to the structured Internet and not supplied by the power grid. To this purpose, a Flying Aerial Network (FANET) is used to provide the ground devices with the required edge-computing services. The FANET is realized by a fleet of N UAVs, each equipped with an edge-computing facility, and a TX/RX element to allow UAVs to communicate to each other and with the devices on the ground. Communications between ground devices and UAVs and among UAVs are exclusively in charge of the UAVs, with no need for terrestrial facilities.
Since the focus of the paper regards takeoff management of charged UAVs of the same FANET, it can be not considered any problem of communications among UAVs and between UAVs and ground devices. The literature presents various solutions to solve this problem, for example [37] reports some details.
Moreover, UAVs are assumed equal to each other in flying, connectivity and computing characteristics.
In the periods when the FANET is not able to satisfy the whole computation load required by the ground devices, jobs to be processed are sent to a mobile Satellite Transmission Station (STxS) to offload jobs through a backup satellite connection to a remote data centre. However, this solution is necessary to avoid job loss during overload periods. On the other hand, it has to be considered a second-rate solution because it adds costs (due to the use of both a satellite channel, which is a precious resource and the computation resources in the remote data centre) and it is against the principles of edge computing, which is applied to achieve low latency reaction and privacy preservation. For this reason, the main objective of the proposed framework is maximizing matching between FANET resource availability and time-variant edge-computing service requests, while offloading through the STxS is to be avoided as much as possible.
In the sequel, for the sake of simplicity, the amount of edge-computing service provided by each UAV is defined as the unit of service. Therefore, the time-variant service requested by ground devices and the service provided by the FANET will be expressed in terms of number of UAVs. Likewise, also the residual service, i.e., the one required to the remote data centre through the backup satellite channel because not supported by the FANET, will be expressed in UAV service units.
The main issue of a UAV is the short flying time due to the limited battery charge duration, although the technology trend is fast moving towards a longer duration thanks to high energy density batteries. For this reason, a CS equipped with C p charging points (CPs) is installed on the ground inside or very close to the disaster area that has to be serviced by the FANET. 1 Each CP can charge one battery at a time. The CS is supplied by a mobile WG. The power available from this WG varies along the time due to the variability of the wind in the area. In order to reduce the number of UAVs waiting for charging on the ground, some high energy density and extreme lightweight batteries, B, are also made available [38]. To avoid any human intervention, an automatic battery swap mechanism has been considered [36]. When there is no CP with a full battery, the landed UAV must wait for battery charging to take off again. The average waiting time depends on the power available from the WG and the number of UAVs waiting for a charged battery. When the power available from the WG is sufficient to charge the CPs with a waiting UAV, the remaining power is delivered to other CPs in order to charge batteries that will be used by future landing UAVs. Finally, when the power available from the WG is greater than the current number of CPs to be supplied, the green power surplus can be used to supply external appliances installed in the same area.
Since the devices on the ground need a time-variant edgecomputing service, the number of UAVs necessary to guarantee this service changes over time. Moreover, the fact that the WG produces a time-variant power has to be taken into account since this determines the number of available charged batteries 1 How close placing the CS to the disaster area strongly depends on the specific scenario of the disaster, its location, and the availability of space for installation. However, consider that all that is needed to realize the proposed system, including a small wind power generator (i.e. a micro wind turbine) can be transported onsite by a truck or, if access roads to the disaster area are interrupted, by a helicopter, which for sure will be available because used for the transportation of other materials needed to the same area for other purposes. An area of a few hundred square meters is enough for installing the entire framework. For this reason, it is expected that the CS will be installed inside or very close to the disaster area. This will enable them to provide edge computing services also in a subinterval while they are leaving or moving toward the CS. Regarding the battery duration problem, there are currently very limited solutions that enable 2-3 hours flight time but they are very expensive. On the other hand, it is expected that the problem will be mitigated in a medium-term period by the development of more efficient UAVs that will allow lower power consumption during hovering [39], [40] and the improvement of battery technologies that are also pushed by the strong interest in the electrification trend of the automotive market.
for UAVs that land. In the periods when the number of UAVs on air is less than the required edge-computing service, the Transmission Station has to use the expensive backup satellite channel to offload jobs to a remote cloud. Thus, managing UAV takeoff among the UAVs that are ready on the ground with a charged battery is challenging to face future potential critical situations that would cause high costs for using the backup satellite channel. This decision must be taken by adopting a policy maximizing the cumulative reward, that is a long-term reward, Z(n), defined as the total discounted reward from time slot n: where R(n) is the immediate reward achieved in the time slot n, while γ is the discount factor, with γ ∈ [0, 1]. This is an input parameter that informs the decision-maker about how much it should care about the immediate reward as compared to future rewards. Values of γ close to zero mean that it has to take care of the immediate reward. On the contrary, a value of γ close to 1 means that the decision-maker is far-sighted, i.e., it has to take care of all future rewards. For this reason, it uses this reward instead of the immediate one, in order not to be greedy by taking actions associated with the maximum reward at the current time only, but it has to plan the future. It is better to sacrifice immediate reward to gain a high cumulative long-term reward.
In the considered scenario, the immediate reward R(n) in (1) is a measure of the FANET ability to satisfy the edge-computing request avoiding satellite usage. Therefore, it is defined as a function of the difference, d, between the amount of edge-computing service required by the ground devices and the edge-computing service that the flying UAVs provide them. More specifically, the goal is to avoid that the edge-computing service provided by the FANET becomes, in some instants, too much lower than the required service, thus requiring offloading through the backup satellite channel, which is seen as a penalty caused by both additional costs, delay increasing and privacy detriment. Moreover, it is assumed that the satellite owner applies an incremental price for the amount of traffic to be transmitted. Therefore, for a given number of jobs to be offloaded over a long period, they shouldn't be concentrated in a small interval. Finally, from a privacy point of view, it is also better avoiding to offload many jobs produced in a small interval since they are correlated to each other, while the correlation of information related to different periods is lower. Consequently, the immediate reward function is defined as follows to avoid high satellite usage in the short-term as follows: This way, when d is positive, that is when the FANET is not able to satisfy the request, the reward is negative. On the contrary, a null reward is considered when the FANET satisfies the demand (d ≤ 0).
In the system, the decision-maker role is in charge of the FC, an entity running in the CS to coordinate takeoff of the ready UAVs. To this purpose, the FC takes its decisions by using the best management policy found thanks to RL [41].
It is worth noting that, a centralized solution, that is, a singleagent RL centralized on the FC, for two main reasons. On one side, using multi-agent RL implies running one agent on each UAV, which may present problems of coordination. Moreover, it could need dedicated hardware (e.g., with a GPU board) on each UAV to run computationally hungry applications for training. Besides the costs of this hardware, this could be too heavy to be put on-board and can influence flight autonomy for its power consumption. On the other side, the FC can access easily all the information needed to train the model and centrally run the RL algorithm since, in this specific scenario where the number of UAVs is not so much huge, it does not incur the curse of dimensionality problems that often motivate multi-agent approach.
More specifically, the FC uses a model-based RL formulated as a Markov Decision Process (MDP) in Section III. Thanks to this model, the long-term cumulative reward can be computed. Therefore, at the planning stage (off-line mode) of the FC, the best policy is found by selecting the one whose decisions enable cumulative reward maximization. After that, at the operation stage, the FC works online. It observes the system conditions (WG power, edge-computing service request, state of the various UAVs and batteries), then identifies the most similar state of the discrete model and, finally, chooses its action, consisting in deciding the number of UAVs to left on the ground among the ones that are ready for takeoff, according to the best policy found at the design stage.

III. SYSTEM MODEL
In this section, a discrete-time Markov model representing the MDP necessary to solve the RL problem described in the previous section has been defined. Table I summarizes the main notation.
The environment is described by the state of WG, edgecomputing request and FANET (number of UAVs on air, on ground with empty and full battery), the number of full and empty batteries. Since the FC always knows the state of the considered environment, the environment is said fully observable. A MDP Σ that describes the environment also depends on the actions associated with the states of the system. A policy is a specific set of actions. More in deep, for a given policy ρ specifying an action a for each state s Σ , a MDP is completely defined by the tuple is a finite set of actions, P (Σ|ρ) is the state transition probability matrix, Ψ (Σ|ρ) is the immediate reward matrix, and γ is the discount factor. Moreover, the value of a state s Σ is defined as the expected return when starting from s Σ and following ρ thereafter, that is: The matrix P (Σ|ρ) depends on the policy ρ that specifies the action for each starting state. Therefore, its generic element, representing the transition probability from the generic state s Σ to the generic state s Σ , provided that, according to the policy ρ, the action a is performed at the beginning of the time slot n according to the starting state s Σ , is: Likewise, the generic element of the reward matrix, representing the immediate reward received performing the action a at the slot n when the system transits from s Σ to s Σ , is: Section III-A will define some mathematical notation to model each component of the whole system. Then, Section III-B will describe the MDP of the whole system.
The optimal policy can be derived by means of a dynamic programming algorithm called value iteration (see Algorithm 1), where the term v(s Σ ) indicates the value of the generic state Algorithm 1: Value iteration to derive the optimal policy.
Initialize the value-function array v arbitrarily Repeat indicates the value of the same state under the optimal policy ρ * .

A. Markov Model of the System
In this section, the environment of the MDP is modelled with a discrete-time multi-dimensional Markov model. As already said, the time slot is defined as the UAV battery charging time, Δ. For the sake of simplicity, it is assumed that the mean flight time, defined as the time ranging between the UAV take-off instant and the landing instant, so including climb, cruise, and descendent phases, is, on average, a multiple, H, of the charging time, being H an integer. Therefore, the flight time has a time duration of H · Δ.
The behaviour of the overall system in the generic time slot n is represented as: whose component processes are defined below. The first two elements of S (Σ) (n) are the states of the two independent processes that influence the behaviour of the system: G(n), characterizing the renewable power availability, is defined as the number of UAVs that the WG is able to charge at the time slot n; M (n), characterizing the time-variant edge-computing service request coming from the ground devices, is defined as the number of UAVs that are required to satisfy the request.
The process M (n) is modelled as a Markov chain characterized by its state, S (M ) (n), and its transition probability matrix, P (M ) , whose generic element is defined as follows: for each s M and s M belonging to the set of states (M ) . In order to model the renewable-energy generation process G(n), as in [42], the Switched Batch Bernoulli Process (SBBP) model, which is the most general Markov-modulated process in the discrete-time domain, being able to capture both first-and second-order statistics of a process, is used. According to [43], it is defined by: S (G) (n), representing the state of the underlying Markov chain at the generic time slot n; (G) , representing the state space of the underlying Markov chain, that is, the set of all the possible states that the process S (G) (n) can assume; Ψ (G) , representing the space of values of the process G(n), that is, the set of all the possible values that the process G(n) can assume; P (G) , representing the state transition probability matrix of the underlying Markov chain. Its generic element is defined as done in (7) for the process M (n);B (G) , representing the value occurrence probability matrix, whose generic element is defined as follows: Before modeling the FANET behavior, the behavior of each UAV is modelled as a 3-state Markov model, as depicted in Fig. 2. When a flying UAV on-air (state A) reaches a low state of charge (SoC), it lands on a CP of the Charging Station and waits for charging. This is represented by the permanence on the state "Empty Battery" (E), where it remains for some time slots until the CP is not supplied. When the CP where the UAV is placed is supplied, battery charging occurs in a single time slot, given the choice of set the time slot equal to the battery charging time. Therefore, in the successive time slot, the UAV enters the state "Ready to takeoff" (R). On the other hand, it immediately enters this state when there is a full battery in the CS. The UAV will stay in this state until the FC does not authorize its takeoff, thus the cycle restarts. Therefore, the transitions E ⇒ E and E ⇒ R depend on the available power from the WG, while the transitions R ⇒ R and R ⇒ A depend on the FC decision. Therefore, the model of the whole FANET behavior can be represented by the state process S (D) (n), a three-dimensional array counting the number of UAVs whose battery is empty, S Of course, the sum of the numbers of UAVs in each state gives the total number of UAVs in the FANET: Let us highlight that the value of S (D) R (n) represents the upper bound of the action space in the slot n. Therefore, the whole action space can be defined as follows: Finally, the last process to be modelled is B(n), representing the number of backup batteries that are fully charged and, consequently, ready to be mounted on a UAV for takeoff. Its state dynamics, indicated as S (B) (n), depend on the state of the other processes, as described later. Recalling that, the total number of available backup batteries is B, then the state space of the process It is worth noticing that the complexity is given by the size of the state space, which is calculated as the Cartesian product of the state space of the component processes. However, since the model solution is carried out offline, it does not create any convergence or time resolution problems.

B. Transition Probability and Reward
The system state evolution depends on the edge-computing service amount requested by the ground devices and the power availability from the WG, assumed independent of each other, as well as on the decisions taken by the FC, which applies RL to choose how many UAVs have to remain on the ground among the ones with a full battery. In order to define the transition probability matrix of the system as a whole, the event sequence for each time slot is defined. In time slot n, it is described as shown in Fig. 3.
At the very beginning of the time slot 1) Decision of the action A(n) = a , representing the number of UAVs that the FC will leave on ground (they will not take off) among the ones that are ready for takeoff. This decision is taken by the FC based on the initial state S (Σ) (n − 1) = s Σ ; of course, it cannot be greater than the number of UAVs that are ready for takeoff, that is: a ≤ s R . 2) Takeoff of a number S (D) (n) of UAVs, whose value is given by: The generic element of the transition probability matrix for a given action a can be defined as follows: The terms P  s G , a) is the generic element of the transition probability matrix of the joint process (S (B) (n), S (D) (n)). The first two matrices are known as the input of the problem, while the last one depends on the state of the WG and the number of UAVs that the FC has decided to leave on the ground. The derivation of the last matrix, to simplify reading, is reported in Appendix.
The generic element of the immediate reward matrix can be calculated in accordance with the immediate reward function defined in (2). It is the immediate reward for a given transition from the time slot n − 1, when the system is in the generic state s Σ , to the time slot n, when the system is in the generic state s Σ , and for a given action a taken according to the state s Σ . According to (2), it is defined as follows: IV. SYSTEM PERFORMANCE In order to calculate the main system performance indices, it is necessary the steady-state probability array achieved when the optimal policy ρ * is applied. Its generic element is defined as follows: where is omitted the dependence on ρ * for the sake of conciseness. The same will be done in the sequel. The state probability array π (Σ) can be derived by solving the following steady-state equation system: where P (Σ) is the transition probability matrix calculated as in Section III with the optimal policy ρ * derived by solving the Bellman optimality equation system [43]. Three quantities have been considered as performance parameters. The first performance parameter is the mean immediate reward, . It represents the FANET edge-computing satisfaction level and is the target to be optimized. In order to evaluate how much the proposed framework is able to match the edgecomputing service requests coming from the ground devices, the random variable δ is defined as the difference between the amount of edge-computing service request and the edgecomputing service actually provided by the flying UAVs during a generic time slot, i.e., δ = s M −s A .
When the edge-computing request suddenly decreases below the number of UAVs that are currently flying, the number of flying UAVs is greater than the actual edge-computing service request, thus the variable δ could also assume negative values. Of course, values of δ greater than zero give penalties due to the need for remote offloading through the backup satellite link because the amount of edge-computing service provided by the FANET is less than the required one. The probability density function (pdf) of the random variable δ can be calculated as follows: From (2), the mean immediate reward can be calculated as follows: The second performance parameter is the k-level penalty for remote offloading, ℘ k . It is expressed with a polynomial law (being k the polynomial function degree) of the distance between the amount of edge-computing service request and the edgecomputing service actually provided by the flying UAVs. In other terms, the higher the importance to avoid satellite usage in the short term, the higher the value to be used for k. The performance parameter is defined as follows: Finally, the third performance parameter is the Mean residual power. It represents the mean value of the power generated by the WG which is not used by the CS. This amount of power is available to supply other loads. For example, the residual power can supply other appliances needed to manage the emergency in the same area. To this purpose, the saved power process P (n), is defined as the difference between the power generated by the WG, G(n), and the one absorbed by the CS, C(n), both expressed in terms of the number of equivalent UAVs that are used as units of measure of the generated and the absorbed power. Specifically, C(n) is the number of CPs that are supplied during the time slot n. Therefore, C(n) is given by the sum of charging UAV batteries and other charging batteries to be ready for UAVs that will land in the future. Its first-order statistics can be represented by the pdf of its values: It is worth noting that, due to (27) and (28), it is C(n) ≤ G(n), and so P (n) ≥ 0. From the definition of the SBBP process G(n) and the joint process (S (B) (n), S (D) (n)) defined so far, it follows that: The term I s Σ , g (p) is a Boolean function indicating whether an amount of power equal to p is not used when the system state is s Σ and the WG has generated an amount of power equal to g. It can be derived considering that, during the time slot n, given the system state s Σ , the CS chargess B − s B non-mounted batteries and s E −s E mounted batteries: The mean value of the residual power, E{P }, will be applied in the numerical analysis to evaluate the mean power that can be used to supply other electrical loads.

V. SYSTEM SIMULATION
In this section, the proposed optimal FANET management has been applied in some different scenarios. The main aim is to show the performance achieved by RL in terms of the ability to satisfy the edge-computing service request, as well as how to exploit the variability of the green power availability.
These analyses are firstly performed in a specific scenario in Section V-A. Then, in Section V-B, other comparisons are presented by varying some parameters of the edge-computing service request process, but keeping its steady-state probability distribution.

A. Reference Scenario for Numerical Results
The reference scenario considered in this section is a postdisaster area whose edge-computing service requests are due to the need for computation to support rescue operations and the maintenance of a temporary field hospital built on site. These generate jobs each requiring a number of CPU operations with an average value γ = 1.84 · 10 6 . In order to process these jobs, it has been considered a FANET where each UAV is equipped with an edge-computing facility constituted by a Computer Processor Units (CPUs) Intel Core TM i7 Processor, 8 MB Cache, 2.7 GHz, and 32 GB DDR4 SO-DIMM ram. Therefore, the above CPU is able to process jobs with a rateμ P = 1.47 kjob/s. The edge-computing service request process is characterized by two main macro-states: a low-activity macro-state, with a job generation process ranging in the interval [13.2, 19.1] kjob/s, and a high-activity macro-state, with a job generation process ranging in the interval [26.5, 29.4] kjob/s. Quantizing the overall job-generation process with levels ofμ P = 1.47 kjob/s to express it in units of required UAVs, and applying the inverse-eigenvalue technique [42] to derive the SBBP modelling it, it is obtained an eight-state SBBP that assumes values in the set (M ) = {9, 10, 11, 12, 13, 18, 19, 20}. Its transition probability matrix is shown in Fig. 4. Therefore, it has been considered a FANET constituted of 20 UAVs to be able to cope with the highest edge-computing service request.
A UAV provides edge-computing service during its flight. Therefore, the average time a UAV provides edge-computing services is equivalent to its average flight autonomy, whilst during its permanence in the CS, it does not provide any edgecomputing service. To this purpose, in order to improve the service provided by the FANET, B = 20 additional batteries have been also considered for battery swapping, so that the overall number of batteries is 40. If a landed UAV finds a charged battery available in the CS, it is assumed that it immediately swaps the battery and takes off. Therefore, when there is an available full battery on the ground, the UAV provides the edge-computing service with no interruption, i.e., both in the time slot when it landed and in the next one since it immediately takes off, of course, if it is enabled by the FC to take off.
Considering that the UAV engines have a power consumption of about 510 W while the computing facility mounted onboard each UAV has an average power consumption of 35 W, and assuming that it uses a 48 V Lithium battery with a capacity of 34000 mAh, the flight autonomy of each UAV is about 3 hours, while the charging process is of 1 hour. Considering the use of highly efficient power converters, it is assumed an efficiency over 90%, thus the WG power to charge a battery is about 1.8 kW.
According to the setting of the evaluation scenario described so far, the MDP is constituted by a number of states equal to 155232, and a total number of 1.190.112 actions, considering that in each state the set of possible actions ranges between 0 and the number of UAVs that are on ground ready for takeoff. In the considered case, the model solution, calculated using a computer with an Intel i7 4.7 GHz CPU and 16 GB Ram, was of about 2 hours. However, this is not a problem for a real-time application since, as said so far, the RL approach used in this paper is model based, and therefore solutions can be derived offline.
In order to evaluate the performance achieved with the optimal policy obtained by RL according to (2), it has been compared with other policies. In particular, the greedy policy has been considered, which is equivalent to leave always a = 0 UAVs on the ground. Similarly, the policies always considering a equal to, respectively, 1, 2 and 3 UAVs have been also analyzed. Finally, two random policies with, respectively, exponential and linear probability distributions associated with the value of a have been also considered: the lower the value of a, the greater the probability that a ready-for-takeoff UAVs are left on the ground. In the following figures, these strategies will be labeled as "Exp" and "Lin". Fig. 5 shows the k-level penalty for remote offloading, ℘ k , defined in (19), for k ∈ {1, 2, 3, 4}. It has been realized with some circles whose radius is inversely proportional to the penalty and normalized with respect to the best one. Obviously, the greater the radius the better the performance of a policy. In Fig. 5(a) (k = 1), the greedy policy is the best one. This means that, in this case, there is no advantage in leaving a = 0 ready-for-takeoff UAVs in the CS to be used for future high-request events. This is because the penalty is linearly proportional to the difference, d, between the amount of edge-computing service required by the ground devices and the edge-computing service that the flying UAVs provide them. In this case, it is neglected to avoid high satellite usage in a small period, since no additional full batteries are left to face future hard situations. However, the best policy discovered by the RL provides also good results, very close to the greedy one.
Instead, starting from k = 2, the best policy discovered by the RL is the best one, as shown in Fig. 5(b)-(d). In fact, when k = 2 the RL policy is the best one since it has been found by using (2), which is equivalent to (19) for this k value. When k > 2, the RL policy also performs better than the other ones thanks to its ability to stoke ready-for-takeoff UAVs for periods in which there will be higher requests of edge-computing service and/or lower power availability. This fact is confirmed by the increasing ability of the RL policy to outperform the greedy one when k increases, as demonstrated by the radius reduction of the red circle (a = 0) when passing from Fig. 5(b) to Fig. 5(c) and, finally, to Fig. 5(d).

B. Scenarios With Different Edge-Computing Service Requests
This section analyzes the behaviour of the proposed strategy for different edge-computing scenarios. Moreover, it investigates the features of the RL policy enabling them to outperform the other policy. The study highlights that the use of different strategies the RL policy adopts under different available green power is the key of its success.
Three scenarios with lower and six scenarios with higher edge-computing service requests are here analyzed. More specifically, each new generic scenario is derived by adding a value x to the elements of (M ) , representing the set of all the possible values of required UAVs to satisfy the time-variant edge-computing service requests coming from the devices on the ground in the reference scenario used in the above analysis. In other words, the generic scenario will be characterized by the More specifically, the first three scenarios are derived with x ∈ {−9, −6, −3}, while the other six scenarios are derived with x ∈ {3, 6, 9, 12, 15, 18}. In this way, for example, the lowest-load scenario, obtained with x = −9, is characterized by a job generation process ranging in the interval [0, 5.9] kjob/s during the low-activity macro-state, and in the range [13.2, 16.2] kjob/s in the high-activity macro-state. As for the reference scenario, the number of UAVs is always chosen equal to the highest edge-computing service request, i.e., N = 20 + x, to have a chance of cope also with this request. Consequently, the number of backup batteries has been set as B = 20 − xin order to keep constant the overall number of batteries (as aforementioned equal to 40). All the other quantities of the reference scenario have been maintained. In each new scenario, the optimal policy has been found by using the RL. Fig. 6 reports the rewards obtained as in (18), with the previously mentioned policies, normalized with respect to the one obtained by the RL. Obviously, the values at x = 0 are equal to the ones obtained in the previous section for k = 2 and reported in Fig. 5(b). The main outcome of Fig. 6 is that the performance of the greedy policy (a = 0) tends towards the best one (the RL one) in the extreme scenarios (i.e., very low or high edge-computing requests). Instead, for intermediate values of x, the greedy policy is far from the optimal one; this means that, in the most of cases, it is more convenient to left one or more ready-for-takeoff UAVs in the CS although this implies not satisfying the immediate edge-computing request, to avoid a too poor quality of service in the future. The performance of the other policies is very poor when the edge-computing request is low (on the left side of the figure), and performance improves as the average request increases, although there is a great distance from the best policy, achieved by RL.
In order to better understand the reasons behind these results, the parameter named Additionally stocked UAVs (AS-UAVs) has been defined. The AS-UAVs are the UAVs that the FC leaves on the ground although they are ready for takeoff (i.e., their battery is charged), although they should be useful on air because the flying ones are not sufficient to provide the required service at that moment. However, the FC decides to maintain them on the ground for future takeoff when it forecasts a higher service request than the current one, which could deteriorate the immediate reward due to its ability to account for too poor quality service occurrences. Of course, when there are more ready-for-takeoff UAVs than the ones necessary to take off, the not necessary ones remain in the CS even when the greedy policy is applied, but these UAVs are not AS-UAVs. Therefore, it is useful to underline that the AS-UAVs are additional UAVs with respect to the number of UAVs that remain in the CS when the greedy policy is applied. Of course, when the UAVs ready-for-takeoff are less than the necessary ones, each ready-for-takeoff UAV forced to stay at the CS is an AS-UAV, since it is expressly stocked for future use.
It can be derived as follows: Its expected value is computed as: where S (M ) (n) − S (A) (n) = s M − s A is the number of UAVs that are necessary on air to satisfy the service request, while S (R) (n) = s R is the total number of available UAVs on the ground, ready for takeoff. Instead, S (D) (n) = t D is the actual number of UAVs that will takeoff according to the FC decision, as defined in (12). Fig. 7 shows the values ofΓ computed as in (25). When the edge-computing request is low, the best policy operates similarly to the greedy one, as apparent from the magnification inside Fig. 7. The best policy behaves differently in very few cases for the scenario x = −9, but such a little difference is sufficient for obtaining an improvement. As x increases, the number of cases where the RL operates differently from the greedy one slightly increases (see the magnification of Fig. 7), but this little difference enables a good improvement in terms of reward (see Fig. 6). This is mainly because ℘ 2 is close to 0 for the best policy, so that a little difference with the value reached by the  greedy policy is sufficient to involve this good improvement. Fig. 7 shows that, in these scenarios, the other policies behave differently from the greedy and the optimal ones. Therefore, their performance is very poor with respect to the best one.
In the scenarios with very high edge computing requests, i.e., when x > 9, the difference between the behaviour of the best and greedy policies increases, although they remain quite similar on average. On the other hand, the number of UAVs is also increased, and it is more than three times greater than in the lowest scenario. Therefore, the difference between the number of UAVs that take off according to the best policy and the others is proportionally lower than in the previous scenarios. In other terms, although the difference between the best policy and the others increases in terms of the average number of AS-UAVs, the difference in terms of the overall behaviour decreases. Moreover, as the edge-computing service request increases, the performance of any policy worsens, thus the reward is more and more distant from 0 also for the best policy. The consequent conclusion is that, for high values of x, the normalized reward of the other policies increases, as shown in Fig. 6, although there is a great distance from the best policy. Fig. 8 reports the same quantity as Fig. 7 only for the RL (that is, the curve labelled as "Overall" in Fig. 8 is the same labelled "RL" in the previous figure). Moreover, this quantity is also reported by grouping the results according to the states characterized by the same level of power generated by the WG. For example, the red curve, labelled as "Power Level 1", represents the average number of AS-UAVs when the WG is in the lowest power generation state. These curves reveal some interesting aspects concerning the choices performed by the best policy in relation to the power available from the green generator. When the power available is low, the best policy keeps some AS-UAVs, to cope with more challenging future events. As apparent from the magnification in Fig. 8, the small difference revealed so far between the best and the greedy policies is mainly due to their behaviour during the periods of very low green power availability. In fact, for all the other scenarios with higher power levels, the average number of AS-UAVs is zero likewise for the greedy policy when the average edge-computing request is not high. Such behavior is mainly due to the combination of two main reasons. In such scenarios, being the average edge-computing request is not too much (i.e., x small or negative) there is a higher number of available batteries in the CS since B = 20 − x, and the high power availability enables to keep a large number of full batteries on the ground. Therefore, it is very likely that a landing UAV finds a full battery, so it is immediately ready to take off. According to this reasoning, it is not necessary to leave UAVs in the CS for future challenging events since these events will be easily faced thanks to the large amount of full batteries.
This strategy cannot be used during the lowest power generation level since it is less likely that there is a power surplus to charge the batteries on the ground since almost the whole power is used to charge the batteries of the landed UAVs. Moreover, in the case of the lowest power generation level, the average number of AS-UAVs increases as the average request increases in the various scenarios. Indeed, another interesting aspect that arises from the figure analysis is that, when the average edge-computing service is high (x > 9), such an increment does not continue further. Actually, the average number of AS-UAVs decreases as the average request increases in the various scenarios. In these cases, where there is a large edge-computing request but with the lowest power generation availability (Power Level 1), it can be considered that all the events are challenging, and then it becomes more difficult to stock AS-UAVs. On the contrary, in these scenarios with a high service request, during the periods with the other power levels, an increment in the average request involves that the average number of AS-UAVs increases. To discuss the hypothesis about the possible causes behind this behaviour, it is necessary to keep in mind the previous reasoning about the advantages deriving from the full batteries on the ground useful to avoid keeping the UAVs in the CS. In the scenarios with a high service request, the number of batteries in the CS is low regardless they are empty or not. Moreover, there are more mounted batteries to charge due to the greater number of UAVs (N = 20 + x) and due to the need for more UAVs on air. These aspects have greater importance in the case of a medium/low level of available power (Power Level 2). More specifically, in this case, it is very limited the number of batteries that can be charged in addition to the UAV ones, Fig. 9. Average edge-computing service provided by FANET and Satellite when the RL policy is adopted for different scenarios. Average WG power that can be delivered to the local load.
thus it is necessary to stock some additional UAVs for facing future more challenging events. Since the possibility to charge some batteries in the CS increases as the level of available green power increases, the number of AS-UAVs reduces as the level of power increases (Power Level 3 under Power Level 2, Power Level 4 under Power Level 3).
Finally, Fig. 9 shows both the provided average edgecomputing service and the remote computing request when the FANET is managed by using RL policy in the various scenarios and the related average saved WG power. The quantities are almost linearly dependent on the average edge-computing service request until x = 12, after they slightly initiate to tend towards constant values. An important aspect is that the slope of the curve is about equal to 1 until the scenario x = 12. This means that the increment of the average service request is satisfied by an increment of the same magnitude in terms of provided service. This important result is obtained by only increasing the number of UAVs (although the number of batteries is kept constant) and by finding the best policy in each scenario.

VI. CONCLUSION AND FUTURE WORKS
This paper proposes to use artificial intelligence for automatic management of a FANET providing edge-computing in postdisaster scenarios. A CS for battery charging with a WG has been also considered. The FC applies a model-based RL to decide how many UAVs have to take off taking into account the current power generation availability and the edge-computing service requests, and a forecast of them. An additional novelty is the discrete-time analytical model of the system defined to provide the FC with a Markov Decision Process, in order to support its decisions.
It is worth stressing that the proposed RL approach does not suffer from convergence problems because no online training is needed. Indeed, the fact that an exhaustive model of the environment has been defined and that the number of UAVs composing a FANET is usually not so huge, makes it possible to find the optimal policy in closed form offline by means of the Bellman equation system.
The optimal management policy dynamically adapts its behaviour to avoid wide use of backup satellite channels in a short-time horizon during low green-energy generation and high service request periods. The optimal policy is an efficient modification of the greedy one with a level of variation that depends on the combination of the service request and the green power availability. More specifically, their behaviour mainly differs during the periods of low green power availability, since, in this case, the optimal policy frequently stocks UAVs for facing future edge-computing requests although these UAVs would be necessary to satisfy the current request. The comparisons with other policies have demonstrated the achieved gain, proving that the proposed framework based on a FANET supplied by green generation and the proposed management strategy are suitable to face emergence scenarios that need computing resources but lack connection with the power grid and the core network infrastructure.
In future works, the FANET management framework proposed in this paper to support the paradigm of Network Function Virtualization, could include the placement of service chains in the FANET. This is done by taking into account the state of charge of the battery of each flying UAV, in order to maximize the amount of traffic flows that the FANET is able to manage.

APPENDIX TRANSITION PROBABILITY MATRIX OF THE PROCESS
(S (B) (n), S (D) (n)) In order to derive this last matrix, it is necessary to describe the behaviour of UAVs and batteries in detail during the time slot. As represented in Fig. 3, the state evolution during the time slot has been divided into two phases. First, the intermediate (point 3 of the previous list) values of the UAV and battery charge states, indicated withs D ands B , respectively, have been evaluated; then, the final state s Σ . Has been calculated With the aim of calculating the intermediate state, it is considered that the intermediate number of flying UAVs is increased by the taken-off UAVs, t D , as in (12), that is: Actually, this is also the number of flying UAVs at the beginning state (point 2), differently from the other intermediate quantities computed in the following.
The number of UAVs that can be potentially charged by WG in the time slot n is said g.
The intermediate number of UAVs with an empty battery,s E , is the initial number, s E , decreased by the number of UAVs that have been charged by means of the available supplied CPs, min{g, M }, and which so move to the state of UAVs ready for takeoff:s The remaining supplied CPs, i.e., the ones that have not been used for UAVs, are used to charge some empty batteries. In this case, the number of charged batteries could increase by a quantity complementary to (27) but limited by the number of batteries, L. Otherwise, the number of charged batteries remains the same. Therefore, the updated number of charged batteries is: Now, the final state (point 6 of the previous list), s Σ , is derived by updating the intermediate states. It is done by considering the transition of S (M ) (n) and S (G) (n), and the number of landing UAVs. Being this last number defined , the final number of UAVs on air is: Since some landed UAVs could swap their battery with a charged one thus becoming immediately ready to take off, the number of UAVs that are ready for takeoff and the number of remaining charged batteries is: (31) At the same time, the number of UAVs waiting for charging is increased by the landed UAVs that do not find a charged battery, that is: So, with all this in mind, the transition probability matrix of the joint process (S (B) (n), S (D) (n)) can be computed. To this purpose, the total probability theorem is applied to g, representing the number of CPs that can be supplied simultaneously by the renewable generator among the M CPs that are available in the charge station, and on the number , representing the number of UAVs that land at the end of the time slot n among thes A that are flying: (s A ) is the probability that UAVs, amongs A UAVs that are flying, land in a time slot because needing to be charged. In order to calculate it, since the UAV battery SoCs are independent of each other, the landing probability for a UAV in a time slot is modelled as a Bernoulli process with probability ρ Down . This parameter depends on the average time on air. Assuming that in the time slot when a UAV takes off does not land, the per-slot landing probability starting from the second time slot on air is ρ Down = 1/(H − 1). Therefore, considering thats A represents the number of UAVs that are already on air at the beginning of the time slot (beginning state), they potentially may land with probability p (Land) [ ] (s A ) following a Binomial distribution: (34)