Constrained Federated Learning for AoI-Limited SFC in UAV-Aided MEC for Smart Agriculture

For a wide range of smart agriculture use cases, the prospects of utilizing the Internet of Things (IoT) are immense. Many IoT devices can be deployed for precision farming, soil management, automated irritation, information gathering, or performing some local processing to provide various services. Due to the computational capacity limitation of IoT devices and their limited power, UAV-aided mobile-edge computing (MEC) is proposed to provide IoT nodes with additional resources by hosting their computation functions and making smart agriculture use cases more operational. On the other hand, from the implementation viewpoint, Network Function Virtualization (NFV) is an emerging approach recently proposed for flexible management of such computation functions in UAVs and MEC-server. However, efficient orchestration of the virtualized functions is a challenge. In this paper, we consider a decentralized UAV-aided MEC system in which the NFV-enabled processing nodes manage the computational tasks. To be more specific, we consider the smart agriculture use cases that need live streaming/analysis, such as surveillance or environmental monitoring. In such a network, we propose a method for orchestrating the NFVs efficiently, while the network energy consumption throughout the network is minimized. This problem becomes even more crucial when considering a strict condition on the instantaneous AoI values. Therefore, the problem is first formulated as a Decentralized Constrained Multi-agent Markov Decision Process (Dec-CMMDP). As the formulated problem is NEXP, in the next step, we exploit some structural features of the considered network and introduce the concept of symmetry to simplify the problem. Then, inspired by Augmented Lagrangian dual optimization, a novel decentralized, federated learning-based solution is proposed to solve the problem. Simulation results show the effectiveness of the proposed approach in minimizing the total network energy consumption, minimizing the average AoI, and satisfying the strict condition of AoI $ < 100$ msec for supporting real-time applications in our network parameter settings.


I. INTRODUCTION
A GRICULTURE, as the main source of food that faces ever-increasing global demand, requires a major step forward in quality and productivity.The introduction of the Internet of Things (IoT) and its applications in smart agriculture provide this industry with effective tools to support farmers not only for better productivity but for greater profitability [1].IoT provides connections among numerous devices that, in harmony with each other, are deployed to provide a specified service.The use cases cover a wide range of services such as crop field monitoring, pest control, smart autonomous irrigation, and soil management [1], [2].The applicability of IoT is not limited to agricultural land itself but on a much wider scale includes the supply chain as well [3], [4].Therefore, a huge amount of data provided by IoT nodes expanded throughout the agriculture and supply chain needs to be processed and analyzed to support real-time and non-real-time decision making processes [1], [5].That is while volatile weather conditions besides the vastness of agricultural fields will increase the risk and maintenance costs [2].Therefore, such a huge data computation and analysis demand will put a strain on resource-limited IoT devices.To improve the capability of resource-constrained IoT devices and network coverage, UAV-aided MEC [6], [7] is proposed.The flexibility, and mobility in changing weather situations, as well as being easy to deploy and having reasonable maintenance costs make UAVs an effective solution to provide IoT devices with the required resources.This goal is realized by enabling the IoT nodes to offload their processing task to hovering UAVs which is equipped with the required storage, processing, and communication resources [2].
Therefore, IoT in conjunction with UAVs, has attracted much attention and is widely utilized in smart agriculture and smart farming [1], [2], [8], [9].For instance, in environment monitoring applications or precision farming [10], the IoT devices are distributed throughout the intended area and collect real-time data from their surroundings.Next, the collected data is forwarded to the UAVs.Finally, the pre-processed data is forwarded to the local server for further processing and information extraction to support timely farming-related decisions and actions.The purpose of precision farming is to improve the accuracy of operations and maximize the overall performance while reducing cost by taking the field's variables such as weather conditions, and information on moisture level and soil water requirements (for example for smart autonomous irrigation) into account [1], [5], [8].This extends smart agriculture scope to guarantee a secure and sustainable food supply chain provided by context and situational awareness through processing the real-time events where rapid protective and/or recovery actions are needed [11].Such applications are computationally intensive, delay-sensitive [1], [2], [8], [10], [12], and freshness of information is an important aspect that needs to be considered.
With emphasizing information freshness, AoI is recently proposed as a metric to quantify the timeliness and freshness of the collected data [13], [14].This metric is widely used in the context of IoT networks [15], [16] to evaluate the freshness of the data at the destination node.AoI is the time elapsed from the generation of the last-received update packet [14].AoI increases linearly over time until the next fresh packet has arrived, and this is the main difference between the AoI and the other traditional metrics that quantify the timeliness of a designed system.
Before diving into another important aspect of smart farming realization, in the following, we will provide two use cases of real-world realization of smart agriculture.The first use-case is the work presented in [17] which exploits cloud and edge computing in conjunction with NFV technology to develop a system covering excessive requirements of smart precision farming.A platform of three layers is developed.The first layer is a local cyber-physical system that mainly gathers data by interaction with crop devices in a real-time manner.The second layer is an edge computing plane composed of VNFs where task offloading is accomplished.And finally, the third layer is a cloud platform to collect and record data.This 3-tier platform is able to cope with the requirements of soilless agriculture in full recirculation greenhouses [17].The second case is the Flourish research project [18], an adaptable robotic solution that combines the capabilities of a small UAV network with another small network of autonomous unmanned multipurpose ground vehicles.Flourish is aimed at monitoring crop density, crop nitrogen level, and weed pressure to precisely classify weeds by developing multi-spectral perception algorithms.In the next step, the developed navigation and mapping system is able to locate weeds and perform selective spraying.All the above multi-function processes are performed without human intervention [18].Therefore, in a UAV-aided smart agriculture scheme (which is the case of interest in this paper), the UAVs are exposed to a huge amount of data to process in a timely manner.In other words, we are faced with a dynamic computing environment where different processing/computing functions need to be implemented on the UAVs and the MEC local server in a scalable, flexible, easy-to-launch, and cost-effective manner [11], [17], [18].From this point of view, virtualization of the network element functions (NFs), called NFV, is a key technology for reliably implementing and intelligently managing the NFs [19].The NFV virtualizes the NFs and abstracts them from the physical hardware, which enables rapid service function chaining (SFC), and service provisioning in UAV-aided MEC applications [20].Considering the data-intensive and computationalbased application of smart agriculture, multiple computing functions in the form of virtual network functions (VNF) should be deployed sequentially and orderly to provide the processed data for the final decision-making at the local MEC server.One of the most important problems that should be optimally and efficiently solved is the placement of VNFs, managing the resources among different VNFs, and determining how to route the packets of information among VNF components over the available NFV infrastructure.VNF Orchestrator (VNFO) performs this operation [21].As the network traffic and VNF's load change over time, the placement needs to be dynamically adjusted to the new conditions as well [22].Utilizing NFV enhances the agility in deploying and managing network components and improves the robustness and scalability of networks significantly [20], [22].
In this paper, we will focus on the smart agriculture use cases that require live streaming and analysis, where there is a strict condition on the AoI values.Surveillance and environmental monitoring, in general, are two examples of such use cases.Subsequently, the VNFO should decide on the placement and scheduling of the chain of VNFs to minimize total network energy consumption while resulting AoI values must be less than a predefined threshold.Therefore, to mathematically model such a constrained decision-making problem, the extended version of the Markov Decision Problem (MDP), i.e., constrained MDP (CMDP) [23], is utilized.More specifically, we consider a distributed multi-agent CMDP (Dec-CMMDP) where each UAV is responsible for placement and scheduling its corresponding VNFs belonging to its support services.Nevertheless, unless for a particular case that the agents are transition and reward-independent, the optimal policy for this problem is NEXP-complete with no standard solution in polynomial time [24].Recently, machine learning algorithms and artificial intelligence (AI) based solutions appear viable ways to solve such complex problems in polynomial time [25], [26], [27].Since its inception in 2017 [28], Federated Learning (FL) has reshaped many emerging intelligent IoT systems toward advanced FL architecture.The distributed nature of FL, where some clients cooperatively train a global ML model without directly sharing the local data, makes FL an attractive alternative to traditional centralized ML schemes.Particularly, FL enhances the privacy and scalability of IoT applications and networks by pushing intelligent ML functions to the network edge [27].
In the context of delay-sensitive smart agriculture applications, we address the problem of robust and flexible management of virtualized computing functions by distributing them into processing nodes: UAVs and the local MEC server.The purpose is to perform this function chaining in such a way that the total energy consumption of the UAVs and IoT nodes is minimized.At the same time, a strict condition on the instantaneous values of AoI is satisfied.We formulate the problem above as a distributed multi-agent CMDP model and propose a novel energy-efficient FL-based solution to solve it.The main contributions of our paper are summarized as follows: • To the best of our knowledge, this is the first time that the problem of constrained dynamic orchestration of NFV-enabled SFCs in a UAV-aided MEC network is considered under instantaneous strict conditions.
• We formulate this joint optimization problem as a Dec-CMMDP, where a strict condition on the instantaneous value of AoI must be satisfied.
• We developed an extended version of the Augmented Lagrangian dual optimization method in conjunction with Federated Learning to obtain the optimal policy for the formulated Dec-CMMDP problem.
• As the formulated problem is NEXP-complete, we adopted the inherent symmetry in the structure of the problem and proposed a novel Iterative Federated-based algorithm in which a set of distributed parties learns in parallel and aggregates their own experience through a coordinator.
The rest of the paper is organized as follows.Section II introduces the related works.Section III describes the system model.Section IV presents the problem definition and formulation.Section V explains how the problem can be modeled as a Dec-CMMDP.The proposed Iterative federated-based solution and the analytical results that support our proposed algorithm are presented in Section VI.The effectiveness and performance of the proposed scheme are demonstrated in Section VII.Finally, Section VIII concludes the paper.

II. RELATED WORKS
To enable different operations, such as environmental monitoring and automation, numerous IoT devices are used in IoT-based smart agriculture [1], [12], [15], [29].For a comprehensive review of emerging technologies for IoT-based smart agriculture, refer to [1].For a UAV-aided farm monitoring IoT scheme, Nguyen et al. [12] considered the problem of processing deadline-critical tasks which are fed by IoT devices deployed on the field.Assuming that a Multi-access MEC infrastructure is available, the energy-efficient monitoring problem is modeled as a multi-objective maximization problem, and a Q-Learning-based solution is proposed which aims to process the tasks before their deadline.The same authors in [29] have extended the proposed scheme in [12] to a multi-actor-based risk-sensitive RL approach.
In the context of UAV-aided IoT networks and to quantify the freshness of information, AoI has been widely used in some recent works [15], [16].In [15], Han et al. considered a UAV-aided IoT system in which the performance of data gathering is analyzed in terms of packet loss rate and data quantity using a Markov chain.To define the freshness of data packets, they analyzed the AoI of devices as a firstcome-first-served (FCFS) model and M/M/1 queuing.In [16], the authors considered a UAV-aided wireless powered IoT system, where a UAV takes off from a data center, flies toward sensor nodes to transfer energy, collects their information, and then returns to the data center.To minimize the average AoI of the data gathered from all ground sensors in such a system, an optimization problem is defined.Then, a suboptimal method is proposed to decompose the problem into two subproblems.The solution to the first subproblem is the input for the second subproblem.Zhu et al. in [30] investigated the age-sensitive MEC systems which benefit from UAVs.They proposed a multi-agent RL scheme for intelligent control of UAV's trajectory planning, data scheduling, and bandwidth allocation.The problem is modeled as an average Age-of-Information (AoI) minimization.Then, an actor-critic-based multi-agent RL framework is proposed, where edge devices and a center controller cooperatively learn the interactive strategies through their observations.To enhance system performance in terms of convergence, an FL mode is introduced into multi-agent collaboration.
Considering NFV-enabled UAV-aided MEC IoT networks, each IoT service can be expressed as a service function chain (SFC) defined as several strictly ordered VNFs.The VNFs can be geographically placed into the local MEC server close to IoT terminals or UAVs.However, optimally and efficiently placing VNFs and routing service paths through the VNF instances are challenging problems.This problem is also known as SFC dynamic orchestration (SFC-DOP) [20].In [20]  have dealt with the SFC embedding and dynamic VNF placement in a geo-distributed cloud system.The problem is formulated as a Binary Integer Programming (BIP) to embed the SFC requests at the lowest possible cost.
A summary of related works along with the main topics they have mainly focused on is provided in Table 1.All problems mentioned above are modeled as conventional MDP, as this work did not consider the case that there is a strict condition on the AoI values.Liu et al. in [31] discuss this issue that, in practice, RL techniques often cannot be directly applied to physical systems, especially in cases where there are some constraints to satisfy, e.g., limit resource consumption.This paper has surveyed the existing approaches addressing CMDP using RL.Two main types of constraints, i.e., cumulative and instantaneous are considered, and their pros and cons are discussed.Among all approaches dealing with CMDP, there are two popular methods of Lagrangian relaxation based algorithms [31], [32], [33] and Lyapunov function based safe policy determination algorithms [34], [35], [36].None of the above studies consider CMDPs with instantaneous constraints.Li et al. in [33] dealt with this problem as a reward-maximizing policy determination while satisfying certain constraints at each time step.They first treated the conditions in which the strong duality of CMDP is in place and then, inspired by the Augmented Lagrangian Method [32], have proposed a policy-gradient-based RL algorithm for instantaneously-constrained RL problems.
We summarized the main references in the literature review in Table 1.As evident from this table, there are a few works that consider AoI in smart agriculture networks, however, none of these studies consider CMDPs with instantaneous constraint.In addition, none of them consider the problem of SFC in such networks while benefiting from NFV technology.Therefore, the techniques that are proposed in these works are not able to be deployed as a solution for the proposed smart agriculture framework formulated in this paper.Our paper focuses on SFC in NFV-enabled MEC where the problem is formulated as a Dec-CMMDP.As opposed to other SFC studies we consider an instantaneously-constrained Dec-CMMDP, where a strict condition on the instantaneous value of AoI must be satisfied.At the next step in section VI, inspired by [33], we will propose our novel solution which is an extended version of the Augmented Lagrangian dual optimization method in conjunction with Federated Learning.

III. SYSTEM MODEL
The notations used in this section and the rest of the paper are as follows.Matrices and sets are denoted by Bold uppercase characters, and vectors are denoted by bold lower-case characters.Z + and Z + 0 denote positive integers and positive integers plus zero, respectively.|A| is the cardinality of a set In the context of smart agriculture, a real-time IoT network provides different types of real-time services to the network operator.These services can cover various applications, from environmental monitoring to automation.To be more specific, as it is depicted in Fig. 1, there is a set N of N IoT nodes that collect delay-sensitive real-time information; Then send them to a local server through an Aerial Network in the form of packets.The Aerial Network consists of U UAVs denoted by the set U. Each IoT node is connected to a UAV in its range.There is a local server located near the network and is equipped with a MEC server1 indexed by M .

The IoT network provides a set
processing functions should be performed on the packets of that service by following a logical order.Hence, it is assumed that the MEC server M and each UAV u, as processing nodes, can run F = k∈S F k different VNF types on their physical computing machine.The total available resources at the physical machine (processing node) p ∈ U ∪ M is indicated by C p (in Hz) and B p (in Byte), where C p and B p denote the computing and memory capacity at node p, respectively.
Each IoT node only supports one of the K different services provided by the network.If However, there is a strict condition on the AoI of the service provided by each group of IoT devices.We will get back to this point later.
We consider a discrete-time system with two hierarchical timing levels as illustrated in Fig. 2. According to the first and smaller one, the time is divided into equal time slots indexed by t = 1, 2, . . ., each with duration T .All communications in uplink and downlink directions are according to this time schedule.On top of that, we have the VNF-scheduling time slots t with duration T that is a single round of VNF placement and scheduling updates.T is multiples of T , T = O × T , O ∈ Z + .At the beginning of each time slot t, the VNFO updates the VNF placements.Each UAV u combines all the data packets of the IoT nodes with the same service type, say k, and forwards the combined packet-flow ϒ k u (t) throughout the aerial network toward the local server.The set VNFs (processing functions) must be performed on data packets of the IoT nodes with service type k.
At the beginning of each VNF-scheduling time slot t, for each service packet-flow ϒ k u (t), the set of processing functions k u (t) ⊆ F k which will be performed by the UAV u is determined.The remaining processing functions for that service, ¯ k u (t) = F k \ k u (t), will be performed by the MEC server.k u (t) = ∅ means all the VNFs of that service will be performed at the MEC server.During the VNF-scheduling T , the network follows a fixed placement rule determined by the VNFO until the next placement at the next VNF-scheduling time slot.It is assumed that the VNF orchestrator (VNFO) located in the local server decides the placement of the processing functions (VNF chains) of each service packet flow.Although the VNFO module is located on the local server, the orchestration process is performed by coordinating the UAVs as distributed agents.We will explain this process in the following sections in detail.
Let v fk p (t) ∈ Z + 0 denote the number of VNF instances from V fk that is selected to be run on processing node p at VNF-scheduling time slot t: The set of all assigned VNFs to the processing node p ∈ U ∪ M is indicated by There are two communication links among the network nodes: the wireless links between IoT nodes and an aerial network consisting of UAVs and the wireless links between the UAVs and the local terrestrial server.Let l nu (t) denote the channel loss between IoT node n ∈ N and UAV u ∈ U, then the achievable bit rate of node n in uplink direction will be R nu (t) = w n log 2 (1 + p nu l nu (t)σ 2 ), ∀n ∈ N , u ∈ U, where w n and σ 2 denote the channel bandwidth of IoT device n and the noise variance, respectively, and p nu is the transmission power level.The channel between IoT nodes and UAVs and between UAVs and the local server can be modeled as an airto-ground channel model [37].According to this model, the path loss, l nu can be calculated as [30], where f , c, and d are frequency of operation, speed of light, and distance between the transmitter and receiver, respectively; and η e is the average of excessive path loss in two cases of existing on Line-of-Sight (LoS) path, η LoS e , and non-LoS case, η nLoS e , where p LoS is the probability of existing on the LoS path and can be closely approximated as [30], where, a and b are environment-related parameters.
Similarly, in the downlink direction, the achievable bit rate of the link between UAV u ∈ U and the MEC server M will be R uM (t) = w uM log 2 (1 + p uM l uM (t)σ 2 ), ∀u ∈ U, where w uM denotes the channel bandwidth, σ 2 is the noise variance, p uM is the transmission power level, and l uM (t) denotes the channel Loss at time t.In the following sections, we will focus on the VNFO's functionality, and resource allocation of the radio access part is beyond the scope of this paper; hence, without loss of generality, we assume a fixed power and bandwidth allocated to all the participating nodes.

IV. PROBLEM FORMULATION
As explained in the previous section, at the beginning of each VNF-Scheduling time slot t, the VNFO decides on the placement of VNF chains.

Definition 1 (Placement Index): For the chain of VNFs corresponding to service packet-flow
| is defined as the point where the chain is broken into two parts.The first part will be placed in UAV u and the second part in MEC server M .δ k u = 0 (or

means all VNFs are performed in the MEC server M (or UAV u).
Definition 1 implies that the packets travel a loop-free route.Each packet belonging to ϒ k u (t) needs a specific computational capacity c fk in terms of CPU cycles.Assuming that all the processing capacity of processing node p in a single time slot with duration T is C p T , the following condition at each VNF-scheduling time slot t should be satisfied: The above condition ensures that the computing capacity of the selected processing node is enough to serve the assigned VNFs.The same condition also needs to be fulfilled regarding the storage capacity requirement b fk (in Bytes): where B p is the total amount of available storage capacity of the processing node p.

A. AGE OF INFORMATION
The AoI metric is adopted to quantify the freshness of the received packet at the destination.As mentioned, AoI is defined as the time elapsed from the generation of the last received packet from that service.This subsection will show how to formulate and quantify this metric regarding the network parameters.For each service packet-flow ϒ k u (t), let τ k u (t) denote the expectation of IoT access network delay with respect to trans-mission rate R nu (t), where D nu is distance between IoT node n ∈ N and UAV u ∈ U and k is the packet length of service type k ∈ S.
If t k u represents the time elapsed from the beginning of the time slot t in which a packet from service packet-flow ϒ k u (t) has arrived, the AoI of service packet-flows at the UAV nodes can be calculated as, where binary variable α k u (t) ∈ {0, 1} indicates whether any new packet of service flow k at TS t is received, α k u (t) = 1, or not, α k u (t) = 0.As mentioned above, t k u represents the time elapsed from the beginning of the time slot t in which a packet from packet-flow ϒ k u (t) has arrived, so referring to the first equation of (8), T − t k u is the time passed from receiving that last packet of packet-flow ϒ k u (t) in the case that in time slot t a new packet from this packet-flow is received; As a result, τ k u + T − t k u would the time elapsed since the generation of the last received packet from packet-flow ϒ k u (t) at time slot t, i.e., the AoI of this packet-flow at time slot t.By definition, for every time slot t in which the UAV does not receive a new packet from a service packet flow, the AoI of that service packet flow increases by T which leads to the second equation of (8).On each packet of service packet-flow ϒ k u (t), the set F k of VNFs (processing functions) should be performed, the subset k u (t) in UAV and the remaining VNFs ¯ k u (t) at the MEC server.The processing time of every packet of this flow will be, where θ V u and θ V M is the run time of VNF V when it is performed on UAV u and the MEC server M , respectively; k (δ k u ) is the packet length of service type k after performing the δ k u th VNF of the chain, and, τ k uM (t) is total transmission delay between UAV u and the MEC server M for transmission of a single packet consists of propagation delay and transmission delay.
Let binary variable β k u (t) ∈ {0, 1} indicate whether the VNF scheduling for service packet-flow ϒ k u (t) in a single round of VNF scheduling t was successful, then, the AoI at the Local server will be, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
For every unsuccessful VNF scheduling, the AoI of that service packet flow increases by T .
To support our time-sensitive application, we need to guarantee that the AoI for each service packet flow is less than a predefined threshold all the time.This point is reflected in the following remark.
Remark 1: For each service packet-flow ϒ k u (t), we have this strict condition that the age of information k u (t) in relation (10) be less than a predefined threshold ξ ,

B. ENERGY CONSUMPTION
In this subsection, the total energy consumption for processing and delivering a single packet of each service packet flow to the local server is formulated.For the uplink direction, the energy consumption can be calculated as where τ k nu (t) denotes the transmission time between IoT node n, with service type k, and UAV u at TS t, and k (0) is the packet length of service type k before performing any processing functions.Similarly, for the energy consumption in the downlink direction, we have, where τ k uM (t) denotes the transmission time of packets belong to service type k from UAV u to the MEC server M , and, k (δ k u ) is the packet length of service type k after performing the δ k u th VNF of the chain.Finally, if ψ V u and ψ V M denote the power that UAV u and the MEC server M which hosts the VNF V consumes to run this VNF on each packet of this service packet-flow, respectively, then, the total energy required for performing the VNFs on a single packet of each service type can be calculated as follows: Th exact value of ψ V u and ψ V M parameters depend on the hosting processing power including CPU power, efficiency, and so on.Here, for the sake of brevity, in the math formulation we have considered these affecting parameters in a single parameter.In the deployment phase, they should be determined and taken into account depending on the exact specifications of the hosting machine.
Using ( 13)-( 14), the total energy consumption of the network to process a single packet of all service types for all UAVs will be, IoT nodes' are energy-limited and UAVs also are battery operated; hence, their available energy to compute and communicate is limited.It is important to note that, in certain circumstances, the VNFO may place some of the VNFs to be processed locally by IoT nodes.Thus, the total long-term cumulative value of their energy consumption should be minimized as well.On the other hand, as declared in Remark 1, there is a strict condition (constraint) on the AoI value of the service packet-flows.Therefore, the following constrained optimization problem should be solved.
Problem 1 (Constrained Energy-Efficient VNFO): Considering the service packet-flow requirements, UAVs/MECserver available resources, and the condition of the access network, an energy-efficient VNFO solution is needed that guarantees the required instantaneous AoI value at the local server: (16) It is worth mentioning that α k u (t) in ( 8) and β k u (t) in ( 10) are the results of the policy δ k u which is the expected output of a solution for solving Problem 1.These two variables α k u (t) and β k u (t) are defined to formulate the AoI mathematically but they are not independent variables actually.δ k u will be the policy of VNF placement that will be deployed by NFVO and will determine the actual value of α k u (t) and β k u (t) in the deployment phase.
In each VNFO-level time slot T , the orchestrator sequentially decides on the chain of NFVs of each service flow of UAVs.Markov Decision Process (MDP) is a powerful framework for mathematically formulating and studying this type of problem for a class of sequential decision-making problems.Depending on the environment state, the MDP output will be the best action (or at least the best upon the history of the observations and actions) which maximizes a specific utility function [38].However, according to Remark 1, we have a strict condition on the AoI at the local server that should be guaranteed.To incorporate this condition, in the next section, we reformulate Problem 1 under the framework of the constrained Markov decision process (CMDP).CMDP is an extension of the standard MDP where the purpose is optimizing an objective while explicitly satisfying some constraints in terms of auxiliary costs [23].

V. DEC-CMMDP FORMULATION
A CMDP is modeled as a tuple ⟨S, s 0 , A, T , r, c, c 0 ⟩, where S is the state space, s 0 ∈ S is the initial state, and ∀(ś, s) ∈ S, a ∈ A, A(s, a) is action space, T (ś|s, a), ∀(ś, s) ∈ S, a ∈ A is the transition probability function, r(s, a) is the immediate reward function, c(s, a) is the immediate cost and c 0 is an upper bound on the expected cumulative cost.In a CMDP, the agent based on the current state s uses a policy π to perform an action a(s) that determines the next state ś according to the transition probability of T (ś|s, a).The best policy π among feasible policies is the one that maximizes the expected discounted cumulative reward E{ ∞ t=0 γ t r(s, a)|π} while keeps the discounted cumulative cost E{ ∞ t=0 γ t c(s, a)|π} less than the predefined threshold c 0 , in a sequence of decision-making instances t = 0, 1, 2, . .., where, γ is the discounting factor.Indeed, in our scheme, the VNFO is the decision maker.Although a centralized VNFO may be optimal, it requires a large volume of communication transactions to share the local information with the centralized VNFO to make it aware of the exact current state of the network.Another drawback of a fully centralized VNFO is that it will not be scalable and would be a single point of failure from the processing and communication viewpoint.Therefore, in this paper, we consider the constrained multi-agent MDP (CMMDP) scheme [39] in which multiple agents or decision-makers exist.In the literature, this scheme is also referred to as a stochastic game [40].In the CMMDP case, VNFO is implemented as a multi-agent VNFO, where there is coordination among the agents (or actors) that are the UAVs in our system model.In the proposed multi-agent VNFO, a CMMDP is formally defined as a tuple with the following definition.

Definition 2 (CMMDP Model):
A CMMDP G with a set U of U agents is defined as a tuple G = ⟨U, S, A, T , R, C, c 0 ⟩, where • S is the finite set of entire environment states, • A = u∈U A u is the joint action space, where A u is the set of actions available to agent u, • T is the transition function T : s∈S (s × A(s)) × S → [0, 1], where T (ś|s, a(s)) is defined as the transition probability from state s to ś by doing joint action a(s), • R : s∈S (s × A(s)) → R is reward functions, where R(s, a(s)) is defined as the total reward received by all agents when the joint action a(s) is executed at the entire environment state s,

is cost functions, where C(s, a(s)) is defined as the total cost incurred by all agents when the joint action a(s) is executed at the entire environment state s,
• c o is a predefined upper-bound on the expected cumulative cost: By definition, in a CMMDP, the agents based on entire network state s and joint policy π perform independent actions a π u ∈ A u .The joint action a π(s) = u∈U a π(s) u determines the next entire environment state ś according to the transition probability T (ś|s, a π(s) ).The total reward R(s, a π(s) ) and total cost C(s, a π(s) ) received by all agents is also based on the current entire-state s and joint action a π(s) performed by all U agents.
In a large-scale CMMDP, the requirement that all agents can observe everything is too restrictive [41].As an alternative, the case where the state space is factored into per-agent sets, S = u∈U S u is introduced.In factored-state CMMDP, which is called a decentralized CMMDP (Dec-CMMDP) [39], each agent u ∈ U conditions its own policy π u only on locally observed state s u ∈ S u , receive reward r u (s, a(s)) and incur cost c u (s, a(s)) with required upper-bound of c o u .It is worth noting that in a Dec-CMMDP, the reward r u and cost c u each agent experiences depend on the entire network state s and joint action a(s).Whereas the local factored state s u is the basis for choosing the action by the agent.
As declared in Problem1, the purpose of our problem is to find the best policy for VNF placement by determining the best Placement Indices δ k u for all service packet-flows at each state by the UAVs as distributed agents.
Let define the extended state s e u,t = (s −u,t , a −u,t , s u,t ), ∀u ∈ U as a combination of factored state of UAV itself, s u,t , joint factored state of the other UAVs, s −u,t , and joint action of the other UAVs, a −u,t .In the following, where it does not cause ambiguity, we ignore the time index t.According to Bellman expectation equation [42], the action-value function Q π u,t s e u , a u , ∀u ∈ U is defined as the expected return starting from state s e u taking action a u according to policy π u , while the other agents are taken the joint action a −u according to joint policy π −u , where the action-value function decomposed into immediate reward r u s e u , a u plus discounted action-value of the successor state śe u , while the agent u will execute action áu ; and γ is discount factor.With some mathematical manipulation, (17) can be written as, where, V π u,t s e u , ∀u ∈ U is the value function at the extended state s e u and is defined as the expected discounted cumulative reward starting from state s e u following the joint policy π .For an enough large value of t (t → ∞), the goal is to find the optimal policy π * u among available policies π u which leads to the optimal Q-value (action-value) function while the constraint in ( 16) is satisfied, * Q π u,t s e u , a u = arg max Using ( 17) and ( 18), and some math operations (19) Determining an optimal policy for a Dec-CMMDP is NEXPcomplete [24] with no straightforward solution.A few efforts, [43], [44], [45], have been made in the literature to capture and exploit some structural specifications of the understudied system (application) to find or at least simplify the problem of finding optimal policy (20).One of those that are proposed for solving partially observable stochastic games (POSG) is [44], where a class of POSGs is characterized by symmetry across players (agents) in terms of cost and state dynamics.Inspired by this work, we claim the following lemma which helps to develop our proposed algorithm to solve Problem 1.
Lemma 1: Problem ( 16) is a symmetric Dec-CMMDP.With being a symmetric CMMDP, the problem of finding the best policy can be reduced to finding π * u , the best response to ú̸ =u π ú, while π ú = π u , ∀ú ̸ = u.In section VII we will prove this lemma.
Lemma 1 implies π * u U u=1 = π * .Although this result is promising, as we will utilize this result in developing our proposed method in Section VI, from the implementation viewpoint finding the best policy in a distributed form is still challenging.On the other hand, the recursive form of (20) provides the possibility of exploiting iterative solutions like dynamic programming algorithm; however, for our multiagent CMDP case, they are inefficient [39] and require the full information of the model which makes it impractical.Therefore, we formulate the problem of finding optimal policy (20) for Problem 1 as a model-free RL problem.
Considering the limited available resources of processing nodes in ( 5)-( 6), the objective of Problem 1 is minimizing the total energy consumption E total defined in (15), therefore the reward function r u s e u,t , a u,t defines as, where, ζ u (t) ∈ [0, K ] is the number of packet-flows for which the VNF scheduling result has satisfied ( 5)-( 6) conditions; δ V and δ E are the normalization coefficients for VNF scheduling result and energy consumption terms, respectively.On the other hand, the policy should guarantee the required instantaneous AoI value at the local server; hence, from ( 16) the constraint c u,t s e u , a u defines as, According to the above discussion, Problem 1 can be reformulated as an RL problem as follows.

Problem 2 (Cumulatively Constrained RL Problem): Consider a Dec-CMMDP problem with U agents, unknown transition function, reward function r u,t s e
u , a u U u=1 defined in (21) and cost function c u,t s e u , a u U u=1 defined in (22), the objective is to find the optimal policies π * u U u=1 for the following infinite horizon constraint optimization problem: However, there is a point in fully modeling our problem as a Dec-CMMDP.The AoI constrained in ( 16) is instantaneous and requires that the instantaneous value of the AoI (represented by constraint function in ( 22)) at all time instances satisfies this constraint.We will return to this point in the next section, where we introduce our proposed framework for solving Problem 2.

VI. ITERATIVE CONSTRAINED FEDERATED-DQN FRAMEWORK
In this section, we introduce our proposed Iterative Constrained Federated DQN (IC-FDQN) algorithm as depicted in Fig. 3.

A. AUGMENTED LAGRANGIAN BASED SURROGATE OBJECTIVE FUNCTION
Even though Dec-CMMDP, and its model-free form declared in Problem 2, are an appropriate choice for modeling Problem 1, a cumulatively-constrained MDP does not guarantee the associated instantaneous constraint.Using CMMDP terminology, according to (22), we require to have whereas the constraint in the CMMDP model defined in Definition 2 is cumulative.Hence, there is a gap here.We base our proposed method on classical Lagrangian dual optimization.Even though it is shown that for infinite-horizon RL problems with cumulative constraints, under certain regularity conditions, there is no duality gap despite their non-convex nature [46], the same statement is no longer true for the MDPs with instantaneous constraints [33].Therefore, we will adapt a newly-proposed extended version of the Augmented Lagrangian (AugLag) method [33] to develop a scheme for obtaining the optimal policy for Problem 2 while satisfying the instantaneous constraint of (24) in a decentralized manner.
Li et al. [33] consider condition under which strong duality holds for Problem 2 with instantaneous constraint, and then design a new rewarding mechanism for a new unconstrained surrogate objective function.It is proven analytically and demonstrated empirically [33] that this method results in a high-quality policy with smaller constraint violations than the primal-dual method.We will return to this subject in Section VII where we discuss the convergence of the proposed algorithm.According to this method, for Problem 2 subject to instantaneous constraint (24) The non-negative function X + in ( 25) is crucial; otherwise, it is not generally true that an optimal policy for ( 26) is also optimal for (23) while satisfying instantaneous constraint (24) [33].In the following subsection, we propose our Federated-based iterative learning algorithm for solving (26).

B. PROPOSED IC-FDQN METHOD
In Problem 3, besides optimal policies, π * u U u=1 , λ and ρ are also two design parameters that should be optimally determined.For a fixed value of (λ, ρ), Problem 3 would be a multi-agent MDP.Hence, the optimal solution can be determined through two nested iterative loops, where the inner-loop determines the optimal policies π * u,i U u=1 for a fixed value of λ (i) , ρ (i) , then, the outer-loop over λ (i) , ρ (i)  aims to determine the optimal values of λ and ρ from an initialized point of λ (0) and ρ (0) , respectively, [33], where, l ∈ [1, ∞) and c ρ ∈ R + are the increasing rate of the quadratic penalty coefficient and dual ascent stepsize, respectively.
According to Lemma 1, the optimal policy among the agents is the same, an excellent promotion to adapt Deep FL in our proposed scheme.Among decentralized methods, the multi-agent solution needs a large volume of communication overhead between the agents to share their local observations.It does not fully utilize the potential of Lemma 1 well.FL does not have the communication overhead of the centralized techniques and also does not necessitate the agents to share all of the data and local observations to converge.Although this specification is for providing privacy, in our problem, it provides us with the gain of energy efficiency that arises because the agents (UAVs) do not need to share all of their observations.
As illustrated in Fig. 3, to estimate Q-value functions ( 17), deep reinforcement learning (DRL) is deployed, where deep neural networks (DNNs) are used as the function approximators to predict the Q-values.The Q-function estimated by the neural network (NN) in each agent u is represented by Q π u,t s e u,t , a u,t ; θ u,t U u=1 , where the parameter θ u,t represents the weights of the NN.The updated value of θ u,t is used to train the NN and approximate the actual values of Q π u,t [47], [48].Let's define the loss function L θ u,t as the expectation value of the mean squared error of the estimated Q-value Q π u,t s e u,t , a u,t ; θ u,t from the target value y u,t [47], where, y u,t = r u,t + γ × argmax a u,t+1 Q π u,t+1 s e u,t+1 , a u,t+1 ; θ u,t and a u,t+1 indicates the agent's action generated by the DNN at t + 1, given the state s e u,t+1 .At each iteration, the deep Q-function approximator is trained to learn the best estimate of the Q-function by minimizing the loss function L θ u,t .To improve the stability of the algorithm and cope with sample correlation, as depicted in Fig. 3, two novel techniques, namely Fixed Target Network [49], and Experience Replay Buffer [50] are deployed, respectively.Utilizing these two techniques, the loss function L θ u,t can be written as where θu,t denotes the target network parameters, and the expectation E D is taken over the randomly selected mini-batches of samples from the replay buffer D. As it is illustrated in Fig. 3, we have two main entities, the set U = {1, . . ., U } of UAVs that are our distributed agents or, in FL terminology, the clients, and the coordinator that in our model is a local server (MEC-server).FL allows the UAVs (clients) to train a shared global model parameterized by θ g that is an exact copy of the clients' local model θ u u=U u=1 using their own local observations D u u=U u=1 , while the original data have remained at UAVs.After local training, clients share their local model updates with the coordinator.The coordinator then aggregates the received updates to build the global model θ g .As a result, relying on the distributed data training at the clients, the local server can enhance the training performance without significant communication overhead as it just needs an update of the local model parameters, not the clients' local data.The federated learning procedure of our proposed method includes the following key steps.

1) DISTRIBUTED LOCAL TRAINING
Primarily, the local server initializes the global model, θ g,0 , and transmits it to the clients.Upon receiving θ g,0 , during VNF-scheduling time slots t the clients interact with environment and train their local model θ u,t u=U u=1 using their own local observations D u,t u=U u=1 by minimizing a loss function Then, the clients upload their local update on θ u,t u=U u=1 to the coordinator for aggregation.

2) MODEL AGGREGATION
After collecting the clients' local model updates, θ u,t u=U u=1 , the next step is aggregating them into a new version of the global model, θ g,t+1 , which is performed by coordinator through averaging among the agent's contributions, where ω u represents the relative contribution of each agent on the global model.According to Lemma 1 and symmetry among the UAVs, the best choice for ω u is ω u = 1 U .After deriving a new update θ g,t+1 , the coordinator broadcasts it to all clients.Upon receiving the update from the coordinator, the clients upgrade their local model accordingly.Finally, for the locally-running DQN model at the clients, the local state space S u , the action space A u , and the reward function R u are defined as follows: • State: We define the client's state-space as a vector including 1) Computational, c fk F k ,K f =1,k=1 , and storage, b fk F k ,K f =1,k=1 , capacity requirement of packet flows belong to different services, 2) available CPU C p (t)|t = t U ∪M p=1 and storage B p (t)|t = t U ∪M p=1 of the processing nodes (UAVs and the MEC server), 3) Transmission rate in uplink R nu (t)|t = t N n=1 , and 4) Transmission rate in downlink R uM (t)|t = t at VNF-scheduling time t.Therefore, the state space S u will be, • Action: The action space is possible choices of the placement index • Reward: According to (26), the reward function at VNF-scheduling time t is given by: where

VII. CONVERGENCE AND COMPUTATION COMPLEXITY ANALYSIS
The Convergence and computational complexity of the proposed algorithm are discussed here.The convergence analysis consists of two main parts, i.e., the convergence of the outer-loop to determine the optimal values (λ * , ρ * ), and the inner-loop to determine the optimal policy {π * u,i } U u=1 for fixed intermediate values of (λ (i) , ρ (i) ).The outer-loop is the extended version of the augmented Lagrangian method and its convergence is discussed in [32] and [33].Therefore, in the following, we will present the convergence of the inner-loop and then we will determine the computational complexity of the whole algorithm.

A. INNER-LOOP CONVERGENCE
We prove the inner-loop convergence in two steps as follows.
In step one, we justify our method in subsection VI-B on how to aggregate the local models reported by the UAVs as local agents, which is averaging.Then, we discuss the convergence of the aggregation method itself.For step one, first, we need to prove the Lemma 1 introduced in Section V and repeated below for ease of reference.Therefore, it can be inferred that the agents conceptually have the same decision-process model.Accordingly, without loss of generality, we can assume that the discount factor and the predefined upper-bound on the expected cumulative cost of the agents are the same: As a result, it can be inferred that Problem 1 has the conditions of Definition 3, so it is a symmetric Dec-CMMDP.For such a class of symmetric multi-agent MDP, [44] demonstrated that for any u, ú ∈ U, if π u = π ú, then, π u is ϵ-best-response to π −u if and only if π ú is ϵ-best-response to π −ú , where ϵ-best-response (for an arbitrary ϵ ≥ 0) defines as a policy that achieves (reach) a reward (cost) within ϵ of the maximum (minimum) value.With being a symmetric CMMDP, the problem of finding the best policy can be reduced to finding π * u , the best response to ú̸ =u π ú, while π ú = π u , ∀ú ̸ = u.This proves Lemma 1.
Lemma 1 implies π * u U u=1 = π * which means despite what the optimal policy is, all the agents are the same.This justifies our model aggregation described in subsection VI-B, i.e. averaging over locally computed models by the agents.This method is perhaps the most popular method of doing aggregation on local models in the literature, called FedAvg [28], and its convergence has been extensively discussed in several papers [51], [52], [53].Hence, in the following step two, we only discuss whether our problem and the proposed solution satisfy the required conditions for convergence.
To have FedAvg convergence guaranteed, basically, two particular conditions are assumed [51], [52], [53]: 1) data among the agents are i.i.d.and, 2) all the agents are participating in global model training.Although the latter condition in our case is true, however, the UAV's observations are not independent.This case is also dealt with in some recent works.In particular, Wan et al. [54], have analyzed the convergence rate of FL training in a more general setting with non-i.i.d.local observations, where they considered the joint impact of communication and training limitations while the condition of the agents to be i.i.d is released.

B. COMPLEXITY ANALYSIS
In this section, the computational complexity of the proposed algorithm in Fig. 3 is calculated.We consider the complexity of the proposed algorithm in two phases, i.e.Model Training and Action Selection when the trained model will be deployed.
For each iteration of the outer-loop, we have the training of the local models by the agents (UAVs), and then, the aggregation of the local model by the MEC server to achieve the global model.The complexity of local model training by the agents is given by the summation of the complexity of action selection and the complexity of the back-propagation algorithm for each sample of the replay buffer multiplied by the mini-batch size.For the last multiplication by minibatch size, note that in each iteration of the training process, the local agent randomly takes a mini-batch of the samples recorded by the agent in its local replay buffer.
For a fully connected NN with a fixed number of hidden layers and neurons in each hidden layer, the complexity of action selection is proportional to the summation of input and output size of the exploited NN [55], [56].The input size of the NN equals the state space size which from ( 33) is given by 2FK + 2(U + 1) + N + 1; and, the output size of the NN equals the action space size which from ( 34) is given by K (F + 1).Therefore the complexity of action selection is O(FK ).On the other hand, from [55] and [56], for a given sample of the replay buffer the computational complexity of the back-propagation is proportional to the product of the input and output size of the NN.Therefore, for our case of interest from the above calculations, it would be O(OF 2 K 2 ), where, O is the batch size.As a result, assuming that the outer loop of the algorithm converges after a fixed number of iterations, O(1), and the aggregation process also increases in proportion to the number of the UAVs, U , the overall computational complexity of the training phase of the whole network is O(UOF 2 K 2 + U ), or indeed O(UOF 2 K 2 ).According to the above discussion, the complexity of the action selection of all U agents in the deployment phase is given by O(UFK ).

VIII. PERFORMANCE EVALUATION
In this section, the performance of the proposed algorithm is evaluated.The performance results are compared for four different methods: 1) The proposed IC-FDQN method, 2) The centralized version of the proposed method; Iterative Constrained Centralized DQN (IC2DQN), 3) The multi-agent version of the proposed method; Iterative Constrained Multi-agent DQN (IC-MDQN), and 4) The heuristic method of Minimum Delay.In the IC2DQN method, it is assumed that the VNFO is the local server while it is aware of the entire network state s.In the IC-MDQN case, there is no coordination between the UAVs as distributed agents.Finally, the minimum delay method at each VNF scheduling time t aims to select the action that leads to minimum end-to-end delay between the IoT nodes and the local server.

A. SIMULATION SETUP
The simulation environment is implemented in Python using OpenAI gym [57], a widely used tool for developing RL algorithms, and conducted in a computer with Intel(R) Core(TM) i7-10700 CPU 2.90 GHz and 64 GB RAM.Through simulations, the impact of the UAV's load variation on the performance in terms of Coefficient of Variation (CV) is evaluated.CV or Relative Standard Deviation is formally defined as C v = σ µ × 100, where σ and µ are the standard deviation and mean value of a statistical distribution [58].A higher value for CV means a higher degree of variation to its mean.CV is a standardized measure of dispersion of a probability distribution or frequency distribution and widely used in engineering as a quality and reliability assurance measure [58], [59].For a system under test, 20 < CV < 30 is normal so for a reliable system it is expected that it will tolerate this variation in terms of the output performance.Whereas, CV > 30 is a strict condition [59], [60].Hence, in the simulations, the performance of the proposed algorithm has been investigated for CV ∈ {10, 20, 30, 40} and is compared with the baseline methods.In addition, the impact of increasing the number of IoT nodes on the performance of the proposed scheme in terms of average AoI, AoI-condition violation, and network energy consumption is evaluated.Simulation parameter settings are summarized in Table 2.

B. SIMULATION RESULTS
The effect of load variation on AoI and energy efficiency of the proposed method and baselines are presented in Fig. 4 to Fig. 6.In the implemented model, it is implicitly assumed that all the assigned VNFs to a processing node should be finished in a single round of VNF scheduling.In Fig. 4, the average AoI versus different values of CV is shown.As it is evidenced, by increasing the CV the average AoI increases too.This means an increase in load variation leads to an increase in AoI.This happens mainly because for bigger values of CV the agents have to find the best policy for a state of bigger dimension.However, compared to baseline methods, for different values of CV (except for CV = 40) average AoI of the proposed IC-FDQN method is the smallest, and the minimum delay method has the worst performance.Generally speaking, similar results can be deduced from Fig. 5 and Fig. 6.Nevertheless, each method with respect different values of CV behaves differently which needs to be discussed.For CV = 10 and CV = 20, i.e. when the variance of the UAV's load is small, the IC2DQN that has a centralized view over the network status reaches smaller AoI in comparison with the IC-MDQN in which the agents have just local observations.However, for the proposed IC-FDQN method, although the agents have their local observations (like IC-MDQN) but cooperatively contributed to the same model that results in a single global model which is able to reach the minimal AoI among the other baseline methods.This result is consistent with Lemma 1 and confirms the fact that for asymmetric Dec-CMMDP, which is the case in this paper, the best policy for the agents is the same.
By increasing CV to 30 and more, IC-MDQN starts to surpass IC2DQN, as in IC2DQN the local server as a central agent tries to concurrently maximize the performance of all UAVs; for a fixed size of DNN, with the increase in CV , reaching out to the best policy for all the agents becomes more difficult until it would be infeasible after a certain amount of CV and performance degrades significantly.However, for CV = 30 both the proposed IC-FDQN method and IC-MDQN are able to reach a lower AoI in comparison with IC2DQN.The reason can be explained as follows.In IC-FDQN and IC-MDQN methods the agents behave independently, so they have more degrees of freedom in determining the local optimal action.That is why IC-MDQN provides better performance than IC2DQN as the CV increases.However, comparing IC-FDQN and IC-MDQN, the proposed IC-FDQN method which is FL-based still is able to reach the minimum AoI because the agents while acting independently, have the same model (a copy of the global model) that is cooperatively built upon the aggregation of local agents' observations.Nevertheless, for CV = 40, the story is different.As explained before, CV = 40 is not a normal case for an under-test system and is a strict condition.In this case, the problem from the global viewpoint of IC-FDQN and IC2DQN methods is not feasible, however, IC-MDQN could achieve better performance because the agents follow different policies and are greedy.As a result, some of the agents are able to minimize the local AoI which eventually leads to minimization of the average AoI.It should be mentioned that in calculating the AoI value of each agent, an upper limit of T is considered.Finally, the heuristic Minimum Delay method which tries to minimize the point-to-point delay comes with the largest value of AoI as it does not consider the end-to-end performance and has a local short-term target.In summary, for normal values of load variations, the proposed IC-FDQN method is able to achieve the minimum value for average AoI in comparison to the baselines for two reasons.First, the agents interactively  cooperate to build the optimal global model based on their local observations.Second, although the model is the same, the agents act independently.
A similar discussion can be made for the results presented in Fig. 5 for the percentage of AoI violation and Fig. 6 for average energy consumption.The only point that is worth mentioning is the temporary reduction of the average energy consumption for IC-FDQN and IC2DQN when CV = 20.The point is that for a normal coefficient variation and for those methods that have a global viewpoint (all methods except IC-MDQN), independent load variation of the UAVs provides the central agent with the opportunity to manage the resources network-wide to minimize the energy consumption at some point that the load is low in some UAVs.However, the IC-MDQN is not able to use this opportunity because the agents behave greedily, independently, and based on local observations.Therefore, this reduction in the level of energy consumption for the case that the load of the UAVs is low leads to an overall minimization of energy consumption.
The outer-loop convergence is depicted in Fig. 7, where the optimal value for λ and ρ are determined through an iteration   over k using ( 27) and (28).From (27), we define the average cumulative cost given by, E ∞ t=0 γ t X + c u,t s e u , a u π u,i local server as a central agent tries to concurrently maximize the performance of all UAVs.With a fixed size of DNN, for a large number of IoT nodes, it is not feasible and performance degrades.However, in IC-MDQN and IC-FDQN, the agents behave independently and have more degrees of freedom in determining the local optimal policy.The same discussion for the percentage of AoI violations is true.The effect of an increase in the number of IoT nodes on AoI violation is illustrated in Fig. 11(a) to Fig. 11(c).In this investigation, the available processing resources are assumed to be fixed when the number of IoT nodes increases.The same behavior as Fig. 11(a) to Fig. 11(c) but this time for the percentage of AoI violation can be seen also here.In scenario 2 and scenario 3, the processing node's total resources, C p and B p , are increased with the scale factors of 1.25 and 1.5, respectively.As it is evident, just the proposed IC-FDQN method is able to achieve zero percentage of AoI violation.A result that was expected with respect to the results of Fig. 7, where it is shown that the only method that is able to converge to zero residual cost is IC-FDQN.For a significant range of IoT numbers, this superiority remains.For example in Fig. 11(a), where the IoT nodes number is 35, the percentage of AoI violation for the baseline methods is 20% and more whereas this value for the proposed IC-FDQN method is zero.

IX. CONCLUSION
We considered the problem of dynamic placement and scheduling of NFV-enabled SFCs in a smart agriculture application with the aim of minimizing total energy consumption throughout the network while there is a strict condition on the Age of Information (AoI).The problem is formulated as a Dec-CMDP.Then, adopting the symmetric structure of the network, a novel federated learning-based iterative method is proposed to solve the problem efficiently.The proposed method is distributed and energy efficient since the local agents only need to share the parameters of their locally trained model with each other.The privacy-supporting feature of FL significantly decreased the communication overhead which in our problem led to a significant reduction in the total energy consumption of the network.Regarding satisfaction of the constraint and instantaneous AoI values, the proposed method is able to meet the required constraint for a reasonable range of parameter settings and has the best performance in comparison to the baseline methods.In terms of freshness of information and for a realistic parameter setting, the AoI is minimized jointly.Simulation results demonstrated that the achieved value for the AoI is appropriate for most near realtime applications.

Lemma 1 :
Problem (16) is a symmetric Dec-CMMDP.With being a symmetric CMMDP, the problem of finding the best policy can be reduced to finding π * u , the best response to ú̸ =u π ú, while π ú = π u , ∀ú ̸ = u.Proof: Inspired by[44], let's define a symmetric Dec-CMMDP model as follows.Definition 3 (Symmetric Dec-CMMDP): A Dec-CMMDP is called symmetric if the following conditions hold:(i) ∀u, ú ∈ U, A u = A ú, c o u = c o ú and γ u = γ ú. (ii) ∀ u ∈ U, a ∈ A,and an arbitrary permutation function σ (.) over all single actions {a u } u=U u=1 chosen by the agents: (a) r u (s e u , σ (a)) = r σ (u) (s e u , a), (b) c u (s e u , σ (a)) = c σ (u) (s e u , a), (c) T (•|s e u , σ (a)) = T (•|s e u , a).Condition (ii) of Definition 3 means the agents have the same statistical and decision model and are independent of each other.In our VNF-enabled SFC problem, as it is depicted in Fig. 1, each distributed agent u ∈ U, has K packet flows {ϒ k u (t)} K k=1 belonging to different services each of which requires running a different VNF chain on its packets.The statistical model of each service type packet flow is the same for the agents.They independently decide how to break each VNF chain between the local server and itself while they follow the same objective function declared in Problem 1.

FIGURE 7 .
FIGURE 7. Outer-loop convergence: Average cumulative cost vs iteration over k.

FIGURE 8 .
FIGURE 8. Probability distribution function of AoI values.

FIGURE 9 .
FIGURE 9.The effect of changing the constraint value ξ after training.
[22]u et al. presented a DRL-based framework for dynamic SFC orchestration in IoT networks.Pei et al. in[22]