Joint Air-Ground Distributed Federated Learning for Intelligent Transportation Systems

Supported by some of the major revolutionary technologies, such as Internet of Vehicles (IoVs), Edge Computing, and Machine Learning (ML), the traditional Vehicular Networks (VNs) are changing drastically and converging rapidly into one of the most complex, highly intelligent, and advanced networking systems, mostly known as Intelligent Transportation System (ITS). Recently, distributed ML techniques, such as Federated Learning (FL) have gained huge popularity mainly for their advantages in terms of intelligence sharing and privacy concerns. VNs are a natural contender for exploiting FL for solving challenging problems; however, their limited resources, dynamic nature, high speed, and reduced latency requirements often become the bottleneck. V2X communication technologies allow vehicular terminals (VTs) to share their valuable local environment parameters and become aware of their surroundings. Such information can be utilized to build a more sustainable and affordable FL platform for serving VTs. Gaining from recently introduced 3D architectures, integrating terrestrial and aerial edge computing layers, we present here a distributed FL platform able to distribute the FL process on a 3D fashion while reducing the overall communication cost for providing vehicular services. The framework is defined as a constrained optimization problem for reducing the overall FL process cost through a proper network selection between various nodes. We have modeled the FL network selection problem as a sequential decision-making process through a Markov Decision Process (MDP) with time-dependent state transition probabilities. A computation-efficient value iteration algorithm is adapted for solving the MDP. Comparison with various benchmark methods shows the overall improvement in terms of latency, energy, and FL performance.


Joint Air-Ground Distributed Federated Learning for Intelligent Transportation Systems
Swapnil Sadashiv Shinde , Student Member, IEEE, and Daniele Tarchi , Senior Member, IEEE Abstract-Supported by some of the major revolutionary technologies, such as Internet of Vehicles (IoVs), Edge Computing, and Machine Learning (ML), the traditional Vehicular Networks (VNs) are changing drastically and converging rapidly into one of the most complex, highly intelligent, and advanced networking systems, mostly known as Intelligent Transportation System (ITS). Recently, distributed ML techniques, such as Federated Learning (FL) have gained huge popularity mainly for their advantages in terms of intelligence sharing and privacy concerns. VNs are a natural contender for exploiting FL for solving challenging problems; however, their limited resources, dynamic nature, high speed, and reduced latency requirements often become the bottleneck. V2X communication technologies allow vehicular terminals (VTs) to share their valuable local environment parameters and become aware of their surroundings. Such information can be utilized to build a more sustainable and affordable FL platform for serving VTs. Gaining from recently introduced 3D architectures, integrating terrestrial and aerial edge computing layers, we present here a distributed FL platform able to distribute the FL process on a 3D fashion while reducing the overall communication cost for providing vehicular services. The framework is defined as a constrained optimization problem for reducing the overall FL process cost through a proper network selection between various nodes. We have modeled the FL network selection problem as a sequential decision-making process through a Markov Decision Process (MDP) with time-dependent state transition probabilities. A computation-efficient value iteration algorithm is adapted for solving the MDP. Comparison with various benchmark methods shows the overall improvement in terms of latency, energy, and FL performance.
Index Terms-Vehicular edge computing, federated learning, aerial networks, Markov decision process.

I. INTRODUCTION
I N THE last decade, Multiaccess Edge Computing (MEC), being one of the most recent revolutionary technology, has enabled various latency limited and data-intensive services and applications in the wireless networking scenarios [1]. In the case of Vehicular Networks (VNs), edge computing can be enabled through the implementation of Roadside Units (RSUs) and integrating them with limited capacity edge servers, known as Vehicular Edge Computing (VEC) [2]. However, with growing service requirements, limited RSUs resources are not sufficient and are becoming a bottleneck for VEC performance. Recently, different aerial platforms, such as Low and High Altitude Platforms (LAPs and HAPs), Unmanned Aerial Vehicles (UAVs), drones, and balloons, are integrated into the terrestrial networks creating a single Terrestrial/Non-terrestrial (T/NT) network [3]. These platforms can be further exploited for their reliable computation and communication capabilities through the edge computing paradigm. If integrated into the VNs, the aerial platforms can boost the performance of traditional VEC facilities with additional resources. In recent times, the Internet of Vehicles (IoV) paradigm has been introduced, where VNs, through the integration of communication and sensing, become a large source of data [4]. These data can be analyzed through different Machine Learning (ML) techniques for providing higher Quality of Service (QoS) and Quality of Experience (QoE) in vehicular services with reduced costs to the end-users. ML algorithms can find the hidden patterns and underlying structures in data collected by Vehicular Terminals (VTs) without human intervention. Therefore, ML techniques, including Deep Learning (DL), Reinforcement Learning (RL), and Federated Learning (FL), have been suggested to be used to solve challenging research problems on VNs in the recent years [5], [6], [7]. Compared with the traditional centralized ML process, FL had enormous success in terms of reduced latency and energy performance [8], [9]. In general, a FL process involves several training iterations, also known as FL iterations, characterizing a training process over distributed FL-clients (i.e., wireless nodes), transmission of updated local ML model parameters towards a centralized server, an averaging operation to be performed over a centralized FLserver for creating a global model (i.e., Federated averaging (FedAvg)), and the transmission of new global model parameters towards FL-clients [10]. Through local training operations, wireless devices can save energy and reduce data transmission delay when sending raw data towards a centralized server. Additionally, through an aggregation process performed at a centralized FL-server, devices can learn from each other's training experience. Though FL has these advantages, implementing in distributed environments, such as VNs, can be challenging mainly due to, e.g., the resource scarceness of individual VTs, involvement of high mobility, dynamically changing vehicular environments. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ The main challenges, when implementing the FL platform over a distributed VNs, are: • The harsh wireless environment due to vehicular mobility can make it hard to implement the traditional centralized FL model in VNs.
• VTs' resource limitations can be a bottleneck for performing a large number of FL iterations.
• Uncertainty about the participation of a large number of VTs into the FL process, mainly due to the harsh wireless environment, can also increase the FL process cost.
• Continuously changing vehicular environment parameters and corresponding data can impact the FL model performance over time, and introduce model drift. This may require frequent evaluation and retraining of the FL model.
• A constrained optimization problem is formulated for minimizing the overall cost (in terms of joint latency, energy and the FL training performance) of the process by a proper assignment of VTs and FL servers.
• A MDP framework is considered for modeling the problem as a sequential decision-making process, and a value iteration technique is used to solve it. The MDP environment is modeled through time-dependent state transition probabilities that take into account the local vehicular environment.
• The performance of the proposed scheme is analyzed by comparing it with different heuristic techniques and conclusions are drawn.

II. RELATED WORKS
Traditional VNs are converging into a more intelligent and advanced networking system with the integration of edge computing, machine learning, and big data applications [2]. Though VEC brings several advantages by enabling innovative services in the VN, the limited storage and computation capacity of edge servers is becoming a bottleneck for the innovative ML-based applications. The joint T/NT approach can solve the resource scarceness problem of traditional terrestrial edge computing techniques. Recently several authors have highlighted the importance of aerial edge computing facilities for boosting the VNs performance [18], [19].
The recently introduced IoV technology supports the transmission, storage, and computing of huge amount of data generated by VTs, which can be used to improve the VNs performance [20]. Various ML-based approaches can be adapted for finding useful patterns from VTs data and to enable the intelligence into VNs. In [21], authors have surveyed various ML techniques, their applications, and challenges faced during their implementations over the dynamic vehicular networks. Recent works highlight the importance of FL for vehicular cases. In [11], authors have surveyed the various opportunities and challenges while considering FL over a federated VN. Though FL brings advantages in terms of communication efficiency and privacy preservation, additional optimization is needed in terms of device selection, resource allocation, distribution of the learning process, etc for adapting it over wireless environments. In [14], the authors have considered the FL device selection problem over a resourceconstrained VN. A min-max optimization problem is formed and solved through a greedy algorithm. In [5], the authors have considered the joint computation offloading and FL process optimization over the edge computing enabled VN. Various cluster-based and distributed approaches are considered for finding the proper resource sharing between two phases aimed at minimizing the latency and energy costs. The importance of using FL in VN scenarios is also enforced by considering other aspects. Indeed, even if not strictly related to the system under consideration, FL is a good candidate when jointly used with Blockchain for implementing a collaborative intrusion detection solution for IoV scenarios [22]. Given the rise of different security threats over edge environments, e.g., edge device compromise, privacy leaks, denial of service (DoS), additional security measures are needed while accessing the services from the edge facilities. In [23], the authors have proposed a lightweight anonymous mutual authentication scheme for n-times computation offloading in IoT environments. The proposed method can provide user anonymity, conditional message tracing, unlinkability to users' private data through messages collected from open channels, resilience to edge compromise and DoS. Similarly, in [24], the authors have proposed a two-factor lightweight privacypreserving authentication scheme for enhancing the security of vehicular communication systems by using decentralization of central authorities and biological-password-based twofactor authentication. For another important IoT use case, the authors in [25] proposed a cloud-based user authentication scheme for secure authentication of medical data for wearable healthcare monitoring systems. The proposed scheme also allows password change, smart card revocation, and new wearable sensor addition phases.
In the case of future intelligent networks, various AIbased services can be enabled through proper collaboration between different networking environments and cloud/edge facilities [26]. However, this gives rise to several data privacy and security challenges [27]. FL, being one of the potential distributed learning techniques, can be useful for providing such AI-driven services in vehicular networks. Though FL has a main advantage in terms of elevated data privacy compared to the traditional centralized training approaches, in recent times various new privacy and security-related issues have risen in the traditional centralized FL models. Data/model poisoning, data modification, attacks on inference processes, backdoor attacks, Generative Adversarial Networkbased attacks, malicious servers, free-riding attacks, and eavesdropping are some of the major security and/or privacy threats that can be seen while implementing the FL process over different IoT networks [28]. While considering the FL process over a vehicular system, this issue can even become more critical mainly due to dynamicity, presence of a large number of VTs, multiple server nodes, high sensitivity of vehicular data, and fatal impacts of a data breach, etc. Several techniques introduced over different IoT environments for enabling the secure and trustworthy FL process such as reputation management, Blockchain-based systems, data privacy-based perturbation techniques, secure aggregation techniques, secure multi-party computation, homomorphic encryption, back-door defenders, etc., need further analysis for creating a highly reliable FL over VNs [29], [30], [31].
In recent times, the edge intelligence paradigm along with various distributed learning frameworks have received lots of attention [32]. In [33], authors have proposed a federated RL approach for minimizing the communication delay of the traditional centralized training approach for finding proper offloading decisions and the resource allocation in a UAV enabled MEC environment. In another case, [13] highlights the importance of the distributed FL processing for solving the long-distance connectivity and energy efficiency challenges of traditional centralized FL. MEC-enabled aerial access networks and their benefits for FL are discussed in [34]. In [35], the authors have proposed a communication-efficient FL framework based on a customized local training strategy, partial client participation, and flexible aggregation strategies. The analysis is limited to the terrestrial RSU nodes along with the cloud facilities.
Though FL has received lots of attention, a proper communication efficient, and sustainable FL platform for dynamic VN applications is still not functional. In addition, while designing the FL models, several authors have restricted their studies to terrestrial networks. By considering these shortcomings and various challenges posed by the FL, in this work, we aim to design the communication efficient, highly sustainable, distributed FL process for vehicular applications.

III. SYSTEM MODEL AND PROBLEM FORMULATION
In the following, an urban Internet of Vehicles (IoV) scenario for Intelligent Transportation Systems with connected and intelligent VTs is considered, allowing to request several intelligent services from the nearby edge computing facilities. In recent times, such urban IoV scenarios have gained a lot of attention from the vehicular research community [14], [36]. In particular, we consider a multi-layered joint air-ground network composed of HAPs, UAVs (i.e., LAP nodes), RSUs, deployed along the road paths, and randomly distributed VTs traveling on a road in either directions, where V = {v 1 , . . . , v m , . . . , v M }, R = {r 1 , . . . , r n , . . . , r N }, U = {u 1 , . . . , u l , . . . , u L }, correspond to the sets denoting M VTs, N RSUs and L UAVs, respectively. Each HAP node is denoted through the index h.
The system is modeled in a time-discrete manner, and the network parameters are constant in each time interval τ , where τ i identifies the ith time interval, i.e., τ i = {∀t|t ∈ [iτ, (i + 1) τ ]}. The generic mth VT is characterized by a processing capacity equal to c v,m Floating Point Operations per Second (FLOPS) per CPU cycle, while its CPU frequency is f v,m [5], [37]. VTs are supposed to be able to communicate on a bandwidth B rsu v,m with the RSUs, in a bandwidth B LAP v,m with UAVs and on a bandwidth B HAP v,m with the HAPs. In addition, the mth VT is supposed to hold a set D v m with |D v m | = K v m data samples produced during its operation as a result of the embedded Advanced Driver-Assistance System (ADAS), and later used during the FL training process. FL is here exploited for assisting during vehicle operations, e.g., computation offloading, path planning, object detection. The nth RSU, supposed to be in a fixed position with a coverage radius R r,n , is characterized by a processing capacity equal to c r,n FLOPS per CPU cycle, with CPU frequency f r,n , and communication capabilities, supposed to be identified through a communication technology, able to cover the VTs on ground with an overall bandwidth B r →v r,n . The RSUs are also able to connect with UAVs and the HAPs with a bandwidth B UAV r,n and B HAP r,n , respectively. The RSUs are connected to the electrical grid for the energy supply. Each RSU can provide edge computing services to the VTs in its coverage space. In addition, the area is supposed to be under the coverage of multiple UAVs with lth UAV at altitudeh u,l and coverage radius R u,l . We assume that UAVs are charged by exploiting available charging points in the service area. Based on VTs requests, UAVs can move in different directions with optimal path planning, whose management is beyond the scope of this work. While serving VTs, the lth UAV, supposed to move with a relatively slow speed compared with highly mobile VTs, is characterized by a processing capability equal to c u,l FLOPS per CPU cycle, with CPU frequency f u,l . In addition, it is supposed to be able to communicate on a bandwidth B l→(v,r ) u,l and cover an area with radius R u,l , while the lth UAV has a bandwidth B HAP u,l when communicating with the HAP. Each UAV can serve a set of VTs and RSUs in its coverage space.
The generic hth HAP node is placed at an altitudeh h above the ground, and characterized by a processing capability c h FLOPS per CPU cycle, with CPU frequency f h . Moreover, we consider multi-beam antenna forming techniques, where each antenna beam is supposed to cover a geographical area of radius R h and has a communication bandwidth B h→(v,r,l) h . In the following, we will refer to a single beam as the coverage of the HAP. It should be noted that though HAP coverage is reduced to a single beam for notation simplicity, our approach can easily be scaled for the overall HAP coverage with multiple beams. Each RSU, UAV, and HAP provides edge computing services to the VTs, RSUs and UAVs within its coverage area. Fig. 1 shows the basic system elements and various communication links between them. 1

A. VT Mobility Model
We suppose that the mth VT moves in a freeway-like mobility scenario with a speed ⃗ v m (τ i ) bounded by ⃗ v min and ⃗ v max [14], where the instantaneous speed is modeled through a truncated normal distribution density function: 1 Despite the system model and the analysis is carried out considering the general case of multiple HAPs, the performance will be later evaluated for the simple case with only one HAP. The generic case can be seen as a simple extension of the one HAP scenario. and µ and σ are the mean and standard deviation of the vehicle's speed, and erf(x) is the Gauss error function over x. The path length within which the mth VT remains under the coverage of jth node (i.e., any of RSUs, UAVs or HAPs) is D v m , j (τ i ) and can be given by: is the location of the mth VT at τ i and x j , y j is the projection over the ground of a generic jth edge computing node, which can be a RSU, UAV or HAP. The available sojourn time for the mth VT with respect to a generic jth node can be written as:

B. Distributed FL Platform for Vehicular Applications
In order to solve the VN management through the proposed air-ground network architecture, we propose a VTs servicebased distributed FL platform (Fig. 2).
The federated training operation depends on the service request ν, where ν can be any vehicular service requested by VTs, such as computation offloading towards edge servers, path planning, streaming-related services, etc. Each service ν requires a unique FL model F ν . In the considered FL platform, VTs (i.e., FL client devices) with local datasets D v m can perform the local training for the FL model based upon the requested service ν. Since different VTs can request different services over time, a group of VTs randomly located in the coverage space of the HAP requesting the same service ν will participate collaboratively to train the FL model corresponding to the service ν. The number of VTs participating in the training process of the νth FL model is given by: In each itth FL iteration, after the local training operation, we assume that a data vector w it,ν v,m (τ i ) (i.e., model updates embedded into IP packet) is generated where an information header is added indicating the VTs service (χ ν ⇔ ν). Here, χ ν can be a unique sequence of bits indicating the νth service. 2 Such processed data will be sent towards an upper layer edge computing node based upon the network selection strategy embraced by the VTs, as will be discussed in the following.
After receiving the data from the lower layer entities, each Edge Node (EN) will perform the FedAvg process creating the new set of updates w it,ν r,n /w it,ν u,l /w it,ν h , where w it,ν r,n is the aggregated FL model updates associated with the νth service generated by the intermediate RSU node n, and w it,ν u,l and w it,ν h are the FL model updates after the averaging process (i.e., FedAvg) performed at lth UAV node and hth HAP, respectively. With post-processing operation, the header information χ ν is again inserted into the aggregated data (w it,ν r,n /w it,ν u,l /w it,ν h ) for the next layer processing. In the end, data vector corresponding to the w it,ν r,n /w it,ν u,l /w it,ν h is transmitted towards the next platform or VTs based upon the network selection strategy. Though it is beyond the scope of this work, insertion/processing of the header information associated with the specific service request allows the proposed FL platform to train multiple service-based FL models simultaneously.
After receiving data from the ENs, VTs use them in the next iteration of the FL process. The process continues for several FL iterations until a certain confidence interval is reached. Fig. 3 shows the steps of each single iteration of the proposed distributed FL process.

C. Network Selection Parameters
FL performance is a function of the number of participating VTs to the FL process, the number of FL iterations performed by VTs, the communication and computing latency, and the energy cost of each FL iteration. The FL process cost depends also on the network selection strategy adopted by different networking layers given their limited computing and communication resources.
To better clarify this point, if a VT selects the HAP node direct link for the FL data transmission, it can potentially save the processing latency and the cost required to perform the FedAvg process at the intermediate layers; however, it can increase the VTs data transmission cost in terms of transmission latency and energy, mainly due to the limited resources of VTs and the long-distance communication links between VT and HAP. Also, due to long-distance communication links, the link failure probability can be higher, resulting in a possibly high number of dropouts (i.e., VTs not participating in the FL training process). On the other hand, if VT decides to select RSU or UAV nodes for distributed FL data communication, it can potentially save communication time and energy. However, an additional burden of processing latency over these intermediate layers needs to be considered. Similar analysis can be applied to the RSU and UAV nodes when selecting the possible higher networking layers for the data communication. Therefore, there is a clear tradeoff between the different network selection strategies adapted by the VTs and the intermediate layers. A proper network selection strategy guaranteeing the optimal training latency and energy performance is required.
Based on their limited coverage ranges, each VT can be covered by set of RSUs, UAVs, and one HAP node. Focusing on the mth VT, modeling the available nodes for selection. VT can either select RSU, UAV, or HAP for communicating the FL model parameter updates.
, the mth VT does not participate in the FL process. Also, for avoiding the additional complexity, we consider that each VT can be assigned to only one EN which can be RSU, UAV, or HAP during the FL process. Thus, 2) RSUs Network Selection Decision: For the case of nth RSU, we define with dimension 1×(N U r,n + N H r,n ) modeling the available nodes for selection. RSU node can either select UAV, or HAP for communicating the FL model parameter updates. If b r,n (τ i ) = {0} (1×(N U r,n +N H r,n )) , the nth RSU node does not communicate with higher layers and broadcasts back the model parameters towards VT. For avoiding the additional complexity we consider that each RSU can be assigned to only one EN which can be UAV, or HAP during the FL process. Thus, 3) UAVs Network Selection Decision: For the case of lth UAV, we define

D. FL Process Cost Analysis
In general, FL is an iterative learning process where each FL iteration includes several steps adding latency and energy costs. Local on-device ML model training, data communication between VTs and FL servers, pre-and postprocessing of FL model data, FedAvg process performed at FL servers are the main steps involved during FL iteration.
In the following, we analyze the latency and energy cost of each of these operations.
1) FL Local Training Model: The FL computation corresponds to the local training of the ML model based on the on-device dataset. In local device training, the mth VT with service request ν has to compute the local parameter set w it,ν v,m through the dataset having size K v m data samples; if we assume that, for every iteration, the total number of FLOPs required for each data sample d is ψ d , the time and energy consumed during the FL training process by the mth device is [37]: where P c v,m is the power consumed by the mth VT for the data processing. We suppose for simplicity that the on-device FL processing time and energy is the same for every iteration 2) FL Data Pre-/Post-Processing: For each FL iteration, pre-and post-processing operations are performed for detecting and adding the header information χ ν associated with the service (ν) requested by the vehicular nodes. The latency and energy of these operations are: is the time required to detect and remove the header information from FL data at the ith node, function of its computation resources, the number of FLOPs required to process the FL data (i.e., model parameters embedded in the IP packets) from node i given as ψ pr e and a number of VTs/servers sending the updates towards the server i given by N i . Also, T 3) FL FedAvg Process: In the proposed FL infrastructure, intermediate FL server (i.e., RSUs, UAVs, HAP) perform the FedAvg process on the data received from any of the lower layers. The latency and the energy required to perform the FedAvg process is given by: where ψ F A is the number of FLOPs required to process the individual nodes parameter vectors over ith server 4) FL Data Communication Model: The data rate between ith and jth node is a function of the mutual distance, hence: where P tx i is the transmission power of the generic ith device, h(d i, j ) is the channel gain at a distance d i, j between the ith device and the jth device, and N 0 = N T B i is the noise power, where N T and B i are the noise power spectral density and bandwidth associated to the ith device during communication.
During the FL processing, at each iteration it, the ith FLdevice sends the parameters set w it i to the higher layers. Supposing that |w it i | represents the data size of the parameters set expressed in bits [10], the uplink transmission time and energy for the FL parameters in the itth iteration is: where, r it i, j is the uplink transmission rate between ith and the jth FL node, during the itth iteration, which is a function of the bandwidth (B j i,m ), and the distance (d i, j ) between the two nodes, modeled through the Shannon capacity formula in (5). Since FL-servers are accessed by multiple VTs/lower layer nodes, we assume for simplicity that the jth node bandwidth is equally shared among the connected VTs and lower layer nodes, i.e., if u j = u l ∈ U, the bandwidth resources of u l , B l→(v,r ) u,l is shared among all VTs and RSUs connected to it. Also, P tx i is the ith device transmission power. Similarly, the reception time required to receive data from the jth node by the ith node is given by, Each FL server needs to wait for receiving the data from all the connected VTs and lower layer nodes before performing the FedAvg process. The data reception latency and energy at the jth FL server are given by: With these basic latency and energy elements in hand, we can now define the FL iteration cost in terms of total latency and energy requirements.

E. FL Iteration Cost
The mth VT FL process cost includes the local computation cost, header processing operation cost at VT and the additional cost depending on the network selection strategy. Thus, for the mth VT, the total FL process cost (in terms of latency and energy consumed) for a single iteration is: where, for the n-th RSU, the FL process cost for a single iteration is a function of the time/energy required to receive model updates from VTs, the header processing cost, the FedAvg process cost, and the additional cost based upon the network selection strategy adopted by it. Thus, for the case of nth RSU, Similarly, for the l-th UAV, the FL process cost is based upon data reception, header processing, FedAvg process, and the additional cost due to the network selection strategy. Thus, for the case of lth UAV, a single iteration cost is, Finally, for the HAP node, the FL process cost for a single iteration is: Fig. 4, presents the different latency components considered during the modeling of the FL latency over different nodes. We have avoided including the energy elements for simplicity.
In the end, for each FL iteration the required latency and energy cost for the mth VT is given by 3 :

F. Number of FL Iterations Performed
Each FL iteration adds cost in terms of required latency and energy consumed over different platforms. However, it is important to perform a sufficient number of FL iterations for generating the FL model with sufficient accuracy over the real world data. The number of FL iterations performed by VTs depends on the adopted network selection strategy and the sojourn time within each EN coverage area. It is supposed that each VT can participate in the FL process till it belongs to the considered ENs coverage area. Thus, where, ρ(d(v m , r n , u l , τ i )) is the number of FL iterations performed by the mth VT whose value is upper bounded by the ratio between the jth ENs sojourn time, T soj v m ,r n (τ i ), and the FL iteration time. Here, the jth node corresponds to any RSU, UAV or HAP based upon the network selection strategy adapted by the mth VT. It should be noted that the jth node corresponds to the FL server node that transmit back the global model parameters towards the VT.
In general the FL process can be stopped if it achieves some predefined stopping criteria, such as the number of FL iterations performed, predefined loss function value, etc. [5], [38]. Therefore, without loss of generality, we introduce ϵ ν as a convergence parameter in terms of FL global model loss function value, for the FL model corresponding to the service ν. In the past, it has been shown that, in certain environments, it is possible to limit the number of FL iterations required to be performed to achieve the predefined loss function value [5], [39], [40]. However, the maximum number of FL iterations required to be performed can depend upon several parameters such as local environment scenarios, number of VTs participating in the training process, quality of VTs data, etc. Here, we assume that the number of FL iterations required to achieve the FL performance is function of the number of VTs participating in the training process of the FL model of the νth service, where the squareroot models the reduced impact when a higher number of VTs participate to the FL training process. Here, C is a constant representing the maximum number of iterations required for a single VT to achieve the FL model convergence. If VTs have participated to a reduced number of FL iterations when using the FL model for their applications, the performance can be sub-optimal. In such cases VTs might need to pay additional penalty in terms of performance degradation or reduced quality of service.
Here, we introduce a stochastic penalty function P FL ν (ρ (d(v m , r n , u l , τ i ))) for measuring the impact of the number of FL iterations performed over an FL model performance. This analysis is motivated by the work done in [5], for the joint computation offloading and the FL process optimization over VN. If the mth VT requesting the service ν is using the FL process to estimate the parameter x ν with x min,ν ≤ x ν ≤ x max,ν , and the estimated value is given bŷ x ν (ρ (d(v m , r n , u l , τ i ))), the FL penalty is: wherex ν is estimated by using a stochastic function with truncated normal distribution with probability density function fx ν (·) ofx ν as, and ξ(·) and¯ (·) are, respectively, the probability density function of the related standard normal distribution and its cumulative distribution function, i.e., In this work we assume that the mean value of the distribution of x, i.e.,μ, and its variance,σ 2 , are equal tō whereγ is a numerical constant, used for controlling the variance of the model. The interested reader can have a look to [5] where the same authors considered the above model for estimating the FL iterations vs performance for the computation offloading application over VN.
In the end the total FL latency and energy cost is:

G. Problem Formulation
In this work, we aim to perform a communication-efficient FL process over a joint air-ground network. By adopting a proper network selection strategy over different platforms (A = {d(v m , r n , u l , τ i )}, ∀m, n, l), the aim is to maximize the FL process performance. Thus the main aim is to minimize the joint cost of latency, energy, and the penalty function value measuring the FL process performance: subject to the following constraints, where A = {d(v m , r n , u l , τ i )} is the combined set of network selection decisions of all nodes involved during the FL process, η 1 and η 2 are weighting coefficients for balancing latency and energy consumption, and w 1 is a weighting coefficient for the penalty function. According to (10a), each VT, RSU and UAV can communicate with only one EN from the upper layers. Eq. (10b) limits the number of iterations performed by each VT depending on the available sojourn time considering the limited HAP coverage. Eq. (10c) shows the upper limit on the bandwidth resources of any jth EN, among any RSU, UAV or HAP. The total bandwidth available for the VTs and other nodes connected to any jth EN will be upper bounded by the bandwidth of the jth node, i.e., B j . According to (10d), weighting coefficients η 1 and η 2 can have any value between zero and one with their sum equal to one. Also w 1 can have any positive value.

IV. PROPOSED SOLUTIONS
For solving (9), we aim at finding a proper EN selection strategy for creating a highly reliable FL model with reduced latency and energy costs. With multiple edge computing layers and a large number of VTs along the road, the considered problem can be hard to solve. Here, we propose a MDP-based RL approach for finding proper assignment strategies for different nodes. The basic elements of the MDP model include the state space (S), the action space (A), the reward function (R), the discount factor (γ ), and the proper environment dynamics, or state transition, probability model (P). Thus, the MDP process can be defined as a tuple given by {S, A, R, P, γ }. In order to analyze the performance of the proposed MDP model, we present two multi-dimensional MDP approaches based on the VTs' local environment, as well three benchmark methods for comparison purposes.
A. MDP Based Approach 1) Local Environment Based Multi-Dimensional MDP Model: In the considered network architecture, each VT can be covered by one or more RSUs, UAVs, and one HAP. Thus, different VTs/ENs can have different number of nodes available for communicating the FL updates. In order to properly select the EN for performing the FL, and setting up the MDP parameters, we assume that each VT is able to acquire the local environment parameters through V2X communication links, allowing more personalized MDP models with better accuracy. In particular, we classify the VTs into different groups, based on their local environments, where each group can have a separate state space and action space.
For the mth VT, the main parameters include the number of RSUs (N R v,m ), UAVs (N U v,m ), and HAP (N H v,m ) nodes available for the FL process. In addition, we define three with V H v m ,h ≤ V H max , corresponding to the number of nodes (i.e., VTs, RSUs, and UAVs) already connected to the each RSU, UAV and HAP node, respectively, covering the mth VT. Here, V R max , V U max and V H max stand for the maximum number of devices that can be served by each RSU, UAV and HAP nodes. Thus, a tuple can represent the mth VT local environment. The number of possible κ values, i.e.,K , can depend upon V R max , V U max and V H max . Through V2X communication links, VTs can determine the number of nodes around them. However, since all VTs participate in the FL process simultaneously, their assignment parameters in advance is unknown. Therefore, some assumptions are required. Here, we consider the following two approaches for generating V R v m , V U v m , and V H v m vectors that can be used to improve the MDP models accuracy. a) Minimum distance based assignment approach: In the case of a minimum distance-based approach, each node is assigned to the upper layer node with the minimum possible distance. For example, mth VT is assigned to the nearest RSU, nth RSU is assigned to the nearest UAV, and lth UAV is assigned to the nearest HAP. Thus in general, b) Random assignment approach: In this approach, each node is assigned to any of the higher layer nodes with a probabilistic rule. We have considered the uniform assignment approach where the probability of assigning the ith node towards the jth upper layer node is given by where U max i indicate the total number of upper layer nodes covering the ith node which can be VT, RSU or UAV.
With these two approaches in hand, different sets of V R v m , V U v m , and V H v m can be generated, helping to select proper ENs. The two different MDP approaches resulting from these methods are denoted as MDP with minimum distance based assignment approach (MDP-MD), and MDP with random assignment approach (MDP-RA). Later, the performance of these two schemes is compared in the simulation results section.
2) State Space (S): In general, MDP state space is constituted by all possible states in which MDP agents can find themselves during the exploration of the environment. Finding an appropriate network of ENs, i.e., vehicle to HAP (v m → h), vehicle to nth RSU to HAP (v m → r n → h), vehicle to lth UAV to HAP (v m → u l → h), vehicle to nth RSU to lth UAV to HAP (v m → r n → u l → h) can potentially save the FL iteration latency and energy cost and allow VTs to participate to a large number of FL iterations resulting into a better FL model generation. In this work, S is constituted by multiple number of binary variables corresponding to all n ∈ N R v,m , l ∈ N U v,m and h. In particular, we define as a binary variable related to r n ∈ N R m , which takes value 1 if the mth VT is assigned to the nth RSU and able to perform a sufficient number of FL iterations, where 0 < ζ R ρ ≤ 1 is a parameter indicating the FL accuracy level that can be based on the service type requested by the users. For example, in case of a critical safety-related service, the FL model accuracy should be high for avoiding possible fatal car crashes due to the failure of the FL models. In this case, ζ R ρ should be closer to 1 or even 1. On the other hand, if it is not a high priority/safety-related service, a moderate FL accuracy can be sufficient to serve the user. In such cases ζ R ρ can be smaller. Similarly, is a binary variable related to u l ∈ N U m , which takes 1 if the mth VT is assigned to the lth UAV and able to perform a sufficient number of FL iterations. Here, 0 < ζ U ρ ≤ 1 is the parameter indicating the FL accuracy level as a function of Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. the service type requested by the users. Finally, is a binary variable related to h, which takes value 1 if the mth VT is assigned to the HAP h and able to perform a sufficient number of FL iterations. Here, 0 < ζ H ρ ≤ 1 is the parameter indicating the FL accuracy level based upon the service type requested by the users.
In the end, if the mth VTs local environment is modeled through the tuple κ, the complete state vector is given as,

3) Action Space (A):
If the mth VT local environment is modeled through the tuple κ, the action space (A κ = {a κ (τ i )}) includes all possible actions a κ that can be taken by the MDP agent corresponding to the κ. In the considered FL network selection problem, agents can select ENs belonging to the different networking layers. Therefore, the generic action space is defined as,

4) Reward Function (R):
MDP agents can receive a positive or negative reward based upon the current state and the action taken. Here, we consider the weighted sum of latency, energy and penalty functions cost required to complete a single FL iteration as a reward received by the agent based upon its state and action. Thus, R v,κ (s, a) = η 1 T FL (s κ , a κ ) + η 2 E FL (s κ , a κ ) + w 1 P FL ν (ρ (s κ , a κ )) 5) MDP Environment Dynamics (P): MDP environment dynamics model the behavior of the MDP environment in terms of state transition probabilities based upon the agents' current state and the actions performed. The probability of MDP agent finding itself into state s ′ when it performs the action a from state s is given as P(s ′ |s, a). Modeling such state transition probability over dynamic vehicular environments can be challenging. We propose a timedependent state transition probability equation based upon the MDP agent's local environment. In general, for scenario κ, the state transition probability at τ i is given by P (s κ (τ + δ)|s κ (τ ), a κ (τ )) = P S r n R (τ + δ), S u l U (τ + δ), S h H (τ + δ) which represents the state transition probability for state s κ (τ + δ) for the MDP agent from current state s κ (τ ) taking action a κ (τ ). Here, δ is the MDP time step. Since VT can connect to only one node in a given time interval, the events S r n R , S u l U and S h H can be considered as an independent events, which results into, P(s κ (τ + δ)|s κ (τ ), a κ (τ )) = P S r n R (τ + δ)| S r n R (τ ), S u l U (τ ), S h H (τ ) , a κ (τ ) In particular, with various communication links, e.g., V2V, V2R, V2I, vehicular-based MDP agents can acquire useful information about the surrounding environment (i.e., tuple κ), which can be used to model the state transition probabilities. The transition probability expressions are modeled as exponential functions based upon various local environment parameters. The state transition probability expressions for S r n R for the κth MDP agent with current state s κ (τ ) and performing action a κ (τ ) is defined as, corresponding, respectively, to the probability that the MDP agent will be in state with S r n R (τ +δ) = 1 and S r n R (τ +δ) = 0 by taking action a κ (τ ) from current state s κ (τ ). Also, λ n ( soj v m ,r n (τ i ) models the impact of VTs local environment over the state transition probability values. According to λ n (τ i ), if the mth VT through action a κ (τ ) selects the RSU n with high V R v m ,r n and d v m ,r n (τ i ), the VTs might not be able to perform the required number of FL iterations. Also if VT selects the RSU with a high sojourn time value, it can perform a sufficient number of FL iterations, resulting in a higher probability that occurs S r n R (τ + δ) = 1; δ R 1 , δ R 2 and δ R 3 are the weighing coefficients used to associate proper weights towards each parameters. Next, for S u l U , P S u l U (τ + δ) = 1 |s κ (τ ), a κ (τ ) probabilities, while δ U 1 , · · · , δ U 5 correspond to the weighting coefficients for properly balancing the impact of various environment parameters. In the end, for S h H : soj v m ,h (τ i ) are the parameters modeling the surrounding environments impact over the state transitions. In the end, by using (14), (18) can be used to find the state transition probability in any interval τ .

B. MDP-Based FL Network Selection Strategy
For the MDP model corresponding to the κth agent, the solutions' set can be defined as a policy function π κ = {π κ (s κ (τ i + δ)), ∀δ} that maps every state s κ ∈ S to action a κ ∈ A. Selecting different actions can result in different policy functions, where the aim is to find an optimal policy that corresponds to the minimum cost in terms of delay, energy and FL process penalty value. For every policy π κ , a value function V π κ (s κ (τ i )), corresponding to a state s κ (τ i ) can be defined for analyzing its performance. In general, V π κ (s κ (τ i )) corresponds to an expected value of a discounted sum of total reward received by following the policy π κ from state s κ (τ i ), and can be defined as: V π κ (s κ (τ i )) = E δ=0 γ δ R (s κ (τ i + δ), π κ (s κ (τ i + δ))) where γ ∈ [0, 1] is the discount factor, R(s κ (τ i +δ), π κ (s κ (τ i + δ))) is the immediate reward received for following a policy π κ at time τ i + δ from the state s κ (τ i + δ), is the maximum number of steps considered during the MDP evaluation, i.e., episode length, and E(·) corresponds to the expected value. Thus, the value function analyzes the particular policy function by assigning a numeric value to each state and can be utilized to compare the performance of different policies. In the end, the following optimization problem can be formulated in order to be able to find the best possible policy function associated with state s κ (τ i ): where κ corresponds to the set of policy functions that can be explored. As shown by many works (e.g., [41], [42]), the problem defined in (19), can converge into a Bellman optimality equation given by: Different approaches can be used to solve the problem in (20); however, the value iteration approach is widely known for its fast convergence and easy implementation. Therefore, below we present a value iteration approach aimed at solving the MDP designed in the previous section for finding an optimal policy that corresponds to the minimization of a FL process time and energy over VN.
The value iteration method allows finding an optimal policy and value function for the MDP models. The Algorithm 1 describes the steps involved during the value iteration process. For every agent κ, the process begins by initializing the values of each state to ∞ and iteration count (it) to 0 (Line 2). For each state-action pair, the state value is determined by using (21) (Line 5). The state value and a corresponding optimal policy (π * κ (s κ (τ i ))) associated with state s κ is determined by using (22) and (23) (Lines 7-8). The iterative process continues till the change in the all states values becomes less than the predefined convergence parameter ϵ (Lines 10-13). In the end, the algorithm returns the set of optimal policy functions π * κ associated with all possible scenarios in which VTs can find themselves over the road (Line 15).
The time complexity of the traditional value iteration process can be analyzed as O( |S| · |A|) with being the maximum number of time steps considered, |S| state space dimension, and |A| representing the action space. With the involvement ofK scenarios, the time complexity expression becomes O(K |S| · |A|). The scenario-based modeling can reduce the state and action space dimensions significantly. Additionally, time-dependent state transition probabilities can reduce the overall uncertainty in the MDP process.

C. Benchmark Methods
For analyzing the performance of the proposed MDP model, we have considered the following benchmark methods.

Algorithm 1 MDP Value Iteration
Input: ϵ, γ , S κ , A κ , Pr,K , Output: π * κ 1: for κ ∈K do 2: Initialize it = 0, V 0 (s κ (τ i )) = ∞, ∀s κ (τ i ) 3: for s κ (τ i ) ∈ S κ do 4: 8: if any |v it+1 (s κ (τ i )) − v it (s κ (τ i ))| > ϵ then 11: it = it + 1 12: 14: end if 15: end for 16: return π * κ 1) Conventional Centralized FL Process (C-FL): In the case of a conventional centralized FL process, each VT transmits its model updates to the centralized HAP server. Thus, v m → h, ∀v ∈ V. This approach can reduce the overall processing costs in terms of intermediate layer processing and averaging operations performed over RSUs and UAVs. However, possible long-distance communication links between VTs and HAP can limit the performance in terms of link failures, high energy costs, limited users participating in the FL process, etc.
2) Minimum Distance Based FL Process (MD-FL): In this case, the FL process assumes that each node communicates with the nearest nodes from the upper layer. Thus, each participating VT can select the shortest distance RSU node for transmitting its update, which then process and transmit the aggregated update vectors towards the nearest UAV for further processing. In the end, HAP collects data from all the participating UAV terminals for generating the global model, which it then broadcasts back towards VTs. Eqs. (11a)-(11c) can be used to determine the minimum distance assignment vectors for different nodes.
3) Random Assignment Based FL Process (RA-FL): In this approach, nodes involved in the FL process (i.e., VTs, RSUs, UAVs, and HAP) follow the random assignment strategy in (12). Thus, each VT selects the one EN from a set of RSUs, UAVs, and HAP covering it. Similarly, RSUs can either be connected to the UAV/HAP or can also communicate back the results to VTs. UAVs also followed the same strategy, where they can either send their data to HAP or return it to the VTs for the next round of the FL process.
4) FedCPF Inspired RSU-Based Benchmark Solution for the Considered Scenario (FedR-FL): In [35], authors have proposed a FedCPF approach based upon a customized local training strategy, partial client participation, and flexible aggregation strategies. Here we considered a FedCPFinspired, RSU-based benchmark approach where VTs are performing the local training process and transmitting the model parameters to the nearest RSU node. The client selection strategy of the FedCPF approach is considered where participation of each VT in the FL process is based upon a probability P sel (i.e., P sel is the probability of the client being a part of FL training, while (1 − P sel ) is the probability that the client will opt out from the LF training). The other two strategies of customized local training strategy and the deadline-based server aggregation strategy are based upon the RSU sojourn time constraint. In particular, VTs' participation in the FL process is limited by its dynamicity and the RSU coverage range.

V. PERFORMANCE EVALUATION
The value iteration algorithm for solving the MDP model and the benchmark methods previously described are simulated over a Python-based simulator, using ML-related libraries such as NumPy, Pandas, Matplotlib. In Table I, the main simulation parameters are shown for the considered network architecture. The service area is under the coverage of one HAP, 20 UAVs and 40 RSUs. A variable number of VTs between 200 and 700 are considered, assuming that each one is requesting service ν with a probability equal to 0.2. Each VT is traveling with a variable speed as modeled in (1) with µ=10 m/s and σ =1. The maximum number of FL iterations required to achieve the proper performance, as defined in (7), consider that C = 1000. Also P sel = 0.7 is used for the FedCPF-inspired RSU-based benchmark solution approach. Each VT has a FL dataset of size |D v m | = 5, 000 samples. During the FL training process, ψ d = 1500 and T FA i,ν = 1 ms. The maximum number of nodes covering any VT is given by  set to 0.7. During value iteration process γ = 0.9, ϵ = 0.01, λ = 0.1 and episode length = 200 are used.

A. Numerical Results
In the following, we present the main performance results including the FL cost, latency, energy, FL penalty, and the average number of FL iterations performed by VTs for different methods.
1) FL Process Cost: The main objective of this work is to jointly reduce the overall latency, energy, and FL penalty. Fig. 5 shows the performance in terms of FL overall cost for the MDP schemes and the benchmark methods previously presented. It can be observed that both MDP methods outperform the benchmark approaches as the number of VTs increases.
With a reduced number of VTs, with a fully distributed FL process, the MD-FL approach requires higher cost mainly due to several processing operations performed at different layers. On the other hand, a fully centralized C-FL method has reduced costs due to the presence of a limited number of VTs requesting the resources from the centralized HAP node. However, if the number of VTs is higher, the overall cost of the C-FL approach grows fast mainly due to the higher communication distances and the limited resources of a HAP node. With this, the C-FL cost becomes higher than the other benchmark methods with the increasing density of VTs. Similar effects can be seen later in the latency and energy plots shown in Figs. 6 and 7. The other two benchmark approaches (RA-FL and FedCPF-inspired method) have a slightly better performance mainly due to the reduced communication distances and reduced processing operations compared to the fully distributed MD-FL and a fully centralized C-FL approaches. However, the imperfect/static edge node selection without considering the local environment parameters and the available resources, the performance of the benchmark approaches compared to the proposed MDP solutions. On the other hand, the proposed MDP solutions, with network selection based upon the VTs local environments and the available resources of ENs, are able to keep the FL process cost under the limit. In particular, the MDP-RA method outperforms all other approaches. For the case of the MDP-MD, the VTs local environment is modeled through the assignments of the FL devices to the nearest nodes with less flexibility, i.e., VTs can be assigned to the RSUs, RSUs can be assigned to the UAVs, and UAVs are assigned to the HAP node. On the other hand, the MDP-RA approach is more flexible, where each node can select any higher layer entities, i.e., VTs can be assigned to the RSU, UAVs, or HAP. Therefore, MDP-RA method outperforms the MDP-MD in terms of overall cost, as well FL latency, and energy, as later shown in Figs. 6 and 7.
With the upcoming latency-constrained vehicular applications and services demanding ML models with high accuracy, it is important to perform the distributed learning process, such as FL, in a limited time and with reduced energy consumption. Thus, the proposed distributed learning framework with efficient network selection strategies allows a large number of VTs to participate in the training process with reduced cost and a huge advantage over the traditional methods.
2) FL Latency Performance: For each FL iteration, the FL process is impacted by communication, training, and processing latency. In Fig. 6, we present the average FL iteration latency for various MDP and benchmark methods. In particular, MDP-MD and MDP-RA methods, with proper node selections can perform the FL process with reduced latency compared with other methods. As for the previous case, with fewer nodes requesting resources, the C-FL method performs better compared to the MD-FL and RA-FL methods. However, when VTs are more, C-FL performance in terms of latency requirements degrades drastically. On the other hand with a distributed FL process, the MD-FL approach induces higher latency with lower VTs. However, with higher VTs its performance is better than the C-FL method mainly due to the distribution of the FL process over the multiple edge nodes. The fedCPF-inspired approach selects the RSU nodes for the averaging operation limiting the latency costs in the beginning. However, with the limited sojourn time, VTs are unable to perform a sufficient number of iterations resulting in the higher Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. FL penalty as shown later in Fig. 8. With a higher number of VTs, the performance of the FedCPF-inspired approach degrees mainly due to the limited RSU resources and the imperfect node selection.
The MDP approaches, especially the MDP-RA method, jointly reduce both communication and processing latency by distributing the FedAvg process over a sufficient number of edge nodes. Therefore, the proposed methods can efficiently train the FL model over the distributed multi-layered VN environments.
3) FL Energy Performance: It is important to reduce the FL process energy cost given the involvement of different T/NT networking platforms (i.e., VTs, UAVs) with scarce energy resources. The FL process consumes energy for the local training operations, data communication, and FL data processing over servers. In Fig. 7, we present the performance in terms of energy spent by the different methods. Similar to the latency performance, the C-FL process energy performance is better for a reduced number of VTs, due to the involvement of a limited number of VTs and reduced FedAvg process cost. However, when VTs are more, the VTs energy requirements become high mainly due to long-distance communication over limited bandwidth resources. Compared with the C-FL method, MD-FL has an advantage in terms of reduced communication distances/costs. However, with the repetition of the FedAvg process over each layer, the energy cost increases. On the other hand, the proposed MDP methods can reduce both communication and computation process energy requirements simultaneously by properly distributing the FL process over multi-layered VN. By utilizing the local environment knowledge, MDP methods can select proper ENs with sufficient resources and, as a result, are able to perform the FL process with reduced energy requirements.
It should also be noted that the energy performance of the C-FL method degrades quickly compared to its latency performance. This is mainly because every VTs involved in the FL process of the C-FL approach requires communication with the centralized HAP node. Due to this, the energy cost induced by the individual VTs can be higher compared to the other benchmark methods. This trend can also be seen in the overall cost performance in Fig. 5. The results from Figs. 6 and 7 can also highlight the issues of the well-known straggler effect in the FL framework with the traditional benchmark methods and 4) FL Penalty Performance: With the adopted network selection strategy, if VTs fail to perform a sufficient number of FL iterations, the FL model performance may not be adequate. We have modeled the impact of the number of FL iterations performed by VTs in terms of a stochastic penalty function presented in (8). In Fig. 8, we show the average FL penalty value for different sets of VTs for the proposed methods. The benchmarks, with inadequate FL process, fail to perform the required number of FL iterations resulting in the higher FL penalties. The FedCPF-inspired approach selects the nearby RSU node for limiting the FL communication cost, which as result limits the number of FL iterations performed by VTs, inducing the heavy FL penalty. The other two benchmark methods, MD-FL and RA-FL also suffer from a large penalty due to the reduced number of FL iterations performed mainly due to the high latency per FL iteration with constrained sojourn times. Although the C-FL method gains from a higher coverage range of the HAP node and with reduced FL latency, for a reduced number of VTs it is able to perform a large number of FL iterations with a reduced penalty, while, as the number of VTs increases, its performance decreases. For a reduced number of VTs, the penalty value for the MDP-RA process is high, mainly because of the low number of VTs participating in the FL process and its decisions to select the nearby edge nodes for reducing the overall FL cost. However, with a growing number of VTs, its performance increases with proper network selection strategies and an adequate number of VTs participating in the FL process. On the other hand, the MDP-MD method which suffers slightly in terms of latency and energy costs in the beginning can perform a high number of FL iterations reducing the LF penalty. Notice that these behaviors of MDP methods can also be impacted by the assumptions made over the competing VTs decisions and can have different impacts in terms of individual costs. However, both the MDP methods are able to reduce the joint costs of latency, energy and penalties significantly compare to the traditional benchmark methods. Therefore, the proposed FL process with proper network selections can create reliable FL models with better performance.

5) Average Number of FL Iterations:
For having adequate performance FL nodes, VTs should be able to perform a sufficient number of FL iterations (ρ (d(v m , r n , u l , τ i ))). The number of FL iterations performed by each VT is based upon the network selection strategy and the available sojourn time of the selected ENs, as given in (6). A proper network selection strategy can reduce the FL iteration time. Also selecting proper ENs with a higher number of communication/computation resources allows VTs to participate in a larger number of iterations. To shade more light on the results presented in the previous figures, here we present the average number of FL iterations performed by different methods (Fig. 9). It can be seen that with a lower number of VTs, C-FL is able to perform a higher number of FL iterations, however, its performance reduces as more and more VTs participate in the process mainly due to the longer FL iteration time. It should also be noticed that though in the beginning, the C-FL approach can outperform one of the MDP solutions (MDP-RA), its joint performance is still not optimized due to the static FL process (Fig. 5). On the other hand, as described before in Fig. 8 the node selection strategies for the MDP-RA and MDP-MD methods are based upon a joint cost optimization and can be influenced by the competing VTs decisions. With FedCPF-inspired RSU-based benchmark solution, VTs can only perform a limited number of iterations only, mainly due to the limited coverage range of the RSU nodes. This also highlights the importance of considering the distributed NTN layers of networking platforms for supporting the FL process. With imperfect edge node selection strategies, the other two benchmark solutions (MD-FL and RA-FL), also suffer from limited FL iterations resulting in imperfect FL models with higher performance penalties (i.e., Fig. 8).

VI. CONCLUSION
In this work, we have presented the communicationefficient, distributed FL platform over a joint T/NT-based VN. The proposed approach can be useful for creating costefficient, sustainable, and more reliable FL models for serving VTs applications. With proper analysis of the FL process cost, we formed the constrained optimization problem for finding the optimal FL network selection strategy over multilayered VNs. We further modeled the FL network selection problem as a sequential decision-making RL problem by adapting the MDP framework. A time-dependent environment dynamic model is created by utilizing the VTs environment parameters acquired through the V2X technology. In the end, the value iteration approach is used to solve the MDP model for finding suitable policies. The numerical results acquired over the Python-based simulation show the major advantages of the proposed FL approach over several other benchmark methods including the conventional centralized FL process. In the future, we expect to extend this work by analyzing the performance of proposed methods on realistic vehicular systems for enabling intelligent solutions at the edge.