A Layer Selection Optimizer for Communication-Efficient Decentralized Federated Deep Learning

Federated Learning (FL) systems orchestrate the cooperative training of a shared Machine Learning (ML) model across connected devices. Recently, decentralized FL architectures driven by consensus have been proposed to enable the devices to share and aggregate the ML model parameters via direct sidelink communications. The approach has the advantage of promoting the federation among the agents even in the absence of a server, but may require an intensive use of communication resources compared to vanilla FL methods. This paper proposes a communication-efficient design of consensus-driven FL optimized for training of Deep Neural Networks (DNNs). Devices independently select fragments of the DNN to be shared with neighbors on each training round. Selection is based on a local optimizer that trades model quality improvement with sidelink communication resource savings. The proposed technique is validated on a vehicular cooperative sensing use case characterized by challenging real-world datasets and complex DNNs typically employed in autonomous driving with up to 40 trainable layers. The impact of layer selection is analyzed under different distributed coordination configurations. The results show that it is better to prioritize the DNN layers possessing few parameters, while the selection policy should optimally balance gradient sorting and randomization. Latency, accuracy and communication tradeoffs are analyzed in detail targeting sustainable federation policies.


I. INTRODUCTION
Distributed learning methodologies based on consensus [1], [2], [3] have emerged over the last few years for solving complex processing [4], [5], [6], [7] and decisionmaking tasks [8], [9] over cooperative networks. In this paradigm, interconnected agents combine local processing procedures with mutual interactions over a mesh network to learn a shared model describing the task to be The associate editor coordinating the review of this manuscript and approving it for publication was Ahmed Almradi .
fulfilled [4], [6]. Centralized learning implementations that involve energy-intensive processing at data centers or servers can be avoided by promoting nodes' self-organization via consensus methods [1], [3]. This approach is expected to bring significant advantages in terms of latency, scalability, and robustness, especially within new-generation wireless networks. 6th Generation (6G) cellular systems are in fact moving towards dedicated infrastructures [10], [11], [12] to support decentralized, device-to-device communications [13], [14] tailored for specific industry verticals, ranging from robotics [15] to autonomous driving [16], [17]. VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ devices perform more local optimization steps before each communication round.
In this paper, we propose a new approach for improving the communication efficiency specific for decentralized FL [24], [26]. The method is based on a layer selection optimizer that selects a number of relevant layer parameters to be shared among cooperating devices. Recent works have in fact demonstrated that applying compression in a layer-wise manner provides benefits compared to standard techniques operating directly on the full model [40], [41]. Indeed, different layers have different impact during the training process, and neglecting their importance when applying compression strategies may result in longer convergence times. This is especially important when considering large models possessing a large variability of trainable parameters across layers. It has been shown that layers exhibiting a large number of parameters generally encode the information in a redundant manner and therefore should be highly compressed or shared less frequently. In contrast, layers comprising few parameters typically show strong connections with preceding and succeeding layers. Thus, compression operators should be designed to specifically operate on these layers so as to not impact the final model performance. Layer-wise compression methods have been introduced relying on sparsification [40], [42] and randomized selection [41], [43]. However, the former requires repeating the sparsification operations on all layers, increasing the computational overhead as the number of layers grows, while the latter cannot capture any interrelation among layer parameters as it uses a simple randomized selection. In this paper, we propose to overcome these limits by a new combined approach that selects dynamically the most informative layers to be exchanged among the nodes based on both randomized and gradient-based selection strategies. The proposed method does not constrain the communication frequency of the largest layers as in [35], nor it preemptively divides the model into segments as in [36] and [37], but rather it selects only the most informative layers, according to the squared norm of the local gradients. Detailed contribution is summarized below.

B. PAPER CONTRIBUTIONS AND ORGANIZATION
We consider the decentralized FL system in Fig. 1 where a set of networked peer devices (or learners) cooperate to train a deep NN model. Each learner independently selects a subset of the NN layers and transmits the related trainable parameters (i.e., weights and biases) to the neighbors. Layer selection is implemented on each FL training round: the devices run an optimizer that sorts the layers of the NN model according to their expected contributions to the learning performance (e.g., measured by the squared norm of the gradients). The trainable parameters of the selected layers are then encoded for sidelink communication. The goal is to avoid the transmission of model parameters that may contribute minimally to the global model quality.
The proposed strategy could also be extended to integrate quantization and pruning of the selected model parameters to further improve communication efficiency.
The proposed methods are first validated using the MNIST [44] dataset to assess latency, communication, and accuracy trade-offs with different connectivity patterns. Next, we consider the application of the developed FL policies to a cooperative sensing use case in vehicular scenarios. In the considered setup, vehicles rely on a complex NN, characterized by 40 trainable layers, to recognize road users/objects in their surroundings based on Lidar sensor readings. To extend the field of view of their ego-sensors, vehicles implement a FL optimization of the perception model via NN parameters sharing over Vehicle-to-Vehicle (V2V) links. Considering the vast amount of trainable layers and parameters of the ML model, the optimization of the information exchanged during the FL process is crucial so as to comply with limited sidelink resources. To summarize, the original contributions are as follows: • A novel fully-decentralized FL system is proposed to target communication-constrained distributed ML implementations. The proposed architecture leverages average consensus and enables the agents to actively participate in the learning process by direct interactions via sidelinks.
• A parameter selection policy, referred to as Consensusdriven Federated Learning with Layer Selection (CFL-LS) is designed to select the most informative NN model parameters for transmission over the sidelinks.
With this respect, we introduce a layer optimizer that selects a suitable population of the available model layers to be shared with neighbor nodes, based on local gradient observations.
• The impact of the CFL-LS policy on the consensus process is analyzed by considering different optimization and layer selection strategies.
• The approach is validated by extensive performance analysis in practical use cases, including connected automated driving.
Experimental results show that the proposed communication-efficient FL policy can reduce the communication resources up to 80% compared to standard FL setups that exchange all model parameters on every communication round. Effects of quantization, link loss/unavailability in wireless fading channels, and bandwidth constraints are also considered.
The paper is organized as follows. Sec. II describes the model of the proposed decentralized FL system. Sec. III presents the algorithms employed for layer selection, while Sec. IV analyzes the related convergence performance. The validation of the proposed method in image classification and vehicular sensing use cases is described in Sec. V and Sec. VI, respectively. Finally, Sec. VII draws the conclusions.

II. FEDERATED SYSTEM MODEL
The proposed FL setup consists of N interconnected agents, that mutually exchange the parameters of their local ML model optimized from local data samples via supervised learning methods. We assume that node i, with i = 1, . . . , N , Local node datasets D i are typically unbalanced, i.e., with varying sizes, and/or limited number of contained classes. A Deep Neural Network (DNN) is used to map the training data to the desired predictions. The DNN model is composed by L layers, with outputs h ℓ at layer ℓ computed by applying a non-linear (activation) function f ℓ (.) to the weighted sum of the outputs of the previous layer h ℓ−1 as where w ℓ and b ℓ are the weights and biases of layer ℓ, while for ℓ = 0 we have h 0 = x h and h L = y h for ℓ = L. The weights and the biases of each layer can be conveniently aggregated into the matrix 1 where . . , L is a compact vectorized representation of the weights and biases of layer ℓ with P ℓ being the overall number of trainable parameters for the ℓ-th layer.
The goal of FL is to learn a global model W ∞ , shared across all interconnected agents, for mapping the input data x = [x 1 · · · x E ] T to the desired output predictions y = [y 1 · · · y E ] T as best as possible. The global model parameters can be learned through the minimization of any finite-sum objective function L(W ∞ ) as where is the local loss of node i with L i,h (x h , y h ; W) being the loss computed over the example (x h , y h ) when W holds. The decentralized FL approach analyzed in the following relies on an average consensus policy to obtain W ∞ by repeatedly alternating the mutual exchange of local representations of the ML model on consecutive communication rounds t = 0, 1, . . ., with local model optimization steps for minimizing the local loss L i .

A. COMMUNICATION MODEL
The model parameters W t,i are exchanged by the agents to satisfy the half-duplex constraints: on each round, the devices multiplex a digital representation of the selected model parameters into a frame slot of T F seconds and transmit such frame using orthogonal channels of bandwidth B W . Connectivity among the agents is here represented as a undirected graph G = (V, E), where V and E denote the set of nodes and edges, respectively. At round t, the wireless link between a pair of devices (k, i) ∈ E at distance d k,i is assumed to be impaired by a frequency-flat time-varying fading channel with baseband complex-valued response h k,i (t) ∼ CN (0, 1), and instantaneous Signal-to-Noise Ratio (SNR) that accounts for the log-normal shadowing S 0 , the path loss index υ, and the average SNRγ 0 at reference distance d 0 .
A link is assigned as potential edge (k, i) ∈ E, ∀k ∈ N i of the graph G if γ k,i (t) > β where β is the receiver-side sensitivity threshold [45]. In what follows, we declare a link as unavailable with probability The impact of link unavailability on FL under communication constraints is investigated in Sec. VI-C.
Device k sends the model updates encoded by b k,t bits according to the quantization scheme [46]. This encodes the model parameters in a stochastic manner by applying a randomized rounding operation that discretizes the parameters into a fixed set of levels (here varying between 256 and 1024, corresponding to b k,t = 8 and b k,t = 10 bits). Considering a frame/slot of T F seconds, the number of bits chosen to encode the selected model parameters must satisfy the constraint where B W log[1 + γ k,i (t)] is the link-layer spectral efficiency. Notice that the quantization process affects the time span (T F ) of each communication round (for an assigned efficiency) and thus the learning wall-clock time.

B. CONSENSUS-DRIVEN DECENTRALIZED FEDERATED LEARNING
On each FL round, the agent i fuses the local ML model W t,i with the ones received from the set of available neighbors N i as where ε controls the stability of the consensus procedure [24]. The diagonal matrix with a k (t) = [a 1,k (t) · · · a L,k (t)] T contains the information about the layers that are chosen by the neighbors for transmission over the sidelinks. In particular, σ k,i is the mixing weight while a k (t) is a binary vector encoding which layers have been transmitted by neighbor k. Each entry of a k (t) is defined as: Once the average consensus step is completed, the fused model is optimized locally using local data and a chosen optimizer. 2 where m t+1,i and v t+1,i are the first and second order moments of the gradients ∇L i (ψ t+1,i |M i ) = [∇p 1,i (t) · · · ∇p L,i (t)] T estimated with respect to the mean local loss L i (ψ t+1,i |M i ), averaged over the mini-batch M i . All parameters expressed in (12), i.e., β 1 , β 2 ∈ (0, 1], µ t and δ are detailed in [47]. Finally, the updated model parameters W t+1,i are forwarded to the neighbors of node i and a new communication round starts.
The decentralized FL procedure is iterated until local models reach a pre-defined target loss/accuracy or when they converge to the same representation, namely W ∞ . The pseudo-code for the overall FL procedure is reported in Algorithm 1 which considers a more general gradient-based optimization.

III. LAYER SELECTION STRATEGIES
In this section, we propose a communication-efficient FL design that allows the agents joining the federation to select M < L layers of the ML model to be shared with neighbors. The goal of the selection process is to provide a more efficient utilization of the communication bandwidth, such that the number of bits b k,t = b k,t (M ) chosen on each FL round satisfy (7) yet without penalizing convergence performance. Model accuracy and communication efficiency trade-offs are analyzed in two case studies described in Sec. V and Sec. VI.
The proposed method for the selection of the M layers of the NN is based on a layer optimizer that takes into account the local model and the data quality observed on each training round. In particular, in what follows we devise a policy for sorting the gradients of the local loss ∇L i (ψ t+1,i |M i ) defined in (12), based on their squared running averages, as they provide an indicator about how informative each individual layer is, considering the local data/examples. Selected layers and corresponding model parameters are then exchanged in the consensus step.
Given a measure of the local gradient ∇L i (ψ t+1,i |M i ), estimated by node i using a mini-batch M i of its local data, we first compute the average over the available mini-batches collects the vectorized gradients of layer ℓ, averaged over the available mini-batches. Next, we compute the squared norm of the gradients with respect to the trainable parameters characterizing each NN layer Let S = (ℓ 1 , . . . , ℓ L ) be an ordered set that contains the layer indices ℓ i sorted in descending order according to the gradient measures contained in we construct a subset T m = (ℓ k : 1 ≤ k ≤ M ) ⊆ S of the M chosen layers with the largest gradient metric g ℓ,i (t). The elements of a i (t) are thus defined as and subject to 1 T a i (t) < M . Once a i (t) has been constructed, it is transmitted along with the selected layers during the consensus step. Nodes receiving the model updates are then able to retrieve i (t) from a i (t) and fuse the received parameters according to (8). Note that the layer selection process is performed locally at each device and does not require any information from neighbors. In particular, we keep track of the gradients estimated during the local optimization step and sort them in descending order. Intuitively, layers exhibiting higher gradients convey more information about the local data and should be therefore selected and propagated to neighbors. A comparative analysis with other solutions is given in Sec. V-B. Note that gradient sorting methodology for layer selection might cause the parameters of some layers to be never, or rarely, exchanged. 4 As analyzed in the following, this may negatively affect the overall convergence of the FL process. To overcome this limitation, we propose to alternate gradient sorting with a randomized layer selection policy: the goal is to let the nodes receive a fair share of the neighbor NN model layers over consecutive rounds. In particular, we consider R out of M layers as being chosen randomly on each FL round, with 1 u n <p r the unit step function, u n ∼ U(0, 1) and p r is the probability of selecting a layer randomly. We select M − R elements from the ordered set S to construct T m−r = (ℓ k : for all nodes k ∈ N i do 9: 10: ▷ TX to neighbors 15: end for 16: end procedure 17: procedure ModelUpdate(ψ t+1,i ) 18: end for 24: The layers and the corresponding model parameters selected for sidelink transmission thus belong to the set T m = T m−r T r . The full layer selection policy, depicted in Fig. 2, integrating the gradient sorting and the randomized approach, is reported in Algorithm 2.

IV. IMPACT OF LAYER SELECTION ON CONSENSUS
This section analyzes the impact of the proposed layer selection methods on the FL convergence. The goal is to study the minimal (and necessary) conditions for which the consensus process converges to the average of the models.
First, we assume that the local loss functions L i (W) are smooth with constant L > 0, namely their gradients ∇L i (W) are Lipschitz continuous with constant L, and µstrongly convex [20], while the Adam optimizer step-size µ t is properly chosen so that the objective functions L i (W) are decreasing with each Adam iteration (i.e., after some threshold). Considering the average consensus aggregation model of (8), we derive the conditions for convergence under the assumption that each client adopts the layer selection for each layer ℓ = 1, . . . , L do 7: end for 18: optimizer analyzed in Sec. III. To simplify the reasoning we also assume that the number of model parameters does not vary across layers, i.e., P ℓ = P, the mixing weights are the same for each client σ k = σ , namely each client has the same number of examples E and uses the same number of neighbors for model aggregation. 5 We rewrite (8) into now with k (t) = σ · diag [a k (t)] as in (9). Next, considering the Adam optimization and the consensus process we obtain with Adam update ψ t+1,i in (12) . We collect all the N local estimates (from all the N clients) of the L model layers . Consensusdriven model aggregation of (19) can be thus further rewritten as 5 Notice that the number of neighbors is typically pre-determined during FL initialization and corresponds to a fixed-size subset of the neighbors within the communication range.
with the LN × P matrix (PW t ) that contains the N Adam updates of size L × P P and L FL in (20) are the Perron and the Laplacian matrices. In particular, the Laplacian It can be observed from (9) and (22) that the layer selections a i (t) made by individual clients i = 1, . . . , N modify the weight matrix k (t) and thus the Laplacian L FL .
T represent the layer selections at time t, made independently by each client based on gradient sorting, the consensus equation at time t becomes with ∈ P and P collecting the finite set of all the possible layer choices. The consensus process at discrete times q = t 0 , . . . , t can be thus written from (20) as being a set of local models obtained at time t 0 by the Adam optimizer and t = t q=t 0 Pā (q) . The vectorā(q) can be regarded as a switching signal of the discrete-time system (24) since it can assume any value in the finite set P. According to the convergence properties of switched consensus systems [48], [49], the average consensus process converges when the (infinite) product of the stochastic matrices {Pā (q) }, q = t 0 , . . . , t for t − → +∞ has a limit. In the following we exploit the above result to assess the convergence of the layer selection policies.

A. LAYER SELECTION POLICIES AND CONVERGENCE
Recalling thatā(t) is a switching signal in the (finite) set P, we analyze the convergence of two selected policies from Sect. III.

1) POLICY #1: SELECTION W/O COORDINATION
This strategy lets the chosen layers to be selected independently by every device in each round, without any coordination. Under this policy with p r ∈ (0, 1], the matrix product (24) is ergodic, however the sequence Pā (t) , . . . , Pā (t 0 ) is not composed by doubly stochastic matrices. As a result, the consensus process does not converge to the weighted average of the initial local models [50].

2) POLICY #2: SELECTION WITH COORDINATION
In this second scenario we constrain the devices to agree on a sequence of layers to be exchanged for each communication round. The sequence is random when p r = 1. Coordination among devices can be done at the start of the training process via a coordinator device that selects the overall sequence of selected layers that will be exchanged by all devices members of the federation for the assigned communication rounds. By adopting this strategy, the Laplacian matrix L FL in (22) becomes symmetric and satisfies L FL 1 LN = 0, 1 T LN L FL = 0, making Pā (q) doubly stochastic. This holds for all rounds, i.e., for q = t 0 , . . . , t with t − → +∞. As a result, there exists a finite vector α for which the limit holds [48], [49]. As proved in the Appendix A, the consensus process converges to the average values of the initial local models W t 0 [50].
Remark: Setting p r = 0 produces layers that are approximately time-invariant over consecutive FL rounds, thereforē a(t) =ā as experimentally verified in Sec. V.
In this case (24) becomes a linear, time-invariant, discretetime system. Therefore, the consensus process converges to (26) as shown in [4]. This holds for all aforementioned policies.

B. CONVERGENCE RESULTS
To experimentally verify the convergence of the consensus process derived in Sec. IV-A, here we analyze Policy #1 and Policy #2, both with p r = 1 (i.e., for random selection performed on all layers). We employ a NN composed by L = 6 layers each containing a single trainable parameter. N = 10 devices participate to the consensus process with an all-to-all connectivity. During the consensus stage, each device exchanges M = 2 layers out of the total L. Results are reported focusing on the effect of layer sharing policies on consensus. Fig. 3a reports the convergence results for Policy #1 while Fig. 3b refers to Policy #2. Both figures show how the NN parameters evolve during the consensus steps for all devices, while the average of the local models is depicted in dashed black line. The analysis confirms that Policy #2 convergences to the average of the initial models while Policy #1 does not as each NN parameter p k with k = 1, . . . , 6 converges to a different value that depends on the sequence of layers selected by the devices during the FL process. Furthermore, the coordinator-based selection strategy is seen to require a larger number of communication rounds to converge. On the other hand, Policy #1 converges quite quickly while the other strategy requires more consensus steps, indicating that constraining the selected layers to be the same among all devices, as done by Policy #2, may result in longer training times. To conclude the analysis, in Fig. 3c we evaluate how  L = 20 trainable layers each containing a single parameter. These last results further confirm the superior convergence properties of Policy #1, especially when few devices participate to the consensus process and for low values of M .

V. VALIDATION WITH MNIST DATA
In this section, we validate the proposed CFL-LS approach (Sec. III) considering a classification task with the benchmark MNIST dataset [44]. Several baseline methods are used as comparison to show the benefits of the developed layer selection strategies. Sec. V-A details the main simulation parameters employed for assessing the performances of the developed techniques, while Sec. V-B shows a first validation of the proposed method, by comparing different gradient sorting approaches and studying how the training process is affected if (some) layers are never or rarely transmitted. Sec. V-C provides a more in-depth investigation of latency, accuracy, and communication cost trade-offs. More specifically, we analyze the performances of the proposed approach for varying number of transmitted layers, comparing them also against a centralized FL solution and a DML implementation. Then, we evaluate the differences between a decentralized and centralized FL tool for the case where both methods employ the layer selection strategies presented in Sec. III. Finally, Sec. V-D studies the effects of quantization procedures applied to the selected layers and how they affect the final performances.

A. SYSTEM PARAMETERS
The overall FL process is deployed into a virtual platform that allows to configure the devices as distributed local learners and to support device-to-device (D2D) interactions. In particular, in this initial example we consider a ring network of N = 10 agents each connected to a varying and configurable number of contiguous neighbors. We analyze the performance of three connectivity patterns, corresponding to agents connected to the 10%, 50% and 90% of all the possible devices, respectively. These patterns model sparse (10%) to dense (90%) networking scenarios.
For the considered FL setup, each agent is assigned 300 randomly drawn MNIST training examples for 6 classes out of 10, to simulate non-independent and identically distributed (non-iid) information across the devices. The ML model employed by each device and the size of the trainable parameters for each layer are reported in Table 1. In total, the number of parameters of the NN is 16490 and the number of trainable layers is L = 6. Mini-batch Adam optimization is used for updating the local model according to (12) with B = 30 examples, and with parameters µ t = 5 · 10 −4 , β 1 = 0.9, β 2 = 0.999 and δ = 10 −7 . At the end of each communication round, the performances for all agents are computed using the full MNIST validation dataset.
The proposed CFL-LS method is first evaluated by varying the communication bandwidth constraint, or equivalently the link-layer spectral efficiency of (7). More specifically, assuming a frame duration T F ≈ 45 ms and b k,t = 32 bits, we analyze the performances with B W ranging from 2 to 10 MHz. This corresponds to the exchange of M = 1 up to M = 4 layers. In these examples we also assume P U = 0, however, the effect of link unavailability P U > 0 is considered in the following cases. The performance of CFL-LS is assessed and compared against several baseline approaches: i) the classical FL solution, where a PS collects the updated models from all the available devices and layers of the NN; ii) a DML implementation that fuses the raw data at a data center, and iii) a conventional decentralized FL policy, referred to as Consensus-driven Federated Learning (CFL), where all the model parameters are shared (i.e., M = 6) by all the interconnected agents over an all-to-all connectivity network. Note that the aforementioned methods employed for comparison are not subject to bandwidth constraints.

B. ASCENDING VS. DESCENDING GRADIENT SORTING COMPARISON
In Fig. 4 we compare two different strategies for gradient sorting and layer selection. The first one sorts the layers of the NN in a descending order (descending gradient sorting, DGS) while the second one uses an ascending order (AGS). Descending ordering prevents the transmission of layers with low gradients g t,i , while ascending ordering favors the transmission of these layers. These strategies are denoted as AGS p r = 0 and DGS p r = 0, as relying only on gradient sorting operations. The performances are also studied for the case where AGS and DGS integrate a randomized layer selection by using the strategy presented in Sec. III with probability threshold p r = 0.2, here referred to as AGS p r = 0.2 and DGS p r = 0.2. All strategies are compared with devices sharing M = 4 layers. As concerns the D2D connection, we evaluate the layer selection policies over the 50% connectivity scenario. Validation loss and accuracy are used as performance metrics, averaged over all participating devices and over all runs. Fig. 4a reports the validation loss for the aforementioned layer selection strategies, while Fig. 4b shows the percentage of times each layer is transmitted during the FL process.
Comparing the results, DGS shows far superior performances for all cases when compared with AGS. With p r = 0, AGS exhibits extremely low convergence properties while DGS reaches the minimum of its validation loss curve within 100 communication rounds. Nevertheless, even though DGS is much more rapid to converge, it shows clear signs of overfitting after 100 communication rounds. This may be related to the number of times each layer is exchanged by  Table 1. the devices, as reported in Fig. 4b, where the 4th and 6th layers are selected quite rarely. By allowing a more fair layer exchange, i.e., when p r = 0.2, performances can be heavily improved both for AGS and DGS while also avoiding overfitting problems. This indicates that not transmitting layers for long time periods or excluding some of them entirely from being sent during training heavily impacts the learning performances.

C. LAYER SELECTION POLICY ASSESSMENT
In this section, we show that a partial random selection of the layers, regardless of gradient sorting, allows to train higherquality models. The validation focuses on three different threshold probability p r values namely p r = 0.2, p r = 0.6 and p r = 1.0, and considers the connectivity patterns defined in Sec. V-A. Fig. 5 reports the validation loss (top) and validation accuracy (bottom) obtained by sharing a number of layers of the NN per round equal to M = 1 (Fig. 5a and 5d), M = 2 ( Fig. 5b and 5e) and M = 4 ( Fig. 5c and 5f). The comparison considers also the centralized (FL), the consensus-driven (CFL) and the DML scheme. Focusing on VOLUME 11, 2023  For example, this is verified in dense connectivity patterns (90% connectivity) as the validation loss obtained at the start of the training process is lower compared to CFL and FL tools.
Considering now the performances for varying threshold probabilities p r , the results indicate that balancing gradient sorting and randomized selection operations is fundamental and should be performed according to the available communication resources, quantified here by M . With M = 1, high values of p r should be preferred since they allow to obtain far superior performances compared to selection policies relying entirely on gradient sorting functions. This is confirmed by the validation loss/accuracy gap between p r = 0.2, p r = 0.6 and p r = 1.0 for all connectivity patterns. On the other  hand, choosing layer selection policies that prioritize gradient information is beneficial as the number of parameters exchanged among the cooperating agents increases. Looking at the performances achieved when M = 2, the policy with p r = 0.6 shows the best validation accuracy/loss, while, for M = 4, p r = 0.2 should be preferred. These choices are especially important when considering 10% connectivity patterns as the difference among p r values grows, whereas they are not so crucial for 50% and 90% connectivity cases apart for M = 1. Indeed, when M = 4 choosing p r = 1.0, i.e. the full randomized selection policy, greatly impacts the overall performances compared to p r = 0.2. Table 2 summarizes the validation loss/accuracy obtained by all methods at the end of the training process, highlighting that the proposed approach reaches nearly the same performances of FL and C-FL but with much lower communication overhead.
To visualize how the layer selection policy behaves with varying probability p r , we report in Fig. 6 the percentage of times that each layer has been transmitted during the FL process considering M = 1 (Fig. 6a), M = 2 (Fig. 6b) and M = 4 (Fig. 6c). The layer numbering is the same as the one presented in Table 1, while the connectivity considered is the 50% one. From the analysis it can be noticed that for M = 1 and for p r = 0.2, almost 80% of the times the 6-th layer is selected for transmission, which has also the second lowest number of parameters. As M increases, setting p r = 0.2 provides a reasonable balance between the need of prioritizing the most informative layers/parameters before transmission and the requirement of sharing all the parameters of the model during the FL process. An intermediate case is presented by p r = 0.6 where the 6-th layer is now transmitted almost half the times as opposed to 80% in p r = 0.2. Finally, the policy with p r = 1, being the most fair out of the three, shares all the layers in the same manner, as expected.
To conclude the analysis on the MNIST dataset, we compare the performances of CFL-LS with the vanilla FL policy based on the PS orchestration under the same connectivity conditions. For the latter method, we assume that the devices participating to the federation employ the same gradient sorting and randomized operations for selecting the layers to forward to the PS for aggregation. Therefore, in line with CFL-LS, the FL method is here referred to as Federated Learning with Layer Selection (FL-LS). Layer selection is implemented focusing on p r = {0.2, 0.6, 1.0}. As far as connectivity in FL-LS is concerned, the PS randomly chooses the 50% of the devices for updating the global model at each training round, in line with the 50% connectivity scenario analyzed previously. Fig. 7 shows the validation loss (top) and validation accuracy (bottom) when M = 1 (Fig. 7a and Fig. 7d), M = 2 ( Fig. 7b and Fig. 7e) and M = 4 ( Fig. 7c and Fig. 7f). The numerical results show that the proposed method outperforms the vanilla FL scheme by a large margin on all cases. One major difference between FL-LS and CFL-LS resides in the optimized choice of p r as the number of transmitted layers M increases. Indeed, the FL-LS policy is shown to be superior when choosing p r = 1 regardless of how many layers can be shared across the network. On the other hand, CFL-LS requires a careful selection of p r which needs to take into account also the current value of M , as discussed previously.

D. IMPACT OF QUANTIZATION ON LAYER SELECTION PERFORMANCE
Compression schemes can be applied to the model updates exchanged among neighbors to further preserve communication resources. To study its impact on learning performances, we apply here the compression strategy adopted in [33] to the CFL-LS method. The analysis evaluates the quantization effects with b k,t = {8, 10} bits on the validation loss/accuracy for p r = 0.2 in the 50% connectivity case. The performances are also compared with the CFL-LS method without introducing any quantization with b k,t = 32 bits (a typical number employed to encode ML model parameters in most deep learning frameworks). Fig. 8 shows the validation loss (top) and validation accuracy (bottom) of the considered CFL-LS technique for M = {1, 2, 4}. The analysis shows that a sufficient number of bits should be devoted to encode the transmitted layers to not incur in performance degradation. This happens for all values of M even though the performance loss is marginal, especially for M = 4. Interestingly, a slight improvement on the final performances is observed when b k,t = 10 bits, indicating that relying on uncompressed transmission schemes may provide models that generalize less for the considered learning task [51].

VI. COOPERATIVE SENSING IN VEHICULAR NETWORKS: A CASE STUDY
This section is dedicated to the assessment of the proposed methods over a more challenging vehicular sensing use case. We consider the cooperative sensing scenario in Fig. 9, where a number of connected vehicles use onboard lidar sensors to detect road users and/or relevant objects in the surroundings for automated driving services. The vehicles employ a DNN model that processes the Lidar point clouds and outputs the category of the detected road entity. The FL model is continuously trained in a cooperative manner by exchanging model updates through V2V sidelink communications.
In the considered scenario, the vehicles aggregate 10 Lidar sweeps to densify each point cloud data and process it through a local bounding box subsystem that provides object segmentation and position information, as depicted in the bottom part of Fig. 9. According to the bounding box position, extent and rotation, the point clouds that fall within the boxes are extracted and fed to the classifier (depicted by the classification subsystem in Fig. 9). The FL process acts only on the point cloud classification, while the bounding box regression is implemented locally at each vehicle. As far as the decentralized FL training is concerned, vehicles collect point cloud data as they move in the environment and upon gathering enough samples they communicate a single time with a Road Side Unit (RSU) that is tasked to label the examples provided by vehicles. Vehicles belonging to different areas may communicate with different RSUs and provide rather different training categories that will reflect location-dependent properties of road users/objects. For example, vehicles moving on highways will gather little data regarding pedestrians compared to vehicles moving in urban environments. Once all vehicles acquire their corresponding training dataset, we propose to carry out a decentralized FL scheme to let the vehicles train on the overall collected data. Notice that the decentralization of the FL process allows to take some load off from the RSUs thus saving communication resources. Furthermore, it avoids communication among the RSUs which may come at an increased cost compared to V2V interactions and also be sporadic, intermittent and unavailable depending on the current load of the RSUs.
In what follows, Sec. VI-A presents the adopted ML model and the related datasets. Sec. VI-B presents the performances of the proposed method, again compared to several baselines: the vanilla FL tool, the conventional fully decentralized FL technique, and the DML approach. All baselines are implemented without introducing compression strategies. Sec. VI-C studies the CFL-LS algorithm considering link unavailability events. Finally, Sec. VI-D thoroughly analyzes the communication efficiency vs model quality tradeoff for different layer selection strategies.

A. MODEL AND FEDERATED DATASETS
The DNN model here employed is PointNet [52], largely used for 3D shape recognition and classification from point cloud data. The architecture is adapted from [28], with parameters for each layer defined in Table 3. The modified architecture relies on L = 20 federated layers with 40855 potential trainable parameters that can be selected for transmission on each FL round. Note that batch normalization layers are updated opportunistically based on the available local data at each vehicle.
The proposed CFL-LS tool is assessed using the publicly available nuScenes Lidar dataset [53] for autonomous driving. These real-world data have been collected by a fully-equipped vehicle in challenging situations, such as diverse weather/lighting conditions as well as traffic densities. The training dataset is generated as in [28] and contains 9000 training example pairs (x h , y h ) where x h is the point cloud and y h is the corresponding road category chosen among 6 classes. Similarly, the validation set contains 2400 evenly distributed Lidar sets across the 6 classes and is used for performance assessment (accuracy/loss). To comply with the PointNet ML model, which accepts a fixed number of points, x h is upsampled/downsampled to contain exactly 2048 points. Furthermore, we normalize x h such that it is contained into a unit area sphere. The overall training process is simulated in a virtual environment, where N = 10 vehicles are deployed. The vehicles' dynamics have not been taken into account as the main goal of the paper is to study the proposed layer-wise compression operators on the FL performance. Nevertheless, interested readers may refer to [28] for insights on how mobility affects the learning performances. At the start of the training, the overall data is partitioned across the vehicles participating to the distributed learning process. In particular, each vehicle holds 200 examples evenly partitioned into 5 of the 6 road categories to simulate non-iid local datasets.
In the following, we evaluate the CFL-LS approach by varying the constraint on the spectral efficiency defined in (7) with P U = 0, allowing the sharing of M = 2 up to M = 12 layers. Given a frame slot duration T F = 45 ms and b k,t = 32 bits, the corresponding bandwidth B W ranges from 1.45 up to 8.7 MHz. The analysis focuses also on the assessment of the CFL-LS method by changing the threshold probabilities p r and the connectivity patterns (from sparse to dense), defined in Sec. V-A. Unless stated otherwise, results are averaged over 5 independent runs and over all vehicles.

B. MODEL QUALITY ASSESSMENT
Similarly as done in Sec. IV, we evaluate the performances obtained by CFL-LS as compared to CFL, FL and DML training schemes. Fig. 10 shows the validation loss (top) and validation accuracy (bottom) for M = 2 ( Fig. 10a and Fig. 10d), M = 6 ( Fig. 10b and Fig. 10e) and M = 12 ( Fig. 10c and Fig. 10f). The numerical results confirm the findings of the analysis in Fig. 5. Carefully selecting the probability threshold p r taking into account M is beneficial for balancing the communication resources and the model quality. For example, choosing p r = 0.6 when M = 2 over p r = 0.2 for M = 6 allows to substantially reduce the number of transmitted data without introducing accuracy penalties. Opting for p r = 1 when M = 6 rather than p r = 0.6 for M = 12 heavily reduces the communication footprint without significant accuracy drops. CFL-LS approaches the performance of conventional CFL and FL tools but with a much lower communication burden as shown for p r = 1 and M = {6, 12}. On the other hand, convergence rates shown by CFL and FL are now slightly superior compared to CFL-LS schemes.
We recall that the probability threshold p r rules the fairness in sharing the layers among the devices.   Table 3.
parameters for each layer. The considered architecture in this case study shows many layers having very few parameters, i.e., from 32 up to 144 for 35% of the layers, indicating that p r = 0.2 may be a superior choice for larger values of M . Furthermore, integrating batch normalization layers in the architecture might impact the learning performances, especially for non-iid data distributions across vehicles participating in the federated process, as pointed out in [54]. The obtained results are summarized in Table 4 showing that the proposed selection strategies reach nearly the same performances of other baselines.
Similarly as done in Sec. V-C, Fig. 11 reports the probabilities of the chosen PointNet layers during the CFL-LS process, considering the probability thresholds p r = 0.2, p r = 0.6 and p r = 1.0, respectively, and M = 2 (Fig. 11a), M = 6 ( Fig. 11b) and M = 12 (Fig. 11c). Results are also presented for the 50% connectivity scenario. For p r = 0.2, the policy favors the selection of the layers labeled from 6 to 8 in Table 3, and positioned roughly at the middle of the PointNet structure, namely between the first and the second transformation mini-networks [52]. As M increases, the selected model parameters still belong to layers close to layers 6, 7 and 8. Initial and final layers are also chosen in some cases. This is opposed to the previous results for MNIST processing, where the policy prioritized the parameters close to the DNN output. Interestingly, in both cases the proposed selection strategy tends to prioritize the layers possessing fewer parameters, suggesting also that layers containing many trainable parameters can be shared less frequently.

C. IMPACT OF COMMUNICATION IMPAIRMENTS
Poor or intermittent communications may heavily impact the final quality of the trained models. In the following, we thus study the robustness of the CFL-LS method against link outage events. We set up a communication framework that allows us to simulate connection drops among vehicles according to a pre-defined probability, namely P U in (6), and on a per-layer basis, meaning that when a link is unavailable the layer(s) cannot be transmitted as no connection exists. Link unavailability events are assumed to be independent and identically distributed (iid) across all participating vehicles to the FL process and over all model layers transmitted.
In Fig. 12 the CFL-LS performances are assessed for M = 2 and 50% connectivity. We evaluate 3 highly challenging scenarios representing extremely poor link availability in the network, ranging from P U = 0.25 up to P U = 0.75. This is done for evaluating the proposed approach in extremely challenging communication conditions, showing how CFL-LS responds to such detrimental effects. The figure reports the validation loss (Fig. 12a) and validation accuracy (Fig. 12b) considering the three P U probabilities separately, namely for P U = {0.25, 0.50, 0.75}. Increasing P U results in worse performances, as expected. Thus, the selection of p r becomes extremely important when considering a large number of unavailable links in the network: p r = 1.0 should be always chosen to overcome such events as it provides the highest performances when compared to all other choices. Furthermore, the accuracy drop between P U = 0.25 and P U = 0.5 when p r = 1.0 is relatively small when compared against p r = 0.2 and p r = 0.6, indicating that choosing full randomized layer selection schemes may alleviate the communication impairments effects on learning performance. Unfortunately, when the number of available links is scarce, i.e., P U = 0.75, also p r = 1.0 suffers large performance drops, suggesting that special countermeasures should be put in place or allowing the FL process to run for more learning rounds.

D. COMMUNICATION EFFICIENCY VS MODEL QUALITY TRADE-OFF
To conclude the case study analysis, we now quantify the communication resources, namely the communication footprint, needed by the CFL-LS policy to reach a target validation loss value. The goal is to evaluate the tradeoff between communication-efficiency and model quality. The communication footprint corresponds to the amount of data exchanged over the network during an assigned number of FL rounds. Footprint results are shown considering two different wall clock times that comprise up to 100 and 400 FL communication rounds, respectively. In line with [55], it is assumed that the selected model parameters can be shared among neighbors using broadcast messages without requiring each vehicle to forward one copy of the shared layers for every neighbor. Fig. 13 depicts the results of the validation procedure considering sparse to dense connectivity scenarios, namely 10% (Fig. 13a), 50% (Fig. 13b), and 90% (Fig. 13c) connected vehicles, respectively. Each point in the scatter plot represents the communication footprint and the corresponding validation loss obtained by a vehicle participating in the FL process. Each marker encodes the information regarding the probability p r used by the layer selection policy: diamond, circle and star symbols indicate p r = 0.2, p r = 0.6 and p r = 1.0, respectively. Colors refer to different choices for the number of transmitted layers M , with black, blue and green corresponding to M = 2, M = 6 and M = 12, respectively. For each vehicle participating in the FL process, we measure the validation loss observed after 100 and 400 training rounds and compute, for each case, how many parameters, in MB, have been exchanged until that point. The loss values obtained are also averaged over 5 independent runs.
Focusing on the overall performances, the numerical results show that p r = 0.2 gives the best results in terms of communication efficiency for all the considered connectivity patterns and wall clock times, while p r = 1.0 is the least efficient one. On the other hand, p r = 0.2 is also the least performing in terms of model quality for all cases considered in the figures. Trade-offs between communication efficiency and model quality need to be considered depending on the available network resources, the desired accuracy and training latency. For example, p r = 0.2 exhibits the lowest communication footprint when compared to all other methods, making it the most favorable choice provided that the target loss L i (W) is below 1.8. On the other hand, for L i (W) = 1.2, p r = 0.6 should be preferred as exhibiting the lowest communication footprint among all other methods and validation loss in line with p r = 1.0. Analyzing the results for different values of the transmitted layers M , it can be noticed that M = 12 should be avoided as responsible for a high communication footprint, compared with other setups. Instead, sending M = 6 layers per round provides the best tradeoff as reducing the required network resources compared with M = 12 in exchange for some (marginal) degradation of model accuracy. Finally, sending M = 2 layers further reduces the required footprint, however much lower accuracy, i.e., a 5%-8% increase in validation loss, is observed.
Concerning the connectivity patterns, results indicate that the number of sidelink connections has a significant impact on the consensus process. In particular, the validation loss across all vehicles is less dispersed as the number of connections increases. Sparse connectivity makes consensus converge slowly and is responsible for large variations of the validation loss across the vehicles. This effect is more evident after 100 training rounds. Whereas, such variability reduces considering more dense connectivity scenarios (50% and 90% patterns). Optimizing communication efficiency and model quality in sparse connectivity scenarios is thus fundamental since choosing an inappropriate value of M and p r might lead to high validation loss. On the other hand, dense sidelink communications allow the vehicles to keep the number of exchanged layers M as low as possible, thus maximizing the communication efficiency: for the considered study, setting p r = 0.2 and M = 2 provides a reasonable tradeoff between communication footprint and model accuracy.

VII. CONCLUSION
In this paper, we analyzed the communication efficiency of fully decentralized FL setups underpinned by consensus tools. We designed a novel communication-efficient FL framework that enables the agents to self-organize into a distributed training platform while optimizing the network resources used for the learning process. The proposed CFL-LS method employs a layer optimizer that selects the NN layers to be shared by sorting them according to their contribution to the model quality, measured by the normalized squared gradient of the local loss. The layer selection policy is integrated with a fairness scheme that selects randomly the layers in the ML architecture so as to favor a balanced selection of the ML model parameters and optimize the performance.
The proposed layer selection optimizer is firstly analyzed to study its impact on the consensus process. The analysis shows that the proposed solution does not reach average consensus. Nevertheless, the convergence rate provided by the FL method is far superior when compared to a coordinator-based strategy integrating the same layer selection policy that instead achieves average consensus. Then, the communication-efficient FL technique is assessed on the benchmark MNIST as well as on the more challenging nuScenes dataset, targeting a cooperative vehicular sensing use case. The proposed CFL-LS layer selection policy has been validated with a PointNet-compliant DNN architecture composed by 40 trainable layers. This is used to reliably and precisely recognize road users from Lidar point clouds. Latency, accuracy and communication-efficient trade-offs have been extensively analyzed to evaluate the performance. Results indicate that the proposed layer selection policy reduces significantly the communication overhead needed during the training process, in exchange for negligible performance loss compared to classical centralized (FL) and decentralized (CFL) policies. The analysis also shows how balancing gradient sorting operations and randomization for layer selection helps to reduce the communication burden, without penalizing accuracy or convergence rates. More specifically, the main takeaways can be summarized as: i) randomized selection policies (p r = 1) should be preferred when communication resources are scarce; ii) gradientbased selection (p r ∈ [0, 0.2]) should be instead chosen when communication resources are not critical (10-100 MB); iii) prioritizing the exchange of layers possessing few trainable parameters is beneficial for improving communicationefficiency. Finally, experimental tests show that the proposed approach is suitable for integration with device scheduling functions, as well as network quantization schemes.

APPENDIX A
Let us focus on the consensus equation (23) and assume that L FL satisfies the following conditions: 1 T L FL = 0 and L FL 1 = 0 so that the Perron matrix Pā (q) is doubly stochastic. We apply the eigenvalue decomposition to Pā (q) and obtain Pā (q) = U V T , where U = [u 1 · · · u NL ] and V = [v 1 · · · v NL ] are the left and right eigenvectors, respectively, while = diag[λ 1 , . . . , λ LN ] is a diagonal matrix containing the eigenvalues in non-descending order. Recalling that 1 T L FL = 0 and L FL 1 = 0 and L FL is block-partitioned into L blocks, it follows that L FL has L trivial eigenvalues with value 0. This implies that the first L eigenvalues of Pā (q) are λ 1 = . . . = λ L = 1 while U 0 = V 0 = 1 ⊗ I L are the associated left/right eigenvector matrices as satisfying the following conditions Combining the above results, (23) can be rewritten as where u n and v n are the left and right eigenvectors associated to the eigenvalues λ n with n = L + 1, . . . , LN . Convergence to the average initial values contained in W t 0 is thus obtained when the summation of (30) approaches zero, i.e., for |λ n | = |1 − εµ n | ≤ 1 , or equivalently where µ max (.) denotes the largest eigenvalue of L FL . By applying the Gershgorin theorem, we obtain: Therefore, convergence is guaranteed for: where d max is the maximum connectivity degree. In the considered case, d max is the maximum number of times a layer is chosen by all devices.