Semi-Asynchronous Hierarchical Federated Learning Over Mobile Edge Networks

Mobile edge network has been recognized as a promising technology for future wireless communications. However, mobile edge networks usually gathering large amounts of data, which makes it difficult to explore data science efficiently. Currently, federated learning has been proposed as an appealing approach to allow users to cooperatively reap the benefits from trained participants. In this paper, we propose a novel Semi-Asynchronous Hierarchical Federated Learning (SAHFL) framework for mobile edge networks that enables elastic edge to cloud model aggregation from data sensing. We further formulate a joint edge node association and resource allocation problem under the proposed SAHFL framework to prevent personalities of heterogeneous devices and achieve communication-efficiency. To deal with our proposed Mixed integer nonlinear programming (MINLP) problem, we introduce a distributed Alternating Direction Method of Multipliers (ADMM)- Block Coordinate Update (BCU) algorithm. With this algorithm, a tradeoff between training accuracy and transmission latency has been derived. Numerical results demonstrate the advantages of the proposed algorithm in terms of training overhead and model performance.


I. INTRODUCTION
With the improvement of sensing and computing capability of mobile edge networks, the explosive growth of devices has generated a large amount of data [1]. The full utilization of these data will greatly facilitate the mobile edge network to provide secure and efficient needs for devices. However, since the traditional centralized data training method would increase the communication load and affect the data security, it is impractical for the mobile edge networks with large amount of data. Therefore, a new distributed machine learning paradigm named Federated Learning (FL) [2] is emerged that allows the device to complete the training process without uploading their raw data to the central server.
The associate editor coordinating the review of this manuscript and approving it for publication was Tiago Cruz .
Currently, FL has been widely studied to deal with the data science in terminal devices [3], [4] and foster new applications such as medical diagnosis [5] and autonomous vehicles [6]. The FL technology allows participant devices to collaboratively build a shared model while preserving privacy data locally [7]. Particularly, the prevalent FL algorithm, namely federated averaging, allows each device to train a model locally with its own dataset, and then transmits the model parameters to the central controller for a global aggregation [2]. However, FL efficiency is severely degraded by limited communication resources. Furthermore, the participant devices in mobile edge networks usually have heterogeneous resources, which lead to non-independentidentically distributed (non-IID) private data during the communication [8], [9]. The existence of non-IID data creates the need for customized services for individual terminals. Learning a common model proposed by the traditional FL VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ algorithm may produce mediocre performance on some terminals with large data imbalances. Intuitively, FL presents a great potential for mobile edge networks to facilitate the large data management. However, directly applying FL to mobile edge networks still faces three major deficiencies: 1) limited wireless resources; 2) high latency; 3) obliterated data diversity. According to [10] and [11], Federated Learning training at Edge networks (FEL) has been regarded as a solution to facilitate the above limitations through bringing model training closer to the data produced locally. Compared with the conventional cloud centric FL approaches, the implementation of FEL can provide higher wireless resources utilization since less information is required to be transmitted to the cloud. In addition, FEL has a much lower transmission latency and higher privacy than the conventional FL by making decisions at the edge nodes. In [12], the authors develop an importance aware joint data selection and resource allocation algorithm to maximize the resource and learning efficiencies. Meanwhile, the authors in [13] propose an adaptive federated learning mechanism in resource constrained edge computing systems. Along the FEL, the authors in [14] propose a novel Hierarchical Federated Edge Learning (HFEL) framework, where edge servers deployed with base stations fixedly and can upload edge aggregation model to the cloud. The above HFEL enables great potentials in low latency and high energy efficiency.
Besides, due to the heterogeneity of devices, some authors propose to improve the efficiency of FL algorithm by changing the FL aggregation method. The existing federated learning methods mainly utilize the synchronous model aggregation mechanism, where the central server needs to wait for the slowest device to complete the training in each communication round [15], [16]. In the synchronous FL method, the edge server aggregates local models of all devices or a subset of pre-selected devices. In [17], the authors proposed a joint device association and wireless resource allocation scheme under IID and non-IID datasets, respectively. The authors in [18] proposed a novel device selection and resource allocation scheme under wireless resource fruitful unlicensed spectrum (NR-U) networks. However, in this case, the computing resources of those unselected devices are wasted. Besides, for heterogeneous data, the transmission latency of each synchronous model aggregation mechanism is unacceptable for time-sensitive devices. In this way, several works have proposed asynchronous model aggregation methods, where only one participant device would update the global model each time [19], [20], [21]. Meanwhile, when one device uploads its model, the others continue to complete their training. The authors in [22] proposed a novel asynchronous FL mechanism to coordinate the heterogeneity of devices, communication environments, and learning tasks. Nevertheless, the training round under asynchronous methods is higher than synchronous methods. Moreover, due to the asynchrony, gradient staleness may be difficult to control [20]. Therefore, the authors in [23] design an n-softsync aggregation model that can significantly reduce training time by combines the benefits of both synchronous and asynchronous aggregations. Inspired by the above analyses, we aim to leverage a novel Semi-Asynchronous Hierarchical Federated Learning (SAHFL) framework that can provide secure and efficient services to mobile edge networks. Specifically, the proposed SAHFL framework consists of both edge and cloud layers, where each edge node aggregates all of homogeneous local models and the cloud layer aggregates parts of heterogeneous edge models. These selected nodes would update the global model once the selected slowest node finishes training, which combines the merits of both synchronous and asynchronous aggregations. For further performance enhancement, we formulate a joint edge node association and resource allocation optimization problem to prevent heterogeneous edge node personalities as well as ensure communication-efficient of the whole system. The objective function is a Mixed Integer NonLinear Programming (MINLP) problem, which has been solved by a distributed Alternating Direction Method of Multipliers (ADMM)-Block Coordinate Update (BCU) algorithm. It is shown that the proposed algorithm can achieve near optimal with low computational complexity. In addition, to protect the data diversity contribution required by edge nodes, we design an elastic edge update method before edge nodes broadcast the cloud model to devices.
Overall, the main contributions of this work can be listed as follows.
• We propose a novel SAHFL framework by applying the synchronous aggregation model for local-edge and the semi-asynchronous aggregation model for edge-cloud to provide secure and efficient services for mobile edge networks.
• To reserve the personalities of heterogeneous edge nodes, we introduce an elastic edge model update method based on the distance between the global model and the edge model.
• We formulate a joint edge node association and resource allocation problem to achieve communication-efficiency by achieving a tradeoff between training accuracy and transmission latency. A distributed ADMM-BCU algorithm has been used to solve the MINLP problem.
• Under CIFAR-10 dataset, we found that our framework has a good performance in training accuracy and loss. The proposed algorithm can reduce the device latency, and the elastic edge model update method can well protect the personalized level of edge models.
The rest of this paper is organized as follows. Section II introduces the system model and the SAHFL learning mechanism. In Section III, we formulate the communicationefficient problem. A joint edge node association and resource allocation strategy is presented in Section IV. Section V presents the numerical results, followed by the conclusions in Section VI.

II. SYSTEM MODEL
In this work, we aim to design a novel SAHFL framework for mobile edge networks that contains three layers, namely the cloud layer, the edge layer, and the local layer, as shown in Fig. 1. Here, we consider the devices have heterogeneous data structures, namely the local datasets are non-iid. We let homogeneous devices with similar data size, network bandwidth, and QoS gather in the same edge node. Hence, the edge nodes are heterogeneous. A shared Deep Neural Network (DNN) model is distributed over the local devices, which has been trained collaboratively across the devices under their datasets. Different from conventional FLs, the proposed SAHFL framework allows devices train their data locally, homogeneous devices report their computed parameters to the same edge node synchronously, and heterogeneous edge nodes upload their models to the cloud node semi-asynchronously, which can preserve data privacy as well as improve communication efficiency. In the proposed framework, we assume there has a set of K edge nodes K = {1, . . . , K }. Any edge node k consists of a set of N k local devices, denoted as N k = {L k,1 , . . . , L k,N k }. Under edge node k ∈ K, local device n ∈ N k owns a local data set D k,n = {(x j,k,n , y j,k,n ) : j = 1, . . . , |D k,n |}, where x j,k,n is the j-th input training data sample, y j,k,n is the j-th corresponding output, and |D k,n | denotes the cardinality of the data set D k,n . For simplicity, we assume the SAHFL algorithm with a single output. However, this work can be extended to the multiple outputs case. In what follows, we would introduce each part of the proposed SAHFL framework at the t-th iteration.

A. EDGE AGGREGATION
The edge aggregation stage contains three processes, including local model computation, local model transmission, and edge model aggregation. In detail, local model first trained by local data, then local models respectively transmit to their associated edge nodes for edge aggregation. The detailed processes are as follows.

1) LOCAL MODEL COMPUTATION
Without loss of generality, we consider a supervised machine learning task on device n ∈ N k associated with edge node k ∈ K, which has a learning model of w k,n . We further define f n (x j,k,n , y j,k,n , w k,n ) as the loss function of data sample j that quantifies the prediction error between data sample x j,k,n and output y j,k,n . In this work, we mainly focus on the logistic regression model for the loss function, i.e., f n (x j,k,n , y j,k,n , w k,n ) = − log 1 + exp −y j,k,n x T j,k,n w k,n . Hence, the loss function of device n ∈ N k associated with edge node k ∈ K on dataset D k,n can be defined as The local update model of device n ∈ N k in edge node k ∈ K can be achieved by where η is a predefined learning rate.
Define C k,n as the number of CPU cycles for local device n ∈ N k associated with edge node k ∈ K to process one sample data. Assuming each sample data has the same size, the total CPU cycles to run one local iteration is C k,n |D k,n |.
We further let f k,n be the computation frequency of device n ∈ N k in edge node k ∈ K. In this way, the related local gradient calculation latency in one round can be formulated as

2) LOCAL MODEL TRANSMISSION
We adopt the Orthogonal-Frequency-Division Multiple Access (OFDMA) technique for local uplink transmissions. Define B k,n as the bandwidth allocated to device n ∈ N k . Therefore, we have where B k is the bandwidth allocated to edge node k ∈ K for the transmission between edge node k ∈ K and the associated local devices.
where B e is the total bandwidth allocated for the communication between edge nodes to the local devices. Therefore, the achievable local uplink data rate from device n ∈ N k to edge node k ∈ K can be formulated as where P k,n is the uplink transmission power of device n ∈ N k in edge node k ∈ K, g k,n denotes the channel gain between local device n ∈ N k and edge node k ∈ K, and N 0 means the noise power. Similarly, the achievable downlink data rate for device n ∈ N k associated with edge node k ∈ K can be expressed as where P k is the downlink transmission power of edge node k ∈ K.
In this work, we use the same training model for the whole communication system. Therefore, the number of model parameters in each level of model transfer has the same size. Denote Z as the data size of the model parameter bits. The local gradient upload latency of device n ∈ N k in edge node k ∈ K can be expressed as Correspondingly, the edge model download latency of device n ∈ N k in edge node k ∈ K can be formulated as

3) EDGE MODEL AGGREGATION
In this work, each edge node can receive the updated model parameters from its associated homogeneous devices. Since the devices under one edge node usually have a similar type, we adopt the synchronous aggregation method to average these updated models. It means that the edge node would wait for the slowest node to complete training in each round and collect all the connected devices' updated model parameters. Therefore, the edge model aggregating equation for edge node k ∈ K can be formulated as where |D k | = N k n=1 |D k,n | is the total number of data in edge node k ∈ K. We omit edge model aggregation time due to its strong computing capability. Similarly, due to the advantages of bandwidth and transmission power when edge devices broadcast, the edge model download latency can also be neglected Hence, the computation and communication latency between each edge k ∈ K and the related local devices can be derived as

B. CLOUD AGGREGATION
Similarly, the cloud aggregation stage contains two processes, i.e., edge model transmission and cloud model aggregation. Particularly, the selected edge nodes upload their updated model parameters to the cloud for aggregation. The detailed processes are as follows.

1) EDGE MODEL TRANSMISSION
Edge nodes would upload their model parameters to the cloud after edge model aggregations. To ensure uninterrupted transmission from edge to cloud, we also adopt the OFDMA technique. Hence, the uplink data rate for edge node k ∈ K can be expressed as where B c,k is the bandwidth allocated to edge node k ∈ K transmits to the cloud node, P c,k is the uplink transmission power of edge node k ∈ K to the cloud node, and g c,k denotes the channel gain between edge node k ∈ K and the cloud node. Correspondingly, the downlink data rate from the cloud node to edge node k ∈ K can be formulated as where P c is the downlink transmission power of the cloud node, B c is the total bandwidth for the transmission between the edge nodes and the cloud. As we would discuss later, only parts of the edge nodes can be selected in each round. Therefore, we have the constraint of where α k ∈ {0, 1}. Here, α k = 1, ∀k ∈ K indicates edge node k has been selected, and α k = 0, ∀k ∈ K otherwise. In this way, the upload latency from edge node k ∈ K to the cloud node can be written as Similarly, the downlink latency from the cloud node to edge node k ∈ K can be expressed as 2) CLOUD MODEL AGGREGATION Since these edge nodes correspond to heterogeneous local datasets, their model updated periods various. If we adopt the synchronous aggregation model, the latency for faster training nodes is unacceptable. On the contrary, the asynchronous method has shorter round latency, however, it requires several times of training rounds than the synchronous method. Therefore, in this work, we propose a flexible semi-asynchronous aggregation method by combining the merits of both synchronous and asynchronous methods. As shown in Fig. 2, the cloud node would select |S t | = K k=1 α k edge nodes with the fastest training round for model aggregation, where the set of selected edge nodes is denoted as S t . Slow nodes would wait for the next communication round to upload their models. Hence, under the semi-asynchronous aggregation method, we can achieve a balance between training accuracy and communication latency. The semi-asynchronous aggregation method can be written as Also, we ignore the cloud model aggregation latency due to its strong computing capability. Therefore, the cloud-edge communication latency can be derived as Towards this end, the one-round latency for edge node k ∈ K is given by C. EDGE UPDATE MODEL From Eq. (2), the local updated models are determined by their own characteristics. Since the non-iid devices that connected with one edge node have a similar characteristic, the edge aggregation models are heterogeneous. Therefore, if we directly use the cloud model to update the edge models, the personalities among edge models would be eliminated. Meanwhile, the accuracy of the cloud model would be decreased. Hence, we introduce a new edge update model based on [25], which defines a weight distance formula to represent the difference among different weight relatives as Intuitively, the larger of dist (w k , w c ), the greater of the model difference.
Typically, deep learning networks that consist of multiple layers and each layer contains various amounts of weights can be adopted here. For simplicity, we use a small dataset to obtain the layer with the most obvious characteristics, which has been denoted as L = { 1 , 2 , · · · }. Thereafter, we introduce a parameter ε k to measure the difference between the cloud model and edge model k, which can be formulated as where w t, k and w t, c represent the weight of the -th layer of edge model w t k and cloud model w t c . Meanwhile, |L| is the cardinality of L.  (w k , w c ). To keep the personalities, the edge updated model can be derived by

D. LEARNING PROCEDURE OF THE SAHFL MODEL
Based on the definition of SAHFL model, the training procedure of the SAHFL model at the t-th iteration proceeds as follows, which is also shown in Fig. 3. 1) Local model training and update: Devices in mobile edge network train their learning model and calculate their local gradient as ∇F k,n (w t k,n ), ∀k ∈ K, n ∈ N k . After receiving w t k , ∀k ∈ S t , devices in the selected edge nodes update their learning model based on Eq. The procedure starts from t = 1 and repeats the above steps until convergence.

E. CONVERGENCE ANALYSIS
Before delving into the convergence analysis, we introduce the following assumptions on loss functions and gradient VOLUME 11, 2023 estimates according to [24]. We first assume that F(w * ) is the optimal global FL model obtained by collecting the local models of all selected devices in each iteration by the SAHFL algorithm. Other assumptions are as follows.
Theorem 1: Given the optimal global FL model F(w * ) and the learning rate 0 ≤ η ≤ 1 L , the upper bound of E[F(w t+1 c )− F(w * )] can be given by

III. PROBLEM FORMULATION
As discussed earlier, there exists a tradeoff between the training accuracy and the transmission latency. Therefore, our goal in this work is to find a balance between them to provide safety and communication-efficiency services for the SAHFL based mobile edge network framework. According to Eq. (2), the local model Gradient-Norm-Value (GNV) influences the local model updating, which measures the data importance. The GNV of local device n ∈ N k in edge node k ∈ K can be expressed as g w,t k,n = ∇F k,n (w t k,n ) = D k,n ∂f k,n (x j,k,n , y j,k,n , w t k,n ) ∂w t k,n , ∀k ∈ K, n ∈ N k .
Without loss of generality, we leverage the norm of GNV to present the importance, which can be written as Since an edge node connects homogeneous local devices, the GNVs among these local devices are approximately equal. Moreover, local devices in one edge node also have similar training duration, hence, all of these training models (GNV) would be uploaded. In this way, the GNV of edge node k ∈ K can be defined as On the contrary, the cloud node associates with heterogeneous edge nodes, the GNVs among them various. Intuitively, edge nodes with significant gradients have more contributions on model updating and convergence. Therefore, the cloud would preferentially select impactive edge nodes to upload their information for cloud model aggregation. Then, the GNV of the cloud model can be written as For easy of expression, we remove the iteration t in the following. Now, we are ready to describe the problem formulation. The goal of this work is to maximize communication-efficient via joint edge node selection and resource allocation scheduling for an SAHFL based mobile edge network. To accelerate the learning process, it is desirable to select more edge nodes with larger data importance. However, to shorten the communication and computation latency, it is better to upload as fewer edge nodes as possible. As a result, the objective function that represents the tradeoff between GNVs and transmission latency can be formulated as where B c = B − B e , B is the total bandwidth, α = [α 1 , α 2 , · · · , α K ] T , B k,n = [B k,1 , B k,2 , · · · , B k,N k ] T , B c,k = [B c,1 , B c,2 , · · · , B c,K ] T , and ρ ∈ [0, 1] is the weight factor that controls the tradeoff between data importance and transmission latency. Obviously, (26) is a MINLP problem, which is NP-hard. In the following, we would introduce an ADMM-BCU method to find the joint edge node selection and resource allocation strategy.

IV. JOINT EDGE NODE SELECTION AND RESOURCE ALLOCATION
As known to us, all of the steps in the learning procedure are independent with the optimal scheduling decision. Denote T k = max n∈N k (T c k,n + T u k,n ) + T u c,k , the original problem (26) can be rewritten as subject to (26a), (26b), and (26c).
Nevertheless, (30) is still a coupled problem due to the multiply variables of Y , B k,n , and B c,k . Therefore, in this work, we propose a distributed ADMM-BCU algorithm that can iteratively approach a near optimal stable solution with low computational complexity. Specifically, during each iteration, (30) is decomposed into edge node selection and resource allocation subproblems, which aim to solve the blocks of {α, X , Y } and {α, B k,n , B c, allocations block α, B k,n , B c,k , the edge node selection optimization subproblem over the variable block {α, X , Y } can be rewritten as subject to (26b), (28a), (28b), and (29c).
Obviously, (31) is a convex problem, which can be solved by standard tools, such as CVX.
In what follows, we provide the closed form expression of the optimal edge node selection by introducing Lemma 1.
Lemma 1: The optimum edge node selection α * can be expressed as Proof: To find the optimal α k , ∀k ∈ K, we apply for the Lagrangian dual method, which can rearrange Eq. (31) with respect to α k , ∀k ∈ K as Calculate the first-order partial derivatives with respect to α k , ∀k ∈ K, we derive that where This ends the proof.
From Lemma 1, we find that the edge node selection is mainly determined by the edge node importance σ k and the uplink transmission latency from edge k to the cloud T u c,k . VOLUME 11, 2023 Intuitively, the cloud preferentially selects the edge node with either a larger edge node importance or a smaller uplink transmission latency that can improve the communicationefficiency.
b: THE OPTIMAL BANDWIDTH ALLOCATION α, B k,n , B c,k Similarly, under the fixed edge node selection block {α, X , Y }, the resource allocation optimization subproblem over the block α, B k,n , B c,k can be rearranged as subject to (26a), (26b), and Z B k,n log 2 1 + P k,n g k,n B k,n N 0 Also, it is easy to observe that (36) is a convex problem. For ease of analyses, we write this problem under the Lagrangian dual formulation, where (36) can be rearranged as (37), shown at the bottom of the next page, where β, , ϕ, and τ are the Lagrangian multipliers corresponding to constraints (26a), (26b), (36a), and (36b), respectively.
By taking ∂L(α k ,B k,n ,B c,k ,β k , k ,ϕ k ,τ k ) ∂B k,n = 0, the optimal local-edge uplink bandwidth allocation B * k,n can be derived by (38), as shown at the bottom of the next page. From Eq. (38), the optimal local-edge bandwidth allocation B * k,n is mainly influenced by the related channel conditions P k,n g k,n N 0 . Alternatively, by taking ∂B c,k = 0, the optimal edge-cloud uplink bandwidth allocation B * c,k can be obtained by (39), as shown at the bottom of the next page. Obviously, the optimal edge-cloud uplink bandwidth allocation B * c,k has a similar rule with B * k,n . Thereafter, by setting ∂L(α k ,B k,n ,B c,k ,β k , k ,ϕ k ,τ k ) α k = 0, we can obtain the optimal auxiliary variableα * k as The detailed procedure for the joint edge node selection and resource allocation scheduling is presented in Algorithm 1.
In Algorithm 1, ( ) means the successive divergence of the objective function at the  -th iteration [27], which can be defined as where Algorithm 1 Joint Edge Node Selection and Resource Allocation Strategy 1: Initialize the gradient norm value σ k , ∀k ∈ K. 2: Set the minimum successive divergence threshold of the objective function min and the maximum iteration number R max . 3: Set the iteration number  = 0. 4: Initialize the auxiliary variablesα ( ) , λ ( ) ,λ ( ) . 5: While ( ) ≥ min and  ≤ R max do 6: Calculate the optimal device selection decision α * ( ) k according to (35).  Update λ ( ) andλ ( ) according to 10: Update ( ) according to (41).

V. NUMERICAL RESULTS
In this section, we conduct experiments to evaluate the theoretical analyses and test the performance of the proposed algorithm.
A. EXPERIMENT SETTINGS CNN model settings: For exposition, we consider the learning task of training image classifiers, which are implemented on a Convolutional Neural Network (CNN) model, namely VGGNet 16 [28]. The corresponding training dataset is CIFAR-10, which contains 50000 training images and 10000 testing images with 10 categories. To simulate the distributions of heterogeneous data based mobile devices, all data samples are first sorted by digital labels, and then divided into 100 shards of size 500 and each local device is assigned with 5 shards. The batch size of each local device is set as 50 and the average quantitative bit number of each parameter is set as 16 bits. In addition, we adopt the Stochastic Gradient Descent (SGD) optimizer, and the learning rate for the CNN model is set as 0.1. The computation frequency of each local device is randomly set between 2 GHz to 4 GHz.
Wireless communication settings: We consider a hierarchical SAHFL communication network consists of one cloud node and 10 edge nodes. Each edge node connects with two local devices. Both edge nodes and local devices are uniformly distributed under the coverage of the cloud node. The total bandwidth is set as 20 MHz. Moreover, the uplink transmission powers of each local device and edge node are set as 10 dBm and 24 dBm, respectively. Also, the downlink transmission powers of each edge node and the cloud node are set as 10 dBm and 24 dBm, respectively. Furthermore, we utilize the transmission pass loss model of 128.1 + 37.6 log(d [km]). Meanwhile, the noise power spectral density is set as N 0 = −174 dBm/Hz.
In the ADMM-BCU algorithm, we set the non-negative penalty parameter ν as 1. The minimum successive divergence threshold min is set as 10 −4 . In addition, the maximum iteration number of ADMM-BCU algorithm is set as 200.

B. SAHFL PERFORMANCE
In this subsection, we present the convergence performance of the proposed SAHFL model. We first introduce the following baselines.
• Random selection: Under this circumstance, CNN is implemented with random data selection, where both 5 and 8 edge nodes randomly selective conditions are respectively considered.
• Full selection: Under this circumstance, CNN is implemented by selecting all of the edge nodes.
• Normal edge update: Edge nodes directly use the broadcast cloud model as their updated model.
For simplicity, we assume the transmissions from the selected edge nodes to the cloud node are uniformly allocated, totally 5 MHz. Meanwhile, the transmission bandwidths from local devices to edge nodes are also set as the same, totally 15 MHz. Moreover, we set the weighted factor ρ under the proposed algorithm as 0.8. Fig. 4 shows the convergence performance of the proposed CNN based SAHFL model. From this figure, we can find that the VGG-16 network starts to converge at about 70 communication rounds for both the random selection scheme with 8 edge nodes, the Full selection scheme, and the proposed scheme. However, the random selection scheme with 5 edge nodes presents the worst convergence performance. Intuitively, it is because the more devices to be selected, the larger data information can be provided to the neural network, and thus faster convergence. Moreover, due to the non-iid datasets, each node has different contributions. Therefore, the random selection scheme may play a side effect on the whole model, leading to a decreasing model accuracy. Overall, the proposed algorithm shows a near to the full selection scheme convergence and accuracy, which can achieve better performance than the baselines that would be discussed later. Fig. 5 presents the performance influence from the edge update model. From this figure, we find that either the training accuracy or the training loss under the elastic edge update model is better than that of the normal edge update ϕ k Z B * k,n log 2 1 + P k,n g k,n B * k,n N 0 2 log 2 1 + P k,n g k,n B * k,n N 0 − P k,n g k,n P k,n g k,n + B * k,n N 0 ln 2  model. The fluctuation of these curves are mainly due to the non-iid data form. Therefore, we can conclude that the elastic edge update model is significant to keep the personalities of the edge nodes.

C. THE SCHEDULING PERFORMANCE
In this subsection, we mainly verify the scheduling performance of the proposed algorithm. In Fig. 6, we shows that the proposed ADMM-BCU algorithm has a fast convergence and a low computational complexity. Fig. 7 illustrates that a tradeoff exists between data importance and the transmission latency. The value of ρ starts from 0.4 to 0.8 under the step of 0.05. This figure shows that a large value of ρ leads to higher data importance and longer transmission latency, and vise versa. Thus, the operators can select a suitable value of ρ according to their specific requirements.
In Fig. 8, we present the performance among the number of selected edge nodes, data importance, and latency under various weight factors. Fig. 8(a) shows the number of selected edge nodes and total data importance in different weight factors under various algorithms. From this subfigure, the number of selected edge nodes increases with the weight factor ρ. When the value of weight factor ρ is small, i.e., the associated edge nodes are small, the proposed algorithm  has a lower data importance than the random selection scheme. With the increment of associated edge nodes, the circumstance changes, which has been explained in Fig. 4. However, the full selection scheme always has the highest value of data importance at the cost of higher latency, which is shown in Fig. 8(b). Fig. 8(b) shows the full selection scheme suffers the highest latency, and the proposed algorithm has the lowest latency after scheduling. Intuitively, the transmission latency is much lower than the total latency, which means the data training time is huge. Moreover, the transmission  latency may not meet the requirements of ultra low latency mobile edge network devices. Under this circumstance, we can enlarge the wireless bandwidth by some resource management technologies.

VI. CONCLUSION
This work proposes a novel SAHFL framework that consists of local, edge, and cloud nodes to provide communicationefficient services for mobile edge networks. Specifically, homogeneous devices are allowed to associate with one edge node. Therefore, we adopt the synchronous aggregation model for edge nodes. On the contrary, for the heterogeneous edge aggregation models, we introduce a semi-asynchronous aggregation model for the cloud node, where parts of the fastest training edge models can be uploaded at each iteration. Moreover, we investigate an edge-cloud update method to keep the personalities of the edge nodes. We propose a joint edge node association and resource allocation strategy, which illustrates a tradeoff between training accuracy and transmission latency. A distributed ADMM-BCU algorithm has been adopted to solve the proposed optimal MINLP problem. Numerical results show that our proposed scheme can accelerate the training process and improve the performance for mobile edge networks.

APPENDIX A PROOF OF THEOREM 1
According to (2), (8), and (15), the global aggregation model of the cloud server can be rearranged as Since the edge model undergoes an edge elastic update process when it broadcasts to devices, we have the following edge model update according to (18), (19), (20), as According to Assumption 1 and Assumption 2, the twice-continuously differentiable F(w) has the inequality of δI ∇ 2 F(w) LI.
Considering the second-order Taylor expansion, F(w t+1 c ) can be written as where (a) stems from the fact that ∇ 2 F(w) LI. By setting 0 ≤ η ≤ 1 L , we have (48), as shown at the top of the next page. Here, step (b) stems from the equation (45).
Step (c) obtains from the fact that 0 ≤ = (∇F(w t c )) T w t c + k∈S t+1 |D k,n |(w t k − η∇F k,n (w t k,n )) By applying Assumption 2, it follows that This ends the proof.