Loading web-font TeX/Main/Regular
Decentralized Federated Learning Over Slotted ALOHA Wireless Mesh Networking | IEEE Journals & Magazine | IEEE Xplore

Decentralized Federated Learning Over Slotted ALOHA Wireless Mesh Networking


The GA presents the research objectives, methodologies, and findings of our work, which involves the implementation of Federated Learning without a central server using s...

Abstract:

Federated Learning (FL) presents a mechanism to allow decentralized training for machine learning (ML) models inherently enabling privacy preservation. The classical FL i...Show More

Abstract:

Federated Learning (FL) presents a mechanism to allow decentralized training for machine learning (ML) models inherently enabling privacy preservation. The classical FL is implemented as a client-server system, which is known as Centralised Federated Learning (CFL). There are challenges inherent in CFL since all participants need to interact with a central server resulting in a potential communication bottleneck and a single point of failure. In addition, it is difficult to have a central server in some scenarios due to the implementation cost and complexity. This study aims to use Decentralized Federated learning (DFL) without a central server through one-hop neighbours. Such collaboration depends on the dynamics of communication networks, e.g., the topology of the network, the MAC protocol, and both large-scale and small-scale fading on links. In this paper, we employ stochastic geometry to model these dynamics explicitly, allowing us to quantify the performance of the DFL. The core objective is to achieve better classification without sacrificing privacy while accommodating for networking dynamics. In this paper, we are interested in how such topologies impact the performance of ML when deployed in practice. The proposed system is trained on a well-known MINST dataset for benchmarking, which contains labelled data samples of 60K images each with a size 28\times 28 pixels, and 1000 random samples of this MNIST dataset are assigned for each participant’ device. The participants’ devices implement a CNN model as a classifier model. To evaluate the performance of the model, a number of participants are randomly selected from the network. Due to randomness in the communication process, these participants interact with the random number of nodes in the neighbourhood to exchange model parameters which are subsequently used to update the participants’ individual models. These participants connected successfully with a varying number of neighbours to exchange paramete...
The GA presents the research objectives, methodologies, and findings of our work, which involves the implementation of Federated Learning without a central server using s...
Published in: IEEE Access ( Volume: 11)
Page(s): 18326 - 18342
Date of Publication: 20 February 2023
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

In recent years, the number of Internet of Things (IoT) and smart wearable devices have witnessed an increased proliferation in several vertical domains (e.g., wearables, home automation systems, smart glasses, health monitors, health-fitness trackers, smart grids, etc). These devices use a variety of wireless technologies and thus manifest different topological properties. For instance, LoRa deployment support one hop connection to the gateway with a distance-dependent spreading factor manifesting star topology. In contrast, Thread, Zigbee, and BLE Mesh manifest the mesh topology. The Mesh deployment for local connectivity is gaining extra momentum with the rapid evolution of standards.

The huge amount of data that sensors and edge devices collect is critical. It can be gathered and analyzed using modern ML techniques to build a classification model that can learn, predict, and meet the end user’s requirements optimally. ML algorithms can enable various applications. These include controlling and monitoring the home appliance, controlling autonomous vehicles, and monitoring various health parameters of the elderly such as heart rate, fall detection, etc. However, the data from an edge device may carry very personal information. Thus, data privacy and security are significant challenges as users usually do not allow sharing of sensitive information and data by putting it all in one central location.

Now, Federated Learning (FL) [1] has recently emerged to address this issue and solve participants’ essential requirements and concerns to preserve privacy and data security. FL has developed as a new paradigm in building models from distributed ML setups that can offer the opportunity to learn a model from multiple disjoint sensitive local datasets while keeping the user data private through distributed training [1] and [2].

FL has attracted much attention because of its ability to preserve the privacy of the client’s data by sharing only the locally trained model parameters instead of the local data itself. Figure 1 shows a reference deployment scenario for a healthcare application. This example shows how the Community Hospital, the Research Medical Centre, and the Cancer Treatment Centre all train their local models using their local private data and share only the model parameters to create a global model in order to improve the system performance.

FIGURE 1. - A Federated learning approach.
FIGURE 1.

A Federated learning approach.

Thus, FL performs the aggregation and analysis of the local models and updates the models on the participants’ devices or servers without sharing any devices’ data with others, thus keeping the private data protected. So, each participating device will train a model exploiting its local data usually using classification or regression algorithms (i.e., employing a Deep Neural Network (DNN)), and only share the model parameters with a central server. Afterwards, the FL central server will aggregate the parameters of these local models to process and create a global model. The FL server broadcasts the global model to update the local models. The system will iterate the same procedure until the system achieves a convergence state [3]. This paradigm of FL is also known as Centralized Federated Learning (CFL).

A. Motivation

The data generated by IoT devices and smartphone terminals has become essential for driving intelligent services through applications of Machine Learning (ML). Various applications ranging from healthcare to autonomous vehicles are rapidly deploying IoT solutions. It is thought that FL can offer more secure and shared security services for a wider range of applications, helping to support the steady growth of distributed ML applications [4]. Although the CFL systems are indeed promising, they face several limitations and challenges as they require a central FL server. Efficient communication is a critical challenge that needs to be addressed in FL to ensure that all participating devices are connected and that performance is not compromised [5]. Furthermore, in particular, it is hard to implement in some scenarios where a reliable and robust central server is difficult to find (e.g., a fully self-driving and autonomous network). Moreover, the CFL network faces the limitation of a single point of failure (communication bottleneck) at the central server. Therefore, this research will study a Decentralized Federated Learning (DFL) approach to overcome these challenges. In the DFL model, all neighbouring devices share their model parameters directly in a peer-to-peer manner. In this case, each device acts as a server by aggregating the parameters from neighbours and then averaging them to update a sub-global model that can be shared with others within one hop of a communication link.

The aim of this paper is to evaluate the feasibility of the implementation of the DFL approach under spatio-temporal dynamics. DFL algorithms are implemented using clusters of mesh networking groups located in different environments. Our system model can be translated into two practical deployment modalities: 1) Autonomous and Decentralized IoT networks: these are the networks formed by limited capability IoT and edge devices where decentralized FL is required intrinsically. For instance, networks formed by wearable IoT devices in battlefields or remote hospitals. 2) Edge/Fog-Assisted IoT networks: these are the hierarchical networks where the data from IoT devices is collected and processed by the edge/fog gateways [6]. These gateways then form a mesh network where DFL must be implemented. A typical example of such deployment is in agricultural IoT setup [7] or other smart city IoT applications. Both cases can be translated into the system model considered in this paper. In addition, this study will implement an FL network based on the slotted ALOHA protocol as a sub-optimal MAC protocol to evaluate the performance of the network and optimal configuration under the proposed setup.

The combination of the DFL with the slotted ALOHA mesh networking protocol is proposed to satisfy the users’ privacy preservation, increase the protection for confidential data, increase the prediction accuracy of the implemented algorithms and build a robust system. This system can help us exploit a greater percentage of the users’ data that will be trained to create a global model that can improve the local models in each participating device without sharing any data with other participants except the local model parameters.

B. Keycontributions

Limited studies were conducted using FL with a central server to achieve better results. The objectives of this study are to introduce and simulate the Centralized and Decentralized FL’s wireless communication stage between the devices in the learning process based on qualitative examinations of the CFL approach with a central server and the DFL approach over a Wireless Mesh Networking (WMN) without a central server. Furthermore, this study aims to simulate the communication network for the CFL and DFL models to design a robust wireless communication network during the training process, which helps to evaluate how the model can perform in different network conditions, such as congestion and interference.

C. Paper Organization

The rest of the paper is organized as follows. In Section II, the related work on the FL approach is provided. Background and challenges for DFL over WMN are presented in Section III. Section IV, introduces the system model in terms of theoretical analysis of device communication in the network and learning metrics for DFL over WMN. The learning criteria for DFL are demonstrated in Section V. The proposed framework simulation and results are summarized in Section VI. Finally, a summary of this research and future work is presented in Section VII.

SECTION II.

Related Work

Many recent papers have investigated the fundamentals of CFL algorithms [8], [9], [10], [11], and [12]. CFL and its central server algorithm are called the Federated averaging (FedAvg) algorithm were first proposed and implemented in [1]. The FedAvg algorithm is implemented to create a global model by averaging the aggregated parameters from the participants [1].

In [8], comprehensive research of CFL for mobile-edge networks was presented. The authors examined the critical implementation issues with existing solutions and potential applications of CFL in IoT and mobile edge networks. In addition, some existing limitations and challenges in CFL are highlighted, such as the difficulty of aggregating sufficient data, real applications’ heterogeneous data distribution, and theoretical analysis of device communication and convergence. The work [9] reviewed the challenges in implementing CFL, future research directions and the existing CFL approaches. In [10] and [12] the authors surveyed the CFL implementations, devised a taxonomy, and overviewed the currently proposed solutions and their challenges in the CFL framework. They presented the essentials of preserving privacy and checking fairness in CFL.

The study in [13] conducted a thorough and comprehensive examination of the architecture, design, and deployment of FL, comparing it to centralized and distributed on-site ML-based systems. Furthermore, the challenges and potential future directions for research in FL were discussed, where some classification problems of FL topics and research fields were also presented, based on a thorough literature review, including taxonomies for its important technical and emerging aspects, such as the core system model and design, application areas, privacy and security, and resource management.

The authors in [11] examined the “In-Edge-AI” model for edge networks to allow for efficient collaboration between terminal devices and terminal servers to exchange learning model parameter updates. They explored two use scenarios: edge caching and compute offloading. Toefficiently support these scenarios, they trained a double deep Q-learning (DDQN) model via CFL. Lastly, the authors in [8], [9], [10], [11], and [12] addressed several existing issues in CFL for actual applications, such as the ability of mobile devices to handle a high computation process and the power consumption and battery life to keep connected to a central server. Furthermore, CFL raises concerns about flexibility because it may cease to function due to the aggregation server’s failure (i.e., due to a malicious assault or physical flaw). Moreover, training CFL models via IoT networks necessitates many communication resources to allow participants to communicate with a central server [14].

Most existing DFL systems are based on gossiping schemes, and the number of neighbours in the learning process are chosen regardless of communication challenges, end-user capability and network capacity. For instance, the works in [15] and [16] implement a classic DFL algorithm that allows a user to aggregate the model parameters from an estimated number of multiple neighbouring devices, and Ramanan and Nakayama [17] propose an alternative approach that uses a blockchain-based FL scheme to aggregate updates for the participants’ devices. However, these approaches suffer from several limitations related to the communication constraints in the real environment applications, the data size and the terminal capability (i.e., energy consumption and computation cost) of blockchain-based transactions. In summary, we list some existing works on FL-related topics with our paper’s contribution in Table 1.

TABLE 1 Existing Works on FL-Related Topics With Our Paper’s Contribution
Table 1- 
Existing Works on FL-Related Topics With Our Paper’s Contribution

SECTION III.

Background and Challenges for DFL Over WMN

The implementation of the DFL approach over a WMN is supposed to reduce communication costs, cope with the single point of failure issue in CFL, and provide innovative capabilities in a range of aspects, including healthcare systems (e.g., monitoring physiological data like heart-rate variability [18] to classify various cardiac pathologies), industry, and smart homes [19], [20], [21].

In addition, the combination of DFL and WMN using mesh protocols will likely be helpful in preserving privacy, guaranteeing a robust network, and improving the battery life of the devices by reducing communication costs. Instead of transferring data from the device to the central server provider, which will need large bandwidth and consume high power on the edge device, the decentralized approach can be applied to minimize these back-and-forth journeys of data. This decentralized fashion is implemented by processing the data into the edge device and communicating with the other neighbours using the mesh networking links to exchange and update the model parameters.

This paper will simulate and implement the DFL model over WMN system protocols. The model and the system performance will be evaluated by training the model using a dataset divided into training data, validation data, and test data. The performance metrics of the algorithms will be prediction accuracy, communication cost, and latency. Both DFL and WMN protocols are implemented in some applications separately and individually.

This research proposes integrating DFL and WMN into one intelligent system to optimize for robust communication networks that could be applied in many IoT applications to ensure participant privacy preservation and data security. Although DFL and WMN have impressive characteristics and features, they reveal several challenges and problems faced by engineers that can influence the model’s accuracy. The following is a short summary of those limitations:

  1. The network will be designed for low-power IoT devices under IEEE 802.15.4-2006 and WMN would need to be designed to coexist in IoT systems with other technologies, and not to replace them [32].

  2. Convergence speed: DFL algorithms usually adopt a peer-to-peer one-layer architecture. Each participant collects and aggregates all the local model updates of one-hop neighbours in a CFL architecture. With such multi-hop architecture, the wireless routing paths between participants can be easily saturated, resulting in a slower convergence speed [33].

  3. Unbalance: the amount of data varies at each participant resulting in different local training data quality.

  4. Lack of stability and flexibility in communication networks of a massive number of devices in real-time applications.

The communication stage is one of four main steps in the learning process that cannot be neglected, and most researchers do not consider it analytically in their research. In this paper, this stage will be addressed in detail. We propose using mesh networking to maximize the communication stage’s flexibility and the channel’s capacity during the learning process. The motivation for this is the fact that the DFL algorithms can efficiently update the terminal edge with the parameters through the Thread protocol or any other mesh networking protocols (e.g., ZigBee or Bluetooth). This will allow us to design and develop a global model that can precisely analyze the end-users data without sharing the data with a central server or any other devices within the network. In other words, the data stays protected locally and never leaves the device itself and this will achieve personalization and guarantee high Quality of Service (QoS) as well as enhance the performance of devices in IoT applications.

To the best of our knowledge, this research is the first work that combines DFL and WMN using the slotted ALOHA protocol. The model results verify the intuition, showing that implementing DFL over mesh networks can offer more flexibility as no central server is required and promises more communication channels available to communicate. More participants can be involved in the learning process in the form of neighbour groups. The rest of this paper will introduce the wireless communication characteristics for the mesh networking and DFL criteria. The wireless communication constraints will be considered and CFL and DFL models will be implemented by simulations.

A. Traditional Machine Learning (ML) on Edge Devices

There are different kinds of ML and deep learning algorithms used in various proposals. For instance, Convolution Neural Network (CNN) algorithms are powerful tools widely used in image classification processing and other classification problems [34] since CNN has a proven ability to achieve higher accuracy, and can efficiently learn from thousands of image datasets. To implement CFL and DFL algorithms, the local ML algorithms are required to be embedded in the terminal devices (participants) to train the algorithm on the local data before sharing the parameters with the server in a CFL approach or with the neighbours in a DFL approach. Details on ML-enabled edge devices challenges and opportunities have been addressed in many research papers in the recent past [35], [36] and they are out of the scope of this paper. In this research, the CFL and DFL models will be implemented for a classification problem, and the CNN algorithm will be the main algorithm that is used to train the local models on the proposed system.

B. Centralized Federated Learning (CFL)

The main objective of CFL systems is to train, in coordination with a central server for model aggregation, a shared global model from participating devices that act as local learners. Figure 2 shows the fundamental CFL architecture and the main four steps to train a CFL network, where these steps are iterated until reaching the convergence status [12]:

  1. Local learning where each edge device uses its local dataset to train the model locally and update the parameters of the ML model (e.g., neural biases and weights).

  2. Upload model parameters to the central server: participants upload (transmit) their parameters to the central server via communication channels.

  3. Global aggregation: the central server aggregates the local models’ parameters from those successfully received to update a new version of the global model on the server.

  4. Download (broadcast) and synchronize the devices with the latest global model updates [14], and then back to step 1 and repeat until system convergences.

FIGURE 2. - CFL framework over an IoT network
FIGURE 2.

CFL framework over an IoT network

In the CFL process, each device’s local algorithm has an optimization technique to update the model iteratively, such as Stochastic Gradient Descent (SGD). Afterwards, the global model emerges from aggregating the local parameters from participants’ devices, which can then be weighted according to the perceived quality of the updates of the devices [34]. One crucial property of CFL is that the participant user data never transfers between devices and the server, which reduces communication costs and data sharing privacy concerns. However, due to the central server node, the CFL system will run into scalability issues. Even if the server node’s hardware and software capabilities have been optimized, the server node’s performance will not improve when thousands of client nodes join [37]. Communication bottlenecks may appear due to the amount of traffic that is increasing exponentially, and the system becomes overburdened. Furthermore, accessing a central node cannot be possible in some scenarios, for instance, self-driving vehicles and high mobility sensor systems. Decentralized architectures have recently been proposed to avoid communication bottlenecks and protect data privacy [38], [39]. Remove the centralized server, and each participant only communicates with its one-hop neighbours in its local area and exchanges their local models and updates parameters [40].

C. Decentralized Federated Learning (DFL)

As shown in Figure 3, the DFL framework does not need a central server to coordinate the training tasks and contains only terminal participants (nodes) [37]. The idea is that each participant exchanges parameter updates with neighbours in a peer-to-peer manner. Besides the local model algorithm, the FedAvg algorithm is employed on each terminal participant to create its global models in the DFL approach with no central server. Each participant trains on its local data and averages it within the aggregated models’ parameters from the selected neighbours using the FedAvg algorithm to broadcast an updated global model to the neighbours again at each iteration. Afterwards, the same procedure is applied to all other participants until the system converges.

FIGURE 3. - DFL framework over an IoT WMN.
FIGURE 3.

DFL framework over an IoT WMN.

In [41], the Combo algorithm proposed an approach in which the participants send and average a segment of the models’ parameters to reduce the required communication bandwidth without affecting the system performance and convergence rate. Even though the proposed DFL algorithms overcome some of the problems associated with general FL systems that require a central server, they use model averages to fuse models at the local clients, which is not always very efficient in heterogeneous data scenarios. For each, local model parameters are updated toward the local optimum, and averaging the model parameters from different clients leads to the averaged outcome of each client’s local optimum being used. The optimum of each participant’s loss function may be quite far from the others, which is also far from the global optimum due to different participants owning different sets of training data. Thus, those datasets typically have different distributions or even no overlap, which is defined as data heterogeneity [42], [43], [44].

Furthermore, most prior works do not consider the wireless environment in the communication network, so they do not account for wireless impairments caused by channel fading, link blockages, and wireless interference. Most researchers choose the number of participants by estimation without concerning the communication medium constraints, which is not always very efficient in terms of reliability and flexibility for real-time applications (e.g., AV and UAV networks).

Therefore, this study will focus on designing a model based on the gossipy method [45] that can deal with decentralized approaches and heterogeneous datasets, and it will precisely analyze the communication process between the participants in the network, considering the interference and the noise in the transmission medium to simulate a real scenario of the communication connection between the terminal devices through the learning process.

D. Wireless Mesh Networking (WMN)

Nowadays, many access points have overlapping areas, and almost each traditional wireless network has to be connected to the wired network. In this scenario, the cost of installing IoT devices is costly and extremely difficult. Thus, a WMN will benefit from its flexibility to connect the devices within the network and offer a different perspective than non-mesh networks. The connectivity needs in wireless mesh networks are reduced mainly because the devices within the network have a multi-route capability to send and receive packets. In addition, the WMN also has a range of advantages such as self-healing and self-organizing, attracting a vast number of investigations and research developments [46]. Furthermore, WMN can reduce the networking cost for innovative home applications using low-profile hardware.

The routing protocols significantly contribute to WMNs as they help find the best path between multi-hop networks in unreliable wireless media. The WMN protocols have been widely investigated to achieve higher throughput, low latency and low power consumption [46], [47]. According to some related research, ZigBee, Bluetooth Low Energy BLE, Z-wave and Thread protocols are the most common protocols used in many wireless mesh-networking applications. Each protocol has its own unique specification for particular implementation depending on the user requirement. For instance, the Z-wave protocol is advantageous for long-range coverage because of its low-frequency band (900 MHz) compared to others with a 2.4 GHz band (i.e., Bluetooth and Thread). According to [48], all protocols achieve similar performance (i.e., latency and throughput) for small networks and small payloads. By contrast, for large mesh networks with multi-hop nodes between the transmitter and the receiver, the Thread protocol achieves better performance metrics in terms of latency and efficiency.

SECTION IV.

System Model

A. Communication in WMN

In this paper, a typical receiver is considered that is connected to a corresponding desired transmitter. A Rayleigh fading channel is adopted for the small-scale path-loss model and complemented with a single slope large-scale path-loss. Hence, the received power at the typical receiver from the desired transmitter is (P_{k}h_{ko}d_{ko}^{-\alpha } ) [49], where P_{k} is the signal transmit power from the desired transmitter device k , h_{ko} is the fading coefficient for the channel between device k to the target destination, d_{k0} is the distance between the desired transmitter k and the corresponding receiver and (\alpha \ge 2 ) is the path-loss exponent.

The signal can be correctly decoded at the typical receiver if the corresponding SINR (Signal to Interference plus Noise Ratio) is higher than a certain threshold T_{k} . Therefore, the probability that SINR\ge T_{k} is defined as:\begin{equation*} P(SINR\ge T_{k}) =P\left ({\frac {P_{k}h_{k0}d_{k0}^{-\alpha }}{\sum _{i\in \varphi }I_{i}+N_{0}}\ge T_{k}}\right ) \tag{1}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where I_{i} is the interference from the device i in the network (I_{i}=P_{i}h_{i0}d_{i0}^{-\alpha }a_{i}) , with i=1,2,\ldots,M , M is the total number of active devices within the desired receiver coverage area, and i\in \varphi where \varphi represents the number of all participant devices within the whole target area. Now a_{i} is a binary random variable that defines the state of the device, whether in a transmitting state or ready to receive such that P_{A}=\, Probability{\{a_{i}=1\}} represents the device is in transmitting state and 1-P_{A}= Probability{\{a_{i}=0\}} represents the device is ready to receive and is not transmitting.

The proposed IoT network is assumed to have a small thermal noise power variance N_{0} in comparison with the cumulative interference power (i.e., interference-limited network). Therefore, the noise effect in the network is negligible, and so (1) can be re-written as:\begin{align*} P(SIR\ge T_{k}) & \cong P\left ({\frac {P_{k}h_{k0}d_{k0}^{-\alpha }}{\sum _{i\in \varphi }I_{i}}\ge T_{k}}\right ) \tag{2}\\ & \cong P\left ({h_{k0}\ge \frac {T_{k}d_{k0}^{\alpha }\sum _{i\in \varphi }I_{i}}{P_{k}}}\right ). \tag{3}\end{align*}

View SourceRight-click on figure for MathML and additional features.

Since the proposed IoT device network has Rayleigh fading channels, and for the sake of simplicity, the h_{k0} term is assumed to be an exponentially distributed random variable with a unit mean. The probability distribution function (PDF) of an exponential random variable X with unit mean can be expressed in the form of f_{X}(x)=\text {exp}(-x) , and thus, (3) can be reformulated as:\begin{align*} P(SIR\ge T_{k}) & \cong \mathbb {E}_{a_{i}}\left ({\mathbb {E}_{h_{i0}}\left({{\int _{\frac {\left({T_{k}d_{k0}^{\alpha }\sum _{(i\in \varphi )}I_{i}}\right)}{P_{k}}}^{+\infty }(e^{-x}dx}}\right)\Bigr )}\right ) \tag{4}\\ & {\cong }\,\,\mathbb {E}_{a_{i}}\left ({\mathbb {E}_{{h_{i0}}}\left({{\text {exp}\left({-\frac {T_{k}d_{k0}^{\alpha }\sum _{(i\in \varphi )}I_{i}}{P_{k}}}\right)}}\right)}\right ) \tag{5}\\ &{\cong }\mathbb {E}_{a_{i}}\left ({\mathbb {E}_{h_{i0}}\left({\prod \nolimits _{(i\in \varphi ) }\text {exp}\left({\frac {T_{k}d_{k0}^{\alpha }I_{i}}{P_{k}}}\right)}\right)}\right ) \tag{6}\\ & {\cong }\mathbb {E}_{a_{i}}\left ({\!\mathbb {E}_{h_{i0}}\left({\prod \nolimits _{(i\in \varphi )}\!\!\text {exp} \left({\frac {-T_{k}d_{k0}^{\alpha }P_{i}h_{io}d_{k0}^{-\alpha }a_{i}}{P_{k}}}\right)}\right)}\right ) \tag{7}\end{align*}

View SourceRight-click on figure for MathML and additional features. where \mathbb {E}_{{a_{i}}} is the expectation with respect to the random variable a_{i} and \mathbb {E}_{{h_{i0}}} denotes the expectation of the fading coefficients from the devices i to the desired receiver.

The Bernoulli distribution property can be used to simplify (7), \mathbb {E}_{{a_{i}}}{\text {exp}(a_{i}x)}=(1-P_{A}+P_{A}\times \text {exp}(x)) , and according to [49] \mathbb {E}_{{h_{i0}}}\left ({{\text {exp}(-h_{i0}\times y)}}\right ) \cong 1/(1+y) .

Consequently, the probability of successful transmission for the participant devices within the network area can be explicitly obtained as:\begin{equation*} P(SIR\ge T_{k}) \cong \prod _{(i\in \varphi )} \left ({1-P_{A}+\frac {P_{A}}{\left({1+T_{k}\left({\frac {d_{k0}^{\alpha }}{d_{i0}}}\right)(\gamma _{ki})}\right)} }\right ) \tag{8}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \gamma _{ki}=P_{i}/P_{k} is the power ratio.

B. Achievable Transmission Capacity Over Slotted-ALOHA

The ALOHA protocol is a class of fully decentralized MAC protocols [50] that does not perform carrier sensing and attempts to avoid packet collision. The slotted-ALOHA protocol was introduced to enhance the utilization of the shared communication medium and reduce the chances of collisions for multiple transmitting devices by synchronizing the transmission of devices at the beginning of discrete timeslots.

In the WMN, the probability for each device sending a packet to neighbours is P_{A} and the probability of being ready to receive a packet is (1-P_{A}) , and therefore, the probability of every mesh node getting a packet from one of its neighbours is P_{A}(1-P_{A})P(SIR\ge T_{k}) where T_{k} is the predefined SINR threshold and P(SIR\ge T_{k}) represents the successful transmission probability. Based on Shannon’s Theorem, the achievable transmission capacity increases when the transmit power increases, and the packet can carry up to \text {log}(1+SIR)\,\text {(bits/second/hertz)} data [49]. Thus, the capacity C_{ALOHA} for the mesh network can be written as:\begin{equation*} C_{ALOHA} =P_{A}(1-P_{A})\text {log}(1+ SIR)P(SIR\ge T_{k}). \tag{9}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

From (8), the outage probability (\theta _{k} ) constraints concerning P_{A},T_{K} and \gamma _{ki} can be obtained as follows:\begin{equation*} 1-\prod \nolimits _{i\in \varphi }\left ({1-P_{A}+\frac {P_{A}}{ \big(1+T_{k}\left({\frac {d_{k0}^{\alpha }}{d_{i0}^{\alpha }}}\right)\left({\frac {P_{i}}{P_{k}}}\right)}}\right )\le \theta _{k}. \tag{10}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

To simplify (9), a natural logarithm is applied to compute the maximum achievable transmission capacity with respect to P_{A} . Hence, we use an auxiliary function f(P_{A})={\text {ln}}\left ({C_{ALOHA}(P_{A})}\right) to simplify the search of the optimal value P_{A} that maximizes C_{ALOHA} . Then, the problem formulation can be written as:\begin{align*} \mathrel {\mathop {\mathrm {arg\,max}}\limits ^{}_{P_{A}}}f(P_{A}) &=\mathrm {arg\,max}\biggl (\text {ln}(P_{A})+\text {ln}(1-P_{A}) \\ &\quad +\,\text {ln}(\text {log}(1+T_{k})) \\ &\quad +\,\sum _{(i\in \varphi )}\text {ln}\left({1-P_{A} +\frac {P_{A}}{\left({1+T_{k}\left({\frac {d_{k0}^{\alpha }}{d_{i0}^{\alpha }}}\right)\gamma _{ki}}\right)}}\right)\biggr ) \tag{11}\\ \textrm {s.t.}~\varepsilon _{k} &\le \sum _{(i\in \varphi )}\text {ln}\left ({1-P_{A}+\frac {P_{A}}{\left({1+T_{k}\left({\frac {d_{k0}^{\alpha }}{d_{i0}^{\alpha }}}\right)\gamma _{ki}}\right)}}\right ). \tag{12}\end{align*}

View SourceRight-click on figure for MathML and additional features.

With a Taylor series expansion, which is \text {ln}(1-x)=-x-x^{2}/2-x^{3}/3-\ldots \cong -x,(|x|< 1) , so (12) can be simplified to:\begin{equation*} \varepsilon _{k}\le \sum _{(i\in \varphi )}\left ({-P_{A}\left({1-\frac {1}{\Big(1+T_{k}\left({\frac {d_{k0}^{\alpha }}{d_{i0}^{\alpha }}}\right)\gamma _{ki}}}\right)}\right ). \tag{13}\end{equation*}

View SourceRight-click on figure for MathML and additional features. Thus, (11) can be reformulated as:\begin{align*} \mathrel {\mathop {\mathrm {arg\,max}}\limits ^{}_{P_{A}}}f(P_{A}) &={~ {{ \mathrel {\mathop {\mathrm {arg\,max}}\limits ^{}_{P_{A}}}}}}( \text {ln}(P_{A})+\text {ln}(1-P_{A}) \\ &\quad +\,\text {ln}(\text {log}(1+T_{k})) ) \\ &\quad +\,\mathop {\sum }\limits _{(i\in \varphi )}\left ({-P_{A}+\frac {P_{A}}{\left({1+T_{k}\left({\frac {d_{k0}^{\alpha }}{d_{i0}^ {\alpha }}}\right)\gamma _{ki}}\right)}}\right ). \tag{14}\end{align*}
View SourceRight-click on figure for MathML and additional features.
Hence, (14) can be written as:\begin{align*} \mathrel {\mathop {\mathrm {arg\,max}}\limits ^{}_{P_{A}}}f(P_{A}) & = \mathrel {\mathop {\mathrm {arg\,max}}\limits ^{}_{P_{A}}}(\text {ln}(P_{A})+ln(1-P_{A}) \\ & \quad +\,\text {ln}\left ({\text {log}(1+T_{k}))-P_{A}f_{2}}\right ) \tag{15}\end{align*}
View SourceRight-click on figure for MathML and additional features.
where f_{2}=\sum _{(i\in \varphi )}\left[{1-\frac {1}{\left({1+T_{k}\left({\frac {d_{k0}^{\alpha }}{d_{i0}^{\alpha }}}\right)\gamma _{ki}}\right)}}\right] , and therefore a partial derivative of f(P_{A}) will be taken to find P_{A} that will maximize the function f(P_{A}) in (15):\begin{equation*} \frac {\partial f(P_{A})}{\partial P_{A}} =\left ({\frac {1}{P_{A}}-\frac {1}{(1-P_{A})}-f_{2}}\right ).\end{equation*}
View SourceRight-click on figure for MathML and additional features.

The probability of being in transmitting mode using the slotted ALOHA protocol at \frac {\partial f(P_{A})}{\partial P_{A}}=0 is defined as P_{A}(0) and obtained as:\begin{equation*} P_{A}(0) =\frac {f_{2}+2-\sqrt {f_{2}^{2}+4}}{2f_{2}}\end{equation*}

View SourceRight-click on figure for MathML and additional features. so, \frac {\partial f(P_{A})}{\partial P_{A}}>0 , when 0\le \,\;P_{A}< P_{A}(0) and \frac {\partial f(P_{A})}{\partial P_{A}}< 0 when P_{A}(0)< P_{A}< 1 . Consequently, if the system is assumed to have a constant threshold T_{k} and power ratio \gamma _{ki},\text {then }\,f(P_{A}) will increase when P_{A} is increased within the range of 0 to P_{A}(0) , and by contrast, it will decrease when P_{A}>P_{A}(0) .

A special case has been adopted by setting the system’s outage probabilities equal to its SINR threshold values to find the maximum transmission capacity based on the slotted ALOHA protocol for the wireless mesh networks. Then (9) can be reformulated to restate the maximum achievable transmission capacity is as follows:\begin{align*} \text {max}(C_{ALOHA}) &=P_{A}(1-P_{A})\text {log}(1+T_{k})(1-\theta _{k}), \\ &=\biggl (P_{A}(1-P_{A})\text {log}(1+T_{k}) \\ &\quad \times \prod \nolimits _{(i\in \varphi )}\Bigl (1-P_{A} \\ &\quad +\,\frac {P_{A}}{(1+T_{k}(d_{k0}^{\alpha }/d_{i0}^{\alpha })}(\gamma _{ki})\Bigl )\biggr ). \tag{16}\end{align*}

View SourceRight-click on figure for MathML and additional features.

From this analysis, the intuition is provided as to what extent the network parameters T_{k},P_{A} and \gamma _{ki} effect the maximum achievable transmission capacity of the slotted-ALOHA mesh network.

C. Latency

The required time to achieve a convergence state in any FL model plays a key role in evaluating the system performance. Therefore, the average learning latency for the participants will be one of the performance metrics. Latency in the proposed centralized and decentralized FL will be defined as the expected time duration (in seconds) required for the model to complete learning in a typical one-hop mesh communication network. Let R denotes the smallest number of iterations to meet the convergence criterion. The expected learning latency (T_{total} ) is a function in the computation time (T_{computation} ) which is the required time for the device to run and update the local model, the server to devices (participants) broadcast communication time (T_{broadcast} ) and the device to server communication time (T_{cmm} ) over a number of iterations R for the number of successful transmit participants A_{i} in the learning process:\begin{equation*} T_{total} =R\sum _{(s_{i}=1)}^{A_{i}}\bigl (T_{cmm}^{(s_{i})}\bigr )+R(T_{computation}+T_{broadcast}). \tag{17}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

The summation of the communication process time T_{cmm} in real applications is slightly variable depending on the number of participants in each iteration. In order to simplify the calculation, it will be assumed that the network has a fixed average number of successful transmission participants A_{i} for the whole process in centralized and decentralized FL.

D. Accuracy and Loss

Accuracy and loss functions are the two main model metrics that are mainly applied to adjust the model weights during the training process and to measure the system performance in order to optimize a model (e.g., a convolution neural network model) and solve for example, a face recognition problem. Accuracy is calculated as follows:\begin{equation*} \text {Accuracy} =\frac {\mathrm {Number~of~Correct~Predictions }}{\mathrm {Total~Number~of~Predictions}}. \tag{18}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Loss is a measure of the difference between the actual output value and the predicted output value by the implemented model. In the classification models whose output values are an array of probability values between 0 and 1, the most common loss function applied is the cross-entropy loss function. The cross-entropy loss function is also known as logistic loss, log loss or logarithmic loss. The probability of each predicted value is weighed against the actual desired output 0 or 1, and a loss is measured based on how far it is from the actual expected value for each sample. A larger loss for significant differences close to 1 and more, and a slight loss for minor differences tending to 0, and therefore, the overall cross-entropy loss of 0 means the model is perfect. The cross-entropy loss function is defined as:\begin{align*} Loss & =-\sum _{j=1}^{m}\sum _{i=1}^{n}y_{(i,j)}\text {log}(p_{(i,j)}),{} \tag{19}\\ & \text {for $n$ classes and $m$ samples}\end{align*}

View SourceRight-click on figure for MathML and additional features. where y_{(i,j)} is the actual output and p_{(i,j)} is the softmax probability of the model output for the i^{th} class in the classification problems and j^{th} instance sample.

Therefore, the objective is almost always to increase the accuracy and minimize the loss of the FL models or any other implemented models.

SECTION V.

The Learning Criterion for DFL

The designed system considers a group of \varphi individual participants nodes (i.e., edge devices) in which set \varphi ={1,\ldots,K} is randomly located as a spatial point process following a stationary Poisson Point Process PPP [51] with intensity (\lambda ) in disk cells with uniformly distributed over a large-scale network. In supervised learning, each node i\in \varphi has access to a dataset \mathfrak {O}_{(i)} consisting of n instance-label pairs of samples(X_{i}^{n},Y_{i}^{n}) , where X_{i}^{n} and Y_{i}^{n} represent the labelled sample input and output for the device i , respectively and n=1,\ldots.,k . The samples are a subset of heterogeneous or homogeneous datasets that follow an unknown probability distribution p(x,y) and possibly have non-empty interaction for different sets (i.e., \mathfrak {O}_{(i)} and \mathfrak {O}_{(j)} where i\neq j ). Each instance X_{i}^{n}\in X_{i}\subseteq X , where X_{i} denotes the local instance space of node i and X denotes a global instance space, which satisfies X\subseteq \bigcup _{i=1}^{\varphi }X_{i} .

Similarly, let y denote the set of all possible labels over all the nodes. Some examples include y={0,1} for binary classification learning and y=\mathbb {R} for regression learning.

There are {X_{i}^{(1)},X_{i}^{(2)},\ldots..,X_{i}^{(k)}} i.i.d samples at each participant node, and the total number of examples k is a variable depending on the size of the local datasets for each one.

All participants’ devices have to have the same machine-learning model (e.g., a CNN), which have a common weight parameters matrix (\textbf {W} ). The main aim of the designed model is to minimize the cross-entropy loss function between the expected (actual) output and the predicted output:\begin{equation*} \mathrel {\mathop {\mathrm {arg\,min}}\limits ^{}_{\textbf {W}}}F(\textbf {W}) \triangleq \frac {1}{A_{i}}\sum _{(i\in \varphi )}f_{i}(\textbf {W}) \tag{20}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where F(\textbf {W}) denote the global loss function, f_{i}(\textbf {W}) represents the local loss for the device i , s_{i} represents the set of active devices that succeed in transmitting to the i^{th} participant at each iteration in the training process and A_{i} is the length of the s_{i} set as:\begin{equation*} s_{i} =1,2,\ldots..,A_{i},\quad \forall \,\,\,SINR\thinspace \,\ge \thinspace \,T_{k}.\end{equation*}
View SourceRight-click on figure for MathML and additional features.

The local loss function for the i^{th} device is calculated by the model cross-entropy loss for the local training dataset \mathfrak {O}_{(i)} as follows:\begin{align*} f_{i}(\textbf {W}) & =\frac {1}{M}\sum _{(m=o)}^{M}l(h_{\textbf {W}}(X_{i}^{m}),y_{i}^{m}), \\ m&=0,1,2,\ldots,M \tag{21}\end{align*}

View SourceRight-click on figure for MathML and additional features. where m\subseteq n is the subset of the total number of local training datasets and M is equivalent to the length of the participant datasets divided by the batch size as follows \left({\frac {|\mathfrak {O}_{(i)}|}{B_{i}}}\right) , and l(h_{\textbf {W}}(X_{i}^{(m)}),y_{i}^{(m)}) denotes the cost function for the weights matrix \textbf {W} evaluated on a hypothesis h_{\textbf {W}}(X_{i}^{m}) With data samples X_{i}^{m} . For instance, the hypothesis for the simple linear regression is defined as h_{\textbf {W}}(X)=\textbf {W}_{0}+\textbf {W}_{1}X .

At the t^{th} iteration of the DFL process, each device i\in \varphi has a local parameters weights matrix \textbf {W}_{i}^{(t)} that is updated to maximize the model accuracy and to find an optimum solution to the target problem.

In the CFL models, there will be two optimizers’ levels. First is the local model level; the local optimizer in each device to update the local parameters based on the local dataset can use a common machine learning optimizer called Stochastic Gradient Descent algorithm SGD. Second is the global model level; a global model optimiser uses the aggregated parameters from neighbours (i.e., Federated Averaging algorithm (FedAvg)) to create and update the global model.

Each participant i makes M training passes over its local datasets \mathfrak {O}_{(i)} to update its local model weights simultaneously via SGD with learning rate (\eta ) and batch B_{i} which is shown as follows:\begin{align*} {\nabla f}_{i}(\textbf {W}_{i}^{t})&=\frac {1}{M}\sum _{(m=0)}^{M}\nabla l(h_{(\textbf {W}_{i}^{t})}(X_{i}^{m},y_{i}^{m}), \tag{22}\\ &\quad {{~{{\forall m=1,\ldots,M \text {and }i\in \varphi }}}} \\ \textbf {W}_{i}^{t}&:=\textbf {W}_{i}^{t}-\eta _{i}\nabla f_{i}(\textbf {W}_{i}^{t}). \tag{23}\end{align*}

View SourceRight-click on figure for MathML and additional features.

Here \nabla f_{i}(\textbf {W}_{i}^{t}) denotes the gradient matrix obtained from the batch B_{i} of the device i's local samples after M local training on the local data at each iteration t . The gradient \nabla f_{i}(\textbf {W}_{i}^{t}) expresses the rate of change of f_{i} with respect to the model parameters \textbf {W} at the iteration t .

The total number of local training M is equivalent to the participant datasets length divided by the batch size (M=|\mathfrak {O}_{(i)}|/B_{i}) in order to train the model locally on its datasets before sharing the parameters with neighbours. After a certain number of iterations for each participant device trained a model on its local dataset with sufficient hyper-parameters, the models’ parameters are shared with other neighbours via a one-hop communication process following the mesh topology.

For all participants, the updates are simultaneously done where each participant receives the other neighbours’ weights and gradients and averages them with the local weights and gradients.

In the next step, each participant in the network successfully aggregates the local parameters \textbf {W}_{i}^{t} from the trusted neighbours who satisfied the wireless communication constraints to execute the Federated Averaging (FedAvg) algorithm in order to obtain a new global model at each t^{th} iteration. Then, each authorized participant sends a broadcast containing the latest global model weights matrix to all participants located within a one-hop mesh network and starts the next iteration. Figure 3 illustrates the proposal DFL network architecture and the learning process steps. The global model which can be denoted as \hat {\textbf {W}}_{i}^{t} is executed by averaging the local model with the aggregated model weights from the neighbours s_{i} (the set of active devices whose received their weights successfully) at the iteration t .

These weights are then applied on the local model using the device dataset \mathfrak {O}_{(i)} and batch size B_{i} to create an updated weights matrix \textbf {W}_{i}^{(t+1)} and broadcast it to the neighbours in the next iteration as follow:\begin{align*} \hat {\textbf {W}}_{i}^{t} & =\frac {1}{A_{i}+1}\left({\textbf {W}_{i}^{t}+\sum _{(s_{i}=1)}^{A_{i}}\hat {\textbf {W}}_{s_{i}}^{t}}\right) \tag{24}\\ \hat {\nabla }f(\hat {\textbf {W}}_{i}^{t}) & =\frac {1}{M}\sum _{(m=1)}^{M}\hat {\nabla }l\left ({h_{(\hat {\textbf {W}}_{i}^{t})}(X_{i}^{m},y_{i}^{m})}\right ); \tag{25}\\ \hat {\textbf {W}}_{i}^{t} & =\textbf {W}_{i}^{(t+1)}. \tag{26}\end{align*}

View SourceRight-click on figure for MathML and additional features.

Then these new parameters \textbf {W}_{i}^{(t+1)} at the device i will be used to train the local model in the next iteration and also share \textbf {W}_{i}^{(t+1)} with all participants within one-hop communication range to repeat the same procedure at each device until achieving the predefined convergence bound \varepsilon _{k} of the implemented model. The convergence is determined by measuring the average loss function of all active device (s_{i}=1,\ldots,A_{i} ) whose successful transmission to the i^{th} device at each iteration as follows:\begin{align*} \mathrel {\mathop {\mathrm {arg\,min}}\limits ^{}_{\textbf {W}}}\nabla F_{i}(\textbf {W})&\triangleq \Bigl [\Bigl (\frac {1}{(A_{i}+1)}(\hat {\nabla }f_{i}(\hat {\textbf {W}}_{i}^{t}) \\ &\quad +\,\sum _{s_{i}=1}^{A_{i}}\hat {\nabla }f_{s_{i}}(\hat {\textbf {W}}_{s_{i}}^{t})\Bigr )\Bigl |A_{i}\ge 1\biggr ]\le \varepsilon _{k}. \tag{27}\end{align*}

View SourceRight-click on figure for MathML and additional features.

The participants will communicate and exchange the parameters and the system loss function’s values over a wireless mesh network via a peer-to-peer manner. The number of successful transmit devices (participants) is subject to wireless communication constraints. Consequently, it is possible to show that every device i in the network after receiving the latest global model update can train its model using the latest update and then evaluate how the model performance improved in terms of the loss functions and accuracy.

To sum up, the DFL process is divided into t iterations. At each iteration, each i^{th} participant performs local learning and exchanges their parameters \hat {\textbf {W}}_{i}^{t} in parallel with other cooperative neighbours s_{i} to create global models at each device in a central server-free manner. The R symbol in pseudocode represents the total number of iterations the system needs to achieve the consensus and meet the gradient divergence condition in (27). Finally, the DFL learning steps are presented in Algorithm 1.

SECTION Algorithm 1

Decentralized Federated Learning

1:

1)

All participants have initial weights with \textbf {W}(0)

2)

for each iteration t=1,2,3,\ldots..R do

3)

for each device i=1,2,3,\ldots..K do {in parallel}

4)

for m=1,2,3,\ldots M , \text {where }M=\frac {|\mathfrak {O}_{(i)}|}{B_{i}} (Local training steps)

5)

\nabla f_{i}(\textbf {W}_{i}^{t})=\frac {1}{M}\sum _{(m=1)}^{M}\nabla l(h_{(\textbf {W}_{i}^{t})}(X_{i}^{m},y_{i}^{m}))

6)

\textbf {W}_{(i)}^{t}\longleftarrow \textbf {W}_{(i)}^{t}-\eta _{i}^{t}\nabla f_{i}(\textbf {W}_{i}^{t};B_{i})

7)

end

8)

from the i^{th} device’s neighbours s_{i}=1,2,\ldots..A_{i} do {in parallel}

9)

receive \hat {\textbf {W}}_{s_{i}}^{t},\hat {\nabla }f_{s_{i}}(\hat {\textbf {W}}_{s_{i}}^{t}),

10)

\hat {\textbf {W}}_{i}^{t}\longleftarrow \frac {1}{(A_{i}+1)}\left({\textbf {W}_{i}^{t}+\sum _{s_{i}=1}^{A_{i}}(\hat {\textbf {W}}_{s_{i}}^{t})}\right)

11)

\,\hat {\nabla }f_{i}(\hat {\textbf {W}}_{i}^{t})=\frac {1}{M}\sum _{(m=1)}^{M}\hat {\nabla }l(h_{(\hat {\textbf {W}}_{i}^{t})}(X_{i}^{m},y_{i}^{m}))

12)

If \frac {1}{\left(k+1 \right)}\left({\hat {\nabla }f_{i}\left(\hat {\textbf {W}}_{i}^{t} \right)+\sum _{s_{i}=1}^{A_{i}}\hat {\nabla }f_{s_{i}} \left(\hat {\textbf {W}}_{s_{i}}^{t}\right )}\right)~ \big{|}A_{i}\ge 1~ \big{]}~ \le ~ \varepsilon _{k}

13)

Yes: end process (Gradient Convergence)

14)

No: continue

15)

\textbf {W}_{i}^{(t+1)}\longleftarrow \hat {\textbf {W}}_{i}^{t}

16)

end

17)

end

SECTION VI.

Simulations and Results

A. The Simulation of the Network Communication

The proposal network for the CFL and DFL approaches will have many participants were distributed on a two-dimensional bounded space. A large-scale circular area with radius R>0 is proposed for notational simplicity, but a hexagonal one can also be applied, as the outcomes will differ slightly by a very small constant. The number of participants is distributed according to the Poisson distribution with network intensity \lambda and area A . In other words, the number of participants within A can be defined as a Poisson random variable with mean \lambda A , where \lambda >0 is the network intensity, and A is the network circular area (A=2\pi R^{2} ).

On the one hand, the central point in a two-dimensional bounded space will be the position for the central server of the network in the CFL model as illustrated in Figure 4 (a), and the participants will be randomly distributed within the target area. However, the successful transmission and participation in the learning process will be subject to wireless communication constraints (see section IV) to approximate the real situation applications.

FIGURE 4. - (a) The distribution for random participants around the CFL centre server. (b) Example of three participants communicating with neighbours in DFL approach.
FIGURE 4.

(a) The distribution for random participants around the CFL centre server. (b) Example of three participants communicating with neighbours in DFL approach.

On the other hand, in DFL, each point (participant) i will define each neighbour based on a one-hop communication link within the device signal coverage that will be a smaller circle with a radius r_{i}< R . For instance, the wireless mesh connections for three DFL participants with a Rayleigh fading are illustrated in Figure 4 (b).

The positions of the participants are independently and randomly distributed within the network area (circular area), where the distance for each participant i to the centre point (central server) is d_{k0}\le R . In the simulation, the relationship between the SINR threshold T_{k} and the probability of successful transmission P(SINR\ge T_{k}) of the network, participants will be found based on this study analysis as in (1) from IV and the total number of participants is a random variable in the network space radius R=1000 {~\text {m}} and the intensity \lambda =0.1 .

The participants are assumed to have equal transmit power (0.8 Watts). Only a finite number of participants can simultaneously exchange and update their parameters based on the slotted ALOHA protocol. When the given parameters (i.e., signal power, distance, and interference) were applied in the SINR equation 8, the theoretical results matched the simulation results, and both proved that the number of successful transmitter devices decreased when the SINR threshold increased for different intensity of users in the network, as shown in Figure 5.

FIGURE 5. - The simulation (markers) and theoretical results (solid lines) of the relationship between the SINR threshold and the probability of success for the participants within different network intensities.
FIGURE 5.

The simulation (markers) and theoretical results (solid lines) of the relationship between the SINR threshold and the probability of success for the participants within different network intensities.

In contrast, reducing the SINR threshold allows more participants to involve in the learning process, but the required bandwidth and the system’s latency will also increase. Therefore, there is a need to examine the trade-off between the probability of success and the SINR threshold to achieve higher throughput and capacity and acceptable latency.

Consequently, the proposed wireless communication model outcomes in Figure 5 confirm the conclusions of the theoretical analysis in (8) and (16) where a trade-off between the probability of success transmit and the SINR threshold is required to satisfy the FL network target in term of the capacity and the number of users (participants) within the network during the learning process.

B. CFL Setup and Simulation

The simulation settings are proposed for implementing an FL with a central server deployed in the centre of the target disk area with a radius R . It has been assumed that the network has a large-scale massive IoT and edge devices network following the PPP with intensity \lambda and the total number of participants is N, the distance from any participant i to the server is r_{i}< R .

To simplify the simulation, all participants are assumed to have the same transmit power and are all randomly distributed around the centre within the network area. The system will be trained using Python and Tensorflow APIs frameworks, the well-known MNIST dataset and a local algorithm on the edge devices to perform digit number recognition from handwritten images. And while this is a simple problem and well-understood problem, it is being used here to illustrate the principle. The MNIST dataset contains labelled data samples of 60K images, each with a size of 28\times 28 pixels. This dataset was shuffled to distribute randomly for the participants, where each participant had 200 random samples. The edge devices will implement a CNN algorithm as a classifier model to evaluate the outcome results and the system performance metrics. Large-scale massive IoT networks are used to validate the algorithm, and thus the network has two cases:

  1. Low mobility scenarios where the number of participants’ success transmitted in the system can be the same for all iterations until the system convergences.

  2. High mobility scenarios where participants move in a wide geometric area and many participants are IN and OUT of the network during the learning processes, such as autonomous vehicles and Unmanned Aerial Vehicle (UAV) networks. Thus, the number of successful transmits is usually different in each learning iteration.

This simulation will implement the first case, low mobility scenarios, and we will leave the second case for subsequent works. In both cases, the number of successful transmit participants in the system is subject to the communication network constraints (described in section IV) and the effect of network parameters in terms of the devices’ intensity and participants’ distribution.

Based on the proposed wireless communication model in VI-A, the system will be allocated the target SINR threshold (−10 dB) in order to achieve a higher transmission capacity, the probability of the participant to be in a transmit mode P_{A}= 0.5 and the intensity of users’ devices \lambda =0.1 to increase the number of participants. This results in having around 80% probability of successful transmit devices (P(SINR\ge T_{k} )) in the network as shown in Figure 5.

Consequently, the total number of participants within the target area is 80 devices, but the number of successful transmit participants in the CFL process was 65 participants.

In this study, the CFL with a central server is designed by using the FedAvg algorithm as a global model optimizer. The main function of the server algorithm is to aggregate and average the participants’ parameters to update the new global model at each iteration and then measure the accuracy and the loss of the model outcomes.

The system evaluation for the CFL network optimization regarding the accuracy and loss used the cost function Cross-Entropy. The simulated system in subsection VI-A has 65 successful transmit participants as regards the communication constraints in subsection IV.

It can be noted from Figure 6 that the procedure moved progressively towards the global minimum dramatically in the first 50 epochs, where the accuracy increased from 17% to 90%. After 200 iterations, the model moved progressively and achieved the convergence state with the accuracy and loss of 98.1% and 0.15, respectively. The latency of the model was 1850 seconds, which was the required time to achieve the predefined convergence bound. Despite having applied constraints for the proposed CFL model to build a robust CFL network by considering the communication constraints and real environment scenarios in the simulation, the model was capable of converging faster and achieving slightly higher accuracy and lower loss than the FL baselines that have not considered the communication constraints in practice. In other words, the baselines estimate a fixed number of participants in the learning process without considering the network challenges, which cannot be reliable in real applications. In contrast, our proposed CFL model defines the participants based on the communication model and considers the real applications environment’s constraints to obtain a trustworthy and robust network while achieving high accuracy and low loss.

FIGURE 6. - The accuracy and loss for the CFL network.
FIGURE 6.

The accuracy and loss for the CFL network.

Although the system achieved high accuracy and reached convergence after around 200 iterations in the designed simulation, there is potentially a single point of failure at the central server. The system, in reality, could be completely down if any failure occurs in the central server or the link to the server is blocked (communication bottleneck).

C. DFL Setup and Simulation

The set-up simulation uses Python and Tensorflow APIs frameworks to implement and evaluate the design of the DFL system over a WMN for IoT devices. In order to make an accurate comparison, we concentrate on the same previous CFL design in terms of the communication links constraints and the parameters model optimizers, low mobility (fixed) and the same total number of participants in the learning process but without a central server. The system was evaluated based on performance metrics; the validation accuracy, the latency and the convergence speed rate. The simulation settings are listed as follows:

  • Training settings

    The participants trained a classification CNN model on the MNIST datasets. As in CFL, each participant will have 1000 random samples to train the local models, but the parameters in this DFL approach will be shared directly with neighbours to create and update the global models at each device without needing a central server.

    The local models of each participant were trained and updated using a stochastic gradient descent (SGD) optimizer with the same learning rate of 0.01 and batch size 32 (hyper-parameters) in CFL simulation.

    Furthermore, the designed system also used another two different local models optimizers’ algorithms, the Adam algorithm [52] and Root Mean Square Propagation (RMSProp) optimizer [53], to observe how the performance of the DFL model gets affected by implementing different optimizers’ methods.

  • Network Settings

    In the simulation, the network is designed to simulate the communication stage between the participants’ devices in a peer-to-peer manner over wireless mesh networking, avoiding the communication bottleneck challenge in the central server in the CFL scenario. Each participant will successfully communicate and exchange the parameters with neighbours if the desired transmitters have an SINR over the target threshold and no collision occurs. The number of successful transmits participants is configured to be variable between the participants in an asymmetric manner. For instance, there are two participants A and B within the network, and participant A can receive parameter updates from participant B. In contrast, participant B does not have to successfully obtain the parameters update from participant A unless he satisfies the network communication constraints.

  • Comparison Setting

    The DFL system outcomes will be compared with the CFL outcomes in terms of system performance. Both DFL and CFL were implemented to train their models on the same factors (i.e., the dataset, the total number of participants within the network and the network geographic area) in order to make them comparable. Thus, the accuracy, loss and latency and convergence speed were measured.

  • Simulation Results

    Based on the successful transmit conditions and the network capacity, every device within the network created its one-hop neighbours’ group to exchange parameters and performed the CFL model training. The model results have been recorded for some random participants in the DFL simulation as an example to evaluate the system behaviour.

The designed network shows that these random recorded participants are connected successfully with a variant number of neighbours (who meet the communication constraints and they are able to exchange parameters and update their global models).

The DFL’s simulated system has shown that the classification prediction achieved about 90% accuracy in the first 20 iterations for all these recorded participants with any of the three different model optimizers mentioned in the training settings (SGD, ADAM and RMSprop optimizers).

The results and statistics of some participants within the network are illustrated in Figure 7, and Table 2. The results for the DFL approach without a central server show high accuracy and low cross-entropy loss in predicting the digit number from the handwritten samples after 135 iterations.

TABLE 2 Simulation Results for Some Random Participants of the Designed DFL Model
Table 2- 
Simulation Results for Some Random Participants of the Designed DFL Model
FIGURE 7. - The DFL model outcomes (Accuracy and Loss) for four random participants.
FIGURE 7.

The DFL model outcomes (Accuracy and Loss) for four random participants.

In this study, the wireless mesh networking model in the section (VI-A) is integrated with the DFL model to verify the number of successful participants during the learning process. The system shows that each participant will be able to communicate with particular neighbours depending on the neighbours’ locations, the desired participant transmit power and other devices’ power interference.

The latency for each participant varies slightly depending mainly on the number of iterations that each participant is required to achieve the system convergence status. In this study, the results have shown that the average latency for the participants was 420 seconds with the assumption the computation and broadcast time are equal for all participants. The expected system convergence in the DFL model is around 130 iterations on average.

As shown in Figure 7, the system achieved sufficiently high accuracy and low loss in the data predictions. The random chosen participants’ records show that the participants can learn in parallel and follow similar progress toward convergence in terms of accuracy and loss. The designed system offers the benefits of utilizing the DFL framework, where the data never leaves the participants’ devices, and privacy restrictions exist. In comparison with the centralized model, the decentralized model achieved competitive results without needing a central server. For instance, participant 1 reached 98.2% accuracy and 0.088 cross-entropy loss after only 135 iterations.

In contrast, after more than 200 iterations, the centralized model achieved the convergence state with the accuracy and loss of 98.1% and 0.15, respectively. Based on the latency criteria in the subsection IV-C, CFL has higher latency than DFL as the system outcomes find that the communication cost and the number of iterations of CFL were higher than the DFL approach.

Thus, DFL can reduce the latency and loss and increase the convergence speed in which they can outperform CFL.

From these results, we can conclude that the DFL models over WMN produced significant developments in classification prediction using sufficient datasets. In this study, the DFL framework is combined with the wireless mesh networking using the Slotted-ALOHA protocol to improve communication between the participants during the learning process.

The DFL approach over WMN using the slotted-ALOHA protocol could be very competitive to the CFL. The simulated models prove that the designed DFL system can achieve better latency, more flexibility, and similar accuracy without installing a central server in the network.

SECTION VII.

Conclusion

To sum up, DFL reduced the communication cost compared to CFL as the participants’ devices communicate directly and send the packets of their parameters and updates with only one-hop neighbours using slotted-ALOHA as the devices’ MAC protocol. The network topology was a mesh network topology. In this study, the network communication model simulated the real scenario of the mesh networking topology considering the frequency interference in the network environment and then combined it with the DFL model to train the network efficiently and reliably. The effectiveness of combining the DFL framework and the mesh networking protocol results in a comprehensive improvement in the model performance, and the DFL approach is becoming very competitive compared to CFL.

The following is a summary of this research and future work:

  1. Analysis of the wireless communication stage between the participants during the learning process in order to simulate the real scenarios of interaction between the IoT devices, reducing the communication resources and increasing the system flexibility.

  2. Implementing the DFL system using the FedAvg algorithm to enforce consensus techniques by sharing local model updates; established gossip methods are also extended by consensus.

  3. Emphasis on an experimental IoT setup, considering convergence speed, complexity, communication cost, and average prediction accuracy on the DFL embedded devices.

  4. The CFL and DFL algorithms were implemented by considering the communication stage challenges in the simulation to calibrate the real environment scenarios and the real applications.

  5. In the future, DFL models need to be designed for high-mobility sensors and devices in the wireless mesh networking system to build a robust system, increase the system flexibility and scalability and enhance the performance for some applications, such as a Driverless Transport System.

References

References is not available for this document.