Decentralized Federated Learning on the Edge Over Wireless Mesh Networks

The rapid growth of Internet of Things (IoT) devices has generated vast amounts of data, leading to the emergence of federated learning as a novel distributed machine learning paradigm. Federated learning enables model training at the edge, leveraging the processing capacity of edge devices while preserving privacy and mitigating data transfer bottlenecks. However, the conventional centralized federated learning architecture suffers from a single point of failure and susceptibility to malicious attacks. In this study, we delve into an alternative approach called decentralized federated learning (DFL) conducted over a wireless mesh network as the communication backbone. We perform a comprehensive network performance analysis using stochastic geometry theory and physical interference models, offering fresh insights into the convergence analysis of DFL. Additionally, we conduct system simulations to assess the proposed decentralized architecture under various network parameters and different aggregator methods such as FedAvg, Krum and Median methods. Our model is trained on the widely recognized EMNIST dataset for benchmarking handwritten digit classification. To minimize the model’s size at the edge and reduce communication overhead, we employ a cutting-edge compression technique based on genetic algorithms. Our simulation results reveal that the compressed decentralized architecture achieves performance comparable to the baseline centralized architecture and traditional DFL in terms of accuracy and average loss for our classification task. Moreover, it significantly reduces the size of shared models over the wireless channel by compressing participants’ local model sizes to nearly half of their original size compared to the baselines, effectively reducing complexity and communication overhead.


I. INTRODUCTION
I N recent years, the proliferation of Internet of Things (IoT)   devices has been remarkable, largely driven by advancements in 5G technology and beyond.The global IoT market is projected to grow to a market value of 1.567 trillion US dollars by 2025 [1].The architecture of IoT, which incorporates spectrum and energy management mechanisms, is designed to optimize both spectral and energy efficiencies to their maximum potential [2].Alongside this, breakthroughs in semiconductor technology have enabled the fabrication of transistors with sub-10 nanometer gate lengths, drastically improving processing capacity while reducing power consumption [3].These developments play a crucial role in the extensive adoption of embedded devices, including smartphones, sensors and tablets.These devices require power-ful microprocessors capable of delivering high performance while adhering to stringent energy consumption limitations.
The substantial volume of data generated by embedded sensor devices at the periphery of IoT systems has given rise to the field of Big Data, which focuses on devising effective methods for processing, disseminating, and analyzing extensive datasets.In typical IoT setups, data initially collected by sensors at the edge are transmitted through a central network to a cloud server as shown in Figure (1), where various data preprocessing techniques (such as data cleansing, feature extraction, denoising, etc.) are applied.Subsequently, this processed data is employed for training machine learning models tailored to the specific application domain [4].These trained models serve various machine learning tasks, including classification, clustering, anomaly detection, and regression, facilitating precise and optimal decision-making.The data flow structure described above in IoT systems has benefited from advancements in High-Performance Computing (HPC) systems, enabling real-time and low-latency solutions across diverse industry sectors.
Despite its widespread adoption, the conventional centralized cloud processing architecture has several limitations when considering practical scenarios.
Firstly, the reliance on transferring a large volume of data from edge devices to the cloud for analysis introduces significant challenges related to communication channel capacity and system latency.This becomes particularly problematic for IoT systems that utilize low-power, low-data-rate communication protocols such as Zigbee and LoRa [5].
Secondly, the nature of the data collected by embedded sensor devices often involves privacy-sensitive information, making the transmission of raw data over potentially insecure network connections a significant concern.Data breaches can have severe consequences for the data owners and are strictly regulated by international legislation such as the European General Data Protection Regulation (GDPR) [6].
Furthermore, the increased processing capacity of edge devices remains underutilized as they are primarily used for data gathering and transmission purposes, neglecting their potential for local computation and analysis.

A. MOTIVATION
In 2017, Google researchers introduced an enhancement to the traditional centralized data flow architecture by proposing Federated Learning (FL) as an innovative distributed model learning framework that operates across the entire system [7], [8].FL revolutionizes the model training process by ensuring that raw data remains localized on the device where it is generated, thereby preserving privacy and enhancing data security.In this approach, a global model is periodically updated by a centralized server in the cloud and shared with edge devices during each training iteration.Each device independently refines the global model using its local dataset and transmits only the updated model parameters back to the cloud server.The server then aggregates these model parameter updates from all participating devices to perform a global model update.This iterative process continues until the global model converges based on a predefined threshold value.Figure 2 illustrates the structure of a typical FL system with a central coordinating server.
Although the initial FL architecture represents a substantial advancement, particularly in terms of privacy and overall system efficiency, it has notable limitations, primarily stemming from its centralized design.One significant drawback is the central server, which serves as a singular point of failure, compromising the system's resilience.This vulnerability is especially concerning for time-sensitive IoT applications, as any temporary server downtime can lead to severe disruptions.Additionally, the centralization of data, even if limited to transmitting model updates, escalates the risk of malicious attacks [9].Sending encrypted gradient In this work, a fully decentralized FL architecture is proposed and analysed to overcome the centralized FL limitations.In our system, there exists no central coordinating entity, such as a cloud server, and the model learning process is fully distributed among network participants, namely the edge devices.More specifically, each individual edge device is responsible for maintaining and updating its local model.Unlike the traditional centralized FL architecture, there is no overarching global model in our approach.During each training iteration, every device shares its current local model parameters with neighbouring devices and receives their respective updates in return.Subsequently, it performs a local model update using its unique dataset.The device then aggregates the received model parameters with its own, calculating the local model parameters for the next iteration.This decentralized learning architecture has been empirically demonstrated to converge to an equivalent central model through the application of gossip training theory in distributed network learning [10] and [11], as illustrated in Figure 3.
In the realm of Decentralized Federated Learning (DFL) on edge devices within wireless mesh networks (WMNs), we believe that our paper makes significant contributions on several fronts.Our primary focus is on reducing communication overhead through the application of cutting-edge compression techniques.Finally, we believe this work makes significant contributions in the area of Federated Learning and network performance analysis.

B. KEY CONTRIBUTION
In this paper, we present a comprehensive description of the DFL architecture within the context of the Internet of Things (IoT) and edge devices while emphasizing the significance of the underlying core communication network.Our unique approach is distinct from much of the existing literature, as we aim to provide an all-encompassing perspective on the decentralized FL paradigm while considering the performance implications of the communication network that facilitates data transfer among IoT edge devices.We have chosen wireless one-hop mesh networking as the foundational network infrastructure, aligning with the distributed nature of IoT network learning.The rationale behind this choice is twofold.
Firstly, mesh networking possesses the adaptive capability to reconfigure their routing graph in real-time, particularly when faced with link interruptions [12], [13].This adaptive characteristic proves invaluable in real-world scenarios where edge devices are often deployed in challenging environments, prone to link failures.By harnessing the dynamic nature of the mesh networks, we ensure consistent and reliable data transfer, even in the presence of intermittent connectivity or link disruptions.This resilience is critical to maintaining the integrity of the decentralized FL process, enabling the system to gracefully handle disruptions without compromising learning.
This work combines the DFL architecture with wireless mesh networking, establishing a robust and efficient framework for collaborative learning within IoT systems.Our analysis delves beyond the FL architecture's performance, encompassing the unique attributes of the communication networks.We use one-hop communication mesh networking to ensure efficient data transfer among edge devices.Additionally, we employ different aggregator methods to update the global model and achieve optimum performance.This comprehensive perspective enhances our understanding of the overall system behaviour.Our research aims to contribute valuable insights into the design and optimisation of DFL systems in IoT environments, promising improved performance and scalability in real-world applications.Moreover, we harness the innate strengths of the mesh networks, such as resilience to single-point failures and heightened data privacy, fortifying the security and reliability of FL within IoT systems.This crucial aspect of our research addresses the mounting complexity and risk factors in large-scale distributed learning applications effectively.
To further enhance our model and minimize communication overhead, we have implemented a state-of-the-art compression technique, as exemplified by the genetic algorithmbased approach [27], to reduce the dimensions of terminal models at the edge.In the case of the Convolutional Neural Network (CNN) model, this reduction entails selecting a subset of convolutional filters and nodes within the dense layers while ensuring that the original models' accuracy levels remain intact.

C. ORGANISATION
The rest of this report is organised as follows: In section II, background and related work from literature is presented, and a complete theoretical analysis of the performance of wireless mesh networks and the convergence criteria of the proposed DFL system architecture is carried out.Section IV introduces the methodology for the learning process, network topology, communication methods, and the optimal approach for selecting highly reliable devices to update the global model that is updated using a compressed model at the edge and a range of aggregator methods.In section V, the simulation setup is described, and the corresponding results are presented along with their interpretation.Section VI presents the results of the simulations and in-depth discussions, offering insights and analysis of the findings.Finally, in section VII a summary of the work is provided, and future work directions and milestones are proposed.

II. BACKGROUND AND RELATED WORK
Ever since its introduction in [7], Federated Learning (FL) has gained considerable attention and has become one of the most extensively researched machine learning paradigms.The literature surrounding FL is expansive, encompassing diverse architectures, analyses of learning performance, and investigations into data privacy concerns.While the majority of research initially gravitated towards the conventional centralized architecture of FL, recent efforts have increasingly shifted their focus towards decentralized alternatives called decentralized FL (DFL), where basically no central server is needed [28].
Related work on DFL can be distinguished into two main design philosophies.The primer approach to achieving decentralization involves harnessing blockchain technology, a highly promising avenue.In these blockchain-based systems, participants fall into two categories: standard edge devices and miners.Each edge device establishes communication with nearby miners, who assume the role of model aggregators during particular training rounds.Depending on the employed consensus algorithm, the miner that successfully solves the hashing problem earns the privilege of atomically updating the distributed ledger with the new global model update.In the study [29], the limitations of centralized federated learning (CFL) are addressed by proposing a decentralized federated learning (DFL) approach that eliminates the need for a central server and instead relies on one-hop neighbours for collaboration in the communication network.They use stochastic geometry to model the dynamics of the network topology, MAC protocol, and fading on links, allowing them to evaluate the performance of DFL while preserving pri-vacy and accommodating networking dynamics.However, this study primarily focuses on the evaluation of DFL without considering its application on the edge and evaluating the network intensities for multi-hop wireless mesh networks.Furthermore, in our research efforts, we have undertaken an extensive assessment of relevant literature and surveys pertaining to both CFL and DFL.This comprehensive evaluation is meticulously presented in table (1), which delves into several facets of CFL and DFL, including global model aggregation methods, foundational frameworks, application domains, and categorizations based on proposed solutions.To facilitate easy interpretation, symbols have been employed within the table to signify the status of each aspect: a checkmark (✓) indicates full coverage, a question mark (?) denotes partial coverage, and a multiplication symbol (x) signifies that the particular aspect has not been addressed.
In [28], authors aimed at optimizing the overall average number of parameter transmissions only in the CFL approach, including shallow and complete transmissions, while maintaining a predefined ratio between them.To offer a thorough analysis of DFL within the context of core communication network performance, table (2) has been compiled to provide an overview of recent developments in CFL and DFL, which serves as an overview of recent developments in the field and their primary focus areas.This table provides a critical evaluation of contemporary advancements in FL network design, spanning multiple dimensions such as resource management, system cost, security, privacy, user distribution analysis, communication network characteristics, FL network intensity, performance, and central server-free approaches.
The works in [30] and [31] investigate the challenges posed by the widespread deployment of Internet of Things (IoT) devices in the 5G era, particularly in the context of software-defined networks (SDNs).It highlights the importance of cache management at the edge of the network and explores emerging edge resources like mobile device clouds and micro-edge data centres.The goal is to optimize content placement based on user demand and cost considerations.The study also addresses security and seamless data delivery in mobile IoT networks and introduces federated learning (FL) as a key framework to harness data and computational resources from end-user devices for training machine learning models.The paper's main focus is on centralized federated learning in the 5G network, leaving potential opportunities in decentralized learning methods, particularly in Ad-hoc networks, relatively unexplored.

III. WIRELESS MESH NETWORK PERFORMANCE ANALYSIS
In this section, an analytical approach is employed to assess the performance of a wireless mesh network, utilizing the physical interference model [39] to quantify the likelihood of successful data transmission between a transmitting node and a receiving node within the network.Our theoretical analysis draws heavily from principles of stochastic geometry and random point processes [40], [41].According to the physical interference model, the probability of successful transmission hinges on the Signal-to-Interference-and-Noise Ratio (SINR) observed at the receiver.A transmission is classified as successful if the SINR meets or exceeds a predefined threshold value.A notable advantage of the physical interference model is its comprehensive evaluation of total interference emanating from all nodes except the transmitter [42].Subsequently, we establish the network topology employing the Poisson point process (PPP) theory.

1) Poisson Point Process
In our analysis, we assume that the edge devices within the network are distributed based on a stationary homogeneous Poisson Point Process (PPP) with an intensity of λ.These devices are distributed within a disk D ⊂ R 2 with a radius of R, centred at the origin of the two-dimensional plane R 2 .According to the properties of a Poisson point process, the expected number N of devices falling within the disk D can be calculated as N = λ|D|, where |D| = πR 2 [40].
An important characteristic of the homogeneous PPP, as per Palm probability theory, is that adding an extra point at the origin in a specific realization of the process does not affect the distribution of the remaining points in the process (Slivnyak's theorem).Consequently, interference statistics can be measured equivalently by assuming that the typical receiver is a point within the process located at the origin [43].In the network, each device is denoted by i with where N represents the total number of active devices within the desired receiver coverage area.Additionally, i ∈ φ, which φ encompasses all participants within the entire target area.

2) Successful Transmission Probability
In the subsequent analysis of the probability of successful transmission, a slotted ALOHA medium access control scheme is considered.In this scheme, each device independently decides to transmit with a probability of p, without coordination with other devices.Additionally, we assume Rayleigh fading for the propagation channel, where the transmission power of each device has a zero mean.The Signalto-Interference-and-Noise Ratio (SINR) measured at the receiver located at the origin is calculated using the following equation: Here, S represents the signal power emitted by the intended transmitter, N stands for the noise power, and I corresponds to the cumulative interference power stemming from other transmitters.
For simplicity, we can theoretically assume that the noise power N is significantly lower than the total interference power.Therefore, we will employ the Signal-to-Interference Ratio (SIR) for the remainder of our analysis, as defined below: The received signal power S i at the receiver from a transmitter i is [40], [42]: In this context, P i represents the transmission power of transmitter i, h denotes the fading factor in accordance with the Rayleigh fading model, r i stands for the distance from transmitter i to the origin, and α characterizes the path loss parameter, which reflects the attenuation of signal power with distance.The total interference power I observed at the receiver results from the summation of all S i values, where i corresponds to all transmitting devices except the intended transmitter.According to the Rayleigh fading model, the received signal power S i follows an exponential distribution [44], and with the assumption of unit transmission power, its distribution is defined by equation ( 4) [40]: The successful transmission probability, as per the physical interference model, can be described as the likelihood that

Related Research
Main research area Assessment of recent developments in the design of FL networks Allocation of resources and cost management

Analyzing the distribution of users
The network communication FL Network intensity and performance Central Server-free [32] DFL concept Security and Privacy in FL FL for Health Informatics This work DFL on the Edge the Signal-to-Interference Ratio (SIR) exceeds or equals a predefined threshold: where r is the distance between the desired transmitter and the receiver.

3) Laplace Transform of Interference
The successful transmission probability, as expressed in equation (5), is equivalent to the Laplace transform of the cumulative interference observed at the receiver when evaluated at (s = θr α ) [40].
Following established analytical methods from stochastic geometry and probability-generating functional, as outlined in references [40], [41], we can deduce a closed-form expression for the successful transmission probability as follows: This leads to the following expression for the successful transmission probability: Here, p represents the probability of an individual transmitter deciding to transmit independently (ALOHA), (λ) stands for the intensity of the Poisson Point Process (PPP), R signifies the radius of the PPP disk, (r) denotes the distance between the transmitter and the receiver, and (α) represents the path loss parameter.
Consequently, from equations ( 6) and ( 7), we can derive the following expression for the successful transmission probability: IV. METHODOLOGY Our system model encompasses various critical perspectives to optimize decentralized learning: We start from a system perspective, optimizing a decentralized model by collaborating between several distributed edge devices without direct access to their local data.Communication among devices occurs in a peer-to-peer manner, eliminating the need for a central server.From a spatial perspective, we leverage geometric patterns to efficiently manage multi-user communication.
Each communication round identifies successful transmitter devices based on interactions with neighbours.Considering convergence, our approach incorporates theoretical analysis to define the target convergence state of the model.Furthermore, we introduce novel aggregator methods and employ Hidden Markov Models (HMM) for device evaluation.Historical performance guides the selection and weighting of edge devices in the learning process.In contrast to traditional Centralized Federated Learning (CFL) systems, the proposed Decentralized Federated Learning (DFL) architecture distinguishes itself by eliminating the need for a central aggregating server.Our target model revolves around a network of edge devices communicating through a wireless mesh infrastructure.The primary objective of this system is the collaborative optimization of parameters W for a global model represented as ŷ = f (W, x), where ŷ represents the model's predicted output, and x denotes input data.Each individual device possesses its distinct dataset D i consisting of input data x i , and this dataset remains private, not shared with other devices within the network.
The local loss function at each device can be defined as: In this equation, |D i | represents the size of the local dataset, and l(ŷ − y) is the loss function that quantifies the disparity between the model's predicted output and the actual output corresponding to input x.It is important to note that we assume the local loss function to be both convex and smooth.
At the outset of each training iteration t, let W t i represent the local model weights for each device i. Employing its local dataset, each device engages in Stochastic Gradient Descent (SGD) [7] on the local loss function.The device subsequently updates its local model weights using the following equation: Here, µ denotes the learning rate, carefully selected to ensure the convergence of the SGD algorithm to a minimum.

TABLE 3. Loss functions for different models
In the subsequent phase, each device transmits its recently updated local model weights W t i +1 to its immediate onehop neighbours within the wireless mesh network.Simultaneously, it receives updated local model weights from its corresponding one-hop neighbours.Subsequently, each device executes a local aggregation process on these received local model weights, typically involving straightforward averaging.This results in the creation of the initial updated local model weights that will be utilized in the subsequent iteration [49].
The process outlined above facilitates the "diffusion" of each device's local model weight parameters throughout the network during each iteration.Essentially, this diffusion mechanism involves the dissemination of the impact of each device's local training dataset across the network through the transmission of local model weights.Notably, prior research, such as [11], has demonstrated that this collaborative learning network converges to the same global optimum and at a similar convergence rate when compared to a conventional centralized cloud-server approach.
The main objective of DFL is to discover model parameters (W) that minimize the average loss function (also known as an object function or cost function) across all neighbour participating devices as follows: In this context of the loss function in equation ( 11), each device indexed as i is assigned a weight denoted by ζ i > 0. In practical scenarios, these weights ζ i are typically determined in proportion to the amount of data residing on each respective device.This means that devices with more data contribute more significantly to the overall objective, as represented by the optimization problem expressed in the equation (11).Furthermore, we assume that the edge devices are capable of running the model within a certain time slot at each epoch.

B. COMMUNICATION SYSTEM
In our model, the Wireless Mesh Networks (WMNs) approach is proposed as a fundamental communication technique that allows devices to share their model parameters in a peer-to-peer manner.Our proposal results from the popularity that WMN have gained due to their cost-effectiveness, which makes them an attractive option for wireless connectivity in the DFL network.WMNs exhibit dynamic self-organization and self-configuration, allowing network nodes to establish and maintain mesh connections autonomously.This characteristic bestows several advantages upon WMNs, including low initial expenses, efficient network upkeep, resilience, and consistent service coverage [50].In addition, this low-cost WMN infrastructure is well-suited for establishing a DFL network that can span across community networks, metropolitan areas, municipalities, and enterprise networks.

C. LEARNING CONVERGENCE CRITERION
Consider a scenario with N edge devices participating in the learning network.For this analysis, we make the assumption that these edge devices are distributed according to a homogeneous PPP (Poisson Point Process) with an intensity measure denoted as λ.Furthermore, these devices are confined within a circular region D centred at the origin and having a radius of R. Within the scope of a typical receiver positioned at the origin, the probability of successful data transmission from a transmitter located at a distance r from the origin can be determined using Equation (8).
The devices that successfully transmit data to the receiver also follow a homogeneous PPP, but with an intensity measure of λp succ [45].Consequently, the number of devices that succeed in transmitting their data to the receiver, denoted as Ñ , can be expressed as: The distribution of the number m of devices successfully transmitting can be described as [45]: Let m t,j represent the number of devices successfully transmitting their updated local model weights to the receiving device j during training iteration t, resulting in the global model W j .Additionally, let N j denote the total number of training iterations out of t, in which at least one device successfully transmitted local model weights to receiver j.We can now introduce the convergence condition for the model training procedure within the DFL network, adapted from [45] and configured to suit the described decentralized architecture: The learning convergence condition can be expressed as follows: The network achieves convergence after R training iterations when the maximum expectation of the average gradient, taken across all participating devices denoted as N j , does not exceed a predefined convergence threshold ϵ 0 .This expectation is calculated with respect to the distribution of the input dataset.In essence, this convergence criterion ensures that if even the device with the poorest performance, in terms of the expected average gradient after R rounds, meets the convergence threshold, then all other devices should also meet it.In such a scenario, the DFL network is considered to have converged to the optimal model weight parameters.

D. DEVICE SELECTION AND MODELS AGGREGATOR METHOD
In the context of FL, participant selection is a crucial aspect as it determines which edge clients or devices in the network will contribute to the collaborative model training process.In the literature, different methodologies are used to evaluate the distributed devices and choose the most appropriate group based on the required purpose.In FL, Hidden Markov Models (HMMs) [51], which are probabilistic models widely used in various fields, can be utilized to make well-informed choices concerning participant selection by modelling the past behaviour and performance of devices and thus regularly assign the weights (ζ i ) in the equation (11) for each connected device.
In the DFL approach, the master devices responsible for aggregating models from other neighbours at iteration t must meet specific specifications and requirements to efficiently manage the learning process during that iteration.
Initially, the central server in CFL or master device in the DFL approach initializes a global model w 0 randomly.Subsequently, in each communication round, the following sequence of steps is executed to achieve the learning objective, as illustrated in Figure (2 (w t+1 ← A({w i t+1 : i ∈ φ}).The most widely-used aggregation rule for communicationefficient Federated Learning (FL) is Federated Averaging (FedAvg) [7], which aggregates the client updates by computing a weighted average: However, FedAvg is not fault-tolerant, and even a single faulty/malicious client can prevent the global model from converging [52].Although this work does not specifically address malicious attacks and model protection, it is important to mention that there are existing robust aggregation techniques designed for these purposes: Krum, as described in [53], operates in each communication round by selecting m local model updates out of the total available N t updates to compute the global model update.This selection is based on comparing the similarity between these local updates.Assuming we have f clients out of N t are malicious, Krum assigns a score to each local model update w i by calculating the sum of Euclidean distances between w i and the m nearest neighbouring local updates among N t − f − 2. The m local model updates with the smallest scores are then chosen, and their average is calculated to determine the global model update.
Median [52]: The Median aggregation method is a coordinate-wise aggregation rule that operates independently on each model parameter.To determine the ith parameter of the global model update, the server arranges the ith parameter values from the submitted N t local model updates in ascending order and selects the median value.Median aggregation can achieve an order-optimal statistical error rate when dealing with strongly convex loss functions.
Various other aggregation methods have been proposed apart from the previously mentioned ones.For example, Bulyan [54] employs an iterative approach with aggregation rules like Krum for enhanced robustness, but it suffers from computational inefficiency and lacks scalability.Zeno [55] assigns scores to updates and aggregates the top N t −b updates with the highest scores, where N t represents the total number of clients, and b is a predefined hyperparameter, typically set equal to or greater than the number of malicious clients.
Another recent approach uses variational autoencoders to project client updates into a latent space for malicious update detection, but it relies on the unrealistic assumption of having access to data matching the client's private data distribution for training.Some studies focus on achieving robust federated learning by identifying and blocking malicious clients through adaptive model quality estimation [56] or clustered federated learning [57].
However, it is important to note that some of these methods have predominantly been tested in the context of centralized federated learning, specifically in the straightforward crosssilo scenario.Their applicability to the more complex and dynamic cross-device scenarios, which are characteristic of DFL approaches, has seen limited exploration.This limitation is part of our work to investigate.

V. MODEL SIMULATION
The Simulation section of this work encompasses the following components: (i) Simulation of a Wireless Mesh Network: We simulate a wireless mesh network in which participants are distributed according to a Poisson Point Process (PPP), as discussed in our theoretical analysis of wireless mesh network performance.This simulation allows us to model communication dynamics and evaluate network performance across various scenarios.(ii) Simulation of the baseline Centralized Federated Learning (CFL) architecture: Our simulations replicate the baseline centralized FL architecture.In this setup, a multi-layered Convolutional Neural Network (CNN) is subjected to compression using the state-of-the-art genetic algorithm-based approach (SOTA compression method).The EMNIST benchmarking dataset [58] is employed for training and validation.The primary objective is to minimize communication overhead while enhancing the performance of the compressed CNN model for categorical digit classification with the existence of a central server.This improvement is achieved by enabling collaborative learning among multiple participants within the CFL framework.(iii) The simulations evaluated the proposed DFL architecture, which incorporates a state-of-the-art compression method.Specifically, a multi-layered Convolutional Neural Network (CNN) was compressed using a genetic algorithm-based approach at the edge.Additionally, various aggregation methods, such as Krum and Median, were employed, mirroring the approach used in the centralized setup.These simulations facilitated a performance comparison between the centralized and decentralized architectures, allowing us to assess the effectiveness and potential advantages of the DFL approach, which has no need for a central server and network infrastructure, taking into account the communication overhead and the complexity.Importantly, this work introduces the novel application of Krum and Median aggregation methods within the realm of DFL.Through these simulations, we aim to gain insights into the performance, accuracy, and efficiency of both CFL and DFL architectures within the context of wireless mesh networks.The simulation results will provide valuable information for understanding the feasibility and practical implications of implementing DFL in real-world scenarios.
Moreover, the primary objective of the simulation component in this project was to explore the relative performance VOLUME 11, 2023 of two FL architectures, (a) conventional centralized (CFL) and (b) fully decentralized (DFL), while considering the success probability of each data transmission between any two network participants.Notably, this work represents the first instance of providing system simulations where each communication step is completed with a specific success probability, accounting for the underlying communication network's performance within the physical interference model.These results measure the realistic performance of practical FL systems, avoiding the assumption of faultless communications.
To incorporate the communication success probability parameter into the algorithms for both FL architectures, we encountered challenges.Existing FL software libraries, such as TensorFlow Federated [59], and PySyft [60], do not offer the flexibility to define precisely how that network participants communicate during the learning process.Therefore, we developed a fully customized software solution based on the FedML FL and torch frameworks, which provided the most flexibility in defining and implementing custom FL network designs [61].Specifically, after acquiring average successful transmission probabilities for various wireless mesh networking configurations through simulations, we integrated this parameter into the FL systems (i.e., CFL and DFL systems) simulation.This approach effectively incorporates the communication network aspect into the learning procedure.
In the following subsections, we provide further detailed descriptions of the simulation setups in other related aspects and present the corresponding results.

A. WIRELESS MESH NETWORK SIMULATION
To simulate the wireless mesh network that constitutes the backbone for communications across both centralized and DFL architectures, we assume that the participating edge devices are placed randomly in a bounded two-dimensional area of a circle according to a PPP.The communication medium is characterised by an inverse square path loss law and by the presence of Rayleigh fading modelled by an exponential random variable.
In the case of the proposed decentralized architecture, it is assumed that each edge device transmits and receives local model updates only from one-hop neighbours with regard to the wireless mesh network connectivity graph.The simulation of this one-hop neighbourhood is implemented by considering a disk D of radius r oh around the typical receiver and including a communication link for every transmitting device that falls inside this area.In other words, each edge device transmits to other devices with a distance less than r oh .Moreover, the medium access control (MAC) scheme is Slotted Aloha with the probability of transmission p in each slot.It is further assumed that there are k available frequency levels spanning the allocated bandwidth, which are used for concurrent transmissions from transmitters to receivers, with k = λpπr 2 oh equal to the mean value of the number of transmitting devices falling inside the one-hop neighbourhood disk area of the typical receiver.Therefore, the desired one-hop-neighbouring transmitting devices do not interfere with each other with respect to the typical receiver.
As previously stated, the aim of the network simulation is to find the mean value of the successful transmission probability between a transmitting node placed inside the area of the circle and the typical receiver situated without loss of generality at the origin of the two-dimensional plane.In order to calculate the target probability, the following procedure is followed: 1) Generate a Poisson distributed for a random number of edge devices inside a disk of radius R with a mean value equal to the process intensity λ times the area of the disk A = πR 2 .2) Split all generated edge devices into transmitters and receivers for a specific round of communications, i.e. perform thinning of the main PPP with parameter p. 3) For a typical receiver placed at the origin, find the number of transmitters that are less than r oh distance away.4) Calculate the Signal to Noise and Interference Ratio (SINR) for each of the transmitting devices, considering as interference all transmitters that are outside the onehop neighbourhood radius.If the SINR is above the threshold γ the transmission is considered successful.5) Repeat the above steps for N rounds and calculate the successful transmission probability for a specific threshold γ as the total number of successful transmissions over the total number of transmissions.

B. CENTRALIZED FEDERATED LEARNING (CFL)
In simulating the fundamental CFL architecture as proposed in [7], the central server node is assumed to be positioned at the origin of the two-dimensional plane, as previously described in our wireless mesh network simulation setup.Within this setup, the edge devices participating in the learning process are located within a one-hop neighbourhood region of a disk, utilizing some distinct frequency bands for simultaneous transmission.Each training iteration involves the following sequence of actions: • The central server disseminates the global model to all participating devices.• Every edge device employs Stochastic Gradient Descent with its own local dataset to adjust the weights of the convolutional neural network and subsequently uploads these updated weights to the central server.• Finally, the server consolidates the received local models from each edge device using the Federated Averaging, Krum and Median aggregator aggregators algorithm, applying straightforward one of these aggregators methods for the global model update.
Following is a description of the convolutional neural network used for classifying handwritten digits trained on the EMNIST dataset.The input of the neural network is a 28×28 pixels image depicting a handwritten digit in grayscale.The model is compressed using the output layer consisting of 10 outputs corresponding to each handwritten digit, whereas the genetic algorithm is used to minimize the overall model size.The EMNIST benchmarking dataset [58] contains 60K grayscale images of handwritten digits.For simulation purposes, the dataset is split into random-sized parts and allocated among the participating edge devices.

C. DECENTRALIZED FEDERATED LEARNING (DFL)
In order to facilitate a meaningful comparison of results, the simulation setup for the fully decentralized FL system mirrors the configuration of the baseline centralized system, maintaining consistent setup parameters such as learning rate, stochastic gradient descent batch size, and the density of participating devices.Analytically, for a given edge device density (the total number of participating devices in the learning network), the objective is to train the convolutional neural network with the EMNIST benchmark dataset evenly distributed among all participating devices.Below, we outline the typical training iteration process for the fully decentralized algorithm: • Initialization: At the onset of each training round, each local edge device possesses the current convolutional neural network weights.aggregates the received local models with its own, employing a straightforward non-weighted average of each model weight, akin to Federated Averaging for the centralized case.Additionally, our work extends beyond traditional aggregation methods and evaluates the model using various robust aggregator methods, such as Krum and Median.This multifaceted approach aims to enhance performance and resilience in the face of adversarial behaviour, contributing to the robustness of the DFL framework.This training iteration cycle, orchestrated collaboratively by the participating edge devices, enables the fully DFL system to continually refine its global model.Importantly, it mirrors the core concept of FL, leveraging the collective knowledge of decentralized devices while preserving data privacy and security.The synchronization and aggregation of local models among neighbouring devices foster collaborative learning without the need for a centralized server, emphasizing the robustness and decentralized nature of this approach.
Furthermore, the powerful benefit in our DFL model, particularly in high-intensity networks, is designed to leverage  the collective computational power of edge devices efficiently.By distributing tasks, minimizing data transfer, and promoting parallel processing, this approach inherently leads to reduced latency, making it well-suited for scenarios where low-latency responses are essential.

VI. RESULTS AND DISCUSSION
In this section, we provide a detailed overview of the simulation parameters and present the results of our wireless mesh network simulation.
For the wireless mesh network simulation, we selected a disk radius of R = 500 meters, defining the total simulation area.The intensity of the Poisson Point Process (PPP), representing the density of participating edge devices, was varied, with values λ = 10 −3 , 5 × 10 −3 , 10 −2 , 5 × 10 −2 , 10 −1 , resulting in average numbers of participating devices ( Ñ ) is varied.
To emulate the behaviour of devices following the slotted Aloha Medium Access Control (MAC) scheme, we set the probability of a transmitting device deciding to transmit in a specific slot to p = 0.3.The one-hop neighbourhood disk radius was set to r oh = 200 meters.
We conducted SINR calculations for several threshold (γ) values, spanning from -20 dBm to +20 dBm, with N sim = 10 5 simulation rounds for each threshold setting.Additionally, we assumed that each device transmitted with the same power level of P tx = 1 Watt.
The diverse set of simulation parameters employed here enables us to comprehensively explore the performance of our wireless mesh network under various conditions.By systematically varying λ and γ, we gain insights into the network's robustness, scalability, and reliability, shedding light on its behaviour across different device densities and signal-to-noise environments.
This detailed analysis serves as a strong foundation for our subsequent discussion and allows us to draw meaningful conclusions about the suitability of our proposed approach for real-world scenarios.With regards to the FL network training parameters for both centralized and decentralized architectures, the EMNIST dataset was initially split into training and validation sets, comprising 50,000 and 10,000 samples, respectively.In configuring the stochastic gradient descent algorithm's hyperparameters, we selected a learning rate (µ) of 0.015 and a batch size (n batch ) of 32 samples.The performance metrics under consideration include model accuracy and cross-entropy loss, specifically tailored for categorical data.
In Figure 4, we present the calculated successful transmission probability as a function of varying γ threshold values, encompassing different PPP intensity values in both theoretical and simulation contexts.A discernible trend emerges, showcasing that as the participating device density increases, the resulting mean value of the successful transmission probability experiences a non-linear decrease.This relationship can be attributed to the escalating total interference power at the typical receiver, resulting from more devices transmitting data, driven by a constant slotted Aloha transmission probability (p).
It's noteworthy that, in both CFL and DFL systems, we adopted an SINR threshold (γ) value of -16 dBm as the criterion for deeming a transmission successful.This choice of threshold ensures that our evaluation remains consistent across different scenarios, providing a clear basis for com-  parative analysis.Below in table 4, the convolutional neural network model accuracy and cross-entropy loss for centralized and DFL architectures are presented for increasing model training epochs and device density λ = 0.01.
The simulations were conducted in a 20-fold manner, and the average values for each metric are presented here.In the case of decentralization, performance metrics are derived by averaging the individual metrics of each participating device across the entire network.It is worth noting that in the centralized scenario, the model achieves a convergence threshold of over 95% model accuracy after an average of 152 training epochs.Conversely, in the decentralized scenario, the predefined threshold is reached after an average of 173 epochs.This represents a 13.8% increase in the required number of training iterations, consequently impacting the overall system latency to the same degree.
The increase in training epochs can be attributed to the diffusion delay introduced by the decentralized architecture.In the decentralized approach, the contributions of each device's local dataset to the local model update are not immediately transferred across the network.This is in contrast to the centralized approach, where a single aggregator collects all individual contributions in each round to update the global model.Thus, this trade-off becomes evident when transition-  ing to a fully decentralized solution with a realistic model of the core communication network.Despite the trade-off of increased training epochs, the advantages of data security and resilience make DFL a compelling choice for collaborative machine learning in decentralized settings.This phenomenon highlights the practical modelling of the core communication between devices in DFL scenarios, where the network topology and communication mechanisms play an important role in shaping the learning dynamics.Understanding and addressing these challenges are essential steps toward optimizing DFL frameworks for real-world applications.Figures 5,6,7, and 8 illustrate the accuracy and crossentropy loss of both architectures concerning varying device densities (λ).These charts clearly demonstrate an inverse relationship between model accuracy and network device density.This outcome emphasizes the significance of considering the communication network's performance in the analysis of an FL system.Interestingly, an increased number of devices participating in the learning process, while expected to enhance convergence, actually leads to reduced model accuracy due to elevated interference from additional transmitters.
Furthermore, it is noteworthy that the rate of increase in accuracy until the 90% threshold is nearly identical between the baseline centralized solution and the proposed DFL system.This observation is of particular significance, indicating that by slightly adjusting the convergence threshold criterion, there's no compromise in performance when transitioning to the proposed DFL framework.
Our model employs a genetic algorithm-based approach to optimize model size, making it suitable for resourceconstrained IoT and wearable devices.This model compression contributes to reducing the complexity as the model size decreases, enhancing its applicability in such constrained environments.Despite this reduction and local model compression to minimize communication overhead and complexity, our approach excels in achieving high model performance.It competes effectively with traditional DFL models, as demonstrated in Figure (9).This highlights the efficiency and promise of our approach in striking a balance between model size and performance.
Our study conducts a comprehensive evaluation of the proposed model, considering various aggregator methods applicable across a spectrum of use cases.Notably, we integrate Median and Krum aggregator methods into the DFL framework, making our work a pioneering effort in this regard.Figure 9 presents a comparative analysis of DFL and CFL performance metrics over multiple training iterations, employing a diverse set of aggregator methods, including FedAvg, Median, and Krum.Subplot (a) provides insight into DFL's accuracy trends, while subplot (c) showcases CFL's accuracy trends.Subplots (b) and (d) delve into the corresponding loss trends for DFL and CFL, respectively.This in-depth analysis offers a window into the intricacies of training dynamics and convergence behaviour inherent to both approaches.
Lastly, our results consistently demonstrate that the DFL model consistently achieves accuracy rates exceeding 93% and exhibits lower loss across all our proposed aggregator methods.This makes it a versatile and suitable model for various purposes.These findings emphasize the inherent advantages of DFL, especially in decentralized settings where concerns about data privacy and distribution are paramount.Moreover, they underscore DFL's potential as a robust and versatile framework perfectly suited for collaborative applications across edge devices.

VII. CONCLUSION AND FUTURE WORK
In conclusion, our study explores a contemporary approach to distributed machine learning, providing an alternative to traditional Internet of Things (IoT) model learning systems.We prioritize data security by keeping raw edge device data local and sharing only model parameters.Recognizing limitations in CFL systems, such as single points of failure and susceptibility to malicious attacks, we embrace a DFL architecture.
Our work addresses DFL framework limitations through convergence analysis and comprehensive performance assessment of the communication network, utilizing wireless mesh networking.Our theoretical analysis employs stochastic geometry to derive a closed-form equation for successful transmission probability in wireless mesh networks.We present a detailed DFL architecture description and training procedure, establishing a learning convergence criterion.Simulations compare our DFL architecture to the conventional CFL model.Using practical slotted Aloha wireless mesh networks and the EMNIST dataset for handwritten digit classification, we extract the successful transmission probability, accounting for potential transmission failures.
Results demonstrate that our decentralized architecture closely matches centralized counterparts in terms of accuracy and average loss.Importantly, our study integrates geometric analysis and diverse aggregator methods (Krum and Median) over compressed models, achieving high performance while significantly reducing the communication overhead.This approach highlights the practicality of decentralized architec-tures and offers an efficient framework for future IoT systems, potentially scalable in real-world applications.
Future work for this project will focus on the following areas: (i) Investigation of suitable networking protocols: Further research will explore networking protocols specifically tailored for wireless mesh networks, such as the Thread protocol.The aim is to identify protocols that support low-power, low-data rate transmissions, which are particularly well-suited for IoT system solutions.This research will contribute to the development of more efficient and optimised communication mechanisms within wireless mesh networks.(ii) Exploration of blockchain-enabled solutions: Future research could investigate the integration of blockchain technology into the compressed DFL architecture.As blockchain technology matures, it presents opportunities for enhancing data security, privacy, and trust in the context of FL.The exploration of blockchain-enabled solutions can contribute to the development of more robust and resilient DFL systems, providing additional layers of data integrity and transparency.

FIGURE 1 .
FIGURE 1.Typical data flow and analysis in IoT systems.
A. SYSTEM ARCHITECTUREIn a typical Federated Learning (FL) training process, the following three key steps are involved.It's important to distinguish between the local model, which is the model trained on each participating device, and the global model, which is the model aggregated by the FL server.The following are the main learning steps: 1) Task Initialization (Step 1): At the beginning, the server determines the training task, which refers to the specific application and its data requirements.The server also sets certain hyperparameters for the global model, such as the learning rate.Afterwards, the server shares the initial global model W 0 G and the task details with selected participants.2) Local Model Training and Update (Step 2): Building on the current global model denoted as W t (with t as the iteration index), each participant individually utilizes their local data and device to update their local model parameters, denoted as W t .In each iteration, participant i strives to find optimal parameters, W t , that minimize the loss function L(.).The loss function varies depending on the problem and the model employed, as illustrated in the table (3).In essence, they aim to find W i t * such that it minimizes L(.).After updating, the local model parameters are then transmitted back to the server.3) Global Model Aggregation and Update (Step 3): In this final step, the server aggregates the local models received from all participants.After aggregating, the server generates updated global model parameters W t+1 G and sends them back to the respective data owners.

FIGURE 4 .
FIGURE 4. The theoretical (solid lines) and simulation (dashed with markers) successful transmission probability as a function threshold values γ in dB.

FIGURE 5 .
FIGURE 5. Model accuracy as a function of training iterations for CFL.

FIGURE 6 .
FIGURE 6. Model cross-entropy loss as a function of training iterations for CFL.

FIGURE 7 .
FIGURE 7. Model accuracy as a function of training iterations for DFL.

FIGURE 8 .
FIGURE 8. Model cross-entropy loss as a function of training iterations for DFL.

FIGURE 9 .
FIGURE 9.Comparative analysis of DFL and CFL performance metrics using various aggregator methods.

TABLE 1 .
An assessment of surveys exploring CFL and DFL

TABLE 2 .
Summary on FL-related topics with our paper's contribution.
• Local Model Update: Each device independently performs stochastic gradient descent to update its local model weights using its local dataset.
• Aggregation with Robust Methods: Each edge device

TABLE 4 .
FL Simulation Results