Communication-Efficient Federated Learning for Hybrid VLC/RF Indoor Systems

Federated Learning (FL) enables smart devices to collaboratively train Machine Learning (ML) models in a distributed manner without sharing their private data with a central server. However, the disparity between the communication and computation capabilities, and the heterogeneity of local datasets of smart devices degrades the performance of FL in terms of latency and accuracy. To mitigate this effect, we address the problems of device selection and resource allocation in an indoor environment where multiple smart devices participate in the FL process. To further reduce the communication latency, we use Visible Light Communication (VLC) for the downlink transmission while a Radio Frequency (RF) access point supports the uplink transmission in the proposed system. Accordingly, we formulate a multi-objective optimization problem for joint device selection and resource allocation in a hybrid VLC/RF system. Then, using the weight methods, the problem is converted to a single-objective optimization which is solved by incrementally selecting devices in each iteration. The embedded device selection scheme in the proposed algorithm is based on the significance of candidate devices’ local gradients and their alignment with the global tendency in order to intelligently prioritize the candidates in the training procedure. Simulation results show that the joint device selection and resource allocation scheme improves the accuracy of the ML model and reduces the average delay in presence of both system and data heterogeneity. Additionally, the proposed hybrid VLC/RF system decreases the latency of the FL process in the downlink mode compared to conventional RF systems.


I. INTRODUCTION
Machine Learning (ML) has become an undistinguished part of Internet of Things (IoT) networks which employs the massive amount of data generated by the IoT smart devices to enable indoor services, such as localization and activity recognition. Recent ML models require a large dataset to reach a generalization in performance before their employment in real-world applications. However, such a dataset is often distributed among smart devices in a decentralized manner. On the other hand, conventional ML methods only perform the training procedure when the local datasets are The associate editor coordinating the review of this manuscript and approving it for publication was Li Zhang. collected in one location; as a result, they cannot guarantee the privacy of clients. Additionally, the transmission of datasets from smart devices to data centers in wireless networks is not feasible due to communication limitations. In this regard, Federated Learning (FL) has been introduced as a ML-based revolutionary solution that can preserve the privacy of clients and reduce the communication burden on the network by applying distributed machine learning [1]. Typically, FL involves several iterations to train an ML model. In each iteration, referred to as a communication round, the selected devices update their local models using the local datasets available at the edge. The updated models are further transmitted to a central server where they are aggregated to construct a new global model. Finally, the aggregated model is distributed among all the smart devices. This process is repeated until the model is converged.
Implementation of FL in 5G wireless networks and beyond comes with various challenges including delay, limited communication resources, restricted energy consumption in smart devices, and particularly the heterogeneity challenge. Generally speaking, there exist two types of heterogeneity for the smart devices in an indoor environment: i) data heterogeneity, and ii) system heterogeneity. The first one refers to the statistical difference among the local datasets stored at the edge side. The latter is due to the performance variability of smart devices in computation and communication. Both statistical and system heterogeneity can severely affect the performance of the FL algorithm. Statistical heterogeneity has more impact on the convergence and accuracy of the trained ML model, while the problems caused by system heterogeneity are primarily concerned with the delay and energy consumption in each communication round. Efficient device selection and resource allocation are regarded as essential techniques that can improve the performance of FL in the current communication systems [2]. Accordingly, it is imperative to select devices that can not only contribute more to the FL process but also have adequate communication channels. Typically, resource allocation techniques in wireless networks are designed to maximize the average throughput of the network. However, in an FL process, the objective of resource allocation problems is to minimize the delay in each communication round while meeting energy constraints.
Although applying resource allocation techniques in Radio Frequency (RF) systems can improve the FL process in terms of latency and energy consumption, it cannot easily handle the transmission of huge ML models between smart devices and the central server. In recent years, Visible Light Communication (VLC) has been recognized as a complementary technology for indoor RF systems. Compared to RF, VLC provides higher data rates, energy efficiency, and security, however, it requires Line-of-Sight (LoS) link for reliable transmission mode. In most indoor applications, hybrid VLC/RF systems are used for improving the performance of communication, where VLC is utilized for downlink transmission and RF is employed for uplink case. Motivated by the aforementioned challenges, in this paper, we study the problem of device selection and bandwidth allocation for federated learning in a hybrid VLC/RF indoor system. We formulate an optimization problem for joint device selection and resource allocation under delay and energy constraints.

A. RELATED WORKS
The problem of client selection [3], [4], [5], [6], [7], [8] and resource allocation [9], [10], [11], [12], [13], [14] for federated learning in wireless networks has been widely studied in the literature. While some works have addressed the two problems separately, some literature considers joint client selection and resource allocation [15], [16]. In [9], an optimization problem was formulated for joint learning and communication in an FL process, whose goal was to minimize the total energy consumption under delay constraints. Authors in [10] proposed a joint device scheduling and resource allocation to maximize the model accuracy in a given training time with latency constraints. [11] presented a Hierarchical Federated Edge Learning (HFEL) framework in which model aggregation is partially migrated to edge servers from the cloud. Additionally, it formulated a joint computation and communication resource allocation and edge association problem under the proposed HFEL framework. [12] developed an importance-aware joint data selection and resource allocation algorithm to maximize learning efficiency. The closed-form results for optimal communication resource allocation and data selection were both developed. Reference [13] studied adaptive power allocation for distributed gradient descent in wireless FL within both Orthogonal Multiple Access (OMA) and Non-Orthogonal Multiple Access (NOMA) transmission with ''over-the-aircomputing''. The work in [14] aimed to accelerate the Deep Neural Network (DNN) training task by jointly optimizing the local training batch size and communication resource allocation to achieve fast training speed while maintaining learning accuracy for both Central Processing Unit (CPU) and Graphics Processing Unit (GPU) scenarios. Reference [17] proposes a probabilistic device selection framework where the candidates with larger gradient absolute have a higher chance of being selected. Authors in [18] develop a device selection scheme that identifies irrelevant updates made by the clients and stops them from being uploaded to the server. To this end, each client checks whether its update is in alignment with the global model. Authors in [3] considered a multi-criteria FL approach for client selection and maximized the number of users in each round. An online heuristic FL approach was presented in [4] to choose the best candidates based on test accuracy. Authors in [5] proposed a client selection for the FL algorithm, named Client Selection Federated Averaging (CSFedAvg), to mitigate the biases in model training caused by non-independent identically distributed clients. The work in [6] presented a dynamic client selection scheme in a power grid mobile edge computing environment for the FL problem. In [7] an algorithm based on reinforcement learning was proposed for client selection to minimize the energy consumption and the training delay such that users are encouraged to participate in the FL process. Reference [8] modeled the client selection for an FL algorithm with a fairness guarantee as a Lyapunov optimization problem. Reference [15] formulated an optimization problem to minimize the FL loss by considering joint learning, resource allocation, and user selection. Reference [16] follows a later-is-better policy to maximize the weighted sum of selected clients in the FL algorithm for a fixed number of communication rounds in the FL process, while satisfying a long-term energy budget.
There has been limited work on the application of VLC indoor systems for federated learning. Reference [19] formulated a problem for user selection and bandwidth allocation in a hybrid VLC/RF system. The problem was first separated into two sub-problems: a user selection problem with a known bandwidth allocation solved by a traversal algorithm, and a bandwidth allocation problem with a given user selection, which was solved by a numerical method. The final user selection and bandwidth allocation were obtained by iteratively solving the two sub-problems. The proposed algorithm in [20] first used a model compression method to reduce the size of the FL model in a hybrid VLC/RF system. Then, similar to [19], it solved the user selection and bandwidth allocation problems as two separate problems. In addition, Table 1 compares the existing works with the proposed algorithm in this paper.

B. MAIN CONTRIBUTIONS
In this paper, we study the performance of FL in a hybrid VLC/RF system. In this regard, we consider an indoor environment with multiple users which have several smart devices which generate their particular local datasets used for training an ML model. Regarding the communications, we consider multiple VLC transmitters on the ceiling used for illumination and downlink transmission. We assume that smart devices are equipped with Photodiodes (PD) to receive data from VLC transmitters. For the uplink transmission, it is presumed that each smart device has RF antennas to upload its data. The main contributions of this paper are summarized as the following items: • In order to improve the performance of the FL algorithm in training DNN models, we formulate an optimization problem for joint device selection and bandwidth allocation in presence of multiple VLC transmitters for downlink and RF transmission for uplink. We also consider both data and system heterogeneity among devices in order to create a more realistic scenario.
• Due to the limited bandwidth in the communication system, only a subset of devices can be selected in each communication round. In addition, due to the data heterogeneity in the local datasets, different devices contribute non-identically to the training process. To improve the convergence of the FL process under the communication limitations, we propose a novel device selection technique in which smart devices with the most contribution to the training process have a higher chance to be selected in each communication round. To this end, different from [17] and [18] which consider only the absolute or angle of local gradients, the proposed device selection scheme takes into account both the absolute and angle of the local gradients based on the inner product of gradients (GradInn). Accordingly, GradInn prioritizes devices based on the significance and mutuality of local gradients while allowing all to participate in the training procedure which potentially increases the accuracy and decreases the convergence time.
• To evaluate the performance of the proposed FL algorithm, we define separate metrics for accuracy and latency in the FL process. In addition, we demonstrate the overall performance based on a unified parameter from the mentioned metrics. Simulation results show that the proposed GradInn selection scheme increases the convergence speed of the training process compared to the alternatives where the absolute value and angle of gradients are used. In addition, using VLC for downlink in a realistic environment reduces the average delay of the FL algorithm in presence of different levels of system heterogeneity.

II. SYSTEM MODEL AND PROBLEM DESCRIPTION
In this section, we first describe the hybrid VLC/RF indoor environment where several clients possess multiple smart devices equipped with both VLC and RF capabilities. Then, a summarized background of FL is presented in order to get more insight into the proposed framework. Moreover, VOLUME 10, 2022 we elaborate on the communication and computation models of smart devices in the FL process as we further use them to formulate the proposed joint device selection and resource allocation scheme. Table 2 summarizes the used notations in this paper.

A. INDOOR ENVIRONMENT CONFIGURATION
This article considers a hybrid VLC/RF system that employs FL in an indoor environment, such as a smart living room, with multiple RF transceivers, e.g., Wi-Fi modems. It is assumed that N LEDs are installed on the ceiling of the indoor area, known as VLC transmitters in the downlink communication mode, indexed by T v = {T 1 , . . . , T N }. It is worthy to mention that the number and position of the VLC transmitters are set such that the illumination requirements of the indoor environment are satisfied. In this work, we differentiate between users (clients or persons) and devices in the indoor environment. In this regard, we consider N u users in the set U = {u 1 , . . . , u N u }, where each user u i has N d u smart devices, e.g., smart TV, laptop, and smartphone, indexed by U i = {u i 1 , . . . , u i N d u }. Moreover, each smart device u i j ∈ U i has access to a dataset D u i j = (x u i j , y u i j ), where x u i j and y u i j denote the inputs and labels of the data set, respectively. We also presume that each device is equipped with Photodiodes (PDs) as VLC receivers, and mmWave antennas to communicate RF signals. The LoS channel coefficient between smart device u i j and VLC transmitter T v , v = 1, . . . , N , is calculated as follows: where m, A r , θ T v ,u i j , and φ T v ,u i j denote the Lambertian factor, the area of the receiver, the irradiance angle, and the incident angle, respectively [21]. In addition, g(φ T v ,u i j ) = ( n r sin c ) 2 ( φ Tv,u i j 2 c ) represents the gain of the optical concentrator [21], n r denotes the refractive index, and c indicates the receiver Field of Vision (FoV) semiangle. The VLC transmitters are connected to a Central Management System (CMS) using fast communication tools, like fiber optics. The CMS can be any high performance device in the environment, such as smart TV, or a separate device with sufficient processing capabilities, e.g, Raspberry Pi. It should be noted that CMS can also act as the central server in the FL algorithm which is responsible for selecting devices, distributing the updated global model among the selected devices, and performing communication tasks such as horizontal/vertical handovers.

B. FL PRELIMINARY
For the sake of completeness and ease to understand the proposed Fed-VL process, we briefly explain the basics of FL. The aim of federated learning is to train a global model using the local data distributed among smart devices while maintaining the privacy of clients. In the FL process, firstly, an ML model (e.g., a deep neural network) is initialized in the central server. The rest of the FL procedure consists of several rounds of updating the ML model and communicating it between smart devices and the CMS. In each round, a pre-defined loss function is firstly minimized over the local datasets at each selected device. Then, the local models are combined in the CMS to minimize the loss function over the collection of all local datasets. This process continues until the global model converges to the desired model that reaches sufficient accuracy for all devices. It is worthy of mention that we consider label heterogeneity among the devices of different users (e.g., D u 1 1 , D u 2 1 , and D u 3 2 cover different classes). While, the datasets on the devices of the same user (e.g., D u 1 1 , D u 2 1 , D u 3 2 ) have distribution heterogeneity, i.e., they have the same classes with the different distributions. We describe the six stages of a single communication round in a typical FL algorithm as follows:

1) DEVICE REQUEST
In this step, a set of U c smart devices candidates to join the FL process based on their remaining energy, CPU usage, and data availability. Thus, they transmit the requested information, e.g., channel state information, energy capabilities, and their locations, to the CMS and volunteer to participate in the next communication round of the FL process.

2) DEVICE SELECTION
The CMS chooses U s ⊂ U c of the candidates based on the received information to actively participate in the FL process while the other devices remain silent in the current round. Note that selecting all the candidates may increase the convergence speed of FL algorithm because more data samples contribute to the learning process assuming there is no malicious users in the network. However, due to the limited bandwidth and energy capabilities, selecting all the clients is not feasible. Thus, it is important to maximize the number of selected devices in each communication round such that the delay and energy constraints are satisfied. Moreover, based on their datasets, smart devices have different contributions to the learning process. Thus, selecting devices that follow the global tendency of the network can improve the convergence of the ML model.

3) LOCAL UPDATE
The objective of each device u i j ∈ U s is to update the local model parameters, represented by a vector ω t,u i j , by minimizing the local objective function f (ω t,u i j , D u i j ) over all the samples in its local dataset. Accordingly, the new local model parameters, ω t+1,u i j , at client u i j is obtained as follows: (2)

4) UPLINK TRANSMISSION
The selected smart devices should transmit their updated models, ω t+1,u i j , to the CMS through uplink RF transmission. Since gradient can have information leakage, usually the difference between the gradient in two consequence communication rounds are shared with the CMS.

5) MODEL AGGREGATION
In the communication round t, the CMS is responsible for aggregating the local models and generating a global model.
Denoting ω t as the parameter weights of the aggregated model in communication round t, the global loss, denoted by F(ω t ), is obtained as follows: where π u i j > 0 denotes the global aggregation weight for device u i j ∈ U s which satisfies u i j ∈U s π u i j = 1. In order to minimize the objective function in (3), different techniques have been proposed. In this paper, we apply the well-known FedAvg approach, which uses the weighted average of the parameters in the local models to obtain the global model as follows: where n is the total number of data samples in the collection of local datasets D = ∪ u i j ∈U p D u i j .

6) MODEL DISTRIBUTION
Finally, after aggregating the local updates in the communication round t, the CMS sends the vector of the model weights, denoted by ω t , to all the smart devices with the downlink VLC transmission.

C. COMMUNICATION AND COMPUTATION MODEL
To get more insight into the FL process, we formulate the latency and energy models for each communication round.

1) DELAY
The delay of each smart device u i j ∈ U i in communication round t is calculated as follows: , and d ul t,u i j represent the delay associated with the downlink VLC transmission, local updates, and uplink RF transmission. Note that the total delay of communication round t is obtained as follows: where N ul 0 and I ul t,u i j denote the powers of the white Gaussian noise and the interference related to user u i j ∈ U at round t, respectively. Let S t,u i j denote the number of bits required for device u i j to transmit its local model to the CMS. The delay of uplink transmission, d ul , and N dl 0 as the bandwidth, transmission power, and the noise power related to device u i j in the downlink transmission mode, the rate and delay of downlink transmission can be calculated similar to (7) and (8). The delay of computational tasks in communication round t at device u i j is obtained as follows: where c u i j is the number of required CPU cycles for processing one bit of data, and L t,u i j is the number of bits required for the local update at device u i j and round t. In addition, f u i j is the frequency of the processing unit for client u i j allocated to the FL task.

2) ENERGY CONSUMPTION
The total energy consumption of each participating device u i j ∈ U at communication round t is the summation of the energy consumed for computational updates and the uplink RF transmission which is calculated as follows: where e comp t,u i j and e ul t,u i j represent the energy consumption associated with the local updates and uplink RF transmission of device u i j at communication round t. Note that e comp t,u i j depends on the processing capabilities of client u i j , and can be obtained as follows: where i j is the energy consumption for one CPU cycle. Regarding the energy consumption for the uplink RF transmission, we have Note that the energy consumption of each client u i j ∈ U is bounded by e max In the FL process, we aim to reduce the energy consumption of smart devices which have energy limitation, while reaching a high accuracy for the trained model.

III. PROPOSED DEVICE SELECTION AND RESOURCE ALLOCATION
This section addresses two crucial problems in the FL process, namely, device selection and resource allocation in the proposed hybrid VLC/RF environment. From a machine learning perspective, device selection has a significant effect on the convergence of the FL process. Accordingly, increasing the number of data samples, or equivalently, the number of selected devices, can improve the performance of the FL algorithm [22]. However, from a communication aspect, a larger number of participants in each communication round leads to a higher delay for a fixed bandwidth. Additionally, despite having a major learning contribution, some smart devices might have a poor communication channel with the CMS which can severely degrade the performance of the FL process by slowing down the other participants. It is, thus, necessary to make a trade-off between the selected devices and the bandwidth assigned to each participating device by jointly considering both the problems of device selection and resource allocation. Limited by the communication constraints, the objective of the FL algorithm is to select smart devices which make a higher contribution to training the shared model. In the meantime, it is aimed to maximize the number of selected devices, while minimizing the delay in each communication round. To this end, we propose a device selection scheme named GradInn, which is based on the local gradient of the candidates and the global gradient. Furthermore, we define an optimization problem to jointly optimize the selected devices and the delay of each communication round.
In the following, we explicitly describe our model and formulation for device selection and bandwidth allocation problems. Afterward, a multi-objective optimization problem is formulated for joint device selection and resource allocation with delay and energy constraints. The problem is then converted to a single-objective optimization problem which is shown to be non-convex. Finally, we propose an algorithm for solving the problem by incrementally adding devices to the set of participating devices, U s , in the FL algorithm.

A. PROBLEM FORMULATION
The main objective of the proposed FL algorithm is two folds; i) maximizing the number of participating devices considering the limited communication resources such that the devices with higher contribution to the training process and with better communication conditions are selected, and ii) synchronizing the selected smart devices and minimizing the delay at each communication round by performing bandwidth allocation.

1) PROPOSED GradInn METHOD
The convergence of the global model in the FL framework is heavily dependent upon the data on which the smart devices are trained. Consequently, the more statistical heterogeneity among D u i j , ∀u i j ∈ U p , the slower the global model converges. As a result, the whole training procedure takes longer to complete. On the other hand, since only a limited number of devices are allowed to participate in the aggregation phase, it is imperative to select the ones that could contribute the most to reducing the global loss function F(ω t ). In this regard, we propose a device selection scheme based on the information extracted from local gradient updates of each device u i j ∈ U c in the framework.
For each candidate u i j ∈ U c , we define an indicator variable v t,u i j to indicate whether the smart device is selected in the communication round t or not, given as Although increasing the number of participants can improve the performance of the FL algorithm, selecting all candidates is not always feasible due to communication limitations. In this regard, some devices might make a higher contribution to the FL algorithm due to their data set characteristics, e.g., the number of samples or the mutuality of their data distributions. Thus, we define the weighted sum of the selected users as the objective of the device selection problem as follows: where α t,u i j determines the importance of smart device u i j in communication round t, and is adjusted based on the local updates in each round.
To obtain a decent value for coefficients α t,u i j , we assume that each device u i j runs κ local update iterations at each round t, where κ ≤ K and K represents the number of local iterations at each round. We denote ω t + ,u i j as the updated weights of smart device u i j after κ local iterations, where t ≤ t + ≤ t + 1. The corresponding gradient at each device u i j at time step t + with respect to ω t is given by The main idea of the proposed device selection scheme is to choose candidates which contribute more to the training process. In this regard, we prioritize the candidates which have a higher value of gradient and follow the global tendency of all devices. To measure the similarity of the local updates at round t with the global tendency, the local gradients g t,u i j are compared with the global gradient g t . However, since g t is not available at the communication round t, we approximate it with the previously aggregated gradient, i.e., g t−1 given as Additionally, to avoid redundant computations in the smart devices, we use the local gradients obtained after κ ≤ K local iterations, g t + ,u i j . We thus define the following score based on the inner product of g t−1 and g t + ,u i j for each device u i j ∈ U c such that the ones with higher scores get more chance to participate in the aggregation at communication round t.
The proposed GradInn criterion in (17) implies that the local gradients that are more aligned to the previous global gradient should be allowed to engage more in the training procedure. This is because the mutuality of the selected gradients results in faster convergence. In addition, the gradients' amplitude is also informative since larger gradients tend to contribute more to changing the loss function. Note that the alignment and the impact of the local gradients' amplitude are both embedded in the inner product in (17).

2) RESOURCE ALLOCATION
The performance of the FL algorithm relies on the fast and robust communications between the devices and the CMS. Due to the system heterogeneity, i.e., the difference in the communication and computation characteristics of smart devices, it takes different times to perform local updates and upload the new models to the CMS. In a synchronized FL algorithm, the model aggregation step in each round, i.e., equation (4), requires gathering the local updates from all the selected devices. Consequently, the speed of each communication round is limited by the slowest participating device. The aim of FL is to minimize the delay of each communication round, or equivalently, the maximum delay of participating devices while satisfying the energy constraints. Efficient bandwidth allocation in FL enables the CMS to synchronize the selected candidates and minimize the delay at each communication round. In this regard, we define the maximum delay of the participating devices at the communication round t as the objective of the resource allocation problem as follows: In such a communication system, the total available bandwidth B should be optimized for the devices in each communication round such that none of the devices have a delay and energy consumption higher than thresholds d th and e max u i j , respectively. VOLUME 10, 2022 By considering the aforementioned analytic results, we define the following multi-objective optimization problem for each communication round of the FL process: C3. e t,u i j ≤ e max C5.
where (21) indicates the delay constraint for the current communication round, while (22) guarantees that the energy consumption of none of the smart devices exceeds its limitation. In addition, (23) determines the lower and upper bounds for the allocated bandwidth b ul t,u i j , and (24) guarantees that the total available bandwidth, B ul , is utilized. Note that the downlink VLC bandwidth is also limited by B dl . However, the above joint device selection and resource allocation problem is solved after the distribution of the updated global model among smart devices in downlink VLC transmission, and thus, the optimization variables P1 are the uplink bandwidth, b ul t,u i j , and selection variables v t,u i j .

B. SOLUTION OF PROBLEM P1
The optimization problem P1 has two objective functions, namely, F RA and 1 F DS , which correspond to the resource allocation and device selection objectives, respectively. Additionally, the optimization variables include the integer device selection parameter, v t,u i j , and the bandwidth allocated to each device, b ul t,u i j . Hence, according to [23], the optimization problem P1 can be considered a multi-objective optimization problem which is non-convex. The weighted method is one of the popular algorithms to solve multi-objective optimization problems. In this technique, the first step is to make a dimensionless objective function. Since the objective function in P1 has two parameters with different dimensionalities, we divide these parameters by their nominal values. In this paper, we denote N c and d th as the nominal value for the number of selected devices and the latency term. To show the trade-off, or equivalently, the relative importance between the two parameters, we present w DS and w RA as weight factors for the number of selected devices, and the maximum delay, such that w US + w RA = 1. Therefore, the optimization problem P1 is transformed to a single-objective optimization problem as follows: It can be shown that the objective function of problem P2 is non-convex. More precisely, the second term of the objective function is an integer programming, while the first term is integer and non-convex. Overall, problem P2 is a mixedinteger non-linear problem, and finding its optimal solution is difficult. In order to solve P2, we present an iterative algorithm where in each iteration a new candidate device is added to the selection set based on a priority criterion until the total bandwidth B is filled for a specific latency threshold. Let U s,t denote the set of selected devices at communication round t which is empty at the beginning. We first obtain a minimum bandwidth that satisfies the delay and energy constraints of the candidate devices in U c . Substituting (5), (7), and (8) in constraint (21), we obtain Additionally, substituting (10) and (12) in C3, and using (7), we obtain Similarly, setting C t,u i j = e comp t,u i j Hence, constraints C2 and C3 are satisfied if (27) and (29) hold, or equivalently, Without lose of generality, we assume that the right side of (30) satisfies C4, i.e, it is in range [0, B ul ], otherwise, we remove the corresponding device from the selection process. We define a priority metric as follows: whered ul t,u i j is the maximum uplink delay for device u i j which is obtained with the minimum bandwidth b ul In each iteration, the smart device with the highest priority metric is selected. In other words, Therefore, the set of selected devices, U s and integer variable v t,u * i j in optimization problem P2 are known and updated in each iteration of Algorithm 1. Moreover, since constraint C1 is related to the selection of the candidate devices, we remove it from the optimization problem. The modified version of P2 is given as where n s is the number of selected devices until the current iteration.
In order to solve optimization problem P3, we set Z = max u i j ∈N s d t,u i j , and add constraint d t,u i j ≤ Z . Since the first term in the optimization problem P3 is fixed and does not affect the solution, we remove it from the objective function. Finally, the maximization problem is converted to the following minimization problem: Problem P4 is a convex optimization problem and can be solved by standard convex optimization techniques, such as Barrier methods [23].

C. PROPOSED FEDERATED LEARNING FRAMEWORK
In the proposed FL framework, the joint device selection and resource allocation scheme has the following six steps; i) each device u i j ∈ U c performs κ local update iterations and calculates the gradient g t + ,u i j , ii) the inner product g t + ,u i j , g t−1 is calculated in each device and then the results are sent to the CMS, iii) the CMS assigns α t,u i j according to (17) to each device, iv) the CMS solves the optimization problem P1using Algorithm 1 and determines which device is qualified to participate in the current FL round, v) the selected devices update their local weights ω t + ,u i j for K − κ more iterations, and vi) finally, the updated weights, ω t+1,u i j , are sent to the CMS for the aggregation.

D. COMPUTATIONAL COMPLEXITY OF ALGORITHM 1
In this subsection, we present the computational complexity analysis of Algorithm 1 by the following proposition.
Proposition 1: The computational complexity of Algorithm 1 is of order O(N c N s ), in which N c is the number of candidates and N s is the total number of selected devices. Algorithm 1 Proposed Algorithm to Solve P2 Input: The set of candidate clients, U c , the number of bits required for device, S t,u i j , the total bandwidth, B ul .
Determine d comp t,u i j as in (9), ∀u i j ∈ N c , v u i j = 1.

13:
Determine uplink delay,d ul 14: Send the model weights ω t to all the clients using VLC links. Device Selection and Bandwidth Allocation: 4: Obtain ω t + by performing κ local update iterations. 5: Calculate scores α t,i j according to (17) 7: Obtain v t,u i j and bandwidth b t,u i j by solving optimization problem P2 using Algorithm 1. Local update: 8: Perform K − κ more local iterations in the selected devices v t,u i j . Uplink transmission: 9: Upload the updated local weights ω t,i j to the central server.
Model aggregation: 10: Obtain ω t according to (4). 11: is N c . The loop is repeated for N s iterations, hence, the computation complexity of the Algorithm 1 is of order O(N c N s ).

IV. EXPERIMENTATION
In this section, we conduct several experiments to evaluate the proposed joint device selection and resource allocation FL algorithm in a hybrid VLC/RF system and compare it with other alternatives proposed in [24] and [17]. First, the simulation setup of the smart indoor environment is described and summarized in Table 3 [25]. We then introduce multiple metrics in terms of the model classification accuracy and latency to evaluate the performance of the proposed algorithm against the baselines. Finally, the simulation results are reported.

A. SIMULATION SETUP 1) COMMUNICATION SETUP
We consider a typical indoor environment with size 9 × 6 × 3 m 3 and N v = 6 VLC transmitters placed on the ceiling. The downlink total bandwidth and the corresponding downlink transmission power are set to B dl = 20 MHz and p dl t,u i j = 1.3 watts, respectively. We consider N u = 5 users in the smart environment, where each user owns N d u i = 6 smart devices, each equipped with P = 4 PDs as VLC receivers. It is assumed that N c = 15 devices randomly candidate to participate in the FL process at each communication round. We opt to select an uplink RF transmission with total band- width B ul = 10 MHz and complex Gaussian noise with power N ul 0 = 3.89 × 10 −21 watts.

2) MACHINE LEARNING SETUP
We use the MNIST dataset and its extension, EMNIST, for the commonly-used handwritten image classification task. MNIST contains 70, 000 data samples in the form of 28 × 28 images. There are 10 classes in the dataset corresponding to handwritten digits from 0 to 9. We also evaluate the algorithms on the heterogeneous dataset EMNIST where each device has |D u i j | = 450 data samples of 90% of the classes.

B. PERFORMANCE METRICS
The performance of an FL framework can be examined from the perspectives of classification accuracy, and latency. We, thus, define the following metrics to evaluate the proposed algorithm for each of the mentioned aspects of:

1) TEST ACCURACY
For an ML classification model, the test accuracy is defined as the number of correctly labeled samples in the test set, denoted by n corr , to the total number of test samples. In our work, we assume that a balanced test set, denoted by D test , is available in the CMS and can be used to evaluate the trained model. In order to make a realistic evaluation of the model accuracy in the FL algorithm, we define metric η 1 as the average accuracy over multiple runs given as where E [.] presents the expectation operator.

2) LATENCY
To acquire a fair understanding of the latency between the devices and CMS in each communication round of the FL algorithms, the cumulative latency of the FL process is defined as follows: where d t is the delay in communication round t.

3) UNIFIED METRIC
The goal of FL is to maximize the accuracy while minimizing the communication and computation costs. To reach an overall understanding of the FL algorithms, we define a unified metric by combining the aforementioned parameters as follows: where η max 1 = 100 and η max 2 indicate the nominal values of the test accuracy and the latency, respectively.

4) SYSTEM HETEROGENEITY
To better understand the effect of system heterogeneity on the average delay of each communication round, we define the following metric to represent the system heterogeneity: whered u i j represents the delay of each device u i j ∈ U s for a given bandwidth b u i j < B ul in each communication round. Note that H SYS → 0, when there is no system variability, i.e., when all devices have the same delay, and H SYS → 1 for high system heterogeneity.

C. PERFORMANCE EVALUATION
In this subsection, we evaluate the performance of the proposed FL algorithm in three scenarios. In the first one, we present the simulation results of the proposed GradInn device selection scheme and compare it with the alternatives [17] in terms of test accuracy η 1 . In the second scenario, we analyze the effect of the proposed resource allocation algorithms in both hybrid VLC/RF and conventional RF systems in terms of delay for various levels of system heterogeneity. Finally, in the third scenario, the simulation results of the proposed joint device selection and resource allocation algorithm are presented for both hybrid VLC/RF and conventional RF systems.

1) SCENARIO I
In order to evaluate the performance of the proposed device selection algorithm, we set w RA = 0 and w DS = 1 to eliminate the effect of the resource allocation term in the objective function of problem P2, i.e., (25). We compare the proposed GradInn algorithm addressed in (17) with the following baselines in terms of accuracy: i) The case where the angle between the local gradients with the previously aggregated gradient is used (namely GradAng) [18]. In this scheme, the parameter α t,u i j , formulated in (17), is obtained as ii) The absolute value of local gradients is used for device selection (GradAbs). We use this scheme similar to the work proposed in [17] with a minor modification, where Fig. 2 compares the test accuracy of the proposed GradInn algorithm and baselines with respect to communication rounds for different number of selected devices, namely N s = 6, 9, and 12. Overall, it is clear that the proposed GradInn algorithm performs better than the GradAng and GradAbs methods. For the GradAbs [17] method, the devices with bigger local gradients are selected. In this situation, local gradients of the selected devices might not be in the same direction, and thus mitigate each other's effects. Although the GradAng method selects devices that are in the same direction of the previously aggregated gradient, it does not take into account the absolute value of the local gradients. Thus, the selected devices do not necessarily have large local gradient, which leads to slow convergence of the algorithm. The proposed GradInn technique, however, takes into account both the absolute value and the direction of the local gradient. More precisely, it chooses the devices that have a larger value of local gradient applied in the direction of the globally aggregated gradient. In addition, as the number of selected devices increases from N s = 6 to N s = 12, the convergence speed of the device selection schemes reduces. This is due to the fact that the device selection methods have to pick a larger portion of the randomly candidate devices. Thus, they have a smaller degree of freedom in the selection process, and hence they behave similar more similarly compared to the case where they have a larger degree of freedom.

2) SCENARIO II
We evaluate the performance of the proposed resource allocation algorithm in the hybrid VLC/RF systems and compare it with the alternative, where only RF is used for both downlink and uplink transmission modes [24]. To this end, we set w DS = 0 and w RA = 1 in the objective function of the optimization problem formulated in (25) . Fig. 3 compares the average delay of the communication rounds in the FL algorithm for both hybrid VLC/RF and conventional RF systems, described in (6), while popular neural networks, namely, AlexNet, VGG-M, VGG-S, GoogleNet, and MobileNet, are used as the global model of the FL framework.
The system heterogeneity, imposed by the difference in communication and computation capabilities of devices, leads to discrepancy in delay of the participating devices. In this situation, the delay in each communication round, which is limited to the slowest device, can be reduced by optimally allocating bandwidth to the participating devices such that the slow participants are given more bandwidth compared to fast devices. Fig. 4 compares the average delay per communication round for different values of H SYS in (39) for both the proposed hybrid VLC/RF system and the conventional RF model similar to the work in [24]. In addition, we consider two cases, where a) the proposed resource allocation algorithm  However, as the system heterogeneity increases, the proposed resource allocation algorithm achieves lower average delay than the equally partitioned bandwidth allocation method. In addition, in the case of the extreme system heterogeneity (i.e., H SYS > 0.7), the four schemes start to perform similarly. This behavior stems from the fact that the variation between the delay of participants is too high that cannot be compensated by the proposed bandwidth allocation scheme. Moreover, a sudden increase is observed in the average delay after H sys = 0.7 which is due to the high variability of communication and computation capabilities between the devices, or equivalently, the high variation in the delay of different devices. Furthermore, in both the proposed and equal bandwidth allocation methods, the VLC/RF system has lower average delay per communication round since it provides faster downlink communication compared to the conventional RF system.

3) SCENARIO III
As shown in Scenarios I and II, the proposed device selection and resource allocation techniques can improve the accuracy and latency in the FL algorithm. Accordingly, the weighting coefficients w DS and w RA in (25) have a direct impact on both test accuracy and the delay of the proposed FL algorithm. To better understand this affect, we alternative the importance weights and evaluate the proposed FL framework in Algorithm 2 in terms of test accuracy and latency. In this regard, Fig. 5 depicts the test accuracy for different values of w DS and w RA . As observed, increasing the weight values of the device selection term in the joint device selection and resource allocation algorithm gives rise to the test accuracy and convergence speed of the learning curve. However, this leads to a raise in the completion time and the average delay at each round.
To have a general understanding of the proposed algorithm with respect to the test accuracy and delay, Fig. 6 illustrates the unified metric η, formulated in (38), versus communication rounds and for different values of system heterogeneity H SYS . We consider three cases where (a) w DS = 0.3, w RA = 0.7, (b) w DS = 0.5, w RA = 0.5, and (c) w DS = 0.7, w RA = 0.3. As it can be seen from the figure, a larger system heterogeneity in the setup results in a lower η. This comes from the fact that higher system heterogeneity leads to higher delay in each round and thus a lower η. Moreover, as the importance weight of the devices selection term, w DS , in (25) increases, the curve gets a sharper ramp with respect to communication rounds to reach its peak. The reason is that a higher priority for selection has been given to the smart devices with more learning contributions. However, it is seen that the curves reduce gradually after reaching their peaks. This is because the test accuracy does not improve as much as for the initial communication rounds.
Finally, Table 4 shows the quantitative results of the unified metric η for both the hybrid VLC/RF and conventional RF systems after the completion of the FL process. It can be concluded from the table that the VLC/RF system reaches a higher value of η for the proposed VLC/RF system compared to the conventional RF communication. The main reason behind this phenomenon is the lower latency of the VLC transmission in downlink which results in a lower cumulative delay, and thus a higher η. Additionally, as the system heterogeneity increases, the value of η reduces after the FL process completion since the average delay per communication round becomes larger with respect to the increase in system heterogeneity.  The communication efficiency in the FL algorithm depends on i) which devices are selected and ii) how the communication resources are assigned to the selected devices. To make a better understanding of the communication efficiency of the proposed hybrid VLC/RF system model, Fig. 7 compares the cumulative delay required to reach a specified accuracy, e.g., 80% with the alternatives. Our proposed algorithm selects the devices with a higher contribution to the training process, thus, in long term, less number of communication rounds, and hence a lower cumulative delay, is required for the convergence of the FL algorithm resulting in a fewer model exchange between devices and the CMS. Additionally, in each communication round, the optimal resource allocation is applied which makes optimal usage of the available bandwidth. The proposed hybrid VLC/RF system slightly reduces the delay due to faster communication in the downlink mode.

V. CONCLUSION
This paper put forward a communication-efficient FL in a smart indoor environment with multiple VLC transmitters for the downlink transmission and an RF access point for the uplink transmission. We assumed that several users (clients or persons) are present in the indoor environment where each user has multiple smart devices equipped with PDs to receive VLC signals. We proposed an FL framework to train a deep learning models in which devices with the highest contribution to the learning process and low latency were prioritized in the selection scheme. Towards this goal, we formulated the problems of device selection and resource allocation as a multi-objective optimization problem which was then converted to a single-objective problem using weighting methods. Furthermore, the optimization problem was solved by incrementally adding the candidates to the set of selected devices in each round of FL process. We conducted several experiments to evaluate the performance of the proposed joint device selection and resource allocation algorithm in a hybrid VLC/RF indoor system. A comprehensive simulation was conducted to show that the proposed GradInn scheme not only reduces the average delay but also increases the convergence speed of the FL algorithm in presence of both system and data heterogeneity. In addition to that, the proposed hybrid VLC/RF system could further reduce the average delay in downlink transmission mode compared to the conventional systems based on RF. For the future works, we suggest to use Intelligent Reflecting Surfaces (IRS) to improve the uplink transmission in FL.