Deep Q-Network Learning Based Downlink Resource Allocation for Hybrid RF/VLC Systems

Developing high data rate systems to meet the requirements of fifth generation mobile systems has become crucial. Hybrid radio frequency/visible light communication (RF/VLC) has appeared as a promising mechanism for achieving this objective. In hybrid RF/VLC, data rate maximization is subject to constraints on bandwidth, power and the user association. The joint optimization problem of bandwidth, power and user association to maximize the data rate is non-concave and obtaining an optimal solution is difficult with conventional optimization algorithms. The existing solutions are based on a presumption of at least one optimization variable. In this article, this issue has been overcome by solving the joint optimization problem in hybrid RF/VLC with a deep Q-network (DQN) learning based algorithm, which has been recognized as an efficient learning based mechanism for optimization. Our system model considers one RF and multiple VLC access points (APs). The idle APs are also incorporated in the system model. The application of DQN learning based algorithm is carried out by finding an optimal policy with the help of an action-value function. As the data sets for the considered system are large, a multi-layered network is used for approximating the action-value function estimator. Finally, a transfer learning based algorithm has been proposed for maximizing the total data rate of the system for the case of a newly entering user equipment (UE) that uses the information of the environment before the arrival of the new UE. Through simulations, it is found that our proposed algorithms can lead to an improvement of more than 10% and 54% in the achievable sum-rate and number of iterations for convergence respectively as compared to that obtained with existing conventional optimization algorithms.


I. INTRODUCTION
With the growing population of mobile internet users, the requirement for data rate has seen an exponential growth in the recent years. The use of conventional only-radio frequency (RF) systems may fail to fulfill it satisfactorily in the near future [1]. Telecommunication community is searching for alternative techniques to fulfill it. Visible Light The associate editor coordinating the review of this manuscript and approving it for publication was Marco Martalo . Communication (VLC) has emerged as an efficient candidate in this regard [1]- [3]. It uses the deployed light emitting diode (LED) based light sources to transmit data through dimming of light, which is invisible to the eyes. VLC offers several advantages like high data rate, lesser interference with the co-existing RF devices, providing communication and illumination simultaneously, efficient unregulated spectrum usage, and efficient frequency reuse [3]. However, it has some disadvantages like inefficiency of non-line-of-sight (NLOS) components, which prevent it's stand-alone deployment [4].
As a solution to this problem, hybrid RF/VLC has been proposed in the literature [5], [6].
Hybrid RF/VLC merges the RF and the VLC networks into a single hybrid system. A typical hybrid RF/VLC architecture consists of some light sources, with each light source acting as a VLC access point (AP) in an indoor set-up. This set-up is supported by one or multiple RF APs. A user equipment (UE) present in the indoor set-up is associated either with a VLC AP or an RF AP for receiving data. The VLC AP offers high data rate while the RF AP ensures uninterrupted communication during blockage of LOS VLC signals to a UE, or when a UE is out of the coverage area of any of the VLC APs and fails to maintain the minimum needed signal-to-noise-ratio (SNR). In this manner, both the networks compensate for the limitations of each other.
Apparently, hybrid RF/VLC systems belong to the class of heterogenous networks (HetNets). In general HetNets, the joint optimization of resource allocation and association remains a significant research problem [7]- [13]. Similarly, resource allocation is a significant research issue in hybrid RF/VLC systems. Along with deciding the association of the UEs with the APs to receive the downlink data, the allocation of downlink bandwidth and transmission power to the APs for data transmission affects the achievable sum-rate of the system significantly. The study of optimal resource allocation for achievable sum-rate maximization in hybrid RF/VLC has received tremendous focus in research [14]- [28]. A common issue faced in these research works when the joint optimization of the downlink bandwidth, transmission power of the APs, and the association parameter are involved, is the non-concavity of the downlink resource allocation problem. Generally, this issue is solved by presuming values for at least one of these parameters and then obtaining the optimal values for the other parameters with conventional convex optimization algorithms. However, performing the joint optimization of all the three parameters without such presumptions is needed, as the association of UEs depends on their signal-to-interference-plus-noise ratios (SINRs) with the APs. Hence, it is directly affected by the allocation of downlink bandwidth and power and vice versa. Presuming a value for downlink bandwidth, transmit power of APs, or association parameter may not give the most optimal solution for maximizing the achievable sum-rate of the system. A comprehensive joint optimization problem incorporates the effects of each optimization parameter on the other one and on the objective function, which ensures a robust solution.
The above issue is the motivation behind the present study. We aim at jointly optimizing the downlink bandwidth, power and association parameter for maximizing the achievable sum-rate of a downlink hybrid RF/VLC system. The problem is subject to constraints pertaining to the availability of resources. Attempting conventional optimization approaches to solve this problem may lead to rigid bandwidth and power allocation as these approaches are less adaptive to the dynamics of the network, and result into an inefficient exploitation of resources [29], [30]. In contrast to this, a moment-to-moment optimal usage of the resources would result into a better output. It is also necessary that the optimal design for association and resource allocation in hybrid RF/VLC should not depend on prior knowledge of the environment. Some model based optimal solutions have been developed in [31]- [33] for specific models in general HetNets. However, the primary concern with these methods is that the incomplete information on the system makes the solutions intractable. Also, obtaining global maximum for a resource allocation problem is challenging with model based optimization methods due to it's non-concavity. The solutions based on game theory, linear programming, Markov approximation, college admission model, and dynamic programming proposed in [5], [7]- [9], [15], [18], [24], need almost accurate information which is not always possible to achieve practically, even when localization in VLC is relatively accurate.
In this article, a deep Q-network (DQN) learning based algorithm is developed for jointly optimizing the downlink bandwidth allocation, power allocation for APs, and association parameter, which maximizes the achievable data rate in a downlink hybrid RF/VLC system. The UEs can be associated with any of the APs lying within their field of views (FOVs). Unlike [34] where DQN was carried out at each AP, DQN is trained at a central unit (CU) which controls the association, allows all the APs to set their transmit powers and allocate bandwidths to the UEs associated with them [35].
Our contributions in this article can be summarized as follows: 1) Comprehensiveness of the problem: To the best of our knowledge, a comprehensive problem incorporating the joint optimization of association, power and bandwidth in a downlink hybrid RF/VLC has been considered for the first time in this article. Such a problem is neither convex nor concave. Making it convex requires prior assumption of at least one optimization parameter. Thus, it is difficult to solve with conventional optimization methods in the existing works. Here, this limitation is overcome by solving it with DQN based learning. The optimal solution obtained with the help of DQN based learning is not dependent on modeling errors and works on a moment-to-moment update. 2) Considering idle APs: Some APs can be switched off due to hardware malfunction while some APs may not take part in communication as only selected APs have been designed for VLC. Such APs do not cause interference to a UE. Considering interference from these APs can affect the robustness of the analysis. Our mathematical model considers idle APs in the SINR expression. Such formulation improves the practicality of the system model.

3) Novel DQN based resource allocation in hybrid
RF/VLC: For the first time, a DQN based learning algorithm is being used for solving the optimal resource allocation and association problem in hybrid RF/VLC. Our developed algorithm allows the CU to adaptively allocate the downlink bandwidth, power and the VOLUME 8, 2020 association parameter to the APs to maximize the achievable sum-rate of the system. It is not dependent on interaction among the UEs. A DQN based learning algorithm is trained at the CU, instead of training the DQN at each AP. This helps in providing an efficiently coordinated association. As the state and action vectors are very large in such problems, the application of DQN outperforms the existing algorithms in terms of achievable sum-rate and the number of iterations needed for convergence. 4) Study the application of DQN with transfer learning: The successful application of DQN with transfer learning has been shown for a newly entering UE in the hybrid RF/VLC set-up, where the experience of UEs already present in the set-up is transferred to a new UE entering into the set-up. It is found that DQN with transfer learning reduces the number of iterations for convergence by approximately 54% compared to when DQN without transfer learning is used for a newly joined UE. The rest of the paper is organized as follows: In Section II, a literature review of the existing works that have led to the present work is performed. Section III explains the system model, where the light propagation model, the RF signal propagation model, the achievable data rate formulation, and the communication model obtained after the mixing of the RF and VLC networks is discussed, and the resource allocation problem is formulated. In Section IV, the solution for the resource allocation problem is designed, where the framework for learning has been formed and the layout of the proposed algorithm has been written. Section V illustrates the transfer learning based algorithm for the newly entering UEs. In Section VI, the simulation results to verify our proposed algorithms have been studied. The computational complexity and the NP hardness of the proposed schemes have also been studied in this section. Section VIII concludes the paper.

II. RELATED WORK AND THE SIGNIFICANCE OF THE APPLICATION OF DQN LEARNING IN HYBRID RF/VLC
Efficient resource allocation and association can lead to a higher achievable sum-rate in HetNets. The problem of resource allocation becomes crucial in hybrid RF/VLC as RF and VLC networks have completely different communication models. Several resource allocation schemes exist for performing achievable sum-rate maximization and related issues like energy efficiency maximization or packet loss probability minimization [14]- [28]. In [14], the total achievable data rate of a hybrid RF/VLC system is maximized by optimizing the association parameter, with the help of minimum distance condition. The focus of this work is on user association where each AP allocates equal bandwidth among the UEs associated to it. A fixed allocation of the transmit power of the AP has been considered here. In [15], achievable data rate maximization is performed with joint load balancing and optimal power allocation in hybrid RF/VLC. A fixed configuration of bandwidth has been taken here. In [16], the effect of bandwidth allocation on the overall sum-rate of the hybrid RF/VLC system has been studied. A bandwidth aggregation protocol to use VLC for increasing the bandwidth of the overall hybrid RF/VLC has been proposed. An optimal packet scheduling scheme is also proposed for the data packets which arrive at the system for transmitting to the UEs via. VLC or RF networks. The scheduling scheme has an impact on the overall sum-rate of the system, as the final objective of the work is throughput optimization. A fixed configuration of power allocation and association parameter has been considered here. In [17], maximization of the total sum-rate in hybrid RF/VLC has been carried out with joint balancing of the individual achievable sum-rates of the information UEs and the energy harvesting UEs. The power and the DC-bias of the UEs are optimized while a constant bandwidth and power allocation is considered for the APs. In [18], the focus is on maximizing energy efficiency of a hybrid RF/VLC system, which is defined as the ratio of the sum-rate and the total operational power, to optimize the bandwidth and power allocation. The system model of this work considers a single RF and VLC AP each. The association parameter and the bandwidth are kept fixed here. In [19], similar to energy efficiency, power efficiency maximization of a hybrid multiple access scheme for visible light communication systems has been studied which offers a better bandwidth allocation. The fundamental objective here is to fill the odd subcarriers optimally. Once again, the association parameter and the bandwidth are kept fixed here. In [20], power efficiency maximization for situations when illumination is not needed and the light source is kept on only for the transmission of data has been performed. In this situation, a VLC AP consumes more power than the RF AP. First, the number of APs needed to be switched on for satisfying the illumination requirements has been determined. Subsequently, the UEs request for real-time communication. Resource allocation remains outside the realm of this work as fixed bandwidth and association parameter have been considered here. The study in [18] has been further extended for multiple VLC APs in [21], but for only optimal power allocation. A fixed configuration of bandwidth and association are considered here. The authors in [22] perform minimization of packetloss-probability in a fractional association time based dualhop hybrid RF/VLC system enabled with energy harvesting. The fractional association time is based on time division multiplexing principle, where the entire bandwidth is allocated to each UE for a specific time fraction. The objective of [22] is to obtain the optimal fraction of association time allocated to a UE. Further, in [23], the joint optimization of the fractional association and power allocation has been studied in dualhop hybrid RF/VLC. In [24], the sum-rate has been maximized with a mobility aware load balancing scheme, using the location-sensitive feature of VLC systems. The solution is based on a college admission model in a matching theory. A fixed configuration of association parameter and power allocation has been considered here. In [25], the focus of the work is on the achievable sum-rate maximization with an intelligent selection of the network among the RF and the VLC networks depending on the dynamics of the environment. The study is performed on an uplink-downlink system with main focus on the non-similarity in the uplink-downlink parameters. The association parameter is optimized in terms of weighted proportional fairness while the bandwidth and the power allocation are kept fixed. In [26], the achievable sumrate maximization for hybrid RF/VLC system has been investigated in a cross layer domain to provide optimal association. The solution depends on the effective capacity of the network obtained after imposing constraints on the buffer length of the data at the AP which holds the data before transmitting it over the selected link. A fixed allocation of bandwidth and power have been assumed here. In [27], the user association problem with lighting constraints for a VLC only system has been studied and a greedy algorithm for maximizing the SINR based utility function has been proposed. Further, in [28], anticipatory association scheme was proposed to anticipate the future locations of the UEs with the aim of maximizing the achievable sum-rate maximization. The association was performed as per the locations of the UEs. A fixed bandwidth and power allocation has been considered in [27] and [28].
As mentioned earlier, the existing works mentioned above presume a value for at least one parameter among bandwidth, power, and association parameter to address the issue of nonconcavity in their respective joint optimization problems. However, such a presumption affects the robustness of the solution. To address this issue, we explore into learning based solutions. Reinforcement Learning (RL) [36] has been realized as an efficient learning mechanism. It is based on interaction with the environment and requires lesser prior information. It is an online learning method and has been extensively studied in artificial intelligence researches [37]. The most popular RL technique is Q-learning which was proposed in [38]. The convergence theorem for Q-learning was later proved in [39]. In [40], an autonomous Q-learning algorithm in HetNets for optimal resource allocation for device-to-device (D2D) communication has been proposed. A utility function defined as the difference between the achievable throughput and the cost of power consumption is formulated as a stochastic non-cooperative game. Each D2D pair is a player which becomes a learning agent with a task to learn it's best strategy. In [41], the association problem in vehicular networks was solved by using an online reinforcement learning approach. The authors take the advantage of the regularities in the features of vehicular networks. Ghadimi et al. proposed a reinforcement learning method to obtain rate adaptation in cellular networks in [42]. However, it should be noted that obtaining an optimal solution with Qlearning method is difficult when the state and action vectors of the joint optimization problem are large. In this regard, deep learning [43] has emerged as a promising technique to solve problems with large state and action vectors. Recently, deep learning-based methods have been used in many areas, such as dynamic channel access [44], power allocation [45], mobile offloading [46], cloud radio access networks [47], interference management [48], mobile edge computing and caching [49]. We first discuss the usability of deep learning in communication networks.
As the use of machine intelligence into future mobile communication networks is drawing tremendous research interest [50], [51], a flag ship of machine learning, deep learning is drawing tremendous research interest of communication networking researchers. In [52] and [53], it's potential to solve problems in the mobile networking domain has been explored. This encourages the use of deep learning in 5G mobile communication systems, which are largely heterogeneous. The data generated in these systems are also heterogeneous to a large extent, as they are received from sources of different formats having complex correlations [54]. Solving these problems with traditional machine learning tools is quite difficult, mainly because of no improvement in performance with more data [55] and inability to handle high dimensional state/action spaces [43]. In contrast, big data fuels the performance of deep learning, as it eliminates domain expertise and employs hierarchical feature extraction. Thus, deep learning has become an efficient candidate for solving problems in communication networks, particularly in heterogeneous systems. In this regard, a detailed account of researches on the applications of deep learning in communication systems can be found in works like [56], where deep learning approaches for network cybersecurity have been discussed, [57] which reviews deep learning approaches for network traffic control, [58] which presents deep learning approaches for physical layer modulation, network access/resource allocation, and network routing, and [59] which presents deep learning approaches for emerging issues including edge caching and computing, multiple radio access and interference management. The most significant advantage offerred by DRL is that it can obtain the solution of sophisticated network optimizations, enabling network controllers like base stations to solve non-convex and complex problems like joint user association, computation, and transmission schedule, and achieve optimal solutions without complete and accurate network information. Some of the major advantages offered by deep learning in communications are as follows: • Deep learning ensures network entities to learn and create knowledge about the communication environment. For instance, by using deep learning, network entities like UEs can learn optimal policies, like AP selection, channel selection, handover decision, caching and offloading decisions, without knowing the channel model and mobility pattern.
• Deep learning enables autonomous decision-making. It enables the network entities to observe and obtain the best policy locally with minimum or without information exchange among each other. This significantly reduces communication overheads. It also improves security and robustness of the networks considerably.
• Deep learning improves the learning speed significantly, particularly where large state and action spaces are involved. Hence, in large-scale networks, deep learning allows network controllers like base stations or APs to control dynamic user association, spectrum access, and transmit power for a massive number of devices and UEs.
• Deep learning has also been found efficient in solving game theory problems also. Several crucial problems in communications and networking such as cyber-physical attacks, interference management, and data offloading can be modeled as non-cooperative games. Deep learning has been recently used as an efficient tool in finding the Nash equilibrium, without complete information. In a major development in this direction, it was found that combining deep neural network (DNN) with Q-learning can improve the learning performance and learning speed [60]. This system is called DQN. Developing a DRL-or DQNbased learning method for joint resource optimization is a new research direction in HetNets. For example, recently DQN based learning has been used specifically for base station activation in [62]. Further in [63], a post decision state based experience replay and transfer RL algorithm for low latency and high reliability has been proposed, for maximizing energy efficiency in hybrid RF/VLC networks. Some learning based works in hybrid RF/VLC have also been proposed in [25], which use RL with knowledge transfer based scheme for the selection of the network among the RF and the VLC networks, depending on the dynamics of the environment. However, hybrid RF/VLC systems generally involve large state and action spaces. RL algrithms perform well for small-size models but perform poorly for large-scale models. For such cases, DQN learning can efficiently maximize the Q-value by approximating the action-value function from the current state. However, it has still remained unexplored for finding optimal resource allocation in hybrid RF/VLC systems.  Fig.1 shows the system model considered in this investigation. The set-up contains multiple VLC APs (light sources) and a single RF AP deployed on the ceiling of a typical room as shown in the figure. The CU is co-located with the RF AP system, which is responsible for controlling the network, viz. bandwidth allocation for the APs, transmit power control of the APs, and association of the UEs with the APs, with the help of the DQN algorithm. The users carrying the UEs are shown arbitrarily present on the floor of the room. A newly entering user carrying a UE is also shown at the border-line of the floor of the room. Let N be the set of APs indexed as i = 0, 1, 2, . . . , |N |. Index i = 0 denotes the RF AP while indices i = 1, 2, . . . , |N | − 1 denote the VLC APs. Let M be the set of UEs present inside the room indexed as j = 1, 2, . . . , |M|. The UEs are considered to be at height h from the floor. The downlink communication to a UE is done through the VLC and the RF networks. Each UE is associated to the RF AP or a VLC AP. VLC APs reuse the same bandwidth. Thus, inter-cell interference (ICI) is present in the VLC network. The investigations will be performed on a reference AP i -UE j pair for downlink communication. The data communication between VLC APs and the RF AP is done through a backhaul circuit [64]. The backhaul circuit also performs the underlying circuitry operations. A noncoordinated transmission has been considered in this set-up. When associated with a VLC AP, a UE receives data with LOS and reflected light ray components.

A. LIGHT PROPAGATION MODEL
The VLC APs transmit data to UEs on the donwlink. The light propagation in the VLC is modeled with diffused reflection, where the light ray incident on a surface is scattered at multiple angles. The optical power of light after undergoing diffused reflection is modeled by the Lambertian law [65] and is given as where P i is the total LED power, φ is the angle of irradiance, m denotes the order of Lambertian radiation profile expressed as where ψ 1/2 is the semi-angle at half illuminance of the LED. Let the LED emit light with wavelength λ and spectral power distribution P i (λ), P i can be expressed as From (1), the LOS DC channel gain G v ij for the downlink communication from the ith VLC AP to UE j is obtained as where T opt (ψ ij ) is the gain of the optical receiver filter and is unity or a constant value within the FOV of a receiver, φ ij is the angle of irradiance at AP i, ψ ij is the angle of incidence at UE j, and d ij is the distance between AP i and the UE j. g(ψ ij ) is the concentrator gain given as where n is the refractive index given as n = speed of light in vaccum speed of light in that optical material , and ψ FOV is the angle of FOV of the receiver UE.
Next, the channel gains of NLOS reflected light components received by the photo diode (PD) at a UE have been computed. The lth reflected light ray component is a light ray coming from the (l − 1)th reflecting point. The (l − 1)th reflecting point acts as a virtual light source and the lth reflecting point becomes the virtual receiver. Investigations in [65] find that the effective DC channel gain of the light ray undergoing various reflections G EffRef , is the cummulative of the channel gains between all the pairs of reflecting points. Mathematically, where p denotes the index of reflection, G (p) is the DC channel gain after the pth reflection from the source LED which can be further expressed as where dA s is the infinitesimally small reflection surface area, P (p) q is the optical power of the reflected light ray component after p reflections emitted from the qth transmitting VLC AP. The infinitesimally small area of the wall surface is considered as the variable for the above integration.
where A s is the incidence surface area, φ b and ψ b (b = 1, 2, . . . , p + 1) are the irradiance and incidence angles at the bth reflection (b is a dummy variable). In (4), G v ij is the DC channel gain between ith VLC AP and the PD based jth receiver. On the other hand in (8), G 1 is the channel gain between the ith VLC AP and the first reflecting point, G 2 is the channel gain between the second and the third reflecting points, and similarly G p+1 is the channel gain between the pth reflecting point and the receiver PD. The channel gains at all the reflecting points are nearly in the same mathematical form. G p+1 is the function of T opt and g(ψ p+1 ) as T opt and g(ψ p+1 ) are properties of the receiving PD and G p+1 is the gain relating the last reflection point and the receiving PD.
Let p (λ) be the spectral reflectance of the material at the pth reflecting point, then P (p) q is given as All the sufaces of all the reflecting points are assumed to be composed of the same material. As p is a function of λ, thus, it is assumed that 1 The effective recieved optical power P eff from a single LED will be the sum of the LOS and the NLOS components and is expressed as where P i is the power transmitted by VLC i and G ij = G EffRef + G v ij is the effective channel gain between AP i and UE j for i ∈ N \{0}.

B. RF SIGNAL PROPAGATION MODEL
The signal received by UE j from the RF AP follows the RF signal propagation model, where the power channel gain includes fading as well as path loss. The received RF signal power is modeled using the WINNER-II channel model [66] given as where χ 0j is the Nakagami fading channel, pl 0j is the pathloss exponent and d 0j is the distance of UE j with the RF AP indexed as i = 0. Here, L = 10 X /10 , X = M +N log 10 It is a general fading distribution. It approximates to Rayleigh distribution when κ = 1 and Rician fading distribution when 1 ≥ κ ≤ ∞.

C. ACHIEVABLE DATA RATE
As the objective of our work is maximization of the achievable sum-rate of hybrid RF/VLC systems, developing insight on the achievable data rate of a UE, when it is associated with RF or a VLC AP, is significant. During a UE's association with the RF AP, it's achievable data rate will be expressed by the Shannon's capacity formula. On the other hand, when a UE is associated with a VLC AP, it's communication is based on intensity modulation/direct detection (IM/DD) of light. In this scheme, the signal amplitude depicts the instantaneous optical power. Consequently, there are constraints on the signal to be real-valued and non-negative. Due to these constraints, direct application of Shannon capacity formula may not fulfil the purpose of obtaining the achievable data rate.
The authors in [67]- [70] have investigated the capacity of an IM/DD channel corrupted by the Gaussian noise. In [68], investigations show that the channel capacity in VLC networks can be approximated by it's lower bound as where w is a constant and is given as w = e/2π (e is the Euler's number), ρ is the responsivity of the PD, B is the modulation bandwidth, P eff is the received optical power and σ 2 is the Gaussian noise power. It was found in [68] that a factor of 1 2 appears as a result of various constraints in VLC. It was also found that the expression (12) is accurate and for a high SNR, it is found to concur with the upper bound also.

D. COMMUNICATION MODEL
Each UE will receive data from the RF AP or from one of the VLC APs. It's association will be decided with the help of the DQN based learning algorithm proposed ahead. Some APs are likely to be idle and no UE will be associated with them. For UE j associated with AP i for i ∈ N , the channel gain vector will be written as and G ij ∈ G j denotes the channel gain between UE j and AP i. The signal transmitted by the APs will be represented in the vector form as . Remember that index i = 0 in the above sets denotes the RF AP. Let us consider that the UE j is associated to AP i. When UE j is associated to AP i = 0 i.e., the RF AP, it will receive signal y j expressed as where n r j is the additive white Gaussian noise (AWGN). On the other hand, when UE j is associated with ith VLC AP, y j will be expressed as where ρ j is as mentioned in (12), the responsivity of the receiving PD at the UE j, n v j includes the shot noise and thermal noise, and where α kj is an indicator function denoting the association of the AP k with UE j such that In (14), α ij = 1 means UE j is associated to AP i. AP i -UE j are the desired AP-UE pair while AP k is causing interference at the jth receiving UE. The first term in (14) represents the desired signal whereas the second term denotes interference. Note that a conventional form of the expression does not have D k (α kj ) in the interference term. We multiply D k (α kj ) in the interference term to include the case of idle APs which are not transmitting. It ensures that AP k is considered as the interferer only if it is transmitting to at least one UE j , where j = j. The parameter α kj signifies the association of UE j with AP k. D k (α kj ) = 0 and 1 if AP k is not transmitting and transmitting to UE j respectively. This factor incorporates the situation when an AP is momentarily switched off due to hardware failure. Following (12), (13), and (14), the instantaneous achievable data rate at UE j for the input signal which is continous and follows negative exponential distribution is expressed as [13] where γ 0j and γ ij are the lower bounds of SINR 0j and SINR ij which are given as , and where B 0j is the bandwidth of the RF AP (i = 0) -UE j link and B ij is the bandwidth of VLC AP i -UE j link such that i ∈ N \{0}. As only one RF AP has been considered in the model, it is assumed that the RF signals suffer negligible interference. Thus, when the UE j is connected to the RF AP, we are interested in the SNR. However, for the sake of consistency in notations, the SNR for RF AP-UE j link is expressed as SINR 0j . When UE j is connected to a VLC AP, SINR will be of interest. Note that, any general mention of SINR further will mean SNR in the case of RF signals. Based on the above expression for instantaneous data rate, the throughput of AP i can be formulated as

E. THE RESOURCE ALLOCATION PROBLEM
This article aims for finding the optimal user association, transmit power allocation for APs, and the optimal downlink bandwidth allocation done by an AP for the UEs associated with it. The resource allocation will be done for maximizing r i obtained in (19). The resource allocation problem is formulated as subject to the following constraints: where B v max is the total bandwidth that can be allocated to a VLC AP. The constraint in (21) illustrates that the sum of bandwidths allocated to the UEs associated to VLC AP i for i ∈ N \{0} cannot exceed B v max . Similar constaint is imposed on the RF AP formulated as follows: The constraint in (22) shows that the sum of bandwidths allocated to UEs associated with the RF AP cannot exceed B r max , which is the total bandwidth allocated to the RF AP. The next constraint is imposed on the transmission power to ensure the power budget saving and safety considerations for the eyes. The transmission power of a VLC AP cannot exceed it's maximum available power P v max , formulated as Similarly, the transmission power of an RF AP cannot exceed it's maximum available power P r max formulated as: Additional constraints have been imposed on SINR ij for i ∈ N , j ∈ M for achieving reliable communication. Let the minimum level for SINR required by the jth UE from the ith AP for successful communication be γ ij . Thus, the constraint on the SINR is as follows: In the constraint C 5 , γ ij is the minimum threshold for SINR ij . For the calculations in this work, we consider SINR ij = γ ij . Equality is assumed here to carry out the optimization of the variables B ij , P i and α ij . As the optimization of B ji , P i and α ij will lead to the optimization of γ ij , taking equality as facilitates the solution.
When the constraint C 5 in (25) holds with equality, the following conditions are obtained for preventing SINR constraint voilation [11], [12] 1 − i∈N j∈M ξ ij > 0, and i∈N j∈M where Constraints (25), (27), (28), and (29) are significant for controlling interference in the system. It is possible that the maximization of the achievable data rates for different APs, namely r i , interfere with each other due to the inter-AP interference. Thus, maximizing the achievable data rates for different APs at the same time will be difficult. The constraint (25) ensures that a minimum SINR threshold for every AP -UE pair is maintained. The minimum SINR threshold has been denoted as γ ij . Putting a minimum SINR constraint on each AP-UE pair ensures a cap on the interference caused by the APs. When the interference from an AP increases to a level that violates this SINR constraint at some UE, the DQN learning mechanism will regulate the transmission power of VOLUME 8, 2020 the interfering AP in a manner that the SINR constraint is satisfied. The constraint in (25) leads to the constraints in (27) - (29), which are used to incorporate (25) in the algorithm while solving the optimization problem. An AP will interfere with the signals of another AP if the constraints in (27) - (29) are violated. This process will be accomplished with the help of the state space vector S ij formulated in the subsequent section.
Note that the sum of logarithmic functions is concave in nature. However, problem P in (20) is jointly non-concave in B ij , P i and α ij (please refer section Appendix for proof).

IV. DQN-BASED LEARNING ALGORITHM FOR RESOURCE ALLOCATION IN HYBRID RF/VLC
Now, a DQN-based learning algorithm to maximize the network throughput in (20) is developed.

A. FRAME WORK FOR LEARNING
In this section, a DQN-based learning algorithm for the resource allocation problem in (20) has been formulated. The proposed algorithm maximizes the achievable data rate of AP i in (20) while satisfying the constraints in (21)-(27). Learning based algorithms run with the help of three vector variables: state, action, and reward. The state vector defines the present status of the environment. The action vector defines the action taken after observing the present status of the environment. The reward vector defines the reward received by the system after an action is taken by the system. Let S ij = {s 1 ij , s 2 ij , . . . , s l ij } be the state vector and A ij = {a 1 ij , a 2 ij , . . . , a m ij } be the action vector. l and m depend on the formulations of S ij and A ij . At time t, the system is in the state s ij (t) ∈ S ij and it receives reward R i (s, a). When action a ij (t) ∈ A ij is taken on the system, it moves to state s ij (t + 1) ∈ S ij . The outcome of action a ij (t) is received in terms of the reward. The CU trains the learning algorithm to perform the association of UEs and communicate the power and bandwidth allocation with APs. This process is repeated iteratively. With each iteration, the system moves towards receiving the maximum reward. The action vector, state vector and reward are formulated as follows: As it can be seen in (20), the association and resource allocation variables are α ij , B ij , and P i , the action space A ij will be formulated with α ij , B ij , and P i for i ∈ N and j ∈ M. Let B ij and P i be the discretized sets of B ij and P i respectively, for i ∈ N and j ∈ M. The following formulation is made for B ij and P i : where P r/v min and P r/v max are minimum and maximum levels of the transmit power for the RF and the VLC APs respectively. Note that the cardinality of α ij will be 2 |N |×|M| for i ∈ N and j ∈ M. The design of A ij involves 2 |N |×|M| values of α ij because of the presence of the interference term in (18). Without the loss of generality, |B ij | = |P i | = 2 |N |×|M| has been considered for the formulation of A ij . The discretized values B ij , P i and α ij will be used to compute the threshold γ ij according to (18) on each link from the ith AP to the jth UE and the action state vector will be formulated as At every iteration, the CU will chose one value from the set A ij for each AP. While choosing a strategy from A ij , the CU adapts the transmit power P i and the bandwidth allocation B ij for the ith AP (such that j ∈ M\{α ij = 0}), and observes the changes in the environment and it's own transmission. Thus, the action is the selection of B ij and P i to achieve a minimum SINR (γ ij ). Next, the state space vector has been designed.

2) STATE SPACE (S ij )
The state space vector is based on constraints defined in (21)- (29) and is defined with binary variables as S ij = {I It can be seen that the formulations in (25)- (29) help in creating the state vector, so as to help the proposed DQN in maintaining a tradeoff between the desired signal power and the interference suffered. Note that the total number of possible states will be 2 6 .

3) REWARD (r i )
As mentioned before, AP i receives an immediate reward depending on the action taken in a particular state. For each i ∈ N and j ∈ M at time iteration t, the CU decides actions a ij (t) ∈ A ij for the i−j link after observing the state s ij (t). The CU communicates α ij (t) through a backhaul link to AP i for all j ∈ M\{α ij = 0}. In the explanation ahead, the subscripts i and j in s ij , a ij and A ij have been dropped for simplicity. The immediate reward R i (s, a) is received in the form of the data rate of the AP i and is defined as where r fix is a reward smaller than the reward obtained after applying any action violating the interference constraints. When the constraints are satisfied, the reward received by AP i is r i . The CU will seek to find an optimal policy for each AP to maximize it's own r i . The CU repeatedly makes the decision and finally obtains the optimal policies for the APs to maximize their respective r i s for constraints (21) to (29). Since, r i s are always non-negative, maximization of i∈N r i can be achieved by maximizing individual r i for each AP i. Therefore, the CU will seek to find an optimal policy through DQN learning algorithm to maximize the reward for AP i. The action vector, state vector, and reward have been used for performing DQN learning as shown in Fig. 2. The CU is shown to be equipped with a replay memory to store the experience e i (t) = {a ij (t), s ij (t), r i (t), s ij (t + 1)}, which was gathered at the transition of two consecutive time instants t and t + 1. The replay memory gets s ij (t), r i (t), and s ij (t + 1) from the network and a ij (t) from the DQN learning output. A mini batch is present which takes training samples from the replay memory at each iteration. Each iteration consists of fixed number of episodes EP N such that each episode uses one training sample and runs for T time slots as shown in Algorithm 1. Further, a DQN block is shown where DQN learning is performed. The input switch of the DQN block switches it's connection alternately with the output of the mini batch and with a link to the network. When connected with the output of the mini batch, it receives the training samples while when connected with the link to the network, it gathers knowledge about the state s ij (t). The DQN learning output is produced in the form of the selected action a ij (t). The output port of the DQN block switches it's connection alternately between two input ports ahead. The first input port feeds a ij (t) to the replay memory. The second input port feeds a ij (t) to the loss and gradient and parameter upgrading blocks, where the upgraded θ is obtained. The output of the parameter upgrading block is fedback to the input of the DQN block with the mini batch ouput.
To accomplish the DQN based learning algorithm for AP i, the CU finds an optimal policy π for it with the help of statevalue function V π (s) [43]. It is the maximum discounted sum of immediate rewards R i (s, a) over a long span of time while the optimal policy π is being followed. Mathematically, it is written as (R(s, a)) t |s t = s, a t = a, π}.
where ζ is the learning rate at which Q * (s, a) is updated. In (35), Q * (s, a) iteratively converges to it's optimal value for t → ∞.
The maximization of Q(s, a) leads to the maximization of r i as the objective of DQN learning is to define an environment for the agent to perform certain actions to maximize the reward. In this work, the reward is the achievable data rate of the ith AP, r i . First, a state value function V π (s) is calculated. The state value function V π (s) tells which state gives the highest reward, i.e., the achievable data rate r i , and is given as where The next step is the calculation of the action-value function Q(s, a), which signifies the action or the policy that the agent should take so that the maximum state value is achieved. Mathematically, Q(s, a) = max π V π (s). Thus, maximizing the action -value function leads to the maximization of the reward r i . If vectors are large, obtaining optimal Q * (s, a) becomes challenging. Thus, the optimal action-value function is estimated with the help of a function estimator. In this regard, [43] has been followed, where a neural network for this estimation as Q(s, a; θ) ≈ Q * (s, a) has been proposed. In this article, a fully connected feed-forward multilayer perception (MLP) network is used for this approximation. Since it is a neural network acting as the action-value approximator, it also brings advantage to the DQN based algorithm. In this approximation, it includes experience replay to improve the performance of learning, in which the CU stores the experience of the environment at each time step for AP i as e i (t) = {a ij (t), s ij (t), r i (t), s ij (t + 1)} into a replay memory. The replay memory at different time instants is written as D i (t) = {e i (1), . . . , e i (t)}. The two different MLP networks used as Q-network approximators are action-value function approximator Q(s, a) and the target action-value function approximator Q(s, a; θ). Here, θ and θ − are the parameters of the present and previous iterations respectively. With each iteration, the present iteration parameter θ of the action-state function is updated. This is done with the help of the display memory D i where a random sample (a, s, r, s) is chosen. The updation of θ − is done after a fixed number of iterations, where the parameters of the target value function are replaced with the updated θ of the action value function. The update procedure is done with the help of gradient descent algorithm based on the following cost function: Otherwise select a random action with probability Update the state s ij (t + 1) and the reward r i (t) according to (33) and (37) Store e i (t) = (a ij (t), s ij (t), r i (t), s ij (t + 1) in the experience replay memory created for AP i, D i . Update the current parameters θ i of the actionvalue function Q(s ij (t), a ij (t); θ i ), by sampling mini-batch of transitions from D i (t) After every fixed number of steps, update θ − i = θ i Get mini batch samples from the replay memory end for end for end for end for Perform r = i∈N r i As the non-negative r i of each AP is optimized and the sum-rate r is the sum of r i s, it will lead to the optimization of the overall system The DQN based learning algorithm for maximizing the achievable sum-rate of the hybrid RF/VLC system is given in Algorithm 1. The above application of DQN learning to solve a resource allocation problem is expected to prove efficient as the considered hybrid RF/VLC system involves large state and action vector spaces. In this regard, DQN learning takes advantage of neural networks to train the learning process and efficiently maximize the Q-value by approximating the action-value function from the current state. With such an application, a higher convergence speed of the algorithm and a better output achievable sum-rate are expected. Moreover, the solution has been achieved without complete and accurate network information. It is clear that first each r i for the ith AP is optimized. As r i s are non-negative and the overall sum-rate r is their sum, it will lead to the overall system optimization.

V. A NEWLY ENTERING UE
To investigate a dynamic system, a new UE entering into the scenario has been considered. Note that the DQN based learning algorihthm estimates the new Q-function on the basis of the reward of every action for each AP. The CU learns the environment of each AP respectively. Then it takes an action linked with the highest reward, which means performing the association of the UEs with each AP and allocating bandwidth and power to each AP, in a manner which gives the highest reward. Thus, the AP gets the reward pertaining to the action taken by the DQN based learning algorithm. The parameters of the Q-function are updated as per the reward received immediately. In other words, these parameters reflect the effects brought by the action parameter of each AP. Each AP causes interference to the other UEs in the hybrid RF/VLC environment. The Q-function parameters reflect the local environment of each AP and also an overall interrelationship between the different modules of the hybrid RF/VLC system.
In case when a new UE joins the environment, discarding all the already gathered information for the individual APs at the CU, the interconnection of the modules in the environment, and initiating the algorithm again for the new system will be an inefficient procedure. We propose to try the application of the transfer learning phenomena in such situation [61]. The already gathered information about the environment obtained through the Algorithm 1 before the new entrant UE has entered will be used immediately after it enters the environment. As the cognitive cycle proceeds further, the information will be updated according to Algorithm 2.

VI. SIMULATION RESULTS
In this section, the effectiveness of our proposed algorithms has been verified with the help of simulations.

A. PERFORMANCE ANALYSIS OF THE PROPOSED ALGORITHMS
Initially, the following set-up has been considered: The hybrid RF/VLC network consists of 1 RF AP, 4 VLC APs, and 4 UEs. At a given time instant, each AP can serve multiple UEs, while one UE can receive data from only one AP. The values for the parameters has been decided from [18] for performing the simulations. The VLC AP noise N v 0 is 10 −21 A 2 /Hz, the average optical power per VLC AP (LED lamp) is 9.2 W, physical area of the PD A pd is 1 cm 2 , PD responsivity ρ j for all j ∈ M UEs is 0.28 A/W, receiver FOV is 60 • , half angle of the LED φ 1/2 is taken as 70 • , and the maximum illuminous intensity of the LED is 28 cd. A learning rate ζ of 0.01 and discount factor of 0.9 are used for all the APs. The path-loss exponent pl 0j is taken as 2.8. on the room ceiling. The replay memory capacity is considered as 100 and the mini-batch for buffer is kept at a size of 10 respectively. The investigations have been performed over 1000 monte-carlo simulations. The input to the neural network has 7 nodes: 6 nodes for the state and 1 node for the selected action to be taken. The structure of DQN consists two-hidden layers of fully-connected neural network with 3 and 2 neurons, respectively. The state and action vectors are functions of downlink bandwidth, power, and association parameter. Thus, passing state and action vectors through the input of the neural network means passing the downlink bandwidth, power and association parameter. The neural network trains the DQNlearning algorithm for generating action-value approximator with environmental interaction and receive the maximum reward. As the iterations proceed, the algorithm converges towards the optimal policy selection from the action vector A ij in (32), which is choosing optimal B ij , P i and α ij , and the achievable sum-rate is maximized.
In the evaluations, the outcome of the proposed schemes has been compared with the exhaustive search algorithm, the received SINR based and the received power based association schemes as benchmarks which are popular resource allocation techniques in heterogeneous networks. We have also made the comparison of the proposed schemes with the Q-learning based power allocation scheme for hybrid RF/VLC proposed in [71]. First, an explanation on the received SINR and receive power based resource allocation schemes is provided. These schemes have been widely used in general heterogeneous networks.

1) RECEIVED SINR BASED SCHEME
The fundamental work in this area can be found in [10]. Further it has been followed in [72]. The fundamental problem addressed in [10] is the optimal allocation of association parameter for equal resources alloted to all the APs. The optimal association parameter association is aimed for achievable VOLUME 8, 2020 data rate maximization of a single AP i UE j link. The problem for obtaining optimal association parameter is given as where, Load i = M j =1 α ij is the total load on the ith AP, which means that Load i represents the number of UEs associated with the ith AP. As equal resource allocation has been followed at each of the |N | APs, the data rate R ij will be equally divided among all the UEs associated with the ith AP. The authors propose a highest SINR based algorithm for solving this problem. The algorithm is based on the SINR which a UE has with each of the |N | APs. To formulate the algorithm, the problem in (40) is re-written in terms of Lagrange multiplier as The proposed algorithm is aimed to solve the problem (41) is as follows: UE's algorithm: • Each UE measures the SINR by using the pilot signals from all the APs, and receives the value of µ i broadcast by each AP at the beginning of the iteration.
• UE j determines the AP i * which satisfies the follows: If there are multiple maximizers, the UE will chose one of them. AP's algorithm: Each AP updates the new value of Load i and µ i in the two steps and announces the new multiplier µ i to the system.
• To obtain the maximizer of problem in (43), we set it's gradient to be 0 with the constraint Load i ≤ |N | i.e., • The new value of the Lagrange multiplier is updated by (46) where δ(t) is a dynamically chosen stepsize sequence based on some suitable estimates.

2) RECEIVED POWER BASED ASSOCIATION SCHEME
The next comparison of the proposed DQN-learning based resource allocation scheme has been made with received power based association technique. The most significant received power based association technique has been shown by Lin et al. in [73]. The association of a UE is decided according to the signal power it receives from different APs. A UE will be associated with an AP if it provides signals at the highest power. Suppose UE j is at a position y j , VLC AP i at a position x v,i , i ∈ N in a hybrid RF/VLC system. If the position of the RF AP is x 0 , a UE will be associated with the VLC AP if The above condition is also based on the received power at the UEs from the APs. The channel losses in the RF and VLC mediums has also been taken into consideration.

3) EXHAUSTIVE SEARCH METHOD
The third benchmark considered for investigating the efficiency of our proposed schemes is the Exhaustive search method [74]. This method is highly complex. The Exhaustive search method used here involves a trellis based mechanism. For instance, let us imagine the optimization of the parameters α ij , B ij , and P i , which lead to the calculation of the action variable γ ij as a traverse between it's initial random value and it's final optimal value. This process involves forming a trellis between the two points. The trellis consists of a certain number of levels, with the final level having the optimal value of γ ij . Each level consists of a number of possible values for γ ij . The main objective here is it to determine all the possible paths from the initial random value to the final optimal value. It involves working through the trellis from level 1 to the final level which involves calculating the number of paths at each level. Let R be the set of trellis levels, then there will be |R| trellis levels. Each level has M points where optimum value could be obtained. If Q(r l , m) be the number of paths at the point m of the level r l , where 1 ≤ m ≤ M possible from the level 1, as shown in Fig. 3, the calculation of the total number of paths possible will be M m=1 Q(r, m).

4) Q-LEARNING BASED POWER ALLOCATION SCHEME IN HYBRID RF/VLC
We compare our proposed schemes with the state-of-theart multi agent Q-learning based power allocation in hybrid RF/VLC systems proposed Kong et al. in [71]. Kong et al. have used multi agent Q-learning for optimization of the transmit power of the RF and VLC APs. Being multi-agent Q-learning, it is performed at each AP separately. On the basis of the application of Q-learning, each AP decides it's transmit power. We compare our results with [71] as it is the state-ofthe-art work available on this topic. The proposed DQN learning based resource allocation algorithm is different from the work in [71] in several aspects. The work in [71] deals only with transmit power allocation for the APs. On the other hand, the proposed DQN learning based resource allocation algorithm deals with transmit power allocation for the APs, the bandwidth allocation for the APs and deciding the association of the UEs with the APs. It can be seen that the domain of the problem addressed here is much larger. The work in [71] consists of only two constraints on the transmit powers of the RF and the VLC APs. However, our work considers six constraints which consist of the two transmit power constraints, two constraints on the bandwidths of the RF and VLC APs, one-one constraint on the association parameter and the SINR each. Thus, our problem formulation is more practical. Considering constraints only on transmit power of the APs leads to presumptions on bandwidth and association parameters, which may compromise with the practicality of the system. This process of comparing the schemes proposed in [71] with our schemes is accomplished by implementing the scheme proposed in [71] for our system and then comparing them with our results (shown ahead in Fig. 10). In [71], transmit power of the APs is the optimization variable and is optimized with Q-learning. Thus, to implement [71] in our work, the action vector in expression (32) is formulated with only power P i terms as variables and presumptions are made for bandwidth B ij and association α ij parameters. The problem (20)-(27) is reduced to such that A constraint is imposed on the transmission power to ensure the power budget saving and safety considerations for the eyes. The transmission power of a VLC AP cannot exceed its maximum available power P VLC max , formulated as Similarly, the transmission power of an RF AP cannot exceed its maximum available power P RF max formulated as: Further, the optimization of action vector A ij is carried out with Q-learning. For the allocation of bandwidth, equal allocation is considered for all the APs, while association parameter α ij is allocated as per the minimum distance criteria. For power allocation, the investigation is performed with two cases, when the number of UEs is fixed and when a new UE is entering into the system. When the number of UEs is fixed, DQN learning without transfer learning serves the purpose while when a new UE is entering into the scenario, the application of transfer learning is investigated. The comparisons have been shown ahead in Fig. 10. The achievable sum-rate is compared with the increasing number of VLC APs deployed as shown in Fig. 9. We now present the simulation results. In Fig. 4, the number of iterations needed for the maximization of the normalized achievable sum-rate with the application of the proposed algorithms has been studied. The investigation for the fixed number of UEs is done in Fig. 4a and in Fig. 4b, the investigation for the case of a new incoming UE is made. As mentioned above, |N | and |M| are conisdered as 5 and 4 respectively. In Fig. 4a, the DQN learning mechanism starts showing output achievable sum-rate of 380 Mbits/s at nearly 240 iterations which goes on increasing with minor fluctuations as the iterations are increased. In nearly 1600 iterations, the final value of the maximized achievable sum-rate is obtained as 1270 Mbits/s. On the other hand, the exhaustive search algorithm starts showing output at nearly 250 iterations and shows a final achievable sum-rate value of 1140 Mbits/s in nearly 1600 iterations. The final values of the achievable sumrate obtained with the received power and the SINR based association schemes are 850 and 990 Mbits/s respectively, which shows that the proposed DQN based learning based mechanism outperforms exhaustive search, received power based association and received SINR based association, and leads to at least 10% increase in the achievable sum-rate. Fig. 4b shows the performance of the transfer algorithm (Algorithm 2) for the case of a newly entering UE, which has been labelled as DQN-transfer learning. Fig. 4b also shows the performance of DQN-learning based method, exhaustive search, received SINR based and received power based algorithms for this case. For applying DQN-transfer learning, the CU uses the information of the already learned network for the newly joined UE while for applying DQN learning based method, the CU initiates action-value function parameters randomly for the newly joined UE. Similarly, the exhaustive search method, the receive SINR and the receive power based algorithms re-start from the beginning after the arrival of the new UE and operate till convergence. It can be seen that the DQN-transfer learning converges to a final value of nearly 1290 Mbps in just 1200 iterations, while the DQN learning based mechanism converges to it in 2600 iterations. The exhaustive search, received SINR and received power based mechanisms converge to the final values attained in Fig. 4a, but in 2650, 3200 and 2700 iterations respectively. Note that the high number of iterations needed by the received SINR and power based association schemes arises due to  load balancing and proportional fairness issues. Unlike the proposed DQN based approaches, these algorithms use the total sum-rate as the objective function.
In Fig. 5, the average number of iterations needed for convergence with varying number of UEs is shown. Fig. 5a shows that as the number of UEs increases, the number of iterations needed for the convergence of all the algorithms increases. It can be seen that the DQN based learning algorithm can attain a level of achievable sum-rate is much lesser number of iterations compared to the other algorithms for a given number of UEs in the network. For a network with 55 UEs, at least 10% higher achievable sum-rate can be attained by the DQN-learning based algorithm in 14.28% lesser iterations compared to the achievable sum-rate value attained by exhaustive algorithm. Further, Fig. 5b shows that when a new UE enters the network, the DQN-transfer learning achieves its maximum achievable sum-rate in nearly 54% lesser number of iterations for a given number of UEs present in the set-up. Fig. 6 shows the plot for achievable data rate with the number of UEs. Fig. 6a shows the results for a fixed set up while Fig. 6b shows the results for the case of the arrival of the new UE. In both the figures, the number of UEs are varied from 5 to 55. It can be seen that for this entire range of the number of UEs, the DQN based learning algorithm and DQN-transfer learning outperform the exhaustive algorithm, the received power and the received SINR based association by reasonable margins. On increasing the number of UEs, the achievable sum-rate increases with the increase in the number of AP-UE links. The achievable sum-rate is found to increase at a higher rate in the 5 to 15 UEs range. As the number of UEs is increased from 15 to 25, a slight decrease in the rate of increment can be seen. For all further increments in the number of UEs till 45, a slight decrement in  the respective rates can be seen. This behavior is the same in all the algorithms investigated here. Intuitively, it is due to the fact that an increase in the number of AP-UE links also results in increased interferences. However, when the number of UEs is further increased from 45 to 55, the rate of increment again increases, which shows that for a high number of UEs, the desired signal power component becomes dominant. In Fig. 6b, the DQN-learning algorithm gives nearly the same output as the DQN-transfer algorithm. The difference between their applications is only the number of iterations needed to converge to their final outputs, as shown in Figs. 4 and 5.
In Fig. 7, the effectiveness of the proposed algorithms is investigated for a varying height of the room. As the height of the room increases, the transmitter-receiver separation increases which results into a decrease in the achievable sum-rate. This decrease is evident from (4), (8), and (11), where it is shown that the channel gains for RF and VLC networks decrease in magnitude with the increase in the transmitter-receiver separation. As the height of the room is increased, the attenuation in the signals received by the UEs increases. Similar to the previous figures, Fig. 7a shows the investigation for the fixed UEs case while Fig. 7b shows the investigation for the newly entering UE. It can be seen that the DQN learning based algorithm and DQN-transfer algorithm outperform the algorithms under consideration. Fig. 8 shows investigations on the FOV of UE j. The FOV impacts the VLC system performance significantly. The achievable sum-rate obtained with the different schemes under consideration has been plotted over a wide range of FOV from 60 • to 180 • . When the FOV of the receiver is small, the effect of interfering signals is lesser on it. Thus, it gives a higher achievable sum-rate. Contrarily, when the FOV of the receiver is large, it receives more interfering  signals from the unassociated APs. Thus, the interference increases which results into the decrement of the achievable rate. This behavior has been depicted in Fig. 8a and Fig. 8b. It can be seen that the proposed DQN learning and DQN-transfer learning based methods outperform the other schemes under consideration. It can also be concluded that for a fixed deployment of VLC APs on the corners of the room, a sharp decrease occurs in the achievable sum-rate with the increasing FOV of the receiver. Such behavior may change for a different deployment of the APs.
The results presented so far do not consider the case of dense AP deployment. To address this concern, we perform simulations for higher number of APs, as shown in Fig. 9. The deployment of the new APs is done as it was done earlier for 4 APs. The 4 APs which were deployed earlier are positioned on the same coordinates in the four corners of the room as before. The new APs are placed within the area covered by these 4 APs as shown in the Fig. 9. The coordinates of the new APs has also been shown here. Next in Fig. 10, the achievable sum-rate vs. the number of APs has been plotted for this set-up. Fig. 10a shows the performance of DQN-learning mechanism, while Fig. 10b shows the performance of the transfer learning algorithm. From Fig. 10, it can be seen that the DQN-learning and the DQN transfer learning algorithms outperform the existing algorithms for the static and the dynamic cases.

B. COMPLEXITY ANALYSIS OF THE PROPOSED DQN-LEARNING BASED ALGORITHMS
The objective of this work is the maximization of action value function Q(s, a) which is achieved by bringing Q(s, a) as close to the target action-value function Q(s, a, φ). The algorithmic complexity is the sum of the statistical and the algorithmic error in this process [75]. The total error rate is given by .|A| (51) where Q k is the Q term at the kth iteration, µ and σ are the mean and standard deviation of P(S × A) where P denotes distribution, φ µ,σ is a constant such that (1 − ι) 2 v≥1 ι v−1 v.K ≤ φ µ,σ , n is the sample size, ζ is a constant, R imax is the maximum reward value for the ith AP. The first term on the right hand side (RHS) of the equation is the statistical error while the second term on the RHS of the equation is the algorithmic error. The algorithmic error converges to zero in linear rate as the algorithm proceeds, but the statistical error represents the fundamental problem. When the following condition for the number of iterations K is satisfied, the statistical error dominates the algorithmic Viewing ι and φ µ,σ as constants and ignoring the polygarithmic term, the proposed algorithms achieve the error rate which scales linearly with the capacity of the action space and goes to zero when n goes to ∞. Here, t j and β j are time parameters for the jth UE. The term n β * j /(2β * j +t j ) in the above equation recovers the statistical rate of the nonparametric regression in l 2 -norm. It is further found that the algorithm achieves an error rate of |A|.n −β j /(2β j +r) when K is sufficiently large, where r ∈ N, N denotes a natural number.
Note that π is the greedy policy with respect to Q and Q functions. As the construction of Q is done with an iterative algorithm, the error convergence has to be related to the error in the previous steps, i.e., Q k − Q k−1 . This relation is formulated as (54) where φ µ,σ is a constant that depends only on the distributions of µ and σ . Thus, as mentioned above, the total error is the sum of algorithmic and statistical errors, where max k∈[K ] || Q k − Q k−1 || σ is the statistical error and the second term on the RHS of the equation is the algorithmic error. The statistical error goes to zero as n increases to a large number. The algorithmic error goes to zero as the number of iterations K increases. The fundamental difficulty of DQN is the error incurred in the single step. The bound on || Q k − Q k−1 || σ is obtained as The action value and the target action value functions need one -one neural network based MLP networks each. Thus, the core function of the proposed algorithm is based on a neural network as shown in Fig. 11. The network considered here has 2 hidden layers of 3 and 2 neurons respectively. This neural network has 7 inputs, 6 from the state vector and 1 from the action vector as shown in the figure.
It can be seen that the state vector has 6 binary input values. First, the input node decides 1 or 0 to be given into the neural network. The bit 1 or 0 is decided according to an M dimensional linear equation as shown in (33) for the constraints in the problem. For an i − j link, Similar is the case for the other state variables for maintaining the minimum SINR values as The state variables pertaining to power constraints involve selection of power P i within the constraints. Thus, the input nodes involve solving an M dimensional hyperplane. Once the decisions regarding 1 or 0 are formed at the input nodes, each node gives it's decision to each of the neurons of the first layer. Let I ij 1 , . . . , I ij 6 be the inputs from the 6 input nodes, the neural network checks 6 r=1 I ij r is > or = 0, which accounts to solving a 6 dimensional hyperplane.
The neural network needed to solve a hyperplane is NP hard [77]. Therefore, as both the proposed algorithms involve solving hyperplanes with neural network, both of them are NP hard.

D. DISCUSSIONS ON THE BETTER PERFORMANCE OF DQN LEARNING OVER EXHAUSTIVE SEARCH METHOD
A reasonable question comes here that why do the proposed algorithms perform better than the Exhaustive search mechanisms. As mentioned earlier, the Exhaustive search method used here involves a trellis based mechanism. For instance, let us imagine the optimization of the parameters α ij , B ij , and P i , which lead to the calculation of the action variable γ ij as a traverse between it's initial random value and it's final optimal value. This process involves forming a trellis between the two points. The trellis consists of a certain number of levels, with the final level denoting the optimal value of γ ij . Each level consists of a number of possible values for γ ij . The main objective here is it to determine all the possible paths from the initial random value to the final optimal value. It involves working through the trellis from level 1 to the final level. It involves calculating the number of paths at each level. Let there be |R| trellis levels with M points at each level. Let Q(r, m) be the number of paths at the point m of the level r, where 1 ≤ m ≤ M possible from the level 1, as shown in Fig. 3. The calculation of the total number of paths possible will be M m=1 Q(r, m) which comes out as M |R| . The computational cost associated with each path is β 0 (|R| − 1) where β 0 is the average computational cost associated with any path segment in the trellis. The total complexity Q(r, m)β 0 (|R| − 1). The process makes Exhaustive search method complex. As a moment-to-moment update is needed in the present work, a limited time span is available for optimizing γ ij . The Exhaustive search method is likely to fail in finishing the optimization of γ ij in the available time span.
As the complexity of Exhaustive search is higher, the Exhaustive search compromises with a lower magnitude of throughput within the designated time span for a momentto-moment update. The magnitude of throughput reached with DQN learning requires much more time with Exhaustive search. Thus, the final throughput will be lower than that obtained with DQN learning.

E. DISCUSSIONS ON THE CONFLICTS OF INTERESTS AMONG THE ACCESS POINTS
Among the APs, there are conflicts of interest and hence the action-value function Q(s, a) for different APs are related to each other. The action-value function Q(s, a) mainly has two variables, s and a. The variable s signifies the state in which an AP-UE pair are present while the variable a signifies the action taken by the CU for each AP and the UEs associated to it to receive the highest reward. The action-value functions Q(s, a)s for different APs are related to each other through s and a, as s and a include the conflicts of interest between the APs. The conflict of interest between the APs occurs in two major ways: in interfering with the signals from other APs and in load balancing. As was mentioned earlier, the interference reduction is expressed in the expression (25) which is written as a constraint for the maximization of r i . When the DQN algorithm runs for the maximization of r i , this constraint on interference is included in it. Thus, all the output results are produced with due consideration of this constraint. The second conflict of interest is load balancing, which means that when the UEs get a high SINR from a particular AP, they try to associate with it. As a result, the load on this AP increases severely and it has to bifurcate it's bandwidth into more smaller parts for allotting spectrum to the UEs associated. Consequently, the effective achievable data rate decreases.
Generally in achievable sum rate maximization with conventional optimization methods like the maximum received SINR and maximum received power methods, the cost function is the final achievable sum rate of the system r = N i=1 r i . When the problem max r is studied, it may happen that a particular AP from i ∈ N lies close to many UEs and thus offers high SINR. Consequently, a large number of UEs will be associated with this AP causing the problem of load balancing. To address this issue, the most widely used mechanism is to instead maximize N i=1 log r i . This ensures that the final maximization is performed on log N i=1 r i , which ensures that no r i remains lesser in magnitude as it will harm the final solution.
However, in DQN learning based maximization technique, maximizing the final sum rate r is difficult with learning based mechanism, as the data rate at each AP or each UE needs to be maximized separately. The maximization is performed on each r i first and then all the r i s are summed up. Thus, this remains the limitation of our work.

VII. CONCLUSION
In this article, the joint optimization problem for bandwidth, power and association parameter allocation in a hybrid RF/VLC system in the downlink has been addressed. It is observed that the problem is neither concave nor convex. To overcome the limitations of conventional optimization algorithms in solving such a problem, a centralized DQN based learning algorithm has been designed, which is based on learning from the hybrid RF/VLC environment. The state vector for DQN is formulated with the constraint terms in the optimization problem, while the action vector for DQN is based on the choice of bandwidth, power, and association parameter. The optimal policy is obtained with the help of an action value function. For opting the appropriate action for optimal policy formulation, the CU picks the appropriate values of bandwidth, power and association parameter from the action vector set. A transfer learning based mechanism for a newly entering UE in the system has also been proposed, which uses the already gathered information in the network for the new entrant. Simulation results verify that the proposed learning based algorithm outperforms the exhaustive search algorithm, the received SINR based association and the received power based association algorithms by more than 10% in terms of achiveable sum-rate and 14.28% in terms of the number of iterations needed for convergence. It is also found that the CU is successful in applying the transfer learning algorithm for using the already gathered information for the new incoming UE in the system, with the maximum achievable sum-rate reached in 54% lesser number of iterations. .
It can be seen that d is not fixed. It can become positive or negative for varying values of B 11 , B 22 , P 1 and P 2 . Thus, for the VLC network, r i will neither be concave nor convex in B 11 , B 22 , P 1 , and P 2 . Next, the behavior of r i in the RF network has been investigated. The system model consists of only one RF AP indexed as i = 0. The achievable data rate for the RF network is given as As mentioned earlier, as α 0j is an indicator function, r i will be neither concave nor convex in α 0j . Further, the concavity of r i with P 0 and B 0j has been investigated. For simplicity in calculations, let us consider α 0j = 1, |M| = 1, N r 0 = a and a vector x = {x 1 x 2 } where x 1 = B 01 and x 2 = P 0 G 0j . The achievable rate can be written as r i (x) = x 1 log 2 1 + x 2 ax 1 .
It can be seen that both the elements of ∇ 2 x r i (x) are negative. The higher order of |M| will result to sum of such similar functions. Sum of concave functions is a concave function. This means that for the RF network, r i will be jointly concave in B 0j and P 0 . However, it is neither concave nor convex in the indicator function α 0j . Thus, jointly it will be neither concave nor convex in α 0j , B 0j and P 0 .