Optimizing Simultaneous Lightwave Information and Power Transfer Under Practical Indoor Mobility With Reinforcement Learning

This article investigates reinforcement learning (RL)-based solutions for optimizing resource allocations in simultaneous lightwave information and power transfer (SLIPT) under practical indoor mobility. Encountering the challenges of excessive outages and intermittent channels posed by practical mobility, the reinforcer for agent training is endowed with the tradeoff between energy efficiency and communication quality. Accordingly, two typical RL categories, i.e., value-based tabular RL and policy gradient-based deep RL are imposed and compared through several numerical examinations regarding information-power transfer balance, generalization ability, and complexity. The vanilla tabular RL is demonstrated to outperform the gradient-based deep RL and should be prioritized in practice if feasible.


I. INTRODUCTION
W ITH the rapid development of intelligent devices and the Internet of Things (IoTs), the concern on both insufficient radio spectrum and finite battery capacity have becomes increasingly prominent. Visible light communication (VLC) has been recognized as a complementary technology to radio frequency (RF) due to its unique advantages: huge license-free spectrum and no electromagnetic interference [1]. Concurrently, simultaneous wireless information and power transfer (SWIPT) for energy harvesting (EH) is regarded as a disruptive technological paradigm to prolong the lifetime of energy-constrained wireless networks [2], [3].
More recently, VLC has been proven to be capable of EH in both free-space [4] and underwater [5],empowering a new perspective on simultaneous lightwave information and power transfer (SLIPT) [6]. For instance, Wang, et al. in [7] investigated the indoor SLIPT system taken into account the tradeoff between the secrecy capacity and the harvested energy. Tang, et al. in [8] investigated joint subchannel assignment and power allocation to solve the mixed-integer fractional nonlinear programming problem in the energy efficiency optimization of SLIPT.

A. Challenges Posed by Indoor Mobility
However, when it comes to the practical indoor scenarios for SLIPT, the underlying mobility of user equipment (UE) poses serious challenges to conventional system model assumption and the corresponding optimization methodology. The UEs in the recent studies are assumed to be placed in a fixed position, without considering the mobility of the UE in the actual environment, and without considering the impact of the FoV variability and burst occlusion. Therefore, these channel gains are ideal fixed values, and conventional optimization methods could pursue an optimal allocation strategy to this kind of problem [9], [10].
In fact, due to the sensitivity of light bands to mobility, the real channel gain under user mobility is highly uncertain with huge dynamic range that changes dynamically with time (e.g., the downlink channel gain data and associated statistics for four light access points (APs) shown in Fig. 1 [11]). Hence it is impractical and inaccurate to optimize the resource allocation of SLIPT along the UE trajectories by traditional optimization methods. Furthermore, there are dist variations among the trajectories of different UEs owing to the nature of human mobility [12], thus it is not reasonable to unify the dynamic allocation strategy by conventional methods.

B. Why Deep RL vs. Tabular RL?
Recently, data-driven and machine learning methods are proven effective in tackling this kind of challenge [11]. Reinforcement learning (RL) has been widely applied in wireless communications as a tool for solving resource allocation problems for complex wireless environments [13].
The first category of RL is value-based methods. Q-learning, as one of the temporal-difference (TD) RL methods, is a typical tabular value-based RL technique. Vanilla Q-learning in [14] handles the resource management in HetNet. Due to the tabular nature of vanilla Q-learning or temporal-difference RL, the computational cost of its training and implementation would be much lower than its deep neural network-based counterpart.
Another major category features policy gradient empowered by deep reinforcement learning (DRL), which enables the policy This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ value approximation in the continuous state-action space. Especially when leveraging the policy gradient methodology [15], DRL yields a smooth dimension for channel observation and transmission control in SLIPT [13]. However, which of the two main RL methods is more suitable for indoor mobile SLIPT optimization has not been investigated.

C. Challenges in Pursuing RL-Based SLIPT
However, the key to training DRL belongs to training the inherent neural networks; and the implementation of DRL on hardware requires further consideration such as quantization and transfer learning. The sparsity of channel date in VLC due to frequent and abrupt outages as depicted in Fig. 1 has been proven to impede the deep learning process due to its low sample efficiency and sample imbalance [16]. Therefore, it is necessary to investigate whether a DRL approach always outperforms a vanilla RL approach without the usages of deep learning in terms of performance and complexity.
Furthermore, data-driven methods requires the channel gain data representing the actual moving trajectory of UEs. In [12], the authors proposed a generation framework of UE-centric channel data subject to real-world human behavior characteristics. Such a mobility environment gains the practicality and accuracy in RL training process, which in turn relaxes the dependence of policy adaption and transferring toward a real-world implementation.
1) Contributions: To the best of our knowledge, resource allocation problems for indoor SLIPT networks under practical user mobility have not been studied yet, and whether DRL always transcends vanilla temporal-difference RL remains a question. This article is aimed at providing insights on addressing these challenges, with following significant contributions: r We propose a data-driven SLIPT resource allocation strategy that balances energy harvesting and information transmission. The energy efficiency and outage probability of the SLIPT system are jointly optimized to confront practical mobility. To achieve this, the overall optimization objective is translated into an elaborated immediate reward maintaining the appropriate learning motivation that stresses on the tradeoff between energy efficiency and communication quality.
r Considering the decision variables are quantified to allow for deployment on real hardware, two representative RL approaches are investigated: a value-based tabular TD learning algorithm, i.e., vanilla Q-learning, and a policy gradient-based deep actor-critic learning algorithm, i.e., DDPG. The numerical experimentation shows that both Q-learning and DDPG are superior to fixed strategies. Surprisingly, however, DDPG did not show a greater advantage than Q-learning in both terms of energy harvesting and outage probability in spite of the higher complexity in training and implementation.
r The generalization ability is examined. Q-learning shows better generalization performance regarding energy harvesting and communication quality. Taking into account generalization capabilities, training and deployment costs, we recommend trying DRL cautiously and prioritizing Q-learning when possible.

II. SYSTEM MODEL AND PROBLEM FORMULATION
We consider an indoor VLC downlink system, assuming identical light-emitting diodes (LED)-based APs and formulating mobile user-centric SLIPT strategies.

A. System Model 1) UE Mobility and Orientation:
We refer to a practical mobility generation framework in [12], which divides realistic human mobility patterns into macro-scale and micro-scale aspects. The macro-scale aspect specifies the user's next destination and is modeled as a semi-Markov process model based on bounded Lévy-walk and return regularity. The micro-scale aspect reflects the shortest path trajectory, steering behavior, and stochastic orientation of UEs. Such a mobility generator is embedded into the following three indoor scenarios as shown in Fig. 2.
2) Channel: The UE trajectories are then used to generate VLC channel data. We consider the blockage of opaque objects from furniture and user body [16], [17] prior to calculating the specific channel gain of each link; also both line-of-sight (LOS) and non-line-of-sight (NLOS) channels are considered [18] where h t,τ is the overall time-varying impulse response, where k is the index of reflections and τ denotes the delay. In the validation section, such channel data would be captured by an open source VLC platform [19] under indoor mobility.
3) SLIPT Scheme: Let m(t) denote the modulated electrical signal corresponding to the bit stream from the information source. A DC bias B is added to m(t) to ensure that the resulting signal is non-negative. Assume that the LED driver works in a linear current manner. P LED represents the power per unit current of the LED, and the transmitted signal can be described as where N is the number of APs. The input electrical signal of the LED must be within the linear range of the LED, , A = βA max , where β represents the ratio of A of the AC signal to A max . The total channel gain of APs to the user is h in (1). The output current of the PD can be expressed as i out = I DC (t) + I AC (t) + n(t), where n(t) is the additive white Gaussian noise (AWGN), which is created from background shot noise and thermal noise.
In a unit time period, the equivalent SNR used for information processing at the receiving end after division is expressed as where η denotes the PD responsivity. To maximize energy harvesting, we notice that the AC component of photocurrent also makes contribution through a time splitting (TS) strategy [20]. Let α denote the splitting factor of TS, which is the portion of a unit time (T = 1 s) that is allocated only for EH; in the time period 1 − α, the AC component is designed solely for communication, and the power generated by the receiving end is given by P = fαI DC V OC , where f is the fill factor, where V t is the thermal voltage and I o is the dark saturation current of the PD. Therefore, the energy consumed by an AP in time T is Meanwhile, we assume the AC signal is sinusoidal, such that the total energy collected at a UE in time T is The system structure can be found in Fig. 3. The control strategy for both transmitter and receiver should be implemented at the AP side. The time splitter of the receiver is controlled through the downlink signal.

B. Problem Definition
We maximize EH efficiency with a certain communication outage probability, which is defined by where γ th is the SNR threshold for indicating outages. The parameters in this section are in Table I.

III. RL-BASED OPTIMIZATION FOR SLIPT
We investigate the performances against practical mobility and varying environment of two data-driven RL algorithms, i.e., vanilla Q-learning and DDPG. The time slots as a part of the state reflects the temporal pattern hidden inside the practical mobility. In each time slot, the channel gain is divided into G intervals. Accordingly, the state space is described as As for DDPG, we directly utilize the channel gain value as the state since DDPG handles continuous state space, which is described as S (t,h) = S (t) × S (h) , where S (t) represents the time slots, and S (h) represents the channel gain component.
As for DDPG, the action space is continuous such that A = α × β, where both α and β belong to [0, 1].
3) Reward: Formulating the reward function is essential to provoke RL to reach the optimization objective, thus we dedicate to tuning the ratio between reward and penalty. Specifically, we need to relate the reward signal with EH efficiency.
Since the magnitude of overall EH efficiency is tiny due to the path-loss as shown in Fig. 1, we first expand the value of EH efficiency by 10 4 ∼ 10 6 times. When it meets the communication restriction (7), i.e., γ ≥ γ th , or i out is all used for communication, i.e.,γ reaches its maximum, we give the policy a positive reward. Otherwise, when γ < γ th , but γ th can be met if i out is all used for communication, i.e., γ < γ th < γ max , we feedback a negative reward.
To sum up, the reward is expressed as To promote the agent towards to objective, we introduce a constant θ r for the positive reward as in [14], in which, θ r = 2 could be sufficient for keeping a proper momentum in the training process.

B. Policy Value Estimation
The estimation of policy value, also known as Q table, is the key for training Q-learning agents. To balance exploration and exploitation, −greedy strategy is used for action selection in the training. During training, the Q-value updates following until Q(s, a) converges, where ρ denotes the learning rate and ψ the discount factor. s and a response to the successors of s and a.
When it comes to DDPG, the actor is trained for finding the policy with maximal value, while critic is trained for valuing each policy. As for determining the critic network parameter θ c , using a stochastic gradient descent (SGD) method based on target network and replay buffer techniques, the following minimization problem can be solved where a is the target actor network with parameter of θ t a and θ t c denotes the parameter of target critic network. Here, the actor network parameter θ a is yielded by a stochastic gradient ascent (SGA) such as After several training episodes, the target networks would be updated using the latest parameters. However, different from Q-learning, the problem of exploration and exploitation is addressed by adding a random action process when filling the replay buffer.

C. Policy Deployment
After the Q-table in vanilla Q-learning or Q(| θ a ) in DDPG is converged, the output policy to be deployed is which corresponds to the SLIPT optimization objective in (6).

IV. RESULTS AND DISCUSSIONS
After determining the hyper-parameters though the exhaustive searches, the balance of energy and information transfer as well as the generalization ability are validated over a variety of scenarios under practical indoor mobility.

A. Training and Hardware Emulation Setups
For the hyper-parameters in Q-learning, an exhaustive search yields the optimal set as ψ = 0.8, ρ = 0.01. For the −greedy strategy, we set the initial value = 1 and gradually decrease it exponentially. The time interval number is set as 6, and the channel gain level number is set as 8.
As for DDPG, we optimize the parameters through exhaustive search and finally set ψ = 0.9, ρ = 0.00002. The buffer depth is set as 15 episodes and we update the target networks for every episode. The actor has 3 fully-connected layers with 300, 300 and 20 neurons, respectively. The critic is based on 2 layers of 200 and 10 fully-connected neurons.
After training the agents of both Q-learning and DDPG, we deploy the agents into a low-cost controller based on reduced instruction set computing (RISC) for validating the real-time decision-making ability in hardware. The CPU of the controller is STM32F407 with ARM Cortex-M4 core running at 168 MHz. One channel of on-chip analog-to-digital converter (ADC) is activated at a sample rate of 100 kSPS to perform the channel gain observation. The action is output through the general-purpose input and output pins.

B. Transferring Trade-Off Between Energy and Information
Recall our optimization objective. On the issue of power allocation for information decoding and energy collection, the goal is to improve the collected EH efficiency as much as possible on the basis of ensuring communications. Q-learning, DDPG, and two fixed PS strategies are compared. For the sake of result analysis, we assume the emitted energy at APs as 1 Joule, such that the energy efficiency could be normalized.
1) Benchmark Setup: To examine whether our proposed RLbased SLIPT performs a trade-off between energy and information, we consider two fixed PS strategies that are either biased towards information or energy transfer. We choose α = 0.9, β = 1 for maximizing energy harvesting and α = 0.6, β = 0.9 for minimizing outage probability.
2) Training Strategy: To maximize the exploration of each episode, we take every 10 episodes as a trajectory group. Each group is traversed 10 times, and then the next group is carried out. To verify the effectiveness of the learned strategy, we apply the learned strategy to another 30 episodes from non-training datasets with key parameters collected by openVLC platform under practical mobility [19].
The training of vanilla Q-learning converges when the is reduced to zero exponentially. However, the convergence of DDPG training can not be guaranteed [21], which is partly impacted by the how the exploration intensity anneals over time. We reduce the volatility parameter of the Ornstein-Uhlenbeck exploration noise in an exponential fashion. As such, the training progress of our proposed DDPG agent is shown as Fig. 4. We also tested the selection of θ r to balance the demands of power transferring and communication quality. The training converges after about 3000 episodes. Since each trajectory group is traversed 10 times to enhance the utilization rate of trajectories, the training curve appears to be segmented, with significant changes occurring every few episodes. The abrupt changes of the training curves occur when the agent faces a new room layout, which has a different baseline of outages and channel gain. However, these changes would not impede the learning performance of the proposed RL agents.
3) Performance Comparisons: As demonstrated in Fig. 5(a), the EH efficiency of Q-learning approximates to the fixed strategy of α = 0.9, β = 1, while the outage probability of Q-learning shown in Fig. 5(b) is nearly 10% lower than that of the fixed strategy of α = 0.9, β = 1. On the other hand, the fixed strategy of α = 0.6, β = 0.9 yields lower outage probability Fig. 4. Training progresses in terms of (a) average EH in each episode and (b) outage probability of the DDPG agent. r 1 , r 2 , and r 3 correspond to θ r = 1, θ r = 2, and θ r = 3. The trajectories from all the three room layouts are mixed and leveraged for training. Therefore the training curves sometimes rise or drop when encountering the trajectories from another room. than all the other strategies, yet its EH efficiency is the lowest. This fact is because higher α and β lead to more EH, which of course raises the EH efficiency but hinders the communication quality in the meantime. DDPG manages to harvest the maximal energy and gains 12.8% higher than the Q-learning; whereas its outage probability is 1.3% higher than the Q-learning in average. Apparently, both two RL-based strategies have learned a balance between energy and information transfer, however the fixed strategies cannot adapt to such a dynamic mobile environment.
From the perspective of the optimization objective in (6), DDPG gains more return, but losses in the constraints of (7) comparing to Q-learning. Vanilla Q-learning seems to be better at the trade-off of EH and communications. However, one has to notice the considerable complexity of DDPG in actual implementation, which approaches the order of O(W ) for DDPG actor with a total number of neurons as W . On the contrary, the running time cost of a vanilla Q-learning draws near the order of O(MG + UV ), which is much lower than O(W ) under our setups. The hardware implementation and validation of these two agents confirm the advantage of Q-learning in low-complexity. The on-chip agent of Q-learning takes only 2.643 μs to make a decision while the DDPG agent costs 73.810 μs. Both of the agents cost very short duration for making decisions using the low-cost ARM CPU. The duration is much shorter than the coherent time of indoor mobility (∼100 ms [12]), such that the affectiveness of each decision can be ensured.

C. Generalization Ability Comparisons
Next, we compare the generalization abilities of the optimization strategies learned by vanilla Q-learning and DDPG through three indoor layouts introduced in Fig. 2. 1) Data Preparation: We consider three types of data preparation strategies. Apart from training the agents for the three layouts using their respective data, we also train the agent solely based on layout R1 to examine the generalization. Besides, as a comparison, we mix the trajectory data generated from the three layouts in Fig. 2 for training with a ratio of 1:1:1 as the training layout changes every 333 episodes as illustrated in Fig. 4. As for the strategy validation shown in Fig. 6, we select 30 episodes for each layout and compare the generalization abilities in the three room layouts.
2) Performance Comparisons: It can be found from Fig. 6(a) that difference of EH is trivial among the three data preparation methods. The Q-learning agent only trained in the R1 layouts performs almost the same compared to the others, which shows a great generalization power of vanilla Q-learning for optimizing SLIPT.
The results in Fig. 6(b) confirm such generalization ability from the perspective of communication quality, i.e., the Q-learning agent trained in only one layout can generalize to other environments. However, the DDPG agent cannot exceed its tabular counterpart as shown in in Fig. 6(c) and (d). The EH difference of generalization performances is more distinguishable than Q-learning, since the DDPG agent trained with respective data from the three layouts clearly transcends the other two methods.
When it comes to the outage probability, the agents trained by different data perform very differently. The vanilla Q-learning agent seems to generalize better than DDPG since the generalization of deep-learning is harder. In fact, when deep-learning is involved in RL, the focus of training strategy would be turned to maintaining the trainability of deep neural networks, e.g., the experience reply and target network techniques are applied to keep the samples independently and identically distributed and to stabilize the training dynamic, respectively. Consequently, such complex and strict training conditions make the generalization ability and stability of DDPG greatly reduced.
Taking into account the balance ability, generalization, and complexity, we recommend not using DRL blindly and preferring to try Q-learning when feasible.

V. CONCLUSION
In this article, we investigate RL-based resource allocation strategies for SLIPT networks under practical indoor mobility for the first time, and share our insight on whether DRL methods really outperform vanilla temporal-difference RL in practice. To address the challenges induced by practical mobility, the energy efficiency and communication quality of the SLIPT system are learned to be balanced by a dedicated reward feedback. To achieve such a data-driven solution, we establish a data generator of indoor mobile channel gain based on semi-Markov process emphasizing truncated Lévy-walk and return regularity, which represents the nature of practical human mobility. We investigate vanilla Q-learning and DDPG from the perspectives of information-power transfer balance ability, generalization ability, and complexity. Both of them outperforms the fixed strategies significantly. However, although DDPG has higher complexity in training and implementation, it fails to keep a prominent advantage over vanilla Q-learning in these three perspectives. Finally, we come to the conclusion as leaning on DRL cautiously and prioritizing Q-learning when possible.