A Novel On-Demand Charging Strategy Based on Swarm Reinforcement Learning in WRSNs

The charging issue in Wireless Rechargeable Sensor Networks (WRSNs) is a popular research problem. With the help of wireless energy transfer technology, electrical energy can be transfer from Wireless Charging Equipment (WCE) to the sensor nodes, providing a new paradigm to prolong the network lifetime. Existing research usually takes the periodical and deterministic charging approach, but ignore the limited energy of the WCE and the influences of non-deterministic factors such as topological changes and node failures, making them unsuitable for real networks. In this study, we aim to minimize the number of dead sensor nodes while maximizing energy utilization of WCE under the limited energy of the WCE. Furthermore, the Swarm Reinforcement Learning (SRL) method is firstly introduced to achieve the autonomous planning ability of WCE. Moreover, to solve the problem of insufficient search in existing SRL algorithm, we improve the SRL by firefly algorithm. And a novel charging algorithm, named Swarm Reinforcement Learning based on Firefly Algorithm (SRL-FA), is proposed for the on-demand charging architecture. To evaluate the performance of the proposed algorithm, SRL-FA is compared with the existing swarm reinforcement learning algorithms and classic on-demand charging algorithms in two network scenarios. The Extensive simulation shows that the proposed algorithm can achieve promising performance in energy utilization of WCE, charging success rate and other performance metrics.


I. INTRODUCTION
Wireless Sensor Networks (WSNs) are widely used in military, intelligent transportation, human health monitoring and so on [1]- [3]. These application scenarios require WSN to work continuously. However, the network lifetime is restricted by the limited battery capacity of sensor nodes. So the energy problem of sensor node has become a bottleneck in the research of WSNs. To solve this problem, scholars have conducted a lot of research. The existing reports can be divided into three categories, namely energy saving [4], energy harvesting [5] and Wireless Energy Transfer (WET) [6], [7]. The energy saving scheme extends The associate editor coordinating the review of this manuscript and approving it for publication was Xingwang Li . the life of sensor nodes by reducing the energy consumption per unit of time or workload. Whereas the energy of sensor nodes is still limited, this method cannot solve the problem fundamentally. The energy harvesting scheme restores energy through environments (eg., solar energy and wind energy). However, the great influence by the environments and unpredictability in the amount of harvested energy make energy harvesting scheme unreliable. The main idea of WET is to charge the sensor nodes using the magnetic resonant coupling. And WET can provide a stable energy supply by controllable charging power. With the help of promising WET technique, researchers have proposed a new concept of Wireless Rechargeable Sensor Networks (WRSNs) [8], [9]. In WRSNs, the sensor nodes can be charged by the Wireless Charging Equipment (WCE).
Hence WCE charging schedule becomes a prominent issue in WRSNs. And different perspectives on charging schedule have been investigated, including path planning, system performance optimizing and so on.
In existing literatures, charging strategies are two-folds: periodic strategies and on-demand strategies. In the former strategies, the WCE usually follows a fixed charging path to charge all the sensor nodes in the networks [10]- [12]. However, due to the interaction with the surrounding environment, the energy consumption rate of the sensor nodes in the networks was demonstrated significantly different [13]. So the sensor nodes have different energy requirements. It is not necessary to charge all the sensor nodes in the networks. Moreover, the energy consumption profiles of the sensor nodes are high uncertainty. Therefore this charging manner is not suitable for the dynamic nature of WRSNs. In contrast to this, the sensor node in the on-demand strategies sends a charging request when its energy below a known threshold value. Upon the reception of a request, WCE inserts it to the charging list, and then charges the sensor nodes according to the charging strategies. So the on-demand strategies are more suitable for WRSNs.
As for on-demand strategies, there are many unknown information, such as the number of charged sensor nodes. So determining the order of charging the sensor nodes is difficult. Most studies take the greedy method. They set a charging priority for the sensor nodes and select the sensor node with the highest priority in each step. Such a local, impromptu decision bears a low overhead (no global request is necessary). Unfortunately, it usually means no global optimality. In this case, if the WCE can learn and adjust the charging path by interacting with the environment, the WCE can charge more efficiently and obtain a better charging path with consideration of global information. Based on this idea, the on-demand charging strategy with autonomous planning for WCE is studied.
Solving independent path planning for robots is an important branch of Reinforcement Learning (RL) application. Moreover, RL has been verified to be effective in solving charging path planning problem in WRSNs. Therefore, RL is considered to solve the problem in this study. Most reports use ordinary RL. However, in ordinary RL, only an agent learns to achieve goal. The agent essentially learns through trial and error, therefore ordinary RL takes much computation time to acquire the solution and causes inadequate search for optimal solution. To solve these problems, RL is improved by swarm methods, called Swarm Reinforcement Learning (SRL). There are multiple agents in SRL algorithm. Moreover, the agents learn through their respective experiences and the information exchanged among them. SRL algorithm has been recognized that it is able to rapidly find the global optimal solution. Therefore, SRL algorithm is introduced into this study.
The performance of SRL algorithm highly depends on the method of exchanging the information of Q-value. The existing methods calculate Q-value directly. Therefore, they are mostly used to solve continuous problems, which have limitations on solving discrete problems. FA [14], [15] is an optimization method inspired from behavior of firefly movement. FA can solve discrete problem and is similar to method used in PSO-Q (a kind of SRL). Therefore, to maintain the advantages and to overcome the disadvantages of existing SRL algorithms, we improve the SRL algorithm by Firefly Algorithm (FA), named SRL-FA.
Most on-demand charging strategies design the charging path according to the greedy method and cannot obtain the global optimal solution. Moreover, they ignore the limited energy of WCE and the charging strategies is not practical.
To solve these problems, the WCE in this study can independently design global optimal charging path through interacting with the network. Furthermore, we aim to ensure the stability of the system while maximizing energy utilization of WCE under the limited energy of the WCE. The main contribution of this paper are as follows.
1) The improved Swarm Reinforcement Learning is introduced into the on-demand charging problem. By interacting with the network, the WCE can independently select the charged sensor nodes and the charging path. 2) SRL is improved according to Firefly Algorithm (FA), named SRL-FA. The performance of SRL highly depends on the information exchanging methods, which are mostly used to solve continuous problems and have limitations on discrete problems. Therefore, the information exchanging method is redesigned. The remainder of study is organized as follows. Section II gives a brief overview of charging strategies on WRSNs and reinforcement learning. In Section III and IV, we detail system model, problem statement, as well as learning model. Algorithm descriptions are given in Section V. Evaluations and comparisons are shown in Section VI and we conclude this study in Section VII.

II. RELATED WORK
In this section, works about this study are introduced, including on-demand charging strategies and reinforcement learning. In the reinforcement learning part, the application of reinforcement learning in WRSNs is also introduced.

A. ON-DEMAND CHARGING STRATEGIES
The on-demand charging strategies can be divided into two categories. One focuses on the performance of the networks, and the other improves the performances of both networks and the WCE. In the former research, the First Come First Serve (FCFS) algorithm is proposed [16]. FCFS schedules the VOLUME 8, 2020 incoming charging requests based on their temporal property and can lead to the back-and-forth charger movement in the space. To overcome the drawback of FCFS, He et al. propose a charging Strategy based on the Nearest-Job-Next with Preemption (NJNP) discipline [17], which can increase throughput by always selecting the spatially closest requesting sensor nodes as the next charging node, but it ignores sensor nodes in urgent need of charging. To balance the fairness of charging, Kaswan et al. [18] consider both temporal and spatial priorities of the sensor nodes. They present a Linear Programming (LP) formulation for the WCE scheduling problem and a charging strategy based on gravitational search algorithm is presented to solve this problem. Zhu et al. [19] present a charging strategy that chooses the sensor nodes which make the least number of other request nodes suffer from energy depletion as the charging candidates. Lin et al. [20] present a Primary and Passer-by Scheduling (P2S) algorithm for largescale WRSNs. After choosing the sensor nodes to be charged, they use a local searching algorithm to find surrounding sensor nodes and add them to the charging path. However, the above literatures do not consider the limited energy of WCE, therefore ignoring the charging cost of the WCE.
To optimize the WCE charging performance at the same time, Fu et al. [21] consider different network parameters, such as the travel distance of the WCE and the energy received by the sensor nodes. And then they construct a set of nested TSP tours based on the energy consumption of the sensor nodes, and only sensor nodes with low remaining energy are involved in each charging round. With the similar network parameter considerations, Zhao et al. [22] propose to jointly optimize the charging scheduling and charging time allocation. Tomar et al. [23] propose a fuzzy logic based scheduling scheme to maximize the survival ratio and energy usage efficiency. Unlike the above charging mode, the WCE in [24] can charge the sensor nodes by one-to-more manner. Not fully charging the sensor nodes, Xu et al. [25] propose a charging strategy that only supplements the sensor nodes with partial energy, and then these two articles designed the charging path with a priority strategy that can maximize the sum of sensor lifetime and minimize the traveling distance of the WCE. Although the above reports can improve the performances of the WCE and the networks, none of them consider the autonomy of the WCE.

B. REINFORCEMENT LEARNING (RL)
The main idea of RL is to achieve experience through interaction between the agent and the environment [26], [27]. As shown in Fig.1, there are three representations. Firstly, the state represents the decision-making factors under consideration being observed by an agent. Secondly, the action represents an optimal action being selected by the agent, which may change or affect the state and reward. Thirdly, reward represents the gains or losses in network performance for taking an action on a particular state.
It is assumed that every state update of agent is a time step. At any time step t, the agent observes state x (t) and learns the long-term reward of each state-action pair, decides and carries out an appropriate action a (t) on the environment in a trial-and-error manner. And then the agent reaches the next state x (t + 1) and receives the reward r (t). Next, the agent updates the Q-value of this state-action pair according to Eq.(1) [28]. Repeats this operation until the agent reaches the final state.
where ψ is the update factor, and γ (0 < γ < 1) is the discount factor. In recent years, RL is widely used in path planning, especially in robot path planning. The robot, treated as the agent, has its own ''brain'' to plan a path in an environment [29]. To improve the ability of WCE's autonomous path planning in WRSNs, Wei et al. [30] proposes a novel charging strategy called CSRL. CSRL uses Simulated Annealing (SA) to select the action and original RL to obtain the charging path for all the sensor nodes in the networks. However, Original RL uses an agent to learn which may cause a slow convergence speed. To solve the problem of original RL, Iima and Kuroe [31] proposes Swarm Reinforcement Learning (SRL) in which multiple agents are set and they learn through not only their respective experiences but also exchanging Q-value among them. And SRL has been recognized that they are able to rapidly find an optimal solution than original RL. Therefore, we consider using SRL in this study.
To adapt to the changeable network environment as well as improve the autonomous planning capability of the WCE in on-demand charging strategies, The SRL is introduced into this study. And to overcome the drawback that the existing SRL often falls into local optimum, firefly algorithm is used to improve the SRL.

III. SYSTEM MODEL AND PROBLEM FORMULATION A. NETWORK MODEL
As shown in Fig.2. A WRSN consists of N sensor nodes, a Charging Service Station (CS) and a WCE, which is deployed over a 2-D monitored area. The set of the sensor nodes is denoted as V s = {n 1 , n 2 · · · n i · · · n N }. All the positions of the sensor nodes are fixed and the sensor nodes are powered by the same type of battery that the capacity is E max .  The WCE can provide energy for sensor node one-to-one. The energy for charging the sensor nodes and the energy for driving are E max c , E max d respectively. When the remaining energy of the WCE is insufficient, it will return to the CS for energy replenishment. The WCE stays at the CS as a vocation, and the time at the CS is the vocation period.
The symbols used in this study are shown in TABLE 1.

B. CHARGING MODEL
The sensor node will die if its energy is lower than E min . To prolong the survival time, n i will send a charging request RM i = n i , p i , t r,i to the WCE when its energy is below a threshold R. RM i contains the time point t r,i issuing the request, the sensor node ID n i and its energy consumption rate p i . The WCE accepts the charging request and stores the request in the order of t r,i . When the vocation period ends, the WCE accepts M charging requests and puts the corresponding sensor nodes to V c , then designs a charging path for the sensor node in the V c .
After designing the charging path, the WCE sets off from the CS to charge the sensor nodes, and then goes back to the CS. This period is defined as a charging round.

C. PROBLEM FORMULATION
The problem in this study is to determine which sensor nodes will be charged and the charging strategy for the WCE, so as to improve both the number of sensor nodes that will be charged and the charging efficiency of the WCE under the limited energy.
Firstly, to measure the charging efficiency of WCE, the concept of WCE energy utilization η is introduced. We assume the charging path W = (π 0 , π 1 · · · π L , π 0 ). π 0 represents the CS, L is the number of the sensor nodes in the charging path. Therefore, η can be calculated as Eq. (2).
EU c is the energy used by the WCE to charge the sensor nodes in a charging round. EU d is the driving energy used by WCE. EU c , EU d satisfy Eq.(3) and Eq. (4).
The WCE should ensure that the sensor node π q has been charged before the energy of π q is below Emin. We assumed ρ is the charging loss rate and U is the charging power of WCE. Then, we have: Each sensor node can only be charged at most once in a charging round. Assume the duration of a charging round is T . The energy of the sensor node in a charging round should satisfy the Eq. (6).
To ensure the WCE can reach the sensor node π q and return to the CS after charging the sensor node π q , the driving energy of the WCE should meet Eq. (7).
It is assumed that v is the driving speed and µ is the driving energy consumption rate of WCE. l j,j+1 is the distance between π j and π j+1 . And π 0 represents the CS.
Then, the On-demand Charging Planning Problem (OCPP) can be formulated as follows, IV. LEARNING MODEL OF THE WCE As described above, the On-demand Charging Planning Problem (OCPP) in this study is to plan a charging path for the WCE in WRSNs, the objective is optimizing networks and the WCE performance. And OCPP is similar to the mobile robot path planning problem. As the mobile robot path planning problem is to search an optimal path from the start point to the end point with the goal of no collision. And solving the mobile robot path planning problem is an important branch of the Reinforcement Learning (RL) application. With the help of RL, the mobile robot has ''brain'', it can autonomously learn the path. In the OCPP, the charging nodes and charging path for each charging round is uncertain. Hence, the WCE can charge more efficiently and the charging strategy is flexible if WCE can autonomously learn and adjust the charging path. Therefore, the RL is introduced to solve the OCPP in this study.
And the relationship between RL and OCPP are as follows: WRSNs is considered as the environment in RL; The WCE in WRSNs is considered as the agent in RL; the state of WRSNs and the WCE is considered as the state in RL; the action of the WCE to the next charging sensor node is considered as the action in RL. Therefore, the learning model in WRSNs can be represented by a triple X , A, R . X is the state space, which represents the state of the WRSNs and WCE. A is the action space, which represents the action set of the WCE. R is the reward generated by actions of the WCE.

A. STATE MODEL
Due to the different states, the location of the WCE, the remaining travel energy of the WCE, and the energy of the sensor nodes of the network will change. Therefore, the definition of the state space considers both the WCE and the state of the sensor nodes in the network. The state space is defined as a two-tuple X = X WCE , X network .
trd means the state of the sensor nodes in the networks. − → e = (e 1 , e 2 · · · e i · · · e N ) means the current energy state of the networks.
− → d = (d 1 · · · d M ) means the distance between the WCE and the sensor nodes. − → trd = (trd 1 · · · trd m · · · trd M ) indicates the flag of the sensor node in V c . If the sensor node numbered m in V c is traversed by the WCE, the flag trd m is set to 1. Otherwise, the value of trd m is 0.

B. ACTION MODEL
For the WCE, to select an action is to determine the next sensor node to be charged. We assumed that all sensor nodes are reachable in the network. Because the sensor nodes can only be charged at most once in a charging round, the WCE can only select the sensor node in V c whose flag is 0. In this paper, the action space is defined as A = {a|a ∈ N c }. a = m means the next sensor node to be charged is the sensor node numbered m in V c .

C. STATE TRANSITION PROCESS
It is assumed that every state update of agent is a time step. At time step k, the WCE stays at state x k , selects action a k according to SA, and then reaches the state x k+1 . At time step k + 1, the sensor node numbered x k+1 in V c has been fully charged. Assume that the duration between time step k to time step k + 1 is t k . Next, we will discuss the state as the time step k + 1.
As for sensor nodes, remaining energy e k+1 i of n i can be calculated by Eq. (8).
t k consists of two parts as shown in Eq.(9): 1) the driving time from the sensor node numbered x k to the sensor node numbered x k+1 in V c . 2) the charging time for the sensor node numbered x k+1 in V c . And U is the charging power.
As for the WCE, at time step k + 1, the ec m and ed m can be calculated by Eq. (10)

D. REWARD MODEL
In the RL, the agent learns by reward value, so the setting of the reward function is especially important, which determines whether the algorithm can converge and the speed of convergence. The problem in this study is to maximize the energy utilization of WCE in a charging round. Therefore, after performing an action a k , the reward r k of a k should consider two aspects: 1) to maximize EU c . 2) to minimize EU d . The reward function is as Eq. (14), As shown in Eq. (14), 0 < α 1 , α 2 < 1, α 1 + α 2 = 1. α 1 and α 2 respectively represent the proportions of the two factors. K represents the unit energy value. The higher ec m m = x k is, the higher r k is. In addition, the lower ed m m = x k is, the higher r k is.

V. PROPOSED ALGORITHM: SRL-FA
To overcome the shortcoming of the original reinforcement learning too much invalid learning, the swarm reinforcement learning (SRL) is introduced. In SRL, multi agents learn simultaneously. However, the existing SRL often falls into local optimum. Meanwhile, for optimization problems, a population-based method such as Firefly Algorithm (FA) have been recognized that they are able to find rapidly optimal solutions. Therefore, in this section, a Swarm Reinforcement Learning based on FA (SRL-FA) is proposed.
In SRL-FA, agents all learn concurrently with two stages, Individual Learning the Charging Path and Learning through Exchanging Information. The improvement of SRL in this study is shown in the latter stage. And these two stages are discussed in detail in subsection A and subsection B. And the learning framework of the WCE is shown as Fig.3. Y is the number of interaions.

A. INDIVIDUAL LEARNING THE CHARGING PATH
In this stage, each agent learns individually by using a usual RL. As for the agent ag i , the learning process in an iteration is as follows: Simulated Annealing (SA) is used to select the action. After selecting, WCE judges whether it can reach this action and return to the CS. If it can, WCE adds this action to the charging path and calculates the reward of this action. Then update the Q-value Q i ag according to Eq.(15); Otherwise, the WCE will back to the CS.
− Q i ag (x k , a k )) (15) where ψ is the update factor, and γ (0 < γ < 1) is the discount factor. Q i ag is the learned experience of ag i , and Q i ag is the new Q-value in this iteration.
After an iteration ends, the WCE will evaluate the charging path W i ag obtained by ag i in this iteration y. The evaluation indicator V i ag can be calculated through Eq. (16). Then the Q i ag is updated as Eq. (17), Based on the above statement, the Individual Learning Algorithm is shown as Algorithm 1: In this stage, each agent learns through updating its Q-value by referring to the other agent. The Q-value update method is important, it determines the performance of the algorithm. The existing SRL algorithms update Q-value directly. This update method is only suitable for the continuous problems. However, the charging path planning problem in this study is a discrete problem. Therefore, this stage should be improved. Suppose the agents have independently learned for Y iteration. Then the improvement idea is shown in Fig.4. Q best is the best Q-value of all the agents at only the previous iteration. G best is the best Q-value found by all the agents so far. P i ag is the best Q-value found by the agent ag i so far. V is a so-called velocity. ω, C 1 , C 2 are weight parameters. R 1 , R 2 are uniform random number in the range from 0 to 1.

Algorithm 1 Individual Learning Algorithm
As figure shows, the improvement idea is that we update the Q-value by updating the path of agent. Among the existing SRL algorithms, PSO-Q can update the path. PSO-Q is the combination of PSO and SRL algorithm. However, PSO is easy to fall into local optimality. Firefly Algorithm (FA) is similar to PSO and performs better than PSO. Therefore, to maintain the advantages of PSO combined with SRL and improve the SRL, FA is introduced into Q-value Update method. And the Q-value Update Algorithm based on FA is proposed. And the details of the algorithm are described in the following sections.

1) RELATIONSHIP BETWEEN FIREFLY ALGORITHM AND OUR STUDY
Firefly Algorithm (FA) regards the value of the objective function as the absolute brightness of a firefly. The Fireflies VOLUME 8, 2020  are hermaphroditic. Therefore the main idea of FA is that a little brighter firefly will move towards the brightest one within the visible distance range. And if there is no brighter one than a particular firefly, it will move randomly.
This study considers updating the Q-value by transforming the charging path of each agent. Therefore, we regard the charging path W i ag as the solution of the firefly ff y i , and the corresponding fitness value η W i ag as the absolute brightness F i ffy of the ff y i . And then the way of updating Q-value is as follows: For each firefly ff y i , find the other firefly ff y max with the highest fitness value in its visible distance range. If ff y max exists, move ff y i toward ff y max . Otherwise, ff y i will move randomly. Then calculate the Q-value corresponding to the new charging path.
Since the solution of a firefly in this study is discrete, and the WCE has limited energy, it will back to the CS if it would use up its energy. Therefore the dimensions of the charging path for each learning process may vary. We cannot simply calculate the distance using Euclidian distance. Then we define the distance between any two fireflies as the number of different arcs between them. To be specific, we assume the solution of ff y i and ff y j are shown in Fig.5. The green solid lines are the same arcs between ff y i and ff y j , and black dotted lines are different arcs between ff y i and ff y j . Therefore the distance between them is 4.

2) THE WAY OF FIREFLY MOVING
In this section, we will introduce the way the fireflies move. And there are two situations of firefly moving. Let D i be the visible distances range of ff y i . Then we take firefly ff y i as the example.
Situation 1: There are no fireflies that the fitness value is larger than ff y i within D i .  To be specific, we assume the solution of ff y i is shown in Fig.6 (a). And the random movement process of ff y i and the new solution ff y i are shows in Fig.6.
Situation 2: There is a ff y max with the highest fitness value within D i . In this Situation, if ff y i moves directly to ff y max , it is easy to fall into local optimality. To enhance the global search capability, we accept the randomly generated new firefly ff y i with a certain probability P. The probability P can be calculated by Eq. (18).
Based on the above statement, there are two ways of moving. The rand() function randomly generates a real number in range [0, 1]. Then, the details are as follows.
The Way of Firefly Moving 2: If rand() >P, ff y i moves directly to ff y max . Then the solution of ff y max is assigned to ff y i . Therefore, the update formula is shown as Eq. (19), The Way of Firefly Moving 3: If rand () P, ff y i is updated by ffy i . Then the solution of ffy i is assigned to ff y i . Therefore, the update formula is shown as Eq. (20), Inspired by the idea of Firefly Algorithm, the Q-value Update Algorithm based on FA is shown as Algorithm 2: count ← 0, k ← 1; 3: while k ≤ na do 4: Calculate the distance d i,k between fly i and fly k ; 5: if d i,k < D i then 6: if F i fly < F k fly then 7: count ← count + 1; 8: Save the number k and its corresponding path; 9: end if 10: end if 11: end while 12: if count = 0 then 13: Move fly i around randomly; 14: else 15: Calculate P according to Eq.(18); 16: if P < rand() then 17: Find the one with the highest fitness value in the saved path and move fly i to it; In this section, experimental simulations are carried out to demonstrate the advantages of the proposed algorithm Swarm Reinforcement Learning based on Firefly Algorithm (SRL-FA). We compare our algorithm with other reinforcement learning algorithms and charging scheduling algorithms in Performance Comparison. Moreover, we investigate the impact of several important parameters on algorithm performance in Properties Analysis.

A. SIMULATION ENVIRONMENT
As listed in TABLE 2, 100 to 200 sensor nodes are deployed in a 2000m × 2000m square. To analyze the performance of SRL-FA in different kinds of networks, the sensor nodes in this study has two ways of distribution, random and uniform. The corresponding networks named C1 and C2 respectively. Uniform distribution means the spacing of each sensor node is the same. The other parameters listed in TABLE 2 have been chosen mostly based on [30].

B. PERFORMANCE COMPARISON
In this section, the Swarm Reinforcement Learning based on Firefly Algorithm (SRL-FA) is compared with other VOLUME 8, 2020 FIGURE 7. Performance comparison between SRL-FA and CSRL [30] in terms of (a)(d) the number of charged sensor nodes, (b)(e) total driving distance and (c)(f) energy utilization in two network scenarios respectively. Reinforcement Learning (RL) algorithms and two classic charging algorithms, FCFS and NJNP under the on-demand charging architecture to analyze its performance.

1) COMPARISONS WITH RL ALGORITHMS
Firstly, we perform an analysis concerning the performance of the redesigned reinforcement learning algorithm in this study. Under the same two networks setting where 200 sensor nodes are deployed randomly(C1) and uniformly(C2) respectively, we compare SRL-FA with 1) the original RL algorithm with an agent [30] and 2) the existing three Swarm Reinforcement Learning (SRL) algorithm named BEST-Q, AVG-Q and PSO-Q [31], [32]. Different performance metrics are considered, including energy utilization η of WCE, the number nc of sensor nodes that have been charged and the driving distance of WCE. The results are shown below and they are average of 30 runs. The red line in Fig.7 and Fig.8 represent our algorithm.
In Fig.7, we compare the SRL-FA with CSRL. The figure shows that the optimization accuracy of SRL-FA is better than CSRL. Since CSRL is based on original RL. There is only one agent in CSRL to explore. SRL-FA is based on SRL. There are multiple agents in SRL-FA. Moreover, agents learn through exchanging information. Therefore, the ability to explore is increased. As shown in TABLE3, in two network scenarios, 1) the energy utilization obtained by SRL-FA is 19% and 7% higher than CSRL; 2) the number of charged sensor nodes obtained by SRL-FA is 19% and 8% higher than CSRL respectively. Therefore, the result confirms that SRL-FA based on SRL is superior to CSRL based on original RL.  [31], [32] in terms of (a)(d) the number of charged sensor nodes, (b)(e) total driving distance and (c)(f) energy utilization in two network scenarios respectively. SRL-FA is improved by SRL algorithm. To verify the superiority of SRL-FA, we compare SRL-FA with the existing SRL algorithms. The SRL algorithms being compared are BEST-Q, AVG-Q and PSO-Q. From Fig.8, it can be observed that, SRL-FA performs better than the other three algorithms. Moreover, as listed in TABLE3, in two network scenarios, 1) the energy utilization obtained by SRL-FA is 17% and 12% higher than BEST-Q, 15% and 11% higher than AVG-Q and 2% and 7% higher than PSO-Q respectively. 2) the number of charged sensor nodes obtained by SRL-FA is 16% and 7% higher than BEST-Q, 13% and 7% higher than AVG-Q and 6%and 3% higher than PSO-Q respectively. Due to the performance of the SRL algorithm highly depends on the method of exchanging information, therefore the result confirms that SRL-FA is well designed.

2) COMPARISONS WITH ON-DEMAND CHARGING SCHEDULING ALGORITHMS
To measure the performance of SRL-FA on on-demand charging scheduling algorithms, we compare our algorithm with two classic on-demand charging scheduling algorithms, FCFS and NJNP in two network scenarios.
As demonstrated in Fig.9(b)(e), the energy utilization of SRL-FA is always higher than FCFS and NJNP. This is because we use TSP solutions to formulate the charging path, which can achieve global optimization. FCFS schedules the incoming charging requests based on their temporal property and ignores the driving distance, therefore it has the least energy utilization. Although NJNP overcomes the drawback of FCFS, it always selects the nearest sensor node and ignores the residual energy of the sensor node. And the charging energy used by NJNP may not be high, resulting in less energy utilization. Next Fig.9(c)(f) compares charging success rate, which is defined as the ratio of the number of sensor nodes which have been successfully charged to the number of sensor nodes sending the charging request. As Figure shows, the SRL-FA performs well. And the energy of the WCE is limited, the sensor nodes that the corresponding charging request does not be responded may not be dead. Thus the charging success rate does not reflect the survival rate of the sensor nodes. And then we compare the number of dead sensor nodes of three algorithms to evaluate system stability, the result is shown in Fig.9(a)(d). From the results, it can be observed that, with the growth of the number of the sensor nodes, the charging success rate decreases and the number of dead sensor nodes increases. It is because the charging requests will increase with the growth of the number of the sensor nodes and the energy of the WCE is limited. WCE cannot serve such a large number of sensor nodes. But the simulation results show that SRL-FA performs better than FCFS and NJNP.

C. PARAMETERS ANALYSIS
In this section, we will study the impact of different parameters such as the number of the agents, the speed of WCE and the update factor on the performance of SRL-FA. And we fix the network scale at 200 sensor nodes in a 2000m × 2000m field.  [16] and NJNP [17] in terms of (a)(d) the number of the dead sensor nodes, (b)(e) energy utilization and (c)(f) charging success rate in two network scenarios respectively.

1) IMPACT OF THE NUMBER OF AGENTS
SRL-FA is based on SRL. The number of agents may influence the algorithm performance. Therefore, we study the impact of it by varying its value from 1 to 6. Fig.10 shows that with the growth of the number of the agents, the energy utilization as well as the number of charged sensor nodes increases and the driving distance decreases. The reason is that multiple agents learn simultaneously to make exploration more full. However, when the number of agents more than 4, the growth is not obvious, which implies that the performance of SRL-FA is near optimal.

2) IMPACT OF THE SPEED OF WCE
An important factor that determines the mobile charger's ability in performing charging tasks is its driving speed v. We explore the performance of SRL-FA with varying v from 5 to 10 m/s. The results are shown in Fig.11. It can be clearly observed that with the speed increases, the SRL-FA performs better. It is because WCE can accomplish charging faster with larger speed. However, the driving energy consumption of WCE is related to speed. The higher the speed is, the higher the consumption is. Therefore, as shown in Fig.11, when the speed of WCE more than 8m/s, there is no obvious improvement in performance.

3) IMPACT OF THE VALUE OF UPDATE FACTOR
According to Eq. (15). ψ is the update factor and it can be seen as the equilibrium between the new Q-value and the learned Q-value during the learning process. If ψ set at 1, it means the learned Q-value has no effect during the learning process.
To study the impact of the update factor on the performance of SRL-FA, we set ψ as 01.-0.9, and the result as shown  in Fig.12. We can see that there is a significant improvement in the value of charged sensor nodes and energy utilization between ψ = 0.1 and ψ = 0.2. And the performance of the algorithm turns to be worse with the increase of ψ. Therefore, the algorithm performs better when ψ = 0.2.

VII. CONCLUSION AND FUTURE WORK
In this study, an on-demand charging algorithm based on Swarm Reinforcement Learning is proposed, named SRL-FA. With the application of reinforcement learning algorithm, SRL-FA can help WCE achieve autonomous path planning. Moreover, SRL-FA totally consider the performance of the WCE with limited energy and the response to the charging requests. Therefore, SRL-FA can improve the performance of WCE and sensor networks.
And then a large number of experiments are conducted to verify the performance of SRL-FA, which is compared with the existing swarm reinforcement learning algorithms and classic on-demand charging algorithms. The simulation results demonstrate that SRL-FA is well designed and can effectively prolong the lifespan of networks as well as WCE's energy utilization under the limited energy of the WCE. We further analyze how the parameters affect SRL-FA, such as the number of agents, the speed of the WCE and update factor.
In the future, we are planning to extend this work by using multiple WCEs and considering the energy consumption of the sensor nodes are dynamic. It may lead to more cooperative works among them to address more practical problem in WRSNs. ZHEN