Adaptive Exploration Strategy With Multi-Attribute Decision-Making for Reinforcement Learning

Reinforcement Learning (RL) agents often encounter the bottleneck of the performance when the dilemma of exploration and exploitation arises. In this study, an adaptive exploration strategy with multi-attribute decision-making is proposed to address the trade-off problem between exploration and exploitation. Firstly, the proposed method decomposes a complex task into several sub-tasks and trains each sub-task using the same training method individually. Then, the proposed method uses a multi-attribute decision-making method to develop an action policy integrating the training results of these trained sub-tasks. There are practical advantages to improve learning performance by allowing multiple learners to learn in parallel. An adaptive exploration strategy determines the probability of exploration depending on the information entropy instead of the suffocating work of empirical tuning. Finally, transfer learning extends the applicability of the proposed method. The experiment of the robot migration, the robot confrontation, and the real wheeled mobile robot are used to demonstrate the availability and practicability of the proposed method.


I. INTRODUCTION
A. REINFORCEMENT LEARNING Reinforcement learning allows agents to perform tasks through trial-and-error learning, which is a type of machine learning algorithm [1]. Agents gain experience by interacting with the environment constantly and ultimately acquire the optimal strategy to guide them get the greatest cumulative reward in the learning process. RL methods require agents to actively explore the unknown environment and receive feedback from the environment to one action taken [2], [3]. Agents use positive or negative feedback to acquire experience that they need to optimize the policy when they perform a task. Recently, this multi-learner parallel learning approach, such as the A3C algorithm [4], has successfully helped agents achieve beyond the human-level in some video games [5], [6].
The associate editor coordinating the review of this manuscript and approving it for publication was Lefei Zhang .

B. THE DILEMMA OF THE EXPLORATION AND EXPLOITATION
The trade-off between exploration and exploitation has always been a dilemma without a unified solution in reinforcement learning systems [7]. The exploration strategy guides the learning agent to explore more unknown environments by collecting new experience. An appropriate method for exploration must determine the opportunity to collect new experiences and to exploit current experience so as to obtain the greatest cumulative reward [8]. Conversely, it is inappropriate for agents to exploit for a long period and even the current experience is inaccurate or inadequate. Therefore, it is crucial to develop an appropriate action policy for the exploration scheme, which will affect the convergence rate of the RL algorithm and the cumulative reward that agents can receive.
Previous researchers usually used the probabilistic exploration method to tackle this problem, such as the epsilongreedy [9], [10]. The epsilon-greedy involves a disadvantage, VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ which is that the learning time is exponentially proportional to the scale of state space [11]. Meanwhile, a fixed value for ε always maintains the same probability of each action selected in the learning process, so a lot of ineffective exploration has emerged even the dilemma of local optimum. The softmax policy is another common method for exploration [12], [13]. Previous researchers proposed a method of using Boltzmann distribution and simulated annealing (SA) to tackle the conflicting requirements of exploration and exploitation [14]. Exploration guided by an intrinsic motivation has been extensively studied by scholars before [15]. Not only the intrinsic motivation, Thompson sampling, bootstrapped models [16] and parameter space exploration [17] can be used as a mechanism to guide exploration.

C. MULTI-ATTRIBUTE DECISION-MAKING METHODS
Problems of multi-attribute decision-making (MADM) are common occurrences when decision-makers are faced with multiple factors to make a wise decision [18]. These factors can be regarded as attributes and used as evaluation criteria to evaluate a scheme, and the scheme with the highest evaluation is adopted by a decision-maker as the best one, which is the process of the MADM. Multiple attributes for the decision-making result need to be ranked or sorted by taking into account many related factors that are usually measured or evaluated using either numerical values with certain units based on the prior experience [19]. The ordered weighted averaging (OWA) operator is a very common and effective method for a multi-attribute decision-making problem [20]. The learning experience gained by multiple learners serves as the prior experience for the source task, if the available actions act as the scheme set. The value functions learned by each learner contribute to the final decision-making result as an evaluation factor, that is, an effective action that the agent will take. Multi-attribute decision-making provides a solution to exploit the prior experience gathered from these sub-tasks.

D. RESEARCH GAPS
For the dilemma of exploration and exploitation, prior works provide good exploration without exploiting the particular structure of the task itself. However, agents need to learn many tasks, not just one, in which prior experience can be used to inform how exploration in new scenarios should be performed. Complex tasks often require a lot of learning time. Moreover, the classical training method does not effectively collect more experience when this training method is compared with the parallel training method of multilearner [21], [22]. Therefore, it is an important issue for a learning agent to determine a way of exploration for new experience and exploitation using the current experience. For the MADM, the conventional methods use the empirical aggregation operator, which has a subjective impact on the decision-making results. Conventional aggregation operator weight attributes empirically, which undoubtedly bring a lot of subjective factors to decision-making results. Moreover, evaluation values are often inaccurate because they are calculated using these unreasonable weights.

E. RESEARCH METHODS
In this study, we use three ways: designing an effective exploration strategy, expanding prior experience for exploitation and designing an appropriate action policy with the current knowledge, to fill the gaps mentioned above for the trade-off problem between exploration and exploitation.
Firstly, to achieve the first way, this study uses an adaptive exploration strategy to determine the value for ε using the information entropy [23], which encourages agents to explore more in the early stage of learning and exploit more in the later stage. Generally, for an action policy, it is appropriate to require agents to make rational use of the current learning experience. Secondly, to achieve the second way, this paper proposes an effective method that collects learning experience using a divide-and-conquer approach. The proposed method decomposes the source task into several sub-tasks and exploits the prior experience from sub-tasks that are trained by the same (Temporal Difference)TD method [24]. The method of multi-learner parallel learning expands the source of prior experience and improves the performance of learning algorithms, but the original method does not. Thirdly, to achieve the third way, this study proposed an action policy using the MADM, which regards the available actions taken by an agent as schemes, and regards the state-action value function obtained from each sub-task as attributes for these schemes. This study uses the MADM method to calculate the evaluation value for each scheme, and then the action with a maximum evaluation value is almost one that the agent will take. The average value for the standardized reward of each sub-task is calculated to act as the current reward for the source task and this average value is used to update the action value function iteratively.
To fill the gaps mentioned above for the MADM, this study presents a MADM method with a support function, which weights attribute using the visibility graph theory [25]. This method considers the influence of both the value and location relationship of attributes on the decision-making results, so the subjective factors are eliminated.
Transfer learning extends the learning model to different task scenarios and accelerates the learning process for the new task. Intermediate tasks bring more prior experience to agents to reduce the difficulty of executing the target task, which differs greatly from the source task.

F. CONTRIBUTIONS IN THIS WORK
The main contributions of this paper are as follows.
1) For the complex task, this study decomposes the source task into several sub-tasks in a divide-and-conquer approach, and then these sub-tasks are trained by the same training method respectively. This divide-andconquer method can not only expand the source of prior experience but also accelerate the learning rate by multi-learner parallel training.
2) An action policy is developed using a MADM method and this action policy determines an action depending on the acquired knowledge. This study uses a MADM method integrating a support function to calculate the weights of attributes, which can eliminate the subjective factors of empirical methods and increase the accuracy of results for a MADM problem. 3) To tackle the dilemma of exploration and exploitation, this study first develops an adaptive exploration strategy with a MADM. The adaptive exploration strategy uses the information entropy technology to determine the threshold for the epsilon-greedy. The experimental results show that the proposed method outperforms competitors.

G. STRUCTURE OF THIS PAPER
The remainder of this paper is organized as following. Section II presents the background for the proposed methods and briefly describes the Q-learning, the softmax policy and the MADM. Section III presents a new MADM method using a support function and this method is used for developing the action policy. Section IV presents a training method for subtasks and a standardized method for rewards obtained by each learner. Section V presents an adaptive exploration strategy and an action policy using the proposed MADM method.
Experiments are conducted to demonstrate the effectiveness of the proposed methods in Section VI. Section VII gives the detail of future improvement for the adaptive exploration by using transfer learning. Conclusions are drawn in the last section.

II. BACKGROUND
A. Q-LEARNING ALGORITHM Q-learning is a model-free reinforcement learning algorithm, which is proposed by Watkins in 1989 [26]. Q-learning is commonly used because it is very simple and fast converges. The Q value is given by, The law for updating Q values using the TD error is given by, where α is the learning rate and the range is (0, 1). The learning rate reflects the efficiency of the learning process. A round of learning process is terminated when the agent reaches the target state. The agent then returns to the initial state and starts the next round until the end of the whole learning process, so the optimal strategy is obtained.

B. SOFTMAX POLICY
Softmax policy is an action policy that is commonly used for exploration schemes to tackle the dilemma of exploration and exploitation [9]. The agent uses this method to select the action using the average reward for each action. The action a t with the highest average reward is the best one to be selected. The simulated annealing (SA) [27] algorithm optimizes the softmax policy to control the randomness of actions. The probability for each action is given by, where P i represents the probability for selecting an action a i and the total number of available actions is K. The probability for selecting action a i is given by, where T t is the temperature parameter. The temperature parameter for the simulated annealing algorithm is tuned using Eq. (5).
where η is the annealing factor and its range is 0 ≤ η ≤ 1.

C. MULTI-ATTRIBUTE DECISION-MAKING
In general, selected schemes often have many predefined attributes, which affect the decision-making result. MADM measures each attribute and gives the evaluation value for each scheme. The aggregation operators are commonly used to find a solution for the problem of MADM, which calculates the evaluation value for each scheme effectively. Ordered weighted averaging (OWA) operator is a simple but effective information aggregation operator among all aggregation operators and it weights each attribute depending on its significance [28]. A set of original data is (b 1 , b 2 , . . . , b m ), which is sorted from large to small to obtain an ordered sequence (b 1 ,b 2 , . . . ,b m ). The OWA operator is given by,

III. A MULTI-ATTRIBUTE DECISION-MAKING METHOD WITH A SUPPORT FUNCTION A. SUPPORT FUNCTION
Graph technology has been used to address a type of the machine learning problem, such as the sparse feature extraction, the dimensionality reduction and so on, and has been proved to have excellent performance [29], [30]. After the values for attributes are ordered, the visibility graph theory converts the values in the data sequence into nodes in a complex network (CN) [31]. Each node in a CN corresponds to the value in the data sequence one by one. The degree of support is the degree of correlation between values in the data sequence, which is reflected by the connection relationship between nodes in a CN. The more connections a node has, the higher the support for the value receives.
The definition of a visibility graph is as follows: Definition 1: Two data are represented by two-tuple (i,b i ) and (j,b j ) respectively in the data sequence. If there is a correlation between two arbitrary data, then for any data (k,b k ) between the two data, it satisfies Eq. (7).
If two values in the data sequence satisfy Eq. (7), the visibility graph theory emphasizes that the corresponding two nodes are connected in complex networks. The degree of a node is defined as the number of edges connected with other nodes in the CN. In general, the degree of a node is positively related to its importance and the support function describes the support degree for nodes. Coulomb's law emphasizes that the support between nodes needs to consider both the value for nodes and the distance between nodes [32].
Definition 2: If the values of the two nodes are D i , D j and the distance between the two nodes is dis ij = D i − D j . The support function between the two nodes is given by, where n is a positive integer. Supp(D i D j ) denotes the support for the node D j to the node D i . The sum of support for the node D i received by all nodes is given by,

B. THE ORDERED WEIGHTED AVERAGING OPERATOR WITH A SUPPORT FUNCTION
A set of ordered data (b 1 , b 2 , . . . , b n ) from large to small is represented by D = {D 1 , D 2 , . . . . , D n }. In the ordered data sequence, the number of data is n. Ordered weighted averaging with a support function (VOWA) is given by, where If there is a data sequence (b 1 , b 2 , . . . , b n ), and any permutation of that data sequence is represented by (b 1 ,b 2 , . . . ,b n ). Evaluation values of the VOWA operator for the two data sequences are equal, as shown in Eq. (11).
Then, the weight for the data D i is given by, where the sum of the support for the node D i is Sum(D i ). All nodes that satisfy Eq. (7) are connected with the node D i .
So, evaluation value using the VOWA operator for this sequence is given by,

IV. TRAINING FOR SUB-TASKS A. A TRAINING METHOD FOR SUB-TASKS
This paper proposes an adaptive exploration strategy with a MADM to address the dilemma of exploration and exploitation. For a type of the problem of the multi-objective decisionmaking, it is hard to solve the source task. So, the divide and conquer approach may be a solution for the multi-objective decision-making problem. Several trained sub-tasks can be used as modular building blocks to develop a rapid prototype for a complicated task to improve the learning performance.
For the complicated task, the learning experience for related sub-tasks is assembled to induce an action policy. In this work, inspired by the divide and conquer approach, a complex task is decomposed into sub-tasks, and the learning experience of trained sub-tasks are fused to complete a complex task. A TD method [24] is used to train the sub-tasks and this training method is shown in Algorithm 1. For example, in robot soccer games, we define the task of a robot playing soccer as a complex task, which consists of several sub-tasks: passing the ball, taking the ball away, avoiding obstacles, shooting and so on.

B. A STANDARDIZED METHOD FOR REWARDS
These sub-tasks are trained by the same training method separately and each learner receives a different reward. The reward received by an agent for the source task needs to consider the rewards received by all sub-tasks. The standardized method deals with all rewards, which belong to different sub-tasks. If the starting time of an episode is t 1 , the reward received by sub-task k is r t 1 k .r t p k is the reward that sub-task k receives at time t p . The average value of rewards for all sub-tasks receives at the time t p is µ t p . The variance of the rewards that all sub-tasks received at the time t p is σ t p . The average and variance of the rewards for all sub-tasks are given as Eqs. (14) and (15).
where, m is the number of sub-tasks. The standardized reward R t p k for sub-task k is given by, The average value of the standardized rewards at the time t p for each sub-task isR t p , which will be given to the source Initialize s t , s t+1 , a t , a t+1 , reward(t), Q table and cumulative reward totalreward.
Choose an initial state s 0 14.
Repeat (for each step of the episode): 16.
Update the TD error: until s is terminal. 26.

V. AN ADAPTIVE EXPLORATION STRATEGY WITH A MULTI-ATTRIBUTE DECISION-MAKING A. AN ADAPTIVE EXPLORATION STRATEGY
An adaptive exploration strategy uses information entropy to achieve a more effective exploration. The proposed method achieves a state-action-dependent exploration with a certain probability. The value ε depends on the number of all available actions in the current state. At the beginning of a learning process, if a common action appears, the agent will explore more, which is indicated by the defined information entropy during this learning process. If the agent acquires sustainable experience indicated by the temporal difference error (TD error), the probability of exploration will be reduced. In this study, we use fluctuation value | Q(s t , a t )| to represent the TD error. The TD error is given by, If the fluctuation value is large, the agent makes more exploration and vice versa. After a learning step, the probability for exploration is calculated. A = a 1 , a 2 , . . . , a N A is the action set and the number of actions is N A . The number of times that the action a i is taken at the state s t is N (a i , s t ).
The probability of taking action a i in the state s t is given by, The entropy E H (s t ) for the state s t is given by, where,Ē H (s t ) is the normalized entropy.(log 0) × 0 = 0. The value for ε in the epsilon-greedy can be calculated using theĒ H (s t ), which is shown in Eq. (21).
where, H c is a constant value, and | Q(s t , a t )| is the TD error. E H (s t ) is a measurement of the uniformity for the available actions. The value ofĒ H (s t ) is maximum, if each action has been tried the same frequency at state s t . The smaller the value ofĒ H (s t ), the more the agent will explore, and vice versa. E H (s t ) gives an opportunity for an agent to try each action possible. In the early stage of a learning process, the agent will explore more and the agent performs more exploitation in the later stage.

B. AN ACTION POLICY WITH A MADM
The source task is decomposed into multiple sub-tasks, and then each sub-task is trained in the same way. The prior experience for an agent depends on the learning experience of each sub-task, which enlarges the source of prior experience. This study uses a MADM method using a support function to calculate the weight for attributes. The state-action value function for m in the current state s t and the current action a t is Q (1) (s t , a t ), Q (2) (s t , a t ), . . . , Q (m) (s t , a t ). The action set is defined as the scheme set that the agent takes. Attributes for the scheme is the state-action value function of the source task and the state-action value function of each sub-task. Then we use the MADM method to give each scheme an evaluation value. The agent performs the action that belongs to the maximum evaluation value according to the definition of MADM. a 3 ), . . . , Q (p) (s t , a n ) is the state-action function for sub-task p for the state s t and action a t . n is the number of actions. The stateaction value function of the source task for state s t is a 3 ), . . . , Q(s t , a n )}.

VOLUME 8, 2020
The decision-making matrix Q is given in (22), where (22), as shown at the bottom of this page.
The scheme set is represented as U = {a 1 , a 2 , . . . , a n }. Y = {Q 1 (s t , a i ), Q 2 (s t , a i ), . . . , Q m (s t , a i ), Q(s t , a i )} is the attribute for the scheme a i . The process of calculating the evaluation value VOWA(a k ) of the action a k for the sub-task k involves the following steps.
Step 1: For the scheme a k , attributes V = {Q 1 (s t , a k ), Q 2 (s t , a k ), . . . , Q m (s t , a k ), Q(s t , a k )} are ordered from large to small and the ordered data is {Q (1) (s t , a k ), Q (2) (s t , a k ), . . . , Q (m) (s t , a k ), Q (m+1) (s t , a k )}.
Step 2: Visibility between nodes is defined as the direct connection between nodes in complex networks. We evaluate the visibility nodes for each node in turn. For the node Q (i) (s t , a k ), the set VQ for the visibility nodes is given by, where kp and ks are indexes from (1, m) for any two nodes, which satisfy kp < ks.
Step 3: The support for the node Q (i) (s t , a k ) received from its visibility nodes is given by, Step 4: The weight of each node is calculated using the support for this node and the weight vector is written as ω = (ω 1 , ω 2 , ω 3 , . . . , ω m+1 ). Taking ω s as an example, we use Eq.(25) to calculate the weight.
Step 5: The evaluation value for the scheme a k is given by, The action a k with the highest evaluation value VOWA(a k ) will be selected.

C. THE WHOLE ALGORITHM FOR THE ADAPTIVE EXPLORATION STRATEGY WITH A MADM
This study defines a random number ran. If ran < ε, the agent chooses an action randomly or chooses an action using the MADM method. Compared with the classical ε -greedy strategy, the proposed adaptive exploration strategy avoids the trouble of manual tuning and increases the learning performance. The values of attributes change if the learning agent moves to the next state. So, these weights are recalculated transiently. These weights for attributes will be recalculated using the support function when choosing a new action. The adaptive exploration strategy with MADM is detailed below.

VI. EXPERIMENTS AND ANALYSIS A. EXPERIMENT ON ROBOT MIGRATION
In this experiment, a robot migration experiment was performed to validate the efficiency of the proposed epsilon-greedy with adaptive strategy (adaptive strategy). Competitors include: epsilon-greedy (epsilon-greedy) [33] and Boltzmann probabilistic exploration (Boltzmann exploration) [34]. As shown in Fig.1, the network for robot migration satisfies the structure of the binary tree. In this study, two groups of comparative experiments with 4 layers of the network (nodes' number is 15) and 10 layers of the network (nodes' number is 1023) are set respectively. Each node in the robot migration network represents a state, and there is no transition between the same layers. The experimental parameters for robot migration are shown in Table 1.
There are 15 endpoints on the network with 4 layers. Starting from the starting position, the robot will repeatedly follow the branches to reach the lowest endpoint and will be rewarded for each endpoint. A total of 14 actions can be Q (1) (s t , a 2 ) · · · Q (1) (s t , a n−1 ) Q (1) (s t , a n ) Q (2) (s t , a 1 ) Q (2) (s t , a 2 ) · · · Q (2) (s t , a n−1 ) Q (2) (s t , a n ) . . . . . . . . . . . . . . . Q (m) (s t , a 1 ) Q (m) (s t , a 2 ) · · · Q (m) (s t , a n−1 ) Q (m) (s t , a n ) Q(s t , a 1 ) Q(s t , a 2 ) · · · Q(s t , a n−1 ) Q(s t , a n )  a 1 ), . . . , VOWA(s t , a n )} 21. End if 22. Take action a t , and observe rewards from each sub-task 23. Observe the next state s t+1 . 24. Calculate the average valueR for the standard rewards using the rewards obtained from the sub-tasks. 25. 27. Calculate the action probability P(a t |s t ). 28. Calculate E H (s t ) andĒ H (s t ) respectively. 36. until s t is terminal. 37. End selected by the robot, and the Q value for each endpoint is written as Q 1 ∼ Q 15 . If the robot reaches the endpoint 15, the reward is +1000. If the robot reaches the remaining endpoints, they are not rewarded. The experimental results give the value for Q 1 ∼ Q 15 .

If
In the migration network, the endpoint Q 7 is the only node leading to the endpoint Q 15 , and the experimental results show that the Q value of Q 15 will change in the experiment, so this study uses the changes of the Q value of Q 15 to compare different methods. In this network, the robot is encouraged to choose the endpoint Q 7 because the Q 7 node is the only node leading to Q 15 . Fig.2 shows the curve of the value for Q 15 , where three different methods are compared. The adaptive exploration method converges at 12th   time steps. The Boltzmann exploration method and epsilongreedy method converge at 23th time steps and 43th time steps respectively. The experimental results show that the curve of the adaptive exploration strategy converges faster than the other two methods and the adaptive strategy can accelerate the convergence of learning algorithm.
We extend the 4-layers network to 10-layers and then analyze the experiment results. In this experiment, different strategies, the adaptive exploration strategy, epsilon-greedy strategy, and Boltzmann exploration were compared using the value of Q 3 . In Fig.3, the adaptive exploration strategy is represented by the blue curve, the Boltzmann exploration policy is represented by the yellow curve and epsilon-greedy strategy is represented by the purple curve. The adaptive exploration method converges at 3746th time steps. The Boltzmann exploration and epsilon-greedy method converge at 3970th time steps and 5337th time steps respectively. Compared with other strategies, the adaptive exploration strategy VOLUME 8, 2020 has the fastest convergence rate, in terms of the value of Q 3 . Similar results are shown for a more complex task scenario.

B. EXPERIMENT ON ROBOT CONFRONTATION
Robocode [35], [36] is an open-source platform developed for the multi-robot confrontation. The introduction to the Robocode platform is shown in https://robocode.sourceforge. io/. We use several scenarios with different enemies to test the availability of the proposed method in this experiment.
In the platform, robots will take strategies to attack the enemy and get rewards, which is shown in Fig.4. Both sides have 100 health points at the beginning of each round. It indicates the end of a round if the health point of one side becomes 0. The state space, action space and the reward function for an agent are shown below.
State space: The state space includes: the absolute orientation angle, the relative direction angle, and the distance between the robots. The range of absolute direction angle is 0 ∼ 360 • , which is discretized into four states and the relative direction angle is also discretized into four kinds of states. The distance between the robots is divided into 20 discrete values.
Action space: The movement of the robot consists: movement and rotation. Each robot can move to arbitrary directions with arbitrary distances in the ground at each time step. Our robot can attack their opponents with bullets of different energies. The action space includes forward, backward, clockwise rotation, and anticlockwise rotation 4 kinds of different movements.
Reward function: If a robot is hit by bullets or fires bullets hit the enemy, its health point will change. Different states have different health changing rules for the agent and the bullets have a different energy. Different reward functions are designed for different sub-tasks, respectively. Three different sub-tasks are designed for this experiment.
Sub-Task 1: Attack the enemy. The reward signal for subtask 1 is shown in Eq. (27).
Sub-Task 2: Don't get hit by the enemy. The reward signal is negative. r = −3 if our robot is hit by the enemy.  The adaptive exploration strategy with the MADM (AE with the MADM), the AE method, the MADM method, the ε-greedy strategy with an attenuation threshold (ε-decreasing) [4] and the Softmax-greedy method (softmaxgreedy) [34] were compared in this platform to demonstrate the effectiveness of the proposed method. The experimental parameters are set, which is shown in Table 2.
Every 10 rounds are counted as a session, and the session score is the average score that is calculated using the scores collected every 10 rounds. Three sub-tasks were trained by the TD method to converge respectively before this experiment was performed. Firstly, we train a robot ''ε-decreasing'' using the ε-decreasing method and then use the robot using the proposed AE with the MADM method, the AE method and the MADM method to fight it for 500 rounds respectively. We define the score ratio as an indicator for these experimental results, which is calculated using the ratio of one side's score to the total score of both sides, as shown in Eq. (28).
where M = 2 is the number of individuals participating in the count. Fig.5 shows the curve of the score ratio for the ε-decreasing method, the AE method, the MADM method and the proposed AE with MADM. The experimental results show that the proposed method has the highest score and the ε-decreasing method has the lowest one. The scores of the AE method and the MADM method are lower than those of the proposed method, which shows that the two methods can be effectively combined to improve learning performance.
Then, we use the robot using the AE with the MADM method, the AE method, the MADM method, the robot using the softmax-greedy method and the robot using ε-decreasing to fight 500 rounds with the robot using the ε-greedy strategy respectively. Fig.6 shows the score ratio for these five methods. Similar to the results shown in Fig.5, the score of the AE with the MADM is higher than that of methods AE and MADM. The experimental results shown in Fig.6 further demonstrate that the combination of the AE and the MADM can improve the performance of the learning algorithm. Besides, compared with the ε-decreasing method and the softmax-greedy, the AE with the MADM method has a higher score ratio, which shows that the action policy using MADM helps the agent get higher scores in confrontation. We decompose the source task into several sub-tasks by a divide-and-conquer method to expand the source of prior experience for the agent. Meanwhile, the MADM method can help the agent select actions more effectively by using prior experience.
In order to demonstrate the low fluctuation for the three methods, a fluctuation curve of score ratio to describe the learning performance for different algorithms. The fluctuation of score ratio is described by the first-order difference of score ratio, that is, scoreratio t+1 − scoreratio t , where the ratio of the latter score is used successively minus the ratio of the former score. As shown in Fig. 7-8, the experimental results show that the fluctuation range of the proposed method is smaller than the other two competitors by comparing the fluctuation curves of these methods, which indicates that the proposed method has low fluctuation. A lower fluctuation means that the agent chooses more appropriate actions to maintain a stable decision.

C. EXPERIMENT ON THE REAL WHEELED MOBILE ROBOT (WMR)
In order to test the practicality of the proposed method, this study executed the image-based visual servoing (IBVS)   control experiment in the real environment, based on the previous works [10], [37]. The experimental environment is shown in Fig.9. Previous work has shown that Q-learning can improve the performance of the IBVS controller by adjusting the mixture parameter for the IBVS controller [10], [37]. We carried out experiments on the basis of this work and the four vertex coordinates of the Quick Response (QR) code are used as the feature points in IBVS [10]. Firstly, we train the WMR in the simulation environment, and then the prior knowledge (the Q-table) that is learned in simulation is directly transferred to the WMR before the real-world experiment, in order to reduce the training time of real-world experiments.
Three different methods were compared: the proposed method (AE with the MADM), ε-greedy strategy (ε-greedy) [33] and the softmax strategy with simulated annealing algorithm (Softmax-SA) [37]. For the real-world environment, the parameters for this experiment are set as shown in Table 3. For the RL model in this experiment, based on the previous work [10], [37], we decompose the source task into three sub-tasks: not losing any feature, reaching the desired position and achieving the desired position faster.
Sub-Task 1: The reward +100 is given when the WMR reaches its target position. In other cases, however, a few negative rewards are given. The reward function is given by, reward = +100, Reach the target position −2, Others Sub-Task 2: If the WMR is deemed to be losing the target and is given a bad reward −100. In other cases, a few positive rewards are given. The reward function is given by, reward = −100, Missing the target features +2, Others Sub-Task 3: The reward is given according to the distance between the WMR and the desired position, which drives the WMR to reach the target position faster. The reward function is given by, where N is the number of image features used in the IBVS system. F c j is the current feature and F * j is the desired feature. col and row are the length and width of the obtained image plane, respectively.
Since the average errors for four feature points are similar, only the feature errors for two diagonal feature points are shown. Each method was tested 50 times, and the average value for the 50 tests was selected as the experimental results. The experimental results are shown in Fig.10. Fig.10 shows the convergence curve for the feature error. From the experimental results, these three methods converge at last, but the convergence rate is different. The AE with the MADM method converges at 148th time steps, the Softmax-SA method converges at 181th time steps and the ε-greedy method converges at 198th time steps. The experimental results show that compared with the other two methods, the proposed method has the fastest convergence rate. The results on the real-robot show that the adaptive exploration and the MADM can accelerate the convergence rate of the learning algorithm. Reducing training time means that the risk of damaging the real-robot will be reduced.

VII. A FUTURE IMPROVEMENT FOR ADAPTIVE EXPLORATION STRATEGY USING TRANSFER LEARNING
To further investigate the improvement of learning performance, this study uses the transfer learning extends the learning model to different scenarios to improve the generalization of the proposed AE with MADM. This study compares the AE with the MADM method using transfer learning (AE-MADM with TF) with the MADM method without transfer learning (AE-MADM without TF). The contrast experiment was divided into two groups. The first group used AE-MADM with TF and AE-MADM without TF to fight with two formations of tank robots that are trained by ε-greedy strategy respectively. The second group used AE-MADM with TF and AE-MADM without TF to fight with four formations of tank robots that are trained by ε-greedy strategy respectively. The robots were set in different formations to ensure the robots in the same team will not attack each other. The robot that is trained using the ε-greedy strategy and the robot that is trained using the AE-MADM method will fight for 500 rounds before the beginning of the experiment, so the robot ''AE-MADM'' has some prior experience. The direct transfer strategy [38] is performed if  agents have the same state and action space in the different tasks, which is given by, ρS : S past → S current ρA : A past → A current (32) An intermediate task gives the agent more prior experience to achieve the target task if the target task is far away from the source task. In this experiment, the second experiment is regarded as the target task, then the first experiment can be regarded as the intermediate task. Fig.11 and Fig.12 show the experimental results for the first and second experiments, respectively. In the first experiment, we used a formation as the enemy, which consists of two robots trained by ε-greedy strategy. The score ratio of AE-MADM with the TF method and AE-MADM without the TF method remained between 0.5 and 0.81. However, the experimental results show that the score ratio of AE-MADM with the TF method is still higher than that of AE-MADM without the TF method because agents learn from prior experience using transfer learning, so the score is higher. With the number of rounds increases, the score ratio of AE-MADM with the TF method remains around 0.7, while the score ratio of AE-MADM without the TF method remains around 0.6.
Compared with the first experiment, the second experiment is more difficult. The learning experience that is gained in the first experiment is transferred to the second experiment using the direct transfer strategy. In the first 220 rounds, the AE-MADM with the TF method does not perform better than AE-MADM without the TF method. With the increase of rounds, the former scores gradually higher than the latter, which shows that the agent can not only exploit the past experience but also gain new knowledge. The experimental results show that transfer learning extends the proposed method to more difficult task scenarios.

VIII. CONCLUSION
The dilemma of exploration and exploitation in RL systems is a challenging problem. In this work, to address the dilemma of exploration and exploitation, an adaptive exploration strategy with multi-attribute decision-making is proposed. The probability of exploration is determined by an adaptive exploration strategy, which uses the information entropy instead of the experience of manual tuning. Meanwhile, we investigate how to expand the source of prior experience from the structure of the task itself, and how to integrate these multisource prior experience. Firstly, the source task is decomposed into several sub-tasks, and these sub-tasks are trained by the same TD method separately. Because each sub-task has a different reward function, a reward standardization method is proposed. The average value of standardized rewards for each sub-task is used as a reward for the complex task. Compared with the conventional method that the agent learns directly in the environment, the training method running several learners in parallel can accelerate the learning rate. The transfer learning method extends the proposed learning model to more difficult tasks. We executed several experiments to demonstrate the effectiveness of the proposed method. The experimental results show that the proposed method outperforms the conventional methods in terms of convergence rate and learning performance.
In the future, we will try to extend the proposed method to the RL systems in high-dimensional space and this learning method needs to be framed in the existing works, such as hierarchical RL [39] and curriculum learning [40], [41]. Since some rough prior experience exists in the previous learning tasks, advanced knowledge transfer methods and the Domain adaptation might lead to more efficient learning performance. Previous works have focused on the RL methods to address the fault diagnosis and fault tolerant control [42]- [44]. So, we will investigate the application of the proposed method to the fault diagnosis. In addition, integrating the possibility of applying multi-attribute decision-making in the deep reinforcement learning system is also worthy of study.