An Intelligent Deployment Policy for Deception Resources Based on Reinforcement Learning

Traditional deception-based cyber defenses (DCD) often adopt the static deployment policy that places the deception resources in some fixed positions in the target network. Unfortunately, the effectiveness of these deception resources has been greatly restricted by the static deployment policy, which also causes the deployed deception resources to be easily identified and bypassed by attackers. Moreover, the existing studies on dynamic deployment policy, which make many strict assumptions and constraints, are too idealistic to be practical. To overcome this limitation, an intelligent deployment policy used to dynamically adjust the locations of these deception resources according to the network security state is developed. Starting with formulating the problem of deception resources deployment, we then model the attacker-defender scenario and the attacker’s strategy. Next, the preliminary screening method that can derive the effective deployment locations of deception resources based on threat penetration graph (TPG) is proposed. Afterward, we construct the model for finding the optimal policy to deploy the deception resources using reinforcement learning and design the Q-Learning training algorithm with model-free. Finally, we use the real-world network environment for our experiments and conduct in-depth comparisons with state-of-the-art methods. Our evaluations on a large number of attacks show that our method has a high defense success probability of nearly 80%, which is more efficient than existing schemes.

Obviously, creating deception resources is the first step in implementing defense measures that belong to DCD. As is known to all, one of the most common deception resources is honeypot, which has been studied by many scholars. So far, several honeypots have been created in existing literature, which can be divided into three categories: low-interaction honeypot, medium-interaction honeypot, and high-interaction honeypot. Provos [9] propose a low-interaction honeypot that mimics the TCP/IP stack of any system, which can be used to deceive the nmap-type stack fingerprint tools. Also, some scholars have created some honeypots for application layer protocols, e.g., Telnet [10] and HTTP [11], as well as some honeypots for special devices, such as smartphones [12], USB devices [13], and data acquisition equipment [14]. Besides these honeypots, there are a variety of forms of deception resources. In [15], Juels and Rivest design the ''honey-words'' to detect the attacker that is cracking the fake account. The ''honey-patches'' is proposed in [16] to trick the attacker with the help of fake vulnerability patches. Subsequently, IoT devices [17], URL address [18], and encryption measures [19], [20] are also proposed to lure and deceive attackers. Actually, anything that attackers are interested in can be forged as a deception resource.
Currently, research on DCD focuses on improving the fidelities of the deception resources so that the attacker will fail to recognize the difference between them and the real network environment. However, to our knowledge, little attention has been devoted to the research on how to deploy the deception resources efficiently, which is also an important enabling technique for DCD. The authors of [21] introduce graph theory to model the attacker's behavior and design a decoy chain deployment method to protect against these penetration attacks. A web honeypot based on collaborative mechanism is proposed to make the honeypot-cluster work together as if it is a single honeypot [22]. In addition, Jajodia et al. [23] and La et al. [24] obtain the guidance of the honeypot nodes deployment by using game theory. Nevertheless, most of these studies can still be classified as static deployment methods, and they have poor effects.
Understanding the state (intent, ability, and strategy) of the attacker is the key requirement of any successful deception [7]. Meanwhile, in an attacker-defender process, the defender's knowledge of the attacker's state may improve over time. However, the static deployment method considers the attacker's states as the invariant value, which is bound not to support a successful deception. In fact, it is easy to implement a dynamic deployment plan for deception resources by using the software defined network (SDN) technology [25]- [29]. From the above analysis, we can see that it is critical in designing a dynamic deployment method to maximize the effectiveness of the deception resources. Clark et al. [30] periodically change the IP address of the honeypot nodes, which will make the honeypot IP identified by the attacker invalid and increase the security of the honeypot nodes. Similarly, Sun and Sun [31] and Sun et al. [32] place the decoy nodes in the target network and then confuse the attacker by IP randomization. By analyzing network characteristics, Shakarian et al. [33] combine MTD with decoy nodes and deploy ''interference clusters'' in the network to reduce the probability of successful intrusion. A model of dynamic array honeypot is proposed to confuse the attacker by coordinating and changing the role pseudo-randomly as a huge dynamic problem [34]. Unfortunately, the above methods suffer from two limitations. First, it has been shown that their approaches tend to prevent attackers from discovering deception resources rather than actively luring them. Second, to infer the attacker's strategy, they have made many strict assumptions and constraints, which cannot be satisfied in practice. As a result, they are too idealistic to be practical.
As one of the most attractive research hotspots of artificial intelligence (AI), reinforcement learning (RL) has made great breakthroughs in many fields, e.g., robotics, autonomous driving, and board games. Especially the emergence of AlphaGo [35] and AlphaGo Zero [36] has greatly shocked the society. Beyond these, in the difficult strategy confrontation games such as StarCraft and Dota2, AI technologies based on RL have also defeated humans. In fact, the essence of network attack and defense is similar to the scene of strategy confrontation games. More importantly, by combining RL and network security data, we can acquire convincing knowledge of the attacker's state, which can help us design more successful deception schemes. Then, inspired by recent success in RL, we are now seeing a renewed interest in DCD. Specifically, we propose an intelligent deployment policy for deception resources based on RL. Our method can dynamically adjust the locations of deception resources according to the network security state and trap the attacker in maximize probability. To our best knowledge, our work is the first one that achieves satisfactory properties without strict assumptions and constraints about the attacker's strategy.
The remainder of this paper is organized as follows. In section II, the problem setting of deception resources deployment is described. In Section III, the method for preliminary screening of effective deployment locations of deception resources based on TPG is given. In Section IV, the details of intelligent deployment policy for deception resources based on RL are presented. Section V describes the experiments and exhibits obtained results. Finally, we conclude the paper in Section VI.

II. PROBLEM SETTING OF DECEPTION RESOURCES DEPLOYMENT
In this section, the problem setting of deception resources deployment is described. We present the assumptions, the attacker-defender scenario, and the attacker's strategy in detail. As mentioned earlier, the core idea of DCD is to induce attackers to stray into the false deception schemes designed by the defender in advance so as to realize the pro-active protection of real resources in the target network. Clearly, the effectiveness of DCD depends on two key factors. One is the fidelity of deception resources, that is, how to make attackers believe them without a shadow of a doubt; the other VOLUME 8, 2020 is the policy of deception resources deployment, namely, how to induce attackers to fall into deception resources. In this paper, we focus on solving the second problem. For simplicity, we assume that deception resources have high fidelity, which means that our deception resources can lure the attacker into investing and depleting all his/her resources.

A. ATTACKER-DEFENDER SCENARIO
In this paper, we go one step beyond by showing that RL can help us obtain the convincing knowledge of the attacker's state and design a successful deception policy solely from network security data, without human data, guidance or domain knowledge. Based on the idea of DCD, we consider a simple and practicable attacker-defender scenario (ADS), which consists mainly of three components: target network, the attacker, and the defender, as shown in Fig. 1.

Definition 1 (Target Network):
A target network TN is a four tuple TN = (N , E, n cr , N fr ) where N = {n j |j = 1, 2..., k} is the set of normal nodes, E ⊂ N × N is a set of edges, n cr ∈ N is the only one of these normal nodes that installs the confidential resources (data or files), and N fr = {n i fr |i = 1, 2..., m} is the set of deception resource nodes, which install the fake resources.
Note, if e ij ∈ E, i.e., there is an edge from node n i to other node n j , it means that the users have access to n j from n i . In practice, attackers usually have specific goals. In our ADS, we assume that the goal of the attacker is to obtain these confidential resources which are installed in n cr .
In most real-world target networks, there are some nodes named ''external facing'', which can be accessed by the outside attacker. In Fig.1, the node n 1 belongs to the ''external facing''. At the beginning of the ADS, the attacker invades and controls these ''external facing'' (by exploiting some vulnerabilities in a piece of code running on these nodes), and then he/she will launch several penetration attacks by compromising some other nodes in the target network. Clearly, a successful attack on node n i from the attacker needs to satisfy two indispensable conditions: i) the attacker has access to node n i , ii) there are some exploitable vulnerabilities on node n i . Note that the attacker does not know where the confidential resources are installed. Thus, one complete attack will start from one node belonging to ''external facing'', and keep penetrating until either the attacker invades the node n cr or the attacker strays into one of deception resource nodes. For simplicity, A a (n i ), as an attack action implemented by the attacker, means that the attacker invades the node n i .
To deceive and defeat the attacker, the defender creates deception resources, denoted as N fr , which is the set of deception resource nodes. For each n i fr ∈ N fr , apart from replacing confidential resources with meticulously synthesized fake resources, it is the same as node n cr . Because creating and maintaining the node n i fr needs to cost a lot of time and money, the number of deception resource nodes is usually smaller. Besides, we assume that the attacker cannot distinguish the difference between confidential resources and fake resources. Then we can define a deception resources deployment action A d as a mapping from the set N fr to the set N . Thus, for each deception resource node n i fr ∈ N fr , there must be one corresponding node n j ∈ N that satisfies A d (n i fr ) = n j , which means that the defender creates a connection from n j to n i fr . For the sake of brevity and readability, when A d (n i fr ) = n j , we call that the defender deploys n i fr behind n j or the deployment location of n i fr is n j . In this case, the attacker has access to n i fr from n j , and therefore, once the attacker invades and controls the node n j , he will be able to implement the attack action A a (n i fr ) in the next step. In fact, there are three kinds of possible mapping relationships from N fr to N , as shown in Fig. 2. Three kinds of possible mapping relationships from N fr to N. a) many-to-one mapping. b) one-to-many mapping. c) one-to-one mapping.
The Fig.2 (a) displays the many-to-one mapping, which means that the defender deploys more than one deception resource node behind the same normal node. As all the deception resource nodes are identical, there is no good except to increase the suspicion of the attacker. The Fig.2 (b) displays the one-to-more mapping, which means that the defender deploys one deception resource node behind more than one normal node at the same time. In this case, according to our observations in the world, the attacker will invade other nodes by using the deceptions resource node as a springboard. The Fig.2 (c) shows the one-to-one mapping, which means that the defender deploys only one deception resource node behind one normal node. Luckily, there are no obvious shortcomings similar to the other two kinds of mapping relationships. 35794 VOLUME 8, 2020 From the above analysis, we can see that a reasonable deception resources deployment policy must follow the rules of one-to-one mapping.
After a limited number of attack actions, our ADS will terminate under two cases: (1) when the attacker implements the attack action A a (n cr ), he will obtain the confidential resources, which means that the defender failed; (2) when the attacker implements the attack action A a (n i fr ), he will obtain the fake resources, which means that the defender succeeded.

B. UNCERTAINTY
Understanding the attacker's strategy is critical in designing a successful deception scheme. Thus, in our ADS, accurate prediction about the attacker's next attack action plays a significant role in selecting the effective deception resources deployment policy. Further analysis shows that the attacker's next attack action is mainly determined by two factors: the set of compromised nodes N compr and the attacker's strategy π a . However, both of these factors are unknown to the defender.
N compr ⊆ N represents the set of nodes that have been already compromised by the attacker. Obviously, N compr will change as the attack progresses, and it can determine the possible attack actions that the attacker is able to implement in the next step. Fortunately, we can understand N compr with the help of a network monitoring system (NMS), which will keep monitoring the health status of each normal node and trigger an alarm of the node when it encounters an attack. Namely, we can infer N compr from these alarms. However, due to the existence of missing and false alarms, there are some uncertainties in this method, as shown in Fig. 3. In Fig.3, the node with grey solid circle indicates that it has been compromised by the attacker, and the node connected by a red triangular indicates that the NMS raises an alarm. In this example, we set N compr = {n 1 , n 3 }, which means the attacker has compromised n 1 and n 3 . Based on this, the Fig.3 (a) displays the situation of accurate alarms: the NMS raises the alarms of n 1 and n 3 . Fig.3 (b) displays the situation of missing alarms: the NMS only raises the alarm of n 1 , but omits the alarm of n 3 . Fig.3 (c) displays the situation of false alarms: apart from the alarms of n 1 and n 3 , the NMS also raises the alarm of n 2 when no attack occurs. Fig.3 (d) displays the situation of missing and false alarms: the NMS omits alarm of n 3 and raises the extra alarm of n 2 . So, it seems that N compr cannot be accurately inferred from these alarms.
Given the set N compr , let N next ⊆ N denote the set of nodes that the attacker can invade in the next step, and then we can conclude that for each n i ∈ N next , there is at least one node n j ∈ N compr , which satisfies e ij ∈ E. Then, we assume that the node that attacker will invade in the next step is n next , hence, n next ∈ N next . By analyzing the actual network attack and defense, n next is mainly determined by the attacker's strategy π a , which is closely related to the attacker's interestingness distribution I and success probability distribution J .
Given a node n i ∈ N next , let I (n i ) and p n i denote the attacker's interestingness in node n i and the probability that attacker invades n i in the next step respectively. Hence, the bigger I (n i ) is, the bigger p n i will be. Analogously, let J (n i ) denote the probability that the attacker invades n i successfully. The bigger J (n i ) is, the bigger p n i will be. Here, n next is given by where α and β are the weight coefficients (0 < α, β < 1) to weigh the importance of the interestingness and the success probability from the perspective of the attacker, respectively. On the one hand, the interestingness distribution I is related to the nature of the nodes, the network topology, and the attacker's preference, which is unknown to the defender. On the other hand, the attacker's success probability distribution J is related to the complexity of vulnerabilities in the nodes, which can be quantified by common vulnerability scoring system (CVSS). Given the deception resource node n i fr , although its location will not affect the attacker's success probability J (n i fr ), it will affect the attacker's interestingness I (n i fr ). In this way, the location of n i fr will affect the attacker's next attack action.
Note that the deployment of N fr can also affect the attacker's next attack action. However, from the perspective of the defender, there are two types of uncertainties. First, the missing and false alarms make the inference of the attacker's compromised nodes uncertain, which further leads to the fact that the inference of the attacker's next attack action is uncertain. In this process, the direction of uncertainty transmission is alarms → N compr → n next . Second, the defender is uncertain about the attacker's interestingness distribution I VOLUME 8, 2020 and success probability distribution J , which further also leads to the fact that the inference of the attacker's next attack action is uncertain. In this process, the direction of uncertainty transmission is I , J → π a → n next .

C. ANALYSIS OF DEPLOYMENT POLICY
As discussed above, the key focus of the deployment policy is to accurately predict the attacker's next attack action, which is decided by N compr and π a . A deployment policy, as a decesion-making rule, defines the defender's way of behaving at given time. In other words, a deployment policy is a mapping from network security states of the target network to deployment actions to be taken when in those states. It can be written as (2) Note that A d is a mapping from N fr to N . Now let AD denote the set of all possible mappings from N fr to N . That is, where |N | is the total number of normal nodes, |N fr | is the total number of deception resource nodes, and |AD| is the total number of mappings from N fr to N . When the node that attacker will invade in the next step belongs to the set of deception resource nodes, i.e., n next ∈ N fr , it means that the defender traps the attacker, and the deployment policy is successful. Note that there are three kinds of deployment policies that are easy to think of. First of all, in the traditional network defense, the defender usually places the deception resource nodes in some fixed locations, and we call this method a static deployment policy (denoted as π 1 d ). Obviously, the static deployment policy has no relation with time t, and it can be written as where A d represents a fixed element of the set AD. Second, we can randomly place the deception resource nodes in different locations over time, and we call this method the random-dynamic deployment policy (denoted as π 2 d ). We can then write π 2 d as, where random(AD) means that we randomly select an element from the set AD. Third, deploying deception resources dynamically behind the nodes where the alarms are raised looks like a good idea too, and we call this method the following-alarm deployment policy (denoted as π 3 d ). Let AL denote the set of nodes that have encountered an alarm. Then π 3 d can be written as, where random(AL) means that we randomly select an element from the set AL.
Although these three kinds of policies are easy to implement, they are more or less flawed. Because of the fixed deployment locations, the static deployment policy can easily be detected and identified by the attacker, which will lead to poorer defense effects. The latter two kinds of policies overcome the disadvantage of being easily identified by the attacker due to their dynamic characteristics. However, the random-dynamic deployment policy can be very much a hit-and-miss affair, and its effect is very unstable from good to bad. Similarly, due to the uncertainty discussed in section II.B, the effect of the following-alarm deployment policy needs to be improved.
For the sake of avoiding the above uncertainties, this paper proposes an intelligent deployment policy for deception resources based on RL. In reality, the target network tends to encounter countless network attacks, which can support our learning model with a large amount of data. Our method regards these uncertainties of all kinds as a black box and uses the alarm data to eliminate them step by step based on RL. Through the learning process, we can build on prior knowledge on predicting the attacker's next attack actions in the ADS. Using our intelligent deployment policy, the locations of these deception resource nodes can be dynamically changed according to the network security state. Let π d denote our policy, and it can be written as, where S t represents the network security state at time t, and RL represents the currently trained model. Compared with the previous three methods, our approach is based on the fact that learning can reduce uncertainties, which is objective and reasonable. It should be noted that the longer the learning process is, the higher the accuracy of our policy will be. In addition, given the fact that not all nodes can be broken by attackers, we use the TPG model to reduce the deployment location space and improve the efficiency of learning. The full details are in Section III. After that, we describe the complete RL model in Section IV.

III. PRELIMINARY SCREENING OF EFFECTIVE DEPLOYMENT LOCATIONS FOR DECEPTION RESOURCES BASED ON THREAT PENETRATION GRAPH
As mentioned in Section II, not every node in TN can be invaded successfully by the attacker. For instance, in Fig.1, if there is no connection from node n 1 to n 4 , the attacker will never have access to the node n 4 . Hence, he/she cannot launch an attack on node n 4 . In another case, if there is no vulnerability on the node n 2 , or if vulnerabilities on this node are too difficult for the attacker to exploit, the attacker cannot successfully invade and control this node even if he/she has access to it. So, apparently, if the defender deploys the deception resource nodes behind node n 2 or node n 4 , the attacker will never be able to invade them. Accordingly, neither node n 2 or node n 4 can be used as an effective deployment location for deception resources. For this reason, we design the preliminary screening method based on TPG to achieve effective deployment locations for deception resources and eliminate these invalid deployment locations simultaneously. As a result, the efficiency of our algorithm for selecting the optimal deployment policy has been enormously improved.
Traditionally, the penetration paths are generated by using the attack graph (AG), which has two significant drawbacks: (1) it has a rather complex generation process, and hence it is not suitable for large-scale networks; (2) it can only describe the connections among the existing vulnerabilities.
To solve these problems, we propose a two-layer model (TPG), whose lower layer describes the micro-penetration scenario between each node-pair, and the upper layer shows the macro-penetration relationships between each node-pair in the target network. Let G TPG = (G HTPG , G NTPG ) denote the TPG of the target network TN.
Definition 2 (Host Threat Penetration Graph): As the lower layer of TPG, the host threat penetration graph (HTPG for short) is a directed graph G HTPG = (N HTPG , E HTPG ), where the N HTPG is the set of nodes and E HTPG is the set of edges. A node of N HTPG can be represented as a tuple (Host, Privilege), which means the host privileges obtained by the attacker. Specifically, Host denotes the invaded host by the attacker, and it can be marked as its IP address. Privilege denotes the privilege in the host the attacker has obtained, which is divided into User and Root. An edge of E HTPG , indicating the signal-step penetration attack, can be represented as a triple (Service, Vulnerability, Probability), where Service denotes the service used by the attacker to invade the node, Vulnerability denotes the vulnerability of the service exploited by the attacker, and Probability denotes the probability that attacker invades the node successfully.

Definition 3 (Network Threat Penetration Graph):
As the upper layer of TPG, the network threat penetration graph (NTPG for short) can be defined as a directed graph G NTPG = (N NTPG , E NTPG ), where the N NTPG is the set of nodes and E NTPG is the set of edges. Specifically, a node of N NTPG denotes a host in the target network, which can be marked as its IP address. An edge of E NTPG , indicating the probability that attacker obtains privilege from one node to another node, can be represented as a tuple (U P , R P ), where U P denotes the user privilege and R P denotes root privilege. Obviously, both U P and R P are real numbers between 0 and 1.
To refine the expression, the specific generation method of TPG is shown in the literature [37]. Fig. 4 shows a simple example of a G TPG .
By using the G TPG , all these normal nodes that cannot be invaded successfully by the attacker will be regarded as the invalid deployment locations for deception resources. For instance, in Fig.4, if only node n 1 can be accessed by the outside attacker initially, the attacker will never be able to invade and control node n 4 . Thus, node n 4 is an invalid deployment location for deception resources. Compared with the traditional attack graph, TPG generation algorithm has the advantage of low computational complexity due to the idea of stratification. In addition, the TPG itself, clearly, has a concise visual effect, which can make it easier for the defender to understand the network security state.

IV. INTELLIGENT DEPLOYMENT POLICY FOR DECEPTION RESOURCES BASED ON REINFORCEMENT LEARNING
As analyzed above, in the ADS, the deception resources deployment is an optimization and decision-making problem with many uncertainties and little prior information. As an important research hotspot of AI, reinforcement learning, which has a strong self-learning ability and can achieve the goal despite uncertainty about the environment it faces, is widely used to solve this kind of problem. In this section, we describe our intelligent deployment policy for deception resources based on RL in detail.

A. MODEL OVERVIEW
According to the classical RL theory, the model to select the optimal deception resources deployment policy based on RL is as shown in Fig. 5. In our model, the defender is equivalent to the agent, while the target network and the attacker together (containing many uncertainties) are equivalent to the environment, and the NMS can be regarded as the sensor of the environment. At time t, NMS integrates several alarms, which can be used to represent the network security state S t . Generally, it is a fact that the defender will select the deception resources deployment action A t d according to the current network security state S t . Due to the joint action of the attacker, defender, and target network, the network security state will change. At the VOLUME 8, 2020 same time, the defender will be rewarded with environmental feedback R t . Note that state transitions are considered to be stochastic because the next network security state not only depends on the defender's action, but also on uncertainties that the defender cannot control. In this process, the agent (the defender) interacts with the environment (target network and the attacker) continuously and finally tries to select an optimal policy, which can be the guideline for the defender to implement the deception resources deployment actions in any network security state. By implementing the optimal policy, the defender can achieve the purpose of trapping the attacker in maximize probability.

B. MODEL REPRESENTATION
Note that the reward signal R in our model indicates what is good in an immediate sense. Roughly speaking, the better the defender's action is, the higher the reward will be. A reward signal defines the goal in an RL problem. Selecting the optimal policy is to maximize the total reward it receives over the long run. Beyond the defender, the attacker, and target network, one can identify four main elements of an RL model: state, action, reward, and policy.
Definition 4 (State): Network security state S, as the name suggests, reflects the security situation of the target network. Also, it can reveal the traces of the attacker for the defender. In our ADS, an alarm may be generated by the NMS once the attacker invades a normal node. For this reason, we can use the alarms generated by NMS to represent the network state S. As mentioned in Definition 1, k is the number of normal nodes in the target network TN. Given the time t, the network security state can be written as If the NMS generates an alarm about node i, then ψ t i = 1. If the NMS generates no alarm about node i, then ψ t i = 0. Obviously, the size of network security state space is 2 k . Let S fail final and S success final denote the defense final state of failure and success, respectively. In one case, if the deployment policy of the defender cannot trap the attacker, which means that the attacker achieves his/her attack goal and the defense is unsuccessful, then the network security state will change to S fail final ; in another case, if the policy traps the attacker, which means that the defense is successful, then the network security state will change to S success final . Definition 5 (Action): Let A d be the deception resources deployment action of the defender. We assume that the number of deception resource nodes available for deployment is m. Given time t, A d can be written as a matrix of m × k. Definition 6 (Reward): After the defender implements a deployment action, the environment will give him a reward R. Let us make a human analogy, rewards are somewhat like success (if high) and failure (if low), whereas values correspond to a more farsighted and refined judgment of how successful or unsuccessful the defender's deployment action is in a particular network security state. In our model, at the time t, we define The defender implements the deployment action A t d according to the network security state S t , after that, the network security state will transform to S t+1 . When S t+1 = S fail final , which indicates that the defender has failed, we have R t = −1. Similarly, when S t+1 = S success final , which indicates that the defender has succeeded, we have R t = 1. In other cases, which indicates that the game is not over yet, we have R t = 0.
Definition 7 (Policy): A policy π d defines the defender's way of behaving at a given time. Roughly speaking, a policy π d is a mapping from network security states to actions to be taken by the defender when in those states. That is As a result, the size of policy space is |AD| |S| , where |AD| is the size of the defender action space and |S| is the size of the network security state space. We evaluate the policy by the cumulative rewards it receives over the long run, as shown in Formula (12). Thus, the policy with the highest cumulative reward is the optimal policy, as shown in Formula (13).
C. MODEL LEARNING The task of RL is usually described by the Markov decision process (MDP). However, in this problem, because of the missing and false alarms generated by NMS along with the uncertainty of attacker's strategy, the state transition probability of the Markov decision-making process is unknown to the defender. In the real world, the target network tends to encounter countless network attacks, which can support our learning model with a large amount of data. As a result, the expected cumulative reward of the policy can be approximated by calculating the average cumulative reward. Accordingly, this paper uses a Q-learning algorithm with an unknown model to learn the optimal policy. The training rule is written symbolically aŝ whereQ n (s, a) is estimated value of Q(s, a), and where s and a are the state and action updated in the nth cycle respectively, and visit n (s, a) represents the number of stateaction pair (s, a) visited by the training algorithm (including the nth cycle). Obviously, as the value of n increases, α n decreases, and then the difference between Q(s, a) and Q n (s, a) decreases. The details of the training algorithm based on Q-Learning are shown in Algorithm 1.

Algorithm 1 Training Algorithm Based on Q-Learning
Input: the target network TN, initial state s 0 Output: optimal policy π d 1: Generate the threat penetration graph TPG of TN 2: Generate the deployment action space AD based on with the probability 1 − ε select an action a from AD randomly with the probability ε 2: receive the reward r and the new state s'.

3: return (r, s')
According to the convergence theorem of the Q-learning algorithm, when each state-action pair is infinitely visited, the difference between Q(s, a) andQ n (s, a) will be approximately equal to zero. In addition, the computational complexity of Algorithm 1 is O(n), where n is the number of training steps. For the above reasons, this algorithm achieves a fast convergence speed and good stability.

V. EXPERIMENTS AND ANALYSIS
To verify the effectiveness of our method, a real network environment is built for testing. The topology of the experimental network is shown in Fig. 6. The experimental network is divided into four regions: DMZ, Subnet 1, Subnet 2, and Subnet 3. The web server is located in the DMZ network. Subnet 1 is composed of a pad and a host. Subnet 2 includes three hosts. Subnet 3 contains a print server, a file server, and a database server. The service access rules in the target network are shown in Table 1. Among them, the attacker controls a host outside the network. Combined with CVSS, we use the Nessus to scan the network and obtain the hosts and vulnerabilities information, which is shown in Table 2. In particular, Pad and Host 1 cannot be  accessed directly from Host 2 and Host 3 through the internet, but they can be accessed from these two hosts through USB or other transmission devices due to some improper operations of users. In addition, Host 4, with a higher security level, is a physically isolated entity.
According to the generation method of TPG, HTPG from the attacker to the web server is shown in Fig. 7 and the NTPG of the whole target network is shown in Fig. 8. NTPG describes all the possible attack actions in the target network that can be implemented by the attacker. We can see that since there are no vulnerabilities on Host 4 and print server, they are absent from this NTPG. Now, if we deploy the deception resource nodes on these two nodes, the attacker will never invade them. In this case, the deployment is meaningless. On the contrary, Pad, Web Server, Host 1, Host 2, Host 3, File Server, and Data Server belong to this NTPG, which means that they may be invaded by attackers at some time. Hence, these nodes can be used as the deployment locations of deception resources. In short, the NTPG can help us to distinguish which nodes of the target network are suitable for deployment of deception resources and which are not.
Furthermore, we find that the number of reductions in deployment locations based on NTPG is related to the number of reductions in deployment actions by the curve shown in Fig. 9, which can verify that the preliminary screening of effective deployment locations of deception resource nodes based on TPG can reduce the action space of deception resources deployment policy, and ultimately improves the efficiency of RL. Fig. 8 reveals that there are seven effective deployment locations for deception resource nodes. Then, we assume that there are two deception resource nodes available for Hence, the size of the network security state space is 128, and the size of the action space is 21. Let the initial state be [0 0 0 0 0 0 0]. The attacker launches a penetration attack against the target network from the external network, then continues to carry out several single-step attacks, and finally realizes the invasion of the attack goal (database server). When the attacker intrudes into the database server, the state transits to S fail final , whereas when the attacker intrudes into any node belonging to N fr , the state transits to S success final . In the simulation experiments, the attacker's strategy is expressed by the attacker's interestingness distribution. Our algorithm is trained solely by network security state transition data (i.e., alarms), without any other data. It should be noted that our simulation conditions are not only easy to implement, but also consistent with the actual characteristics of ADS in real networks.
To quantitatively evaluate the quality of our policy, this paper proposed an evaluation index, which is defined as follows.
Definition 9 (Defense Success Probability): Let us assume that the defender implements the policy π d to deploy the deception resources, then, the ratio of the number of successful defense experiments to the total number of experiments is defined as the defense success probability (dsp) of policy π d , labeled as dsp(π d ). That is where num denotes the number of successful defense experiments and sum denotes the total number of experiments. Table 3 summarizes the parameters used in our experiments. It should be noted that we consider the false negative rate (FNR) and false positive rate (FPR) of the NMS as variables. To obtain the comparisons, we implement experiments in four different settings of FNR and FPR. Moreover, comparisons of our method with the other three methods that have been described in section II.C are made, and the results are shown in Fig. 10-13. Fig. 10 shows the results of static deployment policy. According to simple combinatorics, there are 21 static deployment policies due to seven effective deployment locations and two deception resource nodes. The simulation results show that the dsp of static deployment policy is very unstable. Among them, the static deployment policy in which the two deception resource nodes are deployed at first and second deployment locations can achieve the best defense effect, and its dsp is close to 75.6%. Besides, it should be noted that the results in different settings of FNR and FPR show very similar change rules. However, in the real world, the effect of this policy will be greatly affected by the attacker's strategy. Meanwhile, it is a hard problem to obtain the optimal static deployment policy, which has the biggest dsp. Even worse, the static deployment policy can easily be detected and identified by the attacker. Fig. 11 shows the results of the random-dynamic deployment policy. Because of the randomness of this policy, its dsp will vary greatly from experiment to experiment. For this reason, the test for this policy is repeated ten times, in which the maximum of dsp is close to 75%, and the minimum value is 15.8%. Although the random-dynamic deployment policy can improve the    attacker's difficulty in identifying the location of deception resources, it is not a good choice because of its poor effect. Fig. 12 shows the results of the following-alarm deployment policy. We can immediately see that its effectiveness is stable in specific settings of FNR and FPR. Nonetheless, as FNR and FPR increase, its effectiveness is gradually decreased. Due to the high FNR and FPR values of NMS in the real network environment, this policy cannot be widely used in practical applications. Fig. 13 shows the results of our method.  When the training steps of the Q-learning algorithm are less than 2000, the dsp of our method increases exponentially. And when the training steps are more than 2000, it is maintained at a high level and keeps improving steadily. Finally, with the continuous increase of the training steps, it improves to nearly 80%. Like to the static deployment policy, the settings of FNR and FPR have no effect on the effectiveness of our method. To sum up, our method is superior to the other three deployment policies. Moreover, we record the training time of our method, and the result is shown in Fig. 14, which visibly reveals that the training time of our method linearly increases w.r.t the number of training steps without being affected by the settings of FNR and FPR. Furthermore, we present a comparison with existing deception resources deployment methods. The comparison results are summarized in Table 4. First, the method with a detailed scenario is convincing and valuable. Second, loose assumptions about the attacker's strategy usually contribute to perfect practicability. Third, static deployment policy for deception resources is easy to be detected and identified by the attacker, so it is significant to support dynamic deployment. Then, active deception can improve the probability of trapping the attacker. Finally, only if these methods are easy-to-deploy can they be widely used.
From the above comparisons, we can conclude that our method is the only one that has not strict assumptions and constraints about the attacker's strategy but achieves satisfactory properties.

VI. CONCLUSION
Nowadays, the deception-based cyber defenses (DCD) has played a significant role in network security. However, the current research on DCD pays less attention to select the optimal deployment policy for deception resources, which limits in some measure the effectiveness of DCD. To solve this problem, we propose an intelligent deployment policy for deception resources based on RL. Our method can dynamically place deception resources as the network security state changes. Starting with describing the attacker-defender scenario and the attacker's strategy, we then analyze the uncertainties and several deployment policies in detail with formal methods. Next, we propose the preliminary screening method that can derive the effective deployment locations for deception resources based on threat penetration graph (TPG). Finally, we design a Q-learning training algorithm for finding optimal deployment policy for deception resources. The experimental results show that our method improves the effectiveness of deception resources deployment. From comparisons with other similar work, we can conclude that our method is the only one that has not strict assumptions and constraints about the attacker's strategy but achieves satisfactory properties. For the future, we plan to extend our work to support large-scale networks and study the relationship between the number of deception resource nodes and the effectiveness of DCD.