Strategic Honeypot Deployment in Ultra-Dense Beyond 5G Networks: A Reinforcement Learning Approach

The progression of Software Defined Networking (SDN) and the virtualisation technologies lead to the beyond 5G era, providing multiple benefits in the smart economies. However, despite the advantages, security issues still remain. In particular, SDN/NFV and cloud/edge computing are related to various security issues. Moreover, due to the wireless nature of the entities, they are prone to a wide range of cyberthreats. Therefore, the presence of appropriate intrusion detection mechanisms is critical. Although both Machine Learning (ML) and Deep Learning (DL) have optimised the typical rule-based detection systems, the use of ML and DL requires labelled pre-existing datasets. However, this kind of data varies based on the nature of the respective environment. Another smart solution for detecting intrusions is to use honeypots. A honeypot acts as a decoy with the goal to mislead the cyberatatcker and protect the real assets. In this paper, we focus on Wireless Honeypots (WHs) in ultra-dense networks. In particular, we introduce a strategic honeypot deployment method, using two Reinforcement Learning (RL) techniques: (a) <inline-formula><tex-math notation="LaTeX">$e-Greedy$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>e</mml:mi><mml:mo>-</mml:mo><mml:mi>G</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="sarigiannidis-ieq1-3184112.gif"/></alternatives></inline-formula> and (b) <inline-formula><tex-math notation="LaTeX">$Q-Learning$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>Q</mml:mi><mml:mo>-</mml:mo><mml:mi>L</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>n</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="sarigiannidis-ieq2-3184112.gif"/></alternatives></inline-formula>. Both methods aim to identify the optimal number of honeypots that can be deployed for protecting the actual entities. The experimental results demonstrate the efficacy of both methods.


I. INTRODUCTION
T hrough the evolution of the softwarisation and virtu- alisation technologies, such as Software Defined Networking (SDN), Network Function Virtualisation (NFV) and cloud/edge computing, 5G has become a digital reality, providing multiple benefits to the individuals' aspects, such as higher connectivity, lower latency and improved energy efficiency.Already, most of the developed countries offer commercial 5G services.Based on the 5G Public-Private Partnership (5G-PPP), 5G will be able to connect approximately seven trillion wireless entities [1].Therefore, many Internet of Things (IoT) and Industrial IoT (IIoT) applications such as the smart electrical grid and remote healthcare services, will benefit from 5G.However, the aforementioned technologies are characterised by several security issues [2], [3].In [1], A. Ahmad et al. provide a detailed overview about the 5G security challenges.Other similar studies are listed in [4], [5].Moreover, it is noteworthy that despite the security characteristics of 5G, such as the sufficient encryption mechanisms, the wireless systems within the Radio Access network (RAN) are prone also to various cyberthreats from their first-generation (1G).Their evolution even beyond 5G (B5G) or 6G can lead to new sophisticated and complicated cyberattacks with devastating consequences.
Based on the aforementioned remarks, it is evident that the presence of efficient intrusion detection mechanisms is necessary.The rise of Artificial Intelligence (AI) techniques, such as Machine Learning (ML) and Deep Learning (DL), has evolved significantly the conventional signature and specification-based Intrusion Detection Systems (IDS).Many studies investigate in detail the efficiency of ML and DL-based IDS [6], [7].In particular, through ML and DL, the current IDS are capable of detecting and discriminating unknown anomalies and zero-day cyberattacks.However, in contrast to signature/specification-based IDS, ML and DL-based IDS are usually linked to a high number of misclassifications due to the presence of False Positive (FP) and False Negative (FN) results.Moreover, ML and DL require the existence of a labelled dataset that can differ from environment to environment.Due to the sensitive nature of this kind of data, usually, there are not publicly available intrusion detection datasets especially related to the 5G domain.Another smart detection mechanism that can contribute to the timely detection of a cyberattacker is a honeypot.A honeypot is an intentional security hole that aims to mislead the cyberattackers and protect the real assets.However, it is noteworthy that despite the defensive nature of the honeypot, it can also be used by a cyberattacker to reach the real assets.
The goal of this paper is twofold.First, we focus our attention on deploying honeypots in a strategic manner, taking full advantage of Reinforcement Learning (RL).In particular, the deployment problem is transformed into a Multi-Armed Bandit Problem (MAB), where our goal is to deploy the optimal number of honeypots, taking into account the benefits and costs of the Defender and the Attacker.In particular, we adopt two RL methods: (a) e − Greedy and (b) Q − Learning.Based on the various security events detected by the honeypots or other detection systems, both e − Greedy and Q − Learning converge to the appropriate number of honeypots that should be deployed.Second, we first introduce the theoretic framework of the presence and the role of Wireless Honeypots (WHs) in ultra-dense networks.Finally, we evaluate the RLbased honeypot deployment methods with respect to deploying WHs in ultra-dense networks.Consequently, the contribution of this work is summarised in the following bullet points.
• Strategic RL-based Honeypot Deployment: We introduce an RL-based honeypot deployment method, taking advantage of e−Greedy and Q−Learning.The proposed method identifies how many honeypots can be deployed in an infrastructure, taking into various costs and benefits of the Defender and the Attacker.• Wireless Honeypots in Ultra-Dense Networks: We first introduce the role and use of WHs in ultra-dense networks in order to mitigate security risks.For this purpose, the impact of density in wireless networks is investigated and modelled.Finally, we evaluate the above RL-based honeypot deployment method in ultra-dense networks.
The rest of this paper is organised as follows.Section II provides a background about honeypots and RL.Section III presents some relevant works and discusses our contribution.In section IV, the concept of the honeypot orchestrator is provided as an RL agent.In section V, we introduce the strategic RL-based honeypot deployment method.Section VI investigates the deployment of WHs in ultra-dense networks.Next, section VII focuses on the evaluation analysis with respect to deploying WHs in ultra-dense networks.Finally, section VIII concludes this work.

A. Honeypot: A Security Trap
Honeypots are assets with no production value that imitate the behaviour of the real assets, thereby protecting them and collecting valuable information about the cyberattackers.In particular, the honeypots can be classified into two categories: (a) production honeypots and (b) research honeypots.The production honeypots are placed into the production network, trying to hide the real assets from potential malicious insiders.On the other side, the research honeypots are exposed to public networks like the Internet, attracting potential cyberattackers and collecting important information related to their behaviour.It is noteworthy that any interaction with a honeypot is considered suspicious since the legitimate users do not have any reason to interact with it.Moreover, the honeypots can be classified based on the interaction level as (a) Low-Interaction Honeypots (LIH), (b) Medium-Interaction Honeypots (MIH) and (c) High-Interaction Honeypots (HIH).LIH can simulate some network services in terms of the various communication protocols, without emulating completely the network behaviour of the real assets.MIH can emulate better the network behaviour of the real assets, transmitting, for instance, similar network packets as the real entity.Finally, HIP represents a complete copy of the real asset, comprising all of its hardware and software characteristics.
Both academia and industry have implemented several honeypots.In particular, Deception Toolkit (DTK) [8] was the first honeypot released in 1997, emulating known vulnerabilities of UNIX.HoneyBOT [9] is a LIH for Windows Operating Systems (OS), simulating relevant vulnerabilities.Similarly, KFSensor [10] is a commercial honeypot for Windows OS.HoneyD [11] is probably the most known honeypot capable of emulating at the same time multiple hosts.Tiny Honeypot [12] is a server-based honeypot, which listens to all Transmission Control Protocol (TCP) ports, logging all interaction activities.Dionaea [13] is written in Python and emulates the MQ Telemetry Transport (MQTT) protocol.Jackpot [12] is related to Simple Mail Transfer Protocol (SMTP) and aims to combat email spam.Cowrie [14] is a LIH emulating SSH.Conpot [15] is an industrial honeypot emulating multiple relevant protocols like Modbus and IEC 60870-5-104.In addition, an overview of WHs along history is discussed in [16], where they are defined as nodes that offer wireless access whose value is being probed, attacked, or compromised, letting the attackers to interact with them.In more detail, the main goal of WHs is to gather information about the attacks performed on wireless networks and the associated technologies, focusing on the attacks that exploit the wireless technologies' weaknesses, which are mainly due to the use of unguided transmission medium [17].The main principles of the WHs can be used in several types of networks, including cellular, Local Area Networks (LANs), sensor networks and Unmanned Aerial Vehicles (UAVs)-based networks [18].
Many supporting tools have been developed in order to analyse the data retrieved from honeypots or to extend their functionalities [19].In particular, Bait-n-Switch [20] aims to redirect all malicious traffic to a honeypot.Accordingly, Honeynet Security Console (HSC) [21] analyses, correlates and visualises honeypots logs.Honeysnap [22] processes Packet Capture (PCAP) files that were collected by serverbased honeypots.GSOC-Honeyweb [12] is devoted to the management of client-based honeypots via a user-friendly environment.Moreover, TraCINg [12] aggregates data from multiple honeypots and correlates this information in order to discover possible worms.
It is noteworthy that many honeypots projects have been organised in order to exploit at the maximum level the benefits of honeypots and discover potential zero-day attacks.In particular, the Honeynet Project was started in 1999 to explore and investigate zero-day cyberattacks.Furthermore, the Leurre.comproject [23] deployed multiple LIHs in more than 30 counties, aiming at collecting quantitative data related to cyberthreats and vulnerabilities.Accordingly, NoAH-Project coordinated by Foundation for Research and Technology Hellas (FORTH) deployed an HIH called Argos [24] to enhance the protection of Internet Service Providers (ISPs) and investigate potential zero-day attacks.The mw-collect Alliance project collected information about various malware by deploying multiple Nepenthes sensors [12].Moreover, Telekom-Fruhwarnsystem [12] was started in 2013 to collect various datasets related to honeypot activities.Finally, H2020 SPEAR [25] and H2020 SDN-microSENSE [26] implemented various industrial honeypots for the smart electrical grid.

B. Reinforcement Learning
The goal of an RL agent is to identify the optimal policy performed in an environment based on the various states and the possible actions.An action a t can be performed at time t in the state (s t ), thus leading to a new state s t+1 and a reward R(s t , a t ).The optimal policy refers to maximising the accumulated rewards over time.There are various kinds of RL methods, such as e − Greedy, Thompson Sampling, SARSA, Q − Learning and Deep Q − Learning.In this paper, we focus on e − Greedy and Q − Learning.However, after defining the environment with respect to the available states and actions, each of the aforementioned methods follows two phases: (a) training process and (b) inference.During the training process, the RL model (based on the corresponding method) is trained to identify the best policy.In particular, after initialising the parameters of each method, the RL model starts interacting with the environment, thus leading to new states and obtaining the corresponding rewards.At the end of each episode, the parameters are adjusted appropriately in order to gain a better reward during the next episode.This process is repeated till convergence.Finally, inference follows, which means that the RL model is ready to be used in the environment without adjusting the parameters of the RL method.More information about the various RL methods is given in [27].

III. RELATED WORK
Several studies have investigated the role of honeypots and relevant optimisation techniques with AI and game theory in order to protect critical organisations and infrastructures.Some of them are listed below [12], [28]- [34].In particular, in [28], J. Franco et al. provide a survey about honeypots and honeynets for the IoT and IIoT.In [12], M. Nawrocki et al. present a comprehensive study about honeypot software and relevant data analytics.Similarly, in [29], the authors discuss the decoy and security operations of honeypots, presenting a detailed taxonomy.On the other hand, in [30], C. Dalamgkas et al. focus on honeypots related to the smart electrical grid.In [31], C. Kiekintveld et al. present a study about game theory methods used to deploy honeypots in an efficient manner, modelling the behaviour of the attacker and the defender.In [32], L. Shi et al. investigate the performance of honeypots through Petri nets.In [33], W. Zhang et al. present a honeynet composed of multiport honeypots for countering IoT attacks.Finally, in [34], L. Shi et al. provide a blockchain-based dynamic and distributed honeypot.Next, we discuss some relevant works in a more detailed manner and show the novel points of our paper.Each paragraph focuses on a separate paper.
In [35], P. Radoglou-Grammatikis et al. provide TRUSTY.TRUSTY is a web-based platform capable of collecting, normalising and processing security logs originating from honeypot applications.In particular, the authors focused mainly on industrial honeypots, thus using TRUSTY to generate a dataset related to honeypot events.Based on this dataset, a strategic method for deploying honeypots in a smart electrical grid environment is also provided.First, the behaviour of the attacker and the defender is modelled in terms of the various costs and benefits with respect to attacking a real asset or a honeypot.Consequently, the utility functions of the attacker and the defender are defined, respectively.Next, the deployment process is formed as a Multi-Armed Bandit (MAB) problem with the goal to optimise the utility function of the defender.The MAB problem is solved through the e − Greedy method.The evaluation results demonstrate the efficiency of the proposed deployment method with respect to selecting the optimal number of honeypots.
In [36], P. Diamantoulakis et al. present a sophisticated honeypot deployment method, taking full advantage of game theory.After defining the utility function of the defender and the attacker, a one-shot game is formulated.For this purpose, the various costs and benefits for the defender and the attacker are determined, respectively.Next, the solution of this game is given by calculating the Nash Equilibrium (NE).If NE is not available, the decision of the defender is modelled through a non-convex min-max optimisation analysis.Subsequently, the authors investigate a continuous scenario related to the previous one-shot game.This means that the defender and the attacker play the one-shot game more than one time.Thus, a Bayesian game is modelled, and the corresponding Bayesian NE (BNE) is determined.The simulation results demonstrate the effectiveness of each method regarding the selection of the optimal number of honeypots in a smart electrical grid environment.
In [37], K. Wang et al. introduce a Bayesian honeypot model in order to protect an Advanced Metering Infrastructure (AMI) against Denial of Service (DoS) attacks.In particular, the authors investigate three cases provided by a service provider: (a) a real AMI communication, (b) honeypot service and (c) anti-honeypot service.The first two are related to a legitimate user, while the anti-honeypot services refer to actions performed by a cyberattacker in order to recognise the presence of honeypots and bypass them.The goal is to balance the detection rate and the energy consumption.Thus optimal strategies are defined for the attacker and the defender.Next, several BNEs are identified.The experimental results show that the proposed game can enhance the honeypots' detection rate and the energy consumption.
In [38], Y. Zhang et al. introduce an adaptive honeypot deployment mechanism based on Learning Automata (LA).LA is an RL method used to select an optimal action based on a finite set of actions and the interactions with a random environment.LA can be defined as a tuple of five elements, namely deployment of the necessary defensive mechanisms, while the defence phase refers to the countermeasures applied during the execution of a cyberattack.The proposed method considers the entirety of the nodes as the LA, and each node is considered as an action.Based on the malicious activities and the evolution of the LA, a particular number of honeypots are deployed.The experimental results show the efficacy of the proposed method in terms of the honeypots' detection rate and selecting the appropriate number of honeypots.
In [39], M. Du and K. Wang investigate the role of honeypots against Distributed DoS (DDoS) in SDN environments.First, the authors provide an anti-honeypot strategy capable of identifying the presence of honeypots in an SDN network.In particular, the first step of the anti-honeypot strategy is to identify whether there is a honeypot in the SDN network.Next, the honeypot type is clarified, and the optimal attack strategy is determined.To protect the SDN network from the above antihoneypot strategy, the authors provide also a Bayesian pseudohoneypot game with respect to the deployment of various kinds of honeypots in an SDN network.The authors also show the existence of several BNEs and prove that the proposed BNEs can accomplish the optimal equilibrium between the legitimate users and attackers.The evaluation results demonstrate that the proposed method can effectively counter DDoS attacks with low energy consumption.
In [40], U. Bartwal et al. provide a Security Orchestration Automation and Response (SOAR) engine that deploys honeypots based on security events related to DDoS and botnets.In particular, the proposed SOAR engine is composed of ten architectural components: (a) Host Machine, (b) Virtual Machines, (c) Honeypots, (d) Container Registry, (e) Storage, (f) Traffic Tracker, (g) Botnet Detector, (h) DDoS Detector, (i) Orchestration Engine and (j) Access Logs.The orchestration engine is responsible for deploying the honeypots located in the Container Registry based on the security events recognised by the Botnet and DDoS detectors.The detectors adopt both Machine Learning (ML) and signature/specification rules.Initially, no honeypot is deployed.Next, based on the security events, the orchestration engine undertakes to start the first honeypots.If the attackers start interacting with the honeypots, new honeypots are deployed by the orchestration engine, thus minimising the attack probability against the real assets.
In [41], W. Undoubtedly, the previous works provide useful insights, methodologies and tools.Several papers adopt game theory and RL methods in order to deploy honeypots in a strategic manner.Characteristic examples are [36], [37], [39].However, despite the evaluation results, this kind of modelling cannot be adopted easily during the production mode of real environments.Moreover, the parameters of the game models should be re-adjusted based on the impact of the various security events and alarms.On the other hand, the previous RL methods do not consider the detection of security events through other detection mechanisms than honeypots.Finally, it is worth noting that the current works do not consider the use of WHs in ultradense networks.Based on the aforementioned remarks, in this paper, we introduce first an RL-based honeypot deployment method modelling the behaviour of the Defender with the use of WHs and other detection measures.In terms of 5G, a WH can emulate a vulnerable gNB.The smart deployment of the WHs is modelled as security game in terms of the costs and benefits of the Defender.The security game is solved through two RL methods, namely: (a) e − Greedy and (b) Q − Learning.Finally, we model and introduce in a theoretic manner the use of WHs in ultra-dense networking environments.

IV. SECURITY ANALYSIS
Based on the aforementioned remarks, Fig. 1 illustrates the goal of our RL-based security agent in the context of an ultra-dense networking environment.Based on the various security events, the RL agents tries to deploy the appropriate number of WHs in order to protect the real access points.In the context of 5G networks, the access points can refer to gNB.Therefore, the RL agent play the role of a honeypot orchestrator, deploying the appropriate number of honeypots.To this end, the unique characteristics of the ultra-dense network should be considered.Thus, the utility function of the Defender should take into account not only the security characteristics but also the quality of the network in terms of the services provided.Thus, the RL agent interacts with the environment and receives an corresponding reward and state (i.e., ste of observations about the security and the quality of the network) given the new security events.For each security events detected either by WHs or other detection mechanisms, the honeypot orchestrator is triggered with respect to deploying the appropriate number of honeypots.The number of the real access points and honeypots that are connected.
The utility of the Defender at the time interval t. g() A function expressing initially the utility of the Defender at the time interval t. S D,i The strategy of the Defender regarding each asset.

S A,i
The strategy of the Attacker regarding each asset.

I r,i
It denotes whether the attack against the asset i is detected.
The benefit of the Defender for each attack against a honeypot.δ 2 The benefit of the Defender for each attack against an access point is detected.
The cost of the Defender for each attack not detected.θ The probability that an asset is used as a honeypot.ϕ The probability that a real asset is under an attack.

ŨD
The expected value of the Defender's utility.

Pr
The probability that an attack against a real asset is detected.

C
The cost induced by the use of honeypots.

C
The expected value of the cost induced by the use of honeypots.λ The density of the remote radio heads deployment.

L
The number of the deployed access points.

M
The number of the deployed WHs.
The distance between the i-th user and its closest AP. f X,i (x) The probability density function of x i .F X,i (x) The cumulative density function of x i .

R i
The achievable communication rate of user i. γ i The signal-to-noise ratio at the reference distance of 1 m.h i The small scale fading power gain of user i. β The path loss exponent.

P i
The transmit power of user i. σ 2  The power of the additive white Gaussian noise.

L ref
The path loss at the reference distance.

R t,i
The transmission rate of the i-th user.

P i,out
The outage probability of the i-th user.

Z
The number of re-transmissions.

exp(•)
It denotes the exponential function.κ The parameter of the exponential distribution.

C i
The cost induced to the i-th user due to the use of honeypots .
The price in the unit commitment stage.p ed The price in the economic-dispatch stage.µ i The mean energy demand of the i-th device.

Emax
The maximum energy consumption of the i-th device.r The actual energy consumption of the i-th device.

S
The space of states A The space of actions st The current state at time t at The action performed in the state st R(st, at) The reward of action at in the state st T D Temporal Difference SE A set of security events a LearningRate The learning rate, which denotes how fast the Q values are updated

V. STRATEGIC HONEYPOT DEPLOYMENT WITH REINFORCEMENT LEARNING
We consider the honeypot deployment problem as a security game with two antagonistic players: (a) Attacker and (b) Defender.The goal of the Attacker is to attack the real access points, while the Defender intends to deploy/use the appropriate number of honeypots that will provide the maximum protection, taking into account the available computing resources and the behaviour of the Attacker.Let N be the total number of the connected stations that can serve either as honeypots or access points.The ratio of N utilised by honeypots is symbolised by θ. s D,i ∈ {−1, 1} represents the strategy of the Defender.s D,i equals −1 and 1 when the cyberattack targets a real access point or a honeypot, respectively.Similarly, δ 1 defines the benefit related to the Defender for each attack against a honeypot, while δ 2 implies the benefit of the Defender for each attack detected without the use of a honeypot.Finally, δ 3 is the cost of the Defender for each attack not detected in a timely manner.For the sake of clarity, Table I summarises the notation.
The utility function of the Defender in a time interval t i.e., U D [t], is given by Equation 1.
In Equation 1, I r,i is equal to 1, when the attack is detected by the node i and equal to 0 when it is not detected.Of course, when S D,i = −1, i.e., the attacked device is a honeypot, I r,i = 1, while if S D,i , I r,i ∈ {0, 1} is a random variable.Also, g(•) increases in terms of s A,i I r,i and decreases in terms of and . If we assume that the terms of Equation 1 progress linearly, the Equation 1 can be written in the form of Equation 2, where C is related to the cost induced by the use of honeypots (e.g., due to the use of extra resources or due to the degradation of the system's performance).
(2) The best strategy for the Defender is to randomly allocate the honeypots so that the Attacker will not be able to recognise their presence.Since the Defender cannot know a priori the number of attacks, the goal is to optimise the expected value of U D , denoted by ŨD .This can be achieved by knowing the probability ϕ that each connected device receives an attack and by controlling the probability related to the portion of the assets that correspond to honeypots, i.e., θ.Thus, the expected value of the Defender's utility function can be written by Equation 3.
(3) Also, P r is the probability that an attack is detected without the presence of a honeypot.It is worth noting that in the case where P r = 1, the use of honeypots does not offer any gain to the Defender.Moreover, g(θ) is related to the expected cost induced by the use of honeypots.Therefore, based on the security events detected by the honeypots and other potential detection mechanisms, such as signature-based detection systems and ML/DL-based classification, our goal is to define the appropriate θ in order to maximise U D [t] (Equation 3).To re-define, the appropriate value of θ for each security event in the time interval t can be expressed as a MAB problem, where exploitation intends to maximise U D [t] (Equation 3) and exploration aims to check different values of θ to discover more information for the Attacker.In particular, the deployment process plays the role of the gambler and the various values of θ represent the slot machines.To solve the MAB problem, we adopt first the e − Greedy method (Algorithm 1), where we commonly select that mean of θ providing the maximum value ŨD [t] and there is a small probability e where other values of θ are selected in order to discover how Equation 2 ranges.However, although e−Greedy is a suitable option about the exploration, sometimes, we choose a sub-optimal action randomly.Thus, we also use Q − Learning (Algorithm 2) in order to avoid this situation.In both algorithms.data Data denotes the input data, while Result indicates the output of the algorithm.The number of honeypots already deployed denotes the current state s and the number of honeypots that can be deployed in a subsequent security event represents the possible actions a.In a specific case, all the states are defined in the space S, while all the actions are defined in the space A. Both S and A rely on N .Finally, the reward R(a t , s t ) of each action a t performed in the state s t is given by Equation 2. The functionality of Q − Learning relies on (a) the Q(s, a) values, (b) Temporal Difference T D t (s t , a t ) (Equation 4) and (c) the Bellman equation (Equation 5).Q(s, a) represents the estimated reward of the action a performed in the state s.Next, T D t (s t , a t ) expresses the difference between R(s t , a t )+γmax a (Q(s t+1 , a)) and Q(s t , a t ).R(s t , a t ) + γmax a (Q(s t+1 , a)) denotes the reward R(s t , a t ) received by executing the action a t in the state s t plus the Q value of the most optimal action executed in the future state s t+1 discounted by a factor γ ∈ [0, 1].During the training process, by interacting with the environment, Q − Learning intends to identify a high reward R(s t , a t ) and increase the respective Q(s t , a t ).At some point in the course of the training process, Q − Learning will identify all the transitions leading to high rewards and high Q values.At this point, T D will decrease.In order to update the Q values for each security event, the Bellman equation is used.For each new security event detected, the Q values are updated from t − 1 (i.e., when the previous security event received) to t (i.e., the current security event).a LearningRate ∈ [0, 1] represents the learning rate, which denotes how fast the Q values are updated.Q − Learning is an off-policy method.This means that the actions can be dictated by an action selection policy (i.e., behaviour policy), such as e − Greedy, however, with respect to the training procedure, always the greedy option (i.e., target policy) is chosen.

A. Communication Network Model
Let us assume a wireless network that consists of N Remote Radio Heads (RRHs), which are deployed with a Poisson point process with density λ [42], [43].Each of the RRHs can operate either as an Access Point (AP) or a WH.Thus, at a specific time instance, L APs and M WHs are deployed with L + M = N , as depicted in Fig. 2. The role of the WHs is to imitate the behaviour of APs in order to attract and directly detect potential attacks.Also, we assume that the network serves K legitimate users, while the potential existence of malicious users who aim to access the real network is also considered.To mitigate their impact, the allocation of APs and WHs is dynamically adjusted and fully controlled by the network coordinator, which communicates with both the APs and WHs.However, although the WHs imitate the behaviour of APs in order to attract potential attacks, they do have access to the real network.Also, it is assumed that a normal user will never attempt to access a WH, while a malicious user may try to access either an AP or a WH.Hereinafter, let L = (1−θ)N and M = θN .Moreover, we assume that the APs follow a Poisson point process (PPP) with density (1 − θ)λ, while the WHs also follow a PPP with density θλ.Moreover, similarly to the scenario that has been considered in the former section, the Attacker attacks a specific RRH with probability ϕ.In the considered setup, the density of WHs deployment needs to be specified in order to provide the required level of security, without degrading the quality of service that is offered by the wireless communication network.

B. Defender Utility Function
It is assumed that each user is served by the AP that is closest to the user.The Probability Density Function (PDF) of the distance d i between user i and its closest AP is given by: while the cumulative density function (CDF) is given by The achievable rate is given by where h i denotes the small scale fading power gain and γ i denotes the Signal-to-Noise Ratio (SNR) at the reference distance of 1 m and is given by with L ref being the equivalent path-loss.In addition, P i , σ 2 , and β denote the transmit power, the noise power and the path-loss exponent, respectively.It is assumed that each smart device has N transmission opportunities within an hour to report its demand for the next hour.Assuming that the transmission rate is equal to R t,i , the outage probability after Z re-transmission can be defined as where, by following similar steps as in [42], with f H,i (h) being the PDF of the small scale fading power gain.By assuming Rayleigh fading, h i follows the exponential distribution with parameter κ.Thus, (11) can be written as Hereinafter, it is assumed that β = 4, for which (11) can be written as When some nodes operate as WHs, more potential attacks are captured, however the density of nodes that operate as APs reduces, which in turn leads to an increase of the outage probability, and thus, the estimation error-cost.
Taking into account this trade-off, the aim of the Defender is to maximize its utility, given by (3), in which g(θ) is the expected overall cost due to the outage events.In more detail, g is related to the induced cost to each user, denoted by C i , and can be expressed as with K being the set of users.

C. Cost of Outage Events
Indicatively, to give further insight into the definition of g in real-world applications, the case of the smart grid can be considered, in which the cost might be related to the impact of outage events on dynamic energy management or to the case of equipment failure.To this end, next, an example that is wellknown from the existing literature will be considered, which is related to Dynamic Energy Management (DEM).Assuming that the DEM operation is implemented over two consecutive stages, the unit-commitment and economic-dispatch stages, the utility generates and reserves the energy supply based on the estimated energy demand of the consumers.Thus, if the energy supply is over-estimated, the utility needs to pay for the surplus of energy that has been unnecessarily reserved with price p uc .On the other hand, if the energy supply is under-estimated, the utility needs to buy the energy difference between the actual and the generated energies in the economic-dispatch stage to prevent the under-supply situation [44].In this case, the expected cost of under or overestimating the energy demand of the devices that did not successfully report their demand is given by [44] where f R,i is the probability density function of the actual energy consumption, µ i is the mean energy demand of the i-th device, E max is the maximum energy consumption, and p uc and p ed are the energy prices in the unit commitment and economic-dispatch stages, respectively.
where f R,i is the PDF of the actual energy consumption, µ i is the mean energy demand of the i-th device, E max is the maximum energy consumption, and p uc and p ed are the energy prices in the unit commitment and economic-dispatch stages, respectively.

VII. EVALUATION ANALYSIS
This section focuses on evaluating the proposed RL honeypot deployment methods: (a) e − Greedy and (b) Q − learning with respect to the number of WHs in ultra-dense networks.To the best of our knowledge, this is the first work related to honeypots in ultra-dense networks.Therefore, there are not publicly available datasets that can be used in the context of the evaluation analysis.To this end, we are going to use the Honeypot Intrusion Detection Dataset of our previous work in [35].Furthermore, it is noteworthy that the e-Greedy method of [35] was appropriately adjusted in the context of this work based on the parameters of the ultra-dense networks.The aforementioned dataset includes network traffic data and relevant network flow statistics over one year from various research honeypots.This kind of data was used to create a simulation environment, identifying the values of δ 1 , δ 2 , δ 3 , P r , N , e and C(θ) given the communication network model of subsection VI-A.Since the dataset of our work in [35] is related to smart electrical systems, it can be utilised in the context of this work, taken into consideration the modelling and assumptions of section VI.Moreover, since this dataset is ready, the various security events occur by one second.Each network flow of the dataset corresponds to a security event.Thus, for each security event, we consider how many WHs will be deployed.
We consider a simulation environment where N = 6.Regarding the other parameters, various values of them were checked during our experiments.Therefore, we can deploy up to six WHs based on the available APs. First, with respect to the e − Greedy method, we investigate how the PDF of U D [t] ranges based on Equation 2. Fig. 5-Fig.14 show how the PDF of U D [t] ranges based on 5, 10, 20, 50, 100, 200, 500, 1000, 1500 and 2000 security events.After 2000 security events, we see that the best option is to deploy 2 WHs.Moreover Fig. 3 shows the accuracy of the e-greedy model with respect to the number of the various security events and random choice.Although due to randomness, it seems that the accuracy of the random model increases, e − Greedy achieves a better accuracy.Finally, Fig. 4 shows the cumulative reward based on the iterations of Q − Learning for 2000 security events.

VIII. CONCLUSIONS
The evolution of the 5G technology has led IoT and IIoT applications to the 5G era.However still security issues remain.In this paper, we investigate the use of WHs in ultradense networks.In particular, first, we introduce a strategic honeypot deployment method, taking full advantage of two RL methods, namely (a) e − Greedy and (b) Q − Learning.The deployment process is converted into a MAB problem with the goal to deploy the optimal number of WHs in an ultra-dense environment, taking into account the costs and benefits of the Def ender.The evaluation results demonstrate the efficiency of the proposed methods.Our future work will   focus on investigating more complex RL techniques for using WHs in the 5G-RAN, 5G Core and B5G networks.
(a) actions, (b) rewards, (c) states, (d) state transfer function and (e) output function.An attack-defence scenario is formed with two players (a) attacker and (b) defender.The actions of the attacker fall into two main phases: (a) the preparation phase and (b) the attack phase.The first one refers to the preparation activities before the execution of the attack, while the attack phase denotes the actual malicious activities.On the other hand, the actions of the defender can also be classified into two main phases: (a) the planning phase and (b) the defending phase.The planning phase indicates the This article has been accepted for publication in IEEE Transactions on Emerging Topics in Computing.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TETC.2022.3184112This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/ Fan et al. present HoneyDoc, an SDN-based architecture about the honeypot deployment.The architectural model of SDN consists of three main planes: (a) Data Plane, (b) Control Plane and (c) Application Plane.The data plane refers to the physical and virtualised entities connected to SDN switches.Next, the control plane is devoted to the SDN controllers responsible for orchestrating and managing the SDN switches.Finally, the application plane refers to the SDN application that can interact with the SDN controller.Honey-Doc is composed of three main modules: (a) Decoy Manager, (b) Captor Manager and (c) Orchestration Core.The Decoy Manager is responsible for deploying the various honeypots, including LIH, MIH and HIH.All the honeypots are located in the control plane.Next, the Capture Manager refers to an SDN application consisting of three submodules, namely (a) Data Capture, (b) Data Control and (c) Data Analysis, responsible for capturing, controlling and analysing the honeypot data, respectively.Finally, the Orchestration Core is located in the Control Plane and is responsible for coordinating the actions of the Decoy and Captor Managers.

Fig. 1 .
Fig. 1.RL-based Security Game: Deploying a Number of Honeypots in Ultra-Dense Networks This article has been accepted for publication in IEEE Transactions on Emerging Topics in Computing.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TETC.2022.3184112This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in IEEE Transactions on Emerging Topics in Computing.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TETC.2022.3184112This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/ mean then max mean = mean θ M atrix[θ]; θ selected = θ; end end end end This article has been accepted for publication in IEEE Transactions on Emerging Topics in Computing.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TETC.2022.3184112This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/Algorithm 2: Q-Learning Honeypot Deployment Data: Q(S,A), γ, a learningRate , SE, securityEventCounter Result: a action γ = 0.9; a learningrate = 0.1; SE = init(); securityEventCounter = 0; for s byS do for a byA do Q(s, a) = 0; end end for securityEventCounter ← 1 to SE by 1 do s t = random(); a t = e-Greedy();
This article has been accepted for publication in IEEE Transactions on Emerging Topics in Computing.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TETC.2022.3184112This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/

Fig. 12 .Fig. 14 .
Fig. 12. Honeypots Distribution after 1000 sec.events This article has been accepted for publication in IEEE Transactions on Emerging Topics in Computing.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TETC.2022.3184112This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/