Learning Near-Optimal Intrusion Responses Against Dynamic Attackers

We study automated intrusion response and formulate the interaction between an attacker and a defender as an optimal stopping game where attack and defense strategies evolve through reinforcement learning and self-play. The game-theoretic modeling enables us to find defender strategies that are effective against a dynamic attacker, i.e. an attacker that adapts its strategy in response to the defender strategy. Further, the optimal stopping formulation allows us to prove that optimal strategies have threshold properties. To obtain near-optimal defender strategies, we develop Threshold Fictitious Self-Play (T-FP), a fictitious self-play algorithm that learns Nash equilibria through stochastic approximation. We show that T-FP outperforms a state-of-the-art algorithm for our use case. The experimental part of this investigation includes two systems: a simulation system where defender strategies are incrementally learned and an emulation system where statistics are collected that drive simulation runs and where learned strategies are evaluated. We argue that this approach can produce effective defender strategies for a practical IT infrastructure.


I. INTRODUCTION
An organization's security strategy has traditionally been defined, implemented, and updated by domain experts [1].This approach can provide basic security for an organization's communication and computing infrastructure.As infrastructure update cycles become shorter and attacks increase in sophistication, meeting the security requirements becomes increasingly difficult.To address this challenge, significant efforts have started to automate the process of obtaining security strategies [2].Examples of this research include: computation of defender strategies using dynamic programming and control theory [3], [4]; computation of exploits and corresponding defenses through evolutionary methods [5], [6]; computation of defender strategies through game-theoretic methods [7], [8]; derivation of defender responses through causal inference [9]; use of machine learning techniques to estimate model parameters and strategies [10]- [12]; automated creation of threat models [13]; and identification of infrastructure vulnerabilities through attack simulations and threat intelligence [14], [15].
A promising new direction of research is automatically learning security strategies through reinforcement learning methods [16], whereby the problem of finding security strategies is modeled as a Markov decision problem and strategies Attacker Clients . . .Fig. 1: The IT infrastructure and the actors in the intrusion response use case.
are learned through simulation (see surveys [17], [18]).While encouraging results have been obtained following this approach [10]- [12], [19]- [57], key challenges remain [58].Chief among them is narrowing the gap between the environment where strategies are evaluated and a scenario playing out in a real system.Most of the results obtained so far are limited to simulation environments, and it is not clear how they generalize to practical IT infrastructures.Another challenge is to obtain security strategies that are effective against a dynamic attacker, i.e. an attacker that adapts its strategy in response to the defender strategy.Most of the prior work have used reinforcement learning to find effective defender strategies against static attackers, and little is known about the found strategies' performance against a dynamic attacker.
In this paper, we address the above challenges and present a novel framework to automatically learn a defender strategy against a dynamic attacker.We apply this framework to an intrusion response use case, which involves the IT infrastructure of an organization (see Fig. 1).The operator of this infrastructure, which we call the defender, takes measures to protect it against an attacker while providing services to a client population.
We formulate the intrusion response use case as an opti- Fig. 2: Our framework for finding and evaluating intrusion response strategies [10].mal stopping game, namely a stochastic game where both players face an optimal stopping problem [59]- [61].This formulation enables us to gain insight into the structure of optimal strategies, which we prove to have threshold properties.To obtain effective defender strategies, we use reinforcement learning and self-play.Based on the threshold properties, we design Threshold Fictitious Self-Play (T-FP), an efficient algorithm that iteratively computes near-optimal defender strategies against a dynamic attacker.
Our method for learning and evaluating strategies for a given infrastructure includes building two systems (see Fig. 2).First, we develop an emulation system where key functional components of the target infrastructure are replicated.This system closely approximates the functionality of the target infrastructure and is used to run attack scenarios and defender responses.Such runs produce system measurements and logs, from which we estimate infrastructure statistics, which then are used to instantiate the simulation model.
Second, we build a simulation system where game episodes are simulated and strategies are incrementally learned through self-play.Learned strategies are extracted from the simulation system and evaluated in the emulation system.
We believe that this paper provides a foundation for the next generation of security systems, including Intrusion Prevention Systems (IPSs) (e.g.Trellix [79]), Intrusion Response Systems (IRSs) (e.g.Wazuh [80]), and Intrusion Detection Systems (IDSs) (e.g.Snort [81]).The optimal stopping strategies computed through our framework can be used in these systems to decide at which point in time an automated response action should be triggered or at which point in time a human operator should be alerted to take action.
The work in this paper builds on our earlier results in automated intrusion response [10], [12], [82].Specifically, this paper can be seen as a generalization of the work in [10], where we investigate intrusion response against a static attacker.As explained in this paper, intrusion response against a dynamic attacker requires a different mathematical framework.An extended abstract of this paper was presented at the "Machine learning for cyber security" workshop at the International Conference on Machine Learning (ICML) 2022 [82].

II. THE INTRUSION RESPONSE USE CASE
We consider an intrusion response use case that involves the IT infrastructure of an organization.The operator of this infrastructure, which we call the defender, takes measures to protect it against an attacker while providing services to a client population (Fig. 1).The infrastructure includes a set of servers that run the services and an Intrusion Detection and Prevention System (IDPS) that logs events in real-time.Clients access the services through a public gateway, which is also open to the attacker.The attacker's goal is to intrude on the infrastructure and compromise its servers.To achieve this, the attacker explores the infrastructure through reconnaissance and exploits vulnerabilities while avoiding detection by the defender.The attacker decides when to start an intrusion and may stop the intrusion at any moment.During the intrusion, the attacker follows a pre-defined strategy.When deciding the time to start or stop an intrusion, the attacker considers both the gain of compromising additional servers and the risk of detection.The optimal strategy for the attacker is to compromise as many servers as possible without being detected.
The defender continuously monitors the infrastructure through accessing and analyzing IDPS alerts and other statistics.It can take a fixed number of defensive actions, each of which has a cost and a chance of stopping an ongoing attack.An example of a defensive action is to drop network traffic that triggers IDPS alerts of a certain priority.The defender takes actions in a pre-determined order, starting with the action that has the lowest cost.The final action blocks all external access to the gateway, which disrupts any intrusion as well as the services to the clients.
When deciding the time for taking a defensive action, the defender balances two objectives: (i) maintain services to its clients; and (ii) stop a possible intrusion at the lowest cost.The optimal strategy for the defender is to monitor the infrastructure and maintain services until the moment when the attacker enters through the gateway, at which time the attack must be stopped at minimal cost through defensive actions.The challenge for the defender is to identify this precise moment.

III. FORMALIZING THE INTRUSION RESPONSE USE CASE
We formulate the above intrusion response use case as a partially observed stochastic game.The attacker wins the game when it can intrude on the infrastructure and hide its actions from the defender.Similarly, the defender wins the game when it manages to stop an intrusion.It is a zero-sum game, which means that the gain of one player equals the loss of the other player.
The attacker and the defender have different observability in the game.The defender observes alerts from an Intrusion Detection and Prevention System (IDPS) but has no certainty about the presence of an attacker or the state of a possible intrusion.The attacker, on the other hand, is assumed to have complete observability.It has access to all the information that the defender has access to, as well as the defender's past actions.This means that the defender has to find strategies that are effective against an opponent that has more knowledge than itself.
The reward function of the game encodes the defender's objective.An optimal defender strategy maximizes the reward when facing an attacker with an optimal strategy, i.e. a worstcase attacker.Similarly, an optimal attacker strategy minimizes the reward when facing a worst-case defender.Such a pair of optimal strategies is known as a Nash equilibrium in game theory [83].
We model the game as a finite, zero-sum Partially Observed Stochastic Game (POSG) with one-sided partial observability: It is a discrete-time game that starts at time t = 1 and ends at time t = T .In the following, we describe the components of the game, its evolution, and the objectives of the players.
Players N .The game has two players: player D is the defender and player A is the attacker.Hence, N = {D, A}.Time horizon T .The time horizon T is a random variable that depends on both players' strategies and takes values in the set T ∈ {2, 3, . . ., ∞}.
State space S. The game has three states: s t = 0 if no intrusion occurs, s t = 1 if an intrusion is ongoing, and s T = ∅ if the game has ended.Hence, S = {0, 1, ∅}.The state at time t, s t , is a realization of the random variable S t .The initial state is s 1 = 0. Hence the initial state distribution ρ 1 : S → [0, 1] is the degenerate distribution ρ 1 (0) = 1.
Action spaces A i .Each player i ∈ N can invoke two actions: "stop" (S) and "continue" (C).The action spaces are thus A D = A A = {S, C}.Executing action S triggers a change in the game while action C is a passive action.In the following, we encode S with 1 and C with 0.
The attacker can invoke the stop action twice: the first time to start the intrusion and the second time to terminate it.
The defender can invoke the stop action L ≥ 1 times.A stop action is a defensive action against a possible intrusion.The number of stop actions remaining to the defender is known to both players and is denoted by l ∈ {1, . . ., L}.
At each time-step, the attacker and the defender simultaneously choose an action a t = (a t and a t is a realization of the random vector A t . Observation space O.The attacker has complete observability and knows the game state, the defender's actions, and the defender's observations.In contrast, the defender has a limited set of observations o t ∈ O, where O is a discrete set.(In our use case, o t relates to the weighted sum of IDPS alerts triggered during time-step t.We focus on the IDPS alert metric as it provides more information than other possible metrics for detecting intrusions, see Appendix D for details.) Both players have perfect recall, meaning they remember their respective play history.The history of the defender at time-step t is the vector h and the history of the attacker is the vector h where T l>1 and T l=1 refer to the transition probabilities when l > 1 and l = 1, respectively (T denotes the transition probabilities for any value of l).All other state transitions have probability 0.
(2)-(3) define the probabilities of the recurrent state transitions 0 → 0 and 1 → 1.The game stays in state 0 with probability 1 if the attacker selects action C and l t − a t .)Similarly, the game stays in state 1 with probability 1 − φ l if the attacker chooses action C and l t −a (D) t > 0.Here φ l denotes the probability that the intrusion is stopped, which is a parameter of the use case.The intrusion can be stopped at any time-step as a consequence of previous stop actions by the defender.We assume that φ l increases with each stop action that the defender takes.
(4) captures the transition 0 → 1, which occurs when the attacker chooses action S and l t − a (D) t > 0. (5)-( 6) define the probabilities of the transitions to the terminal state ∅.The = 0); (ii) when the intrusion is stopped by the defender with probability φ l ; and (iii) when s t = 1 and the attacker terminates the intrusion (a The evolution of the game can be described with the state transition diagram in Fig. 3.The figure captures a game episode, which starts at t = 1 and ends at t = T .
Reward function R. At time-step t, the defender receives the reward r t = R(s t , (a ) and the attacker receives the reward −r t .Here r t ∈ R is a realization of the random variable R t .The reward function R is parameterized by the defender's reward for stopping an intrusion (R st ∈ R >0 ), the defender's cost of taking a defensive action (R cost ∈ R <0 ), and the defender's cost while an intrusion occurs (R int ∈ R <0 ): (7)- (8) state that the reward is zero in the terminal state and when the attacker terminates an intrusion.(9) states that the defender incurs no cost when no attack occurs and it does not take a defensive action.(10) indicates that the defender incurs a cost when taking a defensive action if no intrusion is ongoing.(11) states that the defender receives a reward when taking a stop action while an intrusion occurs.Lastly, (12) indicates that the defender incurs a cost for each time-step during which an intrusion occurs.
Player strategies π i .A defender strategy is a function π D ∈ Π D = {1, . . ., L} × B → ∆(A D ), where ∆(A D ) denotes the set of probability distributions over A D .Similarly, an attacker strategy is a function π A ∈ Π A = {1, . . ., L} × B × S → ∆(A A ).The strategies for both players are dependent on l but independent of t (i.e.strategies are stationary).If π i always maps on to an action with probability 1, it is called pure, otherwise it is called mixed.In other words, a pure strategy is deterministic and a mixed strategy is stochastic.
Belief update.At time-step t > 1, the defender updates the belief state b t−1 using the equation where is a normalizing factor to ensure that the sum over b t (s t ) for all s t equals 1.The initial belief is b 1 (0) = 1.
Objective functions J i .The goal of the defender is to maximize the expected discounted cumulative reward over the time horizon T .Similarly, the goal of the attacker is to minimize the same quantity.Therefore, the objective functions J D and J A are where γ ∈ [0, 1) is the discount factor and E (πD,πA) denotes the expectation of the random variables (S t , O t , A t ) t∈{1,...,T } under strategy profile (π D , π A ).
Best response strategies πi .A defender strategy πD ∈ Π D is called a best response against π A ∈ Π A if it maximizes J D (16).Similarly, an attacker strategy πA is called a best response against π D if it minimizes J D (17).Hence, the best response correspondences B D and B A are obtained as follows: Optimal strategies π * i .An optimal defender strategy π * D is a best response strategy against any attacker strategy that minimizes J D .Similarly, an optimal attacker strategy π * A is a best response against any defender strategy that maximizes J D .Hence, when both players follow optimal strategies, they play best response strategies against each other: Since no player has an incentive to change its strategy, Observation distribution ( 13)-( 14) TABLE 1: Notations for our mathematical model.

IV. GAME-THEORETIC ANALYSIS AND OUR ALGORITHM FOR FINDING NEAR-OPTIMAL DEFENDER STRATEGIES
Finding optimal strategies that satisfy ( 20) is equivalent to finding a Nash equilibrium for the POSG Γ (1).We know from game theory that Γ has at least one mixed Nash equilibrium [83]- [86].(A Nash equilibrium is called mixed if one or more players follow mixed strategies.)In this section, we first analyze the structure of Nash equilibria in Γ using optimal stopping theory and then we describe an efficient reinforcement learning algorithm for approximating these equilibria.

A. Analyzing Best Responses using Optimal Stopping Theory
The equilibria in Γ can be obtained by finding the pairs of strategies that are best responses against each other (20).A best response for the defender is obtained by solving a POMDP M P , and a best response for the attacker is obtained by solving an MDP M. The corresponding Bellman equations are [87]: where V * l,πA is the value function in the POMDP M P given that the attacker follows strategy π A and the defender has l stops remaining, and V * l,πD is the value function in the MDP M given that the defender follows strategy π D and has l stops remaining.

Stopped
Game episode Intrusion Fig. 4: Stopping times of the defender and the attacker in a game episode; the bottom horizontal axis represents time; the black circles on the middle axis and the upper axis represent time-steps of the defender's stop actions and the attacker's stop actions, respectively; τ i,j denotes the jth stopping time of player i; the cross shows the time the intrusion is stopped; an intrusion starts when the attacker takes the first stop action (at time τ A,1 ); an episode ends either when the attacker is stopped (as a consequence of defender actions) or when the attacker terminates its intrusion.
Since the game is zero-sum, stationary, and γ < 1, it follows from the Minimax theorem in game theory that there exists a value function: ) We interpret the POMDP M P and the MDP M that determine the best response strategies as optimal stopping problems (see Fig. 4) [10], [59], [91], [92].Consequently, an optimal solution to M P (or M) is also an optimal solution to the corresponding stopping problem and vice versa.
The problem for the defender is to find a stopping strategy π * D (b t ) → {S, C} that maximizes J D (16) and prescribes the optimal stopping times τ * D,1 , τ * D,2 , . . ., τ * D,L .Similarly, the problem for the attacker is to find a stopping strategy π * A (s t , b t ) → {S, C} that minimizes J D (17) and prescribes the optimal stopping times τ * A,1 and τ * A,2 .Given a pair of stopping strategies (π D , π A ) and their (pure) best responses πD ∈ B D (π A ) and πA ∈ B A (π D ), we define two subsets of B = [0, 1]: the stopping sets and the continuation sets.
The stopping sets S (D) and S (A) include the belief states where S is a best response: Similarly, the continuation sets C (D) and C (A) contain the belief states where C is a best response: Based on [90, Thm.12. Proof.See Appendix A.
Theorem 1 tells us that Γ has a mixed Nash equilibrium.Further, under assumptions generally met in practice, the best response strategies have threshold properties (see Fig. 5).In Fig the following, we describe an algorithm that leverages these properties to efficiently approximate Nash equilibria of Γ.

B. Finding Nash Equilibria through Fictitious Self-Play
Computing Nash equilibria for a POSG is generally intractable [96, Thm.3.5] [97, Thm.6].However, approximate solutions can be obtained through iterative methods.One such method is fictitious self-play, where both players start from random strategies and continuously update their strategies based on outcomes of played game episodes [98].
Fictitious self-play evolves through a sequence of iteration steps, which is illustrated in Fig. 6.An iteration step includes three stages.First, player 1 learns a best response strategy against player 2's current strategy.The roles are then reversed and player 2 learns a best response strategy against player 1's current strategy.Lastly, each player adopts a new strategy, which is determined by the empirical distribution over its past best response strategies.The sequence of iteration steps continues until the strategies of both players have sufficiently converged to a Nash equilibrium [99, Thms.7.2.4-7.2.5].

C. Our Self-Play Algorithm: T-FP
We present a fictitious self-play algorithm called Threshold Fictitious Self-Play (T-FP), which efficiently approximates a Nash equilibrium of Γ based on Theorem 1.The pseudocode of T-FP is listed in Algorithm 1.
To learn the best response strategies πD ∈ B D (π A ) and πA ∈ B A (π D ), T-FP parameterizes πD and πA through threshold vectors according to Theorem 1.The defender's ) is the threshold (0.5 in this example); the x-axis indicates the defender's belief state b(1) ∈ [0, 1] and the y-axis indicates the probability prescribed by πD, θ(D) to the stop action S.
The above procedure of estimating gradients and updating θ(D) and θ(A) continues for a given number of iterations (lines 9-21).After these iterations have finished, the threshold vectors θ(D) and θ(A) are added to buffers Θ (D) and Θ (A) , which contain the vectors learned in previous iterations of T-FP (line 22).Finally, the T-FP iteration step is completed by having both players update their strategies based on the empirical distributions over the past vectors in the buffers (lines [24][25]. The sequence of iteration steps described above continues until the strategies have sufficiently converged to a Nash equilibrium (lines 6-27).(In Algorithm 1, U k ({−1, 1}) denotes a k-dimensional discrete multivariate uniform distribution on {−1, 1} and π −i denotes the strategy of player j ∈ N \ {i}.)

TO INSTANTIATE THE SIMULATION AND TO EVALUATE LEARNED STRATEGIES
The T-FP algorithm described above approximates a Nash equilibrium of Γ by simulating game episodes and updating both players' strategies through stochastic approximation.T-FP requires the observation distribution conditioned on the system state f O|s (13)-( 14).The emulation system shown in Fig. 2 allows us to estimate this distribution and later to evaluate the learned strategies.
This section describes the emulation system, our method for estimating f O|s , and our method for evaluating defender strategies.

A. Emulating the Target Infrastructure
The emulation system executes on a cluster of machines that runs a virtualization layer provided by Docker containers and virtual links [102].The system implements network isolation and traffic shaping using network namespaces and the NETEM  module in the Linux kernel [103].Resource allocation to containers, e.g.CPU and memory, is enforced using CGROUPS.
The network topology of the emulated infrastructure is shown in Fig. 1 and its configuration is given in Appendix C. The emulation system includes the clients, the attacker, the defender, network connectivity, and 31 devices of the target infrastructure (e.g.application servers and the gateway).The software functions on the emulation system replicate important components of the target infrastructure, such as, web servers, databases, and the Snort IDPS, which is deployed using Snort's community ruleset v2.9.17.1.
We emulate connections between servers as full-duplex lossless connections of 1 Gbit/s capacity in both directions.We emulate connections between the gateway and the external client population as full-duplex connections of 100 Mbit/s capacity and 0.1% packet loss with random bursts of 1% packet loss.(These numbers are based on measurements on enterprise and wide-area networks [104]- [106].) Technical documentation and application programming interfaces (APIs) of the emulation system are available in [107].

B. Emulating the Client Population
The client population is emulated by processes in Docker containers.Clients interact with application servers through the gateway by performing a sequence of functions on a sequence of servers, both of which are selected uniformly at random from Table 2. Client arrivals per time-step are emulated using a stationary Poisson process with mean λ = 20 and exponentially distributed service times with mean µ = 4.The duration of a time-step is 30 seconds.

C. Emulating Defender and Attacker Actions
The defender and the attacker observe the infrastructure continuously and take actions at time-steps t = 1, 2, . . ., T .During each step, the defender and the attacker perform one action each.
The defender executes either a continue action or a stop action.A continue action is virtual in the sense that it does not trigger any function in the emulation.A stop action, however, invokes specific functions in the emulated infrastructure.We have implemented L = 7 stop actions for the defender, which are listed in Table 3.The first stop action revokes user certificates and recovers user accounts expected to be compromised by the attacker.The second stop action updates the firewall configuration of the gateway to drop traffic from IP addresses flagged by the IDPS.Stop actions 3-6 trigger the dropping of traffic that generates IDPS alerts of priorities 1-4.The final stop action blocks all incoming traffic.(Note that   : Attacker commands executed on the emulation system; exploits are identified according to their corresponding vulnerability and its identifier in the Common Vulnerabilities and Exposures (CVE) database [108] and in the Common Weakness Enumeration (CWE) list [109].
according to Snort's terminology, 1 is the highest priority.We inverse the labeling in our framework for convenience.)Like the defender, the attacker executes either a stop action or a continue action during each time-step.The attacker can only take two stop actions during a game episode.The first determines when the intrusion starts and the second when it terminates (see §III).
During an intrusion, the attacker executes a sequence of commands, drawn randomly from all of the commands listed in Table 4. (Detailed descriptions of all commands are available in Appendix E).The first command in this sequence is executed when the attacker takes the first stop action.A further command is invoked whenever the attacker takes a continue action.

D. Estimating the IDPS Alert Distribution
At the end of every time-step, the emulation system collects the number of IDPS alerts with priorities 1-4 that occurred during the time-step.These values are then used to compute the metric o t , which contains the total number of IDPS alerts, weighted by priority.
For the evaluation reported in this paper we collect measurements from 23, 000 time-steps.Using these measurements, we apply expectation-maximization [110] to fit Gaussian mixture distributions fO|0 and fO|1 as estimates of f O|0 and f O|1 (13)- (14).
Fig. 8 shows the empirical distributions and the fitted models over the discrete observation space O = {1, 2, . . ., 9000}.The stochastic matrix with the rows fO|0 and fO|1 has about 72 × 10 6 second-order minors, which are almost all non-negative.This suggests to us that the TP-2 assumption in Theorem 1 can be made.

E. Running a Game Episode
During a game episode, the state evolves according to the dynamics defined by ( 2)-( 6), the defender's belief state evolves according to (15), the players' rewards are calculated using the reward function R (7)-( 12), the defender's observations are obtained from f O (13)-( 14), and the actions of both players are determined by their respective strategies.If the game runs in the emulation system, the players' actions include executing networking and computing functions (see Tables 3-4), and the observations from f O are obtained through reading log files and metrics of the emulated infrastructure.(To collect the logs and system metrics from the emulation, we run software sensors that write to a distributed queue implemented with Kafka [111].)In the case of a game in the simulation system, the observations are instead sampled from the estimated distribution fO .

VI. LEARNING NASH EQUILIBRIUM STRATEGIES FOR THE TARGET INFRASTRUCTURE
Our approach to finding near-optimal defender strategies includes: (i) emulating the target infrastructure to obtain statistics for instantiating the simulation system; (ii) learning Nash equilibrium strategies using the T-FP algorithm in §IV; and (iii) evaluating learned strategies on the emulation system in §V (see Fig. 2).This section describes the learning process and the evaluation results of the intrusion response use case.

A. Learning Equilibrium Strategies through Self-Play
We run T-FP for 500 iteration steps to estimate a Nash equilibrium using the iterative method in §IV-B, which is sufficient to meet the termination condition (line 6 in Algorithm 1).These iteration steps generate a sequence of strategy pairs (π D , π A ) 1 , (π D , π A ) 2 , . . ., (π D , π A ) 500 .
At the end of each iteration step, we evaluate the current strategy pair (π D , π A ) by running 500 evaluation episodes in the simulation system and 5 evaluation episodes in the emulation system.This process allows us to produce learning curves for different performance metrics (see Fig. 9).
The 500 training iterations and the associated evaluations constitute one training run.We run four training runs with different random seeds.A single training run takes about 5 hours of processing time in the simulation system.In addition, it takes around 12 hours to evaluate the strategies on the emulation system.The hyperparameters of T-FP are listed in Appendix B.
Computing environment for simulation and emulation.The environment for running simulations and training strategies is a TESLA P100 GPU.
The emulated infrastructure is deployed on a server with a 24-core INTEL XEON GOLD 2.10 GHz CPU and 768 GB RAM.Documentation of the emulation system is available in [107].
The code for the simulation system and the measurement traces for the intrusion response use case are available at [112].They can be used to validate our results and to extend this research.
Convergence metric for T-FP.To estimate the convergence of the sequence of strategy pairs generated by T-FP, we use the approximate exploitability metric δ [113]: where πi denotes an approximate best response strategy for player i and the objective functions J D and J A are defined in ( 16) and (17), respectively.The closer δ becomes to 0, the closer (π D , π A ) is to a Nash equilibrium.Baseline algorithms.We compare the performance of T-FP with that of two popular algorithms in previous work that use reinforcement learning and study use cases similar to ours [74], [95], [114]- [116].The first algorithm is Neural Fictitious Self-Play (NFSP) [117], which is a general fictitious selfplay algorithm that does not exploit the threshold structures expressed in Theorem 1.The second algorithm is Heuristic Search Value Iteration (HSVI) for one-sided POSGs [115], which is a state-of-the-art dynamic programming algorithm for one-sided POSGs.
Defender baseline strategies.We compare the dynamic defender strategies learned through T-FP with three static baseline strategies.The first baseline prescribes the stop action when an IDPS alert occurs, i.e., when o t > 0. The second baseline is derived from the Snort IDPS, which is a de-facto industry standard and can be considered state-of-the-art for our use case.This baseline uses the Snort IDPS's recommendation system and takes a stop action when Snort has dropped 100 IP packets (see Appendix C for the Snort configuration).The third baseline assumes prior knowledge of the intrusion time and performs all L stops during the L subsequent time-steps.

B. Evaluating the Learned Strategies
Figure 9 shows the learning curves of the strategies obtained during the T-FP self-play process and the baselines introduced above.The red curve represents the results from the simulator; the blue curves show the results from the emulation system; the purple curves give the performance of the Snort IDPS baseline; the orange curves relate to the baseline strategy that mandates a stop action when an IDPS alert occurs; and the dashed black curve gives the performance of the baseline strategy that assumes prior knowledge of the intrusion time.
We note that all learning curves in Fig. 9 converge, which suggests that the learned strategies converge as well.(Fig. 9 only shows the first 120 iterations of the 500 iterations we performed, as the curves converge after 100 iterations.)Specifically, we observe that the approximate exploitability (34) of the learned strategies converges to small values (left plot), which indicates that the learned strategies approximate a Nash equilibrium both in the simulator and in the emulation system.Further, we see from the plot in the middle that both baseline strategies show decreasing performance as the attacker updates its strategy.In contrast, the defender strategy learned through T-FP improves its performance over time.This shows the benefit of a game-theoretic approach where the defender strategy is optimized against a dynamic attacker.Lastly, we notice that the average intrusion length when the defender follows the learned defender strategy and the Snort IDPS baseline strategy is 2 and 3, respectively (right plot).In comparison, the average intrusion length when the defender follows the baseline strategy o t > 0 is close to 0, which indicates that it tends to prescribe all stop actions before an intrusion occurs.
Figure 10 represents the strategies learned through T-FP in a simple form.The y-axis shows the probability of a stop action and the x-axis shows the defender's belief b(1) ∈ B that an intrusion occurs.The strategies are clearly stochastic.This is consistent with Theorem 1.A, which predicts a mixed Nash equilibrium.Further, Theorem 1.B predicts that the defender's stopping probability is increasing with respect to b(1) and decreasing with l, which is visible in the right plot.Similarly, Theorem 1.C predicts that the attacker's stopping probability decreases with the defender's stopping probability when s = 0 and increases when s = 1, which can be seen in the left and the middle plot.
Figure 11 compares T-FP with the two baseline algorithms NFSP and HSVI on the simulator.NFSP implements fictitious self-play and can thus be compared with T-FP with respect to approximate exploitability (34).We observe in the left plot that T-FP converges much faster than NFSP.We explain the rapid convergence of T-FP by its design, which structural properties of the stopping game.
The right plot shows that HSVI reaches an HSVI approximation error below 5 within an hour of processing time.Based on the recent literature we anticipated a much longer processing time [88], [95].This suggests to us that T-FP and HSVI have similar convergence properties.A more detailed comparison between T-FP and HSVI is hard to perform due to the different nature of the two algorithms.
Figure 12 shows the estimated value function of the game V * l : B → R (23), where V * l (b(1)) is the expected cumulative reward when the game starts in the belief state b(1), the defender has l stops remaining, and both players follow optimal (equilibrium) strategies.
We see in Fig. 12 that V * l is piece-wise linear and convex, as expected from the theory of one-sided POSGs [88].The figure indicates that V * l (b(1)) ≤ 0 for all b(1) ∈ B and that V * l (1) = 0 for all l ∈ {1, . . ., L}.Further, we note that the value of V * l is minimal when b( 1) is around 0.25 and that the values for l = 1 and l = 7 are very close.
That V * l (b(1)) ≤ 0 for all b(1) ∈ B and all l ∈ {1, . . ., L} has an intuitive explanation.For any b(1), the attacker has the option to never attack if s = 0 or to abort an attack if s = 1.Both options yield a cumulative reward less than or equal to 0 (7)- (12).As a consequence, V * l (b(1)) ≤ 0 for any optimal attacker strategy and all b(1) ∈ B and l ∈ {1, . . ., L}. (Recall that the attacker aims to minimize reward.) The fact that V * l (b(1)) = 0 when b(1) = 1 can be understood as follows.b(1) = 1 means that the defender knows that an intrusion occurs and will take defensive actions (see Theorem 1.B).Hence, when b(1) = 1, the only way for the attacker to avoid detection is to abort the intrusion, which causes the game to end and yields a reward of zero, i.e.
We interpret the fact that arg min b(1) V * l (b(1)) is around 0.25 as follows.The value of b(1) that obtains the minimum corresponds to the belief state where the attacker achieves the lowest expected reward in the game.Negative rewards in the game are obtained when the defender mistakes an intrusion for no intrusion and vice versa ( 7)-( 12).As a consequence, the attacker prefers belief states where the defender has a high uncertainty, e.g.b(1) = 0.5.At the same time, the attacker does not want b(1) to be so large that the defender performs all its defensive actions before it gets a chance to attack, which can explain why we find the minimum to be around 0.25 rather than 0.5.Lastly, Fig. 13 shows the percentage of blocked attacker and client traffic when running repeated game episodes in the emulation system with different defender strategies.The x-axis shows the running time and the y-axis shows the percentage of blocked traffic per second.
We observe in the upper plot that all defender strategies block some client traffic, which is expected considering the false IDPS alarms generated by the clients (see Fig. 8).(The defender actions that cause traffic to be dropped are listed in Table 3.)The o t > 0 baseline strategy blocks the most client traffic and the Snort IDPS baseline strategy blocks the least, slightly less than the equilibrium strategy learned through T-FP.
We further observe in the lower plot that the equilibrium strategy learned by T-FP blocks the most attacker traffic and that the o t > 0 baseline strategy blocks the least.This suggests to us that the equilibrium strategy balances well the tradeoff between blocking clients and the attacker based on the reward function ( 7)- (12).In comparison, the o t > 0 baseline implements a trivial defense strategy that blocks nearly all traffic, and the Snort IDPS baseline blocks too little traffic, failing to stop the intrusion.

C. Discussion of the Evaluation Results
In this work, we propose a framework for analyzing and solving the intrusion response use case, which we validate both theoretically and experimentally via simulation and emulation.
The key findings can be summarized as follows: (i) Our framework is able to efficiently approximate optimal defender strategies for a practical IT infrastructure (see Fig. 9).While we have not evaluated the learned strategies in the target infrastructure due to safety reasons, the fact that they achieve almost the same performance in the emulated infrastructure as in the simulator gives us confidence that the obtained strategies would perform as expected in the target infrastructure.
(ii) The theory of optimal stopping provides insight about optimal strategies for attackers and defenders, which enables efficient computation of near-optimal strategies through selfplay reinforcement learning (see Fig. 11).This finding can be explained by the threshold structures of the optimal stopping strategies, which drastically reduce the search space of possible strategies (see Theorem 1 and Algorithm 1).
(iii) The learned strategies can be efficiently implemented using the threshold properties.The computational complexity, which is dominated by the computation of the belief state, is upper bounded by O(k|S| where k is a constant (15).
(iv) Static defender strategies' performance deteriorate against a dynamic attacker, whereas defender strategies tained through T-FP improve over time (see the middle plot in Fig. 9).This finding is consistent with previous studies that use game-theoretic approaches (e.g.[70], [71]) and suggests limitations of static defense systems, such as the Snort IDPS.

VII. RELATED WORK
Since the early 1990s, there has been a broad interest in automating network security functions, especially in the areas of intrusion detection, intrusion prevention, and intrusion response.
While the research reported in this paper is informed by all the above works, we limit the following discussion to prior work that uses game-theoretic models and centers around finding security strategies through automatic control and reinforcement learning.
This paper differs from the works referenced above in two main ways.First, we model the intrusion response use case as an optimal stopping game.The benefit of our model is that it provides insight into the structure of best response strategies through the theory of optimal stopping.Second, we evaluate obtained strategies on an emulated IT infrastructure.This contrasts with most of the prior works that use game-theoretic approaches, which either evaluate strategies analytically or in simulation [8], [11], [56], [57], [70]- [72], [75], [76], [78], [118]- [130], [159].
Game-theoretic formulations based on optimal stopping theory can be found in prior research on Dynkin games [60], [164]- [167].Compared to these articles, our approach is more general by (i) allowing each player to take multiple stop actions within an episode; and (ii) by not assuming a game of perfect information.Another difference is that the referenced articles either study purely mathematical problems or problems in mathematical finance.To the best of our knowledge, we are the first to apply the stopping game formulation to the use case of intrusion response.
Our stopping game has similarities with the FLIPIT game [70] and signaling games [168], both of which are commonplace in the security literature (see survey [169] and textbooks [7], [148]- [150]).Signaling games have the same information asymmetry as our game and FLIPIT uses the same binary state space to model the state of an attack.The main differences are as follows.FLIPIT models the use case of advanced persistent threats and is a symmetric non-zero-sum game.In contrast, our game models an intrusion response use case and is an asymmetric zero-sum game.Lastly, compared to signaling games, the main difference is that our game is a sequential and simultaneous-move game.Signaling games, in comparison, are typically two-stage games where one player moves in each stage.
Previous game-theoretic studies that use emulation systems similar to ours are [131] and [132].Specifically, in [131], a denial-of-service use case is formulated as a signaling game, for which a Nash equilibrium is derived.The equilibrium is then used to design a defense mechanism that is evaluated in a software-defined network emulation based on MININET [170].Compared to this paper, the main differences are that we focus on a different use case than [131] and that our solution method is based on reinforcement learning.
Similar to this paper, the authors of [132] formulate an intrusion response use case as a POSG where the defender observes alerts from a Snort IDPS [81].In contrast to our approach, however, the approach of [132] assumes access to attackdefense trees designed by human experts.Another difference between this paper and [132] is the POSG.The POSG in [132] has a larger state space than the POSG considered in this paper.Although this makes the POSG in [132] more expressive than ours, it also makes computation of optimal defender strategies intractable.In fact, to estimate optimal defender strategies, the authors of [132] are forced to approximate their model with one that has a smaller state space and is fully observed.In comparison, we are able to efficiently approximate equilibria of our game, without relying on model simplifications and without assuming access to attack-defense trees designed by human experts.

B. Control Theory for Automated Intrusion Response
Control theory provides a well-established mathematical framework for studying automatic systems.Classical control systems involve actuators in the physical world (e.g.electric power systems [171]) and many studies have focused on applying control theory to automate intrusion responses in cyber-physical systems (see surveys [172]- [174]).
The control framework can also be applied to computing systems and interest in control theory among researchers in IT security is growing (see survey [4]).As opposed to classical control theory, which is focused on continuous-time systems, the research on applying control theory to computing systems is focused almost entirely on discrete-time systems.The main reason being that measurements from computer systems are solicited on a sampled basis, which is best described by a discrete-time model [147], [175].
Previous works that apply control theory to the use case of intrusion response include: [63]- [66], [176]- [178].All of which model the problem of selecting response actions as the problem of controlling a discrete-time dynamical system and obtain optimal defender strategies through dynamic programming.
The main limitation of the works referenced above is that dynamic programming does not scale to problems of practical size due to the curse of dimensionality [179], [180].

C. Reinforcement Learning for Automated Intrusion Response
Reinforcement learning has emerged as a promising approach to approximate optimal control strategies in scenarios where exact dynamic programming is not applicable, and fundamental breakthroughs demonstrated by systems like ALPHAGO in 2016 [181] and OPENAI FIVE in 2019 [182] have inspired us and other researchers to study reinforcement learning with the goal to automate security functions (see surveys [17], [18]).
This paper differs from the works referenced above in three main ways.First, we model the intrusion response use case as a partially observed stochastic game.Most of the other works model the use case as a single-agent MDP or POMDP.The advantage of the game-theoretic model is that it allows finding defender strategies that are effective against a dynamic attacker, i.e. an attacker that adapts its strategy in response to the defender strategy.
Second, in a novel approach, we derive structural properties of optimal defender strategies in the game using optimal stopping theory.
Third, our method to find effective defender strategies includes using an emulation system in addition to a simulation system.The advantage of our method compared to the simulation-only approaches [11], [12], [20]- [29], [35]- [37], [40], [41], [46]- [48], [52], [54], [54], [57], [67], [73], [74], [76] is that the parameters of our simulation system are determined by measurements from an emulation system instead of being chosen by a human expert.Further, the learned strategies are evaluated in the emulation system, not in the simulation system.As a consequence, the evaluation results give higher confidence of the obtained strategies' performance in the target infrastructure than what simulation results would provide.
Some prior work on automated learning of security strategies that make use of emulation are: [50], [30], [31], [32], [33], [38], [49], [51], [53], [55], and [39].They either emulate software-defined networks based on MININET [170] or use custom testbeds.The main differences between these efforts and the work described in this article are: (i) we develop our own emulation system which allows for experiments with a large variety of exploits; (ii) we focus on a different use case (most of the referenced works study denial-of-service attacks); (iii) we do not assume that the defender has perfect observability; (iv) we do not assume a static attacker; and (v) we use an underlying theoretical framework to formalize the use case, derive structural properties of optimal strategies, and test these properties in an emulation system.
Finally, [183], [184], and [185] describe efforts in building emulation platforms for reinforcement learning and cyber defense, which resemble our emulation system.In contrast to these articles, our emulation system has been built to investigate the specific use case of intrusion response and forms an integral part of our general solution method (see Fig. 2).

VIII. CONCLUSION AND FUTURE WORK
In this work, we combine a formal framework with a practical evaluation to address the problem of automated intrusion response.We formulate the interaction between an attacker and a defender as an optimal stopping game.This formulation gives us insight into the structure of optimal strategies, which we prove to have threshold properties.Based on this knowledge, we develop a fictitious self-play algorithm, Threshold Fictitious Self-Play (T-FP), which learns nearoptimal strategies in an efficient way.The results from running T-FP show that the learned strategies converge to an approximate Nash equilibrium and thus to near-optimal strategies (see Fig. 9).The results also demonstrate that T-FP converges faster than a state-of-the-art fictitious self-play algorithm by taking advantage of threshold properties of optimal strategies (see Fig. 11).The threshold properties further enable us to provide a graphic representation of the learned strategies in a simple form (see Fig. 5 and Fig. 10).
To assess the learned strategies in a real environment, we evaluate them in a system that emulates our target infrastructure (see Fig. 1).The results show that the strategies achieve almost the same performance in the emulated infrastructure as in the simulation.This gives us confidence that the obtained strategies would perform as expected in the target infrastructure, which is not feasible to evaluate directly.
We plan to continue this work in several directions.First, we will extend the current model of the attacker and the defender, which currently captures only timing of actions, to include decisions about a range of attacker and defender actions.Second, we plan to combine the strategies learned through our framework with techniques for online play, such as rollout [186].Third, we plan to study techniques that allow to obtain defender strategies that generalize to a variety of infrastructure configurations and topologies.Fourth, we intend to extend our framework to an online-learning setting where the defender strategies co-evolve with changes in the target infrastructure.

IX. ACKNOWLEDGMENTS
This research has been supported in part by the Swedish armed forces and was conducted at KTH Center for Cyber Defense and Information Security (CDIS).The authors would like to thank Pontus Johnson for his useful input to this research, and Forough Shahab Samani and Xiaoxuan Wang for their constructive comments on a draft of this paper.The authors are also grateful to Branislav Bosanský for sharing the code of the HSVI algorithm for one-sided POSGs and to Jakob Stymne for contributing to our implementation of NFSP.

APPENDIX A PROOFS
A. Proof of Theorem 1.A Proof.Since the POSG Γ in ( 1) is finite and γ ∈ (0, 1), the existence proofs in [86, §3] and [88, Thm.2.3] apply, which state that a mixed Nash equilibrium exists.For the sake of brevity we do not restate the proofs, which are based on formulating the POSG as a finite strategic form game and appealing to Nash's theorem [83,Thm. 1].

B. Proof of Theorem 1.B.
Proof.Given the POSG Γ (1) and a fixed attacker strategy π A , any best response strategy for the defender πD ∈ B D (π A ) is an optimal strategy in a POMDP M P (see §IV).Hence, it is sufficient to show that there exists an optimal strategy π * D in M P that satisfies (28).Conditions for (28) to hold and the existence proof are given in our previous work [10][Thm 1.C].Since f O|s is TP-2 by assumption and all of the remaining conditions hold by definition of Γ (1), the result follows.

C. Proof of Theorem 1.C.
Given the POSG Γ (1) and a fixed defender strategy π D , any best response strategy for the attacker πA ∈ B A (π D ) is an optimal strategy in an MDP M (see §IV).Hence, it is sufficient to show that there exists an optimal strategy π * A in M that satisfies (29)- (30).To prove this, we use properties of M's value function V * πD,l (22).We use the value iteration algorithm to establish properties of V * πD,l [89], [90].Let V k πD,l , S Proof.We prove this statement by mathematical induction.For k = 1, we know from ( 7)-( 12) that V 1 πD,l 1, b( 1) is non-increasing with π D (S | b(1)) and non-decreasing with l.
For k > 1, V k πD,l is given by:

APPENDIX D DISTRIBUTIONS OF INFRASTRUCTURE METRICS
The emulation system (see Fig. 2) collects hundreds of metrics every time-step.To measure the information that a metric provides for detecting intrusions, we calculate the Kullback-Leibler (KL) divergence D KL (f O|0 f O|1 ) between the distribution of the metric when no intrusion occurs f O|s=0 and during an intrusion f O|s=1 : Here O denotes the random variable representing the value of the metric and O is the domain of O.
Figure 14 shows empirical distributions of the collected metrics with the largest KL divergence.We see that the IDPS alerts have the largest KL divergence and thus provide the most information for detecting intrusions.

APPENDIX E ATTACKER ACTIONS
The attacker actions and their descriptions are listed in Table 7.

. 6 :
Fig. 6: The fictitious self-play process; in every iteration step each player learns a best response strategy πi ∈ B i (π −i ) and updates its strategy based on the empirical distribution of its past best response strategies; the horizontal arrows indicate iteration steps of self-play and the vertical arrows indicate the learning of best response strategies; the process converges towards a Nash equilibrium (π * D , π * A ).

= 1 Fig. 8 :
Fig.8: Empirical distributions of o t when no intrusion occurs (s t = 0) and during intrusion (s t = 1); the black lines show the fitted Gaussian mixture models.

Fig. 9 :
Fig. 9: Learning curves from the self-play process with T-FP; the red curve shows simulation results and the blue curves show emulation results; the purple, orange, and black curves relate to baseline strategies; the figures show different performance metrics: exploitability (34), episodic reward, and the length of intrusion; the curves indicate the mean and the 95% confidence interval over four training runs with different random seeds.

1 Fig. 10 :
Fig. 10: Probability of the stop action S by the learned equilibrium strategies in function of b(1) and l; the left and middle plots show the attacker's stopping probability when s = 0 and s = 1, respectively; the right plot shows the defender's stopping probability.

Fig. 11 :
Fig. 11: Comparison between T-FP and two baseline algorithms: NFSP and HSVI; all curves show simulation results; the red curve relates to T-FP; the blue curve to NFSP; the purple curve to HSVI; the left plot shows the approximate exploitability metric (34) and the right plot shows the HSVI approximation error [115]; the curves depicting T-FP and NFSP show the mean and the 95% confidence interval over four training runs with different random seeds.

Fig. 13 :
Fig. 13: Percentage of blocked attacker and client traffic in the emulation system; the blue curves show results from the equilibrium strategy learned via T-FP; the purple, orange, and black curves relate to baseline strategies.

Fig. 14 :
Fig.14: Empirical distributions of selected infrastructure metrics; the red and blue lines show the distributions when no intrusion occurs and during intrusion, respectively.
which are realizations of the random vectors H about the game state s t , which is expressed in the belief state b t (s t ) = P[S t = s t | H Observation function Z.At time-step t, o t ∈ O is drawn from a random variable O whose distribution f O depends on the current state s t .We define Z(o t , s t , (a Best response POMDP and MDP for D and A S i , C i Stopping and continuation sets of player i St, At, Ot Random variables with realizations st, at, ot Value functions of M P and M f O|s

TABLE 2 :
Emulated client population; each client invokes functions on application servers.

TABLE 3 :
Defender commands executed on the emulation system.

TABLE 4
TCP scan TCP port scan by sending SYN or empty packets (nmap) UDP port scan UDP port scan by sending UDP packets (nmap) ping scan IP scan with ICMP ping messages VULSCAN vulnerability scan using nmap brute-force attack performs a dictionary attack against a login service (nmap) CVE-2017-7494 exploit uploads malicious binary to the SAMBA service and executes it CVE-2015-3306 exploit uses the mod_copy in proftpd for remote code execution CVE-2014-6271 exploit uses a vulnerability in bash for remote code execution CVE-2016-10033 exploit uses phpmailer for remote code execution CVE-2015-1427 exploit uses elasticsearch for remote code execution CWE-89 exploit injects malicious SQL code to execute code remotely

TABLE 7 :
Descriptions of the attacker actions.