Intrusion Prevention Through Optimal Stopping

We study automated intrusion prevention using reinforcement learning. Following a novel approach, we formulate the problem of intrusion prevention as an (optimal) multiple stopping problem. This formulation gives us insight into the structure of optimal policies, which we show to have threshold properties. For most practical cases, it is not feasible to obtain an optimal defender policy using dynamic programming. We therefore develop a reinforcement learning approach to approximate an optimal threshold policy. We introduce T- SPSA, an efficient reinforcement learning algorithm that learns threshold policies through stochastic approximation. We show that T- SPSA outperforms state-of-the-art algorithms for our use case. Our overall method for learning and validating policies includes two systems: a simulation system where defender policies are incrementally learned and an emulation system where statistics are produced that drive simulation runs and where learned policies are evaluated. We show that this approach can produce effective defender policies for a practical IT infrastructure.


I. INTRODUCTION
An organization's security strategy has traditionally been defined, implemented, and updated by domain experts [1].Although this approach can provide basic security for an organization's communication and computing infrastructure, a growing concern is that infrastructure update cycles become shorter and attacks increase in sophistication [2], [3].Consequently, the security requirements become increasingly difficult to meet.To address this challenge, significant efforts have started to automate security frameworks and the process of obtaining effective security policies.Examples of this research include: automated creation of threat models [4]; computation of defender policies using dynamic programming and control theory [5], [6]; computation of exploits and corresponding defenses through evolutionary methods [7]; identification of infrastructure vulnerabilities through attack simulations and threat intelligence [8], [9]; computation of defender policies through game-theoretic methods [10], [11]; and use of machine learning techniques to estimate model parameters and policies [12], [13].
In this paper, we present a novel approach to automatically learn defender policies.We apply this approach to an intrusion prevention use case.Here, we use the term "intrusion prevention" as suggested in the literature, e.g. in [1].It means that a defender prevents an attacker from reaching its goal, rather than preventing it from accessing any part of the infrastructure.Fig. 1: The IT infrastructure and the actors in the use case.
Our use case involves the IT infrastructure of an organization (see Fig. 1).The operator of this infrastructure, which we call the defender, takes measures to protect it against a possible attacker while, at the same time, providing a service to a client population.The infrastructure includes a public gateway through which the clients access the service and which also is open to a possible attacker.The attacker decides when to start an intrusion and then executes a sequence of actions that includes reconnaissance and exploits.Conversely, the defender aims at preventing intrusions and maintaining service to its clients.It monitors the infrastructure and can defend it by taking defensive actions, which can prevent a possible attacker but also incur costs.What makes the task of the defender difficult is the fact that it lacks direct knowledge of the attacker's actions and must infer that an intrusion occurs from monitoring data.
We study the use case within the framework of discrete-time dynamical systems.Specifically, we formulate the problem of finding an optimal defender policy as an (optimal) multiple stopping problem.In this formulation, the defender can take a finite number of stops.Each stop is associated with a defensive action and the objective is to decide the optimal times when to stop.This approach gives us insight into the structure of optimal defender policies through the theory of dynamic programming and optimal stopping [14], [15].In particular, we show that an optimal multi-threshold policy exists that can be efficiently computed and implemented.
Since the defender can access only a set of infrastructure metrics and does not directly observe the attacker, we use a Partially Observed Markov Decision Process (POMDP) to model the multiple stopping problem.An optimal policy for a POMDP can be obtained through two main methods: dynamic programming and reinforcement learning.In our case, dynamic programming is not feasible due to the size of the POMDP [39].Therefore, we use a reinforcement learning approach to obtain the defender policy.We simulate a long series of POMDP episodes whereby the defender continuously updates its policy based on outcomes of previous episodes.To update the policy, we introduce T-SPSA, a reinforcement learning algorithm that exploits the threshold structure of optimal policies.We show that T-SPSA efficiently learns a near-optimal policy despite the high complexity of computing optimal policies for general POMDPs [39].
Our method for learning and validating policies includes building two systems (see Fig. 2).First, we develop an emulation system where key functional components of the target infrastructure are replicated.In this system, we run attack scenarios and defender responses.These runs produce system metrics and logs that we use to estimate empirical distributions of infrastructure metrics, which are needed to simulate POMDP episodes.Second, we develop a simulation system where POMDP episodes are executed and policies are incrementally learned.Finally, the policies are extracted and evaluated in the emulation system and possibly implemented in the target infrastructure (see Fig. 2).In short, the emulation system is used to provide the statistics needed to simulate the POMDP and to evaluate policies, whereas the simulation system is used to learn policies.
We make three contributions with this paper.First, we formulate intrusion prevention as a problem of multiple stopping.This novel formulation allows us a) to derive properties of an optimal defender policy using results from dynamic programming and optimal stopping; and b) to approximate an optimal policy for a non-trivial infrastructure configuration.Second, we present a reinforcement learning approach to obtain policies in an emulated infrastructure.With this approach, we narrow the gap between the evaluation environment and a scenario playing out in a real system.We also address a limitation of many related works, which rely on simulations solely to evaluate policies [12], [7], [18], [20], [21], [40], [19].Third, we present T-SPSA, an efficient reinforcement learning algorithm that exploits the threshold structure of optimal policies and outperforms state-of-the-art algorithms for our use case.Fig. 2: Our approach for finding and evaluating intrusion prevention policies.
We conclude this section with remarks about the context of this research and the practical relevance of the results in this paper.The objective of our line of research is to construct a mathematical and conceptual framework, validated by an experimental environment, that produces defender policies for realistic scenarios through self-learning.We are engaged in a program with high potential reward that will need many years of investigation.This paper provides an important result and milestone in this program.
From a practical point of view, the main question the paper answers is this: at which points in time should a defender take defensive actions given periodic but limited observational data?The paper proposes a fundamental framework to study this question.We show theoretically and experimentally that the optimal action times can be obtained through thresholds that the framework predicts and which can be efficiently implemented in a real system.

II. THE INTRUSION PREVENTION USE CASE
We consider an intrusion prevention use case that involves the IT infrastructure of an organization.The operator of this infrastructure, which we call the defender, takes measures to protect it against an attacker while, at the same time, providing a service to a client population (Fig. 1).The infrastructure includes a set of servers that run the service and an intrusion detection system (IDS) that logs events in real-time.Clients access the service through a public gateway, which also is open to the attacker.
We assume that the attacker intrudes into the infrastructure through the gateway, performs reconnaissance, and exploits found vulnerabilities, while the defender continuously monitors the infrastructure through accessing and analyzing IDS statistics and login attempts at the servers.The defender can take a fixed number of defensive actions to prevent the attacker.A defensive action is for example to revoke user certificates in the infrastructure, which will recover user accounts compromised by the attacker.It is assumed that the defender takes the defensive actions in a predetermined order.The final action that the defender can take is to block all external access to the gateway.As a consequence of this action, the service as well as any ongoing intrusion are disrupted.
In deciding when to take defensive actions, the defender has two objectives: (i) maintain service to its clients; and (ii), keep a possible attacker out of the infrastructure.The optimal policy for the defender is to monitor the infrastructure and maintain service until the moment when the attacker enters through the gateway, at which time the attacker must be prevented by taking defensive actions.The challenge for the defender is to identify the precise time when this moment occurs.
In this work, we model the attacker as an agent that starts the intrusion at a random point in time and then takes a predefined sequence of actions, which includes reconnaissance to explore the infrastructure and exploits to compromise servers.
We study the use case from the defender's perspective.The evolution of the system state and the actions by the defender are modeled with a discrete-time Partially Observed Markov Decision Process (POMDP).The reward function of this process encodes the benefit of maintaining service and the loss of being intruded.Finding an optimal defender policy thus means maximizing the expected reward.

III. THEORETICAL BACKGROUND
This section covers the preliminaries on Markov decision processes, reinforcement learning, and optimal stopping.

A. Markov Decision Processes
A Markov Decision Process (MDP) models the control of a discrete-time dynamical system and is defined by a seventuple M = S, A, P at st,st+1 , R at st,st+1 , γ, ρ 1 , T [14], [16].S denotes the set of states and A denotes the set of actions.P at st,st+1 refers to the probability of transitioning from state s t to state s t+1 when taking action a t (Eq.1), which has the Markov property P [s t+1 |s t ] = P [s t+1 |s 1 , . . ., s t ].Similarly, R at st,st+1 ∈ R is the expected reward when taking action a t and transitioning from state s t to state s t+1 (Eq.2), which is bounded, i.e. |R at st,st+1 | ≤ M < ∞ for some M ∈ R. If P at st,st+1 and R at st,st+1 are independent of the time-step t, the MDP is said to be stationary and if S and A are finite, the MDP is said to be finite.Finally, γ ∈ (0, 1] is the discount factor, ρ 1 : S → [0, 1] is the initial state distribution, and T is the time horizon. The system evolves in discrete time-steps from t = 1 to t = T , which constitute one episode of the system.A Partially Observed Markov Decision Process (POMDP) is an extension of an MDP [41], [17].In contrast to an MDP, in a POMDP the states are not directly observable.A POMDP is defined by a nine-tuple M P = S, A, P at st,st+1 , R at st,st+1 , γ, ρ  [42], [43], where ∆(S) denotes the set of probability distributions over S. By defining the state at time t to be the belief state b t , a POMDP can be formulated as a continuous-state MDP: M = B, A, P at bt,bt+1 , R at bt,bt+1 , γ, ρ 1 , T .The belief state can be computed recursively as follows [17]: where C = 1/ st+1∈S Z(o t+1 , s t+1 , a t ) st∈S P at st,st+1 b t (s t ) is a normalizing factor independent of s t+1 to make b t+1 sum to 1.

B. The Reinforcement Learning Problem
Reinforcement learning deals with the problem of choosing a sequence of actions for a sequentially observed state variable to maximize a reward function [44], [45].This problem can be modeled with an MDP if the state space is observable, or with a POMDP if the state space is not fully observable.
In the context of an MDP, a policy is defined as a function π : {1, . . ., T } × S → ∆(A), where ∆(A) denotes the set of probability distributions over A. In the case of a POMDP, a policy is defined as a function π : H → ∆(A), or, alternatively, as a function π : {1, . . ., T } × B → ∆(A).In both cases, a policy is called stationary if it is independent of the time-step t given the current state or belief state.
An optimal policy π * is a policy that maximizes the expected discounted cumulative reward over the time horizon: where Π is the policy space, γ is the discount factor, r t is the reward at time t, and E π denotes the expectation under π.
The Bellman equations relate any optimal policy π * to the two value functions V * : S → R and Q * : S × A → R, where S and A are state and action spaces of an MDP [46]: V * (s t ) and Q * (s t , a t ) denote the expected cumulative discounted reward under π * for each state and state-action pair, respectively.Solving Eqs.5-6 means computing the value functions from which an optimal policy can be obtained (Eq.7).In the case of a POMDP, the Bellman equations contain b t instead of s t and V * (b t ) is convex [47].
Two principal methods are used for finding an optimal policy in a finite MDP or POMDP: dynamic programming and reinforcement learning.
First, the dynamic programming method (e.g.value iteration [46], [48], [16]) assumes complete knowledge of the seventuple MDP or the nine-tuple POMDP and obtains an optimal policy by solving the Bellman equations iteratively (Eq.7), with polynomial time-complexity per iteration for MDPs and PSPACE-complete time-complexity for POMDPs [39].
Second, the reinforcement learning method computes or approximates an optimal policy without requiring complete knowledge of the transition probabilities or observation probabilities of the MDP or POMDP.Three classes of reinforcement learning algorithms exist: value-based algorithms, which approximate solutions to the Bellman equations (e.g.Q-learning [49]); policy-based algorithms, which directly search through policy space using gradient-based methods (e.g.Proximal Policy Optimization (PPO) [50]); and model-based algorithms, which learn the transition or observation probabilities of the MDP or POMDP (e.g.Dyna-Q [45]).The three algorithm types can also be combined, e.g. through actor-critic algorithms, which are mixtures of value-based and policybased algorithms [45].In contrast to dynamic programming algorithms, reinforcement learning algorithms generally have no guarantees to converge to an optimal policy except for the tabular case [51], [52].
Many variants of the optimal stopping problem have been studied.For example, discrete-time and continuous-time problems, finite horizon and infinite horizon problems, problems with fully observed and partially observed state spaces, problems with finite and infinite state spaces, Markovian and non-Markovian problems, and single-stop and multi-stop problems.Consequently, different solution methods for these variants have been developed.The most commonly used methods are the martingale approach [54], [55], [60] and the Markovian approach [53], [48], [16], [56], [57].
In this paper, we investigate the multiple stopping problem with L stops, a finite time horizon T , discrete-time progression, bounded rewards, a finite state space, and the Markov property.We use the Markovian solution approach and model the problem as a POMDP, where the system state evolves as a discrete-time Markov process (s t,l ) T t=1 that is partially observed and depends on the number of stops remaining l ∈ {1, . . ., L}.
At each time-step t of the decision process, two actions are available: "stop" (S) and "continue" (C).The stop action with l stops remaining yields a reward R S st,st+1,lt and if only one of the L stops remain, the process terminates.In the case of a continue action or a non-final stop action a t , the decision process transitions to the next state according to the transition probabilities P at st,st+1,lt and yields a reward R at st,st+1,lt .The stopping time with l stops remaining is a random variable τ l that is dependent on s 1 , . . ., s τ l and independent of s τ l +1 , . . .s T [54]: The objective is to find a stopping policy π * l (s t ) → {S, C} that depends on l and maximizes the expected discounted cumulative reward of the stopping times τ L , τ L−1 , . . ., τ 1 : Due to the Markov property, any policy that satisfies Eq. 9 also satisfies the Bellman equation (Eq.7), which in the partially observed case is: where π l is the stopping policy with l stops remaining, E l denotes the expectation with l stops remaining, b is the belief state, V * l is the value function with l stops remaining, b o S and b o C can be computed using Eq. 3, and R a b,b o a ,l is the expected reward of action a ∈ {S, C} in belief state b t when observing o with l stops remaining.

IV. FORMALIZING THE INTRUSION PREVENTION USE CASE AND OUR REINFORCEMENT LEARNING APPROACH
We first present a formal model of the use case described in Section II and then we introduce our solution method.Specifically, we first define a POMDP model of the intrusion prevention use case.Then, we apply the theory of dynamic programming and optimal stopping to obtain structural results of an optimal defender policy.Lastly, we describe our reinforcement learning approach to approximate an optimal policy.

A. A POMDP Model of the Intrusion Prevention Use Case
We formulate the intrusion prevention use case as a multiple stopping problem where an intrusion starts at a random time and each stop is associated with a defensive action (Fig. 3).We model this problem as a POMDP.
1) Actions A: The defender has two actions: "stop" (S) and "continue" (C).The action space is thus A = {S, C}.We encode S with 1 and C with 0 to simplify the formal description below.
The number of stops that the defender must execute to prevent an intrusion is L ≥ 1, which is a predefined parameter of our use case.
2) States S and Initial State Distribution ρ 1 : The system state s t ∈ {0, 1} is zero if no intrusion is occurring and s t = 1 if an intrusion is ongoing.In the initial state, no intrusion is occurring and s 1 = 0. Hence, the initial state distribution is the degenerate distribution ρ 1 (0) = 1.Further, we introduce a terminal state ∅ ∈ S, which is reached after the defender takes the final stop action or after an intrusion is prevented (see below).The state space is thus S = {0, 1, ∅}.
3) Observations O: The defender has a partial view of the system.If s t = ∅, the defender observes o t = (l t , ∆x t , ∆y t , ∆z t ), where l t ∈ {1, 2, . . ., L} is the number of stops remaining and (∆x t , ∆y t , ∆z t ) are bounded counters that denote the number of severe IDS alerts, warning IDS alerts, and login attempts generated during time-step t, respectively.If the system is in the terminal state, the defender observes o T = ∅.Hence, the observation space is O = {0, . . ., ∆x max } × {0, . . ., ∆y max } × {0, . . ., ∆z max } ∪ ∅.
4) Transition Probabilities P at st,st+1,lt : We model the start of an intrusion by a Bernoulli process (Q t ) T t=1 , where Q t ∼ Ber(p = 0.01) is a Bernoulli random variable.The time of the first occurrence of Q t = 1 is the start time of the intrusion I t , which thus is geometrically distributed, i.e., I t ∼ Ge(p = 0.01) (Fig. 4).
We define the time-homogeneous transition probabilities P at st,st+1,lt = P lt [s t+1 |s t , a t ] as follows: where P lt denotes the probability with l t stops remaining.All other state transitions occur with probability 0. Eq. 11 defines the transition probabilities to the terminal state ∅.The terminal state is reached when the final (l t = 1) stop action S (a t = 1) is taken.If Eq. 11 is not applicable, i.e., if the system does not reach the terminal state, then the transition probabilities when taking action S (a t = 1) or C (a t = 0) are defined by Eqs.12-14.
Eq. 12 captures the case where no intrusion occurs and s t+1 = s t = 0; Eq. 13 specifies the case when the intrusion starts where s t = 0 and s t+1 = 1; and Eq. 14 describes the case where an intrusion is in progress and With this definition of the transition probabilities, the evolution of the system can be understood using the state transition diagram in Fig. 5.
5) Observation Function Z(o t+1 , s t+1 , a t ): We assume that the number of IDS alerts and login attempts generated during one time-step are discrete random variables X ∼ f X , Y ∼ f Y , Z ∼ f Z that depend on the state.Consequently, the probability that ∆x severe alerts, ∆y warning alerts, and ∆z login attempts occur during time-step t can be expressed as f XY Z (∆x, ∆y, ∆z|s t ).
We define the time-homogeneous observation function Z(o t+1 , s t+1 , a t ) = P[o t+1 |s t+1 , a t ] as follows: 6) Reward Function R at st,lt : The objective of the intrusion prevention use case is to maintain service on the infrastructure while, at the same time, preventing a possible intrusion.Therefore, we define the reward function to give the maximal reward if the defender maintains service until the intrusion starts and then prevents the intrusion by taking L stop actions.
The reward per time-step R at st,lt is parameterized by the reward that the defender receives for stopping an intrusion (R st = 50), the reward for maintaining service (R sla = 1), and the loss of being intruded (R int = −10): Eq. 17 states that the reward in the terminal state is zero.Eq. 18 indicates that each stop incurs a cost by interrupting service and possibly a reward if it affects an ongoing intrusion.Lastly, Eq. 19 states that the defender receives a positive reward for maintaining service and a loss for each time-step that it is under intrusion.(Remark: the reward function can equivalently be stated to give a cumulative reward upon transitioning to the terminal state and zero reward otherwise [16].)7) Time Horizon T ∅ : The time horizon T ∅ is a random variable that indicates the time t when the terminal state ∅ is reached.Since the expected time of intrusion E[I t ] is finite, it follows that E π l [T ∅ ] < ∞ for any policy π l that is guaranteed to use L stops as t → ∞.Further, as the continue reward is negative when t > I t , the optimal stopping times τ 1 , . . ., τ L exist.(Remark: it is also possible to define T = ∞ and let ∅ be an infinitely absorbing state.)8) Policy Space Π l and Objective Function J: As the POMDP is stationary and the time horizon T ∅ is not predetermined, it is sufficient to consider stationary policies.Further, since the POMDP is finite, an optimal deterministic policy exists [16], [17].Despite this, we consider stochastic policies to enable smooth optimization.Specifically, we consider the space of stationary stochastic policies Π l where π l ∈ Π l is a policy π l : B → ∆(A), which depends on l ∈ {1, . . ., L}.
An optimal policy π * l ∈ Π l maximizes the expected discounted cumulative reward over the horizon T ∅ : We set the discount factor to be γ = 1.(The objective in Eq. 20 is upper bounded when γ = 1 since E π l [T ∅ ] is finite for any policy π l ∈ Π l that is guaranteed to use L stops as t → ∞, which is true for any optimal policy (see Lemma 1 in Appendix A).) Eq. 20 defines an optimization problem which reflects the objective of our use case.In the following section, we state structural properties of an optimal policy that solves this problem.

B. Threshold Properties of an Optimal Policy
A policy that solves the multiple stopping problem is a solution to Eqs. 20-21.We know from the theory of dynamic programming that this policy satisfies the Bellman equation formulated in terms of the belief state (Eq.10) [17], [42].
The belief state b t is defined as b t (s t ) = P[s t |h t ] (see Section III-A).As the state space of the POMDP is S = {0, 1, ∅} (see Fig. 5), b t is a probability vector with two components: b t (0) = P[s t = 0|h t ] and b t (1) = P[s t = 1|h t ], where t = 1, . . .T ∅ − 1.Further, since b t (0) = 1 − b t (1), the belief state is determined by b t (1) and the belief space B can be described by the unit interval, i.e.B = [0, 1].
We partition B into two sets-the stopping set S l = {b(1) ∈ [0, 1] : π * l b(1) = S}, which contains the belief states where it is optimal to stop, and the continuation set belief states where it is optimal to continue.The number of stops remaining, l, ranges from 1 to L.
Applying the theory developed in [17], [34], [33], we obtain the following structural result for an optimal policy.Theorem 1.Given the POMDP in Section IV-A, let L denote the number of stop actions, f XY Z|s the conditional distribution of the observations, b(1) the belief state, S l the stopping set, and C l the continuation set.The following holds: (A) (B) If L = 1, there exists a value α * ∈ [0, 1] and an optimal policy π * L that satisfies: (C) If L ≥ 1 and f XY Z|s is totally positive of order 2 (i.e., TP2), there exist L values α * 1 ≥ α * 2 ≥ . . .≥ α * L ∈ [0, 1] and an optimal policy π * l that satisfies: Proof.See Appendix A.
Theorem 1.A states that the stopping sets have a nested structure.This means that if it is optimal to stop when b(1) has a certain value while l − 1 stops remain, then it is also optimal to stop for the same value when l or more stops remain.
Theorem 1.B and Theorem 1.C state that there exist an optimal policy with threshold properties (see Fig. 6).If L ≥ 1, an additional condition applies: the probability matrix of f XY Z|s must be TP2 (all second order minors must be nonnegative) [ Knowing that there exists optimal policies with special structure has two benefits.First, insight into the structure of optimal policies often leads to a concise formulation and efficient implementation of the policies [16], [11].This is obvious in the case of threshold policies.Second, the complexity of computing or learning an optimal policy can be reduced by exploiting structural properties [17], [36].In the following section, we describe a reinforcement learning algorithm that exploits the structural result in Theorem 1.
C. Our Reinforcement Learning Algorithm: T-SPSA Theorem 1 states that under given assumptions and given L ≥ 1 stop actions, there exists an optimal policy which uses . We present an algorithm, which we call T-SPSA, that computes these thresholds through reinforcement learning.
We learn the threshold vector θ through simulation of the POMDP as follows.First, we initialize θ (1) ∈ R L randomly.Second, for each iteration n ∈ {1, 2, . ..} of T-SPSA, we perturb θ (n) to obtain θ (n) +c n ∆ n and θ (n) −c n ∆ n , where c n ∈ R and ∆ n ∈ R L .Then, we run two POMDP episodes where the defender takes actions according to the two perturbed threshold vectors (Eq.26).We then use the obtained episode outcomes Ĵ(θ (n) + c n ∆ n ) and Ĵ(θ (n) − c n ∆ n ) to estimate the gradient in Eq. 25 using the Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimator [63], [64]: where i ∈ {1, . . ., L} is the component index of the gradient, c n = c n λ is the perturbation size and c and λ are hyperparameters.
The perturbation vector ∆ n is defined as: Next, we use the estimated gradient and the stochastic approximation algorithm [52] to update the vector of thresholds to maximize J(θ) (Eq.20): where is the step size and A and are hyperparameters.
This process of running two episodes and updating the threshold vector continues until it has sufficiently converged.The described algorithm, T-SPSA, converges to a local maximum of J(θ) with probability one under standard conditions [63].For this reason, we run the algorithm several times with different initial conditions.We list the pseudocode of T-SPSA in Appendix D and give its hyperparameters in Appendix B. Our Python implementation of T-SPSA is available at: [65].

V. EMULATING THE TARGET INFRASTRUCTURE TO INSTANTIATE THE SIMULATION AND TO EVALUATE THE LEARNED POLICIES
To simulate episodes of the POMDP and to compute the belief state we must know the distributions of alerts and login attempts conditioned on the system state.We estimate these distributions using measurements from the emulation system shown in Fig. 2.Moreover, to evaluate the performance of policies learned in the simulation system, we run episodes in the emulation system by executing actions of an emulated attacker and having the defender execute stop actions at times given by the learned policies.

A. Emulating the Target Infrastructure
The emulation system executes on a cluster of machines that runs a virtualization layer provided by Docker [66] containers and virtual links.It implements network isolation and traffic shaping on the containers using network namespaces and the NetEm module in the Linux kernel [67].Resource constraints of the containers, e.g.CPU and memory constraints, are enforced using cgroups.
The configuration of the emulated infrastructure is given by the topology in Fig. 1 and the configuration in Appendix C. The system emulates the clients, the attacker, the defender, as well as 31 physical components of the target infrastructure (e.g application servers and the gateway).Physical entities are emulated and software functions are executed in Docker containers of the emulation system.The software functions replicate important components of the target infrastructure, such as, web servers, databases, and an IDS.
We emulate internal connections between servers in the infrastructure as full-duplex loss-less connections with bit capacities of 1000 Mbit/s in both directions and emulate external connections between the gateway and the client population and the attacker as full-duplex connections with bit capacities of 100 Mbit/s with 0.1% packet loss in normal operation and random bursts of 1% packet loss.
The client population is emulated by three Docker containers that interact with the application servers through functions and protocols listed in Table 1.
The emulation evolves in time-steps.During each step, the defender and the attacker can perform one action each.The   defender executes either a continue action or a stop action.
The continue action has no effect on the progression of the emulation but the stop action has.We have implemented L = 3 stop actions which are listed in Table 2.The first stop revokes all user certificates and recovers user accounts compromised by the attacker.The second and third stops update the firewall configuration of the gateway.Specifically, the second stop adds a rule to the firewall that drops incoming traffic from IP addresses that have been flagged by the IDS and the third stop blocks all incoming traffic.We have implemented three attacker profiles: NOVICEAT-TACKER, EXPERIENCEDATTACKER, and EXPERTATTACKER, all of which execute the sequence of actions listed in Table 3, where I t is the start time of the intrusion.The actions consist of reconnaissance commands and exploits.During each timestep, one action is executed.The three attackers differ in the reconnaissance command that they use and the number of stops L required to prevent the attack (see Table 4).
NOVICEATTACKER uses brute-force attacks to exploit password vulnerabilities (e.g.SSH dictionary attacks) and uses a TCP/UDP port scan for reconnaissance.The attack is prevented if the defender takes a stop action and revokes the user certificates.
EXPERIENCEDATTACKER uses a ping scan for reconnaissance and performs both brute-force attacks and more sophisticated attacks, such as a command injection attack (e.g.CVE-2014-6271).The attack is prevented if the defender takes two stop actions and blacklists IP addresses that have been flagged by the IDS in addition to revoking the user certificates.
Lastly, EXPERTATTACKER only targets vulnerabilities that can be exploited without brute-force methods and thus generates less network traffic, for example remote execution vulnerabilities, such as, CVE-2017-7494.The attacker uses a ping scan for reconnaissance like EXPERIENCEDATTACKER.The attack is prevented if the defender executes three stop actions and blocks the gateway.
Since the ping-scan generates fewer IDS alerts than the TCP/UDP port scan, the reconnaissance actions of EXPERI-ENCEDATTACKER and EXPERTATTACKER are harder to detect than those of NOVICEATTACKER.

B. Estimating the Distributions of Alerts and Login Attempts
In this section, we describe how we collect data from the emulation system and estimate the distributions of alerts and login attempts.
1) At the end of every time-step, the emulation system collects the metrics ∆x, ∆y, ∆z, which contain the alerts and login attempts that occurred during the time-step.For the evaluation reported in this paper we collected measurements from 21000 time-steps of 30 seconds each.
2) From the collected measurements, we compute the empirical distribution fXY Z as estimate of the corresponding distribution f XY Z in the target infrastructure.For each state s t , we obtain the conditional distribution fXY Z|st .Fig. 7 shows some of the empirical distributions.The distributions related to EXPERIENCEDATTACKER are omitted for better readability.The estimated distributions from EXPER-TATTACKER and EXPERIENCEDATTACKER mostly overlap with the distributions obtained when no intrusion occurs.However, a clear difference between the distributions obtained during an intrusion of NOVICEATTACKER and the distributions when no intrusion occurs can be observed.From these empirical distributions, we note that the assumption that the observation distribution is TP2 in Theorem 1.C is reasonable.

C. Simulating an Episode of the POMDP
During a simulation of the POMDP, the system state evolves according to the dynamics described in Section IV, and the observations evolve according to the estimated distribution fXY Z .In the initial state, no intrusion occurs.During an episode, an intrusion normally occurs at a random start time.It is also possible that the defender performs L stops before the intrusion would start, in which case no intrusion starts.
A simulated episode evolves as follows.The episode starts in state s 1 = 0 and l 1 = L.During each time-step, the simulation system samples an action from the defender policy a t ∼ π θ,l (•|b t ).If the action is stop (a t = 1) and l t = 1, the episode ends.Otherwise, the number of remaining stop actions is updated: l t+1 = l t − a t .Further, if an intrusion is in progress, the system executes an attacker action following Table 3.It then updates the state s t → s t+1 and samples ∆x t+1 , ∆y t+1 , ∆z t+1 from the empirical distribution fXY Z|st+1 .(The activities of the clients are not simulated but are captured by fXY Z .)The simulation then computes the belief b t+1 using Eq. 3 and computes the defender reward r t+1 using Eqs.17-19.(Note that the exact reward can be computed during training and evaluation of policies but not when the policies are deployed in the target infrastructure as it depends on the hidden state.)The sequence of time-steps continues until the defender performs the final stop, after which the episode ends.If the attacker sequence in Table 3 is completed before the defender performs the final stop, the sequence is restarted.

D. Emulating an Episode of the POMDP
Just like a simulated episode, an emulated episode starts with the same initial conditions, evolves in discrete timesteps, and experiences an intrusion event at a random time.However, an episode in the emulation system differs from an episode in the simulation system in the following ways.First, attacker and defender actions in the emulation system include computing and networking functions with real sideeffects in the emulation environment (see Table 2 and Table 3).Further, the defender observations in the emulation system are not sampled but are obtained through reading log files and metrics of the emulated infrastructure.Lastly, the emulated client population performs requests to the emulated application servers just like on a real infrastructure (see Section V-A).Due to these differences, running an episode in the emulation system takes much longer time than running a similar episode in the simulation system.

VI. LEARNING INTRUSION PREVENTION POLICIES FOR
THE TARGET INFRASTRUCTURE Our approach for finding effective defender policies includes (1) extensive simulation of POMDP episodes in the simulation system to learn the policies; and (2), evaluation of the learned policies through running POMDP episodes in the emulation system.This section describes our evaluation results.
The environment for training policies and running simulations is a Tesla P100 GPU.The hyperparameters for the training algorithm are listed in Appendix B. The emulated infrastructure is deployed on a server with a 24-core Intel Xeon Gold 2.10GHz CPU and 768 GB RAM.We have made available the code of our simulation system, as well as the measurement traces used to estimate the observation distributions of the POMDP, which can be used by others to extend and validate our results [65].

A. Evaluation Process
We train three defender policies against NOVICEAT-TACKER, EXPERIENCEDATTACKER and EXPERTATTACKER until convergence.For each attacker we run 10 000 training episodes to estimate an optimal defender policy using the method described in Section IV-C.After each episode we evaluate the current defender policy.
To evaluate a defender policy, we run evaluation episodes and compute various performance metrics.Specifically, we run 500 evaluation episodes in the simulation system and 5 evaluation episodes in the emulation system.
The 10 000 training episodes and the evaluation described above constitute one training run.We run five training runs with different random seeds.A single training run takes about 4 hours of processing time on a P100 GPU to perform the simulations and the policy-training, as well as around 12 hours for evaluating the policies in the emulation system.
We compare the policies learned through T-SPSA with three baseline policies.The first baseline prescribes the stop action whenever an IDS alert occurs, i.e., whenever (∆x + ∆y) ≥ 1.The second baseline is obtained by configuring the Snort IDS as an Intrusion Prevention System (IPS) which drops network traffic following its internal recommendation system (see Appendix C for the Snort configuration).To calculate the reward, we define 100 dropped IP packets of the Snort IPS to be a stop action of the defender.Lastly, the third baseline is an ideal policy which presumes knowledge of the exact intrusion time and performs all stop actions at exactly that time.
We evaluate our algorithm, T-SPSA, by comparing it with three baseline algorithms: Proximal Policy Optimization (PPO) [50], Heuristic Search Value Iteration (HSVI) [68], and Shiryaev's algorithm [69].PPO is a state-of-the-art reinforcement learning algorithm, HSVI is a state-of-the-art dynamic programming algorithm for POMDPs, and Shiryaev's algorithm is an optimal algorithm for change detection.The main difference between T-SPSA and the first two baselines (PPO and HSVI) is that T-SPSA exploits the threshold structure expressed in Theorem 1 and the main difference in comparison with Shiryaev's algorithm is that T-SPSA learns L thresholds whereas Shiryaev's algorithm uses a single predefined threshold.We set this threshold to 0.75 based on a hyperparameter search (see Appendix B).

B. Learning Intrusion Prevention Policies
Fig. 8 shows the performance of the learned policies against the three attacker types.The red curves represent the results from the simulation system and the blue curves show the results from the emulation system.The purple and orange curves give the performance of the Snort IPS baseline and the baseline policy that mandates a stop action whenever an IDS alert occurs, respectively.The dashed black curves give the performance of the baseline policy that assumes knowledge of the exact intrusion time.
An analysis of the graphs in Fig. 8 leads us to the following conclusions.We observe that the learning curves converge quickly to constant mean values for all attackers and across all investigated performance metrics.From this we conclude that the learned policies have converged as well.
Second, we observe that the converged values of the learning curves are close to the dashed black curves, which give an upper bound to any optimal policy.In addition, we see that the empirical probability of preventing an intrusion of the learned policies is close to 1 (middle column of Fig. 8) and that the empirical probability of stopping before the intrusion starts is close to 0 (second rightmost column of Fig. 8).This suggests that the learned policies are close to optimal.We also observe that all learned policies do significantly better than the Snort IPS baseline and the baseline that stops whenever an IDS alert occurs (leftmost column in Fig. 8).
Third, although the learned policies, as expected, perform better in the simulation system than in the emulation system, we are encouraged by the fact that the curves of the emulation system are close to those of the simulation system.
We also note from Fig. 8 that the learned policies do best against NOVICEATTACKER and less well against EX-PERIENCEDATTACKER and EXPERTATTACKER.For instance, the learned policies against EXPERIENCEDATTACKER and EXPERTATTACKER are more likely to stop before an intrusion has started (second rightmost column of Fig. 8).This indicates that NOVICEATTACKER is easier to detect for the defender as its actions create more IDS alerts than those of the other attackers, as pointed out in Section V-A.
Lastly, Fig. 9 shows a comparison between our reinforcement learning algorithm (T-SPSA) and the three baseline algorithms in the simulation system.We observe in Fig. 9 that both T-SPSA and PPO converge to close approximations of an optimal policy within an hour of training whereas HSVI does not converge within the measured time.The slow convergence of HSVI manifests the intractability of using dynamic programming to compute policies in large POMDPs [39].We also see in Fig. 9 that T-SPSA converges significantly faster than PPO.This is expected since T-SPSA considers a smaller space of policies than PPO.Finally, we also note in Fig. 9 that T-SPSA outperforms Shiryaev's algorithm, which demonstrates the benefit of using L thresholds instead of a single threshold.
While the research reported in this paper is informed by all the above works, we limit the following discussion to prior work that centers around finding security policies through reinforcement learning, a topic area that has grown considerably in recent years.Three seminal papers: [83], [84], and [85], published in 2000, 2005, and 2008, respectively, analyze intrusion prevention use cases and evaluate traditional reinforcement learning algorithms for this task.These papers have inspired much follow-up research, e.g. on studying deep reinforcement learning algorithms for intrusion prevention [12], [13], [25] and studying new use cases, such as defense against jamming attacks [86], mitigation of denial of service attacks [87], [88], defense against advanced persistent threats [89], placement of honeypots [90], botnet detection [91], [92], detection of flip attacks [93], detection of network traffic anomalies [94], greybox fuzzing [95], and defense against topology attacks [96].
This paper differs from the works referenced above in two main ways.First, we formulate the intrusion prevention problem as a multiple stopping problem.The other works formulate the problem as solving a general MDP, POMDP, or Markov game.The advantage of our approach is that we obtain structural properties of optimal policies, which have practical benefits (see Section IV-B).
Problem formulations based on optimal stopping theory can be found in prior research on change detection [69], [97], [38], [37], [93], [13].Compared to these papers, our approach is more general by allowing multiple stop actions within an episode.Another difference is that we model intrusion prevention rather than intrusion detection.Further, compared with traditional change detection algorithms, e.g.CUSUM [97] and Shiryaev's algorithm [69], our algorithm learns thresholds and does not assume them to be preconfigured.
Second, our solution method to find effective policies for intrusion prevention includes using an emulation system in addition to a simulation system.The advantage of our method compared to the simulation-only approaches [12], [13], [18], [19], [20], [38], [21], [22], [40], [24], [25], [26] is that the parameters of our simulation system are determined by measurements from an emulation system instead of being chosen by a human expert.Further, the learned policies are evaluated in the emulation system, not in the simulation system.As a consequence, the evaluation results give higher confidence of the obtained policies' performance in the target infrastructure than what simulation results would provide.Some prior works on reinforcement learning for intrusion prevention that make use of emulation are: [23], [27], [28], and [29].They emulate software-defined networks based on Mininet [98].The main differences between these efforts and the work described in this paper are: (1) we develop our own emulation system which allows for experiments with a large variety of exploits; (2) we focus on a different intrusion prevention use case; (3) we do not assume that the defender has perfect observability; and (4), we use an underlying theoretical framework to formalize the use case, derive structural properties of optimal policies, and test these properties in an emulation system.
Finally, [99] and [100] describe ongoing efforts in building emulation platforms for reinforcement learning, which resemble our emulation system.In contrast to these papers, our emulation system has been built to investigate the specific use case of intrusion prevention and forms an integral part of our general solution method (see Fig. 2).

VIII. CONCLUSION AND FUTURE WORK
In this paper, we proposed a novel formulation of the intrusion prevention problem based on the theory of optimal stopping.This formulation allowed us to derive that a threshold policy based on infrastructure metrics is optimal, which has several practical benefits.
To find and evaluate policies, we used a reinforcement learning method that includes a simulation system and an emulation system.In contrast to a simulation-only approach, our method produces policies that can be executed in a target infrastructure.
Through extensive evaluations, we showed that our approach can produce effective defender policies for a practical configuration of an IT infrastructure (Figs.8-9).We also demonstrated that our reinforcement learning algorithm (T-SPSA), which takes advantage of the threshold structure (Theorem 1), outperforms state-of-the-art algorithms on our use case.
We make assumptions in this paper that limit the practical applicability of the results: the attacker follows a static policy, and the defender learns only the times of taking defensive actions but not the types of actions.Therefore, the question arises whether our approach can be extended so that (1) the attacker can pursue a wide range of realistic policies and (2) the defender learns optimal policies that express not only when defensive actions needs to be taken but also the specific measure to be executed.
Addressing these points is part of our research agenda.The dynamic attacker can be studied using a game-theoretic extension of the introduced framework.The theory tells us that an optimal solution can be found through self-play in a similar manner as described in the paper, but further work is needed to show that such a solution is feasible in practice.Scenarios involving several attackers can also be studied in this context.
We also plan to extend the defender model to include the selection of defensive actions.One possible approach is to learn two orthogonal policies: a policy that decides when to take a defensive action and another policy that decides which action to take.

IX. ACKNOWLEDGMENTS
This research has been supported in part by the Swedish armed forces and was conducted at KTH Center for Cyber Defense and Information Security (CDIS).The authors would like to thank Pontus Johnson for his useful input to this research and Vikram Krishnamurthy for helpful discussions.The authors are also grateful to Forough Shahab Samani and Xiaoxuan Wang for their constructive comments on a draft of this paper.APPENDIX A PROOF OF THEOREM 1.
Given the POMDP introduced in Section IV-A, let L denote the number of stop actions, f XY Z the observation distribution, B = [0, 1] the belief space (see Section IV-B), b(1) the belief state, S l the stopping set, and C l the continuation set.
We use the value iteration algorithm to establish structural properties of V * l and π * l [16], [17].Let V k l , S k l , and C k l , denote the value function, the stopping set, and the continuation set at iteration k of the value iteration algorithm, respectively.Let V 0 l b(1) = 0 for b(1) ∈ [0, 1] and l ∈ {1, . . ., L}.Then, lim k→∞ V k l = V * l , lim k→∞ S k l = S l , and lim k→∞ C k l = C l [16], [17].
The main idea behind the proof of Theorem 1 is to show that the stopping sets S l have the form S l = [α * l , 1] ⊆ B and that α * l ≥ α * l+1 for l ∈ {1, . . ., L}. Towards this goal, we state the following four lemmas.
The left matrix corresponds to the transition probabilities when a t = C, or, when a t = S and l t > 1.The right matrix represents the transition probabilities when a t = S and l t = 1.To show that P at st,st+1,lt is TP2, it is sufficient to show that all 3 2 2 second order minors of both matrices are non-negative.The second-order minors of the first matrix are M 1,2 = M 1,3 = M 2,3 = M 3,1 = M 3,2 = 0, M 1,1 = 1, M 2,1 = 0.01, M 2,2 = M 3,3 = 0.99, where M i,j denotes the determinant of the submatrix formed by deleting the ith row and jth column.For the second matrix all second order minors are zero.Hence, P at st,st+1,lt is TP2.for n ∈ {1, . . ., N } do θ (n+1) ← θ (n) + a n ∇θ (n) J(θ (n) ) 12: end for 13: return θ (N +1) 14: end procedure

Fig. 3 :
Fig.3: Optimal multiple stopping formulation of intrusion prevention; the horizontal axis represents time; T is the time horizon; the episode length is T − 1; the dashed line shows the intrusion start time; the optimal policy is to prevent the attacker at the time of intrusion.

Fig. 4 :
Fig. 4: The cumulative distribution function (CDF) of the intrusion start time I t .

Fig. 5 :
Fig. 5: State transition diagram of the POMDP: each circle represents a state; an arrow represents a state transition; a label indicates the event that triggers the state transition; an episode starts in state s 1 = 0 with l 1 = L.

Fig. 7 :
Fig. 7: Empirical distributions of severe IDS alerts ∆x (top row), warning IDS alerts ∆y (middle row), and login attempts ∆z (bottom row) generated during time-steps of intrusions by different attackers as well as during time-steps when no intrusion occurs.

Fig. 8 :Fig. 9 :
Fig. 8: Learning curves obtained during training of T-SPSA; red curves show simulation results and blue curves show emulation results; the purple, orange, and black curves relate to baseline policies; the rows from top to bottom relate to: NOVICEATTACKER, EXPERIENCEDATTACKER, and EXPERTATTACKER; the columns from left to right show performance metrics: episodic reward, episode length, empirical prevention probability, empirical early stopping probability, and the time between the start of intrusion and the Lth stop action; the curves show the mean and 95% confidence interval for five training runs with different random seeds.

Lemma 1 .
During a POMDP episode, an optimal policy π * L prescribes L stop actions.Proof.The proof follows directly from the definition of the transition probabilities (see Eqs. 11-14) and the reward function (see Eqs.

R 8 :
high ∼ Ĵ(θ (n) +c n ∆ n ), R low ∼ Ĵ(θ (n) −c n ∆ n ) for i ∈ {1, . . ., L} do 9: ∇θ (n) J(θ (n) ) i ← R high −R low 2cn(∆n)i 10:end for 11: 1 , T, O, Z .The first seven elements define an MDP.O denotes the set of observations and Z(o t+1 , s t+1 , a t ) = P[o t+1 |s t+1 , a t ] is the observation function, where o t+1 ∈ O, s t+1 ∈ S, and a t ∈ A. If O, S, and A are finite, the POMDP is said to be finite.The belief state b t ∈ B is defined as b t (s) = P[s t = s|h t ] for all s ∈ S. b t is a sufficient statistic of the state s t based on the history h t of the initial state distribution, the actions, and the observations: h t = (ρ 1 , a 1 , o 1 , . . ., a t−1 , o t ) ∈ H.The belief space B = ∆(S) is the unit (|S| − 1)-simplex 17, Definition 10.2.1, pp.223][61].This condition is satisfied for example if f XY Z|s is stochastically monotone in s.

TABLE 2 :
Defender stop commands in the emulation.

TABLE 3 :
Attacker actions in the emulation.

TABLE 4 :
Number of stops required to prevent the attacker L and reconnaissance commands of the attacker profiles.