A Soft-Kill Reinforcement Learning Counter Unmanned Aerial System (C-UAS) With Accelerated Training

In recent years, unmanned aerial vehicles (UAVs) have gained significant popularity and are used for many applications, from entertainment to surveillance and the modern battlefield. As regulation demands arose worldwide, controling and reacting to unauthorized flights of UAVs became a pressing issue. In this work, we present an algorithm to accelerate the training of a reinforcement learning drone agent for a counter unmanned aerial system (C-UAS). The main objective of this C-UAS is to guide an invading drone to a safe-killing zone (SZ) using a hunter quadrotor drone. The hunter quadrotor launches a spoofing, or meaconing, attack on the GNSS receiver of the invading drone. The proposed algorithms employ an abstraction of the C-UAS problem to accelerate the training step and enable training during the mission. Results for different SZ radii are discussed using a software-in-the-loop simulation for ground truth based on a detailed model of the UAV embedded system and flight dynamics, including error metrics and action time. We show that a 99% probability of successful target steering to the SZ can be achieved considering a SZ radius of 75 meters and a Q-table trained with the proposed accelerated training model.


I. INTRODUCTION
In the last decade, unmanned aerial vehicles (UAV) received increasingly academic and commercial attention [1], [2]. While autonomous and self-driving cars are being developed and starting to appear in the market slowly, aerial vehicles have quickly dominated several areas of interest, from commercial to military purposes [3], [4], [5]. These vehicles can be used for obscure purposes and terrorism, i.e., as a vector of explosives and even for biological/chemical payloads [4], [5]. The reason for such advances can be traced to the increased battery capacity and weight reduction, powerful embedded controllers widely available to the public, and easy access to The associate editor coordinating the review of this manuscript and approving it for publication was Cong Pu . lighter brushless motors. Therefore, today a heavy payload such as a camera or a sensor can be carried at a low-cost [1].
Nowadays, large areas can be covered and remotely sensed. A task that a few years ago had to be performed by conventional fixed-wing aircraft and helicopters resulted in prohibitive costs and a long operation time [6]. These applications today are more and more performed by UAVs, such as inspection of power lines, search and rescue tasks, filming, railway and wind turbine inspections, agriculture, security and surveillance, and uses on the battlefield in modern warfare [7] to the delivery of products [8].
Furthermore, due to the easy access to UAVs by the general public, their low cost, and their easy operation, the use of UAVs in prohibited areas has been increasing. Consequently, the need to protect such areas against unauthorized access is growing. Areas that need to be protected or where UAVs are not authorized, for example, are: protest areas, stadiums, airports, water supply plants, atomic plants, and military zones [9]. The high cost of conventional methods to destroy such aircraft motivates research on protecting restricted areas from UAVs by other means.
In general, a counter unmanned aerial system (C-UAS) is composed of two steps: detection and mitigation [10]. The detection of drones is a very challenging problem, as due to their size and structure they are difficult to spot and locate compared to conventional aircraft. After the detection, the mitigation process usually requires a low reaction time and can be composed of: radio frequency (RF) and global navigation satellite systems (GNSS) jamming as well as spoofing, lasers, kinetic, or a combination of these elements [11]. Also, C-UAS platforms are flexible and more expeditious than static or mobile ground platforms [12].
Our proposed C-UAS differs from the current state-ofthe-art solutions as it employs a ''drone countering drone'' approach, along with the so-called attacker hunter drone (AHD) being able to carry a GNSS spoofing payload. The proposed C-UAS reduces the reaction time as the AHD can fly closer to the so-called target invader drone (TID). A non-destructive or soft-kill method can be employed to catch the TID preserving vital forensic information on the target system. Most importantly, we present a C-UAS being able to adapt to different target behavior and missions using a reinforcement learning (RL) agent. The decision of how the AHD moves to automatically guide the TID to a pre-defined safe zone (SZ) is provided by an RL technique, specifically Q-learning. Due to their various capabilities and robustness, RL algorithms have received much attention in recent years [13]. RL algorithms for guidance and navigation became possible based on recent advances in theoretical fields of intelligent agents to provide guidance, as presented in [14], [15], and [16]. Problems of moving vehicles in space in a scenario where its dynamics are not fully known or are not implemented directly in the controller are addressed in [17], [18], and [19].
The C-UAS design presented in [20] was tested using a software-in-the-loop (SITL) implementation based on the Ardupilot platform. The SITL simulations are performed to emulate the complex operational scenarios and are used to train the Q-values in the Q-tables. The computational cost of the SITL simulation is high as we simulate the movement and sensors of two drones. Therefore, the time the training takes is in the order of minutes, resulting in only a few missions/scenarios that could be presented to the system in the training phase. Consequently, the high computational load of the SITL method should be bypassed using an alternative to derive the Q-values in the Q-tables of the system in the training.
This work presents an alternative method of training the C-UAS using an abstraction of the problem to reduce the training time and prepare the C-UAS for different mission scenarios. This proposed new training uses an error-tracking approach (ETA) to emulate the UAVs' movements and guidance. We call this model the simplified UAV movement model (SUMM). We use the SUMM to obtain the Q-tables faster, which is later loaded to the SITL simulation, which we use to simulate the complex mission scenarios of the two UAVs. Hence, in [20], the SITL simulation was used while training the C-UAS and evaluating its performance. In this work, we use the SUMM to accelerate the training and the SITL as ground truth for the performance evaluation of the SUMM.
The dynamic model of the SITL simulation used is the standard model for rotary-wing aircraft (Copter module) on the Ardupilot platform. The platform uses the SITL simulation as a test tool for the developed control algorithms, and in this work, it is used as a standard with which we compare the results of the proposed algorithm.
In [20], the proposed C-UAS also uses spoofing, and a much simpler RL, e.g., a reward function with fixed values for movement, and the initial Q-table was chosen to represent the heading of the SZ based on the TID's position. Also, the training and evaluation were achieved only using the SITL simulation. No acceleration or pretraining steps were included. In this work, we introduce different initial positions for both UAVs in contrast to [20] where the initial positions were in the origin of the coordinate system.
The main contribution of this work is a methodology to drastically accelerate training for a C-UAS which can softkill a UAV. This contribution is made through a simpler model describing the movement information of two drones. The essential information for the attacker's training is maintained. The attitude information of both aircraft is not used in the proposed algorithm and is, therefore, discarded. Furthermore, we have further developed the Q-learning algorithm and the dimension of the Q-table with respect to the algorithm presented in [20]. Hence, this new approach has the following benefits: • Lower computational load while training, and therefore cheaper training; • Different strategies can be tested in scenarios where the C-UAS could not guide the TID to the SZ so far; • As the training time is reduced to the order of seconds, we can retrain the system while hunting, using the onboard CPU resources of the AHD; This article is organized as follows. In Section II related work and essential information related to the presented topic are discussed. In Section III we discuss GNSS spoofing and RL to guide the AHD and the SITL simulation, while Section IV presents the SUMM simulation details of the proposed ETA method. In Section V we present the training process along with its parameters. In Section VI we present and discuss the results of the evaluation step of the SITL simulations of SUMM trained Q-tables together with a Monte Carlo (MC) simulation for three scenarios of different radius of the SZ. Finally, in Section VII we discuss our conclusions and in Section VIII we present future work improvements. VOLUME 11, 2023 II. RELATED WORK As the UAV system complexity increases with the capabilities of such vehicles, they become more dependent on their onboard instruments, such as the inertial navigation system (INS) and its sensors as accelerometers, gyroscope, and barometer [21], [22]. Other sensors such as altimeters, light detection and ranging (LiDAR) sensors, power measurement sensors of the RF link, and GNSS receivers [23] are also used. GNSS receivers are vital to correct the drift error of the INS using fusion algorithms and to derive robust and highly reliable position, navigation, and timing (PNT).
The systems and subsystems of UAVs are susceptible to several attacks and malicious actions, such as electronic warfare measures, i.e., jamming and spoofing [24], [25], [26], and cyberattacks [24], [27]. The embedded systems of a UAV rely on the information provided by the sensors attached to the main control board. This control board is responsible for capturing the sensor data, process, and forwarding the control signals to the actuators. Therefore, to infiltrate the system, one of its sensors has to be attacked to send false information to the embedded hardware. The inertial measurement units (IMUs) responsible for measuring the angular velocities and the linear acceleration in the three-axis require a high level of access to the embedded platform to change the reading of the INS. Generally, to alter the measured quantities by the IMU, a cyberattack on the embedded system is required [28], [29]. The RF -control and telemetry -links are the subsystems that are predominantly attacked [25], [27], [30] with simple jamming techniques. In contrast, most commercial systems use known protocols and network tools similar to the ones found on modern computers. The operational system of these embedded systems usually is a modified Linux system.
As an alternative to launching cyberattacks on the telemetry links, the GNSS receiver, present in almost any commercial UAV that can cover a large area, are susceptible to RF link attacks using electronic warfare measures, such as jamming and spoofing [25], [30], [31]. These attacks are powerful since the control loop is affected, calculating the wrong position, and even for remotely controlled flights, a pilot will have great difficulties to control the aircraft even when the telemetry and control links are not directly affected.
Detecting jamming and spoofing does not mean an autonomous drone, or a pilot controlling the drone, will quickly regain control of the aircraft. The controls may not respond correctly to the inputs, and onboard cameras sensing the environment may not be sufficient to correct the injected errors. Several techniques have already been presented to fuse terrain data to estimate aircraft position [29], [32] or to use visual cameras with optical-flow estimation [33], but such techniques require a very high computational load or additional sensors. The use of point clouds [34] and 3D target detection [35] along with position estimation from terrain data can automate flight steps [21], [22], [36], [37], but the navigation while under a GNSS repeater attacker is still a big challenge for UAVs and even for conventional aircraft.

III. SPOOFING MODULE AND REINFORCEMENT LEARNING (RL)
In this work, we consider two kinds of the positioning of a UAV: based on external observation and GNSS. External observation using a visual camera with object detection and optical flow tracking or a LiDAR sensor is considered for the AHD to estimate the position of the TID. We assume that both the AHD and the TID use GNSS to estimate their position internally. To guide the TID to the SZ, the AHD launches a spoofing repeater attack [38] on the TID's GNSS receiver. Such an attack is also called meaconing [38], [39]. The main purpose of this meaconing system is to superimpose the relayed signal to the signals from the GNSS satellites received by the TID. Thus, the GNSS satellite signals are received by the meaconing system on board the AHD, amplified, and then relayed (with or without modification) to the TID.
As the AHD can get closer to the TID at the beginning of the attack, the additional time delay that the spoofing signals experience when propagating from the AHD to the TID with respect to the direct GNSS satellite signals can be in the order of nanoseconds to zero. The situation when the AHD is flying over the TID is called the zero-delay meaconing attack, i.e., the GNSS satellite and the meaconing signals are aligned [38]. When aligned, the propagation time delay difference between the AHD and the TID with respect to the direct GNSS satellite signals is close to zero, and the tracking loops of the TID's receiver can accumulate drift over time, accumulating small clock offsets. If then, additionally, the power of the meaconing signals is a bit stronger than the power of the direct satellite signals, the tracking loops of the TID will keep tracking the meaconing signals while the AHD can slowly move farther away from the TID's original true position. If the attack was successful, then the TID will calculate its position using the meaconing signals from the AHD and thus will calculate the same position as the AHD, but with a slight receiver clock offset, [20]. However, in case the zero-delay meaconing attack is not successful, the AHD also should be able to perform a jamming attack to deny tracking of the direct GNSS satellite signals by the TID's GNSS receiver and force it into re-acquisition. During the jamming the AHD can initiate the meaconing attack and thus ensure the TID's receiver tracking loops to lock on the meaconing signals.
Assuming a successful spoofing attack, in our simulations, the output coordinates of the GNSS receiver in the TID embedded system will be the coordinates of the AHD embedded system with added errors due to the hardware implementation of the repeater.
So far, autonomous target drones could be guided to the desired SZ by a pilot controlling the AHD but this task imposes a high cognitive load on the AHD pilot as the TID's mission is unknown. Therefore, we automate this task by using an RL Q-learning agent to guide the TID to the SZ in the shortest time possible, as it requires little to no knowledge of the system dynamics [4] The observed states s[n] are defined as As we can observe in (2), the states depend only upon the SZ's and TID's position. The TID's position (relative) is assumed to be estimated by the AHD embedded sensors, such as visual cameras and a LiDAR system, while the TID's mission is not known to the AHD. In the Q-learning algorithm, the AHD agent performs an action a[n] at time n, chosen from a set of possible actions A, and the environment returns a state s[n] from the set of possible states S and where where r s is the SZ radius, ρ is the maximum allowed distance or also called antenna distance, i.e., ∥t[n]∥ 2 with κ is the kill reward the agent receives when a successful kill is made. Also, −κ is the evasion reward when the TID evades the flight area with radius σ or the maximum antenna distance between the TID and the AHD is reached. Each action of the AHD results in a a new state s
The agent selects its next action based on the maximum Q-value policy π[n] [41], i.e., where Q(s[n], a) is called the Q-value, obtained from the Q-table, a list that comprises the possible states (observation) and Q-values based on which the agent will decide which action is best. During the operation, the values in the Q- where α is the learning rate, Q(s[n + 1], a) is the Q-value from state s[n + 1] after the action a was taken, and γ is the discount factor responsible for balancing the short-term and long-term reward, i.e., acting as a memory measure in the system. We use the ϵ-greedy action selection policy to introduce randomness to the system balancing between exploration and exploitation.
In contrast to [20], the velocity ratio of AHD and TID were reduced in the present work. Also, as the acceleration step is introduced, the Q-values for a given configuration were improved as the randomness ϵ during training is higher.
The SITL simulation which is applied to simulate operational behavior of UAVs in various scenarios is based on the Ardupilot [42] embedded firmware and was presented in details in [20]. As shown in [20], this approach enables simulating the behavior of a real embedded firmware, but the SITL simulations performed take the same time as the real flights. This means at least three minutes per flight as the test bed also has to initiate and then perform the complete episode of the training process. The number of flights needed to populate the respective Q-table grows as we introduce more parameters to the guidance and control problem. Thus, we propose a generalized model of the UAV movement in the next section called SUMM as a mean to obtain the Q-table faster, which is loaded in the SITL simulation [20], and the performance of the SUMM-trained Q-table in the SITL is compared. The SITL test bed is used in this work to validate the SUMM approach in reducing the training time and thus is considered to emulate realistic UAV behavior in the considered scenarios.

IV. SUMM SIMULATION
Proportional-integral-derivative controllers (PIDs) are commonly used to control attitude and movement of UAVs, several combinations of linear and nonlinear PIDs are discussed in [43], [44], and [45]. The ETA derived in this work uses a PID-like approach to represent the general behavior of a UAV. We consider only the UAV's position and velocity in a 2D space, flying towards the desired waypoint. The behavior of the UAV can be characterized by a simple closed-loop control system where the process variable is the position in space, and the linear velocity is limited to match the real-life maximum velocities of commercially available UAVs. For this approach, the UAV is considered a point mass in space, and only the cinematic movement is considered. Thus, we neglect rigid body quantities, as stated by [46], which are assumed to be symmetrical and with constant mass. Also, the mass distribution over the vehicle is minimal, resulting in a negligible inertia matrix.
Furthermore, as presented in [45], it is common to separate the vehicle controller into several stages, such as position controller, attitude controller, and adaptive PIDs, to mix multiple quantities to control a UAV with unknown dynamics. In contrast to the classical implementation of error-tracking controllers to control the UAV motion in a given space, for example, shown in [47], in our case, we emulate a closedloop problem. In contrast, the position error for the desired waypoint is driven to zero for increasing iterations. We also use a pair of controllers responsible for the x and y directions in a 2D Cartesian coordinate system since we assume an independent multivariate positioning sensor model. Thus, the output of the ETA in the SUMM simulation for the x and y coordinate directions are modeled each as a PID controller with a feedback loop [48], [49] u x,y (t) = K p x,y e x,y (t) + K i x,y t 0 e x,y (τ )dτ + K d x,y de x,y (t) dt (9) or, in the discrete form, where K p x,y ∈ R, K i x,y ∈ R, K d x,y ∈ R are the proportional, integral, and derivative constants of the controllers, e x,y (t) ∈ R is the error in time and e x,y [n] ∈ R is the representation of the discrete error, defined as the difference between the reference input and the output (from the feedback loop [48], [49]) where r x,y [n] ∈ R is the discrete input reference signal, the desired waypoint in our case, u x,y [n] ∈ R is the output of the PID, and ϵ x,y [n] ∈ R is an error from external sources, e.g. the error from the target detection system and the error from the GNSS receiver of the UAV. It is essential to notice that the x and y subscripts refer to the equation in each coordinate system axis. Even though the presented algorithm is considering a 2D space for guidance, the flight levels of the AHD and TID are different, as required by the meaconing. Therefore, there is no chance of collision. Also, the flight phases in modern flight controllers are separated, e.g., the take-off and landing phases have an altitude controller. In contrast, the cruise flight phase has a controller providing navigation and guidance. Thus, small changes in altitude from the initial position and the TW can be neglected. Thus, the presented 2D algorithm can be used in the cruise phase of the flight. As a requirement, the detection and tracking systems of the AHD must maintain a minimum vertical distance of the TID, e.g., maintaining the AHD at a flight level above the TID.
For the SUMM, we define the desired waypoint coordinate tuple in the x and y coordinates as (r x [n], r y [n]). The constants K p x,y , K i x,y , K d x,y are the same for the x and y directions but are chosen differently for the AHD and TID. The controller output tuple is defined as (u x [n], u y [n]) and is updating the vehicle position, whereas K p x,y , K i x,y , K d x,y are kept rather small. In our proposed SUMM simulation, we are not interested in controlling a vehicle, but, instead we are interested in finding a simple descriptive model that represents the overall behavior of a quadrotor while a waypoint is set to its embedded firmware and the movement it performs towards the desired direction. The SUMM simulation procedure is outlined in Algorithm 1. In the following, the same structure is applied to emulate the behavior of the AHD and the TID.

Algorithm 1 SUMM UAV Simulation
for the n-ith iteration.

V. TRAINING PROCESS
The training process aims to derive Q-values of the Q-table so that the agent learns the best way to interact with the environment. In our problem at hand, this means learning the best possible trajectory for the AHD to guide the TID to the SZ. Then, during the actual flight, the Q-table resulting from the final training episode can be used to resolve the guidance and control problem. The training can also be continued during the flight to adapt to changing situations. Thus, in any case, the training process needs to be of low computational complexity.
In Table 1 the training parameters used for the SITL and SUMM training cases are listed as well as the location of the TW and the SZ. The SUMM parameters were chosen to match the behavior of the SITL simulation represented by the resulting Q-table with its Q-values. Generally, the Q-table can only be generated for a finite number of discrete actions and states. Additionally, Q-learning, in its basic form, starts to fail for a larger number of actions and states, as the likelihood of the agent observing a particular state and performing a particular action is increasingly small. Therefore, the Q-table with its Q-values can only provide rough information about the environment and the involved systems but also sufficient information to solve complex control problems. Consequently, when designing dynamic system models for training, the model depth can be traded-off against its complexity, as the resulting Q-table based on different models can provide a similar solution to the problem at hand. Therefore, the training hyperparameters were not included in the Q-table as new state-action pairs, keeping the RL algorithm decoupled from the ETA.
To compare the simulation time for both training cases that we discussed earlier, we consider 400 episodes being simulated with the SITL and the SUMM. The results are shown in Fig. 1. In general, we can observe that the SUMM simulation achieves a much shorter simulation time for each episode. The peaks at multiples of 100 for the SUMM simulations are caused by the flight log insertion to the overall training database. We begin by observing Figure 1 where the comparison of both methods yields the main reason for proposing an approach: a low simulation time per flight against keeping the same overall mechanics between the AHD and the TID.
Additionally, we define the maximum-kills stop criteria (MKSC), where we early-end the training process in the SUMM when the number of overall kills is achieved. Also, we define the successive-kills stop criteria (SKSC), where the simulation is stopped when a given number of successive kills is achieved. In the following, MKSC500 means we stop the training process after 500 accumulated kills. The MKSC and SKSC can be combined in the SUMM simulations. These two parameters help to decide when the training is sufficient to start the mission of the AHD based on the derived Q-table.

VI. RESULTS AND DISCUSSION
To evaluate our proposed C-UAS method, we choose representative flight scenarios and we generate the trajectories of the AHD and TID during a hunting mission using the described SITL simulation. We also perform Monte Carlo (MC) SITL simulations of flights using pre-trained SUMM Q-tables to derive and analyze the performance of our proposed method.

A. SUMM LOADED SITL FLIGHTS
In Fig. 2, we show nine flights simulated with the SITL approach loaded with a pre-trained Q-table using the SUMM training. The red dashed line represents the flight trajectory of the AHD, and the solid blue line shows the TID's flight trajectory. The SZ is illustrated as a yellow circle while the red X denotes the next TW as planned in the TID's mission. It is important to note that for these examples, no previous SITL flights were performed to train the used Q-table. Based VOLUME 11, 2023 on these example flights, we will discuss the behavior of our proposed Q-learning approach for specific cases. Afterward, we will analyze the performance of our proposed approach based on MC simulations, especially the kill probability.
In general, we can observe that, as the AHD moves in space, the TID is guided towards the SZ. Thereby, guiding the TID through the movements of the AHD would impose a high cognitive load on a human pilot steering the AHD. Thus, the need for our Q-learning agent arises, which accomplishes this complex task without directly involving a human pilot.
In Fig. 2 (a), Fig. 2 (f) and Fig. 2 (h) scenarios are shown in which the AHD is not capable of steering the TID towards the SZ, but the TID also does not reach its next TW. In some cases the TID is approaching its next TW, but not reaching it as shown in Fig. 2 (a) and Fig. 2 (h). In other cases the TID simply tangents the next TW and is then guided away by the AHD, as shown in Fig. 2 (f). The TID is driven away from the TW even in the case the TID is not approaching the SZ due to the TW's location and the implemented reward function, as it penalizes the movements closer to the safe zone less. In this manner, the TID first is guided towards the SZ and then driven away from the TW. Afterwards, as the TID is still seeking to get to the TW, its position drifts until it leaves the defined flight zone.
The SZ kill radius r s together with the action size ∥d∥ 2 influence the kill rate of the TID. In the case ∥d∥ and r s are chosen quite small the TID might pass through or by the SZ without a successful kill. Such cases can be seen in Fig. 2 (b) and Fig. 2 (i), where the TID is driven to the SZ, but the r s = 25 m criteria is not met and the AHD decides to simply drive the TID away from its next TW. The flight example shown in Fig. 2 (b) addresses another important aspect. It demonstrates that it is vital that the AHD's velocity is higher than the TID's velocity, as the AHD has to fly ahead of the TID to recover the steering closer to the SZ.
In the example flights shown in Fig. 2 (c) to Fig. 2 (e) and in Fig. 2 (g) the AHD successfully guides the TID to the SZ, the kill criteria is met, and we can consider the TID successfully killed. Also, we can observe that the AHD and TID trajectories are quite similar but not equal, as in general convergence can be obtained from the SUMM training. However, the actsense-learn cycle is still active during the flight and thus, the AHD adapts while flying and consequently is behaving a bit different for the depicted scenarios.
Even though the Q-table only has a discrete set of states, the system can adapt to changing conditions as the state-action pairs comprise a set of possible AHD and TID positions in space quantized by π/4 angles. If the discretization is not high enough to guide the TID to the SZ, the AHD can retrain in flight. As can be seen in Fig. 2 (b) and (i), the AHD velocity must be higher than the TID's velocity, so the AHD can move in space in the opposing direction the TID is heading. The closer the AHD's velocity is to the TID, the broader the steering area has to be accepted and more iterations are needed to guide the TID to the SZ.
Finally, an important element of the overall problem that needs to be discussed is the overall size of the space needed for the maneuvers of the AHD. In practical terms, the larger the needed area, the higher the probability that the AHD will lose its capability to continue spoofing the TID due to the increased antenna distance. On the other hand we cannot make the maneuver box smaller due to the AHD's and TID's velocity, as the AHD needs sufficient space to guide the TID to the SZ.

B. SUMM LOADED SITL MONTE CARLO (MC) SIMULATION
In order to derive the kill probability p k we performed MC simulations of a SUMM trained scenario. p k is the probability that the TID will be guided into the SZ with a radius of r s meters. We can calculate the p k by assuming the number of trials m in our experiment is large enough, i.e., m → ∞, we can approximate p k by the relative frequency as where m k is the number of kills in m tries. The SITL MC simulations are performed with the parameters as described in the previous section and m = 1000. We analyze different SZ radii r s for the case of a MKSC of 800. The convergence of the relative frequency to p k for different r s is depicted in Fig. 3 and the summarized results of the MC simulations are listed in Table 2. To calculate the standard deviation we used a binomial approximation, with 95% confidence interval. The radius r s plays a key-role in the problem at hand and p k increases significantly if r s is increased. The results show that the proposed C-UAS system is able to guide the TID to the SZ in 42% of the cases for r s = 25 m which can be considered a feasible scenario in urban areas. However, even if the drone is not captured in the SZ, it is steered away from its original TW. Thus, the AHD in all cases can confound the mission of the THD.
To further analyze the results, we show the final position spatial distribution for the SITL MC simulations in Fig. 4. The higher the r s , the more the AHD's final positions, represented by the red dots, are clustered in a small final region, indicating that, besides the non-deterministic portion of the presented problem, the AHD is adapting while guiding the TID towards the SZ (act-sense-learn cycle). The TID's final positions are concentrated either well inside the SZ or the TID is guided away from the next TW. It is essential to note the arrangement of the points in Fig. 4 when we compare the three scenarios: in the first and second scenarios presented in Fig. 4(a) and (b), with a radius of r s = 25m and r s = 50m, the distribution  Also, in Fig. 4(c), the final target positions, represented in blue inside the yellow region, are concentrated in an arc of the safe zone perimeter. This result represents the pattern from the UAV dynamics, along with the agent adopting an equivalent strategy to guide the TID to the SZ using similar maneuvers.
For the AHD and TID, the distances of their final positions to the SZ r sz and to the TW r tw are calculated. The results are shown in Fig. 5 for MKSC800 and r s = 25 m. We can state that r s = 25 m is the hardest case as it has the smallest safe zone while maintaining the same velocity ratio between the AHD and the TID. On the other hand, in the cases the TID's velocity is higher than that of the AHD, the latter would not be able to steer the TID to the safe zone. Therefore, in such a case, the AHD is still capable of driving the TID away from its TW.
Furthermore, the action size defines the movement of the TID from time instant to instant. A larger action size ∥d∥ would lead to the TID flying over the SZ at cruise velocity and a soft-kill would not be successful.
In Fig. 5, the diagonal graphs are the histograms of the respective variables. There are three categories of final positions of the AHD and the TID • the TID was killed in the SZ, • the AHD or TID flew out of the defined area (out of box), • the AHD reached the maximum movements and could not further guide the TID. The r sz and r tw for the ADH or the TID have a linear dependency, as can be expected for fixed SZ and TW. If the AHD or the TID is out of the defined flight area or box, r sz and r tw VOLUME 11, 2023  have rather high values. The same is true when either drone is far from the region of interest. The distribution of the r tw higher values describes our objective to drive the TID away from the TW. The narrow peaks of the killed states represent the capability of our system to steer the TID to a bounded SZ region. The closer to zero and narrower the peaks, the smaller the bounding box required to execute the soft-kill process. If the final status is maximum movements, we note the wide range over which these points are distributed. In each case, the flight trajectory is still contained in the bounding box but the AHD were not able to drive the TID to the SZ effectively -as is represented in Fig. 2 (a), 2 (f), and 2 (h).
Finally, it is essential to note that the target is successfully steered towards the SZ using a small region no more than a kilometer wide in most flights. These results show that the proposed method is promising for dense urban regions with an increased p k . The SUMM pre-training and SITL evaluation combination are complementary when evaluating C-UAS systems for a wide range of mission scenarios.

VII. CONCLUSION
As UAVs became available to the general public, the need for C-UAS arises in order to protect a given area from malicious or non-authorized UAVs. The high cost of conventional methods to destroy UAVs from airports to stadiums motivates our research and development of a cheaper and more general approach to capture and soft-kill these vehicles effectively and safely.
This work proposed a soft-kill approach to C-UAS by automatically guiding a UAV to an SZ with a Q-learning algorithm. The proposed approach achieved high success rates and additionally it was designed so that even if the TID FIGURE 5. Final AHD and TID distances to SZ (r sz ) and distances to TW (r tw ) distributions obtained by SITL MC simulation, for MKSC 800 and r s = 25m. The diagonal plots show the histogram of either TID r sz , AHD r sz , TID r tw , and AHD r tw , respectively.
does not reach the SZ, it cannot fly to the next TW. Thus, the Q-learning algorithm in the AHD could always identify and fail the mission of the TID.
The SUMM method was introduced as an acceleration tool compared to the SITL simulations for the training to achieve a better trade-off between high fidelity and low computational cost. The SITL simulation was designed to simulate the true behavior of UAVs in dedicated mission scenarios, while the SUMM method only implements a coarse model of the behavior of UAVs, but captures its main features to accelerate training drastically.
Finally, this work showed a new way to implement expanded Q-tables for complex Q-learning algorithms, perform the training in reasonable time, and perform extensive evaluation by a more straightforward training method.

VIII. FUTURE WORK
The proposed algorithm allows more complex scenarios to be considered, and a large Q-table can be used in future developments and system improvements. Also, as more scenarios can be considered for different strategies for the attacker, the C-UAS can be prepared for different mission requirements.
The presented techniques and system designs can also simulate strategies against drone swarms since the SITL alone would require prohibitive resources for training such scenarios. The strategy to be used against drone swarms is to spoof the centroid of the swarm in the same presented way a meaconer payload would with a single drone.
Furthermore, the presented 2D algorithm can easily be extended to 3D to additionally consider different flight levels and changes in the target's altitude during flight. The Q-table also can be expanded to allow a quantized azimuth vector pointing from the AHD to the TID as states.