Intelligent Trajectory Design for Secure Full- Duplex MIMO-UAV Relaying Against Active Eavesdroppers: A Model-Free Reinforcement Learning Approach

Unmanned aerial vehicle (UAV) assisted wireless communication has recently been recognized as an inevitably promising component of future wireless networks. Particularly, UAVs can be utilized as relays to establish or improve network connectivity thanks to their flexible mobility and likely line-of-sight channel conditions. However, this gives rise to more harmful security issues due to potential adversaries, particularly active eavesdroppers. To combat active eavesdroppers, we propose an artificial-noise beamforming based secure transmission scheme for a full-duplex UAV relaying scenario. In the considered scheme, we investigate a UAV-relay equipped with multiple antennas to securely serve multiple ground users in the presence of randomly located active eavesdroppers. We formulate a novel average system secrecy rate (ASSR) maximization problem under some quality of service (QoS) and mission time constraints. Since the ASSR optimization problem is too hard to solve by conventional optimization methods due to the unavailability of the environment’s dynamics and complex model, we develop some model-free reinforcement learning-based algorithms, i.e., Q-learning, SARSA, Expected SARSA, Double Q-learning, and SARSA(<inline-formula> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula>), to efficiently solve the problem without substantial UAV-network data exchange. Using the proposed algorithms, we can maximize ASSR via finding an optimal UAV trajectory and proper resource allocation. Simulation results demonstrate that all the proposed learning-based algorithms can train the UAV-relay to learn the environment by iterative interactions, thus finding an optimal trajectory, intelligently. Particularly, we find that SARSA(<inline-formula> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula>) based proposed algorithm with <inline-formula> <tex-math notation="LaTeX">$\lambda =0.1$ </tex-math></inline-formula> outperforms the others in terms of the ASSR.


I. INTRODUCTION
The emerging AI driven 6G wireless communication networks have been envisioned to be an enabling technology of IoE, wherein the networked connection between people, process, data, and things is anticipated to be autonomously determined [1], [2]. Therefore, the ever-increasing demand for seamless and ubiquitous connectivity as well as high data rate transmission serving an exponentially increasing The associate editor coordinating the review of this manuscript and approving it for publication was Moayad Aloqaily . number of users are amongst the most critical challenges. In light of this, UAVs have been recognized as one of the key components of such networks due to their unique attributes: cost-effective, flexible deployment, maneuverability, and versatility [3], [4]. As a result, UAVs can be dispatched to avoid environmental obstacles and to provide seamless connectivity and reliable communications to a massive number of users.
Wireless applications of UAVs can be categorized into the following paradigms: one is for on-demand deployment as airborne platforms such as mobile BSs or relays to expand coverage and provide wireless connectivity in VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ densely crowded areas where the current infrastructures are encountering with some challenges to meet all the concurrent requests [5], [6], or in hazardous environments where no communication infrastructure is in full operation [7]; another is for data collection/dissemination due to their high mobility and low-cost operation for the UAV-IoT applications [8]- [10]; and the last one is for serving as aerial users or cellular-connected UAVs, receiving service from the terrestrial stations and cooperating multiple UAVs in the sky leading to information fusion and resources complementation to fulfill a common mission [11]- [13]. Despite the aforementioned advantages, the open nature of UAVs' AG links inevitably makes such systems vulnerable to various malicious attacks [14] such as eavesdropping, particularly active eavesdropping, wherein the adversary simultaneously performs both information eavesdropping and malicious jamming. If employed by illegitimate parties, hostile UAVs can even pose, benefiting from their salient attributes, more detrimental security threats to legitimate transmissions [15]. Therefore, wireless security is of crucial requirements for such UAV-aided wireless systems, and so, there exist various significant security challenges in the design of UAV-aided wireless communications to be addressed [16], [17]. Typically, in the network layer of wireless systems, cryptography techniques have been applied for information safeguarding, but for physical layer wireless communications, PLS approaches have been widely investigated, and recognized as one of the promising security countermeasures, especially for confidentiality. Since PLS can exploit the physical characteristics of wireless media without the need for complex encryption procedure, and more importantly, employing traditional cryptography techniques may not even lead to satisfactory confidential performance in resource-constraint aerial platforms [18], [19]. The notion of PLS, first introduced by Wyner's seminal work in [20], lies in a wiretap channel model, which guarantees that confidential communication can be established between legitimate users, provided that the eavesdropper's channel capacity is a degraded version of the legitimate user's one. Since then, various PLS techniques have been developed for terrestrial wireless communications with a three-category classification: secure channel coding design, channel-based adaptation PLS, and ANI techniques (see [21] and references therein).

A. RELATED WORKS AND MOTIVATIONS
Recently, some research works have investigated PLS for secure UAV communications. For example, in [22], PLS-based secure UAV-enabled communications have been developed via joint trajectory design and power control. In [23], a secure UR-based communication scheme via destination-assisted cooperative jamming has been proposed. In [24], the authors have studied a secure multi-UAV system with wireless energy harvesting in terms of efficient trajectory design and communication resource allocations for average secrecy rate maximization. The authors in [25] have explored employing an ANI-based secure two-phase transmission protocol for a single-antenna UAV system operating as an aerial BS, and then jointly optimized UAV's trajectory, network transmission power, and power allocation factor over a given time horizon. Further, secure energy-efficient power control and trajectory co-design has been investigated for UAV-enabled direct transmission [26], and UAV-assisted mobile relaying [27], [28]. In [29], the authors have studied a joint location-based 3D beamforming and trajectory design for the downlink multiple-antenna UAV relaying, and then proposed a heuristic-based iterative algorithm to improve the secrecy outage probability of the system.
The majority of recent research works have mainly focused on simple direct transmission [24], [29] or half-duplex UAV relaying [30], [31]. Recently, FD transmissions that double spectrum efficiency have attracted notable research interests to adopt at legitimate nodes, e.g., [32], [33]. Specifically, for non-security purposes, a UR-based FD system has been considered in [32] for the joint design of beamforming and power allocation with a fixed circular UR's trajectory while using a DF relaying protocol. We note that DF relaying refers to a type of transmission protocol used in the communications between a source and a destination aided by one (or more) intermediate relay nodes, where the relay node decodes, remodulates, and then retransmits the received signal to the destination. Another type of relaying architecture is called the amplify-and-forward relaying, wherein the relay node simply forwards a scaled version of the received signal without decoding. Further, in [33] the trajectory design and resource allocation of a similar system model has been explored to minimize outage probability. For security-based FD-operated UAV communications, in [34], the authors have considered an FD system with an untrusted UR, and then studied secrecy outage and average secrecy rate performance metrics, wherein the untrusted relaying refers to the case when the intermediate relaying conducts adversarial activity during communication facilitation [35]- [37]. It should be worth pointing out that FD malicious nodes can potentially pose severe security attacks compared to their passive counterparts. The authors in [38] have proposed an ANI-based secure uplink UAV transmission in the presence of a multiple-antenna FD-operated AE. They have analyzed a hybrid outage secrecy metric, which enables to capture the joint effect of connection and secrecy outage probabilities.
We note that in all the abovementioned research works for UAV's trajectory design, e.g., [24], [26], [27], [29], [31], [33], standard optimization techniques such as SCA have been employed under the assumption of a known network model and a perfect knowledge of flight's dynamics. However, this assumption can be somehow impractical inasmuch as a precise mathematical model can hardly be formed, owing to the fact that the UAV-network topology frequently demands information exchange between the UAV and the core network. Consequently, the expression of the objective function to be optimized or the constraints might be either unavailable or obtaining their gradients analytically becomes almost impossible [39]. Hence, other optimization approaches are required to deal with such complex problems. One promising approach can be the model-free RL techniques [40] that can reduce the online computational complexity. To that end, in [41], the authors have proposed an RL algorithm for a multi-UAV cooperative system, aiming at maximizing the sum rate metric via trajectory design and resource management. The authors in [42] have considered exploiting RL algorithms to optimize UAV's trajectory for maximum data collection in a sensor network under some QoS constraints. However, these developments have aimed at only reliability aspects of UAV communications, and PLS security aspects have not yet been fully explored.

B. OUR CONTRIBUTIONS
Driven by this demand, in this work, we propose a secure FD-operated MIMO-UAV relaying communication scheme in the presence of multiple AEs, wherein the source is a multiple-antenna BS. We assume that both BS and UR adopt ANI-based beamforming, and AEs are equipped with double antennas to perform concurrent reception and transmission. Then, our design goal is to maximize the ASSR under some QoS and mission time requirements. To achieve our goal, we develop some applicable RL-based algorithms for an adaptive trajectory design, enabling the flying UR to autonomously find the optimal path to complete the mission. Detailed contributions are summarized below.
• In our design, we target at maximizing the ASSR of the system under some conditions for fulfilling the UR's flying mission as well as combating the active eavesdropping issue. Besides, we take into account the collision avoidance between the flying UR and environmental obstacles for safety purposes.
• The original optimization problem of ASSR is, however, hard to solve due to the non-convex complex model of the objective function and some associated constraints. To tackle this problem, we devise some efficient model-free RL-based algorithms, i.e., Q-learning, SARSA, Double Q-learning, Expected SARSA, and SARSA(λ). Via the proposed algorithms, we can train the UR to find its optimal path via environmental interactions and decision-updating using the feedback/reward received for the trajectory design purpose, meanwhile forming a simple resource allocation problem that can partially contribute to generating the reward function. We can see that our approach significantly diverges from those in [22], [24], [43], where the problems were trackable and mathematical optimizations have been applied.
• Finally, we discuss the convergence and complexity of the proposed adaptive trajectory design algorithms under the considered settings. Via extensive simulations, we demonstrate that these algorithms can effectively improve the considered ASSR performance, and their convergence rates to their optimal policies are also desirable. The rest of this paper is organized as follows. In Section II, we detail the system model and signal representations for the proposed secure UR-based FD system in the presence of multiple randomly-located AEs, wherein the BS and UR both adopt the MIMO-based ANI beamforming. Problem formulation is then given in Section III, followed by our model-free RL-based solutions in Section IV. Section V is devoted to numerical results and discussions about the performance of the developed solutions. Finally, the conclusions are drawn in Section VI.

II. SYSTEM MODEL
We consider a UAV-assisted mobile relaying system, as depicted in Fig. 1, wherein a UAV is employed as a mobile relay to provide an enhanced service and secure connectivity for multiple ground users. Particularly, we consider a BS, denoted as S, which intends to secretly communicate with the remote ground users with the help of a UR, denoted as U, in the presence of multiple terrestrial AEs. We assume that S and U are equipped with N s and N u transmitting antennas, respectively, and U has also one receiving antenna. Further, we assume there are L single antenna ground users, denoted as D = {D 1 , D 2 , · · · , D L }, and M double-antenna AEs, represented by E = {E 1 , E 2 , · · · , E M }, all of which are randomly distributed across a rectangular R w × R l region. We insist that AEs, compared to conventional passive eavesdroppers, may pose stronger eavesdropping attacks as they can, in addition to overhearing transmit confidential messages, actively deteriorate the capacity of the main channel, i.e., the quality of received signals at the legitimate nodes, by malicious jamming transmissions. Further, being equipped with two antennas, we assume that each AE operates in the FD mode such that utilizes one antenna for eavesdropping purpose and the other for jamming transmission, simultaneously. To guarantee the security and reliability of the transmission in the considered system, we employ a UAV to act as a mobile DF relay. The goal of the UR is to fly over the region from a pre-specified starting location and stop at the pre-established final destination (ignoring the landing process, this point is depicted by a flag in Fig. 1) for each flight. Note that obstacles such as high-rise buildings and trees represent the forbidden region due to, for example, the possibility of collision, through which the employed low-altitude UR should avoid passing during the mission. In order for the UR to sequentially relay the data and provide service for multiple users, the total flight duration T , which should not go beyond a maximum allowed feasible mission time T max , is divided into multiple sufficiently-small time slots, at each of them only one user is scheduled to receive data from U according to the TDMA technique, and besides that, the UR applies FDD protocol at each time slot, allocating equally-shared bandwidth for data transmission and reception. Now, we detail the channel assumptions and signal representations of the proposed system in what follows.

A. CHANNEL ASSUMPTIONS
First off, we model the location of network nodes by 3D Cartesian coordinate system. As such, without loss of generality, due to mission requirements, the UR's predetermined initial and final locations, which may refer to rising and landing sites, are represented as Q i = [x I , y I , H u ] T and Q f = [x F , y F , H u ] T , respectively, wherein H u denotes the operating fixed altitude of the UR. The location of flying UR at time slot t is denoted as Q u (t) = [x u (t), y u (t), H u ] T with corresponding projected coordinate on the ground (x-y plane) as q u (t) = [x u (t), y u (t)] T . Further, the node S is located at the ground with x-y coordinate q s = [x s , y s ] T , the location of randomly located ground users and AEs are also represented as , and so q i ∈ R 2×N i , in which N d and N e represent the cardinality of sets D and E, respectively. Note that the j-th column (where j = 1, 2, · · · , N i ) of the matrix q i , denoted as q (j) i , represents the x-y coordinate of the j-th terrestrial device of type i. As a result, the instantaneous distance between the flying UR and terrestrial communication node i j can be represented as d ui j (t) which is given by Likewise, the instantaneous distance between S and the UR is represented as Plus, defining G {S} ∪ D ∪ E, the Euclidean distance between any pair of terrestrial nodes a, b ∈ G is denoted as d ab and calculated by where q a(b) ∈ {q s , q (j) i ∀i, j}.

1) LARGE-SCALE ATTENUATION
In this work, we consider that each terrestrial device has an LoS path towards the UR with a given probability as [4]. This LoS probability is determined by the environment, locations of the terrestrial devices, and the UR. Thus, we express the LoS probability between the terrestrial node g ∈ G and the UR as where θ gu (t) = 180 π arcsin H u d gu (t) represents the elevation angle in degree, wherein d gu (t) is given by (1) and (2), and the parameters ω 1 , ω 2 > 0 are determined by the environment. The non-LoS probability of the link between the node g and the UR can be simply expressed as P N gu (t) = 1 − P L gu (t). Further, we model the elevation-angle dependent probabilistic path loss component as where η N > η L ≥ 2. We note that according to (4) and (5) for fixed altitude of UR, the LoS probability will decrease as the UR goes away from the terrestrial node g and accordingly, the channel will experience a larger path loss attenuation. Therefore, the large-scale attenuation between any two nodes m and n can be given by where β 0 represents the channel power gain at the reference distance of 1m.

2) SMALL-SCALE FADING
We further consider that the AG channels experience Rician fading for LoS propagation conditions and Rayleigh fading for non-LoS propagation conditions. As such, we express the time varying channel between the BS and the UR as where K su (t) = ω 3 exp (ω 4 θ su (t)) with ω 3 and ω 4 being constant parameters, denotes the Rician K-factor of the channel, h o su represents the LoS component of the corresponding channel, defined as where β su (t) denotes the time-varying azimuth angle between the BS and the UR, δ s denotes the constant antenna spacing in wavelength at the BS. Further, h r su (t), denoting the scattered component of channel vector between S and the UR, each element of which follows quasi-static i.i.d complex Gaussian random variable with zero mean and unit variance, i.e., h r su ∼ CN 0 1×N s , I N s . Furthermore, the air-ground channel vector between the UR and the single receiving antenna terrestrial node i j is represented by h ui j (t) as where K ui j (t) = ω 3 exp ω 4 θ ui j (t) denotes the Rician K-factor of the channel between the UR and the node i j , h o ui j denotes the corresponding LoS component, defined as where β ui j (t) denotes the time-varying azimuth angle between the UR and the ground node i j , δ u denotes the constant antenna spacing in wavelength at the UR. and Since each AE has one antenna for jamming transmission, therefore, the corresponding channel from j-th AE e j ∈ E to the UR at time slot t can be represented as (11) where K e i u (t) denotes the corresponding Rician K-factor and h r e i u ∼ CN (0, 1). Therefore, we can define the 1 × N e vector h eu [h e 1 u (t), h e 2 u (t), · · · , h e Ne u (t)]. Additionally, the M × L channel matrix from the AEs to the users can be represented as where H ed is subject to i.i.d Rayleigh fading with normalized channel power gains. Next, let the link between the BS and the terrestrial node i j be the 1 × N s channel vector h si j , such that h si j ∼ CN 0 1×N s , I N s . Further, the self-interference channel from the UR's transmitting antennas to the receiving antenna is characterized as the 1 × N u vector √ ρ u h uu wherein ρ u ∈ [0, 1] characterizes the effect of imperfect self-interference cancellation such that ρ u = 0 implies zero self-interference and 0 < ρ u ≤ 1 takes the level of self-interference into account. Further, h uu ∼ CN 0 1×N u , I N u . Likewise, the selfinterference channels of AEs as well as the cross-interference channels arising from the other AEs are respectively denoted as h e p ∼ CN (0, 1) and h e p e q h e pq ∼ CN (0, 1) with e p(q) ∈ E, ∀p, q (p = q) ∈ {1, · · · , N e }, and can be represented as the N e × N e channel matrix given by where 0 ≤ ρ e p ≤ 1 demonstrate the self-interference factors of the AE e p due to FD operation.

B. USER SELECTION
Since at each time slot t, U forwards the intended confidential message to one scheduled user via performing FD relaying, we let ζ j (t) ∈ {0, 1} be a binary variable for user D j where j ∈ {1, · · · , L} at time slot t to indicate user scheduling, i.e., ζ j (t) = 1 if user D j is scheduled at time slot t and ζ j (t) = 0 otherwise. Therefore, we have the user scheduling constraint given as In this work, we consider the user selection criterion based upon the best channel condition of the second hop of information relaying. Thus, the scheduled user at time slot t can be obtained as where j * denotes the index of the selected user D j and we denote it as the destination node D.

C. MIMO-ENABLED ARTIFICIAL NOISE BEAMFORMING
In order to establish secure end-to-end transmission, we adopt ANI technique with the confidential messages for secure transmission at both the BS and UR according to MIMO-beamforming, as shown in Fig. 2. In this scheme, we assume that the source transmits noise-like signals in addition to information signals, in order to confuse the malicious nodes. As such, we design the transmitted unit-power signal x s from the BS as where w s = h † su ||h su || is chosen to maximize the information signal transmission towards the UR, wherein (·) † indicates the transpose conjugate operator, W s,an is assumed to be the projection matrix onto the null space of h su , i.e., h su W s,an = 0, and therefore, it is evident that the columns of the first matrix in the right-hand side of (16), or the so-called beamforming matrix, form orthonormal basis. Note that following Eigen-decomposition of the matrix H su = h † su h su , w s can be chosen as the eigenvector corresponding to the maximum eigenvalue, and the remaining eigenvectors can form the matrix W s,an . It should be mentioned that the designed beamforming matrix aims at degrading the quality of the received signal at the unintended devices while improving the quality of the reception at the UR. Further, t s denotes scalar information signal to be sent securely to the end-user and t s,an represents the ANI vector at the BS with dimensions (N s − 1) × 1. Letting α s where 0 < α s ≤ 1 be the power allocation factor between information signal and ANI at the BS, we have E{|t s | 2 } = α s and E{t s,an t † s,an } = 1−α s N s −1 I N s −1 , wherein E{·} indicates the expectation operator.
Likewise, the UR performs ANI to the previously decoded signal to be forwarded to the selected user D. Then, the forwarded information-bearing signal can be represented as where w u = h † ud ||h ud || is chosen to maximize the information signal transmission towards the scheduled user D, W u,an is chosen such that h ud W u,an = 0. Further, t s (τ − 1) denotes the previously decoded information signal, t u,an represents the ANI vector at the BS with dimensions (N u −1)×1. Letting α u where 0 < α u ≤ 1 be the power allocation factor between information signal and ANI at the UR, we have E{|t u | 2 } = α u and E{t u,an t †

D. TRANSMISSION PROTOCOL AND SIGNALS REPRESENTATION
Let P s , P u , and P e j with j = {1, · · · , N e } be the transmission powers of the BS, UR, and AEs, respectively. Then, the received signal at the UR at time slot t can be represented as where n u ∼ CN (0, σ 2 u ) is the AWGN, x u and x e j are the unit-power signals, i.e., E{ x u 2 } = 1 and E{|x e j | 2 } = 1. Note that the first term in the right-hand side of (18) denotes the information-bearing signal, the second term is the residual self-interference at the UR, the third term denotes the disturbance arises from the AEs and their jamming transmission. The SINR can be given as where The received signal at the scheduled user D at time slot t is given by (20) where the antenna noise n d is modeled as the AWGN, i.e., n d ∼ CN (0, σ 2 d ). Note that the first term in (20) is the information-bearing signal transmitted by the UR, the second term is the signal coming from the BS, however assuming the worst-case scenario such that the legitimate users have low-complex receivers which are unable to perform joint processing, the user treats this signal as an interference. In other words, since the signals coming from the BS and the UR are inherently different, for example, due to having been encoded with different codebooks, so the user is only able to detect one signal. Besides that, it is common in the literature to consider solely the signal coming from the UR, however, in this work we take into account the direct transmission as well. Finally, the third term denotes the disturbance coming from the AEs. Hence, the SINR at D can be given as where It should be mentioned that ∀E j ∈ E may receive two copies of the information signal from the BS and the UR with some delay as the relay needs to first process the received signal before forwarding. In contrast to [44] and the assumption that we made for the scheduled user, where the direct transmission is treated as an interference, each AE E j here is assumed to be able to fully combine these signals and perform a joint processing method such as an ideal Rake receiver [45], [46]. As such, E j can appropriately co-phase and merge these two signals via applying MRC and perform more harmful eavesdropping attacks. Besides, we assume that the AEs are non-colluding, i.e., each AE decodes the received signals from the source and the UR without cooperating with other AEs. Consequently, the received signal at j-th AE, denoted by y e j , can be represented as y e j = P s L se j h se j (w s t s + W s,an t s,an ) where x e j is the unit-power jamming signal transmitted by other AEs, n e j is the AWGN at j-th AE where n e j ∼ CN (0, σ 2 e ). We assume that E j applies MRC to effectively decode the received information, and hence, the SINR at D can be given as where the SINR se j is given by in which and sue j can be obtained as per the rules of DF protocol as where su is given in (19) and ue j can be represented as where

III. PROBLEM FORMULATION
The achievable instantaneous system secrecy rate of the proposed UAV-based FD relaying scenario is defined, assuming normalized shared bandwidth in bit-per-second-per-Hertz (bit/s/Hz), as [47] where [x] + max{x, 0}, I D (t) is given by which represents the capacity of the main channel including both the direct and relaying links from the BS to the scheduled user D at time slot t, and I E (t) is given by which determines the Shannon capacity of the non-colluding eavesdropping links at time slot t. In this work, we aim at maximizing the ASSR during the mission time T by trajectory design and resource allocation. Thus, the optimization problem can be formulated as where the constraint C1 indicates the minimum instantaneous secrecy rate requirement, otherwise secrecy outage may occur, C2 is to ensure that amount of data securely received by the users does not go beyond users' capacity to make fairness service amongst them, C3 and C4 should be satisfied due to green communications and hardware limitations, C5 indicates the ANI factors limitation, constraints C6 and C7 are posed by the restricted flying region requirement, C8 arises due to the mission requirement in terms of pre-specified start and ending locations, and finally C9 ensures the feasibility of the mission, and C10 is for guaranteeing the mission completion goes not beyond a reasonably feasible maximum allowed time T max . The problem (30) is too complicated to solve due to non-convex complex model of the objective function and constraints C1 and C2. Our approach is to employ some learning-based reinforcement techniques to tackle the problem. In the following section, we detail our RL-based solutions to approximately solve the original problem (30) after providing a brief introduction of RL fundamentals.

A. PRELIMINARIES
Here, we first briefly explain the RL fundamentals (interested readers are encouraged to refer to excellent resources such as [40] for detailed discussions), by which we then reformulate our optimization problem in (30) for trajectory optimization of the proposed UR scenario and then efficiently solving via learning approximate solution approach. RL problems can be mathematically studied based on the MDP frameworks which basically establish a relationship between interaction-based learning and goal achievement. Consequently, it is worthwhile to first recall some key, though abstract, components of the MDP frameworks in the following, which shall be explicitly semanticated later on. In MDPs there is a decision-maker or the so-called learning agent that continually interacts with the environment over a sequence of discrete time steps t = 1, 2, 3, · · · via taking an action, then receiving some feedback signal from the environment − termed as reward, and then being presented in a new situation or state. The objective of the agent is to maximize the received rewards over time. In our work, we consider the finite MDP wherein the number of elements in the state-action-reward set, i.e., {S, A, R} is finite.
The interaction of the learning agent with the environment is well visualized in Fig. 3. Particularly, given s ∈ S be the agent's current state, it takes an action a ∈ A and then goes to the next state s ∈ S, observes the environment, and receives a numerical reward r ∈ R ⊂ R following to the action taken. According to the finite MDP framework, for particular values from the reward set r ∈ R and state set s ∈ S, there is a well defined discrete probability distribution, which depends only on the preceding state s ∈ S and action a ∈ A and represented by which results in the expected reward function of the stateaction-next-state triples, defined below as a three-argument function RL algorithms, which are employed to solve the above-mentioned finite MDP, are basically instructing the learning agent via estimating the action-value function or the so-called Q-function. Precisely, Q-function estimates the quality of action taken by the agent in a given state in terms of the expected discounted return ϒ t . This return captures not only the immediate reward but also a scaled version of the future rewards in the long run for all successive steps, which can be mathematically represented as where R t+k for k = 1, 2, · · · , L indicate the future rewards after time step t, γ ∈ [0, 1] denotes the discount rate, specifying to what degree of importance the future rewards should be taken into account, and L represents the final time step. Note that, since the expected future rewards depend on the particular action the agent will take in the future, therefore, this Q-function should be defined in regards of the agent's way of acting or the so-called policy. Indeed, policy is the core element of RL methods and this decision-making rule is merely a mapping from states to the probability of taking each possible action. Mathematically speaking, if the agent follows policy π at time step t, then it will take the action a t = a at state S t = s according the probability π(a|s), which is given by In light of this, the value of taking action a under policy π and in state s is defined as where the Q-function Q π (s, a) indicates the expected return starting from the state s, taking the action a and following policy π thereafter, in which ϒ t is defined in (35). Likewise, the value function of a state s under a policy π is defined as the expected return when starting in state s and then following policy π afterwards, which can be given by These two functions can be related via v π (s) = a π(a|s)Q π (s, a), Solving an RL problem for the finite MDP is roughly equivalent to finding an optimal policy, 1 which can be expressed as We note that the optimal policy also shares the same optimal Q-function, i.e., Q (s, a), defined precisely as Q (s, a) = max π Q π (s, a), ∀s ∈ S, ∀a ∈ A The above function provides the optimal expected long-term return as a value that is locally and immediately available for each state-action pair. Additionally, Q (s, a) is required to satisfy the Bellman optimality equation, given by It is worth pointing out that the Bellman optimality equation is non-linear and explicitly solving it is, in practice, too hard, since at the outset we need to accurately aware of the dynamics of the environment − corresponding to have p available, and secondly we need to have sufficient 1 A policy π 1 is defined to be better than or equal to a policy π 2 if its expected return is greater than or equal to that of π 2 for all states. In other words, π 1 ≥ π 2 ⇐⇒ v π 1 (s) ≥ v π 2 (s), ∀s ∈ S. We also note that there the optimal policy is not necessarily unique, however, there exists always at least one policy that outperforms all other policies in terms of the expected return. computational resources for solving the equations amongst other required assumptions. Alternatively, we can use some iterative methods to approximately solve it and hence estimate the optimal Q-function in an efficient time. In light of this, TD-based reinforcement algorithms are model-free methods that recursively approximate the Q-function. Furthermore, there are two main types of TD learning methods: On-policy and Off-policy. While the former attempts to evaluate and improve the policy that is used to make decisions, the latter evaluates or improves a policy known as behavior policy, which may or may not correlate with that used to generate the data, namely the estimation policy. Now, we have the necessary tools to dive into our problem in terms of reformulating the UR's trajectory design problem such that we solve it via the finite MDP-based RL algorithms. Towards that end, we reformulate (30) as a model-free RL problem. Specifically, we divide the original problem into three sub-problems: user scheduling, trajectory optimization, and jointly power and ANI allocation sub-problems. In other words, a three-stage decision-making process is considered to cope with the original problem. First, the UR takes one of the possible actions to obtain its trajectory, then, selects one of the users as per protocol detailed in II-B, i.e., mainly the closest user to the UR, for conducting relaying service. Upon UR's changing position and scheduling the ground user, by optimizing the available resources ≺ P s , P u , α s , α u , the ISSR is improved, which in turn contributes to the reward received by the learning agent (i.e., UR) following the action taken in the previous stage. Therefore, to approximately reformulate the original problem to be solved efficiently, we need to precisely specify the RL models in terms of state set, action set, rewards, and the algorithm, all of which are detailed below.

B. STATE SET
Since we are interested in UR's trajectory design, therefore, we can consider each state representing the position of the UR in 3D space. However, seeing that UR's position can generally be modeled as a continuous function of time, i.e., Q u (t) where t 0 ≤ t ≤ t 0 + T , hence, this leads to having an infinite state set. Nonetheless, we hold attention to the restricted number of possible states as per the finite MDP framework. Therefore, the considered rectangular region, where the fixed-altitude UR aims to learn the optimal trajectory via RL is partitioned into N w by N l small tiles. Consequently, the region [0, R w ] by [0, R l ] in Fig. 1 is converted into a finite grid-world of N w × N l tiles which the x-y coordinates of the center of each tile represents one state. As a result, the state set S can be represented as S = {S n = (x n , y n ) n = 1, 2, 3, · · · , N }, where N = N w × N l , x n and y n are given respectively by It should be mentioned that here we also assume that each tile, representing the position in a x-y coordinate, might be occupied by a user, an AE, or an obstacle. Thus, the corresponding occupied states can be represented as sets S d ⊂ S, S e ⊂ S, and S o ⊂ S, respectively, where it is assumed that S d ∩ S e ∩ S o = ∅. Further, according to the definitions given above, we can define the initial state s init and the termination state s flag of the considered MDP problem as Moreover, based on the constraint C8 in (30), we have q i = q u (0) = s init and q f = q u (N max sp ) = s flag , where N max sp is a positive integer corresponding to the mission completion time. Moreover, with a slight change in the notation, the UR's discrete position at time step t can be considered as q u (t) = s ∈ S.

C. ACTION SET
As illustrated in Fig. 1, the available action set for the UR is assumed to be where N refers to flying one tile towards the north (−y direction), S refers to flying one tile towards the south (+y direction), E refers to flying forward for one tile in the direction of +x, W refers to flying backward for one tile in the direction of −x. Analogously, NE, SE, NW, SW indicate flying one tile with higher speed but equal time duration, when compared to other aforementioned directions, towards north-east, south-east, north-west, and south-west, respectively. Therefore, the cardinality of the action set is 8. Note that, in practice the UR is capable of selecting any direction, however, the optimal continuous trajectory can be approximately considered when the number of states in our problem goes to infinity. It should be mentioned that here we adopt ε-greedy strategy for action selection in order to balance between exploration and exploitation of the environment. As such, action a is selected according to where rand(·) ∈ [0, 1], and ε represents the probability of exploration, wherein the agent has the chance to improve its current knowledge about each action, which of course, enables the agent to make more informed decisions in the future. Further, 1 − ε denotes the exploitation rate, which refers to choosing the greedy action to obtain the most reward by exploiting the UR-agent's current action-value estimates.
In general, we want the UR-agent to start off the learning of environment with fairly randomized policy and later gradually move towards a deterministic one, which implies ε should be decaying. Note that in this work we consider ε as a decreasing function of episode number episode as where ∈ (0, 1), k is a constant positive integer value such that makes ε parameter decay every kth episodes, and · represents floor bracket operator. This dynamic choice of epsilon parameter with proper constants results in having more chance of environment exploration at the beginning of learning process and mainly following the learned policy at the last episodes.

D. REWARD FORMULATION
In order to enable the UR to be successful in the quest for an optimal trajectory to maximize the ASSR during the flight mission, we devise the reward function in such a way that the constraints imposed by the environment are also satisfied in the RL. It is worth pointing out that when the UR takes an action a at time step t such that t = 1, 2, · · · , N max sp , and then transits from current state s to next state s receiving the reward r at the successive time step, the UR-agent assigns a score for the taken action to indicate how important that action was in rendering the future reward. Therefore, the reward function of the UR-agent is defined as whereR sec represents the improved ISSR as an immediate reward, F 1 is the indication function for taking into account QoS requirements in terms of both the communication secrecy outage and user service fairness, F 2 is also an indication function that encourages the UR to complete the mission at the final desired destination as soon as possible (decreasing N max sp which corresponds to minimising the mission time T ) in order to improve the overall ASSR performance. Finally, F 3 is a function which penalizes the UR-agent to avoid collision with an obstacle, to make it fly inside the restricted region, and discourage it from getting stuck in a loop which may result in a mission failure. It is worth pointing out that the reward function parameters (ζ 1 , ζ 2 , ζ 3 ) should be selected in such a way to balance between positive rewards (revenue) and negative rewards (cost). Now, we delve into mathematically explaining the functions making the instantaneous reward r in more detail.

1) DEFINITION OF FUNCTIONR sec
Following user scheduling according to (15), for the sake of maximizing the ASSR in (30), it is important to optimize the ISSR via proper resources allocation, which in turn, improves the system sum secrecy rate and accordingly ASSR performance. We define the functionR sec for a given time step t aŝ R sec = maximize R sec (P s , P u , α s , α u ) where C1 and C2 are the maximum transmission power constraints posed by hardware limitations and regulatory standards. C3 indicates the ANI power allocation constraints. The above optimization problem, having non-linear objective function with convex constraints, can be readily solved via known optimization toolbox such as fmincon in MATLAB.

2) DEFINITION OF FUNCTION F 1
According to the minimum QoS requirement of the mission, we assume there exists a minimum required secrecy rate R th sec constraint to be satisfied, or the secure communication undergoes an outage. To that aim, we formulate this constraint as a penalty function to penalize the UR for taking those actions during the learning process that lead to any QoS failure. As a result, to avoid such circumstances, we define the function F 1 as whereR sec is given by (50), R th sec represents the minimum instantaneous secrecy rate requirement at each user,R max sec indicates the maximum sum secrecy rate threshold, below which the selected user's cumulative secrecy rates up to the given time step t should be, in order to ensure fairness in providing service amongst the users.

3) DEFINITION OF FUNCTION F 2
Since the UR is required to complete the mission in the pre-specified final location denoted as a flag in Fig. 1, we need to have a termination state. However, the URagent may not, depending on the environment, complete the mission due to getting stuck in some states. To avoid this, we penalize the UR by the function F 2 , defined precisely by It is worth mentioning that penalty function F 2 is designed as the multiplication of two terms: I) 1 + t N max sp is a penalty corresponding to the number of time steps taken so far, and in general, we target at a reasonable mission completion time of the UR (i.e., as fast as it can with fewer steps) to reduce mission time duration and in some sense mechanical energy consumption, II) indicates the normalized distance between the UR's current location and the desired final location to motivate the UR to find its way towards the termination state. We emphasize that penalty function F 2 via multiplication of both terms ensures that the UR does not get stuck in some specific states, which may have higher secrecy rates, for an unreasonable period of time, and thus, avoids mission incompletion.

4) DEFINITION OF FUNCTION F 3
Apart from the previous functions contributing the UR's reward, we need another function to impose the environmental and mission requirement constraints. To that aim, we define F 3 as where f p and f r are the absolute values of some immediate penalty and reward (both of them will be quantified in the simulations) subject to the conditions with which the UR-agent has encountered, respectively. The first penalty term in (53) ensures that the UR-agent flies at a safety distance of obstacles to avoid possible collisions, the second penalty is to motivate the UR not to go beyond the operating region of interest, the third penalty term is also to further discourage the UR-agent to avoid getting stuck in some specific states making an infinite loop, and the forth term is a reward for reaching the termination state and completing the mission.

E. TD-BASED MODEL-FREE RL ALGORITHMS
TD-based techniques refer to a class of model-free reinforcement learning algorithms which can learn from raw experience without knowing a model of the environment's dynamic, and also update estimates based in part on other learned estimates, i.e., learns by bootstrapping from the current estimate of the value function without waiting for an outcome. In this work, we consider SARSA as an On-policy TD learning algorithm and Q-learning as an Off-policy TD learning, as well as some generalized versions based on these two, i.e., Expected SARSA, Double Q-learning, and SARSA(λ), for the UR-agent's trajectory optimization and then we compare their performances in the numerical section. We note that the main difference between these algorithms lies in the Q-value update rule which are detailed below.

1) SARSA: ON-POLICY TD LEARNING-BASED ALGORITHM
SARSA is an On-policy reinforcement learning TD method with the Q-value update rule expressed by where α and γ denote the learning rate and the discount factor, respectively. Note that (54) demonstrates how learning is conducted from one state-action pair to another and the Q-value is updated. This update rule is calculated after every transition from a non-terminal state. It has been proven that SARSA converges with unit probability to an optimal policy provided that all state-action pairs are visited an infinite number of times and the policy converges in the limit to the greedy policy [40]. SARSA-based intelligent trajectory design for the proposed flying FD-operated MIMO-UR system is, therefore, given in Algorithm 1.
where the learned action-value function directly approximates Q-value, regardless of what policy being followed by the UR-agent. Q-learning based UR trajectory design for the proposed scenario is given in Algorithm 2.

3) EXPECTED SARSA
Since the use of next action a introduces additional variance into the update rule for On-policy SARSA method, it may slow the convergence. For this reason, a modified version of SARSA, i.e., Expected SARSA, has been proposed in [40] and systematically investigated in [48]. Expected SARSA algorithm, instead of using the next action a , indeed, employs an expectation (weighted sum) over all available actions in state s considering the probability of each action under the current policy, with the update rule as where the state-value function can be defined as while s = s flag and sp < N max sp do 8: choose a based on (47), 9: take a, then observe s and calculate r using (49) 10: s ← s ; sp ← sp + 1 12: end while 13: end for in which π(s |a) is given by

Q(s, a)←Q(s, a)+α[r +γ max
if a is non-greedy wherein |A g | shows the number of greedy actions, and ε is given in (47). This may offer substantial advantages over SARSA learning by reducing the variance and accordingly speeding up the convergence of the learning algorithm. On the other hand, Expected SARSA is quite similar to Q-learning for the case when the estimation policy is greedy, and therefore, it can be viewed as an On-policy version of Q-learning. We note that Expected SARSA requires more calculations than SARSA but lacks the high variance due to random selection of the next action a . As a result, it moves deterministically towards the same direction that SARSA moves in expectation, and this leads to relatively better performance for the Expected SARSA than the normal SARSA with the same amount of experience as shall be seen in the numerical section. Expected SARSA based UR trajectory design for the proposed scenario is given in Algorithm 3.

4) DOUBLE Q-LEARNING
Using max operation in the single-estimator Q-learning algorithm may result in poor performance due to a large overestimation issue, particularly in some stochastic environments. To remedy this issue, Double Q-learning has been proposed in [49], which employs two Q-functions such that they both get randomly updated based on the value from the other Q-function for the next state s , and this partly relieves the situation in terms of overestimation issue, however, this approach may result in underestimation of the maximum end while 14: end for expected value. Double Q-learning is regarded as an unbiased estimate of action-value function since the update of two Q-functions occurs in the same problem but learning is accomplished from dissimilar experience sample sets. Here, we proposed a Double Q-learning based RL algorithm for UR-agent's trajectory optimization in Algorithm 4, wherein for action selection, the sum of both the state-value functions for each action is considered to capture the effects of both Q-functions. Although this requires higher computational storage, and hence, Double Q-learning is generally less data-efficient than normal Q-learning, it significantly speeds up the convergence, as we will see later on in the numerical section.

5) SARSA(λ)
So far, all the abovementioned TD learning algorithms, i.e., Q-learning, SARSA, Expected SARSA, and Double Q-learning, only consider one step at most for updating the Q-table values corresponding to the operation state. However, before getting a given state, every step taken resulting to that might be important to consider with different level of degrees. SARSA(λ) algorithm [40] is a generalized multi-step version of SARSA which not only updates the Q-value of the latest step, but also, efficiently rewards all the related steps using the so-called eligibility trace. Eligibility trace is indeed a matrix such as E |S|×|A| , which is initialized with zeros prior to each episode, and saves each step in the path experience whose state-action value gets incremented by one. However, after each step all the elements of the eligibility trace will be decayed proportionally to the the bootstrapping factor λ. This ensures that all the action values from the beginning of choose a using (47) based on π derived from Q 10: take a, then observe s and calculate r using (49) 11: if rand(·) ≥ 0.5 then 12: let a = arg max a∈A Q 1 (s , a) 13: 14: else 15: let b = arg max a∈A Q 2 (s , a) 16: 17: end if 18: s ← s ; sp ← sp + 1 19: end while 20: end for the episode up to the last step taken are updated with different degrees following the recency fading. The SARSA(λ) based trajectory design for the proposed UR-based relaying scenario is given in Algorithm 5. We will see how well this algorithm might perform compared to the others in the numerical section.

F. COMPLEXITY DISCUSSIONS
In this paper, we use the grid-world for the exploration of the UR to find its optimal trajectory. Therefore, the state space topology has linear upper action bound, i.e., the number of possible actions in each state capped with |A| = 8 at most. Further, we have finite set of states with cardinality of |S| = N l N w N . In Algorithms 1 and 2, the UR-agent learns by visiting all the states, updating the corresponding Q-values during each episode 2 Assuming that all the actions are known to the UR, and the state space is fully observable in that the UR is capable of determining it's current state, therefore, the learning agent can reach the goal state and terminate after at most O(|S| s∈S |A(s)|) steps. Further, since the considered grid-world has the special property of 1-step choose a based on (47) 10: take a, then observe s and calculate r using (49) 11: 12: E(s, a) ← E(s, a) + 1 13: for ∀s ∈ S, ∀a ∈ A do 14: Q(s, a) ← Q(s, a) + αδE(s, a) 15: E(s, a) ← γ λE(s, a) 16: end for 17: s ← s ; a ← a ; sp ← sp + 1 18: end while 19: end for invertible due to having no duplicate actions [40], the worstcase complexity for both the proposed algorithms becomes O(8N 2 ). This implies that the worst-case complexity of proposed learning-based algorithms, though perform undirected exploration, are in polynomial order with respect to the number of states N . The only major difference between the two basic algorithms is that the On-policy SARSA learns action values relative to the policy it follows and hence may impose high sample complexity, while Off-policy Q-learning does it relative to the greedy policy in each state visiting, and hence, might be comparably slower. Further, the generalized version of the above algorithms which have been mainly proposed in the literature to speed up the convergence at the cost of higher complexity such as Expected SARSA or SARSA(λ), or higher storage resources but with the same computational complexity for Double Q-learning.

V. NUMERICAL RESULTS
In this section, we demonstrate the performance of the proposed algorithms to find the optimal trajectory for the considered MIMO-beamforming based secure UAV-assisted flying relay system. Simulation settings, mainly adopted from the literature, all are given in Table 2, unless otherwise stated. It is worth noting that we use both Python 3.7 and Matlab R2020a to implement the algorithms and conduct the simulations. Further, all the learning experiments have been conducted  using Python 3.7 on i5-8265U CPU @ 1.6 GHz with 16 GB of RAM system. First, we supply Fig. 4 to demonstrate how the proposed FD-operated MIMO-UR with ANI scenario, denoted as Proposed, performs in terms of the ISSR and the impact of resource allocation according to (50). In this figure, the ISSR performance of the ANI-based UR system with fixed resource allocation, i.e., P s (t) = P max s , P u (t) = P max u , α s (t) = 0.5, α u (t) = 0.5 ∀t, is represented by Benchmark 1. Further, Benchmark 2 illustrates the ISSR performance of the UR system without ANI operation (equivalent to α s (t) = 1, α u (t) = 1 ∀t) and with fixed transmit powers P s (t) = P max s , P u (t) = P max u ∀t. The ISSR performance of the direct transmission protocol from the BS with ANI beamforming to destination and optimized communication resources, represented by Benchmark 3, also taken into account for comparison. Note that the curves in Fig. 4 are plotted for different altitudes of UAV to demonstrate the effect of this key system parameter. In Fig. 4, we consider a simple scenario with fixed-line trajectory of the UAV, wherein there are one   , 0] T with q = 100m. Plus, the UAV flies with constant speed at a fixed altitude H u from just above the BS towards just above the user. Therefore, at mission time  t = 0s, the UR is located at the same x-y coordinate of the BS but with altitude H u , and at t = 20s, the UR reaches its final location assumed to have a similar projected x-y coordinate of the user's one. It is crystal clear from the curves that our proposed scheme well outperforms the other benchmarks. Further, we see that for the lower altitude of the UR, i.e., H u = 50m, there is an optimal location for the UR to offer the best ISSR for all these UR-based scenarios, and this location is roughly closer to the destination user than the BS. Further, this figure also illustrates that when the low-altitude UR is too far from the destination user, Benchmarks 1 and 2 bring comparably deteriorated ISSR performance than the traditional direct transmission in Benchmark 3 without exploiting a UR. The justification behind this can be explained as for the lower altitude of UR and without proper resource allocation, the secrecy capacity of the relaying link may be dramatically impacted due to overall larger attenuation. However, for the reasonably higher UR's altitude, i.e., H u = 100m, the advantages of having likely LoS channel due to higher altitude plays a significant role so much so that it may result in a VOLUME 9, 2021 higher ISSR performance than the properly resource allocated scheme at the lower altitude. We also observe that for the higher altitude of H u = 100m, the ISSR performance of all the UR-assisted scenarios gets improved as the UR gets closer to the destination user, so the better channel condition due to UR's placement has a stronger effect than the proper communication resource allocation. However, it is worth mentioning that there is an inherent trade-off between the LoS channel and the larger path loss attenuation due to the higher altitude of the UR, and this should be taken into account for system design. Note that in this work we consider a fixed altitude of the UR for the remaining simulations, albeit this can be readily extended to a general case in our future work. Now, we turn our attention to employ the model-free RL-based algorithms mentioned in the previous section to design the UR's path planning in order to maximize the ASSR performance. The considered environment is shown in Fig. 5, wherein obstacles such as tall trees and buildings denote the prohibited flying regions, and also the randomly located users (visualized by either boy or girl icons) and AEs are considered quasi-stationary such that their possible movements do not result in a change in their occupied states during the UR's flight duration. Having implemented the proposed TD-based RL algorithms detailed in the previous section, we obtain the following results.
Figs. 6 and 7 are provided to demonstrate the accumulated discounted rewards (return) and mission duration (total steps) performance of all the considered RL-based algorithms, i.e., Q-learning, SARSA, Expected SARSA, Double Q-learning, and SARSA(λ) with ε-greedy action selection strategy, and in regards to the episode number. As can be observed from Fig. 6, initially the return fluctuates dramatically as the UR explores the environment with a higher probability and takes actions randomly. This results in mainly getting negative rewards due to, for example, collisions with obstacles, revisiting already visited states in a given episode, or flying outside the limited region. Nevertheless, after sufficient time when the UR-agent is trained well via the received feedback from the environment, which further leads to having a decent knowledge of the topology of the environment, it tends to utilize from its experience, and consequently, the fluctuation in the accumulated discounted reward function gets negligible, indicating that the Q-function updating is settled. This also implies that the maximum achievable ASSR is attained. Further, we observe from Fig. 7 that the number of steps that corresponds to the mission time is decreased as the learning agent interacts with the environment in each episode. Further, it is evident that the UR intends to complete the mission as fast as possible, while accomplishing the required objective. Indeed, there is a trade-off between the mission time and the achievable ASSR of the system. The more the mission time, the higher the sum system secrecy rate could be due to better positioning and data relaying which improves the objective function of the ASSR, but the larger the denominator of the objective function, which of course, degrades the ASSR. Therefore, the UR intends to complete the mission as fast as it can, while being smart. That's why the path plannings obtained via the algorithms generally look like the way that the UR is instructed to find the best, but not necessarily the shortest, path from the initial location to the final pre-specified location. While (Expected) SARSA could achieve the least total steps of 16 according to Fig. 7, Q-learning has performed the worst with regards to total steps taken and obtained the highest mission duration of 23 time steps, according to their final routes derived from the optimal policies.    [50] to better illustrate the overall trends. It is evident that all the algorithms get converged and demonstrate quite identical increasing trend to that of discounted cumulative reward given in Fig. 6. We also note that SARSA(λ = 0.1) achieves the highest value, i.e., 2.73 bps/Hz, amongst all the algorithms with the equal number of training episodes. The convergence speed of Double Q-learning is comparatively faster than the others, particularly, normal Q-learning, despite the fact that they achieve quite identical optimal ASSR values. We also observe that higher values of λ degrades the performance of the algorithm in the considered environment. Due to the latter result, we also empirically investigate the effect of λ factor in the performance of SARSA(λ) algorithm in Fig. 9 and observe that λ = 0.1 brings the best ASSR, so it can be considered the best choice of bootstrapping factor in SARSA(λ) based algorithm for the given environment. Fig. 10 depicts the trajectories learned by the UR-agent using the proposed TD-based algorithms. Note that final VOLUME 9, 2021 routes are derived from the optimal policy of all the aforementioned algorithms. We note that all the algorithms find slightly different trajectories compared to each other and this difference is on the grounds that they employ different updating approaches, and owing to the fact that SARSA based algorithms are more conservative than Q-learning ones when it comes to the actions explored. Fig. 11 depicts the average user secrecy rate against user index for different proposed RL-based algorithms. We see that all the proposed algorithms satisfy the minimum and sum secrecy rate requirements of the system. Despite that some users are not scheduled according to Fig. 11, the considered secrecy rate requirements of the scheduled users are well satisfied, indicating the effectiveness of the proposed trajectory design and resource allocations algorithms. Fig. 12 is supplied to show the processing time taken for the algorithms to complete using our system. As can be observed from the figure, all one-step algorithms, particularly, SARSA perform better than the multi-step ones in terms of the lowest process time. The impacts of reward function parameter ζ is explored in Fig. 13. we note, intuitively, that the proper choice of ζ results in a balance between system sum secrecy rate and the total mission completion time, which further leads to the best ASSR performance. Further investigation regarding tuning the learning parameters and reward factors is required, which we leave as interesting future work.

VI. CONCLUSION
In this paper, we considered an FD-operated UR-assisted secure communication system to serve multiple ground users in the presence of randomly located AEs. We proposed a secure relaying scheme, wherein both the BS and UR adopt MIMO-enabled ANI-based beamforming to combat AEs. Our problem of interest was to maximize the ASSR of the considered scenario. To achieve this objective, we invoked some model-free TD-based RL algorithms, i.e., Q-learning, SARSA, Expected SARSA, Double Q-learning, and SARSA(λ) for trajectory optimization.
The proposed algorithms were subject to some QoS requirements in terms of minimum instantaneous secrecy rate, user's maximum sum secrecy rate, and also without the need for system identification. Simulation results revealed that all of the proposed algorithms were capable of finding an optimal trajectory of the UR while improving the ASSR, avoiding collision with environmental obstacles, and completing the mission as fast as possible. As a future research direction, one can extend this work to investigate more practical scenarios when the state and action spaces are continuous and/or have very large dimensions suffering from the curse of dimensionality issue. Then, the promising DRL techniques such as DQN may be explored to perform the functional optimization to efficiently tackle the computationally intensive learning process of tabular methods investigated in this work.