Reinforcement Learning for Resilient Aerial-IRS Assisted Wireless Communications Networks in the Presence of Multiple Jammers

The evolving landscape of beyond 5G and 6G wireless communication systems in smart urban environments faces numerous interference-related challenges posed by legitimate and illicit devices. In this context, Intelligent Reflecting Surfaces (IRS) have emerged as a promising solution to mitigate interference caused by obstacles and unknown jamming devices. Existing techniques mainly focus on mitigating the impact of a single jammer in IRS-assisted communications systems, which affects both stationary and mobile devices. Additionally, these approaches target a single objective, such as minimizing the energy, enhancing the transmission rate, or maximizing the Signal-to-Interference-plus-Noise Ratio (SINR), which restrains the performance of the system. This paper offers a comprehensive anti-jamming solution for securing wireless communications in a smart city urban environment comprising diverse public events such as sporting events, parades, festivals, and exhibitions. The focus is on maintaining essential services like security, law enforcement, logistics, emergency response, crowd management, and public health. We introduce a Reinforcement Learning-based technique for UAV-mounted IRS, optimizing trajectory and phase shift beamforming to counteract the disruptive impact of jammers, ensuring reliable communication in dynamic, security-sensitive settings Our approach also seeks to achieve multi-objective optimization by striking a balance between transmission rate and energy consumption in this highly challenging environment. The formulated optimization is computationally complex due to its combinatorial nature. Hence, we leverage the light-weight Deep Reinforcement Learning (DRL) technique called Deep Deterministic Policy Gradient (DDPG) to optimize trajectory and IRS phase shifts and achieve multiple objectives jointly. Experimental results demonstrate the effectiveness of our proposed DDPG-based approach in outperforming other RL algorithms. It achieves a near-optimal solution compared to the benchmark technique within the close gap and improves both achievable transmission rates and energy efficiency compared to related works by 50-70%.

implementation in practice, especially at higher frequencies such as those in the millimeter-wave (mmWave) band.
Wireless smart communication environments are becoming more flexible and mobile than terrestrial systems, but they face difficulties maintaining line-of-sight communications between mobile devices or Internet of Things (IoT) devices and base stations due to obstacles or congested urban environments.One promising technology for addressing these challenges is Intelligence Reflecting Surfaces (IRS) [1], which manipulate the propagation of a wireless signal to enhance or suppress it through reflection or refraction.The recent development of smart radio systems, which control wireless signal operations, has opened new research directions for applying IRS in Beyond-5G and 6G wireless communication networks.
Wireless communications are increasingly becoming more susceptible to adversarial threats at the physical layer through jamming, which degrades the quality of the wireless links.Anti-jamming techniques that use UAV-mounted IRS, also referred to as Aerial IRS, are critical to building a resilient communication service for mobile devices, especially for securing critical wireless communication infrastructure in remote areas or highly congested urban centers.Communication systems face disruption risks in diverse domains due to malicious and non-malicious factors.Various sectors, including military, emergency response, connected vehicles, unmanned aerial vehicles (UAVs), industrial automation, healthcare, maritime, smart grid networks, agriculture, and mining, rely on wireless communication systems for efficient operations.However, the presence of multiple jamming devices poses significant challenges.
Securing wireless communication links during public events in a smart urban environment, critical for providing services such as emergency response, security surveillance, law enforcement, public health, logistics, and crowd management, can be undermined by adversaries using jamming devices to disrupt the command and control system.Similarly, emergency communications during crises can suffer from malicious jamming by saboteurs seeking chaos, strategic gain, or non-malicious sources such as technical glitches, impeding rescue efforts.In healthcare applications, wireless communication is essential for tasks like telehealth and remote surgery, but it is vulnerable to attacks from hackers and cybercriminals, as well as unintentional disruptions, causing compromise of patient data and disruption of healthcare services.Similarly, Smart grid networks, which are vital for electrical power systems, confront threats from cyber attackers or enemy state actors aiming to destabilize infrastructure, while non-malicious interference such as signal congestion, equipment malfunctions, or environmental factors can also lead to power outages and equipment failures.
Multiple approaches have initially utilized fixed IRS deployed on buildings and employed UAVs to act as relays to optimize various communication parameters, such as transmission rate, throughput, age of information, energy efficiency, sum secrecy rate, and SINR, mostly in nonadversarial scenarios [2], [3], [4], [5].Yet, a few works have addressed scenarios involving jammers, optimizing IRS passive beamforming and ground base station transmit power to enhance system rate and secure transmission [6], and optimizing UAV trajectory and IRS phase shifts to minimize energy consumption [7].UAVs offer the advantage of establishing better line-of-sight links with ground and mobile devices while being able to adjust their trajectory and location for overall performance improvement.Therefore, recent advancements have deployed IRS on UAVs to optimize wireless communication parameters [8], [9], [10], including considerations of jamming effects in dynamically complex environments [8], [11], [12].However, these solutions do not consider the deployment of IRS on the UAV [6], [7], while those that deploy them only counter the effects of single jammers on end-user devices [6], [8], [9], [10].Moreover, related works only target a single objective, such as minimizing the energy, enhancing the transmission rate, or maximizing Signal-to-Interference-plus-Noise Ratio (SINR), which fails to capture the inter-dependencies between multiobjectives and hence restrains the optimization of the overall system performance.The presence of multiple jammers introduces additional interference sources and necessitates more sophisticated path-planning algorithms to avoid or mitigate their effects.Therefore, the UAV trajectory needs to be optimized to navigate around the jamming areas, adaptively allocate energy resources, and ensure reliable and efficient communication links.
This paper proposes a novel, comprehensive, and realistic solution for ensuring secure 5G and 6G wireless communications in a smart, critical urban environment during public outdoor events such as sporting events, national parades, festivals, fairs, rallies, exhibitions at public venues like stadiums, arenas, grounds, and malls.Any malicious interference from jammers could disrupt communication links critical for public services like security, law enforcement, public safety, crowd management, and logistics, making it difficult to respond to emergencies or maintain crowd safety, leading to financial loss, public discontent, and facilitating criminal activities.
Our proposed approach tackles this challenge by deploying a Reinforcement Learning algorithm at a single UAV-mounted IRS platform that jointly optimizes UAV trajectory and IRS passive beamforming through phase shift to mitigate the effects of multiple jammers to achieve maximum transmission rate with minimum energy consumption.It strives to balance transmission rate and energy consumption to ensure robust and efficient wireless communication links.Furthermore, we propose using a Deep Deterministic Policy Gradient (DDPG) algorithm as an RL technique to optimize the wireless communication system parameters.
The proposed solution is novel and unique with respect to two points, as compared to existing solutions deploying RL in UAV-mounted IRS for anti-jamming in wireless communications: complexity due to the presence of multiple jammers and a multi-objective optimization approach.
Among the few related anti-jamming solutions deploying IRS mounted on UAV [8], [11], and [12], only Zhifeng Hou et al. [8] is using an RL-based technique called Dueling Double Deep Q-Networks (3-DQN), while the other two works, [11] and [12], use conventional Alternating Optimization (AO) based techniques.We are also considering the mobility of the mobile device, adding more complexity to the system, which has only been considered by authors of [8] and [12].
Another anti-jamming solution [6] that uses RL based technique called Win or learn fast-policy hill-climbing (WoLF-CPHC) learning employs fixed IRS installed on a stationary object like a building rather than a moving platform like UAV, to assist in improving system rate and transmission protection level in wireless communications in presence of single jammer.Similarly, although, the work in [7] is very close to our system model as it also considers multiple jammers interfering with the wireless communications, but it only deploys UAV transmitter communicating with ground users assisted by fixed IRS installed on some stationary objects such as buildings.
Existing IRS-assisted anti-jamming works either deploy a non-RL-based solution using fixed IRS with the static device in the presence of multiple jammers [7] or even if they deploy an RL-based solution using UAV-mounted IRS with moving mobile device, they use an RL technique called Deep Q-Network (DQN), which is mainly suited for discrete state and action space, [8], while only considering the presence of a single jammer.
The second point based on the uniqueness of our solution is the adoption of the Multi-Objective Optimization approach.All of the existing IRS-assisted anti-jamming solutions have either solved a single objective like improving the maximum achievable rate [8], transmission rate [11], SINR [12], minimizing energy [7] or solved each of the multiple objectives separately, like system rate and system protection level [6] using a combination of RL and conventional techniques.
However, in our proposed scenario, the system needs to simultaneously achieve two objectives, such as maximizing the quality of wireless communication for critical public services at public events in the smart urban environment while minimizing energy consumption at the UAV-IRS platform to ensure long-duration operations.These objectives can often be conflicting; for example, achieving higher transmission rates might consume more energy by the UAV.Since multi-objective optimization allows finding a tradeoff that balances these objectives, therefore, it ensures that the system can provide the best possible service under varying conditions while conserving energy.Therefore, our proposed solution balances these two objectives while prioritizing one objective over the other based on the system preferences.
Our main contributions are described as follows: • We propose an anti-jamming technique for smart critical wireless communications infrastructure such as stadiums, airports, sporting venues, etc. using IRSassisted UAV also referred to as aerial IRS.Our work considers one mobile device surrounded by multiple jammers to achieve the maximum achievable transmission rate in an energy efficient fashion.The jammers are non-collaborating and independently transmitting the malicious signals and the system model does not have the full knowledge about the jammers.• We present a problem formulation that involves the optimization of the trajectory of an IRS-assisted UAV and the IRS phase shift (passive beamforming).This optimization aims to maximize both the achievable transmission rate and energy efficiency of wireless communications between a base station and a mobile device moving within a defined cellular region while taking into account multiple jammers that interfere with the receiver of the mobile device.• The formulated optimization problem is computationally complex; therefore, we introduce a Reinforcement Learning (RL) algorithm that seeks to find the optimal trajectory of IRS-assisted UAV and phase shift angles of IRS.The RL technique we have implemented is the efficient Deep Deterministic Policy Gradient (DDPG) algorithm.
• To demonstrate the superior performance of our DDPGbased solution, we conduct extensive simulations of our system under different configurations.Specifically, our evaluations compare the proposed aerial IRS system model with the optimal solution and a few baselines, having a mobile device in the presence of different numbers of jammers, IRS phase shift elements, and jammer distribution areas.The rest of the paper is organized as follows.Section II gives a brief overview of applications of RL techniques in countering the threats of eavesdropping and jamming wireless communication systems and recent works that deployed IRS-assisted UAV communications to improve the efficiency of communications in adversarial environments.Section III presents the system model and problem formulation of the proposed framework along with details and structure of the RL-based solution.Section IV presents the proposed implementation, including the experimental setup, the simulation process, and the results.Section V summarizes and concludes the paper.

II. RELATED WORKS
The problem of countering the effects of jammers has been the focus of research in wireless communications networks for some years.Conventional methods to counter jamming, such as spread spectrum, adaptive power/rate control, and cognitive radio, have shown their efficacy in diminishing jamming threats [13].Yet, their ability to withstand the increasing complexities of 5G and 6G networks and the highly adaptive nature of jamming attacks remains constrained.The classical Spread Spectrum technique is not suitable due to stringent spectral efficiency requirements in 5G and beyond cellular networks.A more recent technique called MIMO-based jamming mitigation is considered quite effective in countering the effects of jammers at the base station or mobile device receiver, but it requires channel information about the jammers [14].
In order to address these limitations of conventional techniques, Reinforcement Learning (RL) offers promising solutions [15].By providing flexible and intelligent antijamming capabilities, RL-based approaches can adeptly adjust to ever-changing attack situations and surpass the restrictions of conventional techniques.Consequently, RL has been considered by very IRS-based solutions for antijamming in their system models.
Various solutions have been proposed to counteract the adversarial effects of jamming and eavesdropping on wireless communication networks.Liu et al. [16] developed a CNN-based anti-jamming network called Heterogeneous Information Fusion Deep Reinforcement Learning (HIF-DRL) for frequency channel selection in a high-frequency communication environment with interference and jamming.Authors in [17] proposed an anti-jamming Deep Reinforcement Learning Algorithm (ADRLA) and Recursive CNN (RCNN) for anti-jamming communication with multiple jammers.Xiao et al. [18] proposed a power allocation control scheme for MIMO Non-Orthogonal Multiple Access (NOMA) systems using Hot-booting Q-learning, and Xiao et al. [19] presented a 2-D frequency-space antijamming mobile communication scheme for mobile device or sensing robot communication with multiple jammers.
The problem of wireless communications jamming is not only faced by terrestrial communication systems but also by new vehicular communication systems, using UAV-aided VANET proposed by Xiao et al. [20].Similarly, Li et al. [21] proposed a Deep Reinforcement Learning (DRL) based anti-jamming technique to counteract an RL-enabled intelligent jammer operating in four different modes such as sweeping, comb, dynamic, and statistic.Yao et al. [22] proposed a Q-learning method called Collaborative Multiagent Anti-jamming Algorithm (CMAA) to solve the multi-user scenario with coordination among cognitive users.Peng et al. [23] managed a swarm of UAV communication system jammed by adversarial fixed and UAV jammers, using Multi-Dimensional Anti-Jamming Reinforcement Learning (MDAJRL) algorithm.Slimeni et al. [24] proposed On-Policy Synchronous Q-learning (OPSQlearning) technique based on a real-time reinforcement learning algorithm to help proactively avoid jammed channels.In the scenario presented in [25], blue-force nodes of a wireless network are uniformly deployed over a disk, while red-force nodes seek to attack the communication of blue-force nodes.The authors develop and implement a QoS-aware routing protocol based on the Actor-Critic Deep Q-Network (DQN) algorithm that allows nodes to avoid communication holes created by jamming attacks.Ye et al. [26] proposed a Prioritized Experience Replay DDQN (PDDQN) based solution for anti-jamming in wireless communications networks.
In [11], the optimization technique involves an Alternating Optimization (AO) based approach called Successive Convex Approximation (SCA) for UAV location optimization and Manifold Optimization (MO) for IRS passive beamforming in a system model consisting of a UAV-mounted IRS, acting as a relay between ground base stations (GBUs) and stationary ground users (GUs) in the presence of a single jammer.On the other hand, [7] presents a wireless scenario with multiple jammers where energy consumption is minimized through joint optimization of UAV trajectory and IRS beamforming through phase shifting using Alternating Optimization (AO) algorithm with Semi-definite Relaxation (SDR) and Successive Complex Approximation (SCA).
In [12], a system is proposed where multiple UAV base stations communicate with mobile devices in the presence of a single jammer, and multiple UAV-mounted IRS relays mitigate the jamming effect.The solution employs an AObased Relax-and-Retract algorithm to maximize the Signal to Interference and Noise Ratio (SINR) of the mobile devices with power and phase constraints at base station UAVmounted IRS devices.
Intelligent Reflecting Surface (IRS) is a promising new technology for reconfiguring the wireless propagation environment via software-controlled reflection.It is a planar surface containing numerous passive reflecting elements capable of inducing amplitude and phase changes independently, to modify wireless propagation.Unlike traditional wireless link adaptation techniques, IRS proactively modifies the wireless channel to enhance communication performance and provide a new degree of freedom for a smart and programmable wireless environment.IRS technology has the potential for low energy consumption and scalable deployment [1].
Reinforcement Learning (RL) is a powerful tool used in complex and dynamic control scenarios such as robotics and gaming.Recently, it has been applied to improve the performance of IRS-assisted wireless communications.RL is modeled as a Markov Decision Process (MDP) with an actor taking actions based on the environment state to maximize the expected reward.The actor updates its policy based on the positive or negative rewards from the environment.RL algorithms can adapt to complex environments in real-time through online training on real-time data.
RL techniques have also been applied to optimization problems in IRS-assisted aerial wireless communications.While many solutions have been proposed using reinforcement learning, most have ignored eavesdropping and jamming threats.For instance, in [27], authors optimized IRS-assisted UAV wireless communication systems using Deep Reinforcement Learning-based Proximal Policy Optimization (PPO) and Block Coordinate Descent (BCD) algorithms to optimize UAV base station trajectory, transmit power, IRS phase shift, and user association.Another work [2] used Decaying Deep Q-Network (DQN) to optimize UAV trajectory, transmit power, and IRS phase shift to minimize energy or power consumption of UAV in similar scenarios of IRS-assisted UAV communication.Additionally, in [4], an IRS installed on a building supports NOMA communications between UAVs, Ground Users (GUs), and Ground Base Stations (GBs), using a robust DRL-based Offpolicy Actor-Critic algorithm to optimize the UAV trajectory, transmit power, IRS phase shift, and transmit power of ground users.
Another work presented by J. Yu et al. in [28] incorporates fixed IRS in a NOMA context through Lyapunov-functionbased RL algorithm called Mixed Integer Deep Deterministic Policy Gradient (LMIDDPG) to achieve superior energy efficiency performance of a mobile edge computing (MEC) system while maintaining the queue stability.Finally, in [5], a UAV base station communicating with a mobile device via an IRS mounted on a building uses two approaches, Deep-Q-Network (DQN) and Deep Deterministic Policy Gradient (DDPG), to optimize energy efficiency and data rate.
In [3], Twin Deep Deterministic Policy Gradient (DDPG) is used to jointly optimize trajectory and active beamforming of UAV and IRS reflection parameter in the presence of multiple eavesdroppers, maximizing the sum secrecy rate of all legitimate users in a millimeter-wave communication system.In a similar scenario with a single jammer, [6] proposes a fast reinforcement learning approach called Win or learn fast-policy hill-climbing (WoLF-CPHC) algorithm to optimize IRS phase shift and ground base station transmit power to improve system rate and transmission protection level.
Recent works have also explored the use of UAV-mounted IRS in wireless communication systems to enhance mobility, especially in scenarios where the user device is mobile in the presence of a single jammer [8], [12].Dynamic deployment of IRS on the UAV instead of fixed location IRS has been found to offer better anti-jamming performance [6].One such solution [29] uses the Deep Deterministic Policy Gradient (DDPG) algorithm to jointly optimize the trajectory and transmit power of UAV-mounted IRS to maximize total throughput in a UAV-powered IoT network.Another work [9] uses the Proximal Policy Optimization (PPO) algorithm to minimize the expected sum Age of Information (AoI) by optimizing UAV altitude, IRS phase shift, and scheduling.The Dueling Deep Q-Network (DDQN) algorithm is used in [10] to optimize UAV trajectory control, velocity, sub-carrier allocation, and IRS phase shift to minimize total transmit power in UAV-assisted IRS Hetnets.However, these solutions do not consider jamming threats in their optimization problem.Only one recent work [8] has addressed the issue of a single jammer by using the Dueling Double Deep Q-Network (3-DQN) algorithm to optimize UAV trajectory and IRS passive beamforming for maximizing achievable communication rate at the mobile device.However, this solution does not consider the scenario of multiple jammers [7] and other optimization variables such as UAV transmit power, active beamforming, and mobile device transmit power.
The work by H Zhao et al. presented in [7] seems to solve a similar anti-jamming problem due to the presence of multiple jammers interfering with the wireless communications.However, its approach is quite different from our proposed solution as it only deploys IRS installed on some stationary object like building etc., to assist the UAV transmitter in communicating effectively with ground users by mitigating the effects of jammers.It does not allow freedom of movement to the IRS that optimizes only single objective of maximizing the energy efficiency.The technique being used is also conventional Alternating Optimization (AO) approach that optimizes the trajectory of the UAV transmitter and phase shifts of IRS separately and alternately.While our proposed solution is unique in that it not only mounts the IRS on the UAV but also employs the RLbased technique DDPG to jointly optimize the trajectory and IRS phase shifting simultaneously to achieve multiple objectives of maximizing transmission rate and minimizing energy consumption.
Another two works, [11] and [12], are using conventional Alternating Optimization (AO) based techniques.We are also considering the mobility of the mobile device, adding to the complexity of the environment that has only been considered by authors of [8] and [12].
Table 1 provides a brief overview of the contributions made by both the most relevant previous works in the field and our own research.

III. AERIAL-IRS ASSISTED WIRELESS COMMUNICATIONS NETWORKS IN PRESENCE OF MULTIPLE JAMMERS
We consider a Multiple Input and Single-Output (MISO) wireless communication system as shown in Figure 1. that provides critical services at public events in a smart urban environment where mobile devices D carried by personnel belonging to security, law enforcement, logistics, emergency response, health safety, crowd management, and citizens are connected through 5G Advanced or 6G base station B, enabling effective communications and data services.

A. SYSTEM MODEL
We establish a rectangular coordinate system for the smart urban environment with the center of Base Station B as the origin with the maximum limits for the axes defined, which would determine the constraints for the location for the Aerial IRS R. The communication between the mobile device D and the base station B is threatened by the adversarial interference from multiple malicious jammers J k emitting omnidirectional signals that aim at breaking down critical communications links resulting in chaos or criminal activity.To counter the effect of jammers, a UAV mounted with IRS or Aerial IRS R is deployed over the region and flies in the air to assist in anti-jamming operations.
Moreover, the mobile device is free to move within the specified boundaries.We assume that a UAV control unit commands the UAV's location and altitude and the phase-shift of the reflecting elements of IRS to serve the mobile device D and maintain the Quality of Service (QoS).Therefore, the UAV-mounted IRS-assisted antijamming communication system is in a dynamic unknown environment.The dynamic unknown environment refers to a situation where the conditions and variables of a given environment are constantly changing and uncertain.In other words, it is an environment where the agent or observer does not have complete information about the state of the environment or how it will evolve.
The base station B, mobile device D, and the jammers J k are placed on the ground with two-dimensional horizontal space within the specified area.
The UAV-equipped IRS flies in the air within the given duration T. This is further divided into N time slots of duration δ t , i.e., T = Nδ t .When N is large, then δ t is small enough to assume that the UAV location and the IRS phase shifts would not change in this time slot.
We assume that all the communication channels follow the frequency selective fading models.The ground-based communications are assumed to follow Rayleigh fading with dominance of non-Line of Sight component due to multi-path scattering from environmental obstacles like trees, buildings, etc., while aerial communications follow Rician fading, which is the special case of Rayleigh fading that has dominant Line of Sight component [30].The IRS can adjust the phase and amplitude of each beam reflected by the reflecting unit cells adapting to the dynamic channel.This beamforming approach can significantly improve the anti-jamming performance at the mobile device.

1) DISTANCE AND CHANNEL MODEL
First, we define the distance model as follows.We define d BD , d BR and d RD as the distance between base station B and mobile device D, the distance between base station B and Aerial-IRS R, and the distance between Aerial-IRS R, and mobile device D, respectively.Similarly, we can define d J k R , ∀k ∈ [1, K] as the distance between Aerial-IRS R and the k-th Jammer J k .They are given as follows: The channel gain h BD between base station B and mobile device D can be defined as [9]: The channel gain h BR between base station B and Aerial-IRS R can be defined as [9]: The channel gain h RD between Aerial-IRS R and mobile device D can be defined as [9]: The channel gains h J k D between K jammers J k and mobile device D, and h J k R between jammers J k and Aerial-IRS R, can be defined as [9]: We assume there are only Line-of-Sight components between base station B and Aerial-IRS R, and between Aerial-IRS R and mobile device D. Therefore, aerial channels between the Aerial-IRS and mobile device and between Aerial-IRS and Base Station follow Rician fading with dominant Line of Sight (LoS) and no non-line-of-sight components due to the multi-path effect.
We also assume that there are no Line-of-Sight (LoS) components between base station B and mobile device D, and between K Jammers J k and mobile device D, so channels among them follow Rayleigh fading.[9] represent Path Loss Coefficients, which can be defined as: In equations (6a)-(6e), L 0 is defined as path-loss average channel power gain at reference distance d 0 = 1m, α is the path-loss exponent for air-to-ground wireless transmission and η is the path-loss exponent for ground wireless transmission.
The small-scale fading vectors (also called Rician Fading or LoS components vectors) ĥBR [n], ĥRD [n] and ĥJ k R [n] are: where, The small-scale fading or Raleigh Fading or non-LoS component ĥBD [n], between Base Station B and mobile device D, and ĥJ k D [n] between jammers J k and mobile device D and are given by: The channel gain values can be re-written as: CN(0, 1) in equations (10a) and (10d) is defined as complex normal distribution with mean 0 and variance 1.

2) ACHIEVABLE TRANSMISSION RATE
Given the channel model specified above, the instantaneous signal-to-interference-plus-noise ratio (SINR) of the wireless communication is given as: where P B and P J k are transmit power of base station and the k jammers respectively and σ 2 is the background noise power.Accordingly, the achievable rate at the mobile device within the time-slot n is given as follows: If we assume unit bandwidth B = 1, then the achievable rate would be:

3) ENERGY CONSUMPTION MODEL FOR AERIAL-IRS
The total energy consumption E R for Aerial-IRS consists of two components.First is communication energy at IRS which is the energy dissipated for reflective beamforming through phase shifts E IRS while second is the propulsion energy consumed by UAV for flying, E UAV .
The IRS communication energy consumption is given as: where M is total number of reflecting elements, P IRS is power consumed by a single IRS element and S k,2 is magnitude of the reflection coefficient of the k-th reflecting element.The propulsion energy consumed by the rotary wing UAV is given as: where P 0 and P 1 are Blade Profile Power and Induced Power of UAV in hovering state, respectively.c 0 is constant parameter called Parasite Power related to aerodynamics, δ t is duration time slot n, U tip is tip-speed of rotor blade and v 0 is mean motor induced velocity.The value of v[n] is calculates as: whereas, dx R [n], (dy R [n] and dz R [n] are displacements of UAV in three dimensions generated by RL at every time step n of exploration.The speed v[n] of the UAV is taken into consideration in order to calculate the propulsion energy consumption of the UAV as shown in equation (17) as it wanders in the cellular region to find the optimal location that maximizes the transmission rate.IRS phase shifting and location are being maneuvered at each step based on the transmission rate and energy consumption.Since there are no time constraints for the movement of UAV, it does not need to rush closer to the mobile device to mitigate the influence of jammers.Rather, it keeps flying in the area around the optimal location at some distance from the mobile device.
Since IRS energy consumption E IRS is very small as compared to propulsion energy E UAV we will ignore this component.Therefore the total energy consumed by Aerial-IRS would be the same as the propulsion energy consumed by UAV.Therefore It is clarified here that we also ignore the computational energy of the PPO or DDPG as they are insignificant.Since the computational energy of DDPG and PPO is related to the DNN model used by the RL algorithm, which in DDPG and PPO are very low due to the low-complexity model therefore it has a negligible burden on energy consumption.

4) IRS PASSIVE BEAMFORMING AT AERIAL-IRS
The reflection coefficient matrix of IRS is made up of θ m of all reflecting units so the phase shift (reflection coefficient) matrix is a diagonal matrix presented as: [n] = diag{e jθ n 1 , e jθ n 2 , jθ n 3 , jθ n 4 , . . ., e jθ n M }; According to [31], the value of θ m can be easily obtained using the following equation by determining the beamforming direction.
where d x and d y are the length and width of each unit cell of IRS.They are of sub-wavelength scale within the range of λ 10 and λ 2 , where λ is wavelength of the signal.θ t is angle between incident signal and x axis, ϕ t is angle between incident signal and z axis, θ r is angle between reflection signal and x axis and ϕ r is angle between reflection signal and z axis, as shown in Figure 2.

B. PROBLEM FORMULATION
The main objective of the problem is to simultaneously achieve multiple objectives, specifically, the maximization of achievable transmission rate at the mobile device and energy efficiency (EE) at the UAV.This is achieved through a joint optimization of the trajectory of the UAV-IRS and the phase shift-based beamforming of the IRS reflectors.These optimizations are performed within the context of constraints associated with UAV trajectory and phase shifts, while also considering the presence of multiple jammers.
In the system model outlined earlier, our objective is to enhance the legitimate transmission rate while concurrently minimizing the energy dissipation of the Aerial-IRS with the aid of the UAV-mounted IRS.This entails positioning the Aerial-IRS at a strategically chosen aerial location and controlling passive beamforming through the manipulation of phase shifts in the reflecting elements to exert influence over Therefore, the problem is formulated as The dynamic nature of the movements of UAV and mobile device D presents significant challenges for traditional optimization algorithms, as they struggle to effectively adapt to the constantly changing factors within the environment.The constraints (22b)-(22f) are linear and simple with respect to either ω R or .Moreover, since the joint trajectory design of the UAV and phase shift control of the IRS are planned to be optimized, the search space is expanded as the number of parameters increases, which also makes the conventional gradient-based optimization techniques unsuitable.

C. REINFORCEMENT LEARNING FOR AERIAL-IRS TRAJECTORY OPTIMIZATION AND PASSIVE BEAMFORMING THROUGH IRS PHASE SHIFTS
We observe that our problem is a mixed-integer nonconvex multi-objective optimization, which is hard to solve because it contains continuous variables ω R and .Although heuristic algorithms, like Greedy, Genetic, Hill Climbing, Simulated Annealing, Particle Swarm Optimization (PSO), and Ant Colony Optimization (ACO), work well for specific problems with pre-defined rules but cannot adapt to rapidly changing environmental conditions.In complex scenarios like UAV-IRS communication with a mobile device in the presence of multiple jammers, their fixed rules struggle with evolving interference patterns and risk being stuck in local optima, lacking generalization compared to Reinforcement Learning (RL).RL, on the other hand, learns from feedback and autonomously adapts, which makes it the preferred choice for dynamic, nonlinear optimization problems.
In addition to this, it is a challenging task to solve a nonconvex optimization problem when complete information on the power and location of multiple jammers J k is unknown, and the trajectory of mobile device D is unpredictable.Hence, we introduce an RL system model to control UAV trajectory and passive beamforming through IRS phase shifts.In Reinforcement Learning (RL) the algorithm learns the skills required to solve the problem instead of solving a specific instance of the problem.This allows the trained policy to quickly provide a solution to an unseen problem without having to start from scratch.
The problem is reformulated as Markov Decision Process (MDP) as described in Section III-C3, and a model-free DRL based on Deep Deterministic Policy Gradient (DDPG) is exploited to find the effective control policy that maximizes the achievable transmission rate at the mobile device and energy efficiency at the UAV.The proposed DDPG algorithm does not rely on prior knowledge of the Aerial-IRS location and IRS phase shift coefficients.
As compared to other Deep Reinforcement Learning (DRL) methods such as Deep Q-Network (DQN) [32], DDPG is a policy-based method that achieves better performance for continuous control stochastic problems [33].DQN algorithm is a value-based method that works best in scenarios with small discrete action spaces, while our problem has multi-parameter continuous action space for which value-based methods would not work.Secondly, DDPG is more sample-efficient than DQN, meaning that it requires fewer training samples to converge to an optimal policy.This is particularly important in scenarios where data collection is expensive or time-consuming.
When it comes to policy-based algorithms, our choice could have been Proximal Policy Optimization (PPO) [34], which is the latest variant of the Actor-Critic method.We found that although both DDPG and PPO are suited to complex continuous control problems, PPO is applicable to both discrete and continuous control problems, while DDPG works only for continuous control problems.However, based on comparative evaluation, we observe that DDPG gives better performance and faster convergence than PPO in scenarios with high-dimensional action spaces and continuous control problems.Moreover, DDPG is found to converge to optimal policy more quickly as compared to PPO in complex environments.
A variant of Actor-Critic method is Trust Region Policy Optimization (TRPO), which was first introduced by [35].We prefer DDPG over TRPO because DDPG is computationally less expensive than TRPO, making it a better choice for problems with a large state or action space or where data collection is expensive.DDPG is generally more sampleefficient than TRPO and has been shown to converge faster than TRPO in some cases, making it a better choice for problems where the agent needs to learn quickly [36].

1) DEEP DETERMINISTIC POLICY GRADIENT (DDPG)
DDPG has the ability to relax complex constraints and replace them with flexible ones.It consists of two DNNs namely actor with the function a = π(s|δ π ) and network parameter δ π and critic network with the function Q(s|δ Q ) and network parameter δ Q , respectively.π(.) function maps the state and action, while Q(.) is an approximator for generating Q-values of a given state-action pair.The actor network outputs the action a t based on given state s t , while the critic network gets the action a t from the actor network and gives the value to evaluate it.Similar to DQN, DDPG also uses target networks for each of actor and critic networks, i.e., π (.) and Q (.) respectively.The transitions generated by the actor network are stored in experience replay memory and when the learning process starts, M transitions are sampled to train actor and critic networks.Experience replay buffer is used to break the correlation of the transitions and ensure efficient learning.The experience transitions of the agent at time t are given by d t = (s t , a t , r t , s t+1 ).
The Q-values Q(s|a) are generated by the critic network for calculating policy gradient which can be presented as: After policy gradient is calculated the actor network parameter δ π is updated and critic network is trained with following loss function: where m is an index of transitions in mini-batch sampled from experience replay memory while The pseudo-code of the DDPG-based algorithm for optimizing aerial IRS trajectory and phase shifting is given in Algorithm 1.It is deployed and executed on-board the UAV-IRS platform as this is the most suitable place where it can communicate with the mobile device to collect environmental and channel parameters that form the current and next state of MDP such as UAV location, mobile device location and SINR value.Moreover, it allows the RL agent to promptly take actions that comprise UAV displacements in three dimensions and phase shift angles of IRS.Deploying the RL agent in the base station can cost the system continuous communication between base station and the UAV to apply phase shifting and monitor trajectory.

2) MULTI-OBJECTIVE OPTIMIZATION PROBLEM
Our problem has two objectives that need to be optimized simultaneously, i.e., maximizing the achievable transmission rate R, and minimizing the Energy Consumption of the Aerial IRS, E R .Therefore, it is a Multi-Objective Optimization Problem.
Select Action: Normalize the values of R and E R ; 23: Observe Reward: 24: experience replay memory.

32:
Update actor and critic networks.

33:
Update two target networks.end for 36: end for Since the Deep Deterministic Policy Gradient (DDPG) algorithm usually works for single objective optimization, we try to solve our problem function by utilizing the Scalarization technique that is usually part of Pareto Optimization technique.Scalarization involves writing the multiple objectives as a linear combination of weighted objective values.The aim is to find the Pareto-optimal solutions that are non-dominant.The non-dominant solutions are those that have the best trade-off between the objectives while any other optimal solution other than these for any one variable would only be achieved at the cost of other objective.Our proposed solution is Multi-Objective Optimization to maximize the achievable transmission rate and minimize the energy consumption of Aerial IRS using DDPG to find the non-dominant solution.
In the next section, we formulate the multi-objective optimization as a Markov Decision Process (MDP).

3) GENERAL MDP FORMULATION
In this section, the state space, action space, and reward function design of the proposed DDPG algorithm are specified to accommodate the continuous decision space for the algorithm.Rewards are generated based on calculating the achievable transmission rate R[n] at the mobile device D and the energy consumption E R of the Aerial-IRS, R. The DDPG-based RL model is shown in Figure 3.
We assume that the locations of jammers are unknown to the Aerial-IRS controller agent and the movement of mobile device D follows a Gaussian pattern, leading to a model-free DRL problem.
We first define state S, action A and reward R.
The action a ∈ A, is composed of Aerial-IRS trajectory defined by the aerial location of UAV ω R and the IRS phase shift θ m .The location of Aerial-IRS ω R is continuous while IRS phase shift parameters θ m are usually discrete.However, when the number of IRS reflecting units M is large, the action space of IRS phase shift values θ m is also large.
To handle this problem, the actions select the beamforming directions of IRS θ t , ϕ t , θ r and ϕ r , which denote the angles between incident signal and x axis, incident signal and z axis, reflection signal and x axis and reflection signal and z axis, respectively, as shown in Figure 2. According to [31] the corresponding value θ m can be easily obtained from the following equation by determining the beamforming direction, where the value element m is obtained by from its row and columns.
where d x and d y are the length and width of each unit cell of IRS.They are of sub-wavelength scale within the range of λ 10 and λ 2 , where λ is wavelength of the signal.Therefore, the action space consists of two sub-actions for each episode, which are composed of displacements of Aerial IRS D x , D y and D z along all three axes and four IRS phase shift parameters θ t , φ t , θ r , φ r .The design of an episode is shown in Figure 4, which shows the iterative transition from one state to the next by taking action.The first sub-action is Aerial-IRS trajectory a U and second is IRS phase shift a R at time slot n, therefore a[n] = {a n U , a n R }.Since the first part, i.e., Aerial-IRS trajectory is continuous, it is discretized into three subactions, D x , D y and D z , which are UAV displacements along three axes x, y and z respectively, based on the rectangular coordinate system centered with origin at Base Station B.
3) Reward The reward r ∈ Rw, gives the evaluation of the actions.It is the function of the achievable rate at the mobile device D and Aerial-IRS energy consumption E R as follows: Since the reward r is composed of multiple objectives, i.e., achievable transmission rate R, and Energy Consumption of the Aerial IRS, E R , they are the case of Multi-Objective Reinforcement Learning.This needs Scalarization in which the objectives are normalized and expressed as an additive reward function which is basically the weighted sum of the normalized objective values.A similar problem has been tackled in [37], and we get guidance from it to formulate reward shaping in our problem.The scalarized reward function will look like this: where w R and w E are weights assigned to each normalized objective value such that their sum is equal to 1. Therefore, they can be expressed as: Replacing value of weight w E in Equation (28a), we get following equation.
Considering Equation (30b), we try the iterative method to set values of weight w R such that the reward function gives us the best trade-off between the transmission rate and energy consumption.This would give us a non-dominant solution to achieve transmission rate maximization and energy consumption minimization.The equation with the optimal values of looks weights w R,optim is presented as follows:

IV. SYSTEM EVALUATION A. SIMULATION SETUP
We set up a simulation environment by using the DDPG library from Stable Baselines package to train the DDPG model for critical 5G Advanced or 6G wireless communications in smart city environment, which is explained in this section along with its parameters and hyperparameters.Since our problem is non-convex and computationally complex, the DDPG algorithm is useful to solve it.DDPG consists of two deep neural networks, called actor network and critic network.Each of the actor and critic networks is coupled with target neural networks having the same structure as the former networks.Both actor and critic are using fully connected neural networks with two hidden layers, with 32 neurons in each layer.In two hidden layers, ReLU activation functions are used, while linear activation function is used for the output layer of the critic network.The output layer for the actor network uses tanh as an activation function to scale output to [ − 1, 1] for better performance.Base station transmitter B, Aerial IRS R, and mobile device D are placed in the region with an area of 1000 by 1000 meters as shown in the system map for simulation in Figure 5.The base station is placed at location ω B , i.e., [100, 100, 10], while the initial location ω R for Aerial IRS In order to effectively simulate the optimization solution to the problem, we set the parameters of different system components to be as realistic as possible.The transmission power of the base station B P B is set to 1 watts, with operating frequency for cellular microwave f set to 10.5 GHz, which gives the signal wavelength λ as 0.0286.The frequency selective channel model also has relevant parameters set, with rician factor K 1 set to 8 dB, path-loss average channel power gain at reference distance d 0 = 1m, L 0 set to −20 dBm.The value of the path-loss exponent for air-to-ground wireless transmission, α is set to 2.3 while that of ground wireless communications, η is set to 5.
The UAV system parameters are also initialized.The blade profile power of UAV in hovering status P 0 is set to 79.85 watts, induced power of UAV in hovering status P 1 is set to 88.63 watts, a constant parameter called parasite power related to aerodynamics c 0 is equal to 0.0145 kgm 3 , tipspeed of rotor blade U tip is 200 m/s and mean motor induced velocity v 0 is 7.21 m/s.
The Intelligent Reflecting Surface (IRS), also comes with some configuration parameters.The different numbers of reflecting elements M in IRS have been tested in our simulations, but we keep the standard count as 100, with number of rows and columns as 10 each, while number of bits b to control the number of phase-shifts for IRS elements is set to 2. The length and width of each unit cell as shown in Figure 2, are d x = 0.01 meters and d y = 0.01 meters, respectively, with separation between the cells being ζ = 0.001 meters.The incident and reflected signal wavelengths are λ = 0.0286 meters, with amplitude A = 1.
The DDPG algorithm hyperparameters are initialized with reward discount factor γ = 0.9 and learning rate α = 0.0001.During simulations, these values are adjusted numerous times to find the optimal solution.
These parameters and other hyperparameters are listed in Table 3, for easy reference.

B. SIMULATION EXPERIMENTS
The simulation experiments are conducted for multiobjective optimization in search of an optimal solution.It is aimed at balancing both the objectives, i.e., maximizing transmission rate R, and minimizing energy consumption of Aerial IRS E R .As expressed in Equation (31), the value of w R is explored in an iterative manner in the reward function, where the reward is given as r = w R,optimal R norm + (1 − w R,optimal )E norm .The aim of these simulations is to achieve convergence of both transmission rate and energy consumption simultaneously, over multiple episodes.
The simulation consists of a multi-dimensional continuous action space used with DDPG policy.State or Observation space is composed of seven parameters: the three location coordinates for Aerial IRS ω R [n], the SINR value of the previous iteration (step) SINR[n − 1] and the three location coordinates of the mobile tracking device ω D [n].
The environment consists of multiple jammers (i.e., K = 5) and a mobile device D, which is allowed to move with Gaussian distribution pattern in all directions within the constrained space.
The reward function consists of both transmission rate and energy consumption by the Aerial IRS with location and battery constraints imposed.Figure 5 shows the location constraint boundary for a mobile device as a blue box while the red box shows the boundary limits for the mobility of Aerial IRS, which ensures that Aerial IRS and mobile device remain within the defined regions.
The aerial IRS has a maximum mobility radius of 20 meters from its current location while mobile device can move to a maximum distance of 10 meters during each step of an episode.The jammers are spread within a radius of 150 meters around the initial location of mobile device D. The system has constraints on location and battery and is set to run for around 4000 episodes with 100 steps in each, resulting in a total of 400, 000 iterations.
We carried out multiple simulations for training DDPG agents and observed the cumulative reward, transmission rate, and energy consumption trends.We observe that the value of w R that gives a non-dominant solution is around 0.92, which is the cliff point, shown in Figure 6.Therefore, the value of w R set to 0.92 achieves the best trade-off between the maximum transmission rate and minimum energy consumption.
We observe in Figure 6, that the reward and transmission rate graphs are rising during the first 700 episodes before converging for the remaining part of the simulation.This is because during the first 700 episodes, the DDPG agent is in the exploration phase, and when it has explored enough, it starts exploiting the best action values to achieve convergence for the later episodes.
We also conducted experiments with another baseline RL algorithm, i.e., PPO in the context of our proposed system model to compare the performance with DDPGe.The hyperparameters for the PPO algorithm are initialized with a reward discount factor equal to γ = 0.9, learning rate α = 0.0001 and clip fraction = 0.2.We observe that for optimization using PPO, the cliff point value of the scalarization weight w R that gives non-dominant solution is around 0.85042642765, given convergence of transmission rate and energy consumption.The hyperparameters for DDPG are described in Section IV-A.
These results, as shown in Figure 6, clearly prove the superiority of the DDPG algorithm in solving the optimization problem of our system model with (K = 5) jammers.Figure 6(b) shows how the DDPG convergence of transmission rate is significantly higher than the converged value of PPO achieved.Figure 6(c) shows DDPG also performing better than PPO in minimizing the energy consumption of aerial IRS.The average transmission rate value achieved by DDPG is 17.06 Mbps as compared to PPO, i.e., around 5.96 Mbps.Moreover, the average energy consumption by using DDPG is 19.16 KJ while that with PPO is 50.08 KJ.
Therefore, we prefer DDPG over PPO because we observe that the policy responsible for managing the trajectory of the UAV and the shifting of the IRS trained by the DDPG agent gives better performance in terms of transmission rate and energy efficiency as compared to the policy trained by PPO agent.Therefore, our experiments prove that DDPG is the right choice for optimization for our proposed system mode as compared to other RL algorithms.
In the next subsections, we conduct simulations to evaluate the performance of the proposed system model by comparing it with baselines and related works.We also study the impact of different environment settings and parameters on the proposed technique, as detailed as follows: 1) Optimal Benchmark: Comparative evaluation of proposed solution with the conventional optimization technique SurrogateOpt 1 for proving the high performance of the RL based technique.

2) Trajectory Optimization with Fixed Phase Shifts:
Comparative evaluation with a baseline where phase shift angles of the IRS are static while only the trajectory of the UAV is optimized to achieve multiobjective optimization, i.e., maximizing achievable transmission rate and minimizing energy consumption.[8] and [11].The authors of [11] present a scenario with a single static end-user device in the presence of a single jammer, served by UAV -mounted IRS with the aim of maximizing transmission rate only.Moreover, the authors of [8] add mobility to the end-user device.Both works aim at achieving the single objective of maximizing transmission rate.Although their performance looks better than our proposed solution, this comes as the cost of higher energy consumption that is shown in bar plots in Figure 15.

5) Single
Objective Optimization for Energy Consumption, w R = 0: Evaluating performance with value of scalarization weight w R , set to 0 for minimizing energy consumption only.We simulate a above scenarios in order to compare and validate the performance of our system model in the presence of multiple jammers.

1) OPTIMAL BENCHMARK
To evaluate the effectiveness of our proposed RL-based optimization approach in terms of optimality, we conducted a comparative analysis by comparing its performance against that of the conventional non-linear optimization technique, SurrogateOpt.Specifically, we conducted simulations using MATLAB, varying both the number of jammers and the spatial distribution of these jammers as part of the evaluation process.
Given the inherent non-convexity of our optimization problem, its NP-hard nature, and the computational complexities arising from continuous decision variables, the pursuit of exhaustive search solutions is found to be infeasible.Consequently, we have employed the SurrogateOpt technique, albeit recognizing its potential to generate sub-optimal outcomes due to the non-convex nature of the problem.In our quest to achieve the optimal solution, we iteratively executed the optimization process 10 times for each configuration to ascertain the lowest sup-optimal solution.It is important to note that, owing to the computational time constraints associated with this optimization, we conducted 10 iterations of the problem over a limited time horizon.These outcomes were then compared against corresponding steps executed by Reinforcement Learning (RL) technique, using identical environmental conditions.The results of evaluation in this case are shown in Figures 7 and 8.
Our experimentation reveals that our RL-based approach closely approximates the least sub-optimal solution obtained by applying SurrogateOpt.Across various system configurations, we note that transmission rate and energy consumption values exhibit closeness to the sub-optimal values, with the maximum observed optimality gap around 30%.

2) IMPACT OF VARYING THE JAMMERS QUANTITY
Our initial series of comparative evaluations focuses on the analysis of jamming costs within the framework of our proposed system model.In Figure 9, we illustrate the influence of varying jammer quantities on the performance of the RL-based Aerial IRS solution, particularly with respect to achieving maximum transmission rates.In our simulation experiments, we maintain a specific jamming region with a radius spanning from 150 to 180 meters around the mobile device, denoted as D. The baseline scenario with zero jammers, i.e., K = 0, gives the maximum reward and, consequently, the highest transmission rate across all benchmarks, with the exception of the Single Objective Optimization of Energy Consumption, as depicted in the bar plots in Figure 10(a).
Conversely, an increase in the number of jammers leads to a notable reduction in the overall average cumulative reward and transmission rate while simultaneously exhibiting a consistent decline in energy consumption, as shown by the plots in Figure 9(a), Figure 9(b), and Figure 9(c), respectively.However, it is important to note that the situation could deteriorate further without IRS beamforming.
Furthermore, we demonstrate that the employment of RLbased optimization with Aerial IRS or UAV-mounted IRS results in superior transmission rates when compared to IRS fixed to a building, even in the presence of multiple jammers, as presented in [6] and represented in Figure 10.Within Figure 10, the bar plots show a declining trend in the average transmission rate values for Aerial IRS and other reference scenarios.However, it is noteworthy that the transmission rate values achieved with Aerial IRS notably surpass those of the baseline scenarios, except for Single Objective Optimization for transmission rate, as showcased in Figure 15.
Moreover, we conducted a comparative analysis between our proposed solution and a baseline presented in a prior study [6], where interference from multiple jammers is mitigated through Phase Shift Optimization of Fixed IRS deployed on a stationary object, such as a building.As depicted in our simulation results in Figure 10(a), our solution once again proves its superiority.This performance can be attributed to the dynamic nature of the Aerial IRS solution, wherein the IRS, mounted on a UAV, continuously adjusts its position in space to counter the effects of jammers.In contrast, the fixed IRS scenario only sees variations in the relative distance between the IRS and the mobile device based on the latter's movement.Therefore, while the enhanced transmission rate is noteworthy, it does come at the expense of increased energy consumption.These findings are shown in the graph for Aerial IRS, presented in Figure 10(b), which exhibits an obvious declining trend in contrast to the plots associated with other baseline scenarios where energy consumption remains constant.
This divergence arises due to the dynamic nature of the Aerial IRS scenario, wherein the UAV's movement necessitates propulsion energy consumption that is actively optimized.In contrast, the Fixed IRS scenario remains stationary, resulting in a consistent energy consumption pattern across all scenarios.
The observed decline in energy consumption, as illustrated in Figure 10(b), with an increasing number of jammers can be attributed to the reduced mobility of the UAV.This reduction in mobility is due to UAV trying to evade the interference caused by the expanding presence of jammers around the mobile device.The UAV simultaneously strives to track the movement of the mobile device to ensure optimal transmission rate performance.
Another baseline comparison, as shown in Figure 10, involves Trajectory Optimization with Fixed Phase Shifts.Notably, the transmission rate attained through this approach falls significantly short of the values achieved by our proposed solution.This observation underscores the pivotal role of IRS beamforming, achieved through phase shifts in our solution, for realizing elevated transmission rates.Furthermore, it is worth noting that this baseline solution generally exhibits a higher energy consumption when compared to the energy efficiency achieved by our proposed solution.

3) IMPACT OF VARYING THE NUMBER OF IRS ELEMENTS
The impact of varying the number of IRS elements is studied by conducting a series of simulation experiments comparing our approach to different baselines and related works.
Figure 11 shows the convergence trends of cumulative reward and achievable transmission rate values.Notably, these values exhibit a substantial superiority within the context of the proposed Aerial IRS system model compared to other baseline scenarios, as illustrated in the comparative bar plots in Figure 12.It is worth highlighting that the highest transmission rates are consistently observed across all scenarios when M = 100 elements are employed, thus justifying its selection for our proposed solution.This observation underscores the efficacy of optimization with UAV-mounted IRS, spanning various quantities of IRS elements, in outperforming other baseline systems, including the approach detailed in [6], which focuses on optimizing phase shifts of a stationary IRS.This performance advantage can be attributed to the greater mobility afforded to the UAV-mounted IRS, facilitating the exploration of the spatial domain to optimize transmission rates effectively.
In Figure 12, the bar plots clearly depict a pronounced increase in the average transmission rate, spanning from M = 9 to M = 100, for both the Aerial IRS system and the baseline configurations.However, it is noteworthy that this ascending trend experiences a slight downturn beyond M = 100, particularly for M = 144 and M = 196.Conversely, an intriguing contrast arises when examining the plots of In the first baseline scenario, the observed trend can be attributed to the optimization of the trajectory without the utilization of IRS beamforming.In such a configuration, the RL model encounters challenges in achieving the optimal transmission rate, which results in an elevated energy consumption profile.In contrast, the second baseline case also exhibits consistent energy consumption values.This constancy arises from the assumption that the IRS, when fixed in a stationary location, such as a building or a static UAV, maintains a uniform energy consumption level across all considered configurations.
The decreasing energy consumption trend observed with the increase in the number of IRS elements in the proposed Aerial IRS configuration can be attributed to the heightened influence of beamforming in attaining maximum transmission rates.As the number of IRS elements increases, the synergy between beamforming and trajectory optimization becomes more pronounced.Consequently, this synergy reduces the necessity for the UAV to continually adapt its movement to track the mobile device's trajectory.
Furthermore, it is also evident that increasing the number of IRS elements beyond M = 100 gives modest energy conservation benefits without a significant enhancement in transmission rates.Additionally, such an increment in IRS elements could potentially escalate hardware costs owing to the enlargement of both the size and quantity of elements.Therefore, the configuration with M = 100 IRS elements appears to strike an optimal balance, offering the most favorable trade-off between heightened transmission rates, reduced energy consumption, and reasonable hardware cost.

4) IMPACT OF VARYING THE DISTRIBUTION AREAS OF JAMMERS
Another significant aspect of experimentation relates to the influence of jammer distribution areas on the system model.To examine this, we conducted a comparative analysis, expanding the spatial extent of the jammer distribution area and assessing the performance of the Aerial IRS system model alongside the baseline scenarios.The baseline scenarios encompass the configuration of Phase Shift Optimization with Fixed IRS, which involves IRS fixed to a stationary object, and Trajectory Optimization with Fixed Phase Shifts.The outcomes, as depicted in Figures 13 and 14, reveal a consistent trend wherein both cumulative reward and transmission rate values experience an upward trend associated with the expansion of the jammer distribution area surrounding the mobile device denoted as D.
The reason behind this observation is rooted in the proximity of jammers to the mobile device, specifically within a range of 100 meters, whereby the jamming signals exert a more obvious interference effect on the received signals at the mobile device.Conversely, the interference decreases when jammers are positioned at greater distances from the device, around 350 meters.Despite this, the plots illustrate that even in complex and demanding scenarios where jammers are positioned close to the mobile device, effective optimization can still be achieved.
As demonstrated by the bar plots in Figure 14, our proposed Aerial IRS solution shows superiority over the other two baselines due to its joint optimization of both trajectory and phase shifts.In contrast, each baseline focuses solely on optimizing one of these parameters, either IRS phase shifting or UAV trajectory.Additionally, an analysis of the energy consumption plots across the proposed Aerial IRS scenario and one of the baselines Trajectory Optimization with Fixed Phase Shifts, generally reveals an ascending trend.This trend is primarily attributed to the UAV's need for Upon comparing the outcomes of the proposed Aerial IRS scenario with the baseline configurations, a notable enhancement in performance becomes evident within the proposed system model.This improvement can be attributed to the aerial IRS's capability to contend with densely placed jammers effectively.
The outcomes depicted in the comparative bar plot found in Figure 14 support the relative superiority of our proposed system model when optimizing parameters in the presence of multiple jammers, as compared to other baseline approaches.This includes the approach involving Phase Shift Optimization with Fixed IRS Location as detailed in [6], as well as Trajectory Optimization with Fixed Phase Shifts.

5) IMPACT OF VARYING THE SCALARIZATION WEIGHT W R
Simulation experiments were also undertaken to evaluate Single Objective Optimization scenarios, whereby the scalarization weight w R is set to 1 or 0. On the other hand, our proposed system model pursues Multi-Objective Optimization by striking a balance between transmission rate and energy consumption.The optimal solution emerges in this context when the scalarization weight w R is established at 0.92.The comparative graph plots depicting these baseline scenarios are presented in Figure 15.
In the first baseline scenario, where w R is set to 1, the primary objective is maximizing transmission rates without regard for energy consumption.In contrast, in the second baseline scenario, characterized by w R set to 0, the primary focus is on minimizing energy consumption while disregarding transmission rate optimization.
The reason for achieving a higher transmission rate in the first baseline scenario is that since the objective is only to maximize the transmission rate, the RL does not care about minimizing the energy consumption and, therefore, consumes the maximum amount of energy to maximize the transmission rate.For the second case, the RL only aims to minimize energy consumption regardless of the transmission rate; therefore, we see almost the same values for all quantities of IRS elements.when w R is set to 0 for the second baseline

V. CONCLUSION
This study introduced a resilient anti-jamming approach employing Deep Reinforcement Learning (DRL) for critical wireless communications for a smart urban environment.This technique enables the joint optimization of Aerial-IRS trajectory and passive beamforming via IRS phase shifts, aiming to maximize achievable transmission rates and energy efficiency.Our research addresses the challenges of a highly adversarial environment characterized by multiple jammers.We have leveraged the Deep Deterministic Policy Gradient (DDPG) algorithm to attain optimal solutions under various configurations.
Our proposed solution holds considerable promise for deployment in critical public wireless communication infrastructure, such as stadiums, airports, urban centers, and sporting venues.It empowers these environments to effectively combat rapidly evolving threat conditions by autonomously and adaptively adjusting the trajectory of Aerial IRS and manipulating IRS reflecting elements' phase shifts.
In future research, we anticipate exploring multi-agent DRL techniques within the context of highly complex wireless communication environments.This exploration could involve factors such as mobile jammers, multiple mobile devices, and the formation of collaborative swarms of IRS-assisted aerial platforms, thereby introducing additional challenges to the optimization problem.

FIGURE 1 .
FIGURE 1. System Model: Aerial-IRS assisted Wireless Communication with Multiple Jammers.
The location coordinates for base station are ω B = [x B , y B , H B ], where H B denotes the height of the base station.For simplicity, we assume that the height of the base station transmitter is H B = 0. Therefore the base station coordinates are ω B = [x B , y B , 10].For the mobile tracking device D, the location coordinates are ω D = [x D , y D , 0], while for the jammers are ω

FIGURE 2 .
FIGURE 2. IRS Phase Shift Elements: Structure and angles of signals.

FIGURE 4 .
FIGURE 4. Episode design containing N iterations each.

FIGURE 5 .
FIGURE 5. System map generated by simulation experiment.

FIGURE 6 .
FIGURE 6. Comparative graph plots of optimization results for DDPG and PPO (a) Cumulative reward plot (b) plot for transmission rate (c) plot for energy consumption.

FIGURE 7 .
FIGURE 7. Graph plots of Comparative Evaluation with SurrogateOpt on proposed system model w.r.t.different number of jammers (a) Average Transmission Rate (b) Cumulative Energy Consumption.

FIGURE 8 .
FIGURE 8. Graph plots of Comparative Evaluation with Optimal technique SurrogateOpt on proposed system model w.r.t.different jammer spread areas (a) Average Transmission Rate (b) Cumulative Energy Consumption.

FIGURE 9 .
FIGURE 9. Simulation Results for Impact of varying the number of jammers on proposed system model with weight wR = 0.92 (a) Cumulative reward plot (b) Plots for Achievable Transmission Rate (c) Plots for Energy Consumption.

FIGURE 10 .
FIGURE 10.Comparative bar plots for the impact of varying the number of jammers in proposed Aerial IRS scenario and baselines on (a) transmission rate and (b) energy consumption.

FIGURE 11 .
FIGURE 11.Simulation Results for Impact of varying the number of IRS elements on Aerial IRS proposed system model with weight wR = 0.92 (a) Cumulative reward plot (b) Plots for Achievable Transmission Rate (c) Plots for Energy Consumption.

FIGURE 12 .
FIGURE 12. Comparative bar plots for Impact of varying the number of IRS elements in proposed Aerial IRS scenario and baselines on (a) transmission rate and (b) energy consumption.

FIGURE 13 .
FIGURE 13.Simulation Results for Impact of Varying the Jammer Distribution Area on Proposed Aerial IRS System Model with Scalarization Weight wR = 0.92 (a) Cumulative reward plot (b) Plots for Achievable Transmission Rate (c) Plots for Energy Consumption.

FIGURE 14 .
FIGURE 14.Comparison of Results for the Impact of Varying the Jammer Distribution Area in the Proposed Aerial IRS Scenario and Baselines on (a) transmission rate and (b) energy consumption.

FIGURE 15 .
FIGURE 15.Comparison of Single Objective Optimization baselines and Aerial IRS scenarios with respect to varying the number of jammers and their impact on (a) transmission rate and (b) energy consumption.