Deep Reinforcement Learning for Internet of Drones Networks: Issues and Research Directions

Internet of Drones (IoD) is one of the promising technologies to enhance the performance of wireless networks. Deploying IoD to assist wireless networks, however, needs to address various design issues. Due to the highly dynamic nature of IoD networks, conventional methods are expected to encounter inadequacies that can be resolved using emerging deep reinforcement learning (DRL) techniques. In this paper, we discuss the application of DRL for addressing various issues in IoD networks. We first overview the main features, types, applications, and services of IoD networks. Then, we briefly discuss some DRL algorithms used to address the issues and challenges of IoD networks. After that, we explain the most crucial issues in IoD networks and discuss some papers that show how DRL can address them. Finally, we provide insights into some promising research directions in the context of using DRL in IoD networks.

to eavesdropping attacks due to the broadcast nature of the communication [5]. Therefore, a secure communication channel is required to ensure the confidentiality of communication within the IoD-aided network [6]. Developing efficient solutions that address these issues and overcome such challenges is one of the surging topics in the industry and academia.
Experts also expect that existing conventional techniques, such as optimization-and game-theory-based, will encounter serious inadequacies when addressing the above challenges due to the high complexity and dynamicity of the IoD environments [7], [8]. Therefore, emerging machine learning approaches, such as deep reinforcement learning (DRL) techniques, have been proposed lately to be utilized instead. DRL is a data-driven technology that is emerged lately to address various issues in wireless networks. In the context of IoD, DRL methods can assist in drones' navigation, and resource allocation and optimization. IoD networks are generally characterized by their high dynamicity, which makes the decision-making process very challenging. In addition, in complicated and large-scale IoD networks, the computational complexity of conventional algorithms becomes extremely challenging to manage. Hence, DRL has been presented lately to overcome such challenges in highly dynamic environments with sophisticated state spaces [7]. DRL utilizes Deep Neural Networks (DNNs) to offer efficient implementation for various decision-making tasks with local knowledge of IoD environments [9]. The role of DRL is to make autonomous and local decisions in IoD networks, e.g., power control, spectrum access, drone association, and path planning, to reach optimal design goals, e.g., minimizing energy consumption and maximizing throughput.

A. RELATED WORK
There are several survey papers on the applications of DRL techniques for various issues and applications in wireless networks and for resource allocation and management in 5G and beyond wireless networks [1], [7]. Although there are papers that discuss the applications of machine learning, deep learning, and DRL for IoD, the contribution and values added by this paper compared to these paper is summarized in Table 1. As we can see in Table 1, the main contribution of this paper compared to these papers (including [10], [11]) is that we added and focused on the applications of DRL specifically. Also, we have focused on several new emerging and missing issues that require more investigation and exploration. In addition, this paper is designed carefully to provide more insights into some promising research directions in the context of applying DRL for IoD networks.

B. CONTRIBUTION AND ORGANIZATION
In this paper, we overview and discuss the application of DRL techniques for various issues in IoD wireless networks. The main contributions of this paper are summarized as follows: • We shortly overview the IoD networks, including their features, advantages, applications, and shortcomings. Then we shortly explain the most widely used DRL techniques to address the various issues in IoD. • We highlight the most crucial issues in IoD networks and classify some existing papers based on the issue they address. Towards this, we identify the problem addressed in each paper and the DRL techniques utilized to address these problems. • We highlight some promising research directions in the context of DRL usage for IoD networks. The rest of this paper is organized as follows. Section II overviews the applications, features, and types of IoD networks. It also shortly overviews some of the DRL techniques used in IoD-assisted networks. Section. In Section III, we classify and explain some existing papers targeting IoD wireless networks based on the issues they address. In Section IV, we highlight some of the future research directions. Finally, Section V provides a conclusion to the paper.

II. DRL TECHNIQUES FOR IOD NETWORKS
In this section, we overview the IoD networks, including their main features, types, applications, and services. Then, we discuss the most widely used DRL techniques to address various problems and issues in IoD networks.

A. OVERVIEW OF IOD NETWORKS 1) FEATURES AND TYPES OF IOD NETWORKS
Nowadays, drones are being deployed in several civil and military applications due to their reduced maintenance cost, effortless integration with many systems, and ability to hover and have high-level mobility [1], [15]. IoD networks are efficient for real-time traffic management and surveillance, remote detection, wireless coverage, and rescue and search missions. IoD networks provide many improvements once integrated with different communication environments. For instance, drones can be deployed as aerial base stations to assist or backup ground stations. In this context, experts expect that drones will be increasingly utilized in future cellular networks to upgrade their capacity and coverage. Also, drones are able to provide aerial caching at small base stations to minimize the communication delay and enhance the users' throughput [17].
Moreover, IoD can be used in real-time urban traffic monitoring and management to assist public transportation [23]. IoD networks are also efficient in object detection missions for reconnaissance and surveillance purposes. IoD is further being utilized to promptly support wireless coverage to urban and highly-dense environments, where a swarm of drones can be quickly navigated to serve cell sites [24].

2) APPLICATIONS AND SERVICES OF IOD
In this subsection, we summarize the main applications and services of IoD networks, which can be shown in Fig. 1. 1) IoD for surveillance: Drones can be deployed for surveillance tasks in which they are required to provide a live feed video stream or capture images for specific objectives. Due to the size, mobility, and capability to endure rough environments, drones can gather information quickly in particular scenarios. Drones can utilize technologies similar to object detection, computer vision, and face recognition to enhance their reconnaissance performance. There are various applications deploying drones for reconnaissance, such as traffic surveillance, indoor or outdoor surveillance, and environment surveillance [23]. 2) IoD for search and rescue: A swarm of drones can also be used for search and rescue operations when an environmental disaster occurs. In such cases, drones implement search and rescue algorithms to focus the search mission on the middle of the disaster, and progressively the focus decreases with increasing the distance from the middle [25]. 3) IoD for communications: Drones are utilized as a reliable technology to support conventional ground-based wireless networks. Due to the mobility of drones within a network, they can maintain a reliable line of sight (LoS) connection with ground entities to enhance the networks' performance and coverage. There are several proposals for utilizing drones as aerial base stations to distribute and offload users' traffic. In this context, IoD can be efficiently integrated with cellular and vehicular networks to improve network performance and decrease congestion [26].
The deployment of drones within a network requires multiple drones to collaborate in order to achieve the required goal. A swarm of drones is a technique in which multiple drones are concurrently coordinated within the network in order to optimally achieve the required task [27]. The swarm of drones is typically employed to accomplish the tasks that are insufficient to be done by a single drone. In addition to swarm technology, the tethered drone is also one of the technologies presented to solve the battery limitation issue of IoD networks [18].

B. OVERVIEW OF DRL TECHNIQUES FOR IOD
In this subsection, we overview the most widely used DRL algorithms to address the problems and issues of IoD networks. These algorithms belong to two categories; valuebased and policy-based algorithms [7]. Note that reviewing all the DRL algorithms is beyond the scope of this paper. However, a detailed review of them can be found in [7], [28].
Deep reinforcement learning (DRL) is a field of artificial intelligence that aims to develop autonomous systems with more efficient knowledge of the real world. DRL targets and solves the problems with high complexity in highly dynamic environments, such as IoD networks that are difficult to solve using state-of-the-art methods [7]. DRL utilizes deep neural networks to estimate value or policy functions for huge-scale reinforcement learning problems. Employing deep neural networks with DRL increases the learning capabilities and makes the algorithms more efficient. Several DRL algorithms exist that are highly effective in addressing IoD networks. The following subsections briefly explain the foundations of DRL. Interested readers can also refer to the corresponding references for more details.
The Markov Decision Process: In order to solve decisionmaking problems, they need to be formulated as a mathematical notation first. The representation is done using the Markov Decision Process (MDP), a time-discrete stochastic control process where the results are regulated by an agent [28]. A DRL algorithm is then applied by converting the optimization problem into the MDP representation. After forming the problem into an MDP model, it is required to explicitly define seven main entities; the environment, agent, state-space, action-space, reward function, transition probability, and policy. The main objective of MDP is to find the optimal policy for the underlying decision-making process.

1) THE DEEP Q-LEARNING (DQN) ALGORITHM
This is a model-free value-based technique [28] used to approximate the agent's value function [29]. The value function is an estimate of the accumulated discount rewards received after an action is executed in an environment. Two value functions exist; the value function and the state-action function [28]. The DQN algorithm is employed in IoDaided networks, especially with problems having discrete action spaces. DQN efficiently solves optimization problems such as passive beamforming design, trajectory optimization, resource allocation, and access control. DQN algorithm can provide optimal decision-making with minimal observations in relatively small and simple IoD environments. However, the DQN algorithm suffers from the overestimation issue, which causes a positive bias in reward calculation leading to sub-optimal policies [30].

2) THE DOUBLE DQN (DDQN) ALGORITHM
This method is implemented to improve the DQN technique [31]. DDQN is also a model-free value-based algorithm implemented to overcome the overestimation issue in the DQN algorithm. The difference between the two algorithms is that in the DDQN algorithm, there are two Q value functions; the first one is for choosing the best action, and the second one is to assess the selection of the action. DDQN is utilized to solve IoD-aided network problems like enhancing user mobility and various issues in mobile edge computing environments. The DDQN algorithm is utilized to improve the overestimation problem of the DQN technique in IoD networks by modifying the loss function. However, the DDQN algorithm is vulnerable to high variance and slow convergence [32].

3) PROXIMAL POLICY OPTIMIZATION (PPO) ALGORITHM
This is a policy-gradient-based technique that standardizes policy updates with attached probability [33]. The PPO technique is used in both continuous and discrete action spaces, and it employs the method of the actor-critic algorithm for evaluating the chosen action. The policy obtained from the PPO algorithm is fitted by a stochastic gradient elevation optimizer, as the value function is updated in the gradient descent technique. The PPO algorithm provides enhanced performance over conventional approaches in issues such as the age of information and power control in IoD networks. The PPO technique is utilized in IoD networks due to its improved convergence and efficiency in solving IoD-based problems with high dimensionality and continuous action and state spaces. However, the PPO algorithm suffers from slow convergence and instability of the model [33].

4) THE DEEP DETERMINISTIC POLICY GRADIENT (DDPG) ALGORITHM
This is a policy-based algorithm utilized in highly dynamic environments with continuous action spaces [34]. This algorithm combines both characteristics of policy gradient and Q-learning techniques. DDPG consists of an actor who is a deep neural network responsible for choosing the action based on the current state of the environment. The critic is a Q-value deep neural network that evaluates the quality of taken actions by the actor network. In the context of IoD networks, the DDPG algorithm solves issues such as spectral utilization efficiency, flocking motion management, and drones' trajectory optimization. The DDPG algorithm can be deployed in IoD-assisted environments due to its effectiveness in solving issues with large dimensional and continuous state and action spaces while simultaneously overcoming the overestimation problem. However, the DDPG algorithm suffers from hyperparameters tuning and model instability [32].

III. ISSUES OF IOD NETWORKS
This section discusses the main issues encountered in IoD networks and how DRL can address them. Related work for each issue is also provided.

FIGURE 2. Applications of DRL for drones navigation in IoD networks. DRL agents
are embedded within drones whose main task is to obtain the optimal 2D/3D route for drones within the IoD network.

A. DRONES NAVIGATION
Drones navigation is one of the main issues in designing IoD networks as shown in Fig. 2. It is the process of deciding the most efficient route and physical location for drones within the network to enhance some performance metrics. Navigation includes drones' trajectory optimization, path planning, and routing [35], [36]. Drones navigation controls drones' hovering speed, elevation, direction, and acceleration. In IoD networks, autonomous control of drones' physical location is required to provide enhanced reliability in terms of network coverage and collision avoidance. The authors in [37] address the problem of route optimization and passive beamforming design in an Intelligent Reflecting Surface (IRS)-aided IoD. An optimization problem is formulated whose objective is maximizing the global subjective data rate and geographical impartiality of all user equipment via jointly enhancing the drone's trajectory and IRSs phase shifts. Multi-agent DQN method is then implemented to solve the problem of discretizing trajectories which has an advantage in terms of training time. Moreover, the authors in [37] propose a DDPG algorithm to tackle the problem of the joint design of RISs' continuous phase alterations and drones' trajectories to maximize the weighted sum rate of the IoD networks. The authors' solution achieves better performance compared with two conventional methods; Complex Circle Manifold (CCM) and Majorization-Minimization (MM) for enhancing the phase alterations of the RISs to maximize the weighted sum rate. Although the work in [37] provides excellent results, it assumes that the agents have real-time and full knowledge of the network, which is difficult to obtain in real-time scenarios. In addition, the authors do not include the issue of drones speed, which is a very crucial issue.
Furthermore, trajectory optimization is required in IoD networks to avoid the physical collision of drones with obstacles and each other. Drones utilize the data acquired by their different sensors, such as Lidar, depth camera, video, or ultrasonic, to achieve that task. Unlike the work in [37], the authors in [38] propose probabilistic and DQN-based algorithms to prevent collisions while minimizing energy consumption in IoD networks considering limited knowledge about the environment. Their introduced technique can be run on board the drone or at a multi-access edge computing entity, according to the drone capacity and the task overhead. Their developed algorithms are then assessed in challenging environments, including several drones hovering and moving aimlessly in small areas without any correlation. Simulation results show that their proposed path-planning DRL algorithms can efficiently guarantee collision avoidance among drones and other entities while minimizing energy consumption in several environments. Furthermore, compared to conventional approaches relying on object recognition algorithms, optimization methods, and analytical modeling, the authors show that their DRL techniques provide promising performance. However, the authors implemented the DQN algorithm, which requires discretizing the action space. This will lead to sub-optimal policies that can be overcome using policy-based algorithms.
Summary: In this subsection, we review DRL algorithms utilized to address drone navigation issues in IoD-assisted networks. In general, drone navigation issues are characterized by their high dimensionality and continuous action nature. Therefore, policy-based DRL algorithms are the most appropriate models to be utilized. However, through our review, we observed that both policy-based and value-based models are also widely used. In addition, it is observed from the reviewed papers that DRL algorithms' performance outperforms conventional algorithms in addressing the drone navigation issue. Furthermore, it was observed that the drone navigation issue could be deployed as single-agent or multi-agent DRL techniques.

B. POWER CONTROL
One of the widely researched challenges of IoD networks is power control and allocation, as shown in Fig. 3. Due to the restricted battery capability of drones in IoD networks, power control aims to extend drones' lifetime. Power control ensures maximum energy optimization and efficiency within the system by minimizing drones' energy consumption. In order to address the power consumption issue in IoD networks, the authors in [39] employ the power control concept along with energy harvesting techniques in order to minimize the energy consumption in IoD networks. This is typically done by dynamically adjusting the transmission power of drones at each time while reducing average system power consumption. By formulating the power allocation problem as a Markov Decision Process (MDP), it can then be solved by DRL techniques. In this context, the authors proposed an actor-critic algorithm in order to resolve the energy efficiency issue. It is shown that their proposed algorithm outperforms the Greedy algorithm in terms of convergence time and average systems' energy consumption. However, the authors assumed that drones are static and only hovering. This assumption can be more realistic if the network is dynamic and the drones are moving within the network coverage.
In another context, ultra-dense networks (UDNs) are increasingly becoming more efficient in supporting many users and emerging mission-critical systems. Unlike the work in [39], the authors in [40] propose a method to overcome the restrictions of wireless communication resulting from natural disasters. An emergency transmission system deploying drones as dynamic base stations in order to support UDNs is introduced. The authors develop a UDN system model incorporating drones base station assortment, whose objective is to maximize the energy efficiency of IoD-assisted UDN. Then a stochastic optimization problem is formulated, and a DQN-based DRL model is used to solve it and achieve the system's energy efficiency. Compared to other solutions based on legacy Q-learning, maximum, and random resource allocation algorithms, the authors' proposed DRL solution can substantially enhance the energy efficiency within the system. However, the authors adopted the DQN algorithm in their solution, which is vulnerable to quantization error that degrades the accuracy of the learned policies.
Summary: In this subsection, we focus on the power control problem for IoD-assisted networks by utilizing DRL algorithms. This issue also includes IoD network's energy efficiency. The power control issue is identified by its continuous action space and high complexity. It is observed that the DQN and actor-critic algorithms are the most widely used DRL algorithms to address this issue. However, the policy-based DRL algorithms are more accurate in addressing such types of problems as they overcome the quantization problem encountered in the value-based methods when used to address the power control issue. The DRL algorithms implemented in the reviewed papers are evaluated in ultradense and urban environments, and the results show that DRL techniques outperform traditional methods. In general, DRL agents can be deployed both in a distributed or centralized manner and in single-or multi-agent settings when used for power control in IoD-based networks.

C. CHANNEL ALLOCATION
Channel allocation is a technique to share the current spectrum between multiple users to perform the required tasks. It also includes the access control of users to provide enhanced performance. Channel allocation in IoD-assisted networks can also reduce transmission interference. The existence of drone base stations made the IoD networks highly dynamic, which further increases the complexity of channel allocation. In [20], the authors introduce a non-centralized DRL algorithm to handle the users' access control issue in IoD networks. In their model, users make their individual access decision autonomously based on the local network communication. This will increase the network's long-term throughput while preventing recurrent handovers. The authors then propose a DQN algorithm along with an extended short-term memory network to solve the user access problem. It is shown in [20] that their proposed solution smartly allows users to access the appropriate drone base station independently and increase the long-standing throughput with minimum handovers. Using simulation, the authors show that their DRL-based method provides more promising results compared with the linear programming technique.
In IoD-assisted networks, drones must communicate effectively without interfering with each other. Hence, it is essential to manage the spectrum resource in such systems. The authors in [41] propose a classified DRL-based scheme to address the problem of multi-drone cell path planning and resource allocation in IoD networks. Unlike the work in [20], in which the authors assumed that the drones are flying in a predefined orbit, the authors in [41] assumed that the drones have the ability to move in a 3D manner with a high level of mobility, which is a more realistic yet challenging scenario. The challenge of trajectory planning is to define the route planning for various drone cells in the radio access networks for a sustained period. To solve the mentioned problem, the authors introduce a multi-agent DRL algorithm where nonstatic state-space triggered by the multi-drone cell network is referred to by the multiple agent fingerprint method. In the same context, the authors in [42] propose a framework for dynamically allocating channel resources in IoD networks using DRL. The authors utilize a long short-term memory method with a DQN algorithm to efficiently learn from previous network states and adjust to the IoD network dynamics. It is shown that their proposed method provides faster convergence and exemplary implementation in terms of reward, standard collision rate, and effective communication rate compared to Q-learning and DQN methods. However, the authors assumed that the drones are static, which can be extended by considering a more dynamic network settings.
Summary: This subsection reviews the channel allocation problem in IoD-aided networks. We reviewed some papers that use DRL models to address various issues, such as access control and channel allocation. Several DRL techniques can be used to address these issues depending on the nature and dimensionality of the problem. It is observed in both reviewed papers that the DQN algorithm is the most widely used method to address the channel allocation issue. In addition, we observe that both single-agent and multiagent DQN techniques are utilized. Furthermore, it is noticed in both papers that the DRL techniques are used in a noncentralized manner; however, they can also be deployed in a distributed manner.

D. ISSUES IN RIS-ASSISTED IOD
Reconfigurable intelligent surfaces (RISs) have become a promising technology to overcome various challenges in wireless networks, as shown in Fig. 4. RISs consist of an array of reflecting engineered elements with reconfigurable features, which can recompose the targeted signals [19]. Utilizing drones with RISs was investigated in [43] to show the advantages of passively transmitting the data to the base stations in IoD networks. The authors formulate an optimization problem that aims to minimize the estimated sum of the information age by adjusting the phase shifts of RISs, transmission schedule, and elevation of the drones. The authors propose a PPO-based DRL algorithm to solve the problem. The authors' results outdo other conventional methods in terms of the information optimization age. However, the authors assume that the both source and destination nodes are equipped with a single antenna, which might not be the case always in highly dynamic networks.
Furthermore, RISs provide energy and spectrum efficiency in wireless networks by modifying the phase shifts of the reflecting elements to improve signal reflection. For the same network settings in [43], the authors in [3] study the advantages of integrating RISs in IoD networks. In their system model, RISs are installed on drones, and due to drones' mobility, a 3D signal reflection is achieved. The authors show that the drones-carried RISs provide greater implementation flexibility, a complete full angle reflection, and dependable air-to-ground communication links compared to conventional static RISs. Though drones are constrained by their battery capabilities, it is also challenging to install RISs with a large number of reflecting elements on drones. Unlike the works in [3], [43], the authors in [44] propose an integrated system of RIS and drones to utilize the features of drones' mobility and RISs' reflection in order to improve the performance of IoD networks with energy harvesting. In particular, the implementation intends to increase energy efficiency by jointly optimizing the phase shifts of RISs and resource allocation of drones. The authors propose a PPObased DRL technique to solve the optimization problems considering a highly dynamic environment. In addition, the authors also use a parallel learning technique to minimize the latency of data transmission. Simulation results show that their proposed method gives promising results compared with continuous convex approximation and a closed-form solution in solving the drones' route optimization and RISs' phase shift issues.
Summary: Integrating IoD networks with RIS technology allows more flexibility and reliability for the network. In this subsection, we focus on papers that study the applications of DRL methods for addressing some challenges presented in RIS-assisted IoD networks. The reviewed papers in this subsection addressed the issues of age-of-information and spectrum efficiency for RIS-aided IoD networks using the PPO algorithm. This DRL algorithm is used for its reliability in making optimal decisions in highly stochastic and dynamic environments. Both papers show the robustness and efficiency of DRL techniques compared with other state-ofthe-art methods. In addition, for most deployments, RISs will be mounted on drones, which means that a multi-agent DRL method is adopted where each drone-carried RIS will act as an agent.

E. ISSUES IN TETHERED IOD
One of the utilization of drones within wireless networks is to use them as aerial base stations, as shown in Figure 5. However, drones have limited battery capabilities that hinder this deployment. To overcome this limitation, the concept of tethered drones has been introduced lately, which provides an alternate power source and data over a direct physical link from a ground station to drones [50]. The authors in [18] propose an implementation of tether drones to maximize the throughput in multi-cell IoD networks. The problem is formulated as an MDP, and a multi-agent Q-learning technique is implemented to solve it. The authors evaluate the performance of their algorithm intensively in terms of specific rates, sum rates, computation complexity in the air-to-ground network, and fairness. Compared with elevation control systems based on Q-learning technique, arbitrary action, and integrated Q-learning algorithms methods, the authors show the superiority of their proposed method. However, the authors in [18] adopted the Q-learning method which typically lead to sub-optimal policies.
The concept of tethered drones is also being studied as a promising technique to overcome challenging scenarios such as environmental disasters. In [45], the authors extend the work in [18] to solve the energy constraint issue in highlydense tethered IoD networks. In order to efficiently utilize the spectrum, the authors propose non-orthogonal multiple access (NOMA)-based scheme that assists users in similar resource blocks with inadequate channels. The inter-cell interference is a crucial issue in utilizing tethered drones due to the highly dynamic nature of IoD networks. To overcome this issue, the authors propose to use channel gain plus interference as a feature attribute. The authors then formulate user association, elevation, and power of tethered drones as a joint optimization problem. The problem is then solved by developing a multi-agent DDPG algorithm. Simulation results show that their proposed algorithm outperforms the greedy algorithm. However, the authors assumed that the tethered drones are static and have full knowledge of the system. This work can be enhanced by considering moving tethered drones settings with limited channel state information (CSI).
Summary: In this subsection, we review some papers that utilize DRL techniques to address some issues in emerging tethered IoD networks, such as fairness, computational complexity, and spectrum efficiency. In general, multi-agent with value-based and policy-based algorithms can be utilized to learn the optimal policies for such complex tethered IoD networks. In addition, it was observed that issues associated with tethered drones use policy-based and value-based algorithms to overcome the challenges. That is due to the highly complex nature of the IoD environment.

F. ISSUES IN DIGITAL TWINS-ASSISTED IOD
The digital twin (DT) is a simulated implementation to replicate a physical entity in virtual reality, as shown in Figure 6. This physical entity is designed from data acquired by several sensors, historical data, and physical entities promptly [46]. The data acquired represent various attributes of the physical entity's performance. The DT technique has been deployed vastly in different fields such as smart cities and intelligent manufacturing. DT is efficient when applied with machine learning algorithms as they provide high-reliability state knowledge of the environment. In [46], the authors study the problem of flocking motion in a multi-drone-assisted environment using policy-based DRL methods. Flocking motion is a critical challenge in a multi-drone system, which requires the coordination of a swarm of drones from the beginning to the end of a route in a cooperative way. The conventional techniques implemented to solve the flocking motion issue require prior knowledge of system settings, which is not possible in real environments. Therefore DRL methods are used instead. The authors in [46] propose a DT-enabled DRL technique to solve the flocking motion issue. A DRL algorithm, called behavior-coupling DDPG, is then used to minimize collision and arrival rates. Their proposed solution provides promising results compared with conventional methods in terms of convergence and performance. However, this work assumes static drones, which can be further extended to a moving drones scenario.
Mobile edge computing is an ecosystem that utilizes the resources of cloud computing by optimizing them to the edge of networks. DT can enhance the efficiency of mobile edge computing in high-mobility environments. Unlike the work in [46], the authors in [47] propose a smart task offloading scheme for IoD-assisted mobile edge computing networks with the support of DT. The authors' solution targets maximizing the energy efficiency of the network by jointly enhancing the drone's trajectory, mobile terminal users' involvement, communication power, and computation capability while taking into consideration the maximum processing delays constraint. A double DQN technique is utilized to solve the mobile terminal users' involvement and drone trajectory issues, whereas the communication power and computation capability problems are solved using an iterative method. Simulation results show that the performance of the authors' proposed double DQN algorithm outdone the Greedy and DQN techniques. However, the authors adopted the value-based DRL approaches, which will lead to less accurate policies compared with the policy-based approaches.
Summary: This subsection reviews some papers that focus on DRL's applications for addressing challenges in digital twin-based IoD networks. This subsection focuses on how DRL can be used to solve the problems of flocking motion and performance enhancement of IoD networks in edge mobile computing networks. Since both problems require prior network knowledge of network statistics, conventional methods would not be able to overcome them, and DRL will be an efficient alternative. The integration of DRL algorithms with digital twins technology has shown promising results compared with other traditional techniques. In addition, depending on the problem type, value-based and policy-based algorithms in single-agent or multi-agent settings can be utilized to address the problem where the DRL agents could be installed on servers where the digital twins are located.

G. ISSUES IN INTEGRATING V2X WITH IOD
Recently, drones have been introduced as a promising integration with vehicular communication V2X to support various applications, such as traffic surveillance, disaster rescue, and data acquisition, as shown in Fig. 7. In IoD-assisted V2X networks, drones communicate periodically with vehicles and other network entities. Many issues are raised in such networks, such as the security of data exchange. Fu et al. [48] consider drone-to-vehicle communication networks that are vulnerable to eavesdropping attacks in urban scenarios. The authors aim to increase the physical layer secrecy levels with respect to power utilization and the limitation of flight zones. This is accomplished by enhancing drones' routes, drones' transmission power, and the jamming power of the roadside unit. The authors then formulate an optimization problem that takes into account the dynamic properties of the wireless links. A curiosity-driven DQN algorithm is then proposed to solve the problem. Compared to the vanilla DQN algorithm, it is shown that their proposed algorithm provides more efficient performance. However, the authors in their work only utilized a single drone and a single agent algorithm. which could be integrated to adapt multiple drones into the system. In addition, the authors adopted a value-based DRL method which leads to sub-optimal policies.
Research shows that drones can fill the communication gaps among ground vehicles, regardless of their non-optimal mobility, restricted energy resources, and limited communication. However, increasing the number of operated drones leads to an increasing rate of collision among drones and difficulty in controlling them. In a different context of the work in [48], the authors in [49] propose an intelligent network of drones to be dispatched in an organized form to provide communication relays for V2X IoD networks. In particular, the authors aim to maintain connectivity while providing an efficient strategic form of coverage in the IoD network and minimizing the total energy consumption. The authors developed a DRL framework called DISCOUNT to solve the issue, which shows efficient performance compared to the DQN and Dueling DQN algorithms. However, the authors assumed that the agents have control over limited action space of drones and minimal urban vehicular coverage is provided, which is quite difficult in real settings.
Summary: This subsection reviews some papers that study DRL's applications for addressing some challenges in emerging networks that integrate IoD with vehicular networks. The reviewed papers discuss two main issues, efficient communication relays and secure communication. IoD-aided vehicular networks are characterized by being highly dynamic and complex networks. Hence, the challenges raised in such networks are characterized by their complicated state space and time-varying nature. DRL can be efficiently used to manage such stochastic settings, and the reviewed papers in this subsection have demonstrated the efficiency of DRL methods compared to conventional methods. In general, single-and multi-agent DRL can address the issues in IoD-aided vehicular networks using both value-and policy-based algorithms. Table 3 summarizes the papers discussed in this section, indicating their issue addressed, design objective, domain, and DRL techniques used.

IV. RESEARCH DIRECTION
In this section, we provide insights into some promising research directions stemming from the papers reviewed in this paper. These research directions are directly related to the issues discussed in Section III. More specifically, all the issues discussed in Section III can be applied to the emerging applications, services, and IoD-assisted network types provided in the following subsections.

A. DRL FOR INTERNET OF QUANTUM DRONES
Recently, quantum drones has emerged as an interesting technology that can enhance the performance of real-time applications. It integrates quantum sensors, radars, and other quantum devices within the system [51]. In addition, quantum drones employ techniques such as quantum machine learning, quantum genetic systems, and secure quantum cryptography. Quantum drones transmit quantum signals for obscure communications and enhanced reliability and security. However, there are several issues in drone-based quantum computing in which DRL can be utilized. For example, DRL can be used to address the problem of loss of coherence among qubits or in the allocation of communication resources among quantum drones, which are promising research directions [52].

B. DRL FOR DIGITAL TWINS-ENABLED IOD NETWORKS
As we mentioned previously, Digital Twins (DT) technology has attracted much research recently in various domains in wireless networks. However, DT research in the IoD is still in its early stage, and many issues still need more investigation. For example, developing advanced DRL-based resource-intensive models based on DT to solve the IoD network's growth as in [53] is a promising research direction that needs more exploration. In addition, developing DRL models based on DT to address path planning in IoDaided networks is also another promising research direction as in [47].

C. DRL FOR FREE SPACE OPTICAL SYSTEMS UTILIZING IOD NETWORKS
Free space optical (FSO) systems are envisioned to overcome the growing need for bandwidth caused by emerging datahungry wireless applications. However, FSO links are highly affected by atmosphere turbulences. In this context, IoD can be used as an efficient technology to assist FSO communication. For example, drones can be deployed as relays in FSO networks to overcome mobility and buffer restrictions. The authors in [54] conduct extensive research on this topic, and it is observed that integrating drones with traditional relay-aided FSO systems enhances network performance. Research on integrating DRL models with such emerging networks requires more investigation. For example, DRL can be used as an efficient tool to enhance communication coverage by optimizing radio resource allocation and adjusting drones' constellations and trajectories in free space optical IoD networks. Early work has been done in [55] to utilize DRL to optimize the coverage of cell-free space optical vehicular networks, which is a promising field of research.

D. DRL FOR INTERNET-SATELLITE-DRONE NETWORKS
Internet-Satellite-Drone Network (ISDN) has emerged recently as a promising technology that integrates drones with satellite networks [56]. Several issues are raised in this field that requires more research. For instance, jointly allocating and managing radio resources of satellite networks, drone BSs, and ground BSs in ISDN using DRL is a major challenge that requires more research and investigation. A pioneer work is reported in [57], in which DRL is used to maximize the end-to-end data rate via packet forwarding between ground base stations through LEO satellites and fixed-wing drones.

E. DRL FOR DRONE-AIDED TERAHERTZ NETWORKS
Terahertz band (THz) is one of the main enabling technologies to assist ultra-broadband short-range communication in future wireless networks. Deploying drones with THz technology has raised several challenges and issues that need to be addressed using DRL methods. For example, DRL can be used for drones' harmonization in order to enhance the coverage and rate in UDN deployment of the THz band. Also, DRL-based path planning to ensure spectrum efficiency in THz-aided drone networks is also another promising research field.

V. CONCLUSION
This paper provided an overview of the applications of DRL techniques in IoD networks. We explained the advantages, types, and applications of IoD. Then we reviewed some DRL algorithms widely used in addressing IoD problems. The main issues encountered in IoD networks are then explained, and we review some of the existing papers in the literature to examine the merits of utilizing DRL to address these issues compared to conventional methods. Finally, we highlighted some promising research directions that are yet to be investigated in the future.