Reinforcement Learning for Joint Detection and Mapping Using Dynamic UAV Networks

Dynamic radar networks (DRNs), usually composed of flying unmanned aerial vehicles (UAVs), have recently attracted great interest for time-critical applications, such as search-and-rescue operations, involving reliable detection of multiple targets and situational awareness through environment radio mapping. Unfortunately, the time available for detection is often limited, and in most settings, there are no reliable models of the environment, which should be learned quickly. One possibility to guarantee short learning time is to enhance cooperation among UAVs. For example, they can share information for properly navigating the environment if they have a common goal. Alternatively, in case of multiple and different goals or tasks, they can exchange their available information to fitly assign tasks (e.g., targets) to each network agent. In this article, we consider ad hoc approaches for task assignment and a multi-agent reinforcement learning algorithm that allow the UAVs to learn a suitable navigation policy to explore an unknown environment while maximizing the accuracy in detecting targets. The obtained results demonstrate that cooperation at different levels accelerates the learning process and brings benefits in accomplishing the team's goals.


I. INTRODUCTION
Wireless sensor networks, either with terrestrial fixed [1] or dynamic [2] sensors, are widely used for data gathering, sensing, and communications.Among all possible applications, their monostatic and multistatic deployments have been investigated for radar localization and target detection [3].
A step forward has been introducing flying dynamic sensor networks where sensors are integrated onboard unmanned aerial vehicles (UAVs) [4], [5].A recent review on the use of UAVs for remote sensing, spanning from precision agriculture (e.g., forest monitoring), urban environment and management (e.g., air traffic control) to disaster hazards and rescue (e.g., postdisaster assessment), can be found in [6] and the references therein.In all such situations, networks of UAVs can offer privileged views for gathering radio and vision-based data.Compared to terrestrial fixed networks, the advantages of using UAV-based networks lie in their flexibility, robustness to single-point failure, reconfigurability, and ability to maintain a line-of-sight (LOS) condition with users and other destination points.For example, in [7] and [8], swarms of coordinated UAVs equipped with ad hoc radar sensors are deployed to track a malicious target.Or alternatively, UAVs have been used as network infrastructure for localization, communications, and other applications [9], [10], [11].
In this domain, an important line of research is the optimization of the UAV trajectory [12], [13].In fact, unlike terrestrial sensors, all tasks and navigation must be optimized so as not to waste time flying over areas of little interest from the mission perspective because of the frequent need to recharge batteries [4].Moreover, UAVs must complete their mission within a finite time horizon, especially if they operate for time-critical applications.For example, in postdisaster situations, targets (e.g., victims) must be detected and localized as quickly as possible by rescuers aided by networks of autonomous UAVs [14].Several recent papers have studied the UAV trajectory optimization for wireless communication purposes where UAVs are used either as flying base stations (BSs) or users [15].For example, in [16] and [17], the navigation goal was maximizing the communication rates of multiple concurrent cellular users' transmissions.Other contributions focus on localization [8], [18] or minimization of the electromagnetic exposure [19].
Traditionally, the navigation control problem is solved by adopting model-based optimization, e.g., nonlinear programming or dynamic programming [33].Thanks to the availability of a statistical model, the navigation problem can be written as the minimization (maximization) of a cost (information) function and is solved by relying on classic optimization tools.Usually, the formulated optimization problem also considers constraints for anticollision, obstacle avoidance, and energy consumption.For example, in [8], [18], the UAV navigation problem is described as the minimization of the uncertainty of target positioning.
Unfortunately, empirical system models are often unavailable or, in some situations, unreliable due to highly fastchanging environments.To this purpose, machine learning (ML)-based approaches are of interest to learn a policy that achieves the desired objectives efficiently and in a data-driven fashion [20], [30], [34], [35], [36], [37], [38].Among different ML approaches, reinforcement learning (RL) and deep RL have been used for UAV policy navigation because of their ability to learn directly by interactions with the surrounding environments [24], [39], [40], [41], [42], [43].When the environment has a grid-world representation (e.g., indoors), Q-learning represents a simple and optimal solution because state-action pairs can be represented by a tractable Q-table that is updated at each time instant according to the received rewards [4], [14], [44].Table I summarizes the use of tabular and deep Q-learning applied to UAV networks.The main disadvantage of tabular Q-learning is the curse of dimensionality that occurs for large state and action spaces (e.g., large environments) and leads to increased computational complexity and a slow convergence [30].The combination of deep learning with RL (deep RL) overcomes this issue by relying on neural networks (NNs) for Q-function representations [45].However, most applications treat NNs as black boxes, and understanding and interpreting deep learning models remains challenging [46].The lack of interpretability still requires comprehensive treatment, especially for dual-use technologies like those based on UAVs and for safe-critical applications.Moreover, having a tabular representation can help analyze the impact of different parameters and schemes on performance.
A way to accelerate the training of large Q-tables without relying on NNs is having agents cooperating with each other [47].Different techniques in the literature have been proposed for cooperation, accounting for centralized and decentralized solutions [41], [48].While centralized solutions usually permit a global view of the environment, decentralized solutions are more flexible but require more intelligent agents.According to Q-learning, several cooperative approaches for sharing the learned experience among the agents' network have already been proposed in [49] and [50].In [50] and [51], the authors considered distributed Q-learning approaches for multiple device access in massive machine-type communications scenarios, whereas in [48], the authors analyze a setting where some agents are more expert than others (thus, being more informative) in cooperation.
Given this background, and differently from the optimization objectives evidenced in Table I, in this article, we aim to study cooperative RL in a dynamic radar network (DRN) of UAVs whose tasks are to detect targets accurately and to enhance their ambient awareness by estimating their occupancy radio map.Cooperation among UAVs will be tackled in a twofold manner: 1) for task assignment when agents within a DRN share different goals, e.g., detect multiple targets, and 2) for UAV navigation when only a single target is present.Through such cooperation, UAVs take actions based on a "global" (network) shared knowledge and reduce the overall learning time, thus improving the network's performance.
The main contributions of this article can be summarized as follows.
1) We propose an ad hoc DRN architecture, composed of UAVs, for solving joint target detection and environment mapping tasks.2) We investigate cooperative multi-agent Q-learning approaches for solving autonomous navigation of UAVs when agents share the same mission goal so that the required mission time is reduced.3) We investigate different approaches for task assignment when multiple and competing tasks are required during the mission.In particular, we consider either a simple received signal strength indicator (RSSI)-based solution or a random assignment or a multiarmed bandit (MAB) scheme for properly Fig. 1.Block diagram for decentralized joint detection, mapping, and navigation.Green, red, and purple blocks indicate state perception, state estimation, and policy estimation, respectively.The environment is estimated by processing radar observations.More specifically, an RF sensor is used to gather RSS measurements for target detection, indicated as RSS i,k , and a scanning radar permits to collect an angle-range matrix, denoted with e i,k , for environment mapping.The true state at time instant k is indicated with s i,k for the ith UAV.The set and the estimated number of targets are denoted by T and T, respectively, whereas the belief of the environment map with b i,k .The policy estimation allows the UAV to select an optimized action a i,k .
Such an action leads to a new state where the UAV collects a reward indicated with r i,k+1 .
managing the UAVs-targets assignment at specific time instants of the mission.4) We demonstrate through a comprehensive case study, including Terahertz (THz) mapping, the feasibility of the proposed DRN in various settings.In particular, we investigate the tradeoff between the mapping and detection performance while reducing the learning process, highlighting the benefits carried out by cooperation.
The rest of the article is organized as follows.Section II describes the problem formulation for the DRN, whereas Sections III-IV overview the considered navigation and cooperation approaches.Then, Section V reports the considered case study.Finally, Section VI concludes this article.

II. PROBLEM FORMULATION
In this article, we consider a DRN composed of UAVs that, by either collaborating together or acting as independent learners, navigate for detecting active targets in an unknown environment while reconstructing a probabilistic map of it. 1ore specifically, UAVs have the following two tasks.
1) Primary (Extrinsic) Task: High-quality detection of multiple active targets.Practical examples are cooperative users that need to be rescued or hidden malicious targets whose unwanted communication is sniffed within a certain frequency band; 2) Secondary (Intrinsic) Task: Estimation of an occupancy map of the explored area.
To accomplish them, each UAV performs the following steps: sensing, state estimation, task allocation (for scenarios with multiple targets), and policy estimation, depicted in Fig. 1 and described in the following.1) Sensing: Each UAV is considered equipped with ad hoc low-cost and low-complexity sensors.A radar working at high frequencies (e.g., millimeter-waves or THz) can be accommodated in a small space despite the adoption of many antennas for accurate beam-steering operations.Such radars are useful for mapping as they can collect a range-angle energy matrix to be processed by a mapping algorithm [52].
On the other hand, different sensors can be used for target detection spanning from vision-based systems to radars.In the following, we will consider a radio frequency (RF) sensor able to measure the RSSI from a target that can be discriminated from other targets through the detected packet ID [53]. 2enerally speaking, if other sensors gathering different types of data are on board, data-fusion techniques can be used to process heterogeneous information.2) State Estimation: The state comprises the UAV positions, an occupancy map of the environment, and the ID of an associated target.In our investigated scenario, the map is estimated using a Bayesian filtering approach, namely an occupancy grid (OG).
Appendix A shows the basic principles of the adopted OG algorithm.The environment is discretized in cells, and each cell has a binary status (1 if occupied and 0 if free).The goal of the estimation is to infer the a-posteriori probability mass function of the occupancy of each cell based on the history of radar measurements.
In addition, we assume to be able to distinguish the signals coming from different targets, as they use different tones of an orthogonal frequency-division multiplexing (OFDM) signaling scheme, provided that the signal-to-noise ratio (SNR) is above a certain threshold that allows us to decode the received signals and extrapolate the sources' IDs. 3  3) Task Allocation: To avoid situations where more UAVs are likely to get closer to the same target, with the risk of missing other targets, a UAV-target allocation algorithm can be run either at each UAV or at a network level to distribute the available resources better.Next, we consider the following three solutions described in Section IV-B.The random strategy considers a random UAV-target allocation.
The independent RSSI strategy assigns each UAV to the target corresponding to the maximum received power.The cooperative MAB strategy uses a MAB formulation and an UCB-based approach.4) Policy Estimation: Starting from the estimated state, each UAV should decide where to navigate to maximize joint detection and mapping performance and global network behavior.The functions that map states into actions are called policies.As a first step, each UAV acts as an independent learner, estimates its own policy, and takes a navigation decision.The navigation action drives the UAV to the next position, where an instantaneous reward is collected according to the goodness of the chosen action.Such a reward permits a first update of the policy.In collaborative settings, UAVs can share their knowledge with neighbors or with more expert UAVs (e.g., by exchanging Q-values).After such an exchange, the policy can be further updated.In this sense, in the rest of the article, we will focus on the capability to make informative navigation decisions, independently or cooperatively, according to multi-agent Q-learning.

A. System Model
We consider a set of M = {1, . . ., i, . . ., M} UAVs employed in the environment, and a set of T = 3 Such values were numerically set to SNR = 10 dB in our case study [56].
{1, . . ., n, . . ., T} targets' IDs to be discovered with certain reliability, i.e., the measured SNR should overcome a desired threshold.We divide the time into a sequence K of discrete time instants upper bounded by K to take into account the limited UAV endurance.
1) State-Action Model: In our scenario, the state vector for each UAV s i,k at time k contains the UAV location, the map of the environment, and a detection variable, i.e., is the true map at time k described as a vector of N cell cells in which the map is discretized, and n i,k contains the target ID associated to the ith UAV, and that can be empty in case no target is associated to the considered UAV.The environment is assumed stationary, so that In the navigation algorithm, we consider that the state coincides with the UAV positions, that is We also assume that the UAVs move in a grid so that s i,k = (p i,k mod ), where is the grid step.Consequently, for a single-agent case, the state space considered for RL navigation purposes is |S| = N cell . 4Similarly, the UAV navigation actions can be defined as 3 where p i,k is a position displacement in a continuous space.As before, since the UAVs are constrained to move in a grid with only four available actions, we have a i,k = ( p i,k mod ) and the action space is here defined as corresponding with right, left, up, and down directions.
2) Observation Model for Target Detection: The UAVs initially sense the environment through a detection module whose intent is to reveal the presence of a collaborative target that periodically broadcasts a beacon in the environment.Then, if the received packets are correctly demodulated, the UAVs collect a vector with the RSSIs that, for time instant k, is where Ti refers to the set containing the target IDs detected by the ith UAV with cardinality Ti , and RSS n,i,k is the RSSI measured from the nth target at time instant k, where we assume that the duration of the beacon is less than the interval between k and k + 1.Here, the RSS n,i,k (in dBm) is modeled according to a log-normal power loss model as follows [58]: where α pl is the path-loss exponent, d n,i,k is the distance between the ith UAV and the nth target at time k, S h models the shadowing effect, and it is here considered normally distributed as S h ∼ N (0, σ 2 s ), with σ s being the shadowing spread, and where k 0 is defined as is the received power at d 1 = 1 m, where λ is the wavelength, P n is the nth target's transmitted power, G n (G i ) refers to the transmitting (receiving) antenna gain, 1 i,n,k is an indicator function set to one if there is non-line-of-sight (NLOS) between the target and the UAV at time instant k, and L NLOS is the additional attenuation due to the blockages creating the NLOS condition [14], [59], [60].Note that RSS n,i,k is a function of the UAV and target positions and distance d n,i,k , and in the following, it is used for defining the rewards.Next, we describe RL in DRNs for navigation, and then, we discuss cooperation in Section IV.

III. NAVIGATION POLICY A. Single-Agent Markov Decision Process
A Markov decision process (MDP) is defined by the tuple comprising the state space S, the action space A, the reward space R, and the probability of transitioning from one state s k , at time instant k, to the next state s k+1 [57]. 5he actions of the ith agent are selected according to a policy π i (a i,k |s i,k ), which is the conditional probability mass function of the action.The optimal policy selects an action according to where the ), is the expected sum of discounted rewards over all possible policies and is given by with 0 ≤ γ ≤ 1 being the discount rate, and where {R k,i , S k,i , A k,i } are the random variables for the ith agent related, respectively, to rewards, states, and actions at time instant k taking values in {R, S, A}.The expected reward at time instant k + 1 for the state-action pair is Optimal policies share the same optimal action-value function for policy π defined as [57] Q-learning is an off-policy temporal-difference (TD) control algorithm where the policy is learned run-time while the UAV navigates the environment.It is a model-free tabular approach with the possibility of choosing a random action.The simplest solution is represented by the -greedy approach [57], [61], [62], where a random action is selected with a probability given by .Other variants of these approaches account for exploration only at the beginning ( -first strategy) or for a time-decaying exploration ( -decaying strategy) to converge to a quasi-optimal solution.The advantages of using TD methods instead of Monte Carlo or dynamic programming are that there is no need for a model, and an update of the return (i.e., cumulative rewards) is made at each time step.
For discrete states and actions, the Q-value in ( 5) can be represented by a Q-table that, at each time instant and for each agent, is updated by [57] where α is the learning rate, and the max operator is used to have a greedy policy.In this case, the learned actionvalue function directly approximates the optimal actionvalue function in (7), independently from the policy being followed.

C. Navigation Rewards
One of the most important aspects when adopting RL is the reward shaping that drives the agents' behavior in the desired manner [63], [64].To this purpose, we recall that the UAV network has a primary (extrinsic) and a secondary (intrinsic) task and associated rewards.
The extrinsic rewards are usually task-specific, and they associate a state-action pair into a real-valued reward, whereas the intrinsic rewards only indirectly depend on the world's state through the beliefs estimated by UAV about such a state [63].
Thus, as in [62], we can write the reward as a weighted sum of extrinsic and intrinsic rewards as6 where where r d i,k+1 , r c i,k+1 , and r m i,k+1 relate to the target detection, the mapping coverage, and accuracy, respectively, and η, ξ 1 , and ξ 2 are the related weight coefficients whose impact will be studied and discussed in the numerical results.
The detection reward is expressed as a function of the RSSI measured from the assigned target as RSS max (11) where RSS n i ,i,k+1 is the measured RSSI from the n i,k+1 th target associated to the ith UAV at time k + 1, and RSS max is the RSSI a UAV would experience at a distance of a single cell (i.e., the minimum possible distance) from a target. 7he mapping coverage reward is given by where I is the indicator function, i.e., I(x) = 1 if x is true and 0 otherwise, D i,k+1 ⊆ I i,k+1 indicates the cells visited for the first time, at time k + 1, whereas I i,k+1 represents the set of indices of all the cells illuminated by the ith UAV at the same k + 1 th instant.In other words, the higher the number of cells visited for the first time, the higher the reward.Finally, r m i,k+1 is defined as follows: where b i,k+1 (m j ) is the belief of the occupancy state of the j th cell as predicted by the i th agent at time slot k + 1.
Notably, this reward aims to push actions that minimize the uncertainty of the map in the shortest possible time.
Note that obstacles are assumed to be detected with (proximity) sensors that allow the agent to avoid them by including numerical penalties in the Q-table.

IV. MULTI-AGENT COOPERATION
According to Fig. 1, cooperation can be intended in a twofold manner.In the first one, if a group of UAVs share a common (detection) task, an exchange of Q-values can speed up the mission completion.In the second one, if multiple (detection) tasks should be completed by the network, then the cooperation can be implemented through task assignment between UAVs.

A. Common Task: Q-Sharing
When UAVs are networked and share a common (detection) task, they coordinate together for navigation to achieve the mission goal more rapidly than if they operate independently.In this section, we extend the single-agent framework of Section III-A to the multi-agent case.
According to Fig. 2, two types of UAVs are conceived: 1) independent learners; and 2) cooperative learners that can work either in a distributed or in a centralized manner.More specifically, we have the following.
1) Independent Learning (IL): Each UAV finds the best policy in an independent way by solving the optimization problem in (4) for their local Q-table, i.e., Q i (s i,k , a i,k ).Thanks to the first stage of single-agent Q-learning, they select their own action and move to the next position (i.e., p i,k+1 ) where they collect an instantaneous (sample) reward r i,k+1 .
2) Centralized Cooperative Learning (CC): In this case, the UAVs share the same Q-table, that is, Thus, each UAV indirectly knows what has been experienced by the others through the Q-table, which is updated by the ith agent as Note that while actions are selected onboard by each UAV, the Q-table is shared among all of them (e.g., it can be stored in an edge or cloud), and this allows the UAVs to make more informed decisions.This approach has the same disadvantages as centralized architecture, i.e., a low degree of robustness and the need to communicate with a central node or share all the Q-tables.
3) Distributed Cooperative Learning (DC): In this case, UAVs share some learning information (e.g., Q-values) with other UAVs.This allows updating their own Q-tables by considering the knowledge acquired by others.Each agent updates its own Q-table according to a specific function, i.e., where M i,k are the UAVs within the communication range of the ith agent (also indicated as "neighbors").In the next, we consider M i,k = M i , ∀k.

a) Distributed Cooperation With Maximum Q-values:
Each agent i updates its Q-table by substituting each Q-value with the related best Q-value among all the Q-tables of neighboring agents.By omitting the temporal index, for each state-action pair, the Q-value is updated by The previous approach might suffer because negative rewards are neglected since they are not considered useful, even if they are important, as they can prevent other agents from repeating the same mistakes.Thus, a possible alternative is to account for the absolute value of the Q-table (BestAbs-Q).In this case, (15) becomes ∀ i ∈ M, s ∈ S, a ∈ A. In this way, the best values of the Q-table are mixed with the current Q-values of the ith agent, such that new and past information is balanced.Since there are different ways to conceive distributed cooperative approaches, in the following, we highlight different possible techniques and implementations of the cooperative function f (•) according to the work in [68].

B. Multiple Tasks: UAV-Target Association
We refer to the task allocation of the target-UAV association procedure that maximizes the number of discovered targets in a given mission time.
Let L = {1, 2, . . ., , . . ., L} be the set of all possible associations.The output of the assignment is a vector containing the estimated association.Such a vector is a part of a matrix A of size L × M where L = |L| is the cardinality of L. Each matrix entry is an index identifying the target associated with the considered UAV.
We recall that the term "discovered target ID" refers to a label associated with targets detected with a received power larger than a predefined threshold ξ as defined in (1).Moreover, the task of target discovery ends when the aforementioned targets are detected with an RSSI higher than RSS max .In the following, we describe the proposed solutions.
1) Cooperative UAV-Target Association: The cooperative solution is based on MAB.MAB is a sequential decision process equivalent to one-state MDP whose objective is the maximization of the cumulative payoff/reward obtained in a sequence of decisions [69], [70].Indeed, differently from Q-learning, it is defined by a reward and a set of actions (i.e., arms), but it does not entail the concept of state transitions.Among the possible solutions for the MAB problem, the UCB allows picking arms by solving the dilemma exploration versus exploitation in a closed form [69], [71].In the following, we propose an ad hoc UCB-based approach such that the target-UAV association is performed in a fashion that it avoids assigning the same target to multiple UAVs.
The UCB-based algorithm consists of four phases: 1) initial, 2) explore-all, 3) training, and 4) association phases, and it works as follows [72] During the initial phase, the UAVs perform an initial measurement campaign with the intent to reveal the presence of a target through its transmitted beacon, as described in Section II-A2.At the end of this phase, all the UAVs combine the gathered information and a set with T = i∈M Ti detected ID targets is created with cardinality T = | T |.The number of arms (i.e., of possible associations), indicated with L = |L|, is determined as follows: where x! indicates the factorial of x.Note that if the number of targets is below the number of agents, some UAVs will remain without targets and dedicate their efforts only to environment radio mapping.On the contrary, if the number of targets is larger than the number of UAVs, some targets will be served later.
After the initial procedure, with the creation of T , there is a training phase, namely Explore-All phase, where all the arms ∈ L are played once.For each arm, played at instant t = , we observe a global reward r , and we update the average reward due to the choice of arm where N is the number of times the th arm is selected that, for this phase, is equal to 1 (i.e., μ = r ), and r ,i is evaluated according to (11).
After the Explore-All phase, the UCB solves the tradeoff between exploration (e.g., choosing a random action) and exploitation (e.g., choosing an action according to the collected information) "in a closed form" during the Training phase.By assuming Gaussian distributions and a known standard deviation of rewards, the objective reduces to compute the best estimate for the mean value of the reward to pick the best arm.By accounting for τ training instants, at the tth training time, with L < t ≤ τ , the choice of a UAV-target allocation action is performed by picking the th arm as where, in this case, it holds μ as defined in (19) which is the estimated average cumulative reward collected so far for arm and r is updated as where r ,i (ν ) is the reward associated to agent i due to choice at time instant k, and 1(x) is the indicator function defined as in (12) as The rationale behind the utility function in (20) is to find a tradeoff between exploitation and exploration based on time and the number of times an arm has been chosen.The exploration part is included with the term 2 ln(t ) N : when N is large, then this term goes to zero, and the agent can rely on its already acquired knowledge.Concurrently, the exploitation term described by μ will be more accurate as time passes.
Finally, during the so-called Association phase, the arm selected for the association is ˆ = τ where τ is computed as in (20).
To avoid the need to have the UCB running continuously, intermittent MAB mode can be used to enable UAVs making assignments only in a few instants to save power.
2) Independent UAV-Target Association: The previous approach allows the coordination of multiple UAVs for assigning separate tasks, but it might entail the exchange of several messages between each UAV and a central node, which can be a UAV of the network or an edge/cloud.This, however, can imply extra power consumption.
To overcome this limitation, a viable and simple solution is that each UAV decides on its own to which target to associate according to the measured environmental characteristics.In this sense, the simplest and most intuitive solution is that each UAV picks the target it reveals with the highest RSSI, even though other agents might have the same goal.Such association rule aligns with the work in [67] and [73].
In operating like this, a twofold aspect merits attention.The first is that the entire swarm of agents might go toward the same goal, neglecting other targets that need support.On the other hand, this solution can also be adopted as a backup plan if the network's connectivity prevents using the MAB approach.
Thus, omitting the time index k, in this case, we have that the association is defined by a scalar 9 given by where RSS n,i is defined as in (1), and A contains a single row.
In particular, the arg max operation in (23) searches the highest RSS value inside the vector RSS i and returns the corresponding index.
3) Random UAV-Target Association: The third approach entails the adoption of a completely random target-UAV assignment procedure which can also decide that a UAV is assigned only for exploration.Each matrix element is set as {A} 1,i = χ where χ ∼ U (1, T) is a uniformly distributed random variable considering integer values between 1 and Ti , and • is the rounding down operator.
From one side, this approach is not efficient, especially when an agent is close to a target but is assigned to another one.But on the other hand, it also exploits the concept of exploration in the case of resource allocation, which could be enlightening when the accumulated experience is insufficient to make informative decisions.
Note also that A is a single vector for the last two approaches, as the task assignment does not require any learning procedure but is essentially based on an independent choice.

V. NUMERICAL RESULTS
In this section, we provide a numerical investigation of the proposed techniques for task assignment and navigation of a network of agents exploring an unknown indoor environment while detecting the presence of targets.The mission time is described by episodes of duration K, so that the generic discrete-time temporal index is given by t = (e − 1) K + k ∈ {1, 2, . . ., K N ep }, where k ∈ {1, 2, . . ., K} is a discrete-time index of each episode and e ∈ {1, 2, . . ., N ep } is an index of a generic episode.We accounted for typical values of the shadowing standard deviation that can be of 1.70 dB and 3 dB in LOS and NLOS situations, respectively [58].The path-loss exponent was set to 2.Moreover, in the presence of NLOS, in our case study, we considered an attenuation of L NLOS = 30 dB.
A. Simulation Set-Up 1) Onboard Sensors: For this case study, we used a sub-THz multiple-antenna radar for environment mapping and a radio receiver for active target detection.The sub-THz radar was equipped with a planar squared array of 100 antennas working at 140 GHz with a maximum gain of 26 dBi.The transmitted power and the signal bandwidth were set to 5 dBm and to 1 GHz, respectively, whereas the observation window (time frame) was set to 50 ns in accordance with the size of the environments. 10We considered 25 steering directions with angles between −60 • and 60 • with a step of 5 • .Finally, the radar noise power was set to −80 dBm.Please refer to Appendix A for further details.
The RF receiver for target detection worked at a central frequency of 2.4 GHz with a bandwidth of 120 MHz [53].The transmitted power of each target was 10 dBm, and the receiver noise power was set to −90 dBm.Finally, we assumed that the UAV is equipped with a conventional omnidirectional antenna of G = 5 dBi gain.2) Figure of Merits: Next, we assess the source detection performance in terms of the rate of experiencing a certain SNR regime [Success rate (SR)].Let us define the time instant at which the mission can be considered successfully completed as with ζ being a threshold set to guarantee a reliable detection rate (i.e., the expected SNR at a distance of 2 m from the source).More specifically, the condition for T = 1 in ( 24) is valid for the single-target scenario and indicates that all UAVs should experience an SNR above a certain threshold to reach the mission goal.Instead, the condition for T > 1, i.e., valid for the multiple target case, indicates that the mission ends when all the targets are detected with an SNR over the threshold.Then, for each episode e, we define the SR as where N MC is the number of Monte Carlo iterations.
To obtain a quantitative evaluation of the mapping performance, we consider the image similarity (IS) index defined as [52], [74] where m is the actual occupancy map taking value m j ∈ C = {0, 1}, ∀ j ∈ M with M being the set of all grid cell indexes, m is the estimated map with with N c being the number of times a cell in map m 1 has the occupancy value c, and d M (m 1,i , m 2, j ) is the Manhattan distance between the ith cell of the map m 1 and the jth cell of the map m 2 , both having the same occupancy value c.The mapping metric in (27) can be computed for each UAV and at each Monte Carlo cycle.3) Simulation Environment: The reference scenario is displayed in Fig. 3 where the color of each cell represents its occupancy value: empty cells are displayed in white, whereas occupied cells are in black.To describe complete uncertainty about the map status,  we initially set b 0 (m j ) = 0.5, ∀ j ∈ M. The navigation task was solved by running a multi-agent tabular Q-learning where the learning parameters were set to α = 0.99, γ = 0.9, and the probability of taking a random action, i.e., , was considered as a timedecaying function to favor exploitation phases over exploration behaviors (similar to decayed epsilongreedy [75]).For this reason, we have considered the empirical strategy reported in Table II.We fixed the mission time K for each episode to 150 and the number of episodes N ep to 100.The training episodes allowed UAVs to leverage over prior knowledge acquired through time and experience.We denote with ξ = ξ 1 = ξ 2 the mapping weight. 11

B. Performance in Single-Target Scenario
We first consider a scenario with only one target, and the UAVs of the DRN cooperate to reduce the learning time.In such settings, being the mission goal common for all the networks, UAVs do not perform a task allocation phase.In Figs. 4 and 5, we report the estimated occupancy maps and trajectories during the last episode by considering a single Monte Carlo realization.Rewards were set to optimize the detection of the source, i.e., η = 1 and ξ = 0 (see Fig. 4), or for joint detection and mapping, i.e., η = 1 and ξ = 0.3  (see Fig. 5).We tested the following configurations with two UAVs: 1) IL (top) where each UAV has and updates its own Q-table ; 2) Centralized learning (middle) where each UAV updates a common Q-table; and 3) Distributed learning (bottom) where each UAV cooperates by exchanging the Q-values (BestAbs-Q).It is possible to notice that when centralized or distributed cooperation is performed, all the UAVs successfully complete the mission by following a Fig. 6.Q-table for the distributed cooperation with BestAbs-Q with η = 1, ξ = 0.The true map is juxtaposed in white.The initial UAV positions are white markers with blue edges; the source is indicated with a white marker with a red contour.The UAV trajectory is displayed with a white marker and green edges.The actual map is reported in white, whereas colors represent the intensity of the Q-values, that is, a color tending toward red indicates that a UAV located at the considered cell and choosing the action indicated in the title will receive a good reward based on past experiences.By contrast, colors tending to blue indicate state-action pair that did not lead to high rewards in the previous episodes.As expected, in Fig. 6, UAV trajectories are mainly driven by the target detection, and hence, the cells with higher values minimize the distance from the target.Instead, in Fig. 7, the path toward the target leads to lower rewards because the mapping penalizes navigation over the same trajectories in favor of exploring new areas.In Figs. 8 and 9, the Q-tables are displayed as functions of the cooperation scheme for ξ = 0 and ξ = 0.3, respectively.As it can be noticed, independent learners update only a local part of the Q-table, whereas, when cooperation is performed, UAVs can opt to follow paths explored by others and lead to higher rewards.Now, we investigate the performance averaged over Monte Carlo iterations.To this end, we set the number of simulations to 50.Fig. 10 reports the IS score for two agents and for different cooperation strategies and values of ξ .Continuous and dashed lines refer to a radar (maximum) range of 7.5 m, and 3.75 m, respectively.The image similarity index diminishes over time as the map reconstruction accuracy improves.Instead, the choice of the cooperation strategy does not significantly impact this metric.Finally, a variation of ξ does not impact the mapping performance when the radar range is similar to the environment size, whereas some changes can be perceived when the radar has a scarce illumination capability.Fig. 11 reports the SR as a function of time and for different rewards, that is, for joint detection and mapping rewards, when ξ = 0.3.As expected and confirmed by the estimated trajectories, cooperation among the UAVs is beneficial in reducing the time needed to find a single source.Indeed, with cooperation, 80 episodes are sufficient to accomplish the mission in more than 85% of cases.
Fig. 12 depicts the SR by varying the mapping weight and changing the cooperation scheme.Through Fig. 12-top, The radar reading range was 7.5 m (top), and 3.75 m (bottom).The mapping weights are varied as reported in the legend, whereas the weight associated with detection was set to η = 1.
it is confirmed that centralized or distributed cooperation is beneficial for speeding up target detection.Fig. 12-bottom instead puts in evidence that when the radar has a limited reading range (e.g., 3.75 m instead of 7.5 m), the mission cannot be successfully accomplished even in the presence of cooperation, which has almost no impact (the SR is always below the 40% at the end of the mission in all cases).Note also that for ξ = 1, Fig. 12-top shows that the detection performance is worsening because the UAVs do not focus only on the primary task but also on mapping.On the other way round, when the reading range (RR) is reduced (see Fig. 12-bottom), having ξ = 1 helps the UAVs privilege the exploration phase with an increased likelihood to find the target.In Fig. 13, performance is compared by accounting for the different Q-learning cooperation schemes described in Section IV.Notably, cooperation allows for boosting performance, regardless of the choice of a specific algorithm.

C. Performance in Multitask Scenario
We now analyze the task assignment performance in the presence of multiple targets.To this end, we considered the scenario of Fig. 14-top, where M = and = 2 are placed in three different geometric settings, namely Config.#1, Config.#2 #3.In the first configuration, the UAV initial positions are close to each other, whereas, in the second, they are more spread out in the environment.The third configuration has been chosen to investigate the performance when one target is close to an obstacle.In our simulations, the task assignment was performed from k = 1 till k = K/2 with a step of 20.The training duration of the MAB was set to τ = 20 steps.In Fig. 14-bottom, we plotted the obtained results for different multitask assignment techniques, that is, 1) cooperation through MAB; 2) independent by considering the maximum measured RSSI (namely, max − RSSI); and 3) independent and random task assignment (namely Random).
The performance shows that the case where the maximum RSSI and the MAB are employed allows for drastically reducing the mission time with respect to the random approach.Moreover, the MAB-based approach avoids having different UAVs sharing the same task in the environment and, consequently, the UAVs can focus on other operations.Nevertheless, the MAB-based approach requires a certain number of training episodes and, thus, a higher complexity.

VI. CONCLUSION
In this article, we have investigated the possibility of employing a DRN for indoor scenarios, where targets must be revealed in the shortest possible time, and the environment has to be reconstructed.More specifically, we have investigated both scenarios where cooperation is exploited alternatively for navigation or task assignment.First, we proposed an ad-hoc model for both situations and assessed the performance through extensive simulation analysis.Our results showed that the proposed framework allows for attaining robust performance (in terms of SR and IS) under different settings, which makes DRN a promising solution for solving joint detection and mapping problems.

APPENDIX A MAPPING
We provide insights about how the mapping reward is evaluated when a multiantenna radar is exploited by the ith UAV.In this case, for each steering direction, the radar transmits a train of N p pulses, and for each pulse, it collects the backscattering response (e.g., an echo).The time frame is subdivided into N bin bins, and for each time bin, the radar computes the corresponding energy profile.Notably, according to the work in [52], each energy element e bs , referring to a specific steering direction b and a specific time bin s, is expressed with where N p is the number of pulses transmitted for each steering direction, θ b is the bth steering direction, T ED ≈ 1/W is the duration of the energy bin, y i (t ) is the band-pass filtered version of the signal, and T f is the time frame.Starting from (29), it is possible to write a range-angle matrix given by A possible approach to estimate a map of the environment is to use an OG algorithm that, given these observations and by operating cell-by-cell, estimates the log-odd of occupancy for the jth cell as i,k m j = log b i,k m j 1 − b i,k m j (31) where b i,k (m j ) is the belief of the occupancy state of the j th cell computed by the i th UAV at time instant k.To this end, the algorithm proceeds in the following three main steps [52].

Fig. 2 .
Fig. 2. From the left to the right: (a) Independent, (b) centralized, and (c) distributed cooperative learning schemes.

Fig. 3 .
Fig. 3. Reference occupancy map for the single-target case.The UAV initial positions are indicated with blue markers, and the source positions with red markers.Double circle markers indicate the positions of UAVs when only two are considered in simulations.

Fig. 8 .Fig. 10 .
Fig. 8. Q-tables for the right action as a function of the learning strategy with η = 1, ξ = 0. Left: IL; Middle: CC; Right: DC.The meaning of colors is the same as for Fig. 6.

Fig. 11 .
Fig. 11.SR for source detection as a function of the number of episodes averaged over Monte Carlo iterations and the number of agents (N = 2).Dashed lines: IL, dot-dashed lines: CC, Continuous lines: Distributed cooperation with BestAbs-Q (DC).

Fig. 12 .
Fig. 12. SR for source detection as a function of the number of episodes averaged over Monte Carlo iterations and the number of agents (N = 2).The radar reading range was 7.5 m (top), and 3.75 m (bottom).The mapping weights are varied as reported in the legend, whereas the weight associated with detection was set to η = 1.

Fig. 13 .
Fig.13.SR for source detection as a function of the number of episodes and the number of agents.We consider distributed cooperation with different schemes for updating the Q-tables.In all cases, rewards were set considering η = 1, ξ = 0.3.Solid and dashed lines refer to M = 2 and M = 4, respectively.

Fig. 14 SR
Fig. 14 SR as a function of the task assignment strategy and geometric configuration.

TABLE I Some
Examples of Deep and Standard Q-Learning Based Applications in UAV Networks table entry indicated by the pair (s, a) is substituted with the maximum Q-values among the corresponding entries of all neighbors.In this type of cooperation, all agents must share the entire Q-table with the risk of slow and time-consuming communication operations. 8b) Distributed Cooperation With Absolute Q-Values:

TABLE II Adopted
Time Decayed Epsilon-Greedy Rule i,N steer 1 e i,N steer 2 . . .e i,N steer N bin e