Loading web-font TeX/Math/Italic
Decoupling Patrolling Tasks for Water Quality Monitoring: A Multi-Agent Deep Reinforcement Learning Approach | IEEE Journals & Magazine | IEEE Xplore

Decoupling Patrolling Tasks for Water Quality Monitoring: A Multi-Agent Deep Reinforcement Learning Approach


The Parameter-Sharing Multiagent Multitask Deep Q-Network (PSMA-MDQN) architecture proposed here utilizes a shared Feature Extractor that captures common features from th...

Abstract:

This study proposes the use of an Autonomous Surface Vehicle (ASV) fleet with water quality sensors for efficient patrolling to monitor water resource pollution. This is ...Show More

Abstract:

This study proposes the use of an Autonomous Surface Vehicle (ASV) fleet with water quality sensors for efficient patrolling to monitor water resource pollution. This is formulated as a Patrolling Problem, which consists of planning and executing efficient routes to continuously monitor a given area. When patrolling Lake Ypacaraí with ASVs, the scenario transforms into a Partially Observable Markov Game (POMG) due to unknown pollution levels. Given the computational complexity, a Multi-Agent Deep Reinforcement Learning (MADRL) approach is adopted, with a common policy for homogeneous agents. A consensus algorithm assists in collision avoidance and coordination. The work introduces exploration and reinforcement phases to the patrolling problem. The Exploration Phase aims at homogeneous map coverage, while the Intensification Phase prioritizes high polluted areas. The innovative introduction of a transition variable, \nu , efficiently controls the transition from exploration to intensification. Results demonstrate the superiority of the method, which outperforms a Single-Phase (trained on a single task) Deep Q-Network (DQN) by an average of 17% on the intensification task. The proposed multitask learning approach with parameter sharing, coupled with DQN training, outperforms Task-Specific DQN (two DQNs trained on separate tasks) by 6% in exploration and 13% in intensification. It also outperforms the heuristic-based Lawn Mower Path Planner (LMPP) and Random Wanderer Path Planner (RWPP) algorithms, by 35% and 20% on average respectively. Additionally, it outperforms a Particle Swarm Optimization-based Path Planner (PSOPP) by an average of 26%. The algorithm demonstrates adaptability in unforeseen scenarios, giving users flexibility in configuration.
The Parameter-Sharing Multiagent Multitask Deep Q-Network (PSMA-MDQN) architecture proposed here utilizes a shared Feature Extractor that captures common features from th...
Published in: IEEE Access ( Volume: 12)
Page(s): 75559 - 75576
Date of Publication: 21 May 2024
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Water resources, such as lakes and ponds, are vital to life, providing habitat for wildlife and a source of drinking water. Lake Ypacaraí in Paraguay, which spans over 60 km2, is facing increasing pollution due to waste discharges, lack of sewage systems, and tourism. This has resulted in eutrophication, excessive nutrients [1], and cyanobacteria blooms that deplete dissolved oxygen, posing a threat to aquatic life and water quality. Efficient monitoring of Lake Ypacaraí’s biological status, especially for dynamic blue-green algae blooms, necessitates periodic Water Quality (WQ) sampling. WQ data, such as pH, dissolved oxygen, temperature, and water turbidity, can be used to create a physicochemical model that provides up-to-date information on the water’s biological status. This can aid the decision-making process for monitoring pollution in Lake Ypacaraí more efficiently. Also, certain areas of the lake are more polluted than others [2], [3], [4], and therefore require more frequent monitoring. Manual sampling is impractical due to the lake’s vast size (60 km2) and the potential health risks for biologists in contaminated environments. Furthermore, installing a fixed sensor grid is not an optimal solution because it limits the ability to vary sampling locations and may be more costly due to frequent battery replacements compared to human-conducted missions [5].

In [2] and [3], the deployment of Autonomous Surface Vehicles (ASVs) equipped with WQ sensors for continuous monitoring instead of manual sampling (Figure 1) was proposed. ASVs are capable of operating autonomously, making them more efficient and suitable for tasks performed in hazardous environments, such as polluted areas. In addition, they offer the advantage of hourly sampling, in contrast to the limited frequency of manual sampling (once per day). By deploying a fleet of ASVs, simultaneous measurements can be taken at various locations on the map, yielding a more complete and detailed physicochemical model. The main challenge in the multi-ASV paradigm is to develop an effective coordination strategy that enables the ASVs to independently sample the lake with collision-free routes while meeting a low redundancy criterion. Physical constraints such as battery limitations and non-navigable zones must also be considered, as well as avoiding collisions between ASVs. To address these issues, the continuous monitoring of the lake is modeled as the Patrolling Problem. The objective of this task is to find an efficient route or strategy to cover important zones of the lake while also revisiting those that have not been sampled in a long time.

FIGURE 1. - ASV prototype designed for water resources exploration.
FIGURE 1.

ASV prototype designed for water resources exploration.

When monitoring Lake Ypacaraí with ASVs, the scenario becomes a Partially Observable Markov Game (POMG) due to the unknown pollution levels at the beginning of the episode. Additionally, experiments in this work focused only on blue-green algae bloom contamination, which is dynamic, chaotic, and changes in size over time.

In this work, the patrolling problem is addressed in a decoupled way, dividing it into two phases: exploration and intensification. During the exploration phase, ASVs are required to visit all areas of the map and taking pollution measurements. At this stage, homogeneous patrolling is carried out, as the relative importance of the areas is not taken into account so the lake is uniformly covered. After identifying the most significant areas of contamination, which are characterized by blue-green algae blooms, the intensification phase follows. In this stage, non-homogeneous patrolling is carried out, with a focus on visiting the most relevant areas more frequently. This approach poses a challenge, as it implies that the same agent must adapt to the changing phases and make optimal decisions in each of them. The proposed approach involves employing a variable named $\nu $ to regulate a smooth transition between the Exploration and Intensification phases. This variable establishes the probability of choosing an exploratory or an intensifying policy action. This transition mechanism, based on $\nu $ , is an innovative scientific contribution since, to the best of our knowledge, it has not been proposed in the literature before.

Given the extensive number of routes and the number of ASVs required for continuous monitoring of large bodies of water, the complexity of this problem becomes NP-hard. As suggested by [3], [6], and [7], a Deep Reinforcement Learning (DRL) approach is recommended for dealing with this problem’s high dimensionality. DRL methods, like Deep Q-Learning (DQL), operate without a model of the environment (model-free), and uses a Convolutional Neural Network (CNN) to approximate the Q-function, which can effectively handle high-dimensional state spaces [8]. In this multi-ASV system, each vehicle is an agent in a cooperative Multi-Agent DRL (MADRL) environment. To deal with scalability issues, a single neural network, designed and trained based on the study of [6], is shared by all agents. Each agent can share the Deep Policy, as they are interchangeable in terms of the observation and action capabilities. As the patrolling missions have been decoupled into two phases, the agents must learn both tasks, resulting in Multitask Multiagent DRL (MT-MADRL). Our approach is based on the neural network having two heads, one for each task’s Q-function. Therefore, despite any number of agents, there is only one network and one Q-function to optimize for each task. This is due to the agents pursuing the same collective reward and acting in a purely cooperative manner. As a result, each individual reward can be modeled with a single network. Concerning the nonnavigable terrain constraints, Censoring-DQL [5] is applied to takes the deterministic information of the environment to neglect the actions that would violate those constraints. Furthermore, if a possible collision between agents is about to occur, a consensus algorithm proposed in [9] is used to avoid it. Therefore, the main contributions of this article are:

  • A framework for monitoring partially observable dynamic scenarios with a two-phase approach: Exploration and Intensification. The proposed approach employs a smooth transition mechanism between the phases.

  • The application of Multitask Multiagent Deep Reinforcement Learning using Deep Q-Learning and parameter-sharing techniques for environmental monitoring.

  • A comparison between the proposed framework and other algorithms based on heuristics.

This paper is structured as follows: Section II is a review of the literature on Multiagent and Multitask Deep Reinforcement Learning with applications to environmental monitoring. Section III presents the Multiagent Patrolling Problem, provides specific scenario details, and outlines key assumptions. Section IV details the approach utilized to tackle the Multiagent Patrolling Problem, including the use of DRL approaches, the Exploration versus Intensification phase strategy, reward function design principles, and state space representation. In Section V, the paper presents key metrics and simulation results to verify the effectiveness of the proposed methodology. Finally, Section VI summarizes the findings and proposes potential avenues for future research.

SECTION II.

Related Work

The use of ASVs in aquatic environments has gained significant attention in recent years [10]. ASVs have a wide range of potential applications, including environmental monitoring for early warning of pollution [3], [6] [7], bathymetric surveying for navigation safety [11], [12], and emergency response tasks such as oil spill tracking [13]. Artificial Intelligence approaches, such as Bayesian Optimization (BO) [14], Genetic Algorithms (GA) [15] and DRL [3], have been applied to mitigate the NP-hard complexity of many applications involving a large number of possible trajectories. In [14], it was proposed a mission planning method based on BO, which defines the movements of the ASV with the aim of minimizing the uncertainty of the contamination distribution of an aquatic environment. Although this approach efficiently obtains a physicochemical model for low data regimes, the authors do not consider that water quality parameters can change during a mission. Our proposal addresses the dynamic behavior of algae blooms. Additionally, in [7], the authors demonstrated that scalability issues are better addressed with DRL than with GA approaches, which motivated this work to use DQL to solve the Patrolling Problem.

The Deep Q-Network (DQN) algorithm [8], which enabled games on the Atari 2600 console to be played at an expert human level, catapulted the popularity of DQL in 2015. In this work, Rainbow DQN [16] is used, which was introduced as a combination of several improvements, including DDQN [17], DuelingDQN [18], and Prioritized Buffer Replay [19]. Regarding solving non-homogeneous patrolling with DQL, in [20] a relevance map is used to represent areas with different coverage requirements. The policy could only make decisions based on the data of the relevance map within the camera’s field of view. In contrast, our proposal takes point sample measurements while creating a model of the relevant zones (polluted areas). This model is then used as an input image for the neural network. This methodology of environmental representation was previously employed in [3], where the authors created a pollution map for Lake Ypacaraí and generated an image to highlight the areas with the highest pollution levels. This work differs from ours in that they did not have a dynamic pollution map and also did not use a collision avoidance mechanism.

Using multiple autonomous vehicles as sensing nodes [6], [21], [22] leads to increased coverage and faster data collection. The extension of DRL to the multi-agent case (MADRL) enhances the performance of the single-agent case but brings new inherent challenges such as computational complexity, nonstationarity, partial observability and credit assignment [23]. In cooperative multiagent environments, parameter sharing (PS) [24] has been shown to be effective when agents are homogeneous (they share the same set of skills). The PS approach, which our proposal uses, involves sharing the parameters of a single policy among all agents, which are trained with the experiences of all agents simultaneously. This enables an efficient training process that can scale up to an arbitrary number of agents, thereby reducing the computational complexity. Previous studies, such as [25], have utilized PS in MADRL, where the network input includes information on the relative position and velocity of other agents. While the proposed approach is egocentric, our work diverges as it employs a single buffer to store experiences while they use a separate buffer for each agent, limiting the scalability of their algorithm. In contrast, other works like [26], do not use replay memory. Instead, they combine PS with Deep Recurrent Q-Network (DRQN) [27] to address partial observability. They approximate the Q-function with a recurrent neural network (RNN) that can maintain an internal state and aggregate observations over time. However, DuelingDQN architecture is proposed as a faster and easier-to-train alternative to RNN in high-dimensional spaces, as noted by [24]. Additionally, the partial observability of our proposed environment is mitigated by implementing a first phase of homogeneous patrolling, as the pollution distribution along the lake at the beginning of the episode is unknown. In the context of multi-agent patrolling of environmental missions, [6] and [28] proposed a scheme that employs Deep Q-Learning with a Convolutional Neural Network as a shared fleet policy and utilizes a global visual state. In [6], a decoupled final layer was proposed for N agents with |A| possible actions. This layer consists of an individual fully-connected layer for every agent, resulting in |A|×N neurons in the last layer. In contrast, our proposal only has 2x|A| neurons in the last layer for any number of agents. In fully cooperative environments that employ joint reward signals, agents face credit assignment challenges when determining the impact of their actions on team performance. To tackle the credit assignment problem, [28] proposed a decoupled reward in which each agent receives a reward only for their individual contributions, without any additional considerations. They demonstrated that their approach is effective in addressing the credit assignment problem, which motivates this work to employ it. The works mentioned above only address possible collisions of the agents that are explicitly penalized in the reward function. In contrast, our proposal includes collision avoidance mechanisms to guarantee that no action will cause any collisions.

In this study, the agents learns to optimize two tasks, therefore two policies. The first part of the episode involves making decisions based on a policy optimized only for the exploration task, while the second part involves taking actions following an intensification policy. In the literature, the multitask paradigm in DQL has been addressed with Multitask Deep Q-Network (MDQN) [29], [30], [31]. A network is trained as a vanilla DQN, but with a separate output layer (head) for each task. In [29], it was concluded that, given sufficiently related tasks, shared hidden layers in the MDQN can efficiently learn a shared feature representation and therefore perform well across tasks. In our case, the exploration and intensification phases are similar, as the agents coordinate their efforts to reduce the weighted average idleness on a single map. The conclusion of [29] and other successful single-agent applications of MDQN [30], [31] motivated us to use this approach.

Extending to the multi-agent paradigm, works [32] and [33] have applied knowledge transfer to MT-MADRL by training task-specific neural networks (teachers) first and then applying a knowledge distillation algorithm to train a policy that performs well across tasks. Our proposal differs in that it aims to learn a policy for each task, rather than one policy for multiple tasks. On the other hand, the authors in [34] proposed an approach that allows agents to have multiple policies for different tasks. However, their algorithm is based on the deep deterministic gradient algorithm (DDPG), which is a DRL algorithm for continuous action space environments. Unfortunately, this algorithm cannot be applied to our proposed discrete action space environment. Furthermore, to the best of our knowledge, no approaches to MT-MADRL utilize DQL. This research aims to fill this gap.

SECTION III.

Preliminaries

A. The Patrolling Problem

The main task of the patrolling problem is to identify the ideal strategy for visiting relevant areas of an environment in order to monitor them in a periodic way. To represent a continuous surface, the terrain skeletonization technique [35] can be used. This means that the environment to be covered can be represented as an undirected graph $G(V,E,W)$ (see Figure 2), $V={1,2,\ldots ..n}$ is the set of vertices (also called nodes) of the graph and E is the set of edges of G. With this formulation, different types of problems can be represented, depending on the importance of the costs associated with the edges. In the proposed scenario, the cost is determined by the distance between nodes. Every graph node also has an idleness, denoted by $\mathbf {W}_{k}$ for the node $k\in V$ , which indicates the amount of time since the node was last visited by an agent. In this paper, the objective is to reduce the average idleness of the graph and, consequently, of each individual node. In the patrolling problem, when all areas are considered equally important and there are no specific or relevant areas, it is referred to as the homogeneous patrolling problem. However, if there are relevant areas, they should be patrolled more frequently based on the importance assigned to each zone, which is known as the non-homogeneous patrolling problem.

FIGURE 2. - Example of a graph G with vertices 
$V_{k}$
 and edges 
$E_{k-j}$
, for clarity all the edges names are not displayed. Either of the two agents (in 
$V_{7}$
 and 
$V_{9}$
 can move to eight different directions, indicated by the red arrows. Nodes in red indicates the possible locations where a collision could occur if both agents are headed to it.
FIGURE 2.

Example of a graph G with vertices $V_{k}$ and edges $E_{k-j}$ , for clarity all the edges names are not displayed. Either of the two agents (in $V_{7}$ and $V_{9}$ can move to eight different directions, indicated by the red arrows. Nodes in red indicates the possible locations where a collision could occur if both agents are headed to it.

Furthermore, Lake Ypacaraí contains regions with high pollution concentration, such as blue-green algae blooms. The behavior of these blooms is dynamic and often chaotic, causing them to frequently change in size over time. Therefore, the set of dynamic weights for the graph nodes, denoted as I(t), indicates the importance of each node based on its level of contamination, with the more contaminated ones having greater importance. Therefore, the non-homogeneous patrolling problem proves to be the best fit for the Lake Ypacaraí case [3] and can be formulated as finding a policy $\boldsymbol {\pi }$ that minimizes the average idleness weighted by the set of weights I(t) given a number of time steps:\begin{equation*} G(V,E,W,I,t) \longrightarrow \pi (E) | \min \frac {1}{|V|} \sum ^{|V|}_{k=1} W_{k} \times \mathit {I_{k}(t)} \tag {1}\end{equation*} View SourceRight-click on figure for MathML and additional features.

The patrolling problem is naturally suited to be shared by multiple agents in space and time since they can patrol different areas simultaneously. Our fleet of ASVs shares a homogeneous architecture and shares its information about the environment, so the idleness to minimize is the shared idleness. This means that, all agents collectively reset the estimate of a node’s idleness based on the visits made by all agents.

B. Scenario and Assumptions

The scenario definition is essential in patrolling problems as it establishes the ASV or agent’s movement abilities, as well as the real constraints and boundary conditions. Our target scenario design is based on the following assumptions:

1) Lake Map

The map of Lake Ypacaraí has been divided into a grid map (see Figure 3), where each cell corresponds to a node and the distance traveled between adjacent cells corresponds to the weight of an edge. Given the total surface area of the lake at 60 km2, each square cell represents $290 m \times 290 \, m $ of the lake’s area. Cells that cannot be occupied (brown cells in Figure 3), such as those outside the navigable surface of the lake or those representing land space, have a null value in the grid map. However, obstacles within the lake are not taken into consideration during the simulation, as the ASV obstacle sensors (Lidar + Camera) possess the ability to avoid them with a local reactive trajectory planner.

FIGURE 3. - Discretized map of Lake Ypacaraí. Blue cells denote visitable water surfaces while brown cells denote illegal, non-navigable areas.
FIGURE 3.

Discretized map of Lake Ypacaraí. Blue cells denote visitable water surfaces while brown cells denote illegal, non-navigable areas.

2) Lake’s Contamination

As the ASVs explore the lake and collect samples, they build a map of the lake’s pollution, which is also part of the scenario. Our experiments focused solely on pollution caused by blue-green algal blooms. The bloom’s behavior is dynamic and chaotic, and its size changes over time, so it is modeled using a brownian movement. In this model, the concentration of algae is represented by discrete particles that can move within the lake space. Using the position of each particle, a particle map $P(t)$ is constructed following the same discretization as the grid map. Each position or cell of the map indicates the number of particles within it. To convert the particle map $P(t)$ to a scalar contamination field $I(t)$ , a Gaussian filter is applied to the image of the particles on the grid. This converts the particle map to a particle density map, which is ultimately referred to as the contamination map (see Figure 4).

FIGURE 4. - Example of the evolution of a contamination map as a result of blue-green algae in Lake Ypacaraí. (a) is the initial map, (b) is the map after 30 steps, (c) after 60 steps, (d) after 100 steps.
FIGURE 4.

Example of the evolution of a contamination map as a result of blue-green algae in Lake Ypacaraí. (a) is the initial map, (b) is the map after 30 steps, (c) after 60 steps, (d) after 100 steps.

3) Vehicle Movement

Once the defined number of directions for the agent has been established, the angles for each direction are calculated, evenly spaced within an angle interval of $[0,2\pi]$ . To ensure the ASV’s motion capabilities are realistically represented, eight different cardinal directions (N, E, S, W and NE, SE, NW, SW) (red arrows in Figure 2) is assumed at each step. Since all ASVs have the same capabilities, they are assumed to move synchronously. For simplicity, each step of the simulation is a movement in one direction. Illegal actions are movement into a cell that is simultaneously desired by another agent (red nodes in Figure 2) or movement into a cell that forms part of non-navigable zones (obstacles or shores). Actions leading to non-navigable zones are masked to prevent them from being taken. This ensures that only actions allowing movement within the graph are available. The battery life of ASVs is a significant constraint when deployed in real-world applications. In our study, the maximum distance that drones can travel if they start with a full battery is determined. This is the condition in which the mission is terminated.

4) Communication

In this scenario, communication between agents regarding fleet position and pollution measurements is assumed. They jointly create an importance map (the set of weights I(t)) based on contamination measurements. The model determines the importance of a cell based on its most recent contamination measurement. To ensure an interest in covering them, all cells within the navigable zone of the lake have a minimum importance value. Furthermore, having knowledge about the fleet position is crucial for each agent to avoid collisions and ensure safe and efficient behavior.

SECTION IV.

Methodology

A. Multitask Reinforcement Learning

Reinforcement Learning (RL) [36] is an Artificial Intelligence approach where an agent learns by interacting with the environment. The Markov Decision Process (MDP) mathematical framework is used to formalize RL problems, such as patrolling Lake Ypacaraí. A MDP is denoted by a tuple consisting of five elements: $(S, A, P, R, \gamma)$ . Where S is the state space, A is the action space, $P\ :\ S\ \times \ A\ \times \ S\ \longrightarrow \ [{0,1}]$ is the transition probability, $R\ :\ S\ \times \ A\ \longrightarrow \ \mathbb {R}_{+}$ is the reward function, $\gamma \in [0, 1$ ) is the discount factor and it represents the agent’s preference for immediate rewards over future rewards. At each time step t, the agent observes a state $s_{t}\in S$ and selects an action $a_{t}\in A$ based on the policy $\pi (a_{t}|s_{t})$ . The environment produces a reward of $r_{t}$ based on the reward function $r_{t} = R(s_{t}, a_{t})$ and progresses to the next state $s_{t+1}$ through the transition probability $P(s_{t+1}|s_{t}, a_{t})$ .

The agent’s long-term policy return ($J(\pi)$ ) is defined as the expected discounted accumulated rewards:\begin{equation*} J(\pi )=\mathbb {E}_{a_{t} \sim \pi }\left [{{\sum _{t=0}^{\infty }\gamma ^{t}r_{t} }}\right ] \tag {2}\end{equation*} View SourceRight-click on figure for MathML and additional features. Therefore, the agent must learn an optimal behavior policy $\pi \ :\ S\ \longrightarrow \ P(A)$ that optimizes the expected performance $J(\pi)$ . In partially-observable environments, agents receive observations of the state rather than observing it directly. These domains are formalized as Partially Observable Markov Decision Processes (POMDPs), which are defined as $(S, A, P, R, \Omega ,O, \gamma)$ . In addition to the elements shared with MDPs, POMDPs include the discrete set of observations $\Omega = \{o1,\ldots ,o^{M}\}$ and the observation function $O\ :\ S\ \times A\ \times \Omega \ \longrightarrow \ [{0,1}]$ . After choosing an action a at timestep t, the agent observes $o_{t+1}\in \Omega $ with probability $P(o_{t+1}|s_{t+1}, a_{t}) = O(o_{t+1}, s_{t+1}, a)$ . In Multitask Reinforcement Learning, there is a collection of M tasks $\mathcal {T}=\{T_{i}\}_{i=0}^{N}$ . Each task $T_{i}$ has a unique MDP (POMDP if the environment is partially-observable) ${\mathcal {M}}_{i}=(S, A, P_{i}, R_{i}, \gamma)$ that shares the same state-action space, but the transition probability $P_{i}$ and the reward function $R_{i}$ differ across tasks. The goal is to simultaneously learn multiple tasks and exploit their similarity to improve performance relative to single-task learning. Each task has a maximization objective given, so the goal is to learn a policy that maximizes the total expected returns over the tasks.

B. Deep Q-Learning

The state-action value function, $Q_{\pi }(s,a)$ , represents a table of Q values that describe the utility of being in a specific state s and taking a particular action a while following a given policy $\pi (s)$ . Deep Q-Learning is a widely used algorithm in reinforcement learning (RL). Instead of iteratively updating the Q-values in a table (Q-Learning), it uses a neural network to estimate the state-action, which can learn to map the states directly to their corresponding actions while estimating the corresponding Q-values using parameterized weights ($\theta $ ). This is useful for handling high-dimensional state spaces in Q-learning. For a given state s, action a, received reward r, and next state $s'$ , the Q-value update in DQL is expressed as:\begin{align*} Q(s,a; \theta )& =Q(s,a; \theta ) \\ & \quad +\alpha *(r+\gamma *\underset {a}{\max } Q(s\prime ,a; \theta )-Q(s,a; \theta )) \tag {3}\end{align*} View SourceRight-click on figure for MathML and additional features. The learning rate ($0 \lt \alpha \geq 1$ ) determines how quickly $Q(s,a)$ is updated with new data during each iteration. The discount factor $\gamma \in (0, 1$ ] adjusts the importance of rewards over time. In addition, DQL commonly uses the epsilon-greedy strategy to balance exploration and exploitation. This strategy is applied during action selection to maintain a balance between exploiting current knowledge and exploring new possibilities. This is done by selecting the action with the highest estimated Q-value with a probability of $(1 - \epsilon)$ for exploitation, or selecting a random action with a probability of $\epsilon $ for exploration.

In their work [8], authors present two important methods for improving stability and efficiency in reinforcement learning: Experience Replay and Target Network. Experience Replay consist of storing past experiences $(s,a,r,s\prime)$ in a memory buffer. This allowed the agent to learn from different situations and avoiding problems such as catastrophic forgetting and correlated observations. The Target Network is a duplicate of neural network parameters for action selection; it updates less frequently than the main network, reducing correlations and enhancing stability. Additionally, to tackle overestimation of action values, Double Deep Q-Network (DDQN) [17] was proposed. This approach uses the main network for action selection and the target network for value estimation, mitigating the overestimation problem. Furthermore, [18] proposed the dueling architecture, a neural network framework that separates the estimation of the State-Value Function V(s) and the Advantage Function A(s, a). In this dueling architecture, two streams representing V(s) and A(s, a) share a common feature learning module and are then combined through a unique aggregation layer to obtain an estimate of the state-action value function Q using the following general formula:\begin{align*} Q(s, a; \theta ) = V(s; \theta ') + \left ({{ A(s, a; \theta '') - \frac {1}{|A|} \sum _{a'} A(s, a'; \theta '') }}\right ) \tag {4}\end{align*} View SourceRight-click on figure for MathML and additional features. where $\theta '$ and $\theta ''$ are the separate parameter sets used to estimate V and A respectively.

C. Proposed Decoupled Method

The pollution levels and, consequently, the set of weights I(t) are unknown at the beginning of the mission. However, to effectively achieve Equation 1, the group of agents must first acquire sufficient knowledge of I. For this reason, the mission is divided into two phases: exploration and intensification. In the exploration phase, ASVs patrol the area homogeneously with the aim of minimizing average idleness, without considering the importance (the set I(t)) of specific zones. So, our main goal is to achieve homogeneous coverage by searching for a joint policy $\Pi = \{\pi ^{1}, \pi ^{2}, \ldots , \pi ^{N}\}$ of N agents that minimizes the average idleness value throughout the map.\begin{equation*} G(V,E,W) \longrightarrow \Pi (E) | \min \frac {1}{|V|} \sum ^{|V|}_{k=1} W_{k} \tag {5}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Once contamination information has been gathered, our goal is to implement a strategic plan to ensure thorough and intensive coverage of the highly polluted zones during the intensification phase. Our primary objective is to develop a joint policy, denoted by $\Pi = \{\pi ^{1}, \pi ^{2}, \ldots , \pi ^{N}\}$ , for N agents with the aim of minimizing the average idleness idleness weighted by the set of weights I(t) throughout the map.\begin{equation*} G(V,E,W,I,t) \longrightarrow \pi (E) | \min \frac {1}{|V|} \sum ^{|V|}_{k=1} W_{k} \times \mathit {I_{k}(t)} \tag {6}\end{equation*} View SourceRight-click on figure for MathML and additional features.

In this case, the exploration and intensification phases are similar since the agents coordinate their efforts to decrease average idleness on a single map. Conversely, exploring has a homogeneous coverage while intensification is heterogeneous, making our problem inherently multi-objective. If the agents only learned the task of intensification, they would need to explore to gather a better model of the contamination at the beginning of the mission and then decide where to intensify. To speed up the learning process and increase the efficiency of sampling, the study relaxes the requirement that agents must use the same policy for both exploration and intensification, acknowledging the conflicting nature of these goals.

The proposed method uses a variable, $\nu $ , to regulate a smooth transition between the exploration and intensification phases. This variable determines the probability of choosing an exploration or intensification policy action. For the sake of clarity, let $Q_{e}$ represent the head of our network that estimates the Q-values of the exploration policy, and $Q_{i}$ represent the head of our network that estimates the Q-values of the intensification policy. The resultant policy, $\pi _{\nu } $ , is defined as:\begin{align*} \pi _{\nu }(s) = \begin{cases} \underset {a}{\arg \max } Q_{e}(s,a) & with\ probability\ \nu \\ \underset {a}{\arg \max } Q_{i}(s,a) & with\ probability\ 1- \nu \\ \end{cases} \tag {7}\end{align*} View SourceRight-click on figure for MathML and additional features. Thus, the patrolling phases and the transition phase can be defined flexibly:

  • $\nu = 1$ : Exploration phase.

  • $\nu = 0$ : Intensification phase.

  • $0\lt \nu \gt 1$ : Transition phase, during which there is a $(\nu *100)\%$ probability of choosing an exploratory action, and therefore $\nu $ gradually decreases to smoothly transition to the intensification phase.

D. Collision Avoidance Mechanisms

As mentioned above, the Censoring-DQL [5] algorithm is employed to ensure deterministic computation that can address actions leading to nonnavigable zones. In this algorithm, invalid actions of an agent i that would lead to nonnavigable zones are identified using the information in the lake map, which is known a priori. Then, a censoring function $\eta (s,a^{i})\in \mathbb {R}^{|}A^{i}|$ (see Equation 8) is computed.\begin{align*} \eta (s,\mathbf {a^{i}}) = \begin{cases} 1 & if\ a^{i}\ is\ valid \\ -\infty & if\ a^{i}\ is\ not\ valid \\ \end{cases} \tag {8}\end{align*} View SourceRight-click on figure for MathML and additional features. Once the invalid actions are detected, the observation $o^{i}$ is processed by the Q-network and the final censored Q values (for each task) $Q_{C}^{i}$ are obtained.\begin{equation*} Q_{C}^{i}(o^{i},a^{i}) = \eta (s,a^{i}) \circ Q^{i}(o^{i},a^{i}) \tag {9}\end{equation*} View SourceRight-click on figure for MathML and additional features. This ensures that only actions allowing movement within the graph are available.

On the other hand, simultaneous actions within the navigable zone may cause conflicts, resulting in collisions between agents. To address this issue, the SafeConsensus [9] algorithm is utilized. The algorithm sorts agents based on the highest joint value of Q, with the highest-Q agent taking action without considering other agents. The agents that follow consider the new position of the previous one, censoring Q values that lead to collisions with $-\infty $ . This heuristic is based on conditional decision-making and relies on agent optimism to prioritize actions.

E. Multitask Multiagent Deep Q-Network

In this study, the Parameter-Sharing Multiagent MDQN (PSMA-MDQN) is presented as an extension of the MDQN to the multiagent paradigm. The DDQN method in combination with the dueling architecture (Dueling DQN) is used to develop a Dense Convolutional Neural Network with two parallel terminations, referred to as “heads”, each corresponding to a specific task: exploration and intensification. The output layer and parameters of each head are separated, allowing the PSMA-MDQN to learn and optimize these two tasks independently. Each head in the PSMA-MDQN has its own loss function, and the weights of each head are adjusted independently during training. The shared block, also known as the Feature Extractor, extracts useful common features for both tasks [29]. These shared features are utilized by each individual head to generate its corresponding output (see Figure 5). This approach is feasible because each agent operates with the same set of actions and is subject to the same constraints, making them homologous in both actions and observations. So, this work benefits from the homogeneity of the agents to train on a single network a policy $\pi ^{1}=\pi ^{2}=\ldots =\pi ^{N}$ that serves multiple agents without increasing the network parameters with respect to a single agent. Since the objective is to learn more than one policy, our proposal only has $M \times |A|$ neurons in the last layer for any number of agents and for M tasks. Furthermore, as validated in [6] and [9], the number of agents in this scheme does not impact on the stability in the learning process. In Algorithm 1, the pseudocode for the approaches is presented.

FIGURE 5. - The PSMA-MDQN architecture proposed here utilizes a shared Feature Extractor that captures common features from the input state. Each task is assigned an individual head consisting of a three-layer dense neural network and then a dueling architecture with separate Advantage (
$A(s,a)$
) and State (
$V(s)$
) value heads. The final Q-values of each task, 
$Q_{e}(s,a)$
 for exploration and 
$Q_{i}(s,a)$
 for intensification, are obtained by combining its correspondent 
$A(s,a)$
 and 
$V(s)$
 streams. All activation layers correspond to the ReLU function.
FIGURE 5.

The PSMA-MDQN architecture proposed here utilizes a shared Feature Extractor that captures common features from the input state. Each task is assigned an individual head consisting of a three-layer dense neural network and then a dueling architecture with separate Advantage ($A(s,a)$ ) and State ($V(s)$ ) value heads. The final Q-values of each task, $Q_{e}(s,a)$ for exploration and $Q_{i}(s,a)$ for intensification, are obtained by combining its correspondent $A(s,a)$ and $V(s)$ streams. All activation layers correspond to the ReLU function.

Algorithm 1 Parameter-Sharing Multiagent MDQN With SafeConsensus and Censoring-DQL Algorithm

1:

Initialize replay memory B to capacity |B|

2:

Initialize multi-policy network Q with random weights $\theta $ and heads $Q_{e}$ for exploration and $Q_{i}$ for intensification

3:

Clone Q into Target Network Q’ with random weights $\theta \prime =\theta $

4:

Set hyperparameters: $\epsilon $ , $\nu $

5:

for episode $= 1, E$ do

6:

Reset environment

7:

for $t= 1, T$ do

8:

Initialize state s

9:

$p \sim {\mathcal {U}}(0, 1)$

10:

With probability $\epsilon $ , choose a random action $a_{t}$

11:

if $p\lt \epsilon $ then

12:

$\vec {Q}=U(0, 1) \times ,\ldots , \times U(0, 1) \, \triangleright \qquad $ Joint Q-values of the N agents

13:

else

14:

$n \sim {\mathcal {U}}(0, 1)$

15:

if $n\lt \nu $ then

16:

$\vec {Q}=Q_{i}^{0}(o^{0}, a^{0}) \times ,\ldots , \times Q_{i}^{0}(o^{0}, a^{0}) \, \triangleright \qquad $ Choose the intensification head

17:

else

18:

$\vec {Q}=Q_{e}^{0}(o^{0}, a^{0}) \times ,\ldots , \times Q_{e}^{0}(o^{0}, a^{0}) \, \triangleright \qquad $ Choose the exploration head

19:

end if

20:

end if

21:

Apply Censoring-DQL to $\vec {Q}$ to obtain $\vec {Q}_{C}$

22:

$\vec {a}_{t} \leftarrow \, SafeConsensus(\vec {Q}_{C})$

23:

Execute joint action $\vec {a}_{t}^{i}$ , observe reward vector $r_{t}^{i}$ and next observation $o_{t}^{i}$ for every agent i

24:

Store transition $(o_{t}^{i}, a_{t}^{i}, r_{t}^{i}, o_{t+1}^{i})$ in B

25:

Sample a minibatch of transitions $(o_{k}, a_{k}, r_{k}, o_{k+1})$ from D

26:

Update Q-value heads based on the temporal difference error

27:

Every C steps, update target Q-networks.

28:

end for

29:

end for

F. State Representation

The state represents the environment in which the agent operates and the information accessible for decision-making purposes. In our simulation environment, the agents have partial information about the lake’s contamination, as they do not know the contamination levels of the cells they have never visited. Therefore, the environment is partially observable as the agents only have an observation of the dynamic set of weights I(t). To enable the use of a single policy network for all agents, an egocentric observation formulation is employed. This means that the relative position of other agents is also observed, ensuring that each agent has a unique observation. Thus, there is a distinction between the shared and individual elements of the observations. Our proposal is a state representation composed of 4-channel images minmax-normalized, this way every pixel value of the state is within [0, 1].

1) Shared Elements

These common elements of the state ensure that all agents have access to the same information about the environment and can coordinate their actions effectively (see Figure 6).

  • Idleness map: An image containing idleness values for each cell, which is generated based on fleet visits (see Figure 6a). All cells within the navigable zone are set to the maximum value 1 initially, denoting that none of the cells has been visited. When an agent visits a cell, its idleness resets to zero. If a cell remains unvisited over time, its idleness value increases, indicating the need for revisitation.

  • Importance map: This map updates the relative importance of visited cells as new measurements are taken (see Figure 6b). All cells within the navigable zone are initially set to the minimum importance value, indicating an unknown but not null interest. When an agent visits a cell, its relative importance is updated based on the level of contamination. The more polluted a cell is, the more importance it has.

FIGURE 6. - 4-channel egocentric observation of an agent, in this case, the agent 0.
FIGURE 6.

4-channel egocentric observation of an agent, in this case, the agent 0.

2) Individual Elements

Each agent’s state will be differentiated by his position and therefore it is allowed to use one single policy network for all of them (see Figure 6):

  • Agent position: This binary image has a zero value for all cells, except for those cells that are covered by the ASV detection area (white cells in Figure 6c). This image provides specific information about the agent’s current location and allows him to make decisions based on his own position.

  • Other agent’s position: It is a binary image where all cells have a value of zero, except for cells corresponding to the location and detection area of other agents in the environment (white cells in Figure 6d). This image enables each agent to have knowledge of the other agents’ position and avoid collisions or coordinate their actions appropriately.

G. Reward Functions

To guide the agents towards optimal behavior, it is necessary to design a reward function that motivates the agents to achieve the following goals:

  1. During the first phase of the patrolling, the agent fleet should be fully exploratory, visiting the entire map in a coordinated fashion.

  2. After gathering data on the lake during the initial phase, the agents should move on to an intensification phase that takes into account the importance of each zone.

  3. Penalize agents who take measurements in the same cell on the map.

In both phases of the process, an idleness matrix (${\mathcal {W}}$ ) is defined such that, ${\mathcal {W}}_{t}(x,y)$ is the idleness at the position (cell) $(x,y)$ at time t. Moreover, the measurements taken by agent i at time t are restricted to an area of detection ($\omega ^{i}_{t}$ ) of radius r, defined by the agent’s sensors (see Figure 7). In this notation, cells measured by an individual agent i (within the area $\omega ^{i}_{t}$ ) are denoted as $(\mathbf {x}_{i},\mathbf {y}_{i})$ .

FIGURE 7. - This image illustrates the detection areas 
$\omega $
 of the ASVs and the values of the redundancy mask RM in overlapping 
$\omega $
 regions.
FIGURE 7.

This image illustrates the detection areas $\omega $ of the ASVs and the values of the redundancy mask RM in overlapping $\omega $ regions.

To penalize that more than one agent is taking a measurement in the same cell, the Redundancy Mask (RM) stores the number of measurements in each cell (see Figure 7). In each step t, every agent $i =1,2,\ldots , N$ performs a single measurement within the detection mask, thus the calculation of RM is as follows:\begin{equation*} \mathit {RM}_{t} = \sum _{i}^{N} \omega ^{i}_{t} \tag {10}\end{equation*} View SourceRight-click on figure for MathML and additional features.

When multiple agents take measurements in the same cell, they share the reward received for measuring in that cell. To achieve this, the value of the reward is divided by RM matrix. As a result, the agents should distribute themselves and maintain a safe distance from each other to improve coverage and maximize the use of available information. Additionally, the total reward colLected is normalized by dividing it by r. This enables a fair comparison of the rewards regardless of the size of the detection radius used. Although rewards are calculated using global matrices (where ${\mathcal {W}}$ and RM are updated with data from the entire fleet), only information within the ASV’s detection area is considered to reward this particular ASV. During the exploratory phase, the objective is to visit homogeneously the entire zones to minimize the average idleness of the map. Therefore, visits to cells with high idleness should be incentivized. Given $(\mathbf {x}_{i},\mathbf {y}_{i})$ , the cells to be measured as a result of the action $a^{i}$ , the Exploration Reward ($ER^{i}$ ) received by agent i is calculated as follows:\begin{equation*} ER_{t}^{i} = \dfrac {{\mathcal {W}}_{t}(\mathbf {x}_{i},\mathbf {y}_{i})}{r \times \mathit {RM}_{t}(\mathbf {x}_{i},\mathbf {y}_{i})} \tag {11}\end{equation*} View SourceRight-click on figure for MathML and additional features.

During the Intensification phase, the agents should focus on the most relevant zones. Thus, an importance matrix (${\mathcal {I}}$ ) is defined such that, ${\mathcal {I}}_{t}(x,y)$ is the relative importance at the position $(x,y)$ at time t. The Intensification Reward ($IR^{i}$ ) received by agent i is calculated as the Equation 12. By weighting the relative importance of the cells to be visited with the idleness and receiving it as a reward, the agents are incentivized to focus on most important areas.\begin{equation*} IR_{t}^{i} = \dfrac {{\mathcal {W}}_{t}(\mathbf {x}_{i},\mathbf {y}_{i}) \times {\mathcal {I}}_{t}(\mathbf {x}_{i},\mathbf {y}_{i})}{r \times \mathit {RM}_{t}(\mathbf {x}_{i},\mathbf {y}_{i})} \tag {12}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Regardless of the phase of the episode, every time an agent takes an action, it receives a vector of rewards $r_{t}^{i}=[ER_{t}^{i},IR_{t}^{i}]$ . Then, every head in the policy network is trained with all the steps (with his respective reward), even if the action was not taken with that specific policy. This is possible because Q-Learning, the basis of DQL, is an off-policy algorithm [37]. This means that it can learn the optimal policy even when actions were not taken with that same policy.

SECTION V.

Results

This Section presents performance metrics, conducted simulations and learning settings. First, it discusses how the algorithm compares to other approaches in terms of the use of the decoupled method and shared parameters in this case study. Finally, the proposed approach is compared with other heuristic-based patrolling algorithms used in the literature.

The proposed algorithm has been implemented in Python1 3, using the PyTorch2 library for the construction of the neural network. The Gym3 library served as the simulation environment, and numerical and matrix operations were conducted with the NumPy4 and SciPy5 libraries. The algorithm’s code and results are accessible in a GitHub repository.6 All simulations were conducted utilizing an Intel Xeon Gold 5220R CPU operating at 2.20GHz with 187GB of RAM. Additionally, a NVIDIA GeForce RTX 3090 GPU with 24GB of VRAM was employed to speed up the neural network training.

A. Metrics

The following performance metrics have been defined to evaluate the performance of our algorithm. Let K be the total number of navigable cells, $(i_{k},j_{k})$ be the position of cell k, and let t be a specific time instant during the simulation of duration T.

  • The Instantaneous Global Idleness (IGI): the average idleness of all cells in the time step t:\begin{equation*} IGI(t) = \frac {1}{K} \sum _{k=1}^{K} {\mathcal {W}}_{t}(i_{k},j_{k}) \tag {13}\end{equation*} View SourceRight-click on figure for MathML and additional features.

  • The Instantaneous Global Weighted Idleness (IGWI): the average weighted idleness of all cells in the time step t:\begin{equation*} IGWI(t) = \frac {1}{K} \sum _{k=1}^{K} {\mathcal {W}}_{t}(i_{k},j_{k}) \times \mathit {I}_{t}(i_{k},j_{k}) \tag {14}\end{equation*} View SourceRight-click on figure for MathML and additional features.

  • The Average Global Idleness (AGI): The IGI averaged over the whole exploration time $T_{e}$ :\begin{equation*} AGI = \frac {1}{T_{e}} \sum _{t=1}^{T_{e}} IGI(t) \tag {15}\end{equation*} View SourceRight-click on figure for MathML and additional features.

  • The Average Global Weighted Idleness (AGWI): The IGWI averaged over the whole simulation time T:\begin{equation*} AGWI = \frac {1}{T} \sum _{t=1}^{T} IGWI(t) \tag {16}\end{equation*} View SourceRight-click on figure for MathML and additional features.

  • Percentage visited of the map (PV(t)): the proportion of the map visited at time t.

The IGI and IGWI metrics offer insight into possible momentary inefficiencies in the patrolling strategy, while the AGI and AGWI metrics provide a summary of the strategy’s performance and focus on its sustained efficiency. Note that the AGI only averages the IGI until the exploration phase is over ($\nu \neq 1$ ) because the policy changes. PV is used as a surrogate metric because full coverage, while not the primary objective, indirectly helps minimize idleness. The PV complements the other metrics by providing a more easily interpretable metric.

B. Simulation Settings

Table 1 lists the key training hyperparameters used during simulations. The parameter $\nu -intervals$ specifies four specific time points during an episode when a particular $\nu $ value should be set. Each point is defined by a pair of values: the time as a percentage of the total episode duration and the corresponding $\nu $ value. This way, the progression of the $\nu $ value throughout the episode is determined by defining $\nu $ at certain points and interpolating linearly between them. The $\nu -intervals$ used for training, as shown in Figure 8, ensure equal duration for the Exploration and Transition Phases. This allows the agent to train equally on both phases. Additionally, the Intensification Phase has 10% more duration compared to the other two phases, as intensification is a more challenging task.

TABLE 1 Hiperparameters Used for Training
Table 1- Hiperparameters Used for Training
FIGURE 8. - Visual representation of 
$\nu $
 evolution using the 
$\nu -intervals$
 from Table 1. The Exploration Phase is the first 30% of the episode, followed by a Transition Phase until the episode reaches 60% duration. Finally, the Intensification Phase spans the last 40% of the episode.
FIGURE 8.

Visual representation of $\nu $ evolution using the $\nu -intervals$ from Table 1. The Exploration Phase is the first 30% of the episode, followed by a Transition Phase until the episode reaches 60% duration. Finally, the Intensification Phase spans the last 40% of the episode.

Table 2 summarizes the environmental parameters. The simulations were conducted using four ASVs (agents) with sensors that have a detection radius of 580 m. The distance an ASV can travel (autonomy) is 58 km, and the ASV movement length is 580 m per step.

TABLE 2 Environment Parameters
Table 2- Environment Parameters

Figure 9 displays the total rewards obtained by the fleet throughout the training process. These rewards are the sum of the cumulative rewards earned by individual agents for both intensification and exploration strategies at the end of each of the 10,000 episodes. The figure illustrates the algorithm’s effectiveness in consistently maximizing rewards and eventually converging to a stable solution.

FIGURE 9. - Accumulated fleet rewards at the end of each of the 10,000 training episodes.
FIGURE 9.

Accumulated fleet rewards at the end of each of the 10,000 training episodes.

C. Comparison of Policy Strategies

To validate the train of the algorithm with two phases and the efficiency of parameter sharing, the two following policies were trained:

  • Single-Phase DQN: This policy is a parameter-sharing Multiagent DDQN trained on a single task, utilizing a single Dueling DQN head. Therefore, the training process is performed in a single intensification phase setting. The purpose of comparing our approach is to highlight the impact of a decoupled-phase training it.

  • Task-Specific DQN: The policy involves two parameter-sharing Multiagent DDQN policies, each dedicated to a single task, trained within a decoupled phase setting. The methodology differs from our approach, which utilizes a single shared network for multiple tasks. The comparison with our approach addresses whether sharing a common DQN across tasks (MDQN) offers advantages in our particular context.

In Figure 10b, a comparison of IGWI throughout the episode for the three DQN algorithms is presented. It shows how the Single-Phase DQN quickly reduces IGWI by utilizing its optimized strategy for this metric. However, it is important to note that during this period, the PSMA-MDQN algorithm is in the exploration phase with the primary objective of minimizing IGI. By entering the Intensification Phase, our algorithm effectively reduces the IGWI 17% less. This advantage is maintained for the rest of the episode. This suggests that the Single-Phase DQN struggles in minimizing idleness by revisiting cells that were visited long ago (or never visited) while also having to more frequently visiting more relevant zones. This inefficiency allows our homogeneous patrolling policy to outperform it. To speed up the learning process and increase the efficiency of sampling, having two phases relaxes the requirement that the agent must, with the same policy, learn explore and intensify, as these are conflicting goals.

FIGURE 10. - A comparison of the results for the three DQN algorithms throughout the episode. It shows the average and standard deviation of the metrics obtained from 500 episodes.
FIGURE 10.

A comparison of the results for the three DQN algorithms throughout the episode. It shows the average and standard deviation of the metrics obtained from 500 episodes.

Table 3 summarizes the results of all algorithms calculated after running 500 episodes. Our proposed method outperforms all other algorithms in all metrics, as highlighted in bold black. The AGWI at the end of the episode is 3% lower than that of the Single-Phase DQN, which is the second-best algorithm. Additionally, Figure 10c displays the average PV of 500 episodes and its standard deviation. It is evident that PSMA-MDQN covers a larger percentage of the map at a faster rate, averaging 29% more coverage than Single-Phase DQN.

TABLE 3 Average and Standard Deviation of the Metrics of the Algorithms After the Execution of 500 Episodes. The Presented IGI is the One Obtained at the End of the Exploration Phase at Time $T_{e}$ . In Bold Black the Best Score and in Red the Second Best Score $\uparrow $ : The Higher the Metric, the Better; $\downarrow $ : The Lower the Metric, the Better
Table 3- Average and Standard Deviation of the Metrics of the Algorithms After the Execution of 500 Episodes. The Presented IGI is the One Obtained at the End of the Exploration Phase at Time 
$T_{e}$
. In Bold Black the Best Score and in Red the Second Best Score 
$\uparrow $
: The Higher the Metric, the Better; 
$\downarrow $
: The Lower the Metric, the Better

Furthermore, PSMA-MDQN outperforms Task-Specific DQN in all metrics (Table 3), despite the latter using nearly twice (1.96 times) as many neural network parameters. In Figure 10a a comparison of IGI at the end of the Exploration Phase is presented. PSMA-MDQN demonstrates a 6% lower reduction in IGI, indicating more optimal learning of the exploration strategy. Additionally, it achieves a 13% lower reduction in IGWI (see Figure 10b) and 7% lower AGWI reduction (Table 3), which suggests more efficient learning of the intensification strategy. This superiority is attributed to the shared DQN architecture (MDQN) employed in our approach. The Feature Extractor, which is the shared block in PSMA-MDQN, enhances task performance by learning shared representations and leveraging shared information. This shared knowledge improves the model’s ability to generalize effectively, providing a significant advantage in our specific context. Therefore, sharing the Feature Extractor across tasks proves to be beneficial in our particular context.

The proposed method is also compared with other heuristic-based path planning algorithms to validate the results. To ensure fair comparison, only safe actions are taken by the algorithms. The comparison is made using three different path planners (see Figure 11):

  • Lawn Mower Path Planner (LMPP): Each agent starts its path by randomly choosing a direction and proceeds in that direction until it encounters an obstacle (which could be the ground or another ASV). Then, the agent takes the reverse direction but one step forward from the previous path to avoid repeating the same path in reverse. Figure 11a shows an example of LMPP trajectories with 4 agents.

  • Random Wanderer Path Planner (RWPP): Agents randomly select a direction and follow it until they encounter an obstacle. At that point, each agent selects another random safe direction except the reverse to introduce variety. This approach ensures that the algorithm explores new paths without redundancy. Figure 11b shows an example of RWPP trajectories with 4 agents.

  • Particle Swarm Optimization Path Planner (PSOPP): PSO [38] is an evolutionary algorithm that deploys a group of particles to find an optimal solution based on a given metric. These particles navigate the search space using mathematical formulas that take into account their position and velocity. PSO has been successfully used and enhanced to make environmental monitoring [39]. In our scenario, each agent i is treated as a particle, with its position denoted as $p^{i}$ . The closest position on the map with the highest idleness is $p^{i}_{bW}$ , and the closest position with the highest weighted idleness is $p^{i}_{bI}$ . At each iteration t, the particle’s speed $vel^{i}_{t}$ is updated according to Equation 17.\begin{equation*} vel^{i}_{t} = w*vel^{i}_{t-1} + c_{1} * (p^{i}_{bW} - p^{i}) + c_{2} * (p^{i}_{bI} - p^{i}) \tag {17}\end{equation*} View SourceRight-click on figure for MathML and additional features. Table 4 lists the values of the parameters for the PSOPP algorithm. It should be noted that a distinction is made between the parameters for both phases, therefore agents move toward positions with the highest idleness in the Exploration Phase, and positions with highest weighted idleness in the Intensification Phase. The action taken is the one that is closest in direction to $vel^{i}_{t}$ .

TABLE 4 Parameter Values for PSOPP Algorithm
Table 4- Parameter Values for PSOPP Algorithm
FIGURE 11. - An example of LMPP (a) and RWPP (b) trajectories for 4 ASVs.
FIGURE 11.

An example of LMPP (a) and RWPP (b) trajectories for 4 ASVs.

The results shown in Figure 12a demonstrate our algorithm’s capability to reduce the IGI at a faster rate than the heuristic-based algorithms. Due to its slowness and redundancy, the LMPP proves inadequate to reduce both the IGI and the IGWI in a time-efficient manner. At the end of the Exploration phase, our algorithm achieves a reduction of 47% of IGI compared to LMPP, which is significantly lower. The limitations of LMPP become apparent in Figure 12b, highlighting its inability to maintain a reduced IGWI value. This deficiency is attributed to the fact that LMPP ignores crucial zones, which ultimately limits its effectiveness in pollution monitoring. Therefore, our algorithm reduces, on average, the AGWI 44% lower than LMPP at the end of the episode. The Random Wanderer Path Planner (RWPP) has less redundancy compared to LMPP. However, our approach has successfully learned a coordination strategy, surpassing RWPP by achieving a 34% lower IGI at the end of the first phase. Additionally, our algorithm excels in the second phase, outperforming RWPP by 31% and maintaining a consistently lower AGWI, despite RWPP’s non-redundant nature. Moreover, our algorithm achieves a 45% greater reduction in the minimum IGWI. Figure 12c displays the average PV throughout the episode for the heuristic-based algorithms and ours. It is easily seen how much faster our algorithm is, with 48% more than RWPP and 130% more than LMPP at the end of the exploration phase (step 30). Although PSOPP considers zone idleness for decision-making, its coordination strategy is less sophisticated compared to our approach. PSOPP’s natural swarming behavior of particles(agents) leads to less sparsity and suboptimal performance. As shown in Figure 12a, our approach outperforms PSOPP by 39% in IGI, and even RWPP effectively reduces more IGI than PSOPP due to its less redundant nature. Figure 12b shows that PSOPP outperforms other heuristic-based algorithms in terms of intensification, but our method achieves a 37% reduction in idleness and a 31% lower AGWI.

FIGURE 12. - A comparison of the results for the heuristic-based algorithms throughout the episode. It shows the average and standard deviation of the metrics obtained from 500 episodes.
FIGURE 12.

A comparison of the results for the heuristic-based algorithms throughout the episode. It shows the average and standard deviation of the metrics obtained from 500 episodes.

Number of Visit Maps (NVM) (see Figure 13) record the number of times each cell was visited. The purpose of these maps is to provide a visual representation of the uniformity of coverage during the exploration phase. As well as the concentration of visits in more significant regions during the intensification phase. An episode is randomly selected to assess the appearance of the visit maps. The contamination map at the beggining of the Intensification Phase of the episode is shown in Figure 13a, and can be used to identify areas where agents should increase their efforts.

FIGURE 13. - Number of Visit Maps (NVM) record how many times each cell was visited during the Exploration and Intensification Phases. Figure (a) shows the contamination map at the beginning of the Intensification Phase, Figures (c),(e),(g),(i),(k) show the NVMs for the Exploration Phase, while figures (b),(d),(f),(h),(j),(l) show the NVMs for the Intensification Phase.
FIGURE 13.

Number of Visit Maps (NVM) record how many times each cell was visited during the Exploration and Intensification Phases. Figure (a) shows the contamination map at the beginning of the Intensification Phase, Figures (c),(e),(g),(i),(k) show the NVMs for the Exploration Phase, while figures (b),(d),(f),(h),(j),(l) show the NVMs for the Intensification Phase.

Figures 13c, 13e, 13g, 13i and 13i show the NVMs during the Exploration Phase for all the algorithms except the Single-Phase DQN because it has not been trained in that phase. The visual representations clearly demonstrate the superior and more efficient coverage achieved by PSMA-MDQN when compared to the other algorithms. The NVM for LMPP (see Figure 13g) again confirms the notable slow pace and redundancies in its exploration strategy. Also, the inefficiency of RWPP (see Figure 13i) is evident due to its inherent lack of coordination. When compared to Task-Specific DQN (see Figure 13e), it is clear that this algorithm does not employ coordination behaviors to the same extent as our proposed PSMA-MDQN. Consequently, our approach achieves more extensive coverage in fewer steps. Regarding PSOPP (see Figure 13k), the exploration is inefficient because the trajectories of the agents are close to each other.

As for the Intensification Phase, Figures 13b, 13d, 13f, 13h, 13j and 13l show the NVMs for all algorithms. It is evident that LMPP and RWPP do not consider important zones, and therefore, they intensify poorly. In contrast, the DQN-based algorithms and PSOPP successfully intensify in the important zone. The policies trained with the decoupled method, PSMA-MDQN and Task-Specific DQN, have shifted their focus from covering the entire map to solely targeting the most relevant areas. This efficient transition between phases demonstrates that the agents have been able to identify the relevant areas and intensify their patrols in those areas. As for PSOPP, agents migrate collectively from one contamination peak to another (see Figure 13l). This collective movement makes them similar to a single agent, which extremely reduces the performance.

D. Generalization

The goal of this Section is to evaluate the proposed algorithm’s robustness and generalization ability after training by changing the values of $\nu $ -intervals. If the algorithm can perform well across diverse $\nu $ -interval values and it achieves optimal results for both tasks, it suggests that it has learned general patterns and strategies rather than being limited to the specific $\nu $ -interval values used during training. Figure 15 illustrates the evolution of the four $\nu -intervals$ used for evaluation. Figure 14a shows the IGI evolution of different configurations. The Only Exploration configuration, which uses the exploration policy throughout the entire episode, exhibits the minimum IGI. This configuration maintains a low IGI value as long as the Exploration Phase is extended. Figure 14b presents the IGWI evolution of various configurations. The Only Intensification configuration, which uses the intensification policy for the full episode, experiences a faster decrease in IGWI during the initial phase. However, as other configurations transition to the Intensification Phase, they quickly outperform the Only Intensification configuration in terms of IGWI reduction. Figure 14c displays the PV evolution for different configurations. As the duration of the exploration phase decreases, there is a clear trend that the average percentage of the map visited decreases. However, both the Only Exploration and the 70-80 configurations achieve the same PV levels. It is important to note that the percentage of the map explored depends on the number of steps available during the exploration phase, rather than the proportion of time spent in this phase during an episode. In simpler terms, if one episode lasts longer than another for the same $\nu -interval$ configuration, it will have more steps and the agents will cover larger areas.

FIGURE 14. - A comparison of the results for the PSMA-MDQN throughout the episode with different 
$\nu $
-intervals. It shows the average and standard deviation of the metrics obtained from 500 episodes.
FIGURE 14.

A comparison of the results for the PSMA-MDQN throughout the episode with different $\nu $ -intervals. It shows the average and standard deviation of the metrics obtained from 500 episodes.

FIGURE 15. - Illustration of the 
$\nu -intervals$
 used for evaluation. The curve labeled as (’20-30’, ’70-80’) represents the duration of the Exploration Phase up to step (20, 70), followed by a Transition Phase up to step (30, 80), and the remaining steps represent the Intensification Phase. The curve labeled as ‘Only Exploration’ maintains a value of 
$\nu =1$
 throughout the entire episode, indicating that the episode exclusively consists of the Exploration Phase. Similarly, for the curve labeled ‘Only Intensification,’ the value of nu remains at 0 throughout the entire episode, indicating that the episode is solely dedicated to the Intensification Phase.
FIGURE 15.

Illustration of the $\nu -intervals$ used for evaluation. The curve labeled as (’20-30’, ’70-80’) represents the duration of the Exploration Phase up to step (20, 70), followed by a Transition Phase up to step (30, 80), and the remaining steps represent the Intensification Phase. The curve labeled as ‘Only Exploration’ maintains a value of $\nu =1$ throughout the entire episode, indicating that the episode exclusively consists of the Exploration Phase. Similarly, for the curve labeled ‘Only Intensification,’ the value of nu remains at 0 throughout the entire episode, indicating that the episode is solely dedicated to the Intensification Phase.

Nonetheless, it is undeniable that the agents have gained the capability to make decisions based on their current phase of operation and had indeed learned to perform the two tasks independently.

E. Discussions

With the results presented, the following discussions unfold:

  • The decoupled method introduced in this study addresses the constraint of agents having to use a single policy for both initial exploration and subsequent intensification, thereby taking into account the inherent conflict between these objectives. This method accelerates the learning process. This is demonstrated by the fact that when entering the intensification phase, our algorithm reduces the IGWI by 17% more than the Single-Phase DQN. This performance superiority is maintained throughout the rest of the episode.

  • Parameter sharing is more efficient for related multitask learning in our case study. Our algorithm showed a 6% lower reduction in IGI during the Exploration Phase, a 13% lower reduction in IGWI, and a 7% lower reduction in AGWI compared to the Task-Specific DQN, despite the latter using 1.96 times as many neural network parameters.

  • The algorithm developed in this study outperforms heuristic-based approaches such as LMPP, RWPP and PSOPP. On average, it achieves a 44% lower AGWI compared to LMPP and a 31% lower AGWI compared to RWPP and PSOPP by the end of the episode. Additionally, the algorithm demonstrates a learned coordination strategy, resulting in a 47% reduction in IGI compared to LMPP, a 34% and 39% lower IGI than RWPP and PSOPP respectively by the conclusion of the first phase. The algorithm’s superior speed covering the map is evident, with a 48% advantage over RWPP, a substantial 130% lead over LMPP and a 58% more PV than PSOPP by the end of the exploration phase (step 30).

  • The comparison of the proposed algorithm and LMPP highlights the superior performance of the former. This is mainly due to LMPP’s exhaustive approach to ensuring complete coverage by strictly following parallel paths, leading to excessive redundancy.Additionally, in non-convex scenarios, LMPP would face challenges in escaping corners, which would impede its proper operation. As for RWPP, it provides fast homogeneous map coverage, but lacks coordination based on weighted idleness, thereby lacking effective intensification. Additionally, its performance also would decrease in non-convex settings.

  • Regarding PSOPP, when particles initiate exploration from nearby locations, they tend to exhibit similar behaviors. In the Exploration Phase, each particle moves towards the nearest cell with the highest idleness, resulting in a dispersion of paths when encountering obstacles such as the border of the lake. However, in the Intensification Phase, where importance is concentrated, particles tend to congregate and move collectively. This behavior is especially problematic in environments with multiple contamination peaks. In such cases, particles tend to move in groups from one peak to another, making the task extremely inefficient. In contrast, our algorithm provides agent allocation strategies that are particularly suitable for achieving homogeneous and non-homogeneous coverage across the entire map, even in complex and dynamic environments, e.g., when contamination peaks are dispersed.

  • Our algorithm has learned to perform two tasks independently, and the policies can be used arbitrarily.

  • Our algorithm allows for smooth transitions between phases or the use of a single phase. This provides users with the flexibility to configure the algorithm according to their needs.

SECTION VI.

Conclusion

In the context of a dynamic Partially Observable Markov Game (POMG), such as the challenging Lake Ypacaraí patrolling scenario with multiple ASVs, our approach divides the patrolling task strategically into two distinct phases: the Exploration Phase and the Intensification Phase. The aim of the Exploration Phase is to cover the map homogeneously, while the objective of the Intensification Phase is to intensify the coverage in the most polluted areas. Additionally, a novel scientific approach have been introduced to ensure a smooth transition between the two phases. To tackle the computational complexity of the problem, a Dueling DQN has been trained with two heads, one dedicated to estimating the Q-function for the Exploration Phase and the other for the Intensification Phase. The policy is shared across all agents since they are homogeneous and the input state formulation is egocentric.

The results indicate that the decoupled method introduced in our study is effective. This method frees agents from the constraint of using a single policy for both exploration and intensification, which accelerates the learning process. Our algorithm consistently outperforms the Single-Phase DQN, a policy trained on a single intensification task, with the same architecture as ours but utilizing a single Dueling DQN head. Furthermore, our multitask learning approach with parameter sharing is more efficient than Task-Specific DQN. The latter uses two single Dueling DQN with the same architecture as ours, but each dedicated to a different task and trained within a decoupled phase setting. Our approach achieves better results despite using approximately half as many neural network parameters.

By changing the values of $\nu -intervals$ during evaluation to ones that the agents had not trained with, it is demonstrated that the algorithm had acquired the ability to perform the two tasks autonomously, even in scenarios not encountered during training. This provides the flexibility for users to configure the values of $\nu -intervals$ according to their requirements.

In future lines of research, shifting from predefined task durations to training the network with multiple objectives emerges as a promising way to achieve optimal performance configurations. This shift allows for a more nuanced consideration of user preferences, including exploration and intensification requirements, as well as energy efficiency. The proposed approach is to solve a multi-objective optimization problem to identify a Pareto front which contains non-dominated policies. These policies are solutions where improving one objective cannot be achieved without compromising the performance of another objective. The exploration of the Pareto front, achieved by training with varying objective weightings, offers opportunities to discover versatile and adaptive patrolling strategies tailored to diverse user needs and environmental dynamics. Additionally, other environmental monitoring tasks, such as bathymetric surveys and trash detections could be added to the proposed framework. Moreover, including battery level as another decision variable in trajectory design considerations offers a promising area of research. This addition has the potential to improve the efficiency and sustainability of autonomous systems operating in dynamic environments.

References

References is not available for this document.