Introduction
Water resources, such as lakes and ponds, are vital to life, providing habitat for wildlife and a source of drinking water. Lake Ypacaraí in Paraguay, which spans over 60 km2, is facing increasing pollution due to waste discharges, lack of sewage systems, and tourism. This has resulted in eutrophication, excessive nutrients [1], and cyanobacteria blooms that deplete dissolved oxygen, posing a threat to aquatic life and water quality. Efficient monitoring of Lake Ypacaraí’s biological status, especially for dynamic blue-green algae blooms, necessitates periodic Water Quality (WQ) sampling. WQ data, such as pH, dissolved oxygen, temperature, and water turbidity, can be used to create a physicochemical model that provides up-to-date information on the water’s biological status. This can aid the decision-making process for monitoring pollution in Lake Ypacaraí more efficiently. Also, certain areas of the lake are more polluted than others [2], [3], [4], and therefore require more frequent monitoring. Manual sampling is impractical due to the lake’s vast size (60 km2) and the potential health risks for biologists in contaminated environments. Furthermore, installing a fixed sensor grid is not an optimal solution because it limits the ability to vary sampling locations and may be more costly due to frequent battery replacements compared to human-conducted missions [5].
In [2] and [3], the deployment of Autonomous Surface Vehicles (ASVs) equipped with WQ sensors for continuous monitoring instead of manual sampling (Figure 1) was proposed. ASVs are capable of operating autonomously, making them more efficient and suitable for tasks performed in hazardous environments, such as polluted areas. In addition, they offer the advantage of hourly sampling, in contrast to the limited frequency of manual sampling (once per day). By deploying a fleet of ASVs, simultaneous measurements can be taken at various locations on the map, yielding a more complete and detailed physicochemical model. The main challenge in the multi-ASV paradigm is to develop an effective coordination strategy that enables the ASVs to independently sample the lake with collision-free routes while meeting a low redundancy criterion. Physical constraints such as battery limitations and non-navigable zones must also be considered, as well as avoiding collisions between ASVs. To address these issues, the continuous monitoring of the lake is modeled as the Patrolling Problem. The objective of this task is to find an efficient route or strategy to cover important zones of the lake while also revisiting those that have not been sampled in a long time.
When monitoring Lake Ypacaraí with ASVs, the scenario becomes a Partially Observable Markov Game (POMG) due to the unknown pollution levels at the beginning of the episode. Additionally, experiments in this work focused only on blue-green algae bloom contamination, which is dynamic, chaotic, and changes in size over time.
In this work, the patrolling problem is addressed in a decoupled way, dividing it into two phases: exploration and intensification. During the exploration phase, ASVs are required to visit all areas of the map and taking pollution measurements. At this stage, homogeneous patrolling is carried out, as the relative importance of the areas is not taken into account so the lake is uniformly covered. After identifying the most significant areas of contamination, which are characterized by blue-green algae blooms, the intensification phase follows. In this stage, non-homogeneous patrolling is carried out, with a focus on visiting the most relevant areas more frequently. This approach poses a challenge, as it implies that the same agent must adapt to the changing phases and make optimal decisions in each of them. The proposed approach involves employing a variable named
Given the extensive number of routes and the number of ASVs required for continuous monitoring of large bodies of water, the complexity of this problem becomes NP-hard. As suggested by [3], [6], and [7], a Deep Reinforcement Learning (DRL) approach is recommended for dealing with this problem’s high dimensionality. DRL methods, like Deep Q-Learning (DQL), operate without a model of the environment (model-free), and uses a Convolutional Neural Network (CNN) to approximate the Q-function, which can effectively handle high-dimensional state spaces [8]. In this multi-ASV system, each vehicle is an agent in a cooperative Multi-Agent DRL (MADRL) environment. To deal with scalability issues, a single neural network, designed and trained based on the study of [6], is shared by all agents. Each agent can share the Deep Policy, as they are interchangeable in terms of the observation and action capabilities. As the patrolling missions have been decoupled into two phases, the agents must learn both tasks, resulting in Multitask Multiagent DRL (MT-MADRL). Our approach is based on the neural network having two heads, one for each task’s Q-function. Therefore, despite any number of agents, there is only one network and one Q-function to optimize for each task. This is due to the agents pursuing the same collective reward and acting in a purely cooperative manner. As a result, each individual reward can be modeled with a single network. Concerning the nonnavigable terrain constraints, Censoring-DQL [5] is applied to takes the deterministic information of the environment to neglect the actions that would violate those constraints. Furthermore, if a possible collision between agents is about to occur, a consensus algorithm proposed in [9] is used to avoid it. Therefore, the main contributions of this article are:
A framework for monitoring partially observable dynamic scenarios with a two-phase approach: Exploration and Intensification. The proposed approach employs a smooth transition mechanism between the phases.
The application of Multitask Multiagent Deep Reinforcement Learning using Deep Q-Learning and parameter-sharing techniques for environmental monitoring.
A comparison between the proposed framework and other algorithms based on heuristics.
This paper is structured as follows: Section II is a review of the literature on Multiagent and Multitask Deep Reinforcement Learning with applications to environmental monitoring. Section III presents the Multiagent Patrolling Problem, provides specific scenario details, and outlines key assumptions. Section IV details the approach utilized to tackle the Multiagent Patrolling Problem, including the use of DRL approaches, the Exploration versus Intensification phase strategy, reward function design principles, and state space representation. In Section V, the paper presents key metrics and simulation results to verify the effectiveness of the proposed methodology. Finally, Section VI summarizes the findings and proposes potential avenues for future research.
Related Work
The use of ASVs in aquatic environments has gained significant attention in recent years [10]. ASVs have a wide range of potential applications, including environmental monitoring for early warning of pollution [3], [6] [7], bathymetric surveying for navigation safety [11], [12], and emergency response tasks such as oil spill tracking [13]. Artificial Intelligence approaches, such as Bayesian Optimization (BO) [14], Genetic Algorithms (GA) [15] and DRL [3], have been applied to mitigate the NP-hard complexity of many applications involving a large number of possible trajectories. In [14], it was proposed a mission planning method based on BO, which defines the movements of the ASV with the aim of minimizing the uncertainty of the contamination distribution of an aquatic environment. Although this approach efficiently obtains a physicochemical model for low data regimes, the authors do not consider that water quality parameters can change during a mission. Our proposal addresses the dynamic behavior of algae blooms. Additionally, in [7], the authors demonstrated that scalability issues are better addressed with DRL than with GA approaches, which motivated this work to use DQL to solve the Patrolling Problem.
The Deep Q-Network (DQN) algorithm [8], which enabled games on the Atari 2600 console to be played at an expert human level, catapulted the popularity of DQL in 2015. In this work, Rainbow DQN [16] is used, which was introduced as a combination of several improvements, including DDQN [17], DuelingDQN [18], and Prioritized Buffer Replay [19]. Regarding solving non-homogeneous patrolling with DQL, in [20] a relevance map is used to represent areas with different coverage requirements. The policy could only make decisions based on the data of the relevance map within the camera’s field of view. In contrast, our proposal takes point sample measurements while creating a model of the relevant zones (polluted areas). This model is then used as an input image for the neural network. This methodology of environmental representation was previously employed in [3], where the authors created a pollution map for Lake Ypacaraí and generated an image to highlight the areas with the highest pollution levels. This work differs from ours in that they did not have a dynamic pollution map and also did not use a collision avoidance mechanism.
Using multiple autonomous vehicles as sensing nodes [6], [21], [22] leads to increased coverage and faster data collection. The extension of DRL to the multi-agent case (MADRL) enhances the performance of the single-agent case but brings new inherent challenges such as computational complexity, nonstationarity, partial observability and credit assignment [23]. In cooperative multiagent environments, parameter sharing (PS) [24] has been shown to be effective when agents are homogeneous (they share the same set of skills). The PS approach, which our proposal uses, involves sharing the parameters of a single policy among all agents, which are trained with the experiences of all agents simultaneously. This enables an efficient training process that can scale up to an arbitrary number of agents, thereby reducing the computational complexity. Previous studies, such as [25], have utilized PS in MADRL, where the network input includes information on the relative position and velocity of other agents. While the proposed approach is egocentric, our work diverges as it employs a single buffer to store experiences while they use a separate buffer for each agent, limiting the scalability of their algorithm. In contrast, other works like [26], do not use replay memory. Instead, they combine PS with Deep Recurrent Q-Network (DRQN) [27] to address partial observability. They approximate the Q-function with a recurrent neural network (RNN) that can maintain an internal state and aggregate observations over time. However, DuelingDQN architecture is proposed as a faster and easier-to-train alternative to RNN in high-dimensional spaces, as noted by [24]. Additionally, the partial observability of our proposed environment is mitigated by implementing a first phase of homogeneous patrolling, as the pollution distribution along the lake at the beginning of the episode is unknown. In the context of multi-agent patrolling of environmental missions, [6] and [28] proposed a scheme that employs Deep Q-Learning with a Convolutional Neural Network as a shared fleet policy and utilizes a global visual state. In [6], a decoupled final layer was proposed for N agents with |A| possible actions. This layer consists of an individual fully-connected layer for every agent, resulting in |A|×N neurons in the last layer. In contrast, our proposal only has 2x|A| neurons in the last layer for any number of agents. In fully cooperative environments that employ joint reward signals, agents face credit assignment challenges when determining the impact of their actions on team performance. To tackle the credit assignment problem, [28] proposed a decoupled reward in which each agent receives a reward only for their individual contributions, without any additional considerations. They demonstrated that their approach is effective in addressing the credit assignment problem, which motivates this work to employ it. The works mentioned above only address possible collisions of the agents that are explicitly penalized in the reward function. In contrast, our proposal includes collision avoidance mechanisms to guarantee that no action will cause any collisions.
In this study, the agents learns to optimize two tasks, therefore two policies. The first part of the episode involves making decisions based on a policy optimized only for the exploration task, while the second part involves taking actions following an intensification policy. In the literature, the multitask paradigm in DQL has been addressed with Multitask Deep Q-Network (MDQN) [29], [30], [31]. A network is trained as a vanilla DQN, but with a separate output layer (head) for each task. In [29], it was concluded that, given sufficiently related tasks, shared hidden layers in the MDQN can efficiently learn a shared feature representation and therefore perform well across tasks. In our case, the exploration and intensification phases are similar, as the agents coordinate their efforts to reduce the weighted average idleness on a single map. The conclusion of [29] and other successful single-agent applications of MDQN [30], [31] motivated us to use this approach.
Extending to the multi-agent paradigm, works [32] and [33] have applied knowledge transfer to MT-MADRL by training task-specific neural networks (teachers) first and then applying a knowledge distillation algorithm to train a policy that performs well across tasks. Our proposal differs in that it aims to learn a policy for each task, rather than one policy for multiple tasks. On the other hand, the authors in [34] proposed an approach that allows agents to have multiple policies for different tasks. However, their algorithm is based on the deep deterministic gradient algorithm (DDPG), which is a DRL algorithm for continuous action space environments. Unfortunately, this algorithm cannot be applied to our proposed discrete action space environment. Furthermore, to the best of our knowledge, no approaches to MT-MADRL utilize DQL. This research aims to fill this gap.
Preliminaries
A. The Patrolling Problem
The main task of the patrolling problem is to identify the ideal strategy for visiting relevant areas of an environment in order to monitor them in a periodic way. To represent a continuous surface, the terrain skeletonization technique [35] can be used. This means that the environment to be covered can be represented as an undirected graph
Example of a graph G with vertices
Furthermore, Lake Ypacaraí contains regions with high pollution concentration, such as blue-green algae blooms. The behavior of these blooms is dynamic and often chaotic, causing them to frequently change in size over time. Therefore, the set of dynamic weights for the graph nodes, denoted as I(t), indicates the importance of each node based on its level of contamination, with the more contaminated ones having greater importance. Therefore, the non-homogeneous patrolling problem proves to be the best fit for the Lake Ypacaraí case [3] and can be formulated as finding a policy \begin{equation*} G(V,E,W,I,t) \longrightarrow \pi (E) | \min \frac {1}{|V|} \sum ^{|V|}_{k=1} W_{k} \times \mathit {I_{k}(t)} \tag {1}\end{equation*}
The patrolling problem is naturally suited to be shared by multiple agents in space and time since they can patrol different areas simultaneously. Our fleet of ASVs shares a homogeneous architecture and shares its information about the environment, so the idleness to minimize is the shared idleness. This means that, all agents collectively reset the estimate of a node’s idleness based on the visits made by all agents.
B. Scenario and Assumptions
The scenario definition is essential in patrolling problems as it establishes the ASV or agent’s movement abilities, as well as the real constraints and boundary conditions. Our target scenario design is based on the following assumptions:
1) Lake Map
The map of Lake Ypacaraí has been divided into a grid map (see Figure 3), where each cell corresponds to a node and the distance traveled between adjacent cells corresponds to the weight of an edge. Given the total surface area of the lake at 60 km2, each square cell represents
Discretized map of Lake Ypacaraí. Blue cells denote visitable water surfaces while brown cells denote illegal, non-navigable areas.
2) Lake’s Contamination
As the ASVs explore the lake and collect samples, they build a map of the lake’s pollution, which is also part of the scenario. Our experiments focused solely on pollution caused by blue-green algal blooms. The bloom’s behavior is dynamic and chaotic, and its size changes over time, so it is modeled using a brownian movement. In this model, the concentration of algae is represented by discrete particles that can move within the lake space. Using the position of each particle, a particle map
Example of the evolution of a contamination map as a result of blue-green algae in Lake Ypacaraí. (a) is the initial map, (b) is the map after 30 steps, (c) after 60 steps, (d) after 100 steps.
3) Vehicle Movement
Once the defined number of directions for the agent has been established, the angles for each direction are calculated, evenly spaced within an angle interval of
4) Communication
In this scenario, communication between agents regarding fleet position and pollution measurements is assumed. They jointly create an importance map (the set of weights I(t)) based on contamination measurements. The model determines the importance of a cell based on its most recent contamination measurement. To ensure an interest in covering them, all cells within the navigable zone of the lake have a minimum importance value. Furthermore, having knowledge about the fleet position is crucial for each agent to avoid collisions and ensure safe and efficient behavior.
Methodology
A. Multitask Reinforcement Learning
Reinforcement Learning (RL) [36] is an Artificial Intelligence approach where an agent learns by interacting with the environment. The Markov Decision Process (MDP) mathematical framework is used to formalize RL problems, such as patrolling Lake Ypacaraí. A MDP is denoted by a tuple consisting of five elements:
The agent’s long-term policy return (\begin{equation*} J(\pi )=\mathbb {E}_{a_{t} \sim \pi }\left [{{\sum _{t=0}^{\infty }\gamma ^{t}r_{t} }}\right ] \tag {2}\end{equation*}
B. Deep Q-Learning
The state-action value function, \begin{align*} Q(s,a; \theta )& =Q(s,a; \theta ) \\ & \quad +\alpha *(r+\gamma *\underset {a}{\max } Q(s\prime ,a; \theta )-Q(s,a; \theta )) \tag {3}\end{align*}
In their work [8], authors present two important methods for improving stability and efficiency in reinforcement learning: Experience Replay and Target Network. Experience Replay consist of storing past experiences \begin{align*} Q(s, a; \theta ) = V(s; \theta ') + \left ({{ A(s, a; \theta '') - \frac {1}{|A|} \sum _{a'} A(s, a'; \theta '') }}\right ) \tag {4}\end{align*}
C. Proposed Decoupled Method
The pollution levels and, consequently, the set of weights I(t) are unknown at the beginning of the mission. However, to effectively achieve Equation 1, the group of agents must first acquire sufficient knowledge of I. For this reason, the mission is divided into two phases: exploration and intensification. In the exploration phase, ASVs patrol the area homogeneously with the aim of minimizing average idleness, without considering the importance (the set I(t)) of specific zones. So, our main goal is to achieve homogeneous coverage by searching for a joint policy \begin{equation*} G(V,E,W) \longrightarrow \Pi (E) | \min \frac {1}{|V|} \sum ^{|V|}_{k=1} W_{k} \tag {5}\end{equation*}
Once contamination information has been gathered, our goal is to implement a strategic plan to ensure thorough and intensive coverage of the highly polluted zones during the intensification phase. Our primary objective is to develop a joint policy, denoted by \begin{equation*} G(V,E,W,I,t) \longrightarrow \pi (E) | \min \frac {1}{|V|} \sum ^{|V|}_{k=1} W_{k} \times \mathit {I_{k}(t)} \tag {6}\end{equation*}
In this case, the exploration and intensification phases are similar since the agents coordinate their efforts to decrease average idleness on a single map. Conversely, exploring has a homogeneous coverage while intensification is heterogeneous, making our problem inherently multi-objective. If the agents only learned the task of intensification, they would need to explore to gather a better model of the contamination at the beginning of the mission and then decide where to intensify. To speed up the learning process and increase the efficiency of sampling, the study relaxes the requirement that agents must use the same policy for both exploration and intensification, acknowledging the conflicting nature of these goals.
The proposed method uses a variable, \begin{align*} \pi _{\nu }(s) = \begin{cases} \underset {a}{\arg \max } Q_{e}(s,a) & with\ probability\ \nu \\ \underset {a}{\arg \max } Q_{i}(s,a) & with\ probability\ 1- \nu \\ \end{cases} \tag {7}\end{align*}
: Exploration phase.$\nu = 1$ : Intensification phase.$\nu = 0$ : Transition phase, during which there is a$0\lt \nu \gt 1$ probability of choosing an exploratory action, and therefore$(\nu *100)\%$ gradually decreases to smoothly transition to the intensification phase.$\nu $
D. Collision Avoidance Mechanisms
As mentioned above, the Censoring-DQL [5] algorithm is employed to ensure deterministic computation that can address actions leading to nonnavigable zones. In this algorithm, invalid actions of an agent i that would lead to nonnavigable zones are identified using the information in the lake map, which is known a priori. Then, a censoring function \begin{align*} \eta (s,\mathbf {a^{i}}) = \begin{cases} 1 & if\ a^{i}\ is\ valid \\ -\infty & if\ a^{i}\ is\ not\ valid \\ \end{cases} \tag {8}\end{align*}
\begin{equation*} Q_{C}^{i}(o^{i},a^{i}) = \eta (s,a^{i}) \circ Q^{i}(o^{i},a^{i}) \tag {9}\end{equation*}
On the other hand, simultaneous actions within the navigable zone may cause conflicts, resulting in collisions between agents. To address this issue, the SafeConsensus [9] algorithm is utilized. The algorithm sorts agents based on the highest joint value of Q, with the highest-Q agent taking action without considering other agents. The agents that follow consider the new position of the previous one, censoring Q values that lead to collisions with
E. Multitask Multiagent Deep Q-Network
In this study, the Parameter-Sharing Multiagent MDQN (PSMA-MDQN) is presented as an extension of the MDQN to the multiagent paradigm. The DDQN method in combination with the dueling architecture (Dueling DQN) is used to develop a Dense Convolutional Neural Network with two parallel terminations, referred to as “heads”, each corresponding to a specific task: exploration and intensification. The output layer and parameters of each head are separated, allowing the PSMA-MDQN to learn and optimize these two tasks independently. Each head in the PSMA-MDQN has its own loss function, and the weights of each head are adjusted independently during training. The shared block, also known as the Feature Extractor, extracts useful common features for both tasks [29]. These shared features are utilized by each individual head to generate its corresponding output (see Figure 5). This approach is feasible because each agent operates with the same set of actions and is subject to the same constraints, making them homologous in both actions and observations. So, this work benefits from the homogeneity of the agents to train on a single network a policy
The PSMA-MDQN architecture proposed here utilizes a shared Feature Extractor that captures common features from the input state. Each task is assigned an individual head consisting of a three-layer dense neural network and then a dueling architecture with separate Advantage (
Algorithm 1 Parameter-Sharing Multiagent MDQN With SafeConsensus and Censoring-DQL Algorithm
Initialize replay memory B to capacity |B|
Initialize multi-policy network Q with random weights
Clone Q into Target Network Q’ with random weights
Set hyperparameters:
for episode
Reset environment
for
Initialize state s
With probability
if
else
if
else
end if
end if
Apply Censoring-DQL to
Execute joint action
Store transition
Sample a minibatch of transitions
Update Q-value heads based on the temporal difference error
Every C steps, update target Q-networks.
end for
end for
F. State Representation
The state represents the environment in which the agent operates and the information accessible for decision-making purposes. In our simulation environment, the agents have partial information about the lake’s contamination, as they do not know the contamination levels of the cells they have never visited. Therefore, the environment is partially observable as the agents only have an observation of the dynamic set of weights I(t). To enable the use of a single policy network for all agents, an egocentric observation formulation is employed. This means that the relative position of other agents is also observed, ensuring that each agent has a unique observation. Thus, there is a distinction between the shared and individual elements of the observations. Our proposal is a state representation composed of 4-channel images minmax-normalized, this way every pixel value of the state is within [0, 1].
1) Shared Elements
These common elements of the state ensure that all agents have access to the same information about the environment and can coordinate their actions effectively (see Figure 6).
Idleness map: An image containing idleness values for each cell, which is generated based on fleet visits (see Figure 6a). All cells within the navigable zone are set to the maximum value 1 initially, denoting that none of the cells has been visited. When an agent visits a cell, its idleness resets to zero. If a cell remains unvisited over time, its idleness value increases, indicating the need for revisitation.
Importance map: This map updates the relative importance of visited cells as new measurements are taken (see Figure 6b). All cells within the navigable zone are initially set to the minimum importance value, indicating an unknown but not null interest. When an agent visits a cell, its relative importance is updated based on the level of contamination. The more polluted a cell is, the more importance it has.
2) Individual Elements
Each agent’s state will be differentiated by his position and therefore it is allowed to use one single policy network for all of them (see Figure 6):
Agent position: This binary image has a zero value for all cells, except for those cells that are covered by the ASV detection area (white cells in Figure 6c). This image provides specific information about the agent’s current location and allows him to make decisions based on his own position.
Other agent’s position: It is a binary image where all cells have a value of zero, except for cells corresponding to the location and detection area of other agents in the environment (white cells in Figure 6d). This image enables each agent to have knowledge of the other agents’ position and avoid collisions or coordinate their actions appropriately.
G. Reward Functions
To guide the agents towards optimal behavior, it is necessary to design a reward function that motivates the agents to achieve the following goals:
During the first phase of the patrolling, the agent fleet should be fully exploratory, visiting the entire map in a coordinated fashion.
After gathering data on the lake during the initial phase, the agents should move on to an intensification phase that takes into account the importance of each zone.
Penalize agents who take measurements in the same cell on the map.
In both phases of the process, an idleness matrix (
This image illustrates the detection areas
To penalize that more than one agent is taking a measurement in the same cell, the Redundancy Mask (RM) stores the number of measurements in each cell (see Figure 7). In each step t, every agent \begin{equation*} \mathit {RM}_{t} = \sum _{i}^{N} \omega ^{i}_{t} \tag {10}\end{equation*}
When multiple agents take measurements in the same cell, they share the reward received for measuring in that cell. To achieve this, the value of the reward is divided by RM matrix. As a result, the agents should distribute themselves and maintain a safe distance from each other to improve coverage and maximize the use of available information. Additionally, the total reward colLected is normalized by dividing it by r. This enables a fair comparison of the rewards regardless of the size of the detection radius used. Although rewards are calculated using global matrices (where \begin{equation*} ER_{t}^{i} = \dfrac {{\mathcal {W}}_{t}(\mathbf {x}_{i},\mathbf {y}_{i})}{r \times \mathit {RM}_{t}(\mathbf {x}_{i},\mathbf {y}_{i})} \tag {11}\end{equation*}
During the Intensification phase, the agents should focus on the most relevant zones. Thus, an importance matrix (\begin{equation*} IR_{t}^{i} = \dfrac {{\mathcal {W}}_{t}(\mathbf {x}_{i},\mathbf {y}_{i}) \times {\mathcal {I}}_{t}(\mathbf {x}_{i},\mathbf {y}_{i})}{r \times \mathit {RM}_{t}(\mathbf {x}_{i},\mathbf {y}_{i})} \tag {12}\end{equation*}
Regardless of the phase of the episode, every time an agent takes an action, it receives a vector of rewards
Results
This Section presents performance metrics, conducted simulations and learning settings. First, it discusses how the algorithm compares to other approaches in terms of the use of the decoupled method and shared parameters in this case study. Finally, the proposed approach is compared with other heuristic-based patrolling algorithms used in the literature.
The proposed algorithm has been implemented in Python1 3, using the PyTorch2 library for the construction of the neural network. The Gym3 library served as the simulation environment, and numerical and matrix operations were conducted with the NumPy4 and SciPy5 libraries. The algorithm’s code and results are accessible in a GitHub repository.6 All simulations were conducted utilizing an Intel Xeon Gold 5220R CPU operating at 2.20GHz with 187GB of RAM. Additionally, a NVIDIA GeForce RTX 3090 GPU with 24GB of VRAM was employed to speed up the neural network training.
A. Metrics
The following performance metrics have been defined to evaluate the performance of our algorithm. Let K be the total number of navigable cells,
The Instantaneous Global Idleness (IGI): the average idleness of all cells in the time step t:
\begin{equation*} IGI(t) = \frac {1}{K} \sum _{k=1}^{K} {\mathcal {W}}_{t}(i_{k},j_{k}) \tag {13}\end{equation*} View Source\begin{equation*} IGI(t) = \frac {1}{K} \sum _{k=1}^{K} {\mathcal {W}}_{t}(i_{k},j_{k}) \tag {13}\end{equation*}
The Instantaneous Global Weighted Idleness (IGWI): the average weighted idleness of all cells in the time step t:
\begin{equation*} IGWI(t) = \frac {1}{K} \sum _{k=1}^{K} {\mathcal {W}}_{t}(i_{k},j_{k}) \times \mathit {I}_{t}(i_{k},j_{k}) \tag {14}\end{equation*} View Source\begin{equation*} IGWI(t) = \frac {1}{K} \sum _{k=1}^{K} {\mathcal {W}}_{t}(i_{k},j_{k}) \times \mathit {I}_{t}(i_{k},j_{k}) \tag {14}\end{equation*}
The Average Global Idleness (AGI): The IGI averaged over the whole exploration time
:$T_{e}$ \begin{equation*} AGI = \frac {1}{T_{e}} \sum _{t=1}^{T_{e}} IGI(t) \tag {15}\end{equation*} View Source\begin{equation*} AGI = \frac {1}{T_{e}} \sum _{t=1}^{T_{e}} IGI(t) \tag {15}\end{equation*}
The Average Global Weighted Idleness (AGWI): The IGWI averaged over the whole simulation time T:
\begin{equation*} AGWI = \frac {1}{T} \sum _{t=1}^{T} IGWI(t) \tag {16}\end{equation*} View Source\begin{equation*} AGWI = \frac {1}{T} \sum _{t=1}^{T} IGWI(t) \tag {16}\end{equation*}
Percentage visited of the map (PV(t)): the proportion of the map visited at time t.
B. Simulation Settings
Table 1 lists the key training hyperparameters used during simulations. The parameter
Visual representation of
Table 2 summarizes the environmental parameters. The simulations were conducted using four ASVs (agents) with sensors that have a detection radius of 580 m. The distance an ASV can travel (autonomy) is 58 km, and the ASV movement length is 580 m per step.
Figure 9 displays the total rewards obtained by the fleet throughout the training process. These rewards are the sum of the cumulative rewards earned by individual agents for both intensification and exploration strategies at the end of each of the 10,000 episodes. The figure illustrates the algorithm’s effectiveness in consistently maximizing rewards and eventually converging to a stable solution.
C. Comparison of Policy Strategies
To validate the train of the algorithm with two phases and the efficiency of parameter sharing, the two following policies were trained:
Single-Phase DQN: This policy is a parameter-sharing Multiagent DDQN trained on a single task, utilizing a single Dueling DQN head. Therefore, the training process is performed in a single intensification phase setting. The purpose of comparing our approach is to highlight the impact of a decoupled-phase training it.
Task-Specific DQN: The policy involves two parameter-sharing Multiagent DDQN policies, each dedicated to a single task, trained within a decoupled phase setting. The methodology differs from our approach, which utilizes a single shared network for multiple tasks. The comparison with our approach addresses whether sharing a common DQN across tasks (MDQN) offers advantages in our particular context.
In Figure 10b, a comparison of IGWI throughout the episode for the three DQN algorithms is presented. It shows how the Single-Phase DQN quickly reduces IGWI by utilizing its optimized strategy for this metric. However, it is important to note that during this period, the PSMA-MDQN algorithm is in the exploration phase with the primary objective of minimizing IGI. By entering the Intensification Phase, our algorithm effectively reduces the IGWI 17% less. This advantage is maintained for the rest of the episode. This suggests that the Single-Phase DQN struggles in minimizing idleness by revisiting cells that were visited long ago (or never visited) while also having to more frequently visiting more relevant zones. This inefficiency allows our homogeneous patrolling policy to outperform it. To speed up the learning process and increase the efficiency of sampling, having two phases relaxes the requirement that the agent must, with the same policy, learn explore and intensify, as these are conflicting goals.
A comparison of the results for the three DQN algorithms throughout the episode. It shows the average and standard deviation of the metrics obtained from 500 episodes.
Table 3 summarizes the results of all algorithms calculated after running 500 episodes. Our proposed method outperforms all other algorithms in all metrics, as highlighted in bold black. The AGWI at the end of the episode is 3% lower than that of the Single-Phase DQN, which is the second-best algorithm. Additionally, Figure 10c displays the average PV of 500 episodes and its standard deviation. It is evident that PSMA-MDQN covers a larger percentage of the map at a faster rate, averaging 29% more coverage than Single-Phase DQN.
Furthermore, PSMA-MDQN outperforms Task-Specific DQN in all metrics (Table 3), despite the latter using nearly twice (1.96 times) as many neural network parameters. In Figure 10a a comparison of IGI at the end of the Exploration Phase is presented. PSMA-MDQN demonstrates a 6% lower reduction in IGI, indicating more optimal learning of the exploration strategy. Additionally, it achieves a 13% lower reduction in IGWI (see Figure 10b) and 7% lower AGWI reduction (Table 3), which suggests more efficient learning of the intensification strategy. This superiority is attributed to the shared DQN architecture (MDQN) employed in our approach. The Feature Extractor, which is the shared block in PSMA-MDQN, enhances task performance by learning shared representations and leveraging shared information. This shared knowledge improves the model’s ability to generalize effectively, providing a significant advantage in our specific context. Therefore, sharing the Feature Extractor across tasks proves to be beneficial in our particular context.
The proposed method is also compared with other heuristic-based path planning algorithms to validate the results. To ensure fair comparison, only safe actions are taken by the algorithms. The comparison is made using three different path planners (see Figure 11):
Lawn Mower Path Planner (LMPP): Each agent starts its path by randomly choosing a direction and proceeds in that direction until it encounters an obstacle (which could be the ground or another ASV). Then, the agent takes the reverse direction but one step forward from the previous path to avoid repeating the same path in reverse. Figure 11a shows an example of LMPP trajectories with 4 agents.
Random Wanderer Path Planner (RWPP): Agents randomly select a direction and follow it until they encounter an obstacle. At that point, each agent selects another random safe direction except the reverse to introduce variety. This approach ensures that the algorithm explores new paths without redundancy. Figure 11b shows an example of RWPP trajectories with 4 agents.
Particle Swarm Optimization Path Planner (PSOPP): PSO [38] is an evolutionary algorithm that deploys a group of particles to find an optimal solution based on a given metric. These particles navigate the search space using mathematical formulas that take into account their position and velocity. PSO has been successfully used and enhanced to make environmental monitoring [39]. In our scenario, each agent i is treated as a particle, with its position denoted as
. The closest position on the map with the highest idleness is$p^{i}$ , and the closest position with the highest weighted idleness is$p^{i}_{bW}$ . At each iteration t, the particle’s speed$p^{i}_{bI}$ is updated according to Equation 17.$vel^{i}_{t}$ Table 4 lists the values of the parameters for the PSOPP algorithm. It should be noted that a distinction is made between the parameters for both phases, therefore agents move toward positions with the highest idleness in the Exploration Phase, and positions with highest weighted idleness in the Intensification Phase. The action taken is the one that is closest in direction to\begin{equation*} vel^{i}_{t} = w*vel^{i}_{t-1} + c_{1} * (p^{i}_{bW} - p^{i}) + c_{2} * (p^{i}_{bI} - p^{i}) \tag {17}\end{equation*} View Source\begin{equation*} vel^{i}_{t} = w*vel^{i}_{t-1} + c_{1} * (p^{i}_{bW} - p^{i}) + c_{2} * (p^{i}_{bI} - p^{i}) \tag {17}\end{equation*}
.$vel^{i}_{t}$
The results shown in Figure 12a demonstrate our algorithm’s capability to reduce the IGI at a faster rate than the heuristic-based algorithms. Due to its slowness and redundancy, the LMPP proves inadequate to reduce both the IGI and the IGWI in a time-efficient manner. At the end of the Exploration phase, our algorithm achieves a reduction of 47% of IGI compared to LMPP, which is significantly lower. The limitations of LMPP become apparent in Figure 12b, highlighting its inability to maintain a reduced IGWI value. This deficiency is attributed to the fact that LMPP ignores crucial zones, which ultimately limits its effectiveness in pollution monitoring. Therefore, our algorithm reduces, on average, the AGWI 44% lower than LMPP at the end of the episode. The Random Wanderer Path Planner (RWPP) has less redundancy compared to LMPP. However, our approach has successfully learned a coordination strategy, surpassing RWPP by achieving a 34% lower IGI at the end of the first phase. Additionally, our algorithm excels in the second phase, outperforming RWPP by 31% and maintaining a consistently lower AGWI, despite RWPP’s non-redundant nature. Moreover, our algorithm achieves a 45% greater reduction in the minimum IGWI. Figure 12c displays the average PV throughout the episode for the heuristic-based algorithms and ours. It is easily seen how much faster our algorithm is, with 48% more than RWPP and 130% more than LMPP at the end of the exploration phase (step 30). Although PSOPP considers zone idleness for decision-making, its coordination strategy is less sophisticated compared to our approach. PSOPP’s natural swarming behavior of particles(agents) leads to less sparsity and suboptimal performance. As shown in Figure 12a, our approach outperforms PSOPP by 39% in IGI, and even RWPP effectively reduces more IGI than PSOPP due to its less redundant nature. Figure 12b shows that PSOPP outperforms other heuristic-based algorithms in terms of intensification, but our method achieves a 37% reduction in idleness and a 31% lower AGWI.
A comparison of the results for the heuristic-based algorithms throughout the episode. It shows the average and standard deviation of the metrics obtained from 500 episodes.
Number of Visit Maps (NVM) (see Figure 13) record the number of times each cell was visited. The purpose of these maps is to provide a visual representation of the uniformity of coverage during the exploration phase. As well as the concentration of visits in more significant regions during the intensification phase. An episode is randomly selected to assess the appearance of the visit maps. The contamination map at the beggining of the Intensification Phase of the episode is shown in Figure 13a, and can be used to identify areas where agents should increase their efforts.
Number of Visit Maps (NVM) record how many times each cell was visited during the Exploration and Intensification Phases. Figure (a) shows the contamination map at the beginning of the Intensification Phase, Figures (c),(e),(g),(i),(k) show the NVMs for the Exploration Phase, while figures (b),(d),(f),(h),(j),(l) show the NVMs for the Intensification Phase.
Figures 13c, 13e, 13g, 13i and 13i show the NVMs during the Exploration Phase for all the algorithms except the Single-Phase DQN because it has not been trained in that phase. The visual representations clearly demonstrate the superior and more efficient coverage achieved by PSMA-MDQN when compared to the other algorithms. The NVM for LMPP (see Figure 13g) again confirms the notable slow pace and redundancies in its exploration strategy. Also, the inefficiency of RWPP (see Figure 13i) is evident due to its inherent lack of coordination. When compared to Task-Specific DQN (see Figure 13e), it is clear that this algorithm does not employ coordination behaviors to the same extent as our proposed PSMA-MDQN. Consequently, our approach achieves more extensive coverage in fewer steps. Regarding PSOPP (see Figure 13k), the exploration is inefficient because the trajectories of the agents are close to each other.
As for the Intensification Phase, Figures 13b, 13d, 13f, 13h, 13j and 13l show the NVMs for all algorithms. It is evident that LMPP and RWPP do not consider important zones, and therefore, they intensify poorly. In contrast, the DQN-based algorithms and PSOPP successfully intensify in the important zone. The policies trained with the decoupled method, PSMA-MDQN and Task-Specific DQN, have shifted their focus from covering the entire map to solely targeting the most relevant areas. This efficient transition between phases demonstrates that the agents have been able to identify the relevant areas and intensify their patrols in those areas. As for PSOPP, agents migrate collectively from one contamination peak to another (see Figure 13l). This collective movement makes them similar to a single agent, which extremely reduces the performance.
D. Generalization
The goal of this Section is to evaluate the proposed algorithm’s robustness and generalization ability after training by changing the values of
A comparison of the results for the PSMA-MDQN throughout the episode with different
Illustration of the
Nonetheless, it is undeniable that the agents have gained the capability to make decisions based on their current phase of operation and had indeed learned to perform the two tasks independently.
E. Discussions
With the results presented, the following discussions unfold:
The decoupled method introduced in this study addresses the constraint of agents having to use a single policy for both initial exploration and subsequent intensification, thereby taking into account the inherent conflict between these objectives. This method accelerates the learning process. This is demonstrated by the fact that when entering the intensification phase, our algorithm reduces the IGWI by 17% more than the Single-Phase DQN. This performance superiority is maintained throughout the rest of the episode.
Parameter sharing is more efficient for related multitask learning in our case study. Our algorithm showed a 6% lower reduction in IGI during the Exploration Phase, a 13% lower reduction in IGWI, and a 7% lower reduction in AGWI compared to the Task-Specific DQN, despite the latter using 1.96 times as many neural network parameters.
The algorithm developed in this study outperforms heuristic-based approaches such as LMPP, RWPP and PSOPP. On average, it achieves a 44% lower AGWI compared to LMPP and a 31% lower AGWI compared to RWPP and PSOPP by the end of the episode. Additionally, the algorithm demonstrates a learned coordination strategy, resulting in a 47% reduction in IGI compared to LMPP, a 34% and 39% lower IGI than RWPP and PSOPP respectively by the conclusion of the first phase. The algorithm’s superior speed covering the map is evident, with a 48% advantage over RWPP, a substantial 130% lead over LMPP and a 58% more PV than PSOPP by the end of the exploration phase (step 30).
The comparison of the proposed algorithm and LMPP highlights the superior performance of the former. This is mainly due to LMPP’s exhaustive approach to ensuring complete coverage by strictly following parallel paths, leading to excessive redundancy.Additionally, in non-convex scenarios, LMPP would face challenges in escaping corners, which would impede its proper operation. As for RWPP, it provides fast homogeneous map coverage, but lacks coordination based on weighted idleness, thereby lacking effective intensification. Additionally, its performance also would decrease in non-convex settings.
Regarding PSOPP, when particles initiate exploration from nearby locations, they tend to exhibit similar behaviors. In the Exploration Phase, each particle moves towards the nearest cell with the highest idleness, resulting in a dispersion of paths when encountering obstacles such as the border of the lake. However, in the Intensification Phase, where importance is concentrated, particles tend to congregate and move collectively. This behavior is especially problematic in environments with multiple contamination peaks. In such cases, particles tend to move in groups from one peak to another, making the task extremely inefficient. In contrast, our algorithm provides agent allocation strategies that are particularly suitable for achieving homogeneous and non-homogeneous coverage across the entire map, even in complex and dynamic environments, e.g., when contamination peaks are dispersed.
Our algorithm has learned to perform two tasks independently, and the policies can be used arbitrarily.
Our algorithm allows for smooth transitions between phases or the use of a single phase. This provides users with the flexibility to configure the algorithm according to their needs.
Conclusion
In the context of a dynamic Partially Observable Markov Game (POMG), such as the challenging Lake Ypacaraí patrolling scenario with multiple ASVs, our approach divides the patrolling task strategically into two distinct phases: the Exploration Phase and the Intensification Phase. The aim of the Exploration Phase is to cover the map homogeneously, while the objective of the Intensification Phase is to intensify the coverage in the most polluted areas. Additionally, a novel scientific approach have been introduced to ensure a smooth transition between the two phases. To tackle the computational complexity of the problem, a Dueling DQN has been trained with two heads, one dedicated to estimating the Q-function for the Exploration Phase and the other for the Intensification Phase. The policy is shared across all agents since they are homogeneous and the input state formulation is egocentric.
The results indicate that the decoupled method introduced in our study is effective. This method frees agents from the constraint of using a single policy for both exploration and intensification, which accelerates the learning process. Our algorithm consistently outperforms the Single-Phase DQN, a policy trained on a single intensification task, with the same architecture as ours but utilizing a single Dueling DQN head. Furthermore, our multitask learning approach with parameter sharing is more efficient than Task-Specific DQN. The latter uses two single Dueling DQN with the same architecture as ours, but each dedicated to a different task and trained within a decoupled phase setting. Our approach achieves better results despite using approximately half as many neural network parameters.
By changing the values of
In future lines of research, shifting from predefined task durations to training the network with multiple objectives emerges as a promising way to achieve optimal performance configurations. This shift allows for a more nuanced consideration of user preferences, including exploration and intensification requirements, as well as energy efficiency. The proposed approach is to solve a multi-objective optimization problem to identify a Pareto front which contains non-dominated policies. These policies are solutions where improving one objective cannot be achieved without compromising the performance of another objective. The exploration of the Pareto front, achieved by training with varying objective weightings, offers opportunities to discover versatile and adaptive patrolling strategies tailored to diverse user needs and environmental dynamics. Additionally, other environmental monitoring tasks, such as bathymetric surveys and trash detections could be added to the proposed framework. Moreover, including battery level as another decision variable in trajectory design considerations offers a promising area of research. This addition has the potential to improve the efficiency and sustainability of autonomous systems operating in dynamic environments.