Implementation of Decentralized Reinforcement Learning-Based Multi-Quadrotor Flocking

Enabling coordinated motion of multiple quadrotors is an active area of research in the field of small unmanned aerial vehicles (sUAVs). While there are many techniques found in the literature that address the problem, these studies are limited to simulation results and seldom account for wind disturbances. This paper presents the experimental validation of a decentralized planner based on multi-objective reinforcement learning (RL) that achieves waypoint-based flocking (separation, velocity alignment, and cohesion) for multiple quadrotors in the presence of wind gusts. The planner is learned using an object-focused, greatest mass, state-action-reward-state-action (OF-GM-SARSA) approach. The Dryden wind gust model is used to simulate wind gusts during hardware-in-the-loop (HWIL) tests. The hardware and software architecture developed for the multi-quadrotor flocking controller is described in detail. HWIL and outdoor flight tests results show that the trained RL planner can generalize the flocking behaviors learned in training to the real-world flight dynamics of the DJI M100 quadrotor in windy conditions.


I. INTRODUCTION
Small unmanned aerial vehicles (sUAVs) are a growing class of vehicles that can perform complex tasks, especially in hard-to-reach areas. There is a distinct advantage of using a multi-sUAV system over a single sUAV platform as they provide increased capabilities for tasks such as surveying, search-and-rescue operations, and mapping [1]- [7]. Multi-sUAV systems are not straightforward to deploy autonomously. Multi-sUAV algorithms need to ensure that the sUAVs do not collide with one another and work together towards a common task. The problem of coordinated movement of the sUAVs between waypoints while maintaining a set of desired kinematic behaviors and avoiding collisions has been well studied using model-based techniques [8]- [11]. Although recent works demonstrate decentralization can be achieved given online construction of kinematically feasible trajectories, decentralized path planning at scale remains an area of active research [12], [13].
The associate editor coordinating the review of this manuscript and approving it for publication was Xujie Li . the characteristics of multiple sUAVs interacting with one another and exhibiting common collective behavior. One of the defining characteristics of flocking is that each agent performs localized actions that contribute to an overall coordinated behavior of the system. Separation involves agents moving away from each other to avoid collisions and is also termed collision avoidance. Alignment involves agents moving along the average heading of the flock. Each agent in the system exhibits the same relative velocity. Cohesion involves agents staying near the average position (i.e., centroid) of the group.
Wind conditions being stochastic in nature, play a crucial role in the proper functioning of coordinated sUAV operations outdoors. An overview of the impact of external stimuli such as wind gusts on sUAV operations is provided in [18]. Performing wind-agnostic motion planning for sUAVs may produce a sizeable cross-track error if the wind on the planned route leads to actuator saturation [19]. In a multi-sUAV system, each sUAV has to locally counter the wind disturbance while maintaining the safety of the system. Such continuous manipulation of the control effort for multiple sUAVs under uncertain environmental conditions is computationally taxing and can lead to reduced efficiency and safety concerns [18].
Several works in the literature propose approaches to solve multi-agent system flocking problems [20]- [24]. In our previous work, we demonstrated a novel reinforcement learning technique using the multi-objective formulation of the decentralized flocking control problem for up to 25 fixed-wing UAVs [25]. The training was based on a relative state-space construction of obstacles and waypoints in software simulation. In comparison to existing behavioral swarm controllers, our controller learned via object-focused greatest mass stateaction-reward-state-action (OF-GM-SARSA) and was shown to be generalizable across multiple flocking scenarios.
While meaningful experiments on of flocking algorithms on ground robots have been reported [26], [27], those for sUAVs have emerged only recently. A select few papers document real-time experiment evaluation of their flocking system [27]- [32]. Only [33] reports on accounting for wind using a simplified Gaussian noise model. Most other publications provide an evaluation of their flocking approaches in the multi-sUAV simulation environments such as Ardupilot, Q-ground control, Gazebo, and ROS [34]- [36] or numerical simulation using Python and MATLAB [3], [20], [22], [37]- [39] thus leaving a gap in the literature regarding the hardware/software approaches required for implementing flocking based motion planners in real-world outdoor flights.

A. MAIN CONTRIBUTIONS
This paper fills the above-mentioned gap by focusing on the relatively untouched area of experimental implementation of flocking algorithms for outdoor multi-sUAV systems. Specifically, in this work, we leverage our previously developed OF-GM-SARSA-based path planner for flight testing the coordinated motion of multiple quadrotors to reach waypoints while maintaining the flocking behaviors. A snapshot of the outdoor flight tests is shown in Figure 1. The following are the main contributions of this paper: 1) A method to incorporate realistic wind gusts using the Dryden Wind Gust Model for evaluating flocking algorithm performance in hardware-in-the-loop (HWIL) tests. 2) Experimental evaluation and validation of a decentralized OF-GM-SARSA based hardware/software architecture via outdoor flight tests involving up to 4 DJI M100 quadrotors operating in the presence of natural wind gusts.
Additionally, a detailed discussion of the hardware/software architecture used to implement a multi-objective, reinforcement learning (RL) based decentralized flocking planner for multiple quadrotors is provided. The OF-GM-SARSA technique was used to learn a flocking planner for the group of up to 4 DJI Matrice M100 quadrotors that use DJI's N1 flight controller [40]. Each quadrotor was also fitted with a transceiver radio module operating in the 915MHz licensefree frequency bands, providing half-duplex bi-directional RF links at 300Kbps. The OF-GM-SARSA planner used the quadrotors position and velocity information, as reported by the DJI telemetry streams of the one-hop neighbors. The output of the planner was local heading and speed setpoints to each quadrotor's flight controller.
The rest of the paper is divided as follows, Section II presents a discussion on the representative works in model-based and reinforcement learning-based flocking approaches. Section III discusses the flocking problem formulation in terms of the quadrotor dynamics, the interquadrotor communication setup, the Dryden wind model, and the flocking behaviors. Sections IV and V presents the algorithm for the flocking planner and the methodology used for training the RL algorithm. The algorithm was first evaluated in simulations described in Section VI before HWIL and outdoor flight tests. The hardware and software architecture for HWIL and outdoor flight tests are described in Section VII, followed by the flight test results in Section VIII. Finally, the observations and insights gleaned from this study are noted in Section IX.

II. RELATED WORK
The aggregate motion of small birds is known as cluster flocking in the literature [41]. In such cluster flocking applications, each individual bird may only observe the neighboring bird at any time. The flocking formation considered here is, in essence, inspired by flocks of birds where each agent (bird or sUAV) aggregates around a geometric centroid and is independent of a leader agent. The bio-inspired behaviors implemented here posits that each learning agent strives to maintain the integrity of the flock using its sensory input (in this case, the position and velocity of its neighbors). While there exists a mobile ad-hoc network (MANET) for wireless communication of sensor data, the execution of the motion-planning policy is fully decentralized, where each agent runs an identical copy of the flocking policy to achieve a common goal [42]. In contrast to centralized approaches, such decentralization of motion planning offers computational tractability as the number of agents increase and eliminates a single point of (centralized decision maker) failure [43], [44].
Several centralized and decentralized approaches to solving the multi-robot flocking problem have been reported in the literature. These approaches can be broadly classified into model-based flocking methods and model-free methods.

A. MODEL-BASED FLOCKING
Model-based approaches to solving the multi-vehicle flocking problem continue to be studied in the literature [21], [23], [29], [33], [36], [46]. These approaches involve formulating the kinematic and dynamic behaviors of the system in the environment.
In a multi-sUAV system, the environment can include obstacles and environmental disturbances such as wind. The authors in [33] modeled wind disturbances as Gaussian noise, a simplified approximation of real-world turbulence. In [36], the authors defined obstacles as special agents, enabling sUAVs to effectively evade them by avoiding collisions while maintaining the flocking formation. Distributed or decentralized model predictive control (MPC) based formulations to the multi-sUAV flocking problem have been a popular approach due to their ability to accommodate different systems, mission requirements, and low computational overhead [29], [45]. In [29], the authors evaluated a consensus-based MPC approach for flocking experimentally using 5 Crazyfile 2.0 mini-quadrotors in an indoor environment. The authors of [45] tested a decentralized MPC algorithm using a flock of 5 outdoor quadrotors. Figure 2 depicts a custom-built sUAV used for flocking experiments in [45].
Additional control-theoretic approaches such as highfrequency feedback robust control [21], Particle Swarm Optimization (PSO) [30], and PID controllers [32] have also been applied to solve the multi-sUAV flocking problem.
Model-based flocking approaches rely on the accuracy of the model of the multi-vehicle system. The accuracy of the model decreases when the number of agents in the system increases. The modeling becomes especially challenging if the real-world characteristics of wind disturbances are also taken into consideration.

B. REINFORCEMENT LEARNING (RL) BASED FLOCKING
Among the model-free methods, RL-based flocking approaches have been presented in several studies as a means for overcoming platform and environmental modeling restrictions while maintaining decentralized operations [47]- [52].
Q-learning and state-action-reward-state-action (SARSA) have been implemented in multi-vehicle robotics problems because they do not require modeling the complicated flight dynamics of the system. In [24], the authors used a SARSA-based approach to successfully tackle the enemy sUAV avoidance problem in multi-sUAV systems. Figure 2 depicts the hardware-in-the-loop simulation setup used to evaluate their SARSA-based collision avoidance algorithm for multi-sUAV systems. In [53], the authors simulated a Q-learning-based approach for search-and-rescue operations using sUAVs. They considered an indoor scenario where the sUAV relies on RF signals emitted by a smart device owned by the target victim. In [54], the authors proposed a SARSA-based approach to deploy a multi-sUAV assisted wireless network. The goal of the reinforcement learning was to enable the sUAVs to learn the features of the environment and plan trajectories accordingly to provide wireless service in disaster-hit areas.
One of the differences between Q-learning and SARSA is the policy based on which rewards are designated. In Q-learning, the reward policy is greedy and always fixed to favor the maximum achievable reward. In SARSA, the policy changes are based on the current state and action pair, favoring a more optimal solution than Q-learning. Few authors have provided a comparison of Q-learning and SARSAbased approaches for multi-UAV flocking [39], [55]. In [55], the authors compared Q-learning and SARSA to reduce power consumption for multi-sUAV systems in the presence of wind. Simulations were carried out using a combination of ROS and Gazebo. The wind field was simulated using the physics engine in Gazebo and actual wind data. Both RL approaches were compared to a naive planner that selected the shortest paths irrespective of the wind fields at each time step. The results showed that the RL approaches reduced the power consumption by about 30% compared with the naive planner. In [39], the authors provided a comparison of Q-learning and SARSA for global path planning for mobile ground robots using Python simulations. It was observed that while the Q-learning emphasized the minimum number of actions necessary to reach the goal, the SARSA algorithm gave priority to security and found the optimal safe distance that avoided the risk of collisions/accidents. Therefore, Q-learning showed faster convergence rates, and SARSA provided a safer path for the ground robot.
Neural networks (NNs) have been proposed to produce generalized results that take into account the dynamic behavior of the flight environment [34], [38], [56]. In [56], the authors simulated a centralized, deep-Q-learning based leader-follower approach to solving the flocking problem for multi-sUAV systems using the Hungarian algorithm [57]. The proposed algorithm showed feasible convergence times for different flocking formations such as circle, v-shape, and star (12 ms). However, the algorithm took a significantly longer time when the number of sUAVs increased to 100. In [34], a deep-SARSA approach involved combining the traditional SARSA algorithm with a NN instead of Q-tables for storing the states and predicting the best action. The NN used was implemented in Keras and contained three dense layers with 549 trainable parameters [58]. Data was generated based on the training process, which was executed for 4000 simulation runs. The trained model was then successfully evaluated in a simulation testbed built using ROS and Gazebo.
In [38], the authors provided a deep-RL approach for sUAV ground target tracking in the presence of an obstacle. They used a deep deterministic policy gradient (DDPG) to generate the path plan around large-scale and complex environments. Simulation experiments were carried out in TensorFlow 2.0 using Python [59]. It was observed that the improved DDPG algorithm improved the success rate for target tracking from 70.0% to 91.8% in the sparse environment and 13.6% to 67.5% in a dense environment. In [60], the authors formulated the fixed-wing sUAV flocking problem as a Markov Decision Process (MDP). A NN was used to train the model using TensorFlow and Keras, with the number of training episodes set to 50000 and a maximum time step of 30 seconds. The authors also implemented a hardware-in-the-loop simulation using an X-plane flight simulator and two PX4 Pixhawks. The flight simulator modeled weather changes and wind disturbances. The flight simulation showed that the proposed RL algorithm could deal with environmental changes and completed the simulation mission.
As observed, a majority of these studies are limited to simulations. By contrast, this study provides detailed discussions on the hardware/software implementation and validation of OF-GM-SARSA applied to a multi-sUAV system to learn flocking using HWIL and outdoor flight tests.

III. PROBLEM FORMULATION
In this section, the key elements of the problem formulation are elucidated.

A. QUADROTOR MOTION
Consider a group of n quadcopter vehicles with 6 degrees of freedom operating inappropriately defined right-handed inertial, body, and body-fixed frames of reference. The non-linear rigid body dynamics of quadrotors have been well studied and documented [61].
Each quadrotor i = 1, . . . , n has a given start (origin) point o i and a given end (goal) point e i . O is the set of all start (origin) points. o i ∈ O, i = 1, . . . , n. E is the set of all end points. e i ∈ E, i = 1, . . . , n. The Euclidean distance between two quadrotors i and j is denoted by d ij , i = 1, . . . , n, j = i.

B. COMMUNICATION MODEL
Each quadrotor is equipped with a mobile ad-hoc network (MANET) radio used to exchange its trajectory information with other quadrotors. The communication connectivity is maintained directly (one-hop) or indirectly (multihop relaying). The quadrotors are modeled to communicate if they are within the η d distance of one another. This constraint is equivalent to the Signal to Noise Ratio (SNR) experienced by the receiver quadrotor being above a threshold [62].

C. DRYDEN WIND MODEL
Wind gusts are modeled using the continuous Dryden turbulence model [63]. The Dryden wind turbulence model has been commonly used to model continuous wind gusts effects on large-scale aircraft. Since quadrotor sUAVs are significantly smaller in size compared to such aircraft, appropriate approximations have been reported [64], [65].
Functionally, the Dryden model is a pulse shaping filter wherein a unit variance, band-limited white noise signal, is passed through a shaping function to generate an output signal with spectral properties defined by the shaping function [64]- [66]. The Dryden turbulence model specifies a power spectral density (PSD) function to define the turbulence spectra for along-wind, crosswind, and vertical wind directions.
As shown in Figure 3, the along-wind direction is wind flowing along the path of the quadrotor movement either in the direction of the quadrotor (tailwind) or against it (headwind). Crosswind is the wind flowing across the path of the quadrotor movement. The vertical wind is the wind flowing in the direction perpendicular to the plane of the quadrotor movement. The three wind directions are orthonormal to each other, and the relevant variables are modeled as an independent. Figure 3 shows two important parameters of wind gust behavior, as described by the Dryden model. The first is the scale length L along−wind of the wind turbulence in the along-wind (headwind/tailwind) direction of motion. The second is u along−wind , the along-wind gust speed. Additionally, u crosswind is the crosswind gust speed, and u vertical−wind is the vertical wind gust speed. σ along−wind , σ cross−wind , σ vertical−wind are the RMS gust speeds (also called turbulence intensities) in the along-wind, crosswind, and vertical directions respectively.
Since the general flight altitudes for quadrotors do not exceed 1000 feet mean sea level (MSL), they are classified as low altitude aerial vehicles according to the US military standards [67], [68]. The approximate scale lengths and turbulence intensities in the spectral forms have been defined for an altitude less than 1000 feet MSL using equations (1), (2), (3) and (4) respectively. h is the altitude from sea level (in feet). The wind speed at 20 feet AGL |u wind−20 | is set to 15, 30, and 45 knots for light, moderate and severe turbulence conditions, respectively. These values were experimentally determined in [69]. Here, a − w represents along-wind, c − w represents cross-wind, and v − w represents vertical-wind. 1.2 (1) Open-source Python code for implementing the Dryden model is available at [70].

D. FLOCKING BEHAVIOR
The rules for multi-agent flocking behavior have been defined according to Reynolds flocking rule set. These flocking rules were encapsulated as reward functions that motivated the agents to exhibit cohesion, alignment, separation, target seek, and obstacle avoidance. Similar to our previous work [25], the thresholds for each reward were selected based on running a uniform search (sweep) on the parameter space and evaluating the reward curves.
• Cohesion (r COH ): allows the flock to move inward towards an estimated centroid position. The cohesion objective seeks to minimize the distance between a quadrotor sUAV and its one-hop neighbors while maintaining an inter-sUAV separation distance k sep . The flock radius was set proportional to the square root of the flock size. States were given a reward of r COH = +1 for staying within k sep centered at the flock centroid. r COH was equal to −1 otherwise.
• Alignment (r ALN ): allows the quadrotor sUAV in the flock to match their velocity headings such that each member of the flock moves as a single unit. The velocity alignment objective seeks to minimize the velocity heading between the local quadrotor sUAV and the average heading of its one-hop neighbors. States were given a reward r ALN within the interval of [−1; +1] as a function of velocity heading difference which was calculated using the following relationship: where θ is the heading difference in degrees between the platform and the average heading of the flock.
• Separation or Collision Avoidance (r COL ): provides a minimum safe distance k col between the flock members such that they avoid collision with each other. Two levels of negative rewards are defined based on the degree of k col violation. The agent is severely penalized when it is within 0.2 distance units of a neighbor and is mildly penalized when they are 1.5 distance units. This differentiation was necessary since the agents might come close to one another due to external disturbances. However, extremely close distances should be avoided under all circumstances. States were given a reward of r COL = −1 for getting within 1.5 distance units of a neighbor, r COL = −100 for getting within 0.2 distance units of a neighbor, and r COL = 0 otherwise.
• Target Seek (r TGT ): The target seek objective seeks to minimize the distance between a local quadrotor sUAV and the current waypoint of the flock. It allows for waypoint tracking of the flock. The target seek reward is defined such that the agents reach the waypoints as quickly as possible [71]. The quadrotors are considered to have reached a waypoint if they are within a distance k reach from that waypoint with a reward r TGT as follows: • Obstacle Avoidance (r OBS ): This objective was formulated to ensure safe operations in an environment with obstacles. The quadrotors were rewarded based on their VOLUME 9, 2021 distance d io from an obstacle and an obstacle avoidance threshold k obs as follows: - [74]. A single Q-table characterizes each module m. The optimal policy selects the action with the largest weighted sum of Q-values across all Q-tables. During training, the Q-tables are updated using the SARSA algorithm. The overall procedure described here was elucidated previously in [25]. Figure 4 depicts the graphical representation of the OF-GM-SARSA planner, where s and a denote the state and action, respectively. Each flocking behavior is described as an objective represented by a Q-table. The collision avoidance objective is represented by multiple copies of a single Q-table per inter-sUAV pairing to avoid the exponential growth in the size of the Q-table. The greatest mass operation involves selecting the maximizing action under a weighted average of the corresponding Q-values.

A. STATE SPACE REPRESENTATION
For a given sUAV, the states included in the state space representation of the OF-GM-SARSA planner are estimates of relative position (i.e., range and bearing) and velocity heading of neighboring sUAV. Let the state s denote the total collection of these measurements for each module,  The s (COL) states are quantized distance, bearing, and velocity heading difference measurements between a given sUAV and its neighboring sUAVs. Finally, the s (ALN ) states are quantized velocity heading differences between the sUAV's current heading and the average heading of all neighboring sUAVs.
The object-focused formulation of the learning problem further decomposes the collision avoidance states into states for each sUAV, where O (COL) is the set of neighbor sUAVs. A single Q-table was learned for the collision avoidance behavior and shared across all sUAVs. The control policy was evaluated based on the separation distances queried from this Q-table. Each sUAV has a different inter-sUAV distance, and therefore receives different rewards but as specified from the single learned Q-table. A linear sum of the learned Q-values was then performed to get a single composite Q-value for the collision avoidance criterion. The measurement discretization used was specific to each module. The distance partitioning for the collision avoidance module was set to higher resolution at closer ranges. The target seeks module depended on long-range distances, and its discretization was uniform.

B. ACTION SPACE REPRESENTATION
The action space consisted of discrete velocity setpoints for the DJI N1 flight stack. The heading angles were quantized to be within sufficiently smooth yaw rates achievable by the DJI M100 quadrotor and N1 flight stack. The discretized action space A for small roll and pitch angles is represented as the Cartesian product of the heading angles (degrees) and the speeds (m/s) as follows: where the module weights, w m , sum to one. These values were set manually. Cohesion and alignment were weighted less than the other three modules.

V. STATE EXPLORATION AND MODEL TRAINING
The OF-GM-SARSA training procedure is shown in Algorithm 1. After the quadrotor sUAVs and waypoints are initialized, each quadrotor sUAV selects an action according to the softmax exploration rule. This exploration rule uses the progressively learned Q-values from each table to compute the total weighted Q-value, from (13). The actions are then sampled using the softmax probability mass function, where larger values of T encourage exploration. State-space updates are applied sequentially to each quadrotor sUAV, and the following action is chosen according to the same on-policy exploration rule. During training, only discrete actions were used. The Q-table for each module was then updated using SARSA. This process was repeated until the sUAVs reach the waypoint or until a maximum time was reached.

A. TRAINING METHODOLOGY
The learning rate, α, is the same for all Q-table updates for each flocking condition, whereas the discount factors, γ m , are unique. The discount factors decide how much the future outcomes influence the behavior of the learning rule for each module. Target seeks, for example, requires information over a longer horizon. Other modules, however, are more reactive, making immediate rewards more important. Specifically, the collision avoidance discount factor was set to 0, and the cohesion and alignment discount factors were set to 0.01. The discount factor for target seek was set to 0.9. The task of multi-agent flocking consists of non-convex criteria [71]. As such, employing a training method that is efficient is imperative. In [75], the authors showed that starting the training procedure with easier examples of a learning task followed by a gradual increase in the difficulty improved the speed of convergence and the generalizability of the results. Similarly, the training process of modular Q-learning is faster when working with subsets of modules rather than all at once. As such, the training procedure employed in this study consisted of three stages with increasing levels of task difficulty. The training was executed using the 6-Degrees of Freedom (DOF) quadrotor rigid body dynamic model.

1) FIRST STAGE
A single quadrotor sUAV was used during the first stage to learn the Q-table for the target seek objective with the OF-GM-SARSA target seeks objective weight set to 1.0. A quadrotor rigid body dynamic simulation was implemented that was tuned to match the specific construction of the DJI M100. The starting location of the sUAV and the waypoint were randomized uniformly during each training episode/iteration.

2) SECOND STAGE
In the second stage, a fixed set of 25 sUAV was used to learn the Q-tables for the flocking objectives (i.e., collision avoidance, velocity alignment, and cohesion). The OF-GM-SARSA training weights for this second stage were set to 0.40 for the collision avoidance objective, 0.20 for the velocity alignment objective, and 0.40 for the cohesion objective.

3) THIRD STAGE
In the final stage of training, the target seeking and flocking training procedures were combined to update the Q-tables of all four objectives jointly. The training weights used in the third stage were 0.40 for the collision avoidance objective, 0.40 for the target seek objective, 0.10 for the velocity alignment objective, and 0.10 for the cohesion objective.
One thousand training iterations were used for each of the first two stages, whereas the third stage used ten thousand training iterations. Q-table updates were performed at the end of every time step of the simulation. A training episode ended when the sUAV reached the waypoint or a maximum elapsed time for the first and third training stages. For the second training stage, a training episode ended after a fixed maximum elapsed time. The learning rate was kept fixed at 0.20 during all three stages of training.   Moving-average estimates of the reward curves observed during training are shown in Figure 6. The figure demonstrates the convergence of the OF-GM-SARSA learning algorithm across all three stages, with slower convergence in the third stage due to the complexity of the flocking and target seek tasks.

B. ON CONVERGENCE OF OF-GM-SARSA
The OF-GM-SARSA algorithm is used to develop a behavioral flocking controller with a defined set of component rules that govern selection of control output (e.g., acceleration, heading) in outdoor windy environments based on the relative position and velocity differences between neighboring sUAVs [26]. Since the sUAVs operate in a dynamic and stochastic environment, there is no guarantee of convergence to a globally optimal policy. Singh et al. in [76] proved that the SARSA learning algorithm converges to an optimal Q * if the policy is greedy in the limit of infinite exploration (GLIE). GLIE implies a learning policy in which, eventually, the probability of selecting the optimal action over a random action becomes 1 (greedy). This requirement can be met with both the -greedy and the Boltzmann (softmax) exploration. Russell and Zimdars [73] then extended this result to the use of an arbitrator and local Q-functions (modular GM-SARSA), in that, given that the arbitrator satisfies GLIE, individual module updates Q m converge to optimal Q * m , and then converge to a global optimum Q * . Cobo et. al. [77] addressed OF Q-learning, where they demonstrated that OF Q-function estimates Q o converge to the true Q-functions, Q * o , in that several objects in the same class can be seen as independent episodes of the same Markov Decision Process The first and foremost goal of the training was to obtain Q-values that maintain generalizability and ensure that the sUAVs maintain safe flight formation in the presence of realistic stochastic disturbances. Figure 6 depicts the sliding window of the mean and standard deviation for the average returned reward values per iteration for each training phase. As observed in Figure 6, the first two phases demonstrate strong convergence. In the third stage, when training for all objectives, the mean of the rewards obtained demonstrates slow but sufficient convergence indicating that the policy is learning the desired behaviors. Note that some variation in reward during training is also due to randomized scenarios with respect to initial distances from the target in addition to quantity and location of obstacles per iteration.

VI. SIMULATIONS USING DRYDEN MODEL
The OF-GM-SARSA flocking controller was evaluated with and without wind disturbances in Python simulations. The wind disturbances were modeled using the Dryden wind turbulence model discussed in Section III-C. Figure 7 depicts an example of the wind speed in m/s generated in the along-wind (headwind and tailwind depending on +/− sign), crosswind, and vertical-wind for a flight duration of 10 seconds. The sampling rate for this particular dataset was 43Hz. The wind turbulence generated was for a quadrotor sUAV operating at an altitude of 5m above the ground and moving with an airspeed between 0.25m/s to 1m/s. Waypoints were considered achieved if sUAVs were within 3.0 meters of them. The collision avoidance minimum and desired separation distances were set to 3.0 and 6.0 meters, respectively.
100 runs of four quadrotor sUAVs flocking along a 5-waypoint rectangular area while maintaining the flocking rules of cohesion, alignment, and collision avoidance were simulated. Figure 8a depicts the multi-sUAV flock at the third waypoint when there were no wind disturbances. Figure. 8b depicts the multi-sUAV flock at that same point when wind disturbances were applied. From Figure. 8c, it is observed that the quadrotors had overshot the waypoint due to the wind disturbance.
However, as observed in the accompanying video, successful flocking behavior was maintained throughout the simulation. The multi-sUAV formation completed the mission with and without wind disturbances while avoiding collision and reaching all 5 waypoints. A simulation based comparison of the OF-GM-SARSA flocking planner presented here with the genetic algorithm-based behavioral flocking planner was conducted in our previous work [25]. The comparison suggested high generalizability of the trained model and proved the model's capability of keeping the number of collisions to a minimum.

VII. TESTBED AND METHODOLOGY
This section describes the experimental hardware and software architecture and test procedures. The HWIL tests used table-top setups of the quadrotors that ran the flocking planner for 5-waypoint missions. The waypoints were spread out in a square pattern over a 40 meter by 40 meter area for both the HWIL and the outdoor flight tests.

A. FLOCKING QUADROTOR ARCHITECTURE
Each quadrotor sUAV was equipped with a Raspberry Pi 4 flight computer, an RFM69 radio module, and the DJI N1 flight controller as depicted in Figure 9. The associated software architecture is also depicted in Figure 9. The DJI M100 N1 flight controller was responsible for position and velocity control using velocity and yaw-rate inputs generated by the flocking algorithm. The N1 flight controller was connected to the Raspberry Pi running the DJI Onboard SDK via a universal asynchronous receiver transmitter (UART) connection. The OF-GM-SARSA software and Q-tables learned in training (Section V) ran on the Raspberry Pi in a Python ROS node for each platform. The local telemetry information was obtained by interfacing with the DJI N1 flight controller. Both telemetry information and velocity control were achieved using the DJI SDK ROS interface. An Arduino micro-controller was connected to the Raspberry Pi over USB. The Arduino provided a bridge for the RFM69 915 MHz wireless radio transceiver.
A ground station computer hosted a web-based user interface that provided status monitoring information and high-level experimental control for each experiment. The ground station also used an RFM69 915 MHz wireless radio transceiver to acquire telemetry information and perform clock synchronization for all quadrotors. A time division multiple access (TDMA) data link protocol was implemented to achieve low latency wireless communication between neighboring sUAVs and the ground station. Each data payload included neighbor position, yaw heading, and network connectivity information about the transmitting quadrotor. Before starting a mission, communication slot assignments were determined, with the ground station always configured to use the first communication slot.
All HWIL tests were executed using the DJI Assistant 2 flight simulation software, and the same compute and communication payloads. The software provided a real-time emulation of the DJI M100 rigid body dynamics to simulate the telemetry outputs and velocity and yaw rate inputs into the N1 flight controller. The DJI SDK ROS topics were published to the Raspberry Pi shown in Figure 9 to test the OF-GM-SARSA flight planner. During HWIL testing, the three quadrotors and ground stations were placed in line with approximately 0.5 meters between each quadrotor and approximately 1.5 meter separation to the ground station to minimize packet loss.

B. WIND MEASUREMENT
An FT205 ultrasonic anemometer sensor with an embedded computer was used to perform wind measurements during the flight tests, as shown in the Figure. 10. The sensor was mounted on a DJI M100 quadrotor using 20 inches long 3D printed pole. The FT205 was sampled at a frequency of 2 Hz. The wind measurements were performed with the quadrotor hovering at the height of 12.5 meters above ground level. The quadrotor was flown approximately 50 meters away from where the flocking quadrotors were operating to maintain safety.

C. OF-GM-SARSA CONFIGURATION
For all tests, the OF-GM-SARSA planner was configured with desired separation distance of k sep = 6 meters and a minimum separation distance of k col = 3 meters. The reach distance for waypoints was set to 6 meters. The cohesion radius for the cohesion objective was set to 30 meters. Several weight combinations were used for the OF-GM-SARSA objectives, with each combination summing to 1. All quadrotors were required to be within the reach distance to the current waypoint before moving to the new waypoint. The update rate for the OF-GM-SARSA planner was set to 4 Hz, which was selected to match the update rate of the DJI N1 flight telemetry sampling.

D. FLIGHT PROCEDURE
Each HWIL and outdoor flight test consisted of the following three stages:

1) FLIGHT STAGE 1: TAKE-OFF AND POSITION
In the first stage, the DJI M100 quadrotors would take off one at a time and move along a pre-specified path to a nearby location denoted by the center point. The quadrotor would then move into a relative formation as determined by a position offset from the center point, after which the next quadrotor would be cued to take off. Example initial formations included a straight line or a triangle as pictured in Figure 11. Field tests were performed with a minimum height offset of 3.5 meters between quadrotors for safety reasons.

2) FLIGHT STAGE 2: TARGET SEEK
Once the quadrotors reached their initial deterministic formation, the OF-GM-SARSA planner would start the flocking process over the series of pre-specified waypoints.

3) FLIGHT STAGE 3: RETURN-TO-HOME (RTH)
Once the OF-GM-SARSA planner achieved all waypoints, the quadrotors remained in place and executed an RTH procedure. This stage involved each quadrotor returning to its initial formation location and landing at its original takeoff location. The order the quadrotors returned to their takeoff points was determined by their distance to the center point. The quadrotors were assigned a wait time before starting their RTH to allow enough separation between them.

VIII. EXPERIMENTAL RESULTS
This section presents the experimental results and discussion on the performance achieved by the OF-GM-SARSA planner for 3-sUAVs and 4-sUAVs scenarios. The OF-GM-SARSA planner was evaluated under HWIL and field experiments. All metrics reported did not include telemetry measurements from the takeoff and RTH stages. The experiments were performed five times each to ensure repeatability.

A. KEY EVALUATION METRICS
The experimental evaluation of the OF-GM-SARSA motion planner, including the HWIL simulation and outdoor flight tests, provides actionable insights into the practical application of the proposed method. The following metrics were used  Figure 12. to measure the system performance in HWIL and field test experiments: 1) Inter-sUAV distance d ij : the Euclidian distance between two sUAVs at any given time. 2) k col violation: an instance where the minimum d ij was below the specified k col . 3) Takeoff + Target Seek + Return-To-Home (RTH) time τ mission : The total time to complete the mission from takeoff to landing. 4) Velocity alignment deviation: the difference between the velocity heading of each sUAV and the average headings of the flock (closer to zero is better). 5) Cohesion distance deviation: the difference between the position of each sUAV and the average position of the flock (lower is better). 6) Total radio packet loss ζ total loss : the fraction of total transmitted radio packets that were received across all quadrotors during the flight test. 7) Average pairwise packet loss ζ avg loss : the total packet loss per quadrotor pair averaged across a total number of quadrotor pairs during the flight test.

B. INTER-sUAV DISTANCES 1) 3-QUADROTOR TEST
The summary statistics associated with 3 quadrotors for d ij are shown in Table 1. The top left and top right graphs in Figure 12 depict the inter-sUAV distances for 3 quadrotors throughout the entire duration of the HWIL experiment and field test respectively. The average wind speed during this the HWIL test was ∼6.54 m/s with a standard deviation of ∼0.72 m/s. The average wind speed during the field test was relatively higher at ∼7.23 m/s with a standard deviation ∼2.49 m/s.

2) 4-QUADROTOR TEST
The summary statistics associated with 4 quadrotor for d ij are shown in Table 2. The bottom left and bottom right graphs in Figure 12 depict the inter-sUAV distances for one set of quadrotor pairs in a 4 quadrotor scenario throughout the entire duration of the HWIL experiment and field test VOLUME 9, 2021 respectively. The average wind speed during this the HWIL test was ∼6.54 m/s with a standard deviation of ∼0.72 m/s. The average wind speed during the field test was recorded to be ∼6.79 m/s with a standard deviation of ∼1.77 m/s. As is observed from the tables and the graphs, the mean d ij distances for both these tests were well above the desired separation distance of 6 meters. However, during the HWIL and field test experiments, there were rare instances where the minimum inter-sUAV distance was below the specified k col of 3 meters. For example, the pair (1,3) witnessed a minimum inter-sUAV distance of 1.03 meters in HWIL and 2.06 meters in the field test for the 3-quadrotor scenario. Similarly, several pairs in the 4-quadrotor scenario came closer to the k col distance to each other. In all these instances, the quadrotors quickly corrected (in less than 2.5 seconds) these violations.
The inter-sUAV distance performance recorded in our outdoor flight experiments was benchmarked against the inter-sUAV distance performance obtained using the model-based approach incorporating an evolutionary optimization framework reported in [31]. The following observations are made: 1) Using the approach of [31], the average inter-sUAV distances observed varied between 12 meters and 30 meters. From Table 4, the average inter-sUAV distances varied from 8 meters and 18.52 meters for our experiments. 2) Using the approach of [31], the minimum inter-agent distance remained between 5 meters and 15 meters. From Table 4, the minimum inter-agent distance for or experiments did not exceed 8 meters.  For the four quadrotor test, the mean τ mission was 692.44 seconds for the HWIL simulation. For the field test, this duration decreased to 578.35 seconds.
The most significant factor associated with τ mission variations was the presence of unpredictable wind conditions in the field test. During windier conditions, it was found that the total execution time varied significantly as the quadrotors would be aided more by the wind gusts in some directions compared to other directions as they tried to maintain collision avoidance, velocity alignment, and cohesion. These wind effects can be readily observed in the flight test video. Table 3 depicts the ζ total loss and ζ avg loss for the 3-sUAV and 4-sUAV tests. It is observed that both ζ total loss and ζ avg loss are lower for HWIL when compared to outdoor field tests. For outdoor field tests, when the pairwise line of sight is not readily established among quadrotors, ζ total loss increases as the number of quadrotors in the mission increase. However, it is also observed that ζ avg loss remains low i.e. majority of the communication happens through multi-hop indirect propagation. In case of 4 quadrotors operating outdoors, ζ total loss is particularly severe at 63% while ζ avg loss remains around 18%.

E. VELOCITY ALIGNMENT AND COHESION DEVIATIONS
The summary statistics for sUAV velocity alignment and cohesion deviations for 3-quadrotor formation flight are shown in Table 4. Similarly, the summary statistics for sUAV velocity alignment and cohesion deviations for 4-quadrotor flight are shown in Table 5. From the tables, it is observed that for the 3-quadrotor scenario, the flock achieves stronger cohesion and velocity alignment performance as compared to the 4-quadrotor scenario. It is also observed that the velocity alignment deviation in HWIL tests is significantly lower compared to that in the field tests. Several factors need to be considered: 1) The average wind speed during field tests was higher than during the HWIL experiments.
2) The radio packet losses (ζ total loss and ζ avg loss ) were significantly higher in field tests than the HWIL tests. A higher packet loss results in slower/variable telemetry refresh rates, affecting flight control quality. 3) As the number of sUAVs in the flock increases, each sUAV has to accommodate the ''winddisturbed'' positions and orientations of an increasing number of neighbors in the flock.  The velocity alignment performance recorded in our outdoor flight experiments was benchmarked against the velocity alignment performance obtained using the deep deterministic policy gradient (DDPG) approach presented in [78]. The mean velocity alignment observed in our outdoor flight tests with three quadrotors varied between −7.96 • and VOLUME 9, 2021 4.72 • , indicating the agents maintained their heading relative to each other with minimum deviations. Based on the DDPG approach of [78], for the three-quadrotor test, the mean velocity alignment varied between −108 • and 63 • .

IX. DISCUSSION AND FUTURE RESEARCH
Several future research directions for experimental implementation of RL based motion planning algorithms emerge based on this study.

A. ENERGY-AWARE PLANNING
The energy requirements of sUAVs directly affect the practicality of any motion planning algorithm [79], [80]. This requirement is especially true when quadrotors operate in windy outdoor environments where they need to compensate for gusts. An exemplar study in this context is [81], wherein the authors developed an RL approach that combined the effects of the power consumption and the object detection modules to develop a policy for object detection in large areas with limited battery life. The quadrotors used in this study (DJI M100) were evaluated in hover flight tests to have a flight time of approximately 1200 seconds, which proved sufficient to test the OF-GM-SARSA approach in flighttest experiments. However, the inclusion of sUAV power constraints as an objective in the OF-GM-SARSA paradigm using an approach similar to [81] will be explored in future efforts.

B. SCALABILTY AND COLLISION AVOIDANCE GUARANTEES
Future efforts will explore guarantees of collision avoidance with outdoor experimental tests involving groups of up to 12-15 quadrotors using Control Barrier Functions (CBFs) [82]. CBFs are being increasingly used to verify and enforce safety properties in the context of safety-critical controllers. They have the potential to provide a computationally tractable approach to combining learning with safety guarantees.

C. ACTOR-CRITIC METHODS
Despite a relatively slow convergence in the final training stage, the trained OF-GM-SARSA behavioral controller was more than capable of generating flocking behaviors for multiple sUAVs flying in formation outdoors in the presence of wind disturbances. Among existing reinforcement learning techniques, SARSA-based approaches have been shown to have poor convergence performance for real-world applications because they are oriented towards finding the deterministic policy, whereas the optimal policy is stochastic [83]. Alternatively, reinforcement learning approaches based on the actor-critic paradigm have been proven to have good convergence properties [84]. Actor-critic methods have been shown to be scalable to multiple robots with more than 3-degrees of freedom, and they can also be used modelfree [85]. In future work, such actor-critic approaches will be explored. It is well known that despite favorable convergence properties, actor-critic methods are difficult to implement, and their performance is highly dependent on the hardware/software implementation [86]. Towards this end, we plan to use Flightmare, a quadrotor simulator that provides OpenAI gym-style wrappers for several RL algorithms, including actor-critic methods such as Proximal Policy Optimization (PPO) algorithm [87], [88].

X. CONCLUSION
This work presented an experimental evaluation of the OF-GM-SARSA planner to address the flocking problem in small rotor-based multi-UAVs. The study presented a description of the background of the algorithm, along with the training procedure. The flocking controller was experimentally evaluated in HWIL and field tests for 3-quadrotor and 4-quadrotors missions. The controller's performance was also evaluated under windy conditions in HWIL simulations using the Dryden wind turbulence model. The controller behavior observed in windy conditions in HWIL and outdoor tests was similar to the simulations, suggesting that the technique presented here generalizes the behaviors trained in simulation to real-time interactions.