On the Scalability of Vision-based Drone Swarms in the Presence of Occlusions

Vision-based drone swarms have recently emerged as a promising alternative to address the fault-tolerance and flexibility limitations of centralized and communication-based aerial collective systems. Although most vision-based control algorithms rely on the detection of neighbors, they usually neglect critical perceptual factors such as visual occlusions and their effect on the scalability of the swarm. To estimate the impact of occlusions on the detection of neighbors, we propose a simple but perceptually realistic visual neighbor selection model that discards obstructed agents. We evaluate the visibility model using a potential-field-based flocking algorithm with up to one thousand agents, showing that occlusions have adverse effects on the inter-agent distances and velocity alignment as the swarm scales up, both in terms of group size and density. In particular, we find that small agent displacements have considerable effects on neighbor visibility and lead to control discontinuities. We show that the destabilizing effects of visibility switches, i.e., agents continuously becoming visible or invisible, can be mitigated if agents select their neighbors from adjacent Voronoi regions. We validate the resulting flocking algorithm using up to one hundred agents with quadcopter dynamics and subject to sensor noise in a high-fidelity physics simulator. The results show that Voronoi-based interactions enable vision-based swarms to remain collision-free, ordered, and cohesive in the presence of occlusions. These results are consistent across group sizes, agent number densities, and relative localization noise. The source code and experimental data are available at https://github.com/lis-epfl/vmodel.


I. INTRODUCTION
A ERIAL robot swarms have a vast socio-economic potential and are used for numerous real-world applications in industries such as agriculture, mapping, and construction [1]- [3]. Drone swarms can be deployed to monitor crops, create maps, and survey sites much faster than a single drone since they can solve tasks cooperatively and in parallel. Larger group sizes can further decrease task completion times and operating swarms in compact formations can enable new applications in confined spaces such as buildings. However, most drone swarms deployed today rely on external localization and wireless communication, both of which represent major limiting factors towards their scalability in terms of group size and swarm density.
Localization in drone swarms is usually achieved with satellite-based systems for outdoor applications [4] or optical motion capture for indoor deployments [5]. The drones are typically equipped with wireless communication devices that [10], and others [11]). Moreover, vision is arguably the ideal sensory modality for localization on aerial robots since cameras are small, lightweight, and provide extremely high information density at comparatively low power consumption [12]. Multi-robot systems that use a vision-based approach to mutual localization have recently emerged in the form of leader-follower formations [13]- [15] and the first aerial flocks [16]- [18]. Important perceptual factors such as visual occlusions, i.e., agents that are obstructed by others, are usually neglected in these swarms because of their small group size. However, these factors become a deterrent for larger swarms, especially when they have to fly in dense configurations.
While some swarm roboticists explicitly make use of visual occlusions to solve collaborative transport problems [19] and robotic shepherding tasks [20], the most thorough treatment of visibility constraints can be found in the collective motion literature. Using computer vision techniques, researchers are able to reconstruct the poses and visual fields of individual animals and show that visual perception best explains how information about food sources and predators transfers within the group [21]- [24]. How individuals select and react to their neighbors is one of the fundamental questions in the study of collective motion and agentbased flocking models provide an indispensable tool to test and verify different hypotheses [25]- [28]. Notable examples of neighbor selection methods include metric (i.e., within a metric radius) [29], topological (i.e., the set of n nearest neighbors) [30], or voronoi-based (i.e., from adjacent Voronoi regions) [31] interactions. Recently, different forms of visual neighbor selection have gained popularity due to their biological plausiblity [21]- [24]. For example, research on flocking models with a limited field of view shows that lateral vision is crucial for collision-free collective motion [32], [33] and may explain why flocking birds have almost omnidirectional vision [34]. Simulations of large schools of fish show that visual obstructions lead to more realistic group shapes and densities than purely metric interactions [35]. Simulations of large vision-based flocks show that bird density can be regulated effectively if individuals only react to the projection of their neighbors [36]. Other researchers show that many natural behaviors such as milling and polarized flocking emerge from purely visual interactions even in the absence of a spatial representation of neighbors [37]. Although these models offer interesting collective behaviors, they often make modeling choices that are geared towards a particular species or result in undesirable behavior for robotic swarms since they lead to frequent collisions.
In this work, we tackle visibility constraints arising from occlusions from a robotics perspective with the goal of synthesizing large and compact vision-based drone swarms. In particular, we study the effect of occlusions on the performance (i.e., collision avoidance, cohesion, and velocity alignment) of vision-based swarms as they scale from low densities and a handful of agents to high-density swarms with thousands of individuals. To this end, we propose a visual neighbor selection model that offers a perceptually plausible alternative to the ubiquitous but unrealistic metric selection of neighbors, i.e., methods that assume agents can sense arbitrary neighbors within a given radius regardless of occlusions. We simulate vision-based swarms of up to one thousand point mass agents and program them to perform collective waypoint navigation using a simple attractive/repulsive flocking algorithm. The results show that swarms in which agents react to all visible neighbors perform poorly, especially at high densities and as the group size increases beyond tens of agents. However, by limiting visual interactions to their Voronoi neighbors, we can successfully synthesize collision-free, cohesive, and ordered vision-based swarms. A comparison of Voronoi interactions with other common neighbor selection methods (i.e., metric and topological) reveals their superiority in large, high-density swarms. We validate the scalability of the resulting flocking algorithm at different densities and group sizes with quadcopter dynamics using a simulator with realistic physics and noise levels. The analysis shows that visually-constrained Voronoi interactions are both perceptually plausible and highly effective for the coordination of large aerial robot swarms in which agents rely purely on local visual information for control.

II. METHOD
We aim to synthesize a vision-based swarm that remains as compact as possible and collision-free while performing collective waypoint navigation. We define this objective since it enables many practical applications such as cooperative mapping, aerial deliveries, and search & rescue.
We briefly define preliminary concepts and the notation used throughout the article (Sec. II-A). We then describe a simple attractive/repulsive flocking algorithm that provides collision avoidance and cohesion, as well as a navigation capability to the swarm (Sec. II-B). To obtain a flocking algorithm that is plausible for vision-based swarms, we define the notion of agent visibility in the form of a neighbor selection strategy that is based on a realistic occlusion model (Sec. II-C). Since vision-based detection is an inherently stochastic process, we further model sensing noise on the range and bearing measurements (Sec. II-D).

A. PRELIMINARIES AND NOTATION
We consider a set of N homogeneous agents that are labeled by i ∈ A, where A = {1, 2, . . . , N } denotes the set of all agents and |A| = N its cardinality. We also define the set of all but the focal agent i as A i = A \ {i}. The state of each agent i can be described by its position and velocity p i , v i ∈ R m . We focus on the two-dimensional case and let m = 2, assuming that the agents move in planar configurations. We denote the relative position of agent i with respect to j as r ij = p j − p i with distance d ij = r ij where · is the Euclidean norm.
We model the swarm of agents as a directed sensing graph G = (V, E), where the set of vertices V = {1, . . . , N } denotes the agents and the set of edges E ⊆ V × V contains Pairwise potential of separation and cohesion terms as a function of inter-agent distance. Separation is inversely proportional to the inter-agent distance, whereas the cohesion term grows linearly with distance. The equilibrium distance is defined as the distance at which separation and cohesion balance.
the ordered pairs of agents (i, j) ∈ E if an agent i is adjacent to agent j, which we denote by i ∼ j. The graph G can also be represented by an N × N adjacency matrix of the form A ij with entries of 1 if i ∼ j and 0 otherwise. The motion of each agent can be described by singleintegrator dynamics of the form where k denotes the index of the discrete time step with duration ∆t.
In the remainder of the section, we skip the dependence on the discrete time step k for notational brevity and clarity. However, all computations in this section are performed at every time step without exception.

B. FLOCKING ALGORITHM
The objective of the swarm is to perform waypoint navigation while avoiding inter-agent collisions and staying together as a group. We formulate this objective as an artificial potential field that is inspired by the Reynolds flocking algorithm [38]. The motion of an agent is composed of an attractive/repulsive potential that provides separation and cohesion between agents (Sec. II-B1), as well as a migratory potential responsible for goal-directed navigation (Sec. II-B2).
The motion of an agent is composed of a social term that captures agent-to-agent interactions and a migration term that introduces the navigation objective. The velocity command of an agent can be written as where v soc i and v mig i denote the respective social (Eq. 3) and migration terms (Eq. 4). In order to obtain a final velocity command that is feasible even under the actuation constraints of a physical robot, we limit the maximum speed The velocity command v i can then be used directly for the motion update to obtain the agent positions of the next time step (Eq. 1). Scalability of minimum nearest neighbor distances to increasing numbers of agents using the baseline metric neighbor selection model, i.e. agents within the perception radius are detected irrespective of whether they are occluded. Each line represents the minimum equilibrium distance between nearest neighbors obtained from different separation gains as the swarm size increases (mean and std. dev. over ten trials, all other parameters constant). Aside from a noticeable increase of inter-agent distances between ten and thirty agents that occurs due to the saturation of the perception range with agents, the inter-agent distances remain constant across different group sizes (note the logarithmic scale).

1) Separation and cohesion
Cohesion and collision avoidance can be achieved with an attractive/repulsive potential that keeps the agents at an equilibrium distance (Fig. 1). The cohesion term keeps the swarm together by attracting agents to the average position of their neighbors. The separation term leads to collision avoidance by repulsing nearby agents from each other. We can express these rules more formally as where k coh and k sep are gains that regulate the strength of the attraction and repulsion, respectively. Note that we do not scale the separation velocity command by the number of agents. This formulation has the advantage that minimum inter-agent distances remain quasi-constant as the group size increases and thus reduces the need for readjusting the control gains (Fig. 2). We further use the analytical solution to the above equations for three agents as a first approximation of the desired inter-agent distance d ref .
This allows us to express an approximate reference distance by using a separation gain of the form k sep = (d ref ) 2 /2 m s −1 and keeping the cohesion gain fixed at k coh = 1 m s −1 . Note that in general, the separation gain slightly overestimates the reference distance for larger swarms since it does not take the number of neighbors into account. It is nevertheless a useful approximation that spares us the tedious task of finding the reference distance empirically for each agent swarm scale separately.

2) Migration
The purpose of the migration term is to give the agents a navigation goal by steering them towards a waypoint. The   . Schematic visualization of different neighbor selection strategies: metric, visual, topological, and voronoi. We take the perspective of a focal agent within a swarm (central red disk) that selects agents (blue disks) and discard others (gray disks) depending on the following selection criteria: (a) metric selects all agents within a metric perception radius, (b) visual selects all visible agents within a metric radius, i.e., all agents that appear large enough and are not occluded by others, assuming agents are equally sized and have an omnidirectional camera at their center, (c) topological selects only the n closest agents (here n = 6), irrespective of their distance, and (d) voronoi selects only those agents that belong to a neighboring Voronoi region.
migration term can be written as where r mig denotes the relative position of the migration point with respect to the focal agent, and k mig the gain for modulating the migration speed.

C. NEIGHBOR SELECTION
Neighbor selection is an important consideration for all flocking algorithms since it introduces the notion of locality (e.g., in communication, perception, etc.) as opposed to allto-all information transfer. In the following, we denote the neighbors of agent i as a set N i where N i ⊆ A i .

1) Metric: distance-based neighbor selection
Metric neighbor selection keeps only those agents that fall within a radius r max centered around the focal agent (Fig. 3a). We can formalize metric neighbor selection as the set where r max denotes the maximum perception range. Defining the set of neighbors based on a metric range is the most popular means of neighbor selection in the literature [29], [38]- [40]. Metric neighbor selection is a simple and effective method to introduce locality in the interactions and can be interpreted as a perception radius for visionbased swarms or a communication range for swarms that can exchange information via wireless links, for example. With the assumption that all agents are homogeneous and equally sized, we can use the metric perception range to represent visual acuity, i.e., the minimum size that another agent spans on the retina of the focal agent before it can no longer be perceived. We therefore encourage the reader to think about the perception range as the equivalent of the minimum subtended angle that another agent spans on the retina of the focal agent.

2) Visual: occlusion-based neighbor selection
Visual neighbor selection keeps only those agents that appear large enough and are not occluded by closer ones as seen from the perspective of the focal agent (Fig. 3b). The set of visible agents can be formalized as where u ij = r ij /d ij andr ij = r/d ij are the projections of the agent position and radius onto the unit circle, respectively. Note that by combining metric and visual neighbor selection, we obtain a model of visibility that takes into account both visual acuity and occlusions. We consider this model plausible for vision-based swarms since it captures the information that is de facto available to an individual that operates purely on visual perception.
The above definition of visibility contains two key assumptions. The first assumption is that agents can distinguish individuals from each other. Note that this assumption does not require identities to be maintained over time. The second assumption is that partially occluded agents are considered invisible, i.e., only the closest set of agents with an uninterrupted line of sight are contained in the visible set. This assumption is reasonable for monocular vision since the relative distance to other agents can only be reliably estimated if their entire spatial extent is visible.

3) Topological: n-nearest neighbor selection
Topological neighbor selection keeps only the n nearest neighbors of the focal agent (Fig. 3c). We can write the set of nearest neighbors as where the n-arg min operator selects at most the n nearest neighbors. Topological neighbor selection is a popular method due to its explanatory success in natural swarms [21], [41] and is often used in models of collective motion to maintain group cohesion [30], [40].

4) Voronoi: spatially balanced nearest neighbor selection
Voronoi neighbor selection keeps only those agents whose Voronoi regions share a border with the focal agent (Fig. 3d). We can write the set of Voronoi neighbors as where ∅ denotes the empty set and V i the Voronoi region of agent i which can be defined as In other words, the Voronoi region of an agent can be described as the set of all points that are closer to itself than to any other agent.
Neighbor selection based on the Voronoi tessellation can be seen as topological interactions that are parameter-free and automatically balanced in space [30]. Moreover, it can be shown that the average number of Voronoi neighbors is at most six for the planar case we are considering here [42].

D. SENSING NOISE
We model the visual relative localization inaccuracies in two independent components: range and bearing. We model range noise as a function that varies linearly with relative distance from the observer whereas the bearing noise is constant over the field of view [15], [18], [43], [44]. More formally, we define the noisy version of range and bearing with which agent i detects agent j aŝ where ω d and ω β are independent and identically distributed white noise with zero mean and standard deviation of σ d and σ β , respectively. The noisy relative position can then be constructed from polar coordinates aŝ wherer ij can serve directly as an input to the social term of the flocking algorithm (Eq. 3). The exact values for range and bearing noise depend on several factors such as camera resolution, lens quality, calibration accuracy, and target deformation.

III. EXPERIMENTAL SETUP
Before we analyze the experimental results, we briefly describe the metrics that we use to measure the swarm performance (Sec. III-A), as well as the experimental parameters that are used throughout the experiments (Sec. III-B).

A. PERFORMANCE METRICS
We report our results in terms of several complementary metrics: minimum nearest neighbor distance d min , order φ order , and union φ union . These metrics capture whether we have achieved collision-free, aligned, and cohesive collective navigation, respectively. The following metrics are computed at every discrete time step k and we therefore omit the timedependence for notational brevity. The minimum nearest neighbor distance is arguably the most important metric since it captures whether or not the agents can effectively avoid collisions during migration. It is computed as and we say that a collision occurs whenever two agents get closer than twice their radius d min < 2r.
The order metric measures the correlation of the velocity vectors of the agents within the swarm. It is computed as An order value of one indicates that all agents are moving in the same direction in perfect alignment, whereas a value around zero means that the swarm is in a completely disordered state in which no two agents align their direction of motion.
The union metric measures the cohesion of the swarm and expresses whether the swarm has split into subgroups. It is computed as where n comp is the number of connected components of the neighbor adjacency matrix (Sec. II-C). A union value of one indicates that the swarm is moving as a single cohesive unit. A value of zero represents the degenerate situation in which the swarm is split into N subgroups and the agents are unable to perceive any other agent.

B. EXPERIMENTAL PARAMETERS
We perform ten repeated runs of migration experiments to make statistical statements about the scalability of the swarm using different neighbor selection methods, group sizes, swarm densities, agent dynamics, and noise levels.
The specific parameter values we use are informed by our previous experiments with real vision-based quadcopters in indoor [14] and outdoor environments [18], as well as the literature on vision-based drone localization [13], [16], [43]- [49]. We choose the radius of an agent as r = 0.25 m since it reflects a common physical size of quadcopter platforms used in robotic experiments. The perception radius r max = 10 m is chosen as the distance at which other drones were no longer reliably detected during outdoor experiments. The time delta ∆t = 100 ms is chosen as a reasonable amount of time to solve the visual perception, state estimation, and control problems in real-time. The desired inter-agent distance is set to d ref = 1 m to generate the most compact formation VOLUME 4, 2016

Name
Set notation that simultaneously provides enough safety margin against potential collisions.
In order to provide a fair comparison of the visual neighbor selection methods, we choose parameter values that result in comparable numbers of neighbors as the group size increases (Fig. 4d). In particular, we set the maximum number of agents for topological neighbor selection to n = 6 since it reflects the average number of Voronoi neighbors for planar configurations [42]. We further let r max = 2d ref for myopic interactions since it approaches an average number of six neighbors as the group size increases. We provide an overview of the neighbor selection methods used during the experiments in Tab. 1.
Note that metric neighbor selection is not plausible for swarms in which relative localization is vision-based. We include an analysis of metric neighbor selection only for comparison and because it is commonly used in the literature. Conversely, all other visual neighbor selection methods (i.e., visual, visual + myopic, visual + topological, visual + voronoi) are feasible for vision-based swarms since the agents have uninterrupted line of sight.
At the beginning of each experiment, the agents are spawned randomly within a circular region. The initial positions are sampled uniformly in a non-overlapping fashion using rejection sampling such that no pair of agents are closer than their desired reference distance d ref . The area of the circular region is chosen such that the agent number density ρ N remains constant for different numbers of agents. The agents exhibit no motion at the beginning of the experiment, i.e., their initial velocities are set to zero. The agents are given a constant navigation direction r mig = [1, 0] along the horizontal axis which can be seen as a migratory route along the magnetic field [50]. We let the swarm develop its collective motion for a total of T = 200 s composed of 2000 isochronous discrete time steps k with duration ∆t k = 0.1 s. At each time step, the agents select their neighbors according to the indicated neighbor selection function (Fig. 3) and compute their motion command (Sec. II-B). We set the separation and cohesion gains to k sep = 1 m s −1 and k coh = 1 m s −1 to provide an approximate nearest neighbor distance of d ref = 1 m. The separation gain is set to k mig = 0.5 m s −1 which provides goal-directed motion without overpowering the attractive/repulsive commands. We set the maximum speed an agent can sustain to v max = 1 m s −1 . A concise overview of the experimental parameters is provided in Tab. 2.
In order to provide a fair comparison across vastly different group sizes, we compute the metrics over the last quarter of the simulation, i.e. considering only the final 500 time steps. Particularly for large swarm sizes, we avoid computing metrics during an initial transient period in which agents have not yet aggregated to their final configuration. We refer to the time range during which we compute the metrics as the equilibrium period for convenience. We report the minimum nearest neighbor distances as a minimum over time over the equilibrium period since it reveals whether collisions occur. For the order and union metrics, we report time averages over the equilibrium period. The mean and standard deviations are computed over the ten independent runs with random initial conditions.

C. SIMULATION ENVIRONMENTS
We employ two different simulation environments that serve complementary purposes. The simulation environment with point mass dynamics allows us to rapidly prototype algorithms and quickly generate statistical results with up to one thousand agents without running into time or computational constraints.
The Gazebo simulator, on the other hand, provides more physical realism and allows us to obtain an approximation of how an algorithm would behave on real hardware. However, by default, Gazebo, ROS, and PX4 run asynchronously, meaning that messages are exchanged on a best-effort basis given the computational load. To provide a fair comparison at different group sizes, we must ensure that the number of agents does not have any adverse effects on the simulation fidelity by lockstepping all of its software components. In practice, this means we run Gazebo and PX4 in their respective lockstep modes and additionally pause the simulation at each time step, compute the velocity commands for all agents in parallel, and resume the simulation. Unfortunately, even with lockstepping, Gazebo reaches its computational limits at around one hundred agents, after which the real-time factor decreases considerably and spawning additional agents becomes unreliable. We therefore limit our experiments with quadcopter dynamics to one hundred agents.

IV. RESULTS
We report results on four sets of complementary simulation experiments: 1) we compare several neighbor selection methods with increasing numbers of agents to show their performance for different swarm sizes (Sec. IV-A), 2) we evaluate the neighbor selection methods for increasing inter-agent distances to show the effect of varying agent number densities on the swarm performance (Sec. IV-B), 3) we analyze the performance of the neighbor selection methods when they are subjected to increased range noise during relative localization (Sec. IV-C), and 4) we validate the highest-performing neighbor selection method (across group sizes and densities) with quadcopter dynamics and realistic sensing noise to show its performance under realworld conditions (Sec. IV-D).

A. PERFORMANCE ACROSS SWARM SIZES
We assess the performance of the swarm for all neighbor selection methods and six levels of increasing group size N ∈ {3, 10, 30, 100, 300, 1000}. We set the reference distance d ref = 1 m constant throughout the experiments to keep the agent number density fixed and to allow a direct comparison of the effect of group size.

1) Visual neighbor selection
Purely visual neighbor selection shows the overall lowest performance as the group size increases. There is a considerable performance penalty in the distance and order metrics ( Fig. 4a and 4b). The minimum distance is tracked well only for a group size of 3 agents (d min = 1.0 ± 0.0 m; Fig. 4a). The distance gradually approaches the collision threshold of 2r = 0.5 m and reaches its minimum at 1000 agents (d min = 0.58 ± 0.0 m; Fig. 4a). The order metric shows a similar trend since the agents start out perfectly ordered for 3 agents (φ order = 1.0±0.0; Fig. 4b). However, for larger group sizes, the order metric decreases monotonously until reaching its minimum at 1000 agents (φ order = 0.87 ± 0.0; Fig. 4b).
The swarm stays cohesive as a single unit across all group sizes (φ union = 1.0 ± 0.0 m; Fig. 4c). Generally, using visual neighbor selection, the swarm performance decreases as soon as occlusions start to emerge ( Fig. 4d; Fig. 4d). There is no performance penalty for 3 agents using visual neighbor selection since they predominantly occur in equilateral triangle formations in which there are no occlusions (i.e., N i = 2). For larger group sizes, an increasing number of agents within the perception radius is occluded (32% occluded for N = 10; up to 90% occluded for N = 1000).
The trajectories of agents using purely visual neighbor selection are subject to frequent directional changes (Fig. 5a). As a result, the agents migrate with considerable deviations from the optimal linear trajectory in the migration direction. In particular, the relative positions of the agents within the swarm are not fixed but rather subject to frequent topology switches. For instance, agents that initially belong to the swarm periphery move towards the swarm center ( Fig. 5a; blue line) and vice versa.
The topology switches can be explained by considering that an agent within the swarm is exposed to constant changes of its neighbor set (Fig. 6). Small agent displacements result in considerable changes of perspective that cause neighbors to appear and disappear from the visible set ( Fig. 6a and 6b: 11 agents appear and 4 disappear, for example). Here, the focal agent is exposed to a total of 32 visibility switches (8 ± 1.22 switches per timestep) over the course of four consecutive seconds of the experiment.

2) Alternatives to purely visual neighbor selection
Neighbor selection based on the Voronoi tesselation shows the highest performance of all neighbor selection methods across group sizes. The minimum distance, order, and union metrics show performance comparable to metric neighbor selection (Fig. 4a, 4b, and 4c). In particular, the minimum distance is tracked even closer to the reference distance of d ref = 1 m for increasing group size (for 1000 agents: d min = 1.13±0.02 m for visual and d min = 1.21±0.02 m for metric, for example; Fig. 4a). This can be explained by considering that metric swarms have a significantly larger number of neighbors compared those based on visual + voronoi neighbor selection for group sizes N > 3 (Fig. 4d). For example, at N = 1000 agents, the metric neighbor set contains around 22 times the number of agents than it does for visual + topological neighbor selection (on average 11.2 ± 8.5 times the number of neighbors for all group sizes; Fig. 4d). Recall that the flocking algorithm computes the separation term as a sum of reciprocal distances (Eq. 3). Therefore, each neighbor has an additive contribution towards the repulsion (albeit a very small one for distant agents) that explains the slightly larger distances. The agents are perfectly ordered and cohesive for all group sizes (φ order = 1.0 ± 0.0 and φ union = 1.0 ± 0.0, respectively; Fig. 4b and 4c). Qualitatively, the paths taken by visual + voronoi swarms are generally linear and smooth (Fig. 5b). The swarm performs collision-free, ordered, and cohesive collective migration. Switches in the neighbor set do occur but are infrequent and do not lead to unsafe situations or disorder (e.g., changes in neighbor configuration at x ≈ 23 m; Fig. 5b).
Swarms that use visual + myopic or visual + topological neighbor selection do not perform as well as those using visual + voronoi selection for different group sizes. Generally, visual + myopic swarms exhibit low cohesion and easily fragment into several subgroups (Fig. 4c). Fragmentation occurs because agents that exit the perception radius are usually found within small subgroups or entirely isolated due to their limited perception range (see subgroups and isolated agent; Fig. 5c). The fragmentation phenomenon also skews the minimum distance metric towards lower values with large standard deviations compared to other neighbor selection methods (average of d min = 0.82±0.12 m across group sizes; Fig. 4a). This occurs because isolated agents are usually far away from any other agent (see isolated agent; Fig. 5c). We verified that minimum distances to nearest neighbors are usually well-tracked within subgroups of at least three agents.    The union metric is always below φ union < 1 which indicates that fragmentation occurs for all group sizes (Fig. 4c). Cohesion is lowest for small groups and approaches, but never reaches, a value of φ union = 1 that would indicate a singleunit cohesive swarm (φ union = 0.7 ± 0.25 for N = 3, up to φ union = 0.98 ± 0.0 for N = 1000; Fig. 4c). Note that larger groups exhibit higher union performance since the metric is normalized by group size, i.e., larger groups consist of fewer subgroups relative to the overall group size. Swarms with visual + myopic neighbor selection are effectively ordered (φ order = 1.0 ± 0.0; Fig. 4b) Qualitatively, apart from fragmentation, larger subgroups tend to have irregular shapes that are less circular compared to other neighbor selection methods (see the largest subgroup; Fig. 5c).
Swarms that use visual + topological neighbor selection do not exhibit consistent performance accross swarm sizes. Especially for intermediate group sizes of 10, 30, and 100 agents, both minimum distances and order metrics suffer a decrease in performance ( Fig. 4a and 7b, respectively). For the respective distances and order metrics, the minimum performance occurs at 30 agents (d min = 0.85 ± 0.04 m and φ order = 0.97 ± 0.01; Fig. 4a and 4b, respectively). (d) t = 4 s FIGURE 6. Visual representation of the switching topologies caused by occlusions during a collective migration experiment. We show the perspective of an arbitrary focal agent (central red disk) over the course of four isochronous time steps t ∈ {1 s, 2 s, 3 s, 4 s}. The focal agent uses visual neighbor selection and therefore perceives only agents within its perception radius that are in a direct line of sight (blue disks), whereas occluded agents are invisible (grey disks). We further highlight visibility switches, i.e., when an agent that has been occluded since the previous time step becomes visible (green disks) and when a previously visible agent becomes occluded (brown disks). A total of 32 visibility switches occur over the course of four seconds.
We can explain this behavior by considering that agents always select the six closest visible neighbors, irrespective of where they are located. Agents that belong to the swarm center tend to have six neighbors that are spaced around them at approximately equal angles from each other. Conversely, agents on the periphery consider only neighbors in one direction which are subject to occlusions. This leads to similar visual switching topologies as for the purely visual neighbor selection, albeit less severe since even the most distant nearest neighbor for n = 6 is usually in close proximity. The effect of occlusions is mostly mitigated for larger swarm sizes N > 100 since a smaller proportion of agents is located on the periphery relative to the swarm center. We do not observe fragmentation with visual + topological neighbor selection for any group size (φ union = 1.0 ± 0.0; Fig. 4c). Qualitatively, visual + topological interactions generate paths that are not perfectly straight (Fig. 5d). We also observe swarms that exhibit rotations, as well as ones that periodically switch between a set of recurring configurations.

B. PERFORMANCE ACROSS SWARM DENSITIES
We evaluate the swarm performance for all neighbor selection methods and for five levels of increasing inter-agent distances d ref ∈ {1 m, 2 m, 3 m, 4 m, 5 m}. We let N = 100 to fix the group size and to enable a direct comparison between agent number densities. We define the normalized minimum nearest neighbor distance as d norm = d min /d ref to make the minimum distances more easily comparable for different agent densities.

1) Visual neighbor selection
Purely visual neighbor selection does not show consistent performance for different swarm densities. The performance penalty in distance and order is especially severe for agents in high-density configurations with small reference distances ( Fig. 7a and 7b, respectively). The normalized distance is much lower than the desired reference of d norm ≥ 1 and has its minimum for . This indicates that order follows an inverse relationship with the number of visible neighbors: if more agents are visible, the likelihood of visual topology switches that lead to disorder increases (Fig. 6). The neighbor graph also highlights that the effect of occlusions is maximized at intermediate densities. At high densities, the nearest neighbors occlude most agents in all directions (87% occluded for d ref = 1 m; Fig. 7d). Conversely, the effect of occlusions diminishes at lower densities since the agents are not large enough to break the line of sight (5% occluded for d ref = 3 m, for example; Fig. 7d).

2) Alternatives to purely visual neighbor selection
The Voronoi-based neighbor selection provides the highest and most consistent performance across different group densities. The distance, order, and union metrics remain stable for all but the lowest density level (d ref = 5 m) at which interactions are rendered myopic (Fig. 7a, 7b, and 7c; Sec. IV-B1 for discussion of myopic interactions). The normalized distance, order, and union remain stable for high and intermediate swarm densities (average d norm = 1.12 ± 0.03 m, φ order = 1.0 ± 0.0, and φ union = 1.0 ± 0.0; Fig. 7a  7b, 7c, respectively).
Swarms with visual + myopic and visual + topological interactions perform comparatively poorly to visual + voronoi neighbor across group densities. The visual + myopic neighbor selection method shows consistently low performance in terms of distance and union metrics ( Fig. 7a and 7c). Myopic interactions effectively reduce the negative impact of occlusions. However, they also induce low distances and fragmentation (average d norm = 0.84 ± 0.02 and φ union = 0.97±0.01 across reference distances; Fig. 7a and 7c, respectively). Swarms with visual + topological interactions can avoid the fragmentation issues but their minimum distances fluctuate for different densities (e.g.,

1) Visual neighbor selection
Purely visual neighbor selection shows the overall lowest performance in terms of minimum nearest neighbor distance for all noise levels. Collisions between agents start to occur with a noise level of σ d = 0.3 m (d min = 0.46 ± 0.05 m; Fig. 8a). Purely visual neighbor selection also exhibits the lowest average order in comparison to the other visual neighbor selection methods (myopic, topological, and voronoi). The average order for purely visual neighbor selection follows the same trend as the other visual neighbor selection methods as noise increases, however with a consistently lower average order (difference of about φ order = 0.1; Fig. 8b). Purely visual swarms do not separate into subflocks even as noise increases, as evidenced by their perfect union score (φ union = 1.0 ± 0.0; Fig. 8c). On average, purely visual swarms travel only roughly half as far when subjected to 50% range noise compared to when they operate without noise (d travel = 49.47 ± 0.15 m at σ d = 0.5 m vs. d travel = 98.87 ± 0.29 m at σ d = 0.0 m; Fig. 8d).

2) Alternatives to visual neighbor selection
Overall, the visual + voronoi neighbor selection method shows similar or higher performance scores than the other vision-based alternatives (namely, visual + myopic and visual + topological interactions) across the evaluated metrics and noise levels. Regarding the minimum nearest neighbor distance, visual + voronoi neighbor selection outperforms the vision-based alternatives for all noise levels (highest score d min = 1.10 ± 0.03 m for σ d = 0.0 m and lowest score d min = 0.67 ± 0.04 m for σ d = 0.5 m; Fig. 8a). Generally, the minimum nearest neighbor distance of the vision-based neighbor selection methods show a similar downward trend for increasing noise levels (Fig. 8a). For comparison, metric neighbor selection is much more sensitive to increasing noise levels and has the largest difference between performance scores (maximum d min = 1.15 ± 0.02 m for σ d = 0.0 m and minimum d min = 0.30 ± 0.04 m for σ d = 0.5 m; Fig. 8a). For the order and travel distance metrics, visual + voronoi interactions perform comparable to the other vision-based alternatives ( Fig. 8b and Fig. 8d). In terms of union metric, only visual + myopic interactions break the swarm into subflocks (Fig. 8c). Interestingly, the union metric also increases with higher noise levels which allow separate subflocks to reunite occasionally (minimum φ union = 0.97 ± 0.02 at σ d = 0.0 m vs. maximum φ union = 0.99 ± 0.01 at σ d = 0.5 m; Fig. 8c).

D. VALIDATION IN REALISTIC CONDITIONS
We finally assess the performance of the most promising visual + voronoi neighbor selection method in more realistic conditions. This is done to evaluate whether the performance transfers to agents with quadcopter dynamics and more realistic sensor noise. Analogous to the previous experiments, we   We replace the single-integrator dynamics (Eq. 1) with a cascaded PID controller [51] that uses the velocity commands from the flocking algorithm as inputs (Eq. 2). We further set the range and bearing noise to σ d = 0.05 m and σ β = 1°, respectively. The specific values are informed by our previous experiments in indoor [14] and outdoor environments [18] and resemble estimates from visual relative localization using object detection with a multi-target state tracker [52] that was specifically tuned for the operating conditions. The exact noise values may be higher if raw observations are used and depend on many factors such as detector performance, camera resolution, and background clutter.
The visual + voronoi neighbor selection method shows comparable performance with point mass agents and quadcopters operating with realistic sensor noise. The swarm performance generally degrades more with increasing reference distances than it does for increasing group size, regardless of the simulation realism. We omit an analysis of the union since the swarms remained cohesive as a single unit during all experiments without exception (φ union = 1.0 ± 0.0). We further omit the neighbor statistics since we did not observe any discernable differences. The only noticeable difference between point mass and quadcopter simulations is the divergence of the average order for decreasing density (φ order = 0.81 ± 0.05 for point mass and φ order = 0.70 ± 0.04 quadcopter; Fig. 10d). This difference can largely be attributed to the range noise that increases linearly with distance (Eq. 10). The effect of the noise for quadcopter dynamics can also be observed in the slightly lower normalized distances compared to point mass dynamics (Fig. 10a) Interestingly, the more realistic simulation also results in slightly larger minimum distances for 100 agents than would be expected with decreases due to noise (d min = 1.07 ± 0.04 m for point mass and d min = 1.11 ± 0.05 m for quadcopter; Fig. 10a). However, these effects are too small to be significant and could have occurred due to chance.

V. CONCLUSION
Methods for multi-agent coordination often make unrealistic assumptions about the information that is available to the individual agent. One of the most pervasive simplifying assumptions is that vision-based agents can sense the state of all surrounding neighbors within a metric perception radius, even if they are obstructed by closer ones. Here, we break this common assumption and construct a simple yet realistic model of visibility that selects neighbors only if 1) they appear large enough in the field of view, and 2) are not occluded by other agents. Extensive flocking simulations with the visual occlusion model show that perfectly ordered metricbased swarms become disordered and unsafe when agents react to all of their visible neighbors. These adverse effects can be attributed to small perspective changes that continuously influence the set of visible neighbors, thus causing the VOLUME 4, 2016 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.   . Example paths taken by a swarm of thirty agents during a single run of the collective migration experiment using (a) point mass dynamics without noise and (b) quadcopter dynamics with realistic noise. We use the same random seed to create equal initial conditions and highlight an arbitrary focal agent (colored, thick line) to reveal its motion among the other agents (grey, thin lines). The agents start from their initial positions (solid squares) on the left and migrate along the horizontal axis (solid triangles) to the right side of the virtual arena (solid disks). Apart from the effect of noise, there is no discernable qualitative difference between the point mass and quadcopter swarms.
agents to move in reaction to the new neighbor configuration. We show that this interplay between visibility constraints and collective motion can lead to severe instabilities for visionbased swarms, especially for large numbers of agents and high swarm densities.
Selecting a subset of visible neighbors from adjacent Voronoi regions significantly improves the swarm performance (i.e., collision avoidance, velocity alignment, group cohesion, and travel distance) across group sizes, agent number densities, and noise levels. Controlled experiments with subsets of the visual neighbors show that Voronoi-based interactions are a more effective countermeasure against occlusions than metric and topological ones. The main drawback of metric and topological neighbor selection methods is their dependence on specific parameters, namely the perception range and the number of nearest neighbors, respectively. Choosing favorable values for these parameters that provide high performance at all group sizes and densities may be impossible for vision-based swarms. In particular, swarms that select too many neighbors suffer from the adverse effects of occlusions and selecting too few neighbors inevitably leads to fragmentation. Voronoi-based interactions provide an elegant solution to this problem since they are both parameter-free and spatially balanced [30].
The occlusion model presented here is undoubtedly useful but it neglects two important aspects of vision-based relative localization: errors due to misdetections (false positive and false negatives) and partial occlusions. False positives (i.e., detecting an agent that is not there) and false negatives (i.e., not detecting an agent that is defacto there) inevitably occur in real-world conditions but are notoriously difficult to model. Multi-target filtering algorithms can alleviate errors due to sensing noise and false positive detections to some extent but are largely ineffective against false negatives [18]. Modeling partial visual occlusions is equally challenging; agents that occlude others with a given overlap may themselves be -possibly recursively-occluded by other agents at different locations. Whether multiple partially occluded agents should be detected as a single agent at a closer distance is another modeling choice to consider. The main difficulty is that the distribution of these errors depends not only on the robot's physical appearance and the error distribution of the detection algorithm but also on environmental conditions such as background clutter and lighting conditions. We believe that modeling these factors based on first principles is of limited use due to many arbitrary modeling choices. Future work should therefore systematically characterize misdetections and partial occlusions in a more realistic setting with vision-based detectors that localize physical robots in real images. This characterization could then inform many modeling choices, e.g., temporal and spatial distribution of false positives and overlap thresholds for partial occlusions.
We argue that occlusions should not be neglected when designing algorithms for vision-based swarms. We consider the simple occlusion model presented here (Eq. 6) as a useful drop-in replacement for vision-based flocking algorithms that would otherwise default to purely metric interactions (Eq. 5). Simple agent-based simulations can thus prevent significant hardware damage by considering occlusions early in the algorithm design and before they are implemented on real robots. The validation presented here is specifically geared towards drones but we expect the results to translate well to other types of vision-based robots.