GPU Accelerated Convex Approximations for Fast Multi-Agent Trajectory Optimization

In this paper, we present a computationally efficient trajectory optimizer that can exploit GPUs to jointly compute trajectories of tens of agents in under a second. At the heart of our optimizer is a novel reformulation of the non-convex collision avoidance constraints that reduces the core computation in each iteration to that of solving a large scale, convex, unconstrained Quadratic Program (QP). We also show that the matrix factorization/inverse computation associated with the QP needs to be done only once and can be done offline for a given number of agents. This further simplifies the solution process, effectively reducing it to a problem of evaluating a few matrix-vector products. Moreover, for a large number of agents, this computation can be trivially accelerated on GPUs using existing off-the-shelf libraries. We validate our optimizer's performance on challenging benchmarks and show substantial improvement over state of the art in computation time and trajectory quality.


I. INTRODUCTION
Coordinating multiple agents between given start and goal positions without collision is crucial to any multi-agent application. For highly agile agents like quadrotors and autonomous cars, a popular approach has been to formulate this collision-free coordination as a trajectory optimization problem. There are two core computational challenges in this context. First, the number of variables in the optimization problem increases linearly with the number of agents n. Second and more importantly, the number of pair-wise nonconvex collision avoidance constraints grow by a factor n 2 . Existing works have predominantly explored two classes of simplifications to keep the optimization problem tractable. The sequential approaches, e.g., [1], [2], follow an iterative process wherein motion plans for only one agent is computed at a time. Collision avoidance is ensured by treating agents whose motions were computed in earlier iterations as dynamic non-responsive obstacles for the currently planned agent. On the other hand, the distributed model predictive control (MPC) approaches such as [3], [4] decouple the planning process by allowing each agent to view all the others at any given instant as dynamic obstacles with a known trajectory. The sequential approaches do not leverage the cooperation between the agents while it is incorporated only implicitly (through trajectory prediction) in the distributed MPC based approaches. As a result, both class of simplifications have access to a smaller feasible space and are thus conservative.
All authors are with the Institute of Technology, University of Tartu. The work was supported in part by the European Social Fund through IT Academy program in Estonia, smart specialization project with BOLT and Estonian Centre of Excellence in IT (EXCITE) funded by the European Regional Development Fund.
In this paper, we explicitly account for inter-agent cooperation by adopting the classical set-up of [5], wherein a large scale optimization is formulated to compute the trajectories of all the agents jointly. The primary goal of this paper is to improve the computational tractability of such large scale optimization problems. Our optimizer falls into the class of existing algorithms such as [6] that reformulates the underlying numerical computation of the optimizer to parallelize them over CPUs/GPUs.

A. Main Idea
Consider the following unconstrained, quadratic program (QP) for a constant matrix Q and a vector q. As shown, the solution process reduces to solving a set of linear equations.
Now, imagine that QP (1) needs to be solved for several instantiations of q for a given Q. Such scenarios are common in linear MPC, wherein Q encodes the system dynamics and cost functions and is thus constant, while q that encodes the initial condition changes at each iteration. As shown in [7], an efficient way of handling such scenarios is to pre-compute the inverse (or just the factorization) of Q, in which case, MPC computation (or solving (1)) in each iteration reduces to evaluating just matrix-vector products. This reduction becomes particularly important in cases where the QP (1) is formulated over tens of hundreds of variables.

B. Contributions
For the first time, we show how the idea of off-line caching of matrix inverses can be used to accelerate non-linear and nonconvex, multi-agent trajectory optimization problems. This is achieved by deriving the multi-agent version of the collision avoidance constraints presented in our prior work [8]. Subsequently, we use Alternating Minimization [9] to reduce the trajectory optimization into smaller sub-problems wherein the most computationally intensive block has a structure similar to (1). It is important to note that although QP based approaches are exceedingly common in multi-agent trajectory optimization, their mathematical structure precludes leveraging pre-computed matrix inverses/factorization. (see Remark 3 in Section III-B). Our optimizer provides the following benefits over the current state of the art. Ease of Implementation and GPU Accelerations: The entire numerical computation of our optimizer reduces to computing either element-wise operations over vectors or matrixvector products. For a large number of agents, these can be trivially accelerated on GPUs using libraries like CUPY [10] and JAX [11]. We also provide an open source implementation in https://github.com/arunkumar-singh/ GPU-Multi-Agent-Traj-Opt. Our GPU accelerated optimizer can compute trajectories for 32 agents in 0.7 s on a RTX-2080 enabled desktop computer. State of the Art Performance: Our optimizer outperforms the computation time of joint trajectory optimization of [5] by several orders of magnitude while achieving trajectories of similar quality. It also outperforms the current state of the art, sequential approach of [2], by obtaining shorter trajectories in all the considered benchmarks. Even more importantly, our optimizer also outperforms [2] in terms of the computation time on several benchmarks. The speed-up is particularly impressive given that our optimizer performs a much more rigorous joint search over the agents' trajectory space. Suitability on Edge-Devices: Our optimizer can compute trajectories of 16 agents in around 2s on Nvidia Jetson-TX2. This is orders of magnitude faster than the computation time of [5] on an Intel i7 desktop computer with 32GB RAM. Thus, our work presents an important step towards achieving complex onboard decision-making abilities for light-weight quadrotors.

II. BACKGROUND AND PRELIMINARIES
This section introduces some necessary mathematical preliminaries and uses them to draw a contrast between existing works and our optimizer. We begin by summarizing next the basic symbols and notations used throughout the paper.

A. Symbols and Notations
We will use lower case normal font letters to represent scalars, while bold font variants represent vectors. Matrices are represented through upper case bold fonts. The time dependency of the variable is shown by t. The superscript T will denote the transpose of vectors and matrices. The left superscript k will be used to indicate the iteration index in a trajectory optimizer. We use subscript i, j as agent index. We will use n, m to represent the number of agents and planning horizon throughout the paper.

B. Quadratic Inequalities and Conservative Convex Bounds
For spheroid agents with dimension l xy , l z , the inter-agent collision avoidance constraints take the following non-convex quadratic inequality form.
where, (x i (t), y i (t), z i (t)) represent the position of the i th at some time t. Interestingly, a simple linearization of (2) around any arbitrary guess trajectory leads to a convex but conservative approximation of the feasible space [12], [13]. This simplification has been exploited in many recent works on multi-agent trajectory optimization such as [5], [1]. However, the conservativeness often leads to an infeasible optimization problem even when a solution exists. To sidestep this bottleneck, [1] relaxes the trajectory optimization by only incrementally enforcing the collision avoidance constraints. Our optimizer induces convexity in a uniquely different way based on our prior work [14], [15], [16]. Instead of relying on linearization, it breaks down the problem into smaller sub-problems where all but one are convex optimization problems. Moreover, the non-convex sub-problem has a geometrical structure that allows for obtaining an approximate analytical solution.
C. Joint Trajectory Optimization of [5] Reference [5] employs a sequential convex programming (SCP) based optimizer on the multi-agent trajectory optimization problem. Let ( k x i (t), k y i (t), k z i (t)) be the solution guess at iteration k. Then, the SCP of [5] solves the following QP at iteration k + 1.
. The constraints (3b) force the initial and final boundary conditions of the agents to lie in the set C boundary . The main complexity of QP (3a)-(3c) stems from the fact the number of affine inequality constraints increases by a factor of n 2 . As mentioned earlier, works like [1] bypass this intractability by adopting a sequential approach.
Similarly to [5], our optimizer also reduces to solving QPs over the joint trajectory space of all the agents. However, it does not have any inequality constraints. Furthermore, the most expensive part of the solution process can be precomputed for a given number of agents.

D. GPU Acceleration Through Gradient Descent
An effective way of accelerating optimization problems on GPU is to reformulate them in an unconstrained form and then apply the method of Gradient Descent. For example, [17] achieves this for the multi-agent trajectory optimization by augmenting the constraints (3b)-(3c) as penalties in the cost function (3a).
The core computations in Gradient Descent reduces to computing matrix-vector products which can be readily accelerated on GPUs. Reference [17] essentially exploits this feature and also brings in additional innovation in terms of parallelizing collision checks to further improve the computation time.
Our optimizer provides substantial improvements over [17]. As mentioned by the authors themselves, [17] requires extensive hyper-parameter tuning that is likely to be redone if the problem parameters such as robot dimension change. In contrast, the proposed optimizer relies on accelerating a QP on GPU by offline caching of matrix inverses and worked with trivial default parameters on dozens of examples.

III. MAIN RESULTS
In this section, we present our main theoretical result: a GPU accelerated optimizer based on convex optimization. We begin by reiterating the main assumptions.
• We consider agents with decoupled affine motion models along the (x, y, z) axis. We also assume differential flatness which allows us to extract control inputs from position derivatives. This is typical of holonomic agents like quadrotors [5], [1]. Even autonomous cars can be modeled in this form under some conditions [18]. • The agents are modeled as spheroids (or disks in 2D). • We do not explicitly consider the bounds on velocities and accelerations and rely on choosing an appropriate traversal time and regularization on accelerations to ensure the same. However, it is possible to reformulate bounds as quadratic penalties and incorporate within the optimizer without disturbing its computational structure (see eqn (18) in [19]). Alternately, like [2], we can also scale the traversal time during post-processing to satisfy the bounds.
The primary changes involve introducing additional timedependent variables α ij (t), β ij (t), d ij (t), and using them to rephrase quadratic collision avoidance constraints (2) into a set of non-linear equalities (5). On the surface, our formulation (4a)-(4d) looks more complicated than the more conventional multi-agent trajectory optimization (3a)-(3c) as the former involves highly non-linear trigonometric functions. But in fact, (4a)-(4d) has some hidden geometrical and computational structures that we can expose using techniques from Alternating Minimization (AM). To this end, we first create an augmented cost function L by incorporating f c as l 2 penalties In (6a), ρ is a scalar constant and λ xij (t), λ yij (t), λ zij (t) are time-dependent Lagrange multipliers that can be used to drive the residual of f c to zero. Algorithm 1 summarizes the minimization of (6a) subject to (4b) and (4d) based on the AM technique. As before, the left superscript k represents the respective variable at iteration k. As shown, we start with an initialization for and optimize the variables in a sequence. The optimizations (15a)-(15c) and (15f) are convex constrained QPs. In contrast, (15d) is non-convex. But interestingly, simple geometrical intuitions can be used to derive an analytical solution for it. We delve deeper into each of these optimizations next.
B. Analysis of Algorithm 1 1) Steps (15a)-(15c) : The most important feature of these optimizations is that each of them takes the form of a QP wherein the matrices do not change over iteration. To validate this assertion and to show how it is useful, we parametrize x i (t) and its derivatives in the following form.
where, P is a matrix formed with time-dependent basis functions (e.g polynomials) and c xi are the coefficients associated with the basis functions. Let c x be the joint coefficient formed by stacking c xi for all the agents. Now, with the help of (7), we can derive the following matrix representation for optimization (15a) Note, how only the vector k b x fc is shown to be dependent on the iteration index k. The various matrices and vectors involved in (8a)-(8b) are derived in the following manner.
The matrix A is formed by stacking the first and last row of P and its derivatives, and b x eq is formed by stacking initial and final position, velocities and accelerations. Similarly, we obtain the following matrix representation.
In (13), the operator (.) ×n−r vertically stacks the matrix P n − r times. Similarly, the block-diagonal matrix on the r.h.s is formed by (n−r) number of matrix P. In (14), k d, k β, k α are formed by stacking the respective variables at all times and for all the agents. Similar construction is also followed for k λ xij . Reduction to Linear Equations: The equality constrained QP (8a)-(8b) can be reduced to a problem of solving a set of linear equations [7].
Remark 1. For a given ρ, the matrix on the l.h.s of (16) is independent of the iteration index k. Thus, its inverse can be pre-computed and used without any computational cost in each iteration of Algorithm 1.
Remark 2. For a given ρ and the number of agents n, the matrix on the l.h.s of (16) only depends on our choice of trajectory parametrization P. Thus, the same pre-computed inverse can be used to solve trajectory optimization for any variations of start and goal positions as long as the number of agents and trajectory parametrization remains the same.

Remark 3.
It is possible to reformulate optimization (3a)-(3c) used in works like [5], [1], [3] in the form of equality constrained QP and subsequently reduce it a problem of linear equation solving. However, it will not be possible to pre-compute the matrix inverses in this approach because the matrix k A ij in (3c) will change at each iteration of the sequential convex programming optimizer.
Per-Iteration Complexity: Let, n v be the number of decision variables of each agent (columns of P). Let n b be the number of boundary conditions (row of A) for each agent along each motion axis. Then the per-iteration complexity of step (15a) or (16) is dominated by two large matrix-vector products. The first product involves multiplying A T fc with dimensions (nn v × m n 2 ) with vector k b x fc of m n 2 × 1, where we recall n, m to be the number of agents and length of the planning horizon respectively. The complexity of second matrix-vector product, Q −1 x q x is O(n 2 (n b + n v ) 2 ) and follows similar reasoning. However, it should be noted that these are worst-case complexities without the GPU parallelization. For example, theoretically, A T fc k b x fc can be split into nn v parallel computations. However, in practice, the speed-up through parallelization depends on the number of available GPU cores and other hardware limitations such as speed of CPU-GPU transfer. The above analysis drawn for (15a) can be trivially extended to steps (15b)-(15c) as well. 2) Step (15d): Although optimization (15d) is nonconvex, an approximate solution can be derived using simple geometrical intuition. Recall (5) to note that the set of feasible (x i (t)−x j (t)), (y i (t)−y j (t)) and (z i (t)−z j (t)) constitute a spheroid centered at origin with dimensions l xy d ij and l z d ij . Thus, we compute k+1 α ij (t) and k+1 β i j(t) by projecting k+1 x i (t), k+1 y i (t), k+1 z i (t) obtained from steps (15a)-(15c) onto the requisite spheroid through equations (15e). The projection process is also illustrated in Fig.1. The projection satisfies the constraints on α ij (t) by construction. For β ij (t), we simply clip the values to [0, π]. 3) Step 15f: For a given pair of agents (i, j), d ij (t) at different time instants are decoupled from each other. Similarly, they are also decoupled across agent pairs. Thus, (15f) splits into m * n 2 decoupled single-variable convex QPs, each of which can be solved symbolically. That is the solutions are available as analytical formulae that we can evaluate using just element wise operations over vectors. The constraints on d ij (t) are ensured by simply clipping the values to [0 1] at each iteration.
Step (15e) projects this point onto a spheroid centered at origin. The projected point is shown in black and the center of the spheroid is shown in blue. Algorithm 1 Alternating Minimization based Multi-Agent Trajectory Optimization 1: Initialize k d ij (t), k α ij (t), k β ij at k = 0 2: while k ≤ maxiter or till norm of the residuals are below some threshold do k+1 xi(t) = arg min 3: end while

A. Implementation Details
We implemented Algorithm 1 in Python using CUPY [10] and JAX [11] to accelerate linear algebra on GPUs. Specifically, we use CUPY for trajectory optimization up to 32 agents, while JAX was used for a higher number of agents.
Computing hardware consisted of a i7-8750 32 GB RAM desktop computer with RTX 2080 (8GB) and Nividia Jetson TX2. We pre-computed the inverse of Q x in (16) for 10 different increasing values of ρ instead of just one. The higher values were used in the latter iterations of Algorithm 1. This is inspired by [21] that advocates using adaptive ρ to speed up the convergence of optimizer such as Algorithm 1. It is worth reiterating that for a given number of agents, these inverses are computed only once and can be subsequently used to optimize trajectories from arbitrary start and goal positions. For the ease of implementation, we considered agents as spheres rather than spheroids. Along similar lines, we constructed the static obstacles' circumscribing sphere to incorporate them within our optimizer. For comparison with [5] and [2], we used the open-source implementation and data-set provided by the latter. For a fair comparison, we did not include any hard bounds on position, velocities, and accelerations on the implementation of [5]. This led to some reduction in the number of inequality constraints. Similarly, we also changed the so called "downwash" parameter in [2] to 1 to conform with the implementation of our optimizer. Since trajectories obtained with our optimizer and [5], [2] are at different time scales, we used the second-order finitedifference of the position as a proxy for comparing the accelerations across the three methods. Fig. 2(a)-2(c) show the typical benchmarks employed in our experiments. In Fig.2(a), the agents are placed in a square and are required to navigate to their antipodal positions. In Fig. 2(b), the start and goal positions are sampled randomly. Fig. 2(c) repeats the previous benchmark by adding random static obstacles. Each of the benchmarks was evaluated with a different number of agents with a diverse range of radii. A key metric for validating Algorithm 1 is the trend in the residuals of f c (recall (4c)) over iterations. It should converge to zero in a diverse set of benchmarks to ensure that the optimizer reliably computes at a collision-free trajectory. Fig.3(a) provides this empirical validation. The plots show the mean and one standard deviation of the residual trajectory obtained across 20 different problem instances. On average, 150 iterations were sufficient to obtain residuals in the order of 10 −2 . To be more precise, we sampled random initial and final positions in a square room of varying lengths to create problem instances of varying complexity levels. As can be Fig. 4. Fig. (a) shows computation time for a varying number of agents for benchmarks where we sample start and goal positions from a square with varying lengths. Fig. (b) shows the linear scaling of computation time with obstacles for a given number of agents. Fig. 5. Comparison of our optimizer with [5] in terms of arc-length and smoothness of the obtained trajectories. The arc-lengths are similar across both the approaches but our optimizer achieves smoother trajectories. Note that the smoothness cost is computed as the norm of the second-order finitedifference of the position of the respective agents at different time instants. seen, even in the most challenging instance, our optimizer could compute trajectories for 32 agents in around 1s. Fig. 4(b) shows the computation time for 16 agents with different number of static obstacles. Our optimizer shows an almost linear scaling in computation time. This is because incorporation of static obstacles only affects the computation cost of obtaining A T fc k b x fc in (16). Furthermore, the dimensions of both the matrix and the vector increase linearly with the number of obstacles.

D. Comparison with [5]
Fig. 5 presents a comparison of the trajectory quality obtained with our optimizer and [5]. Although both the optimizer converge to different trajectories, the arc-length statistics observed across all the agents are very similar. Furthermore, our optimizer outperforms [5] in terms of trajectory smoothness cost. Table I contrasts the computation time between the two approaches. Our optimizer shows a speed-up of almost 3 orders of magnitude on 8 agent benchmark, and this number shoots up to 60 for the 16 agent benchmark. The computation time trend is not surprising as even state-of-the-art primal-dual interior-point solvers for QP scale cubically with the total number of inequality and equality constraints. Furthermore, the number of inequality constraints stemming from collision avoidance will itself scale as n 2 . As mentioned earlier, our optimizer by-passes this intractability by pre-computing the expensive matrix inverses and parallelizing matrix-vector product on GPUs. It is essential to point out that GPU accelerations of SCP of [5] is a challenging open problem on its own. We conjecture this to be the motivation behind the same authors adopting the gradient descent approach for leveraging GPUs [17].

E. Comparison with [2]
Fig. 6 presents the most important result of this paper, wherein we compare our optimizer with the current state of the art [2]. The cited work adopts a sequential approach but with a batch of agents. It also leverages the parallel QP solving ability of CPLEX [22] on multi-core CPUs. Our optimizer produces trajectories of similar smoothness as [2] but with substantially lower arc-lengths. This trend can be attributed to the reduced feasible space accessible to a sequential approach. More surprisingly, our optimizer also outperforms [2] in terms of computation time on 16 and 32 agent benchmarks. On the 64 agent benchmark, our optimizer is only marginally slower than [2]. We reiterate that it is essential to observe these timings with the context that our optimizer performs a much more rigorous search than [2]. The trends in computation time can be understood in the following manner. For a lower number of agents, the computation time of [2] is dominated by the niche trajectory initialization it leverages from sampling-based planners. Furthermore, for a lower number of agents, the overhead of CPU parallelization is also significant. But these overheads are easily offset by the computation speed-up achieved for a higher number of agents.

F. Performance on Jetson TX2
Table II shows the computation time for a different number of agents for the square benchmark on Nvidia Jetson TX2. The start and goal positions are sampled from a square of length 8m. The timings indicate that our optimizer allows for fast on-board decision making for up to 16 agents. Moreover, even for 32 agents, the computation time is small enough to be useful for practical applications.

V. CONCLUSIONS AND FUTURE WORK
In this paper, we fundamentally improved the scalability of joint multi-agent trajectory optimization. Simultaneously, our algorithm is simple to implement and requires the computation of only a few matrix-vector products and elementwise operation over vectors. We achieved this by leveraging hidden geometrical and convex structures in the problem. We outperformed state of the art in joint and sequential approaches in terms of trajectory quality and computation time. One limitation of our optimizer is that it has a constant computational overhead stemming from loading large scale Fig. 6. Comparisons with current state of the art [2]. Our optimizer outperforms [2] in terms of trajectory arc-lengths. Even more importantly, it also outperforms in terms of computation time on several benchmarks. matrices at run-time. However, this limitation has limited practical impact as it needs to be done only once for a given number of agents.
We are extending our optimizer to work with complicated geometries like an elongated rectangle through multi-circle approximation. Along similar lines, we also aim to extend the results to non-linear agents like autonomous cars by building on bi-convex approximations proposed in our prior work [15].