A GPU Accelerated Dual-Ascent Algorithm for the Multidimensional Assignment Problem in a Multitarget Tracking Application

We develop a Graphics Processing Unit (GPU) accelerated algorithm for the NP-Hard Multi-dimensional Assignment Problem (MAP), suitable for target tracking applications. First, the original MAP formulation with a quadratic objective function is reformulated using a creative linearization technique. This formulation lends itself well to Lagrangian Relaxation, which decomposes into pairwise Linear Assignment Problems (LAPs). These LAPs are solved in parallel and are each solved using a recent GPU-accelerated approach. Next, we propose a dual-ascent scheme for the Lagrange multiplier updates. The advantage of this scheme is that it results in monotonically increasing lower bounds and converges in a fraction of the iterations typically needed for a subgradient method. The dual-ascent technique is also parallelized for the GPU. Finally, we develop a creative gap closure scheme with $M$ -best LAP solutions for each dimension and find the shortest path in the resulting staged graph. The algorithm is applied to the Multi-Target Tracking problem and tested on datasets for maneuverable targets. Scaling studies are also performed, and note that the processing time goes down approximately linearly in the number of GPU devices. The algorithm can efficiently solve up to a problem size of 400 targets in 400 time-frames, which corresponds to 25 billion variables, with high accuracy. Note to Practitioners—The Multi-Target Tracking problem (MTT) has been a longstanding problem with various variants and solution algorithms. Still, the problem remains challenging, especially when dealing with a large number of targets for many time frames, when solution speed and optimality are concerns. Many problems including, entity resolution, weapon target assignment, resource allocation, and data association can be formulated as MAP. Our overall algorithm, implemented with GPU acceleration enables addressing large-dimensioned MAPs, e.g., number of observed targets for a long horizon, for around 25 billion variables. As per our knowledge, no algorithm could tackle this large-scale data either for MAP or MTT.


I. INTRODUCTION
T HE Multi-dimensional Assignment Problem (MAP) is a generalization of a Linear Assignment Problem (LAP) with more than two dimensions. There are several applications for MAPs in data association, information fusion, surveillance, natural language processing and biomedical fields. Examples from such applications include: Multi-Target Tracking [1], Weapon-Target Assignment [2], Resource Allocation [3], Customer Relationship Management [4], Nuclear Fuel Assembly [5] and Entity Resolution [6]. Apart from these applications, some of the most prominent real-life problems are modeled as MAPs, and over time, various solution approaches have also been proposed. Variants of MAP, like Multi-dimensional Assignment Problem with Decomposable Costs (MDADC), Multi-Campaign Assignment Problem (MCAP), Generalized Multi-dimensional Assignment (GMAP) are also applied to problems with modifications, either in the constraints or the objective coefficients, depending on the real-life situation.
In this paper, we apply the dual ascent algorithm to find solutions to large multi-target tracking problems. While several approaches have been proposed to formulate and solve the multi-target tracking problem ( [7]- [10], [11]), most of these approaches provide approximate solutions. Given MAP is an NP-Hard problem [12], even the exact methods proposed to find optimal solutions are not scalable. This paper presents a novel MAP formulation for the MTT, uses methods from mathematical programming, decomposition, and GPU-accelerated programming to (near-)optimally solve large-dimensioned problems than have ever been possible thus far. Specifically, we propose a novel formulation for the multi-dimensional assignment problem resulting from a linearization technique and reformulation. The novelty of this MAP formulation is that it is conducive to decomposition and subsequent parallelization. This formulation is then relaxed using Lagrangian multipliers and written in the dual form, decomposable into subproblems solvable in polynomial time. Then, a regulated dual ascent algorithm is proposed to solve the relaxed problem and find a lower bound. This dual ascent technique demonstrates a monotonic increase in the objective function of the dual, which works better than other subgradient schemes that tend to oscillate the lower bound value, thus taking a long time to converge [11], [13]. A feasible solution (upper bound) to the problem is also computed, and a gap is calculated using these upper bound and lower bound values.  Tracking problem depicted as a multi-dimensional assignment problem.
We believe this is the first application of dual ascent to the MAP. In addition to the tight bounds provided by the dual ascent algorithm, we propose a gap closure scheme that closes the gap between the upper and lower bounds and finds a provable optimal solution or, in the least, improves the upper bound. This gap closure scheme is based on ranked assignments for adjacent dimensions and finding the shortest path through them.
For our numerical experiments, we apply this algorithm on multi-target tracking datasets that simulate the behavior of maneuverable constant velocity targets. The objective function in the MAP formulation has triple costs, which captures the global properties of the trajectories and helps provide better tracking solutions compared to using the pairwise costs alone. In addition to these methodological and experimental contributions, we developed an efficient GPU-accelerated parallel optimization framework that enables us to handle large-sized problems, up to 25 billion variables. To the best of our knowledge, neither MAP problems nor MTT problems of this size have been solved to optimality in the past. In summary, the main contributions of this paper are: • A novel Multi-dimensional Assignment Problem formulation for Multi-Target Tracking, which is based on creative linearization and reformulation. • A regulated dual ascent update scheme for the Lagrangian multipliers, with a monotonic increase in the lower bound (to assess upper bound quality) as well as a gap closure scheme to close the gap between the upper bound and the lower bound to prove (near-)optimality. • An efficient and innovative parallelization scheme for Graphics Processing Units (GPUs) provides much faster solutions to large dimension problems (up to 25 billion variables) than previously possible.

A. Overall Framework and Contribution
The MTT problem is first formulated as a MAP as shown in Figure 1. Each time frame corresponds to a stage t in a K -dimensional MAP and each target in that time frame is denoted by the nodes in stage t of the MAP. The track associated with a target is the assignment of a node over all the K stages. Figure 2 presents the components of the overall framework. The remaining paper is organized as follows. Section II provides literature related to our work. In Section III, formulating MTT as a MAP is discussed in detail, and the linearized formulation is derived. In Sections IV and V, the Lagrangian relaxation and the dual ascent-based multiplier update scheme are discussed in detail. In Section VI, the GPU-accelerated algorithm is described. Section VII is devoted to a creative gap closure scheme that can prove the optimality of the upper bound solution or find a better solution or the provable optimum. In Section VIII, the experimental test data is presented, and the results are analyzed based on the run time and duality gaps of the problems. Finally, in Section IX, we present our conclusions.

II. RELATED WORK
The Multi-dimensional Assignment problem (MAP) was first introduced in [12] and is an NP-Hard problem for the number of dimensions N ≥ 3 [14]. Several exact and approximate algorithms have been proposed to solve MAP. In the category of the exact algorithms, [12] used a tree-search approach, which is a different version of a Branch-and-Bound (BnB) algorithm to solve MAP. Reference [15] provided an exact solution for the axial 3-dimensional Assignment Problem (3AP) using a transformation method and a subgradient method. Reference [16] also employed a BnB technique for a 3AP, with Lagrangian relaxation-based techniques to obtain the lower bounds. After a detailed analysis on the various BnB techniques, [17] determined that a classical BnB technique with the lower bound obtained by the subgradient method provides the best results. Reference [18] developed a depth-first BnB technique where a Lagrangian relaxation was applied to obtain the lower bound and tested on problems of sizes 64 to 64000 binary variables. However, considering the hardness of the problem, several other heuristics have been proposed to find approximate solutions. GRASP (Greedy Randomized Adaptive Local Search Procedure) [19], [20] and other local search techniques [21]- [23] have been applied to MAP.
On the other hand, the MTT problem has been formulated as a set packing and covering problem in [10], and 0-1 integer programming was applied to find solutions. Reference [24] proposed a Joint Probabilistic Data Association (JPDA) model to solve the Multi-Hypothesis Tracking (MHT) application. References [8], [9] formulated MTT as a K-Dimensional Assignment Problem (K-DAP) and solved it using a recursive Lagrangian relaxation method. A K-DAP is relaxed to a (K-1)-DAP, which finally trickles down to a two-dimensional assignment problem solvable in polynomial time, and the solution for a (K-1)-DAP is used to build a solution for K-DAP. While [25] proposed an algorithm to solve a generalized 3D assignment problem, [26] extended this to a generalized S-dimensional (S-D) assignment algorithm for MTT and applied Lagrangian Relaxation. [27] proposed an approach to solving multi-assignment problems using successive oneto-one assignments of decreasing size with modified costs. References [28]- [31] and [32] proposed parallel algorithms for target tracking problems for various domains and applications. Reference [11] provided approximate solutions to the data association problem using dual decomposition and message passing algorithm, and [33] proposed a block ICM technique for assignment problems that converges to a local minimum. Moreover, [11], [34] observed that the triple costs in the objective function captures the global properties of the tracks and can help to track in real-time situations when objects disappear for a short period or are occluded.
This paper provides a tight lower bound for MAP using a dual ascent technique and closes the optimality gap between a feasible upper bound and a lower bound using a gap closure scheme. We apply this algorithm to the multi-target tracking problem and achieve optimal solutions to large-sized MTT problems. The triple costs in the objective function help obtain better tracking solutions by capturing the global properties of the trajectories. The Lagrangian relaxed formulation lends itself to decomposing the problem into subproblems, which are solved independently, thus supporting efficient parallelization and testing on large instances (up to 25 billion variables).
In the past, similar linearization techniques have been proposed for formulations with quadratic objective functions. For example, for the Quadratic Assignment Problem (QAP), different linearizations were proposed in [35]- [37] and [38]. Moreover, various approximation algorithms and heuristics were developed for related NP-Hard problems, using the Lagrangian based relaxation techniques. In [39], one of such Lagrangian relaxation-based techniques was proposed to obtain tighter lower bounds for the QAP. Reference [40] then developed a dual ascent scheme to get a closer lower bound to the Linear Programming (LP) relaxation bound of QAP. Reference [41] used dual projective methods to solve the RLT formulation of QAP used in [42]. References [43]- [45] and [46] developed dual ascent based procedures for a reformulation of the QAP problem. Reference [47] applied RLT to Generalized Quadratic Assignment Problem (GQAP) and Crossdock Door Assignment Problem (CDAP). Reference [48] has applied a similar linearization technique to the Asymmetric Traveling Salesman Problem. References [49], [50] have also used dual ascent techniques on Lagrangian based relaxation models. To the best of our knowledge, the linearization technique in this paper is novel for MAP and lends itself well to dual ascent, as we will see later.

III. FORMULATING MULTI-TARGET TRACKING AS A MULTI-DIMENSIONAL ASSIGNMENT PROBLEM
The objective of an MTT problem is to detect maneuverable targets (Objects of Interest) in their field of view (Region of Interest) and estimate the states associated with them. These states include velocity, acceleration, position, etc. The tracking problem can be divided into two main tasks: Association and Estimation. Association involves linking observations and detecting the tracks of a target. Estimation is to statistically filter these linked observations to estimate the states of the targets. In this paper, we focus on the Association task. The goal is to collect sensor measurements to identify and reconstruct the trajectories of the targets. In the region of interest, there might be spurious observations that are considered noise, and there is also a possibility of missed detection of targets by the sensors. With multiple targets and sensors observing these targets' tracks, there is a high probability of incorrect observations being matched to a target. This situation is also referred to as misassociation and can lead to a degrading quality of the solution. To handle this problem, observation data is collected over discrete time intervals, also referred to as time frames.
The above described MTT problem is modeled as a K -dimensional assignment problem in the following way. Let the number of time intervals that the observation data was collected for, be K . Each time frame t ∈ {0, . . . .K − 1} corresponds to a stage/ dimension in MAP. Observations of a target in that time frame t are denoted by the nodes in stage t of the MAP. The track associated with a target is represented by an assignment of a node spanned over all the K stages. Thus, the objective function of the MAP is to minimize the sum of scores associated with each observation, across the track. In this paper, the objective coefficients are both pairwise and triplet scores of observations, so as to better model constant velocity and maneuverable targets, as explained in [11]. The pairwise costs are defined for consecutive stages (t, t + 1) and the triplet costs are defined for three consecutive stages (t, t + 1, t + 2). In Figure 1, a tracking problem is illustrated as a K -dimensional assignment problem. The assignment (across all the stages) of node 0 originating in stage t 0 translates to the track of target 0 in discrete time frames throughout the time K − 1. In this paper, the multi-target tracking problem is modeled as a multidimensional assignment problem similar to [1] and is rewritten, by considering the set of trajectories as a union of edges on all paths. In the next section, the mathematical formulation of MAP and our creative linearization are discussed.

A. Mathematical Formulation of MAP
MAP has K stages and N nodes in each stage. The objective is to find the assignment of each node across the stages such that the objective function is minimized. Consider the graph

1) MAP:
(2) 2) Linearization: A new variable y is introduced as the product of two x variables, linearizing the above formulation. It is expressed as y where the nodes i , j , k originate from stages p, p + 1, p + 2. The y variable connects the two x variables from three consecutive stages and is used to depict a chain of edges formed by the two x variables. Thus, × and node k in p + 2 In addition to the uniqueness constraints of x in the MAP formulation, the linearized formulation consists of the constraints corresponding to the y variables. The non-zero property of y p i jk is contingent upon the non-zero property of x p i j . That is, for a fixed (i, j, p), the edge y p i jk , for any k in stage p + 2, exists (is non-zero), if and only if the edge x p i j exists (is nonzero) as shown in Figure 3(a). This property can be expressed mathematically as This is also true for the other end of the chain formed by y p i jk . As it can be seen in Figure 3 is non zero, and is expressed as The linearized MAP formulation is later relaxed using Lagrangian multipliers. For the ease of multiplier update, which is explained in Section V, the Equation (6) is replaced by another equivalent constraint. Consider two y variables, y p i jk and y p+1 jkl . For the edges formed by these two variables to be connected as a chain, there needs to be a common edge connecting them, as shown in Figure 3(c). In this case, it is x p+1 jk , irrespective of the nodes i from stage p and l from stage p + 3. Mathematically, for a fixed j, k, p, this is expressed as

B. Mathematical Derivation of the Reformulated MAP (RMAP) Formulation
We will now provide mathematical derivations for the equations written in Section III-A and obtain the reformulated version of MAP, which is referred to as RMAP.
Constraint 1 Consider equation (3). For a node j , from the stages p + 1, the corresponding uniqueness constraint for any k from stage p + 2 is Constraint 2: If two y variables are connected, there is a common x variable between these two y variables. This can be understood from Figure 3(c). Consider equation (2), and multiplying on both sides by x p+1 jk , we have: Equation (2) is rewritten for nodes k from stage p + 2 and i from stage p + 3 as i x p+2 ki = 1, and multiplying both these equations on both sides by x p+1 jk , results in Thus, from (9), (10), equating the RHS, we get, One important observation to be made is that the edges between the last two x stages (K −2, K −1) are not accounted for in the new modified RMAP. So, they are explicitly mentioned in (15) to maintain correctness and consistency. Now with all the constraints in place, the final Reformulated MAP formulation is RMAP: Proposition 1: MAP and RMAP formulations are equivalent.
The proof is included in the appendix (supplementary document).

IV. LAGRANGIAN RELAXATION FOR RMAP FORMULATION
RMAP formulation has a large number of constraints and variables, making it quite difficult to solve. To overcome this challenge, the following steps are undertaken, and in the next few sections, we discuss these steps in detail.
• The linear relaxation of LRMAP denoted as LPLRMAP with binary condition on x relaxed is derived.
and y subproblems that can be solved independently.
In the RMAP formulation as shown above, the constraint (16) is relaxed and added to the objective function with the corresponding multiplier μ. The Lagrangian Relaxation of the RMAP formulation is then: To save memory, the sum of the variables in this case). Whenever the multiplier (μ p jk ) is updated, the corresponding complementary multiplier μ p+1 kl , for some l, is adjusted accordingly. This technique has also been used in [43] and [51] and other lifted formulation based QAP papers. This is further explained in detail in Section V.
The Lagrangian relaxation formulation of the MAP after the above modifications is LRMAP: The LP relaxation of the lifted formulations have shown to provide stronger bounds than most other linearized formulations ( [45]) for a problem, which motivates us to consider the LP of LRMAP for the multi-dimensional assignment problem. The Lagrangian Relaxation is relaxed in terms of the binary condition on x and the LP formulation of Lagrangian relaxation of RMAP is LPLRMAP:

A. Dualization and Decomposition
To find the tightest lower bound on RMAP, we need to identify a set of multipliers to maximize LPLRMAP(μ). This is referred to as the Lagrangian Dual problem. The biggest hurdle in typical sub-gradient schemes is choosing the right step-size for multiplier updates. This usually results in oscillatory (or non-monotonic) behavior of the lower bound. In this paper, the step-size is regulated in a way that guarantees that the objective function will either increase or at least, remain the same through a dual-ascent scheme.
The above formulation (LPLRMAP) is dualized with α, β, γ , δ as the corresponding dual variables to the constraints (27), (28), (29), (30). The dual of LPLRMAP is: Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
As discussed in [38], the degeneracy in the primal formulations of the lifted problems are a reason for using dualbased heuristics. In RMAP, the (2N + N 2 )K constraints, only 2N of the constraints for x have a non-zero right hand side value. Moreover, to obtain a lower bound, a dual feasible solution suffices and it does not have to be solved to optimality. Thus, we can get a lower bound value to the problem with a monitored dual-based heuristic that can be terminated without having to reach optimality.
The dual formulation is decomposable into two separate types of subproblems: one pertaining to the x variables and the other to the y variables. The subproblem that corresponds to the x variables is an LAP in its dual form and the y subproblem is a Linear Semi Assignment Problem (LSAP), which is a simple linear program that can be solved by taking the minimum over a small set of inequalities. Thus, the dual formulation can be decomposed into multiple LAPs and LSAPs in their dual formulation and are referred to as x subproblems and y subproblems in the paper.
From (34), for ( to be maximized, γ p i j has to be large. However by (36), the γ p i j variable has to be constrained according to the following y subproblem: To satisfy the constraint in (40) and to maintain feasibility, we assign γ p i j as γ There are N 2 (K − 2) y subproblems (40) and for every i, j, p, the cost (34) and the x subproblem then becomes: The x subproblem (43), is an LAP in terms of dual variables corresponding to the x variable in the LPLRMAP formulation. Since each x subproblem (LAP) consists of the dimensions p, p + 1, considering the combination of these consecutive pairs, there are K − 1 x LAPs. Similarly, since each y subproblem (LSAP) is formed with the stages p, p + 1, p + 2, considering the consecutive combinations, we have K − 2 y subproblems. The LAPs can be solved using the GPU-accelerated Hungarian algorithm for assignment problems by [52] and its dynamic counterpart in [45].
We note that the constraints (35) and (37) correspond to the last stages. With these constraints, dual ascent for multiplier update requires solving a linear program for δ i j and γ K −3 i j .
To avoid this complication, a dummy stage is added to the end of the last stage; note that this dummy stage does not require additional multiplier update. This addition makes the constraint (30) redundant for the stages K − 2 and K − 1, which would avoid the special handling of δ i j and γ K −3 i j in the multiplier update. Therefore, in all the computations, p ranges from 0 to K − 1 in x p i j and in y p i jk , p can take values from 0 to K − 2. Owing to the addition of the dummy stage to the graph, the total number of x LAPs is now N × K and y subproblems is N 2 (K − 1). All the costs corresponding to the dummy stages are zero and there is no significance to this stage, other than just to avoid additional computation.
The dual formulation with the dummy stage is now: The constraints and the dual variables in (35), (37) can be ignored as they now correspond to the dummy stage and do not carry any importance in the problem. This addition ensures that all the stages in the MAP are covered in the formulation and the consistency and correctness is maintained. Integrality Property: An integer problem has the integrality property if the optimal value of the problem is not altered by dropping the integrality conditions on its variables ( [53]).
The proof of the integrality of LRMAP formulation is included in the appendix (supplementary document).
Proposition 2: (a) This is true because of strong duality in linear programs. If the dual is not solved optimally, this equality converts into "≤" due to weak duality for a minimization problem.
(b) In general, this relationship is "≤". But in this case, since the problem has integrality property, it will convert to "=".
(c) This inequality is due the properties of Lagrangian Relaxation.
(d) As discussed in Section III-B, this follows from the equivalence of the formulations.
In the next section, the dual ascent procedure and determination of the step size is explained in detail.

V. DUAL ASCENT PROCEDURE AND MULTIPLIER UPDATE
We understood from the previous section that the DLRMAP formulation is decomposable into a y subproblem, an LSAP and a x subproblem, which is an LAP. In this section, we will understand the multiplier update scheme and then discuss the algorithm's implementation details. Recall from the previous section, in the formulation for DLRMAP, for a fixed multiplier μ, the objective function is maximization of sum of α and β variables, and α, β are constrained by (34). So, to maximize the objective function, values of α and β are to be maximized, which implies the maximization of γ variables, with respect to the constraint (36). We first solve for the maximum value of γ with respect to constraint (36). The value of γ p i j is transferred to the respective (i, j, p) x subproblem, and is added to the C p i j costs on the right-hand side of the constraint. Now, (33), (34) are in the dual form of an LAP in terms of α, β, and the LAPs are solved to return the values of α, β and the corresponding primal x variables as output. These steps are carried out iteratively by updating the multipliers at each iteration until convergence. The slacks from the x and the y subproblems are used to update the multipliers and there are two phases to the multiplier update process. In this section, we first understand both the phases and then provide the implementation details. From here on, all the indices of i, j, k can be assumed to be {0, . . . , N − 1} and p ∈ {0, . . . , K − 1} for x variables and C costs and p ∈ {0, . . . , K − 2} for y variables and D costs, unless specified otherwise. The multipliers are referred to as μ and α, β, γ values are referred to as dual variables.
As seen in (42), the y subproblem is a LSAP. Representing

A. Phase 1: Multiplier Update
After the y subproblem is solved, we compute the slacks from the constraints in (49). The slack value from a constraint corresponding to a fixed i, j, k, p is denoted by π The LRMAP is a min-max problem, and we aim to minimize the objective function while maximizing the multipliers. We want to increment the multipliers, while maintaining dual feasibility. Since not all the constraints in the y subproblem are active, we use the slack variables to update the multipliers. From (50), the multiplier can be updated as much as the slack's value, which ensures the feasibility of that constraint. The dual constraint corresponding to a y p i jk variable with a multiplier μ p jk is γ Since π p i jk ≥ 0, the strategy is to use this slack to increment the value of μ p jk . However, the same multiplier μ p jk appears in N such constraints (for each i ∈ {0, . . . , N −1}). Thus, we update this multiplier in a way that none of these constraints is violated. So, we take the minimum of all the slacks that this multiplier can be updated with. To demonstrate this, let us assume N = 2. For a fixed j, k and p, the corresponding dual constraints are Since μ p jk can be updated by both π p i 1 jk and π p i 2 jk in constraints (52), (53), the minimum between π p i 1 jk and π p i 2 jk to ensure the feasibility in both these constraints is maintained.

B. Phase 2: Adjusting the Multipliers
In Section IV, we mentioned that the multipliers for the variables y p+1 jkl are stored in the same location as y p i jk for some i, j, k, l, p and these variables are referred to as Complementary Variables. This technique of storing the multipliers corresponding to these complementary variables helps us in saving memory and the multipliers are adjusted accordingly in Phase 2, so it does not affect the objective value in any way. In Phase 1, we have only accounted for the multipliers associated with the type y p i jk . To understand this better, let us consider an example with K = 3. The original objective function is shown below and we focus on the y variables and their coefficients.
The strategy is to store the sum of all the multipliers of complementary variables in the same location as μ With this reduction, we can see that dual feasibility is maintained and the multiplier values for the complementary variables are adjusted. In other words, the sum of the multipliers are adjusted using the minimum over the slacks π p i jk and Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. π p+1 jkl for some i, l and a fixed j, k, p. This sum is distributed among the multipliers based on some percentages. We cover the update and adjustment scheme in Section V-D and address the actual implementation details of this technique.

C. Multiplier Update From the x Subproblem
Once the x subproblems are solved, the slacks from these constraints are used to update the multipliers. Let p i j be the slack value of a x subproblem constraint for a fixed i, j, k, p. The multiplier is updated as follows: where is defined as: The relation between indices of μ and and the implementation will be discussed in the next section in detail.

D. Implementation Details
In order to save memory, a technique called "Fast DA" [45], is applied for the dual ascent technique in this paper. The multipliers, the slack variables and the y variables are not stored explicitly and instead, all the updates are done using the cost variables C, D. The algorithm is divided into three steps: Cost Spreading, Cost Transfer, Cost Concentration. The costs are directly updated with the slacks and the multiplier values. The updated C and D costs are represented asĈ and D respectively. [43], [45], [46], have also adapted this idea of adjusting the cost variables to perform the dual ascent technique for QAP. Although the implementation idea is similar, the QAP and the MAP lifted formulations are quite different and thus, need different cost updates techniques.
1) Cost Spreading: In this step, theĈ costs are transferred to theD costs. For the first iteration, only the C costs are being transferred asĈ was initialized to C. But once the x subproblems are solved at the end of the first iteration and thê C costs are updated with the slack values, this step represents the transferring of the slack from the x subproblem to the multipliers as discussed in (56).
2) Cost Transfer: This step represents the multiplier update step (both Phases 1 and 2). The sum of the multipliers are adjusted using the minimum over the slacks π p i jk and π p+1 jkl for some i, l and a fixed j, k, p. This sum is distributed among the multipliers based on percentages κ 1 , κ 2 , that are decided based on experiments.
For a fixed ( j, k, p), we find i 1 and i 2 .
The minimum of the updated costs values is found and then a percentage (κ 1 , κ 2 ) of the sum of these minimum costs are distributed between the complementary variable costs D p i 1 jk and D p+1 jki 2 . This technique of storing the multipliers saves us a significant amount of memory on the GPU (in the order of (N 2 K )). 3

) Cost Concentration:
The y subproblem is solved with the updated costs. The slacks are computed and theD cost is then updated to be the slack value. Now, in the cost concentration step, the objective value of the y subproblem is transferred to theĈ costs and all the costs in the problem are concentrated in theĈ costs. The K x subproblems are solved with these concentrated costs and the sum of the solutions of the these K LAPs, denoted by ν , are added to the lower bound value (ν(DL RM A P)). TheĈ costs are then updated to be slack value of the constraints in the x subproblem. Since, in every iteration, the costs are updated with the slack values, the value of the objective function of the x subproblem diminishes with each iteration and the lower bound increases monotonically until convergence. The algorithm continues until convergence or until the termination criteria (number of iterations or duality gap) are met. Algorithm 1 presents the dual ascent algorithm without GPU acceleration.

E. Upper Bound Computation
The upper bound is constructed with the DLRMAP solution and is compared to the lower bound to compute the duality gap. The upper bound scheme is to find a feasible solution from the intermediate (infeasible) variable values obtained during the algorithm. Once the x LAPs are solved, along with the lower bound value, the x variables are also produced as outputs. The solutions of the binary x variables for all the K independent LAPs are referred to as the pairwise assignments. This is because each x p i j variable provides the solution for an assignment problem between the pair of p th and p + 1 th stages. Now, utilizing the solution of x variables, a feasible solution is obtained by connecting the pairwise assignments (x) through the K dimensions. The y variables are simply derived from the x values and computed as y The algorithm terminates when the gap is less than MIN_GAP = 0.25% or the number of iterations exceeds the maximum number of iterations (ITN_LIM) set. We set ITN_LIM = 100.    Table I, the size of the costs associated with x and y variables (C and D costs, respectively), are shown, to comprehend the complexity of the problem in terms of both memory and computational hardness. This advocates the need for an accelerated algorithm, and is supported by the advanced GPU architectures that are designed to help implement this level of parallelism on the hardware.
Each CPU-GPU pair is called a Processing Element (PE) which handles a set of x and y subproblems and is responsible for implementing the dual ascent algorithm on these problems. To be able to execute the algorithm efficiently, each PE handles a group of M y y subproblems, and similarly, let each   I NUMBER OF x, y COST VARIABLES multiple nodes. However, given the size of the D costs, this communication is expensive and can be completely avoided if the subproblems are divided efficiently. Fortunately, the RMAP formulation is conducive to decomposition and allows a great level of data parallelization. The x and y subproblems are mostly independent from each other and can be solved in parallel. Specifically, in Algorithm 1, steps 1(b), (3), (4), (5), 6, 7(a) can be solved in parallel. The division of the subproblems among the PEs and the level of dependency among the subproblems is discussed in detail.

A. Distribution of Subproblems
In Algorithm 1, some of the tasks need both x and the y subproblems, while some use only y (with a small level of dependency) and x subproblems. To get a better understanding, the division of the subproblems is explained in two parts: xy subproblem dependency and yy subproblem dependency.
1) xy Subproblem Dependency: One x subproblem, denoted by x p i j ∀i, j ∈ N for a fixed p and only one corresponding y subproblem denoted by y p i jk ∀i, j, k ∈ N , are dependent on each other. So when executing the tasks, the corresponding x and y subproblems should be on the same PE, thus avoiding communication overhead. To see this, consider a simple MAP with K = 10, including the dummy stage. The number of x subproblems are 9 and y subproblems are 8 (explained in Section V). Figure 4 helps to understand this division better. The numbers in the figure denote a stage and the arrows denote dependency. Each y subproblem from dimensions p, p + 1, p + 2 is denoted by an arc covering these three numbers. An x subproblem from dimensions p, p + 1 is denoted by a line between the two dimensions. As shown in Figure 4(a), the i th y subproblem and the i th x subproblem are dependent on each other, thus requiring these two subproblems to be available for read and write on the same PE. For example, if we have 2 PEs, then the subproblems would be divided between the PEs in a way that y subproblem 1, 2, 3, 4 are on one PE and 5, 6, 7, 8 are on the other. Since the x subproblems and y subproblems are dependent on each other, their corresponding x subproblems 1, 2, 3, 4 and 5, 6, 7, 8 should be on PE1 and PE2 respectively. The same idea is implemented for larger problems and more number of PEs.
2) yy Subproblem Dependency: In the task multiplier update, as shown in Figure 4(b), the i th y subproblem is dependent on the (i + 1) th y subproblem, requiring these two y subproblems to be on the same PE, and this aligns with the y subproblem division described above.  Solving x -LAP function only utilizes x subproblems and each subproblem is independent of another. In addition to CUDA and MPI induced parallelism in this task, we employ a technique called Tiling. While solving the x LAPs, instead of solving one LAP at a time, we solve multiple subproblems in parallel. It is less time consuming and computationally less expensive to store a group of x LAPs together and solve as one large LAP (conceptually). So the LAPs are tiled together as one large LAP and are solved using the accelerated LAP solver by [52]. The values of the objective function and the gap values are communicated to the root PE using MPI_Bcast. An important advantage to be noted is that any communication of the cost values is not required as the subproblems are independent. Given the size of the cost matrices, this communication would have been highly expensive, which is completely avoided in this algorithmic scheme.

VII. GAP CLOSURE USING SHORTEST PATH AND RANKED ASSIGNMENTS
The dual ascent procedure solves the problems with a small gap (between the feasible upper bound and lower bound) for most of the test cases and obtains a (near-)optimal solution for a significant portion of them. For the test cases where the dual ascent does not provide a convincingly near-optimal solution, we develop a gap closure scheme that finds an improved upper bound or an optimal solution (depending on the set up). This gap closure scheme is based on formulating a staged graph with ranked assignments between the consecutive stages of the MAP and then finding a shortest path through the graph. This gap closure scheme has been motivated by our past work: (1) [54] with a column generation heuristic in the wireless ad hoc network context, and (2) [55] in the context of opportunistic supply chain formation. Hereafter, we will refer to this technique as SP-RA. Figure 5 explains the concept for the gap closure scheme (SP-RA).
Recall from Section IV-A, an LAP is defined for matching measurements between two consecutive stages p and p + 1. These matches are myopic in that the lowest cost one might not correspond to the optimal track(s) through the K stages. The basic idea of the SP-RA scheme is to find alternate matches for each adjacent stage in a principled manner such that smallest overall cost match through K stages returns the optimal track. The SP-RA algorithm has three major phases.

A. Find the Partial Lower Bound Value Excluding Each LAP p From the Dual Ascent Procedure
Let G denote the graph setup with K − 1 time frames and N nodes in each time frame, as discussed in Section III-A. Consider an LAP that connects stages p and p + 1. This LAP is represented using the variable p. Now, let G\ p denote the graph that excludes this LAP p. To achieve this exclusion, the pairwise and triplet costs corresponding to this LAP are made zero in the formulation of LPRMAP and DLRMAP of Section IV. For an LAP p nullifying the pairwise and the triplet costs is equivalent to ignoring the value addition that the LAP p brings to the total cost. The dual ascent algorithm is now executed with these new changes to the pairwise and triplet costs, to obtain the partial lower bound value and this is denoted as L B p . This process is performed for all the LAPs in the graph and stored for the next step.

B. Find the Ranked Assignments for LAP p
For each LAP p, we want to find M p ranked assignments. These are the M p best solutions to the LAP p, ordered in non-increasing order of cost. For complete gap closure, we need M p to be sufficiently large such that any assignment beyond the rank M p will provably not be included in an optimal solution (see Proposition 3). Let Cost m p denote the cost of the LAP p at rank m. Let B E ST _U B be the best known upper bound value to the problem, which is determined from the dual ascent procedure on the full graph G. To find M p , we utilize the partial lower bound values obtained from the dual ascent procedure of the graph G\ p as explained in Step 1. We increase m until Cost m p + L B p > B E ST _U B. This represents a bound on the value of M p or the number of ranked assignments that need to be computed for a particular LAP p for provable optimality (Proposition 3). If a desired level of accuracy is sufficient, a smaller value of M p , p < p can be used (see also Section VIII on experimental results). Finding M p best (ranked) solutions for an LAP is a well studied problem [56]. For sake of simplicity, we employ the solution pool concept in [57] and use the bound termination criterion.

C. Set Up the Staged Graph of Ranked Assignments for the Shortest Path
From the preceding step, we now have M p number of ranked assignments for each LAP p. The value of M p need not be the same for all the LAPs. For LAP 0, all the ranked assignments are stacked vertically in their ascending order of rank and this is repeated for all the K − 1 LAPs. We refer to each LAP in this graph as a node, N m p , m = 0, 1, . . . , M p − 1; p = 0, 1, . . . , K − 1. This results in a staged graph of breadth K − 1 and depth as the largest M p value. This is illustrated in Figure 6. In this example, the M 1 = 3, M 2 = 4, and M 3 = 2. The arc costs in this staged graph are assigned according to Proposition 3 to follow. Using these arc costs, the shortest (cheapest) path is computed. This is the optimal solution to the problem as shown in Proposition 3.
Proposition 3: Cost m p + L B p is a lower bound to the problem, constrained by the solution m of LAP p. Where, Cost m p represents the cost of the m th ranked assignment of LAP p and L B p is the partial lower bound obtained from the dual ascent algorithm on G\ p.
Proof: The objective function of the RMAP problem as shown in (26) is The relation between the y variables can be understood from the constraints in (16) and this 'connection' in the formulation translates in the form of multiplier updates as explained in Section V. For computing L B p , the pairwise (C p i j ) and the triplet costs (D p i jk ) of the LAP p are considered to be zero in the objective function, i.e, C p i j = 0, D p i jk = 0. The dual ascent procedure is performed on the graph as described in Section V-D, with these new updates in the costs. The new objective function is now: where L RM AP m p is the Lagrangian Relaxation of RMAP constrained by m th ranked assignment of LAP p. jk , the total cost of a path is , which is the objective function of the original RMAP problem. We choose the shortest path among all the possible paths, thus obtaining the lowest value for the objective function. The optimality of the solution is guaranteed since for each LAP, we found all the possible solutions that are sufficient to find an optimal solution and cannot find a better solution by exploring more solutions. Therefore, finding the shortest (or cheapest) path in this staged graph set up guarantees optimality to the problem RMAP.

VIII. EXPERIMENTAL RESULTS
For the experiments, the datasets are divided into two classes based on problem size that is defined by the number of targets (N) and the number of time frames (K ). The smaller datasets are problems with 5 ≤ N, K ≤ 50 and the problems considered are (N, K ) = { (5,5), (5,10), (10,5), (10,10), (10,20), (30,30), (50,50) The accelerated algorithms developed in this paper are coded in C++ and CUDA C programming languages. The tests were conducted on two architectures, largely to mirror these two datasets and partly because of the availability of computing resources. The smaller tests were performed on an Intel Core(TM) i7-8550U CPU @ 3.6 Hz with 2 NVIDIA GeForce RTX 2080 Ti GPUs with 11GB memory. The larger tests were run on an Intel(R) Xeon(R) Gold 6230 CPU @ 2.1Hz with 4 TESLA V100 GPUs with 32GB memory.

A. Data Generation
The data sets were generated using two methods. For smaller-sized problems, the data generation scheme from [58] is used. This method simulates a target tracking behavior for maneuverable targets with constant velocity. For larger test cases the previous method was not practical in terms of memory and computational time, and hence, another method of data generation was used.
1) The first method is based on the cost generation scheme as described in [58], using both linear and cubic spline interpolation. The idea to obtain pairwise costs (C p i j ), is to compute the lengths of spline curves (l) between two successive pairs of positions (x i ( p), y i ( p)) and (x j ( p + 1), y j ( p + 1)), measured at time frames p and p + 1. For the triplet costs, the absolute value of the difference between the curve lengths (l 2) For larger data sets, the costs are generated using a different scheme, based on a similar idea. For each dimension, every node is given a label (l) between 0 and N − 1, without repetition. Therefore, across all the dimensions, exactly K nodes will have the same label. The pairwise costs are computed as C p i j = |l

1) Dual Ascent RMAP Algorithm:
The smaller-sized problems have two termination criteria. The first one is based on the duality gap. The lower bound gets better and closer to the objective value with every iteration. The upper bound is also computed at every iteration as described in Section V-E and the algorithm terminates when the gap calculated is less than 0.25%. The second termination criterion is a limit on the maximum number of iterations which is set to 100. The primary criterion is the duality gap, that is the algorithm terminates as soon the gap goes below 0.25%. This was incorporated because it would have been an unnecessary use of resources on a problem to run the algorithm for 100 iterations, when the algorithm has reached the desired duality gap in less than five iterations. If the primary termination criterion is not satisfied, then the algorithm will run for 100 iterations and then terminate.
For the larger test cases, the algorithm is run only with the second termination criterion (maximum iteration limit), since computation of upper bound at every iteration was expensive. The upper bound is computed once all the 100 iterations are completed and the gap is reported. As expected, the algorithm exhibits a monotonic increase in the lower bound. A problem In Table II, the duality gap and the solution times are reported. The smaller and the larger test cases are separated by a horizontal line for better readability. For smaller test cases where (N, K ) = {(5, 5), (5, 10), (10,5), (10,10), (10,20), (30,30), (50, 50)}, five different problem instances were run. In Table II, the average of the duality gaps, the maximum and the minimum gaps among the 5 instances for each problem size, and the solution times over these five instances are reported. For larger test cases of problem sizes (N, K ) = {(100, 100), (200, 200), (300, 300), (400, 400)}, only one problem instance is generated for each problem size and the duality gap and the solution time (run on 1 device) is reported. We further note that the gap for all the problem instances of all problem sizes is less than 5%, except for one problem instance of size (N, K ) = (20, 10) which reported a gap of 6.70%. 1 (The optimal solution found for this instance through SP-RA.) 2) SP-RA: For gap closure by SP-RA, for the same five instances of each problem size, ranked assignments were computed and the shortest path is found. The ranked assignments for a LAP p are computed using [57] with the parameter "CutOff " in a way that we find solutions that have a cost less than U B_B E ST − L B M p as discussed in Section VII-B. A limit on the maximum number of solutions is set to 1000. If the cutoff for the costs is not reached, then 1000 solutions are found for that LAP. In such cases, since we have not found a guaranteed bound on M p , it is possible that we have not found the optimal solution and it is only an improved upper bound, and thus, this algorithm then becomes a gap closure scheme. The partial lower bounds from the dual ascent values with each LAP excluded are first found and these values are given to Gurobi to find solutions that have a cost that reach the set CutOff value. And then, the shortest path problem is set up with these ranked assignments as nodes and the arc costs are assigned as described in Section VII. Efforts are being made on accelerating the process of finding the ranked assignments using Murty's algorithm in [56]. This would reduce the computation time of ranked assignments drastically when compared to using Gurobi. An accelerated version of single source shortest paths is employed [59].
The average time (over all the five problem instances of a fixed problem size) for finding the ranked assignments in Gurobi is reported under RA time. Similarly, the average time for finding the partial dual ascent values is reported under DA' time and the average time to find the shortest path is reported under SP time. For each problem size and all the five instances, the minimum and the maximum depth (M value) of ranked assignments explored among the LAPs is noted. For each problem size, out of the five instances, the least value of minimum and highest value of maximum depth explored is reported under the columns Min RA and Max RA. These values give us an idea on how many ranked assignments need to be computed to prove the optimality of the solution.
As mentioned before, a limit of 1000 ranked assignments for each LAP is set in the gap closure scheme. However, for some instances, the optimal solution might not be found with even the 1000 ranked assignments. Nevertheless, the gap closure scheme has the potential to improve the upper bound, from what was found in the dual ascent algorithm. To demonstrate this, all such instances whose optimal solution was not found are indicated by " †" in Table III. UB:SP-RA is the upper bound obtained from the gap closure scheme (denoted by SP-RA). UB:DA and LB:DA are the upper and the lower bounds obtained from the dual ascent algorithm. It can be seen that by using the gap closure scheme initialized with the dual ascent solution, the upper bound (UB:SP-RA) shows improvement over the upper bound obtained from the dual ascent algorithm (UB:DA).
To show the significance of the gap closure through SP-RA for such problems where optimality is not guaranteed, we compare the upper bound values from the dual ascent problems and the solution from SP-RA for these cases in Table III and we can see that a better solution was found through the SP-RA algorithm. For sake of brevity we have only reported these values for one instance (the one with the worst duality gap among the five instances) from each of (N, K ) = {(20, 10), (30,30), (50, 50)} problem sizes.
Our experiments showed that the dual ascent technique solved most problems either optimally or with a small gap (near-optimal). Out of all the 40 instances for the small-sized problems (5 instances for eight problem size groups, i.e., N, K ∈ [5,50]), 10 were solved optimally by dual ascent (i.e., zero duality gap). Out of the 30 problem instances with non-zero gap, 29 of these instances had less than 5% gap, and only one instance had 6.7% gap. After executing the gap closure scheme (with required ranked assignments) optimal solutions were obtained for 15 of the 30 problems. For the   remaining 15 problems we had to truncate the number of ranked assignments to 1000, which would not permit us to prove optimality of the shortest path. Nevertheless, in these cases we improved the upper bound value from the one found by the dual ascent scheme. For instance, the problem that had a gap of 6.7%, was improved to a gap of 4.4% from the original dual ascent lower bound. We note that the optimal solutions were not found in these cases due to computational limitations defined in the SP-RA algorithm. Considering sufficiently large number of ranked assignments can permit the gap closure scheme to find optimal solutions.

C. Strong Scaling Studies
We perform strong scaling studies on the larger sized problems. We increase the number of processing elements from 1 to 4 and compare the solution times for each problem size. In Figures 8 and 9, the time comparison for N, K = {(100, 100), (200, 200), (300, 300), (400, 400)} using multiple devices is shown. It can be seen that the solution time decreases nearly linearly when increasing the number of PEs, which shows that the algorithm is scalable. For N, K = 100, since the solutions time, when using more than 1 PE, itself is very small (less than 2 seconds), a significant reduction in time cannot be seen. For N, K = {(300, 300), (400, 400)}, the problem cannot be solved fully in parallel in one shot because of memory restrictions. Thus, these problems are solved in parts or batches. More the number of PEs, lesser the number of batches. The solution time reported takes into account all the batches and the complete set of dual-ascent iterations. Even after batching, the algorithm scales almost linearly across multiple PEs. These large problems are solved with very low optimality gaps (generally less than 5%) in a few minutes. As seen in Figure 9, the 400 × 400 problem with 64 million binary variables and over 25 billion continuous variables can be solved in 4 minutes using 4 GPUs. Furthermore, with increasing processing elements in a parallel computing cluster, the performance scales almost linearly, thus establishing the scalability of the implementation.

IX. CONCLUSION AND FUTURE WORK
In an effort to summarize the research work in this paper, a dual-ascent based algorithm is developed for a Multi-dimensional Assignment Problem (MAP). The algorithm is applied to a Multi-target Tracking (MTT) application, where the targets are maneuverable and maintain near-constant velocity. To better model the problem, both the pairwise and the triplet costs between measurements are considered in the objective function.
There are three major contributions to the paper. The first one is a novel Multi-dimensional Assignment Problem formulation for Multi-Target Tracking, which is based on creative linearization and reformulation. The second contribution is an integrated algorithmic framework with a novel multiplier update scheme and a creative gap closure scheme. The dual formulation is highly decomposable and can be embarrassingly parallelized. Also, a feasible solution (upper bound) can be readily obtained by minor adjustments to the lower bound solution, and the gap between the upper bound and the lower bound computed is generally tight. To the best of our knowledge this is the first application of dual ascent to the multi-dimensional assignment problem, although both the method and the problem are long standing. The creative gap closure technique proposed can prove the optimality of the upper bound solution or find other alternatives that close the optimality gap. The third contribution is a GPU-accelerated algorithm, that is designed to implement this algorithm on multiple GPU devices in a scalable manner. The framework is applied to the multi-target tracking problem, using datasets that simulate an MTT problem with maneuverable constant velocity targets. We note that the MAP framework in this paper is readily applicable to a wide range of other problems.
Future work should be directed at considering some realistic tracking issues of missed detections and false alarms, and birth and death of targets. Some ideas of handling these issues have been discussed in [58]. Moreover, while the proposed dual ascent algorithm focuses on the association aspect of the target tracking application, it can be integrated with an accurate model for estimating the target properties such as state and velocity ( [60]- [63], [64]) to develop a robust, efficient and an accurate algorithm for both estimation and association of target tracking applications.