Online Multi-Object Tracking With Visual and Radar Features

,


I. INTRODUCTION
Multi-object tracking (MOT) is to find states (i.e. positions, velocities, or sizes) of multiple objects in consecutive frames (or scans) while conserving their identifications. Over the past decades, it has been extensively studied in autonomous, robot, and computer vision research areas since it is used as a core algorithm to understand and predict behaviors of dynamic objects. However, it is still a difficult problem due to inaccurate detections, abrupt changes of object motion or appearance, and frequent occlusion by clutter or other objects.
To resolve this problem, a tracking-by-detection approach has been flourished. Given object detections (or measurements) from a radar and a camera, it builds trajectories by linking detections between consecutive frames. Therefore, automated tracking can be achieved by initializing and terminating tracks with provided detections. In addition, tracking accuracy can be improved because it can recover track The associate editor coordinating the review of this manuscript and approving it for publication was Ivan Lee . fragments and identity switches by matching tracks with the corresponding detections.
In tracking-by-detection, a data association between tracks and detections is crucial, and a lot of methods have been developed. Greedy-based association methods such as the nearest neighborhood [1] and the strongest neighbor [2] show the high speed, but reduce the accuracy often when many matching combinations exist. Joint probabilistic data association (JPDA) [1] and multiple hypothesis tracking (MHT) [3] can determine the optimal assignments between tracks and detections during single and multiple frames, respectively. However, they increase the association complexity combinatorially as the number of possible assignments between tracks and detections increases linearly. To reduce the complexity of JPDA, [4] leverage the m-best solutions of an integer programming. Also, [5] show the classical MHT method using online appearance learning can be comparable to recent MOT methods.
However, in recent years, many autonomous systems (e.g. vehicles, mobile robots, and unnamed aerial vehicles) use a camera and a radar together for more accurate and stable FIGURE 1. The overall framework of our approach for tracking objects with visual and amplitude features. When detection bounding boxes and amplitudes are provided, multiple objects are tracked with the leaned object models at previous frame and confidence-based data association. Then, visual/amplitude object models and object trajectories are updated in online by association results. Updated models and trajectories are used as inputs of the subsequent frame.
object detection and tracking. In many practical scenarios, combining different types of features can improve the accuracy and robustness of MOT since sensors are complementary to each other [6]. Therefore, [7] design object dynamic and measurement models based on EKF to fuse radar, image, and ego vehicle odometry measurements. Provided that a scene geometry, [8], [9] present a method to align camera and radar features on global cartesian coordinates, and use the aligned features for object detection. Reference [10] present an overall system for detecting and tracking moving objects by combining different measurements from radars, Lidars, and a camera. Reference [11] model measurements of a radar and a stereo camera in polar coordinates as a member of Lie Groups and perform object state filtering on Lie groups. Most of them have focused on fusing heterogeneous features effectively by developing object dynamic and measurement models [7], [10], [11] or sensor alignments [8], [9]. Then, they use the aligned or fused features for improving the estimation of the object state [7], [8], [10], [11].
Similar to aforementioned other works, we also leverage visual and radar features for more robust MOT. Compared to those works [7]- [11], our work, however, is more focused on improving MOT accuracy and speed by improving the data association. Because the core of the data association is the affinity evaluation, we propose effective object affinity models and an accurate affinity evaluation measure. In other word, our work aims at learning the various object models efficiently with the visual and amplitude features, and modeling the affinity measure to make the learned models applicable for the data association. As a result, we can improve the online MOT accuracy while maintain a low tracking complexity.
To this end, we propose an overall MOT system which can exploit both features effectively as shown in Fig. 1.
The proposed system is based on object model learning and confidence-based data association. We first evaluate confidence scores of tracks, and then categorize them into tracks with low confidence and tracks with high confidence. For the tracks with high confidence, we perform a local association to associate them with detections at a current frame. As a result, we can sequentially grow tracks with online provided detections using this frame-by-frame association. On the other hand, we regard tracks with low confidence as fragmented ones, and perform a global association between tracks with low confidence and other tracks with high confidence or detections. From this global association, we can build long trajectories under occlusions.
For reliable association, accurate affinity (or likelihood) evaluation between tracks and detections is essential. For a track and a detection from the same object, their affinity score should be high. Otherwise, the affinity should be low. From visual features, we learn object appearance, motion and shape models during tracking, whereas learn an amplitude model from a radar feature. Using the learned object models, we can evaluate affinity scores more accurately although many tracks and detections exist, and use the evaluated scores for the confidence-based association. For automated MOT, it is usually required to initialize tracks using detections and terminate tracks according to their status. In addition, in many cases, duplicated tracks which follow the same object are generated. To handle these issues, we also present an effective track management method.
On the challenging visual surveillance benchmark datasets for MOT, we thoroughly evaluate our methods in terms of a standard evaluation metric in radar-based MOT. In particular, we implement different versions of MOT systems with different object models and association methods, and compare their MOT performance under several clutter densities. In addition, we compare our method with the state-of-the-art MOT methods using deep learning. In this evaluation, we compare these methods using the common evaluation metrics in vision-based MOT. From these comparisons, we prove the benefits of our methods on several datasets.
The key contribution of this paper can be summarized as follows: • A unified MOT framework which can leverage visual and radar features effectively.
• Presenting a variety of visual and amplitude object models to learn object models more accurately.
• Enhancing the confidence-based association by applying the several object models for affinity evaluation.
• Extensive implementation and evaluation for various MOT systems on the challenging visual MOT datasets.
• Achieving the state-of-the-art performance comparable with recent deep learning methods while remaining a low tracking complexity.

II. RELATED WORK
In this section, we discuss previous study on radar-based and vision-based MOT.
A radar usually provides a spatial detection (or measurement) including a range and bearing. In many practical cases, origins of detections are unknown because a returned signal of a radar is mixed by objects and clutters. Therefore, many data association methods have been developed in order to assign a measurement to a corresponding track. Simple greedy association methods such as the nearest neighborhood [1] and the strongest neighbor [2] association are presented. Although these methods have a low association complexity, incorrect associations occur when tracks are spatially located close together. For handling this joint trackto-measurement assignment problem within a single-frame or a multi-frame search, joint probabilistic data association (JPDA) [1] and multiple hypothesis tracking (MHT) [3] methods are proposed. For reducing the joint association complexity, linear multi-target integrated probabilistic data association (LMIPDA) [12] is also developed. For handling nonlinear dynamics of multiple objects, sequential Monte Carlo (SMC) methods [13]- [16] for MOT are developed. To estimate object states and cardinality simultaneously, joint probabilistic probability densities of multiple objects are modeled in [13], [14]. However, the computational complexity of these methods increases exponentially as the number of hypotheses increases. To alleviate this problem, the data association and state estimation are treated as a separated problem in [15], [16].
However, the spatial feature is not sufficient for the association cases where objects are closely spaced or clutter is densely distributed in the object vicinity. Therefore, for more accurate association, an amplitude is used as an extra feature in [17]- [21]. The basic idea of these methods is that an amplitude from an object is usually stronger than it from a clutter. The extended MHT [17] and Viterbi data association [18] using the amplitude are provided. In order to exploit the amplitude without the pre-knowledge of signalto-noise ratios (SNRs), a marginalization method [19] which computes an object amplitude likelihood within any SNR boundary is presented. For estimating objects' states and SNR jointly, SMC-based [20] and MAP-based SNR estimation [21] methods are proposed.
In vision-based MOT, tracking by detection methods have flourished for achieving automated and robust MOT. In general, they builds trajectories by associating (or linking) detections. They can be divided into batch and online tracking methods according to the association manner. Batch tracking methods [5], [22]- [24] usually build trajectories by using a global association of detections of whole frames. They produce better MOT results than online methods in most cases. However, they cannot be applied for real-time or casual systems because they construct a batch of detections beforehand, and build trajectories by linking whole detections by an iterative global association. On the other hand, online tracking methods [25]- [30] build trajectories by using a frame-by-frame association of past and current detections. Therefore, they can be suitable for real-time applications. However, they tend to yield identity switches and track fragments by long-term occlusions since detections of future frames are not used.
Because both tracking methods build trajectories by local or global associations, an affinity evaluation between tracks and detections is important for the accurate association. To this end, object affinity models using object appearance, motion, and shape cues [5], [25]- [28] are also developed. Due to the recent advances of deep learning, deep learning-based affinity models [28], [31]- [33] have been presented. References [31], [34] use an autoencoder and a convolutional neural network (CNN) as deep appearance models for learning more rich representation. References [28], [35] exploit the Siamese network [36] to calculate the affinity between an object pair from the network output directly. References [32], [33] learn temporal dynamics of tracked objects using a recurrent neural network and CNN. References [24], [30] learns a deep distance metric by aggregating appearance and motion cues. Although the deep learning can improve a model discriminability, many training samples and costly GPUs are required. In this work, we introduce an amplitude affinity model for vision-based MOT, and show that MOT accuracy can be enhanced by using the new and simple amplitude affinity model. We thus argue that the main benefit of our method can improve tracking accuracy while maintaining a low tracking complexity. We also prove this by comparing our method with recent MOT methods using deep learning on the challenging visual MOT datasets.

III. VISUAL AND AMPLITUDE OBJECT MODELS A. OBJECT DYNAMICS AND MEASUREMENT MODELS
We represent the state of an object i is represented as and d i t = x i 7,t are the position, velocity, size, and expected (or mean) SNR. A nonlinear discrete-time dynamic motion is used to model the behavior of an object i as follows: ∈ R 4 means dynamic states of an object i at frame t composed of positions and velocities along with x and y coordinate, respectively. f t is a nonlinear function of the motion state In general, a detection set obtained at a frame is composed of many detections originated from multiple objects and clutter (or background) [1], [37]. Let us denote a set of detections at frame t as from a camera and a radar is represented as ] are x and y positions, a width, and a height of a detection box obtained from a camera. Also, a j t is an amplitude measurement from a radar. Even though a range and bearing features can be detected from a radar, we use an amplitude feature of it only because a camera usually provides more accurate locations and sizes in a real-world environment [6].
Furthermore, an object-originated measurement ξ i j,t is modeled by a linear measurement model as where the noise w x,t ∼ N (0, σ 2 x ) and w y,t ∼ N (0, σ 2 y ) for localization errors are uncorrelated Gaussian noise sequence. Here, it is assumed that the visual b i j,t and amplitude a i j,t measurements are independent each other.

B. AFFINITY EVALUATION MODELS
We then define a track (or trajectory) T i as a set of states up to frame t as T i where t i s and t i e are the time stamps of the start-and end-frame of the track. If an object i appears at frame t, we denote it by using a binary function as v i (t) = 1. Otherwise, v i (t) = 0.
In addition, we then describe a track T i with four elements and P i represent appearance, shape, motion, and amplitude models, respectively.
Then, an affinity measure to determine how well two objects are matched is defined as where u and z can be a track or a detection. Each affinity is computed as follows: For the appearance affinity A (u, z), we use the subspace learning using partial least square (PLS) [38]. We first extract an averaged RGB color histogram f u hist of each track for new 1 frames. We then project f u hist on the learned PLS subspace (14) to produce a compact and discriminative feature f u proj . The appearance affinity is the cosine similarity between f u proj and f z proj . More details of learning W are given in Sec. III-D.
The shape affinity S (u, z) is calculated with their updated height h and width w. M (u, z) is the motion affinity between u tail (i.e. the last refined position) and z head (i.e. the first refined position) with the frame gap . The forward velocity v u F is evaluated from the head to the tail of u, while the backward velocity v z B is evaluated from the tail to the head of z. We use the Kalman filtering for updating the velocities. The difference between the predicted position computed with the velocity and the refined position is assumed to follow a Gaussian distribution. The forward motion is used only when evaluating affinity between a track and a detection.
The amplitude affinity P (u, z) is evaluated with the averaged amplitude scoresā u andā z for associated amplitude measurements up to a current frame, and their estimated SNRs,d u andd z . In the next section, we discuss an amplitude likelihood model g DT a and a method to estimated u andd z .

C. AMPLITUDE MODEL AND UNKNOWN SNR ESTIMATION 1) OBJECT AMPLITUDE MODEL
We assume the probability density of an amplitude a follows a Rayleigh distribution as discussed in [39]. We then define the expected (or mean) SNR 2 d = S/N 0 , where S is the signal power and d can be treated as the expected object signal power because N 0 = 1. In addition, a slow Rayleigh fading amplitude-modulated narrowband signal is considered in the presence of narrowband noise. In this case, the signal returned from the object is expressed as the sum of the transmitted signal and the narrow band noise. The background noise is normalized as in [39]. This means that the expected noise power N 0 is unity. Therefore, the amplitude density function of an object follows the Rayleigh distribution with the variance 1 + d (i.e., the signal-plus-noise to noise ratio): However, to evaluate the signal power S from the object amplitude distribution (5), the expected object SNR d is required to estimate because Let us next consider the case in which the amplitude a exceeds a detection threshold DT , i.e., a ≥ DT . Then, the amplitude density of the object becomes where the object detection probability P D used for normalization is calculated as When the object SNR d is known, the amplitude likelihoods of an object can then be computed as

2) SNR ESTIMATION
To exploit g DT a (a|d), we estimate object SNR d using the MAP method [21]. We model the prior p(d) with a Gaussian random walk model. In other word, we consider that the SNR is randomly fluctuated in the vicinity of the previously estimated (or initial) SNRd i t−1 . Then, p(d) can be represented with the estimatedd t−1 at frame t − 1 and variance σ 2 d as follows: To estimate an unknown SNR more accurately, one can use several amplitude measurements. In other words, rather than inferring the object SNR with an instant amplitude feature a i t of the object i at frame t, it can be estimated with a set of amplitude features stacked during frames.
Let us denote the stacked amplitude measurements from time t − + 1 to time t as a i t− +1:t . 3 The MAP problem 3 To determine a i t at frame t, we first filter out measurements using the track gating technique and amplitude thresholding. We then select the amplitude with the maximum strength among filtered measurements, and consider it as a i t . More details can be founded in [21].
of finding an optimal SNR with respect to the collection of amplitudes a i t− +1:t can be modeled bŷ where the first likelihood term p(a i |d) is given by (9). By substituting the SNR prior of (11) with (10), the following objective function can be derived: We solve this nonlinear least-squares problem using the Levenberg-Marquardt method [40].
for an object i, we can generate some positive sample boxes by rescaling b i with a scaling factor ψ. We denote a rescaled box as We initially set to ψ = 0.7 and increase ψ with the interval 0.1 until an overlap ratio α over for an intersection region over an union region between d i and d i res is below to 0.75. We generate a set of positive boxes For improving appearance discriminability between an object and other objects nearby or scene clutter, we collect negative sample boxes around an object. Given an object bounding box b i , we define a negative sample box as } is a negative sample index. In our experiment, we set ρ, ζ w and ζ h to 1.2, 2 and 4, respectively. As a result, a negative sample Once

2) PARTIAL LEAST SQUARE (PLS) SUBSPACE LEARNING
To discriminate appearance features of different objects, we learn projection spaces using the PLS since the appearance learning using PLS shows the more discriminability than PCA and color histogram features [38]. We denote a sample set of the i-th track collected from t − + 1 to t frames as Z i t− +1:t , where Z i t− +1:t consists of Z i,+ t− +1:t and Z i,− t− +1:t as defined in Sec. III-D1. Using the NIPALS algorithm, we learn a new PLS weight vector w with dimension at each iteration as follows: where F = {f 1 hist , f 2 hist , . . . , f g hist } is the appearance feature matrix with dimension g × consisting of g histogram features with dimension for Z i t− +1|t . r, o and e are g-dimensional feature score, label, and label score vectors, respectively. p is a label loading value. By learning w for τ iterations, we can produce a PLS weight matrix W = {w 1 , w 2 , . . . , w τ } T . Then, a weight matrix W i for the i-th object can be learned with Z i t− +1|t using (13). For updating W i during tracking, we first generate a W i new with Z i t− +1|t , and combine W i new with the learned W i and balancing weight υ = 0.5: Once W i is learned, we can generate a projected PLS feature f i proj = W i f i hist and use f i proj for affinity evaluation in (4). In our case, we set and τ to 144 and 40. This means that tracking speed can be improved because the dimension of f i proj is much more than the dimension of the original feature f i hist .

IV. DATA ASSOCIATION
We define T i in Sec. III-B. Then, a set of trajectories of all objects up to frame t can be denoted as T 1:t . We denote a set of trajectories existing at frame t as T i N i=1 . Using the confidence measure [28], we then evaluate a track confidence in consideration of the length and continuity of a track and the affinity with an associated detection as follows: where L is the length of a track χ i as L = T i , and w is the number of frames in which the object i is missing due to occlusion by other objects or unreliable detection as λ = t i e − t i s +1−L. β is a control parameter relying on the performance of a detector. When a detector shows high accuracy, β should be set to a large value (β is set to 1.2 as done in [28]). The average affinity T i , z i k between the track and detection is computed by (3).
Once the confidence scores of tracks are computed by (15), local and global association are adaptively performed according to track confidence. A track with high confidence T i(hi) is considered as a reliable track, and is locally associated with a detection in order to grow it progressively. When h track with high confidence and a detection set Z t = {z j t } m j=1 are given at frame t, we compute a local association score matrix S as where the affinity (T i(hi) , z j t ) is computed by (3). Then, track-detection pairs which maximize the total affinity in S h×n are determined by using the Hungarian algorithm [41]. When the association cost of a pair is less than a pre-defined threshold, −log(θ), z j t is associated with T i(hi) . For the track T i(hi) associated with detection z j t , states and confidence of the track are updated with the association results as follows: • The position and the velocity of a track are updated with the associated z j t . The size of the object is also updated by averaging the sizes of associated detections of recent past frames.
• conf (T i ) is updated using z j t by (15). On the other hand, a tracks with low confidence T i(lo) is considered as a fragmented trajectory by occlusions. To link fragmented tracks into one, we associate T i(lo) with T i(hi) or a detection y j t not associated with any T i(hi) in the local association. Assume that there exist η non-associated detections (η ≤ m), and h and l tracks with high and low confidence, respectively. Then, we perform global association by considering following events: We then define a global association score matrix G for all the events as follows: Here Once G matrices are computed, we determine optimal matching pairs using the Hungarian algorithm such that the total affinity score in the matrix is maximized. Then, detections of the associated pairs are linked each other in a sequential manner, and confidences of all existing tracks are updated by (15).

V. TRACK MANAGEMENT AND UPDATE
For achieving automated MOT, managing tracks appropriately is also important. In this section, we briefly discuss some tasks which are contained in the track management. VOLUME 8, 2020 In general, a track initialization is required to generate a new track with detection responses. Once a track is generated, it tracks an object. However, a track could not often follow an non-object (e.g. clutter) due to occlusions and inaccurate detections. In this case, we need to eliminate this false track to correct the tracking failure. In some cases, track duplication, which more than two tracks follow a same object, can be occurred by inaccurate track initialization and tracking failures. In the next section, we provide our track initialization, termination, and merging method to deal with those difficulties.

A. TRACK INITIALIZATION AND TERMINATION
The problem of initiating a new track can be transformed as a problem to find consecutive and similar detection responses during a certain new frames. In general, detections of a new track should not be associated with any existing tracks in the local and global association stages. We define a set of non-associated detections from t − new +1 to t as Y t− new +1:t . It means that the candidates for new tracks are reduced from and t are the time stamps of the start-and end-frame of the new trajectory, and y new t = [b t , a t ] T . Now, we define an affinity score for the new track initialization N (T new ) as follows: where S (y t , y t−1 ) is the shape affinity defined in (4). Also, we evaluate the spatial affinity N (y k , y k−1 ) by evaluating spatial distances along x and y coordinates.
The covariance v is then determined by the maximum velocities of an object along with x and y coordinates and the unbiased converted covariance R c t [42], such that Then, we generate a new track when N (T new ) exceeds the track initialization probability ϑ I .
The desirable track termination method should identify and eliminate false tracks, i.e., those that do not follow true objects. In our case, we evaluate the reliability of a track using the track confidence model conf (T i ), and some tracks which have confidences lower than ϑ T are eliminated. Using the track initialization and termination methods we can generate new tracks and eliminate false tracks efficiently by considering affinities between detections and the track reliability.

B. TRACK MERGING
In MOT problems, several tracks often follow a same object due to inaccurate track initialization or tracking failures. It is called a track duplication. In [20], we presented a track merging method based on a mean shift algorithm. In brief, we classify and group tracks T i N i=1 according to the recent states x i t N i=1 . Using the mean shift, the m c modes of clusters C q 1,...,m c are then determined. Once clusters C q 1,...m c are generated, track q and its components such as track stateŝ x q t , covariance P q t|t , track confidence conf T i , and object models are determined as follows: • The track statex q t|t is the mode of the cluster C q 1,...,m c .
• The covariance P q t|t is the min P • The object models {A q , S q , M q , P q } are the models of the track q * , where argmax

VI. EXPERIMENTAL RESULTS
On the challenging visual surveillance datasets, we evaluate our MOT method. For more comparisons, we implement and compare different MOT methods.

A. IMPLEMENTATION
To verify our affinity models using visual/amplitude features and the confidence-based association method, we have implemented and compared several multi-object tracking systems (M1-M4) using different object models and data association methods. For this comparison, based on the Algorithm 1, we have implemented the following MOT systems by combining different methods: • (M1) without visual models; • (M2) without an amplitude model; • (M3) with all models and LMIPDA-AI association [20]; • (M4) with all models and confidence-based association; Here, the system (M1) only uses the range, bearing, and amplitude features of radars. For affinity evaluation of (M1), we therefore use the object motion M and amplitude P models. On the other hand, (M2) do not exploit an amplitude feature, and exploit visual models A , S , and M for affinity evaluation. For (M3) and (M4), we use the all models of a camera and radar, but different association methods are applied for each system. In (M3), we use the LMIPDA-AI association method. In this association, a track existence probability should be computed in order to evaluate the posterior association probability β i j,t between a track i and a measurement j which is within a gate of the track i. For a fair comparison, we replace a track existence probability with a track confidence. In addition, (M3) leverages all the affinity models when evaluating the β i j,t . When estimating an object SNR in (M1), (M3), (M4), we set the variance σ 2 d and to 5 and 5 when solving the object function (12).
For (M3), we use the gating technique to reduce matching combinations between tracks and measurements as done in [12], [20]. Using the gating technique, m i t validated

Algorithm 1 The Overall Algorithm for Implementing MOT Systems With Different Association Methods and Affinity Models
Input : A set of measurements: Z t and a set of trackers where γ is a gate threshold and m i t is the number of measurements in the gate of the track i; v i j,t = ξ i j,t −ξ i t|t−1 , is a zero-mean Gaussian residual with a covariance S i k . Given the gated measurements, amplitude thresholding is exploited to filter out false alarms with the threshold DT because the amplitude from an object is usually stronger than false alarms [20].

B. EVALUATION METRIC
As a performance measure, the optimal subpattern assignment (OSPA) metric [43] is used. Given the true and estimated sets composed of states of multiple objects, we measure the localization distance and cardinality distance. The localization distance evaluates the state similarities between matched pairs of the true and estimated sets. On the other hand, the cardinality distance evaluates how well the number of existing tracks matches the number of true objects. As an overall performance measure, the OSPA distance representing the total error is calculated by summing both the localization and cardinality distances. For all the distance metrics, a smaller distance indicates better results.
In the OSPA metric, the cut-off parameter is set to c = 100, which determines the relative weighting of penalties assigned to the cardinality and localization errors. The order parameter then is set to p = 1 which determines the sensitivity of the metric to outliers.

C. VISUAL MULTI-OBJECT TRACKING DATASET
To compare the systems (M1-M4) in real MOT environment, we use the publicly available VS-PETS 2009 benchmark dataset [44]. In the dataset, PETS S2.L1 and PETS S2.L2 sequences for multi-object tracking evaluation are exploited. PETS S2.L1 and S2.L2 sequences consist of 795 and 436 frames and the resolution of each image is 768(pixels) × 576(pixels). 23 and 74 objects exist for PETS S2.L1 and S2.L2, respectively. As shown in Fig 2(a) and Fig. 3(a), the trajectories of multiple objects are complicated. In particular, PETS S2.L2 sequence is very challenging because many objects are moving and interacting with each other. For more evaluation, we compare (M1)-(M4) on the Town Centre dataset. This dataset consists of 4500 frames and each frame is a full HD image of 1920(pixels) × 1080(pixels) resolution. 230 objects are moving and interacting as shown in Fig. 4(a). We allocate each object to an initial SNR within [5dB, 20dB], and the object SNRs fluctuate at each scan according to the Gaussian distribution (10) with the variance σ d = 10.

D. DETECTION
For PETS and Town Centre datasets, we use the public available detections from [45] and [46] which exploit the VOLUME 8, 2020 HOG detector [47] and its variant [48], respectively Measurements of objects are assumed to detect with P D = 0.95 and some detections for the objects are removed according to P D . From each detection, spatial locations (i.e. x and y positions) and sizes (i.e. width and height) are obtained. For each object SNR at frame t, amplitude measurements are generated according to Rayleigh distribution (6).
On the other hand, (M3)-(M4) show the better accuracy than (M1) and (M2), and maintain their performance for high λ. This indicates that using both features can enhance the association accuracy and is more effective in the heavy cluttered environment. When comparing (M3) and (M4) using different association methods, using confidence-based association shows the better results than LMIPDA-AI.
This means that adaptive local and global association based on track confidence can determine the association pairs more accurately.
2) PETS S2.L2 SEQUENCE Figure 3 demonstrates the tracking results of the (M1)-(M4) on the PEST S2.L2 sequence. This sequence is very challenging because of the complex motions of objects and many interactions between many objects. Therefore, the overall performance of all the systems is degraded over their performance on PETS S2.L1. In particular, the localization errors of systems increase due to inaccurate detections and many false detections.
From the OSPA results shown in Fig. 3(d), we also confirm that exploiting both visual and amplitude models indeed is beneficial to reduce OPPA errors when comparing (M1)/(M2) and (M3)/(M4). In addition, (M1) without VOLUME 8, 2020 visual feature shows the lowest accuracy. Using the amplitude model reduces OSPA error about 10 when comparing (M2) and (M4). In particular, the effect of using the amplitude model P i increases as λ increases. In this evaluation, our (M4) achieves the best accuracy. In particular, the cardinality errors of (M4) are not sensitive to λ. The low cardinality errors reflect that the number of generated tracks is close to the number of true objects.

3) TOWN CENTRE SEQUENCE
We further compare (M1)-(M4) on the Town centre sequence as shown in Fig. 4. This sequence is very long and contains many objects. However, the performance of all systems is better than other two sequences. We also obtain the better results by using both visual and amplitude models.
In addition, confidence-based association shows the lower OPSA errors than LMIPDA-AI. From the quantitative results on PETS S2.L1, PETS S2.L2, and Town Centre, we prove that our affinity models and association method contribute to increase MOT accuracy, and the performance gain of using our methods gets higher as clutter density increases.

G. QUALITATIVE EVALUATION
In Fig. 5, we compare tracking results of (M1), (M2), and (M4). Figure 5(a) and 5(b) compare (M1) without visual feature and (M4) using both features. We found that some track fragment (FG) and identity switch (ID Switch) are caused by inaccurate association of (M1) when tracked objects are occluded. Furthermore, we show that (M2) without the amplitude model produces an ID switch as shown in Fig. 5(c).

H. COMPARISON WITH DEEP APPEARANCE LEARNING
To show the benefits and effects of our method more, we compare our method with recent MOT tracking systems using deep appearance learning [28], [31], [36]. For a fair comparison, we implement all other systems on the same framework shown in Fig. 1, and replace the appearance model (4) with their deep appearance models. We use the public available codes for [28], [31], [36], and train deep appearance models on the CUHK02 [49] person re-identification dataset. The dataset contains 7,262 image patches for 1,816 different persons captured from 10 camera views. We resize a color image patch of a person to 128×64, and use the resized patches as an input of deep appearance models. We obtain detection boxes by applying a Mask R-CNN [50] detector for each image. We also generate amplitude measurements with the Rayleigh distribution (6).
In addition, for this comparison we use the common evaluation metrics in vision-based MOT: the multiple object tracking accuracy (MOTA↑), multiple object tracking precision (MOTP↑), the ratio of mostly tracked trajectories (MT↑), the ratio mostly lost trajectories (ML↓), the number of track fragment (FG↓), recall (REC ↑), precision (PRE ↑), false alarms per frame (FAF↓), the number of identity switches (IDS↓) and tracker speed in frames per second (Hz↑). Here, ↑ and ↓ represent that higher and lower scores are better results, respectively. Table 1 shows the evaluation results on PETS S2.L1, PETS S2.L2, and Town centre datasets. As shown, the proposed methods are comparable with other MOT systems [28], [31], [36] using deep learning. Although the recent method [28] shows the best tracking accuracy, the proposed method with amplitude feature shows the better MOTA, IDG, FG, REC, PRE, FAF scores than other deep learning-based MOT trackers [31], [36]. In addition, we also know that using the amplitude feature can improve the MOTA score, which is the most important metric, by 1.58% when comparing our systems with/without amplitude. However, the best benefit of our method is the tracking speed as shown. Indeed, our methods can greatly reduce the run time compared to [28], [31], [36]. Note that we achieve this performance without using the person re-identification dataset for appearance learning. These comparison results indicate that our method can work very fast while keeping high MOT accuracy.

I. EVALUATION USING DIFFERENT DETECTORS
In order to investigate how a detector accuracy affects the MOT performance, we evaluate our MOT system over different detection responses on PETS S2.L1 and PETS S2.L2. We use the public available HOG detections from [45], [46] and detections by applying the Mask RCNN [50] detector. We compare our systems with/without the amplitude affinity models. Figure 7 compares the performance of both systems in terms of several MOT metrics. Since the detection accuracy affects the precision and recall of a tracker the most, we compute recall, precision, and other metrics related to these. For all the metrics, our trackers yield the better scores by using the recent Mask RCNN detector than using a HOG detector. In particular, the gap of MOTA scores is large.
Since this metric represents the overall tracking accuracy, it turns out that MOT performance is affected by a detection quality. When comparing our systems with/without the affinity model, exploiting the amplitude affinity produces better rates for all the metrics. Thus, we also prove that the proposed amplitude affinity model can indeed enhance the MOT performance regardless of the detector performance.

VII. CONCLUSION
In recent years, many autonomous systems exploit camera and radar sensors for achieving stable and accurate multi-object tracking. In this study, we have proposed a unified framework to exploit visual and amplitude features effectively for MOT. The proposed framework is based on object model learning and data association methods.
We have learned visual and amplitude models during tracking. In particular, we have learned an object appearance model using discriminative subspace learning, and an amplitude model using the MAP-based SNR estimation. By combining these affinity models with the confidence-based association, we have enhanced the MOT performance significantly. Furthermore, we have presented a practical track management method to deal with track initialization and duplication.
In order to show the benefits of our methods, we have implemented several MOT systems using different affinity models and association methods, and compare their performance extensively on several challenging visual MOT datasets. In addition, we have compared our method with state-of-the-art MOT methods using deep appearance learning. The comparison proves that our method achieves the high tracking accuracy which is comparable to the recent methods. In particular, the best benefit of our method is the low MOT complexity. We greatly reduce the run-time against the deep learning methods.