Hybrid Motion Model for Multiple Object Tracking in Mobile Devices

For an intelligent transportation system, multiple object tracking (MOT) is more challenging from the traditional static surveillance camera to mobile devices of the Internet of Things (IoT). To cope with this problem, previous works always rely on additional information from multivision, various sensors, or precalibration. Only based on a monocular camera, we propose a hybrid motion model to improve the tracking accuracy in mobile devices. First, the model evaluates camera motion hypotheses by measuring optical flow similarity and transition smoothness to perform robust camera trajectory estimation. Second, along the camera trajectory, smooth dynamic projection is used to map objects from image to world coordinate. Third, to deal with trajectory motion inconsistency, which is caused by occlusion and interaction of long time interval, tracklet motion is described by the multimode motion filter for adaptive modeling. Fourth, in tracklets association, we propose a spatiotemporal evaluation mechanism, which achieves higher discriminability in motion measurement. Experiments on MOT15, MOT17, and KITTI benchmarks show that our proposed method improves the trajectory accuracy, especially in mobile devices and our method achieves competitive results over other state-of-the-art methods.

(IoT) [1], such as intelligent transportation, video surveillance, etc. With the development of mobile devices, videos from automobile, UAV, robot, and mobile phone offer more data and bring greater challenges for MOT. In this article, we address the motion modeling problem of MOT in mobile devices. It is difficult to measure and predict object motion without additional sensor or precalibration.
Recently, remarkable progress has been achieved in object detection [2], [3], [4], which promotes the popular trackingby-detection paradigm for MOT. Despite the high accuracy of detectors, false and missing detections still have impacts. To solve this problem, trackers [5], [6], [7], [8] are proposed to generate tracklets (short trajectories) with high confidence to reduce false positives (FPs) in detections. By tracking the objects as detections or tracklets, the key problem is to correctly associate objects among multiple frames. To find the optimal association, many successful algorithms are proposed, e.g., min-cost flow [9], conditional random field [10], [11], [12], multiple hypothesis tracking (MHT) [13], etc. The associations are built based on affinity measurement, which consider appearance consistency and motion prediction. In crowded scenarios, the lower similarity discrimination caused by illumination and pose changes makes appearance unreliable. Therefore, motion information is applied as another basis of association.
In the traditional surveillance system, cameras are assumed to be static, where motion information can be obtained intuitively through image coordinates. However, under mobile devices, the relative movement between the object and camera leads to great changes in image coordinates. The MOT system on the mobile device will have additional difficulty. For instance, as shown in Fig. 1, the object in the red box is almost in sync with the camera, so that its image coordinates change little. On the contrary, both the size and image coordinates of the blue one vary greatly. Our proposed model uses world coordinates, which eliminates the interference caused by mobile devices. As shown in Fig. 1(d), the movement of two objects has a similar pattern except for the opposite direction. This phenomenon makes it difficult to build a unified motion model by image coordinates. Therefore, associating with world coordinates is effective to measure object motion in mobile devices. In Fig. 1(c) and (d), the black-dashed arrow indicates the length between the beginning of two objects in the vertical direction. In the world coordinate, there is a significant distance between two objects while their image coordinates are highly similar instead. As a result, spatial constraints are sensitive when using the image coordinate, but the algorithm using world coordinates can detect and track objects more accurately. Consequently, it is necessary to construct a motion model with the world coordinate for MOT in mobile devices.
When tracking in mobile devices, acquiring the world coordinates of the objects always rely on multivision, depth sensors, radar, etc., which brings extra hardware and computing expenses. Considering that image coordinate take the camera as reference, camera trajectory can be used to compute the ground position of the objects indirectly. To obtain the camera trajectory, many motion-position estimation methods [14], [15], [16] are proposed. However, these approaches are not designed for MOT and unsuitable for locating the camera from the monocular system without additional information (e.g., location from GPS and depth from RGB-D sensor). In addition, a large number of moving objects in the video sequences bring difficulties for point matching between frames.
In this article, we propose a hybrid motion model to address the challenges posed by mobile devices in MOT. Due to different moving directions relative to camera devices, there are great changes in motion states of similar objects, which are amplified by the motion of the devices. Therefore, it is difficult for the MOT system to establish a unified motion model. To solve this problem, the existing MOT methods often rely on additional information, including multicamera, sensors, calibration, etc. These bring more hardware requirements and computational burdens. In a monocular uncalibrated system, our proposed method make full use of the information (background, detections, and horizon) in the video scene to estimate the camera trajectory, and realize the mapping of the objects from the image to the world coordinates along the camera trajectory. We use the geometric perspective with horizon line to simplify the calculation and reduce error in visual mapping. Furthermore, the high confidence tracklets for association is generated by the ground position and height of objects.
Multiobject tracking in the world coordinate can not only avoid the influence of motion devices but also increase the discrimination of motion measurement. The change of motion state between adjacent tracklets is approximately stationary, but is usually nonstationary between tracklets with long interval. In this case, the incompatibility motion state of tracklets leads to inconsistent measurement and prediction by a single motion model. To solve this problem, multimode motion filter (MMF) is proposed to estimate motion of adjacent and long spaced tracklets. MMF establishes prediction and error estimation for different motion modes. Meanwhile, we proposed a spatiotemporal evaluation mechanism (STEM) to evaluate the similarity of tracklets by motion metrics in MMF and appearance feature.
On MOT15, MOT17, and KITTI benchmarks, experiments demonstrate the higher tracking accuracy. As shown in the leaderboard, our method is competitive with other state-ofthe-art trackers.
In summary, the main contributions of this article are summarized as follows.
1) To solve the problem of object motion modeling under mobile devices, we propose a hybrid motion model based on world coordinates using scene information. 2) To adapt to the monocular uncalibrated system and reduce the computational complexity of projection, we propose smooth dynamic projection for object coordinate mapping according to the perspective of the imaging system with horizon. 3) To solve incompatibility between adjacent and long spaced tracklets, MMF is established for the adaptability of modeling. 4) To provide accurate affinity measurement in the tracklets association, STEM is proposed with error variance estimator of motion. The remainder of this article is organized as follows. Related work is discussed in Section II. The hybrid motion model for mobile devices is presented in Section III. MHT based on the hybrid motion model is described in Section IV. The experimental results are shown and analyzed in Section V followed by the conclusion in Section VI.

II. RELATED WORK
In this section, we analyze the merits and weaknesses of recent tracking methods, especially in the mobile devices.

A. Tracking-by-Detection
With preprovided detection, MOT methods focus on data association algorithms, which are divided into online and batch, according to whether the information of subsequent frames is considered. Online methods [17], [18], [19], [20], [21], [22], [23] meet the needs of realtime processing without the subsequent information, but sacrifice the trajectory integrity. Most of these methods [17], [18], [19], [20], [21], [22] focus on the improvement of detector through spatiotemporal affinity, but relies on redetection in tracking. Stadler and Beyerer [23] solved the occlusion problem through heuristic trajectory management. Zhang et al. [24] proposed a multiplex labeling graph for near-online tracking in intelligent devices. The batch methods utilize global information to achieve higher tracking accuracy at the expense of speed. Methods proposed in [25] and [26] solve MOT by lifted disjoint paths model which is conducive to global optimization. Graph network is naturally suitable for modeling MOT problems. With the development of the graph neural network (GNN), some GNN-based methods [27], [28] are proposed recently for further association.
MHT is one of the earliest successful methods proposed in [29]. The main idea of MHT is to establish a hypothesis tree for all possible association nodes, then evaluate and solve the global hypothesis. In the case of dense objects in long video, this strategy has the disadvantage of large time and space costs. To overcome this defect, the hypothesis decision is transformed into maximum-weight-independent set (MWIS) [30], and Sheng et al. [7] proposed a category transfer model for further efficiency optimization. To apply the method to mobile devices, we incorporate pruning and gating strategies and use sliding window. The hypotheses are significantly reduced to make the algorithm achieve near-online performance.
To improve the accuracy of tracking, some methods use tracklets as association units instead of detections. In [5], [6], [7], [8], and [31], the tracklets are generated from detection for association. By reducing detection errors and improving the reliability of association, tracking-by-tracklet has achieved success. Therefore, we use tracklet-level MHT (TLMHT) proposed in [7] as the baseline to implement our method.

B. Tracking in Mobile Devices
Tracking multiple objects in mobile devices is a complex problem involving many vision techniques. Solutions can be divided into three categories according to the vision sensor. First, the vision matching is established by the multiple cameras with preacquired location or calibration. In [32], [33], [34], [35], and [36], pairing of stereo camera is used to track objects. Stereo cameras provide additional information to estimate camera motion and restore the world coordinates of the objects. Second, 3-D reconstruction can be obtained to detect and track objects if the depths are provided. In [37], [38], [39], and [40], RGB-D data or laser points provide the basis for segmentation and detection of the object, and then the global motion of the object can be estimated by depth change. These methods depend on depth sensor, and need much computation in complex scenes with variety motion. Third, in the case of monocular visual, precalibration with assumptions, GPS, or odometry are used to estimate camera trajectory. Ess et al. [41] established a basic paradigm to track multiple objects in mobile platform. Wojek et al. [42] focused on 3-D scene understanding for traffic scenes. Then, Choi et al. [43] proposed a general framework by integrating the tracking method in mobile devices. However, calibration is still required and the camera speed is assumed to be constant. Some methods [44], [45] are designed with lower computation expenses for tracking on UAVs. Gao et al. [16] used odometry in mobile phone to track vehicle in GPS blocked environments. In cope with dense scenes, these methods rely on additional sensor information and lack of adaptability for complex motion interaction. On the contrary, we propose a hybrid motion model with minimum requirement and assumptions to achieve competitive results in the latest benchmarks.

C. Motion Estimation
In terms of visual odometry, many successful methods are proposed in the SLAM field. The method proposed in [15] is based on the RDB-D sensor, and the other [14] calibrates the monocular vision system precisely. Without additional information, our method is initialized with the detection information and suitable for MOT application. For motion segmentation, Liu et al. [46] and Shen et al. [47] used optical flow to acquire point trajectories for segmentation, which inspired us to make background segmentation under mobile devices.
The motion model based on the filter has achieved success by considering multiple errors. The Kalman filter is proposed early in [48] and provides a basic framework. However, a single-motion model is not adapted to deal with unpredictable motion changes. To model multiple modes, Bar-Shalom [49] used multiple Kalman Filters with different transition matrices. Genovesio et al. [50] established a probability model to measure the switching between modes. Recently, LSTM-based motion estimation [8] is proposed with neural network, while rely on specialized training step. Inspired by [50], we simplify the model representation and give the STEM for MOT problem.

III. HYBRID MOTION MODEL FOR MOBILE DEVICES
In this section, we integrate the camera motion of the mobile device, the camera-object motion projection, and the different motion modes of objects into hybrid motion model.

A. Model Overview
In order to detect and track multiple objects in mobile devices, the main idea of the hybrid motion model is to use world coordinates, which are not affected by camera motion. The world coordinates of an object can be projected from detections along with the camera trajectory. In this way, camera motion is required as an indirect quantity for coordinates mapping. As shown in Fig. 2, the model is mainly composed of three parts as follows.
Camera Motion: Different from methods in [41], [42], [43], and [51], our model does not require strict assumptions of precalibration or GPS data. Our method only needs to segment the background according to the optical flow and detections. To enhance the robustness and reduce error, the camera trajectory is obtained by evaluating and maximizing the motion probability. As iteration frame by frame, the motion state is updated to get the camera trajectory.
Camera-Object Motion: With camera trajectory, each object is mapped from image to world coordinate by smooth dynamic projection. Since both the objects and the camera are moving, the mapping method based on point matching is not feasible. To adapt to the dynamic change of object and camera, the First, for camera motion, with segmented background, the camera trajectory is obtained by maximizing probability of motion hypotheses (Section III-B). Second, for camera-object motion, the objects are projected to the world coordinate by smooth dynamic projection in Section III-C. Third, for object motion, with tracklets generated by in Section IV-A, the motion of adjacent and long spaced tracklets are modeled by MMF (Section III-D) and candidate hypotheses are solved with STEM (Section IV-B).

TABLE I NOTATION
projection is established by taking the horizon as reference. In the projection, the heights of the camera and object are smoothed by information in previous frames. Besides, only the focal length is required to obtain the position of each object in the world coordinate.
Object Motion: In motion modeling, after tracklet generation, it is inconsistent to measure and predict motion state by same parameters due to missing detection and long intervals between tracklets. To solve this problem, MMF is proposed for adjacent and long spaced associations. With motion prediction and spatiogating by MMF, we propose STEM to evaluate hypotheses in the MHT framework.
To clarify the meaning of the equation in this article, styles of notation are used as summarized in Table I.

B. Camera Motion Estimation
In the first step of the hybrid motion model, camera motion estimation provides camera trajectory for world coordinate mapping. The basic geometric matching is initialized to obtain reliable feature points by finding the static background in adjacent frames.
Given set P t−1 and P t be the feature points [52] between frame t −1 to t. Then, the initial Translation (T t ) and Rotation (R t ) of camera at frame t is obtained by eight-point algorithm. In order to eliminate the noise caused by mismatching, RANSAC [53] is used in iterations to minimize the cost function where F * t is the estimated fundamental matrix from randomly selected pairs from P t in each iteration and D sp compute Sampson distance of each pair (P i t−1 , P i t ). The continuous pairing process often brings accumulation and estimation errors. To reduce these errors, the video sequence is divided by the sliding window, where camera motion hypotheses are proposed with different samples of frames, pairing points and parameters. To obtain the optimal trajectory, camera motion hypothesis in each video segment is measured by the probability function where (2)(a) is the flow similarity F t at time t of camera state set C t . It controls the flow change caused by camera moving. Equation (2)(b) is the hypothesis probability updated from H i t to H i+1 t at time t, modeling the association between camera motion hypothesis. Equation (2)(c) represents the transition which controls the smoothness of camera movements in state set C t between time t − 1 to t. Through the iterative process, the probability of motion hypothesis is maximized, which represent the optimal motion estimation.

1) Initialization for Estimation:
Before hypothesis evaluation, it needs to initialize the background segmentation and estimate the focal length. Static area in background is overlapped in video fragments to meet the point matching requirement. In MOT applications, images are generally full of various objects with complex interaction. Considering detections describe the position of objects, a cluster model to segment background is trained by the pixel sample and optical flow.
In a pixel sample, the real-time superpixel segmentation [54] is used to form the training set T v with label l where C i represents the color feature of superpixel i from the video and D Hconf represents the detection set of each video with confidence higher than threshold θ 1 from raw detections. Then, we train a Linear SVM with T v to provide cluster score S clu for all superpixels. Second, optical flow [55] is used as local motion information. In the static background, the optical flow field is nearly a smooth surface in velocity space. However, object motion is inconsistent with the static background. To measure the background probability, the distance score S dis is obtained by distance between object optical flow and background optical flow (S dis is mapped from 0 to 1).
Based on the fusion of S clu and S dis , the measurement function M is formulated with weight parameter θ 2 to extract the static surface from the background in each video In camera motion estimation and objects projection, the focal length is an essential parameter. The matching x t−1 ∼ Mx t between point pairs (x t−1 , x t ) is used to estimate the focal length. Camera position change between adjacent frames is approximated to rotating around the center of the background without translation. Also, the radial distortion is ignored in practice, therefore the mapping is expressed as where K is the internal parameter matrix of camera and R t−1⇒t represents rotation matrix between point pairs (x t−1 , x t ). For simplicity of notation, the pixel center is defined The first two rows (or columns) of R t−1⇒t must have the same norm and be orthogonal (even if the matrix is scaled). From this, the focal length can be obtained by solving the equation or 2) Motion Hypothesis Measurement and Updating: To evaluate the motion hypothesis of the camera, flow similarity is measured by camera movement from frame t − 1 to t and is computed in terms of optical flow φ. The predicted optical flow φ C t at frame t is obtained by the camera translation matrix T t and the current camera state C t (including the focal length and other parameters). The equation of flow similarity P(F t |C t ) is formulated as follows: where P * is the selected set of pairing points and D calculates the difference of flow displacement.
In the probability function, transition smoothness measures the stability of camera motion from frame t − 1 to t. Different from the assumption in [43] that the camera speed is constant in the whole video, our model allows speed changes dynamically. The speed of the camera has a limited mathematical expectation and the acceleration is relatively stable. Therefore, the transition motion of the camera is modeled as normal distribution to control smoothness where s t is the scale factor of the transition. The scale factor is computed by the displacement of matching points on the axis with the maximum camera movement. Here, e donates the axis of maximum movement and T t represents the velocity of camera transition. The parameters, pairing points, and frames in each window are selected to propose different hypotheses. In order to maximize the probability of the motion hypothesis, we implement the iteration to get the solutions. The hypothesis from H i t to H i+1 t is accept with maximum P t . In camera position update, the model tends to adopt conservative estimation which indicates that the position of the camera tends to be stable. Therefore, P( where the hypothesis can be modeled using the current camera position H i t . The update strategy evaluates the probability according to (2). The motion hypothesis with higher flow similarity and translation stability is retained. With sliding window, the camera trajectory is estimated for world coordinate projection.

C. Smooth Dynamic Projection
In the second step of the hybrid motion model, the objects are mapped from image to world coordinate by smooth dynamic projection. In order to make the projection more robust and practical, we consider change of camera height h c and rotation angle of pitch α. The motion of objects and camera is measured in the same plane (ground), basic geometric perspective with horizon line h is shown in Fig. 3.
In the initial frame, vertical projection of a camera on the ground is considered as the world coordinate original point (0, 0, 0). The height of the camera off the ground plane is As shown in Fig. 3 1) Horizon Height Estimation: According to (11) and Fig. 3, in most scenarios, the horizon can be used to simplify calculations. We sample from the train set in experiments to estimate the horizon height H t h1 , which follows a normal distribution with image center height 0.5 × H image as mean and scale factor s t as variance By detecting vanishing points [56] at lines parallel to the horizon, the position of the horizon is obtained for projection. The set of vanishing points V = {(x v , y v , c v ) 1:v n } with confidence c v can be obtained. In order to reduce the interference caused by the lines not parallel to the ground, horizon height H 2 h2 is modeled by the set of vanishing points V as normal distribution With (13) and (14), we combine horizon estimation and vanishing point to get the aggregated horizon height H t h3 : 2) Update  (17) where h t c is computed by (11)(a) and (16). Both the camera and object height are smoothed with the scale factor and the world coordinates of each object are obtained according to the projection (11).

D. Multimode Motion Filter
In the third step of the hybrid motion model, the object motion is modeled by MMF with different motion parameters. The motion state vector of the object is expressed In practice, it is considered that all objects move in the same plane (z = 0), so the following descriptions are in 2-D form to simplify the equation.
For the MOT problem, tracklets are often break with fragments due to missing detection and occlusion. When the interval between tracklets is short, the speed of the object is not change dramatically, so the prediction of the motion model can achieve high accuracy. However, after a long interval, the same motion model is often unable to predict the location of the object due to the uncertainty movement. Therefore, we divide the motion of the object into two modes: 1) continuous and 2) discontinuous. In the continuous mode, the tracklets are connected with adjacent association and the motion state is relatively stable. In the discontinuous mode, the interval between tracklets obviously affect the state of movement. In order to model two different motion modes and adapt to the disturbance of the mobile devices. State prediction and measurement of object motion based on multiple Kalman filters are defined as follows: (19) where m i ∈ M with i ∈ {0 : n} represent successive mode with frame set T ∈ {T 1 : T n } identifying the beginnings and ends of the modes. The matrix H is matrix of observation z t . Vectors w t and v t represent the model noise and the measurement noise, respectively. The transition matrix F is associated to the mode m i , which defines the motion mode by the filter. Specifically, the transition matrices of continuous mode (F 1 ) and discontinuous mode (F 2 ) are defined as where F 1 models uniform linear motion and F 2 remain the transmission of the speed vector v(v xt , v yt ) in the estimation of the state parameter. For a discontinuous motion filter, v is ignored if the estimated gain is low.
With (18) and (19), each possible mode sequence for M is considered to filter the state parameter optimally. To reduce the cost of parameterization calculation, the possible mode at frame t without approximation is described as where P(x t+1 |z T i :t+1 , m i ) is assumed to follow a normal dis- , where x m i t+1 is mode conditional mean and P m i t+1 is the covariance matrix. T i represents the start time of the current motion mode. The conditional probability according to the multiple motion filter is given as follows: According to the consistency of observations and predictions, the uniform assumption of motion mode is givenx where m * is the motion mode with the most probability.
The parameters of different motion modes are mixed to predict and update the object motion, and the association cycle of multi Kalman filter is given as where Q m i t and R t are the white noise covariance matrices of model noise w and the measurement noise v. (25) where K m i t+1 is adaptive Kalman gain, and I represents the identity matrix measured innovation. Then, in (21), the mode probability P(m i |z T i :t+1 ) is derived as P m i |z T i :t+1 ∝ P z t+1 |z T i :t , m i P m i |z T i :t (26) where In the process of object motion estimation, if the next measurement contradicts the uniform assumption (23), the filter mode switches. To detect mode switches in tracking, the transport between two modes is measured as According to (28), the model switches between the two modes is detect. Each filter is reinitialized with the previous measurement parameters. However, in the Discontinuous mode, the object motion state may experience an irregular change. Therefore, we use a smooth strategy fuses two independent filtering processes, which are forward and backward in temporal order to solve this problem. In this way, the parameter differences between the two modes are balanced.

IV. MULTIPLE HYPOTHESIS TRACKING BASED ON HYBRID MOTION MODEL
In this section, we implement a hybrid motion model in the MHT framework to solve the MOT problem. MHT establishes hypothesis trees for all possible tracklet associations, where motion of nodes is measured by the proposed model. Considering that confidence of tracklets has a great influence on tracking accuracy, world coordinates with height information is used to generate high confidence tracklets. To select and evaluate candidate tracklets for hypothesis updating and solving, STEM is proposed based on motion measurement in the model.
The tracking framework can be summarized into three steps. 1) Tracklet Generation: With world coordinates and height information provided by smooth dynamic projection, the false detections are filtered to generate high confidence tracklets. 2) Hypothesis Updating: From the initial tracklet, MHT maintain hypothesis trees that contains all possible associations. The probability of each branch in hypothesis tree is measured by STEM. 3) Hypothesis Solving: To avoid the exponential growth of hypotheses, the N-Scan pruning proposed in [13] is used to set a time window at size N. Then, the data association is formulated as MWIS to get objects trajectories. We solve MWIS by category transfer model proposed in [7].

A. Tracklet Generation by Smooth Dynamic Projection
The scale relationship between each object and the scene are estimated by mapping between image and world coordinates. This height information can be used to further filter error detection and generate more accurate tracklets.
Given the detection set D t = {d 1:k } at frame t. FPs in D t can be filtered before generating tracklets. Note that all objects are assumed to move on the ground, the bottom b i of d i ∈ D t is considered below the horizon estimated in Section III-C. Therefore, the detection confidence c d i is modified by (29) and the height h i of object i should obey the normal distribution h i ∼ N (ĥ i , 0.5) under perspective, where the expected mean heightĥ i is estimated bŷ where detections with height probability less than threshold θ 3 are filtered.
In tracklets generation, the K-partite graph is modeled for detection set D k 1 :k 2 from frame k 1 to k 2 in windows k. Different from TLMHT [7], we only use world coordinates to generate high-confidence tracklets for near-online tracking. In this way, the K-partite graph is defined as follows. 1) Node Set N: In initialization, N only includes detections. In subsequent iterations, N contains the tracklets left by the previous matching window, which length is less than threshold θ 4 . 2) Edge Set E: Except the edge between the detection, also contain the edge between the tracklets and the detection. 3) Weight Set W: Represent the similarity of nodes between edges, measured by appearance feature and motion state. The Weight includes motion measurement using world coordinates (X, Y) can be expressed as Each graph is associated with a score set and weight set. Scores serve as cost coefficients in the various discrete optimization formulations used to rank solutions. The solution of the graph can be solved by linear programming proposed in [8] to find the maximum sum of all weights. By solving the K-partite graph, tracklets are generated as the nodes of MHT. In addition, we propose an improvement to incorporate the remaining individual detection into the node to improve the recall of tracking method.

B. STEM for Hypothesis Updating and Solving
In hypothesis updating, MHT maintains possible tracklets in each window. Existing hypotheses are updated to link all candidate nodes of tracklets. The evaluation of the branch in hypothesis tree determines the effect of tracking method. Based on the hybrid motion model, the object motion change is used to calculate the weight score of edge.
The edge between adjacent tracklets is considered to represent continuous motion mode, and the edge between long spaced tracklets belongs to discontinuous motion mode. Here, MMF is used to measure the object motion and predict the mode switch. Due to the false and missing detection, if the tracklets are mismatched in spatial area, a dummy node is predicted according to the object motion model of discontinuous mode by (24).
In order to improve the recall and precision of association in mobile devices, we propose a STEM based on variance estimator. The count of prediction errors at time t is C(e) m i t with current mode m i . The parameters are estimated by whereē t is average error at time t, d m i t+1 represent mean deviation ofē t and M m i t+1 is square matrix of d m i t+1 . Thus, the covariance matrix is estimated as Following the iteration of (31) and (32), the basic spatial gating can be estimated by a normal distribution with current state and covariance matrix. When object motion switched to discontinuous mode, the optimal association always out of the estimated space range. To deal with this problem, the covariance matrix is reinitialized based on the mode switch. The parameters are reset to initial value (C(e) 0 , d 0 , M 0 ) when detect that the object motion is switched to the discontinuous mode. To get a robust result in spatial gating prediction, a similar smoothing method as [57] is implemented in iteration.
In hypothesis solving, the logarithm-likelihood ratio is used to measure the motion similarity between hypotheses. The object location probability P is measured by normal distribution N (x m i t+1 , ||Q m i t ||) withx m i t+1 and Q m i t predicted by (24) in Section III-D. For null hypothesis φ which indicates false association, the probability P(L t ∈ φ) = 1/Â, whereÂ represents the estimated area of world coordinate. The motion score for hypothesis branch L at time t is defined as where H means association nodes belongs to the same object. Accordingly, the aggregated score S * of hypothesis H = l 1:k is defined as where c d i is confidence of each detection in tracklet l and θ 5 is the weight for motion and appearance score computed by adaptive method proposed in [58]. The appearance features of tracklets are represented by linear mean of detections feature extracted by the real-time Re-Id network [59]. The score S app is cosine distance of tracklet feature: w app,ij = cos(app i , app j ). By measuring similarity scores, the associations are solved as baseline [7] to get trajectories for all objects.

V. EXPERIMENTS
In this section, we first introduce data sets and evaluation metrics. Then, parameters are shown with visual analysis. To verify the effects of our method, we performed component analysis and compared it with various methods. Through ablation experiments, each component is evaluated in quantization. For general results, our method is compared with state-of-the-art in MOT and KITTI data sets.

A. Data Sets and Metrics
We evaluate the performance of our method on MOT15 [60], MOT17 [61] and KITTI [62] data sets, which are widely used to evaluate MOT performance based on the tracking-by-detection paradigm. The MOT15 data set consists of 22 video sequences divided into 11 training sets and 11 testing sets and the MOT17 data set consists of 14 video sequences divided into seven training sets and seven testing sets. In MOT15 and MOT17, videos are filmed with different lighting conditions, shooting angles, and density in both static camera and mobile devices provided with public detection, the evaluation focused on the performance of tracking algorithms. Besides, in MOT17, detections of each video are obtained by three detectors proposed in [2], [3], and [4] to balance the impact of different detectors. In the KITTI data set, videos are captured by a vehicle-mounted camera without public detections, and provides camera calibration, stereo view, laser points, and GPS data, which supports a variety of 3-D tracking methods.
The evaluation metrics include CLEAR MOT metrics [63], identity switches (ID Sw.) [64], IDF1 score [65], and higher order tracking accuracy (HOTA) [66]. The MOT accuracy (MOTA) shows the comprehensive MOT performance by combining three error sources: 1) ID Sw.; 2) FPs; and 3) missed objects (FN). The IDF1 is ratio of correctly identified detections over the average number of ground-truth and computed detections. The HOTA is geometric mean of detection accuracy and association accuracy. Averaged across localization thresholds. MT and ML are ratio of the mostly tracked (80% tracked) and the mostly lost (80% lost) objects. Frag is the total number of times a trajectory is fragmented.

B. Parameters and Visual Analysis
In this section, we analyze the parameters of the method. In addition, selected tracking visualization results are shown in Fig. 4.
Parameters: Threshold θ 1 is used to select detections for background segmentation in Section III. By evaluating the detection quality in the train set, θ 1 is set to 0.9 considering that most correct detections have the confidence higher than this threshold. In Sections III-C and IV-A, the average height of the object in real world h 0 is fixed to 1.7 m, which represents the mean height of pedestrians. As shown in Fig. 5, we evaluated the effects of θ 2 , θ 3 , θ 4 , and NRatio on MOTA in MOT17 and KITTI data sets. Because the parameters are relatively independent, when evaluating one parameter, other parameters are fixed as the optimal setting. Weight parameter θ 2 for background segmentation is set to 0.55 to balance the best MOTA performance on MOT17 and KITTI data sets. When θ 3 is set to 0.3, i.e., the detection with height estimation range (1.45, 1.95) is retained, MOTA of each data set achieves the highest value of 56.6% and 85.2%. With the increase of θ 3 , more reliable detections are filtered out, so the value of MOTA drops rapidly. In Section IV-A, to balance the tracklet length and computational efficiency, the maximum length of tracklet θ 4 is set to 5 as used in the baseline method [7]. N determines the compute window in hypothesis solving, which is related to the rate of the object state change with time. Thus, we measure N according to the percentage NRatio of framerate. When NRatio is set to 50%-70%, MOTA of each data set achieves the highest value. Due to the complexity of algorithm, the memory cost and computing time increase exponentially with the increase of N. So, we set NRatio to 50% to get the balance between efficiency and space-time cost.
In summary, to get the best result, we use the parameters setting in the following experiments: Visual Qualitative Analysis: In terms of background segmentation, qualitative results are shown in Fig. 6. Darker areas in Fig. 6(b) have higher background scores by cluster measurement. Using the color features of the detection as the training set, the cluster method is more sensitive to the color. However, it fails when the object color is close to the cluster center. As the result shown in Fig. 6(b), the pants of the man in the middle is not distinguished. In Fig. 6(c), the distance of pixel velocity to the optical flow is shown in different colors. The pixels with higher distance are painted red (the person on the right side), and the pixels with lower distance are painted blue. Distance measurement based on optical flow is more sensitive to motion difference, while it fails when the objects move synchronously with the background. By combining both cluster and distance scores, more accurate fusion result is shown in Fig. 6(d). Fig. 4 shows the visual results of the estimated camera trajectory of MOT17-05 sequence and selected frames in MOT17-14 sequence for visual comparison. As shown in MOT17-05 sequence, our method accurately restores the position of the objects to the world coordinate. As shown by the arrows in Fig. 4, compared with the MPNTrack [27] and Lif_T [25], our method can track more objects and keep continuous tracking in case of occlusion.

C. Effect Analysis for Tracklet Generation and Motion Model
In this section, we analyze the effect of tracklet generation and motion model with MMF and STEM.
As shown in Table II, Tracklet s represents the tracklet generation method based on smooth dynamic projection. We use different methods [5], [7], [8] to generate tracklets as input. Through the height information provided by the projection, the error detections are filtered and the confidences of detections are corrected. Therefore, the recall and the ID consistency of tracklers are both improved for the association. By evaluation on the MOT17 train set, our method achieves the highest MOTA, IDF1 and HOTA. To further verify the quality of tracklets in our method, we combine our tracklets into three open source trackers [19], [20], [21], which are widely used for online MOT. As shown in Table III, by combining Tracklet s , these methods are all improved in main metrics.
As shown in Tables IV and V, for the motion model, we compare our proposed method MMF and STEM with four mainstream motion models [7], [8], [20], [21] on MOT17 and KITTI data sets. Greedy represents distance-based greedy matching, K represents Kalman filter, IOU represents measurement based on IOU value, G represents threshold-based gating, and LSTM represents motion prediction network based on LSTM. By providing prediction probability and error estimation for different motion modes, more tracklets are associated  KITTI TRAIN SETS   TABLE VII  COMPARISON ON MOVING SEQUENCES IN THE MOT15 TEST SET   TABLE VIII  COMPARISON ON MOVING SEQUENCES IN THE MOT17 TEST SET with lower FP, FN, and ID Sw., which make our method achieve better results on MOT17 and KITTI data sets.

D. Ablation Study for Hybrid Motion Model
We evaluated components of our method on moving sequences in MOT17 and KITTI train sets and the results are shown in Table VI. Model T represents only the generated tracklets are used for tracking. It is noticed that ID Sw. and Frag of trajectories are significantly reduced with tracklets generated by our method. Only use STEM in Model M+S , FP and FN of each sequence are reduced, which shows that STEM is effective in tracklets evaluation, rather than achieving the balance between FP and FN by parameter adjustment. In Model M+S , the constraints for tracklets association are stricter, but ID Sw. and frag of the trajectory do not increase significantly. To achieve the best tracking result, Model T+sM+S gives a more accurate evaluation so that the integrity of the same object trajectory is higher, and FN is reduced. At the same time, the trajectories of different objects are correctly associated, thus reducing FP. However, compared with other sequences, the camera trajectory changes greatly in MOT17-13 sequence, which leads to a slightly lower improvement in tracking results. The total results show that our method can effectively improve the accuracy of multiobject tracking in the video of the handheld mobile camera and vehicle camera. In addition, the method has good robustness for scenes with different densities, different weather, and indoor and outdoor environment.

E. Comparison on Benchmark
To evaluate the overall performance, our method is compared with published state-of-the-art methods on MOT15, MOT17, and KITTI benchmarks. For a fair comparison, methods using public detection are compared in the MOT benchmark and we use the same private detector as [27] to get a result in the KITTI benchmark.
In moving sequences, as shown in Tables VII and VIII, our method achieves significant improvement in MOTA, IDF1, and HOTA among all the methods in MOT15 and MOT17 test sets. By generating tracklets with high confidence and motion modeling with MMF, hybrid motion model significantly improves the ID consistency within and between tracklets. Furthermore, STEM maintains higher trajectory integrity without introducing more false associations. Therefore, FN decreased while FP maintained limited growth, and finally more accurate trajectories are obtained.
In the overall benchmark result shown in Tables IX-XI, our method performs with the highest MOTA. By giving more accurate measurement and prediction for object motion, our method achieves high trajectory integrity (MT) for video sequence. Moreover, our tracker also effective for static camera. Without camera motion estimation, the model can directly use the image coordinate of the object for MMF and use STEM for tracklet association. In Table XI, in particular, we choose the tracker using three types of additional sensor data [35], [40], [51]. Compared with these methods, our method only uses monocular video data without calibration information and obtained better MOT results. The algorithms used in our method are all designed for quasi-real time system. Only with the delay of sliding window size and using real-time detectors, our method can achieve near-online effect for mobile devices. The benchmark result can also be found in the MOT Challenge website 1 and KITTI Benchmark website. 2

VI. CONCLUSION
In this article, a hybrid motion model was proposed to address the motion modeling problem of MOT in mobile devices. Through the motion hypothesis evaluation, the camera motion was estimated for world coordinates projection. Our method reduces the estimation error and avoid the requirement of additional information such as calibration. Using horizon perspective, smooth dynamic projection in the model extracts the world coordinate, which avoids the interference of camera motion and results in higher tracking accuracy. Meanwhile, MMF solves the motion measurement and prediction problem for different motion modes and it adapts to object motion estimation under the motion camera. In the tracking framework, STEM provides more accurate affinity measurement for tracklets. The experimental result showed that our method 1 https://motchallenge.net/results/MOT17/ 2 http://www.cvlibs.net/datasets/kitti/eval_tracking.php has simple parameter setting and high robustness. A comparison result on MOT and KITTI benchmark demonstrated a competitive performance over other state-of-the-art methods.