Orthogonal Single-Target Tracking

In this study, we propose a novelWasserstein distributional tracking method that can balance approximation with accuracy in terms of Monte Carlo estimation. To achieve this goal, we present three different systems: sliced Wasserstein-based (SWT), projected Wasserstein-based (PWT), and orthogonal coupled Wasserstein-based (OCWT) visual tracking systems. Sliced Wasserstein-based visual trackers can find accurate target configurations using the optimal transport plan, which minimizes the discrepancy between appearance distributions described by the estimated and ground truth configurations. Because this plan involves a finite number of probability distributions, the computation costs can be considerably reduced. Projected Wasserstein-based and orthogonal coupled Wasserstein-based visual trackers further enhance the accuracy of visual trackers using bijective mapping functions and orthogonal Monte Carlo, respectively. Experimental results demonstrate that our approach can balance computational efficiency with accuracy, and the proposed visual trackers outperform other state-of-the-art visual trackers on several benchmark visual tracking datasets.


I. INTRODUCTION
V ISUAL tracking is a fundamental technique that can be used to predict target object (e.g., vehicle) trajectories. Recently, visual tracking has enhanced its performance by defining visual tracking problems in the Wasserstein space. This Wasserstein space enables the accurate measurement of the distance between probability distributions. Because it can handle probability distributions, the Wasserstein distance has been used in various computer vision applications (e.g., classification [1], detection [2], visual tracking [3], and 3D representation [4]) and has been applied to several machine learning tasks (e.g., semi-supervised learning [5], adversarial learning [6], meta learning [7], reinforcement learning [8], and metric learning [9]).
Conventional visual tracking typically adopts the matching metrics in the Euclidean space, e.g., l 1 and l 2 norms, Kullback Leibler divergence, and Jensen-Shannon divergence, while having several limitations under real-world visualtracking environments. For example, l 1 and l 2 norms cannot accurately measure the discrepancy between the distributions. Kullback Leibler divergence is asymmetric, whereas Jensen-Shannon divergence is discontinuous and is not proportional to the discrepancy between the distributions. Thus, a new matching metric is required in the Wasserstein space, which has been rarely explored in visual tracking. In partic-ular, the Wasserstein distance can measure the discrepancy between probability distributions of the reference appearance and the current target appearance at the estimated state. Because visual trackers explicitly consider the discrepancy of probability distributions, they can encode the uncertainty in measuring the distance from the distributional perspective.
However, calculating the Wasserstein distance requires high computational costs and is intractable in real-world settings with limited resources. To alleviate this problem, the following methods attempt to approximate the Wasserstein distance: For example, Kolouir et al. [10] projected the Wasserstein distance into one-dimensional spaces and presented the sliced Wasserstein distance. Cuturi et al. [11] transformed the optimal transport problems into maximumentropy problems to speed up the computation and introduced the Sinkhorn distance. Genevaay et al. [12] proposed a stochastic optimization method for dealing with large-scale optimal transport problems. While these methods have made the distance computation tractable, they inevitably degrade the Wasserstein distance accuracy.
Thus, it is important to balance the approximation with accuracy in the computation of the Wasserstein distance. For this purpose, we adopt a variant of the sliced Wasserstein distance augmented by orthogonal coupling in the course of Monte Carlo simulation on the Wasserstein distance FIGURE 1: Framework of the proposed visual tracking system. The proposed visual tracker proposes a new state at each time and estimate the target appearance. Then, our visual tracker compares the reference target appearance with the estimated target appearance from the distributional perspective using three Wasserstein-based distances, which are Sliced Wasserstein distance, Projected Wasserstein distance, and Orthogonal coupled Wasserstein distance. [13], called orthogonal coupled Wasserstein (OCW). Our OCW method can preserve the distance information in highdimensional space, although the method approximates the Wasserstein distance to reduce computational cost.
In this study, we aim to solve visual tracking problems using the proposed OCW. The proposed visual tracking method represents a target appearance vector as a target appearance distribution to cope with ambiguities in the appearance representation. Subsequently, the OCW accurately and efficiently minimizes the discrepancy between the estimated and ground-truth target appearance distributions to obtain an accurate target configuration.
The contributions of the proposed method are as follows: • We develop a novel sliced Wasserstein-based visual tracking system (SWT), in which two appearance distributions described by estimated configurations and ground truth configurations become similar via the optimal transport plan. This plan can be conducted using a finite number of probability distributions; thus, the computational costs can be considerably reduced. • We present a novel projected Wasserstein-based visual tracking system (PWT), in which the discrepancy between the aforementioned sliced Wasserstein distance and true Wasserstein distance can be minimized using bijective mapping functions. • We propose a novel orthogonal coupled Wassersteinbased visual tracking system (OCWT), in which the aforementioned projected distance can induce accurate projection directions using orthogonal Monte Carlo. Figure 1 describes the framework of the visual tracking system. The remainder of this paper is organized as follows. Section II relates the proposed method to the existing methods. Sections III, IV, and V propose a visual tracking method based on the sliced Wasserstein, projected Wasserstein, and orthogonal coupled Wasserstein distances, respectively. Section VI-A describes the experimental settings used in this study. We compare the proposed visual tracker with other state-of-the-art methods using the object tracking benchmark (OTB) and large-scale single object tracking (LaSOT) datasets in Sections VI-C and VI-D, respectively. Section VI-B analyzes our proposed visual trackers in depth. We conclude the study in Section VII.

II. RELATED WORK
While visual tracking has a long history, in this section, we discuss the methods most relevant to our study, which can be categorized into three groups: Wasserstein distributional visual tracking, visual tracking via projection, and deep learning-based visual tracking.

A. WASSERSTEIN DISTRIBUTIONAL VISUAL TRACKING
Yao et al. [14] transformed visual tracking problems into transportation problems via linear programming algorithms, where 1-Wasserstein distances (i.e., earth mover's distances) were used as a distance metric. Danu et al. [15] employed the Wasserstein distance in a particle filter formulation to compare estimated multi-target states with ground truths in multi-sensor environments. Zeng et al. [3] measured the discrepancy between target-specific features using the 1-Wasserstein distance to accurately track vehicles. Danis et al. [16] used the Wasserstein distance to evaluate Bluetooth data via a sequential Monte Carlo method.
In contrast to these methods that use Wasserstein distributions to enhance the visual tracking accuracy, we use the orthogonal coupled Wasserstein distance to balance the accuracy with computational efficiency.

B. VISUAL TRACKING VIA PROJECTION
Xiao et al. [17] designed random projection matrices to find subspaces that make visual trackers robust to noise. Zhang et al. [18] transformed visual tracking problems into projection problems, in which a robust target representation model is learned via a projection onto the l + p ball. Zhang et al. [19] proposed a visual tracker based on a structurally random projection for dimensionality reduction of the template space, in which the original distance was preserved with an efficient computation. Danelljan et al. [20] projected color names on an orthonormal basis of a 10-dimensional subspace to extract sophisticated color features for visual tracking.
In contrast to these methods that project the Euclidean space into the subspaces of the target appearance, we project the Wasserstein space and explicitly guide the projection direction for accurate visual tracking.

C. DEEP LEARNING-BASED VISUAL TRACKING
Li et al. [21] presented Siamese deep neural architectures combined with region proposal networks, which aimed to search for candidate regions for target objects. Valmadre et al. [22] proposed deep neural networks based on correlation filters that efficiently compared deep features with reference features. Zhang et al. [23] introduced very deep neural networks to extract representative features for accurate visual tracking. Bertinetto et al. [24] made Siamese networks fully convolutional for accurate and fast matching. Li et al. [25] applied meta information to deep neural networks for fast adaptation in different visual tracking environments and changes in target appearances. Zhu et al. [26] enhanced the discriminative power of deep neural networks using both negative and positive samples for target objects. Choi et al. [27] boosted the adaptive representation ability of deep neural networks using gradient information for visual tracking. Bhat et al. [28] used discriminative classifiers for deep neural networks, in which classifier weights were generated via a novel optimization technique. Guo et al. [29] presented dynamic Siamese network architectures that enable the update of target appearances online.
In contrast to these methods, we do not use complex deep neural architectures. Nevertheless, our proposed visual tracker exhibits state-of-the-art visual tracking performance, because target appearances are described by Wasserstein distributions; thus, several variations in target appearances can be covered during visual tracking.

D. OTHER VISUAL TRACKING
Li et al. [30] proposed a dual-regression framework for visual tracking, which combines discriminative fully convolutional module (for discriminative ability) and a finegrained correlation filter (for accurate localization). Fan et al. [31] introduced a novel interactive learning framework for visual tracking, in which multiple convolutional filter models are interacted with each other and their responses are fused based on the confidence scores. Liu et al. developed robust visual trackers for thermal infrared objects based on multilevel similarity models under the Siamese framework [32], via the multi-task framework [33], and using the pretrained convolutional neural networks [34].
Muresan et al. [35] introduced a multi-object tracking method based on a affinity measurement function and a context aware descriptor for 3D objects. Karunasekera et al. [36] presented a multi-object visual tracking system using a new dissimilarity measure that considers object motion, appearance, structure, and size. Braso and Laura [37] proposed fully differentiable message passing networks for multi-object tracking, which is formulated as network flows.
In contrast to these methods, we presented a novel mathematical approach based on the Wasserstein distance to boost the visual tracking performance. Thus, this approach can be integrated into existing visual trackers to improve their performance. Please note that using the Wasserstein distance enables us to use many of useful mathematical properties.

A. SLICED WASSERSTEIN DISTANCE
The p−Wasserstein distance W p measures the discrepancy between two probability distributions (i.e., µ, ν ∈ P R d ), where P R d denotes the set of distributions defined on R d and the p-th moment. We then define the p−Wasserstein distance as follows: In (1), we can find the optimal transport plan γ between µ and ν, inducing W p .
The Wasserstein distance in (1) can directly consider probability distributions. However, it is difficult to define the set of joint probability distributions Γ(µ, ν). Thus, conventional approaches [38] approximate ν as {ν m } M m=1 and W p (µ, ν) as where w m denotes the mth weight. As an alternative approach, µ and ν are assumed to have one-dimensional probability distributions (i.e., µ, ν ∈ P R 1 ). Then, we can find the optimal transport plan γ using a finite number of probability distributions, which can considerably reduce the computational costs. This approach induces a sliced Wasserstein distance [13], [39].
To compute the sliced Wasserstein distance, we define the unit sphere S d−1 in R d . Subsequently, for a vector s ∈ S d−1 , we define the projection map proj , which transforms x ∈ R d into ≺ s, x ∈ R 1 (i.e., proj s (x) =≺ s, x ). We define the projection of the probability distribution µ as proj # s (µ). Using proj # s (µ), we can deal with one-dimensional probability distributions. Then, the sliced Wasserstein distance W slice p is defined as follows.

B. VISUAL TRACKING
With W slice p , we present a novel sliced Wasserstein distancebased visual tracker. In the visual tracking context, µ and ν indicate the estimated and ground-truth target appearance distributions, respectively. We adopt empirical distributions for µ and ν, which are defined as follows: In (4) Given the best target configuration at time t − 1,Ô t−1 , our goal of visual tracking is to find the best target configuration at time t,Ô t . For this purpose, we randomly search for candidate configurations aroundÔ t−1 . Thus, our motion model is based on a normal distribution, as follows.
In (5), O (c) t denotes the c-th candidate configuration that is proposed based on a normal distribution with center O t−1 and standard deviation Σ. Subsequently, we measure the sliced Wasserstein distance W slice p between appearance distributions described by candidate configuration O (c) t and ground truth configuration O GT t , which are µ (c) and ν, respectively. Our objective is to find the best index c * , in which the corresponding appearance distribution µ (c) described by candidate configuration O

IV. PROJECTED WASSERSTEIN-BASED VISUAL TRACKING A. PROJECTED WASSERSTEIN DISTANCE
Using the sliced Wasserstein distance, we can considerably reduce the computational cost, but can obtain erroneous Algorithm 3 Orthogonal coupled Wasserstein distance-based tracker (OCWT) results, because there exists discrepancy between sliced Wasserstein distance and true Wasserstein distance. In particular, according to s in (3), the projected vector proj s (x) can be biased [40]. In particular, proj s (x) < proj s (x ) does not make proj s (y) < proj s (y ). To solve this problem, bijective mapping has been introduced to measure the sliced Wasserstein distance [13]. Bijective mapping induces In (7), the bijective mapping b(·) can be implemented by sorting {y m } M m=1 , which results in {y sort m } M m=1 , and selecting y sort argsort(xm) for x m , where argsort returns indices that sort {x m } M m=1 and argsort(x m ) returns the index of x m . Subsequently, the projection is conducted using a new projection vector s new ∈ S d−1 , which is different from s in (7). Using s new , we can prevent the aforementioned projection from being biased. The projected Wasserstein distance is then defined as (8).
where x m ∼ µ and y m ∼ ν as in (4).

B. VISUAL TRACKING
Our objective is to find the best index c * , in which the corresponding appearance distribution µ (c) described by candidate configuration O (c) t can minimize the distance as (9).
where the best target configuration at time t isÔ t = O (c * ) t . Algorithm 2 shows the entire pipeline of the proposed visual tracker based on the projected Wasserstein distance.

A. ORTHOGONAL COUPLED WASSERSTEIN DISTANCE
Using the projected Wasserstein distance, we can reduce the discrepancy between the sliced Wasserstein distance and the true Wasserstein distance. However, the projection direction of s in proj s is crucial for the success of the  projected Wasserstein distance, as mentioned in [41]. In this context, we use orthogonal directions, because orthogonal directions of projection vectors guarantee the improvement of estimator variance for the projected Wasserstein distance, as proven in [13]. To sample mutually orthogonal vectors s ort 1 , · · · , s ort N ∈ S d−1 (i.e., ≺ s ort i , s ort j = 0 for i = j), we employ orthogonal Monte Carlo (OMC) techniques in [42]. Using the OMC, mutually orthogonal vectors can be efficiently obtained form the unit sphere S d−1 in R d .
where all coordinates of R d are fixed except i and j, and the two-dimensional subspace is spanned using the rotation of θ.
Using G[i, j, θ], we can sample orthogonal vectors via Kac's random walk on the Markov chain K t | ∞ t=1 .
In (11), the sequence of K t × s ort n is a Markov chain on S d−1 [44]. Then, the orthogonal coupled Wasserstein distance W ort p is defined as follows.

B. VISUAL TRACKING
Our objective is to find the best index c * , in which the corresponding appearance distribution µ (c) described by candidate configuration O (c) t can minimize the distance as (14): where the best target configuration at time t isÔ t = O (c * ) t . Algorithm 3 shows the entire pipeline of our visual tracker based on the orthogonal coupled Wasserstein distance.

A. EXPERIMENTAL SETTINGS
OTB dataset: To demonstrate the effectiveness of the proposed methods, we compared three proposed visual trackers (i.e., SWT, PWT, and OSWT) with 9 recent deep learningbased visual trackers (i.e., ECO-HC [45], TADT [46], SiamRPN++ [21], SINT-op [47], C-COT [48], DAT [49], ECO [45], SiamDW [23], and SINT [47]) using the OTB dataset [50]. This dataset includes various attributes for visual tracking environments, including out-of-view, out-ofplane rotation, deformations, motion blur, scale variation, illumination change, fast motion, background clutter, inplane rotation, low resolution, and occlusions. To evaluate the visual tracking methods, precision and success plots, and the area under the curve (AUC) were used, in which the precision plot computed the ratio of frames such that the discrepancy between the estimated and ground-truth configurations of the targets is less than a specific threshold. The success plot computed the percentage of frames such that the intersection of the union between the estimated and ground-truth bounding boxes is greater than a specific threshold. AUC was used to compute the area under the success plot.

B. ANALYSIS OF THE PROPOSED METHOD
To examine the effectiveness of each proposed technique in Table 1, we compared the proposed SWT with its extensions, PWT and OCWT. As shown in the table, describing multiple appearances of the target using Wasserstein distributions is helpful for accurate visual tracking, where our simple SWTbased visual tracker outperforms state-of-the-art visual trackers including GlobalTrack in terms of normalized precision (as shown in Table 6).
We also examined the robustness of the proposed method against hyperparameter settings. Table 2 shows that the proposed OCWT is not sensitive to different settings for the number of Monte Carlo samples. Although the OCWT exhibited more accurate results with more samples at the cost of computational time, it still shows accurate visual tracking performance even with 50 samples. Table 3 includes the visual tracking results of the proposed OCWT according to the different number of candidate configurations (C in (5)). If we consider a large number of candidate regions for the target, we have more chances of getting trapped in local minima; thus, visual tracking accuracy decreased when we used 20 candidate regions. In contrast, if we consider a very small number of candidate regions for the target, the visual tracking accuracy can decrease because search areas are not sufficient to find the target. However, in any case, our tracker is not sensitive to the number of candidate configurations. Table 4 lists the visual tracking results of the proposed OCWT according to the different numbers of moment statistics (M in (4)). As shown in the table, using a single moment statistic to describe the target appearance was not sufficient to accurately track the target. If we use more than four moment statistics, the visual tracking performance converges, where our visual tracker can successfully track the target. Table 5 shows that the proposed OCWT is not sensitive to different settings with respect to the number of frames (T in (11)). Although we could obtain more accurate orthogonal vectors with a large number of frames, the performance improvement was not significant. Even though the orthogonal vectors are not accurate, using them is crucial for robust visual tracking. It should be noted that the proposed OCWT with orthogonal vectors considerably outperforms the PWT without orthogonal vectors.

C. COMPARISONS ON THE OTB DATASET
Our method was quantitatively compared with non-deeplearning visual trackers. As shown in Figure 2, the proposed VOLUME 4, 2016  method considerably surpassed existing non-deep-learning visual trackers in all evaluation metrics (i.e., precision plot, success plot, and AUC). While the second-best methods are Struck and SCM for the precision and success plots, respectively, the proposed method outperformed these methods by a large margin. Empirically, we argue that accurate visual tracking results of our method are induced by precisely measuring the discrepancy between two distributions of estimated and ground-truth appearances via advanced Wasserstein-based techniques. Our method was also compared with recent deep-learning visual trackers, as shown in Figure 3. The method exhibited state-of-the-art performance in all evaluation metrics, although our method also adopted no complex deep neural network architecture. In contrast, SiamDW showed the second-best performance in terms of the precision plot, even though it employed a deeper and wider neural network architecture for visual tracking. Thus, this quantitative comparison verified the effectiveness of our Wasserstein distributional tracking, in which the discrepancy between the two appearance distributions is efficiently minimized. It is noteworthy that we present a novel appearance model for visual tracking based on the Wasserstein distribution; thus the proposed technique can be plugged into existing visual trackers to improve their visual tracking accuracy. Figure 4 shows the qualitative visual tracking results of our method for the OTB dataset. The test video sequences contain fast motions (e.g., (a) Biker, (b) Bolt, and (c) Deer sequences), nonrigid deformation (e.g., (d) Diving, (e) Iron-man, and (f) Jump sequences), background clutter (e.g., (g) Matrix, (h) MotorRolling, and (i) Shaking sequences), occlusions (e.g., (g) Matrix and (l) Soccer sequences), illumination changes (e.g., (e) Ironman, (g) Matrix, (i) Shaking, and (j) Singer2 sequences), and small objects (e.g., (f) Jump and (k) Skiing sequences). Although these sequences are very challenging, our method accurately tracked the targets. This accurate visual tracking performance steps from the modeling of multiple appearances using the Wasserstein distributions. Table 6 shows quantitative comparisons between the proposed OCWT and recent state-of-the-art visual trackers using the LaSOT dataset. As shown in the table, our method produces accurate visual tracking results and outperforms other visual trackers, where GlobalTrack shows the second-best visual tracking performance. However, GlobalTrack adopted a complex backbone network (ResNet) to extract representative features, while the proposed method used a small backbone network (VGG) to exhibit state-of-the-art performance with small computational costs. These experimental results demonstrate that the advantage of using Wasserstein distributions for the target appearances makes the proposed visual tracker robust to several variations in the target appearances, which can be caused by illumination changes, deformation, and background clutters. Figures 5 and 6 show the success and normalized precision plots of visual trackers using the LaSOT dataset, respectively.  As shown in figures, the proposed visual tracker, OCWT, is comparable with recent state-of-the-art visual trackers such as DiMP and LTMU, while our method considerably outperforms state-of-the-art correlation filter-based trackers (e.g., GFSDCF [55], ASRCF [56], STRCF [57], and BACF [58]). Figure 7 demonstrates the effectiveness of the proposed method in the VOT dataset. The proposed visual tracker, OCWT, is the state-of-the-art visual tracker in terms of accuracy, while its robustness is also competitive to other methods. LSART exhibits the best performance in terms of robustness, but it inaccurately tracks target objects compared with the proposed method.  Table 7 reports speed in terms of FPS. Correlation filter-based visual trackers are fast, because mathematical operations are computationally efficient. The proposed method can also compute 79 frames per second, which is relatively faster than other non-correlation filter-based visual trackers. This indicates that the proposed orthogonal coupled Wasserstein distribution is useful for improving visual tracking accuracy with low computational costs.

VII. CONCLUSION
In this study, we propose a novel Wasserstein distributional tracking method that can balance approximation with accuracy in terms of Monte Carlo estimation. To achieve this goal, we present three different visual tracking systems: sliced Wasserstein-based, projected Wasserstein-based, and orthogonal coupled Wasserstein-based. Sliced Wasserstein-based visual trackers can find accurate target configurations using the optimal transport plan, which minimizes the discrepancy between appearance distributions described by the estimated and ground truth configurations. Because this plan involves a finite number of probability distributions, the computation costs can be considerably reduced. Projected Wassersteinbased and orthogonal coupled Wasserstein-based visual trackers further enhance the accuracy of visual trackers using bijective mapping functions and orthogonal Monte Carlo, respectively. Experimental results demonstrate that our approach can balance computational efficiency with accuracy VOLUME 4, 2016 and the proposed visual trackers outperform other state-ofthe-art visual trackers on benchmark visual tracking datasets.