A Novel Algorithm Based on a Common Subspace Fusion for Visual Object Tracking

Recent methods for visual tracking exploit a multitude of information obtained from combinations of handcrafted and/or deep features. However, the response maps derived from these feature combinations are often fused using simple strategies such as winner-takes-all or weighted sum approaches. Although some efficient fusion methods have also been proposed, these methods still do not leverage the individual strengths of the different features being fused. In the current work, we propose a novel information fusion strategy comprising a common low-rank subspace for the fusion of different types of features and tracker responses. Firstly, we interpret the response maps as smoothly varying functions which can be efficiently represented using individual low-rank matrices, thus removing high frequency noise and sparse artifacts. Secondly, we estimate a common low-rank subspace which is constrained to remain close to each individual low-rank subspace resulting in an efficient fusion strategy. The proposed algorithm achieves good performance by integrating the information contained in heterogeneous feature types. We demonstrate the efficiency of our algorithm using several combinations of features as well as correlation filter and end-to-end deep trackers. The proposed common subspace fusion algorithm is generic and can be used to efficiently fuse the response maps of varying types of feature representations as well as trackers. Extensive experiments on several tracking benchmarks including OTB100, TC128, VOT-ST 2018, VOT-LT 2018, UAV123, GOT-10K and LaSoT have demonstrated significant performance improvements compared to many SOTA tracking methods.


I. INTRODUCTION
Visual Object Tracking (VOT) is one of the most fundamental tasks in computer vision having a wide range of applications across several domains [1], [2] for example autonomous driving [3], anomaly detection [4], augmented reality [5], action recognition [6], surveillance, and security [7]. Numerous research directions have been investigated in recent years for VOT [8]- [24]. Despite a lot of research focus, VOT in challenging environments is still an open problem which needs to be further investigated [25]- [30]. Among the most investigated tracking approaches, Correlation Filters (CFs) have attained significant attention because of their impressive performance in terms of speed and accuracy [9]- [12], [31]- [40]. In most of these methods, a correlation filter is trained over a region of interest in the current frame which is then employed to track the target object in the subsequent frames by maximizing the filter response [18], [41], [42]. More recently end-to-end deep learning-based trackers have also been proposed which have achieved excellent performance [43]- [45]. In many cases, classical object detectors such as Faster R-CNN have also been adapted for tracking-bydetection tasks [13], [14]. The performance of CF-based trackers is further enhanced through scale invariance [42], target re-detection [46], deep end-to-end training [43], local and global filter ensembles [12], and the combination of deep CNN and handcrafted features [11], [47].
Most existing CNN-based methods only use features from later layers to represent target objects, because these features capture rich category-level semantic information. However, spatial details captured by earlier layers are also important for accurately localizing a target [9]. Although the features from these earlier layers are relatively less discriminative than those of later layers, often leading to failure in more challenging tracking scenarios. Consequently, many trackers complement deep representations with shallow activations or handcrafted features for improved localization [41], [42], [48]. This raises the question of how to optimally fuse the fundamentally different properties of shallow and deep features in order to achieve both accuracy and robustness [11]. For optimal tracking performance, it is imperative that handcrafted features be combined with deep features from different CNN layers to best discriminate between target object and background clutter. Fusion of different feature representations has been shown to improve VOT performance [9]- [12]. For instance, Ma et al., aggregate response maps extracted from earlier and later CNN layers by manually assigning a relative weight to each [9]. Qi et al. proposed fusion of response maps from six CNN features using a hedging method [10]. Wang et al. proposed feature-level and decision-level strategies to fuse multi-expert response maps [12]. Bhat et al. recently proposed a Unveiling the Power of Deep Tracking (UPDT) tracker in which the relative weights of features are learnt from training samples [11]. Although feature-level strategies have demonstrated competitive performance for VOT, the initial weights of deeper layers tend to be higher than those of shallow layers, due to their ability to encode more semantic information. However, it is observed that shallow layers can improve localization performance [9], [11], suggesting that in some tracking scenarios shallow layers should be given significant weightage. This has been addressed by decision-level tracker, however the early feature-level fusion strategy is an important factor to consider [12]. Figure 1 presents a challenging situation in which the aforementioned trackers have faced many difficulties to track the target objects.
In the current work, we propose to learn a common subspace-based response map which compliments the information captured by handcrafted as well as deep features. For each response map, we estimate its low-rank representation using non-negative matrix factorization [49]. Then we compute a common low-rank representation across all these response maps which is constrained to remain close to each individual low-rank representation [50]. Thus a consensus is achieved by those response maps which correctly estimate the target position while the incorrect ones do not get accumulated resulting in a more robust VOT. We observe the effectiveness of proposed algorithm by comparing its response map with various State-Of-The-Art (SOTA) trackers as shown in Figure 2. In the first case, Figure 2 (d) shows the fused response map of the proposed CSF tracker using KCF as a baseline tracker on deep features (Figure 2 (c)) and on handcrafted features (Figure 2 (d)). In the second case, Figure 2 (h) shows the fusion by the proposed CSF tracker over three existing SOTA correlation filters-based trackers including GFS-DCF [41], ASRCF [42], and RPCF [48]. In both cases, the fused response map shows a higher signal peak and suppressed noisy peaks.
The proposed tracker, which we name Common Subspace Fusion (CSF), is evaluated on seven tracking benchmark datasets and compared with the many SOTA trackers. Our experiments demonstrate a significant performance improvement in terms of both speed and accuracy. Specifically, our tracker demonstrates a 7.0% improvement in terms of Expected Average Overlap (EAO) as compared to baseline GFS-DCF tracker [41] and an 3.0% improvement as compared to PrDiMP tracker [16] on VOT2018 dataset [25]. Further experiments on GOT-10k [51], OTB100 [26], UAV123 [28] and LaSoT [30] datasets have demonstrated significant improvement over existing SOTA trackers. The main contributions of the current work are as follows: • A novel common subspace fusion algorithm is proposed based on low-rank response map representation of various types of features and trackers. Using the individual low-rank representation response map, a common subspace-based representation is estimated which is constrained to be close to each individual representation on the Grassmannian manifold. • The proposed fusion scheme is employed on correlation filter-based trackers using different features resulting in significant performance improvement in all cases. It is also employed to fuse the predicted response maps of deep trackers which again results in significant performance improvement. Rigorous evaluations are performed on two long-term and six short-term tracking datasets. The proposed CSF tracker consistently demonstrated improved performance. The rest of this paper is organized as follows. Section II summarizes related work. Section III presents the proposed methodology in detail. Section IV describes our experimental evaluation and results and Section V presents our conclusions and future directions.

II. LITERATURE REVIEW
In the past decade, a number of studies have demonstrated improved performance for the task of VOT [9]- [20], [52]. Since the current work is focused on the fusion of various types of features and trackers, we particularly review those studies which present some type of fusion scheme.
Many researchers have aimed to tap the complementary information contained in various types of handcrafted and deep features by using different fusion strategies. These schemes may be categorized into two groups: feature-level and decision-level fusion. Feature-level fusion is an intermediate level fusion in which each feature representation is used to obtain a probability map of the target location. These probability maps, also known as response maps, are then fused using different strategies such as pre-defined or learned weights. This fusion strategy assigns relative weights to different types of features based on semantic information, therefore semantically rich high-level features get higher weights compared to their shallow counterparts. It has been observed that in many tracking scenarios, shallow features are more effective than deep features, resulting in performance degradation of feature fusion strategies that prioritize deep features. For instance, Ma et al. trained correlation filters on each feature layer of VGG-19 [9]. The fused responses were estimated by aggregating all feature maps using a manually hard-coded weighting scheme. Qi et al. proposed a fusion method for hedging correlation filter responses based on relative hard-coded weights into a single response map for target detection [10]. These manually assigned hardcoded weighting schemes may not be optimal in all tracking scenarios. To address this problem, Bhat et al. proposed to learn the weights of the individual feature representation and demonstrated improved VOT performance as compared to the aforementioned fusion techniques [11]. UPDT learned optimal fusion hyper-parameters on the OTB100 dataset [26], which were subsequently applied to other tracking datasets, albeit with no guarantee of the effectiveness of these learned parameters across different tracking challenges. It is observed that when weights are learned, higher priority is still given to deep features over shallow or handcrafted features.
Decision-level fusion is exploited by the MCCF tracker in which a result is selected from multiple proposals based on the agreement of multiple feature combinations as well as temporal consistency [12]. While it has been shown to improve performance in some scenarios, the significance of decision-level fusion strongly depends on the design of the baseline feature combinations. Decision-level fusion is again strongly dependent on a feature-level fusion in which semantic information is given high significance. Decisionlevel fusion is also prone to errors in scenarios where multiple feature combinations contain similar errors. In many tracking scenarios, semantic information may cause errors that could be overcome by prioritizing low-level information. Some studies are also reported on other imaging modalities such as thermal infrared for robust object tracking [53]- [55].
In the current work, we address this shortcoming by using a feature-level fusion strategy based on the common subspace spanned by individual response maps. In contrast to the aforementioned fusion strategies, we consider each feature representation to be equally significant so that a broader range of tracking scenarios can be effectively handled compare to the prior unequal weighting schemes. We learn a common subspace across all feature representations which is constrained to be close to each low-rank representation on the Grassmannian manifold. Our proposed fusion scheme is generic, allowing us to demonstrate its efficacy by plugging it into many recent SOTA trackers resulting in significant performance improvement.

III. PROPOSED COMMON SUBSPACE FUSION ALGORITHM
In our proposed Common Subspace Fusion (CSF) algorithm, each response map is considered equally important, therefore we do not compute any weights for shallow or deep feature maps. Thus we address the problem of tracking errors caused by incorrect semantic information being given too much importance. Compared to the aforementioned fusion schemes, which improve performance by using weak classifiers, our proposed fusion strategy is more generic and improves performance beyond current SOTA trackers. The system diagram of our proposed CSF tracker is shown in Figure 3. For each response map estimated by a set of SOTA VOLUME 4, 2016 trackers, a low-rank representation is computed. Then using multiple low-rank representations, our aim is to compute a common subspace representation resulting in fusion over multiple response maps.

A. MATHEMATICAL FORMULATION
An ideal tracking response map R k ∈ R m×m should be a smoothly varying continuous function, however when working with real-world data, it may contain high frequency artifacts, where m × m is the size of the response map and k denotes feature representation. We therefore propose to compute a low-rank representation L k ∈ R m×c of R k where c < m and maximum rank of L k ≤ c. For this purpose, we convert the response map R k into an affinity matrix S k = R k R k ∈ R m×m which is symmetric and positive semi-definite and may be factorized into a low-rank sparse matrix. S k contains the structure of the corresponding response map R k such that one cluster in S k belongs to the target region while the remaining clusters correspond to nontarget region in the search space.
Non-negative Matrix Factorization (NMF) has been widely employed for the estimation of low-rank approximation [56], [57]. NMF factorizes an input data matrix S k into two non-negative matrices L k and G k , i.e., S k ≈ L k G k . The rank of both non-negative matrices L k and G k is significantly lower than S k . For the purpose of uniqueness and clustering interpretation, G k is enforced to be orthogonal G k G k = I. We consider to enforce orthogonality constraints on both non-negative matrices L k and G k , so that L k can be considered as cluster indicator matrix for rows clustering and G k as the cluster indicator matrix for columns clustering. Such a configuration assists us to identify the target region as an intersection of rows and columns corresponding to the target clusters. The objective function for such decomposition is formulated as follows: However, this double orthogonality is very restrictive and it gives a rather poor matrix low-rank approximation. One needs an extra factor B k to absorb the different scales of S k , L k , and G k , i.e., S k ≈ L k B k G k . In case of symmetric input matrix S k , S k = S k , the non-negative matrices become same i.e., L k = G k . Using Symmetric Non-negative Matrix Tri-Factorization (SNMTF), we factorize each S k as S k ≈ L k B k L k by solving the following objective function [49]: where ||·|| F is the Frobenius norm [58] and B k is a nonnegative auxiliary matrix. The matrix L k contains the feature specific response map structure such that one particular cluster corresponds to the target region while the remaining clusters belong to non-target regions.
In order to obtain a common subspace-based tracking response maps structure across all feature representations, we compute a common low-rank representation M ∈ R m×c which should be close to each individual low-rank response representation L k . The common representation contains a unified target region cluster over all feature response maps such that the individual target region gets superimposed and resulting in an amplified target response. Matrix M can be computed using Eq. (2), in which M replaces L k , and minimizing the objective function across all k feature representations: If each minimization problem is solved independently, matrix M can be further from some low-rank representations L k than the others. A set of c dimensional linear subspaces of R m can be considered a Grassmann manifold G(c, m), such that each point in this manifold corresponds to a unique subspace. Each subspace can be represented using its basis vectors as an orthonormal matrix whose columns span the corresponding c dimensional subspace in R m . In order to ensure that M is close to the majority of L k , it is enforced that the subspace spanned by M is close to the subspace spanned by each L k on a Grassmann manifold [50], [59]. Each L k spans a corresponding c dimensional subspace where c ≤ c ≤ m and is mapped to a unique point on the Grassmann manifold G(c , m) defined as a set of c dimensional linear subspaces in R m .
The geodesic distance between two subspaces M and L k on the Grassmann manifold can be computed by projection distance as follows [50], [59]: where {θ k j } c j=1 are principal angles between c -dimensional subspaces spanned by L k and M and tr(·) denotes the trace of a matrix. In order to ensure M to be close to each L k , the overall objective function is given by: where γ > 0 is a weighting parameter. The second term is the sum of the projection distances between M and each L k . Minimizing this term will ensure that the matrix M will be close to each individual matrix L k on the Grassmann manifold in terms of geodesic distance. In order to minimize Eq. (5), we first formulate the multiplicative update rules to compute L k using the SNMTF method [49] and then we jointly optimize our objective function (Eq. (5)) to derive the  multiplicative update rules for our common low-rank matrix M.

1) Individual Low-rank Representation Computation
Following multiplicative update rules are derived for Eq. 2 using the SNMTF method [49] for estimating L k as follows: where L k (i, j) is the (i, j)-th element of the low-rank representation L k . Eq. (6) converges to the optimal solution if where ζ is a tolerance factor.

2) Common Low-rank Representation Matrix Computation
Following the constrained optimization theory [60] and nonnegative matrix factorization [61], we take the derivative of (5) with respect to M as follows: The ordinary gradient of the optimization problem Eq. (7) does not represent its steepest direction because the matrix M spans the Grassmann manifold [62]. However, the steepest direction can be obtained by using the notion of natural gradient [59], [62], [63]. The natural gradient of Ψ on the Grassmann manifold at M can be written in terms of the ordinary gradient as follows [62], [63]: where ∇ M Ψ is the ordinary gradient given by Eq. (7). Combining Eq. (7) and Eq. (8), we get In order to ensure the positivity constraints on M, the natural gradient is decomposed into two non-negative terms [63], [64] such that: The two terms are enforced to be positive as follows: and Following the KKT condition [60] and preserving the nonnegativity of M, we derive the multiplicative update rules for matrix M using the natural gradient as follows: Algorithm 1: Pseudocode of the proposed CSF algorithm. Input: Response maps of the target object R k ∈ R m×m using any DCFs-based tracker.
Output: M Find maximum value in common low-rank represenation matrix M for target localization.
where the non-negative parts of the normal gradient are given in Eq. (8). Substituting Eq. (9) in to Eq. (12), we obtain the multiplicative update rules for the common subspace matrix M as follows: The target detection is then estimated by seeking the maximum value in the common low-rank representation matrix M. Algorithm 1 summarizes the steps of the proposed CSF algorithm.

IV. EXPERIMENTAL EVALUATIONS
The performance of the proposed CSF algorithm is evaluated on seven tracking datasets including OTB100 [26], UAV123 [28], VOT2018 Short Term challenge (VOT2018-ST) [25], TC128 [27], GOT-10K [51], VOT2018 Long Term challenge [25], and LaSoT [30]. These datasets comprise a variety of tracking challenges including occlusion, background clutter, and scale variations to name a few [28]. The description of each dataset is shown in Table 1. Our proposed algorithm is implemented on a PC with an Intel core i7 4GHz, Titan Xp GPU, and 64 GB RAM.
The tracking performance is evaluated using two popular measures known as precision and success rates [26] for OTB100, TC128, UAV123, and LaSoT datasets. The precision rate is defined as the percentage of frames with Euclidean distance between the predicted and ground truth target location less than 20 pixels threshold [26]. The success rate is defined as the percentage of frames with overlap ratio b1∩b2 b1∪b2 > 0.5 [26], where b 1 and b 2 are the predicted and the ground truth bounding boxes, respectively. By varying the threshold from 0 to 1, the success plots are generated and the area under the curve is estimated. Moreover, following the protocols defined in VOT2017/VOT2018 [25], we used three primary measures including Expected Average Overlap (EAO), Robustness (R), and Accuracy (A) to compare the performance of different trackers on VOT2018-ST dataset. The EAO estimates the average overlap a tracker is expected to obtain on a large set of short-length sequences with the same visual properties as a given dataset. The robustness measures the number of times a tracker fails (loss the target) during tracking while accuracy is the average overlap between the ground truth and estimated bounding box during the successful tracking periods.
The proposed CSF algorithm is tested in two different configurations: fusion using varying types of features for the same tracker (Config-1) and integrating multiple deep trackers (Config-2). In Config-1, we fuse the feature responses obtained by varying types of features while using the same tracker as a baseline.
In Config-2 we fuse the responses of multiple trackers. Figures 2 (e)-(g) show the response maps of three SOTA CF based trackers. The fusion map is smoothed and has a high signal to noise ratio. The objective is to complement the information captured by different trackers and to analyse the capability of our proposed CSF algorithm to fuse multiple deep trackers into a unified framework.
The performance of the two configurations is evaluated using the VOT2018-ST challenge dataset in terms of Expected Average Overlap (EAO), Robustness (R), and Accuracy (A) using the protocols provided by the authors [25]. Each of these experiments is discussed in detail in the following subsections.

A. FEATURE FUSION (CONFIG-1)
Feature fusion is performed using the proposed CSF algorithm with three recent SOTA trackers as baselines: GFS-DCF, ASRCF, and RPCF. We use the same feature set for each tracker including HOG, Intensity Channel (IC), Color Names (CN), and deep features extracted from the 4th block of ResNet50. The performance comparison on the VOT2018 dataset is shown in Table 2, demonstrating that the proposed CSF fusion algorithm improves the performance of each baseline tracker by a significant margin. In this experiment, the baseline trackers are used as the same as proposed by the original authors while the features set is proposed in GFS-DCF. The EAO measure has improved by up to 7.0% for CSF-GFS-DCF while accuracy (A) is improved by 4.0%

B. TRACKER FUSION (CONFIG-2)
In this experiment, the proposed CSF algorithm is used to fuse the output response maps of three existing SOTA deep trackers including ATOM, PrDiMP, and DSLT. Performance is evaluated on the VOT2018-ST dataset, with results shown in Table 3. The features and parameters suggested by the original authors are used in each case. We observe a significant EAO improvement of 3.0% beyond PrDiMP tracker.
In terms of accuracy, we observe an improvement of up to 3.0% while in terms of robustness an improvement of up to 2.0% is observed. This simple experiment demonstrates the effectiveness of the proposed CSF algorithm in fusing complementary information from different deep trackers, resulting in a significant performance improvement.

C. COMPARISON OF CSF ALGORITHM WITH EXISTING FUSION SCHEMES
The proposed CSF algorithm is also compared with four existing SOTA fusion-based trackers: UPDT, HCF, HDT, and MCCF. For a fair comparison among the compared methods, the classical KCF tracker [31] is used as a baseline and the same set of features including HOG, IC, CN, and deep features extracted from the 4th block of ResNet50 are used. Thus, this experiment only compares the strengths and weaknesses of different fusion schemes while keeping all other variables fixed. The experiments are repeated with and without scale estimation on three datasets: OTB100, TC128, and UAV123. The scale of the target object is estimated using the same coarse-to-fine search strategy on HOG features for all compared trackers as proposed in ASRCF tracker [42]. Figure 5 (a)-(d) show the precision and success plots of all trackers on the three compared datasets. The proposed CSF algorithm has consistently demonstrated the best performance in all experiments. On average UPDT gives the second best results on OTB100 while MCCF gives the second best results on the TC128 and UAV123 datasets. In most experiments, HCF shows the lowest performance among the compared fusion-based trackers. The scale estimation step assisted all compared trackers in obtaining better performance, however our proposed CSF algorithm remains the best performer.

D. ABLATION STUDY
The proposed CSF algorithm has only one hyper-parameter to tune which is γ in Eq. (5). To find a good value of γ, experiments are performed on OTB100 dataset using the protocols defined in [11] 1 The authors divided OTB100 sequences into hard videos (OTB-H) as a validation set and easy videos (OTB-E) as a test set considering the performance of different trackers. The value of γ is varied from 0 to 1.6 in intervals of 0.4. In addition, a high value of γ = 50 is also tested. Figure  4 shows the performance comparison as a precision plot of 1 Please see supplementary material of UPDT tracker (https://arxiv.org/pdf/1804.06833.pdf) for the list of videos included in OTB-H and OTB-E. For γ = 0 in our objective function given by Eq. (5), only the performance using the common low-rank representation component is evaluated while for γ = 50 the second term, which is the sum of projection distances between common subspace and individual subspaces, becomes more important. In both cases, we observe a graceful degradation of the proposed fusion algorithm while for γ = 50 the performance was better than for γ = 0 suggesting that the second term plays a more important role than the first. The best performance is observed when γ = 0.8, which is the value used in all other experiments.

E. QUALITATIVE RESULTS
To evaluate the performance of the proposed CSF tracker, we present rigorous visual results on key frames of 13 challenging sequences selected from OTB100 dataset and five sequences from UAV123 dataset. Figure 6 presents the visual results of the proposed CSF tracker. The bounding boxes of the tracked objects are overlaid on the input images and the comparisons are shown with six existing trackers including ATOM, PrDiMP, DSLT, GFS-DCF, ASRCF, and RPCF. The sequences presented in this figure undergo a variety of tracking challenges including occlusion, background clutter, scale variation, deformation, in-plane rotation, out-of-plane rotation, out-of-view, illumination variation, fast motion, motion blur, and low resolution. Overall, the proposed CSF tracker has performed much better than the compared trackers in all these sequences which can be attributed to the fusion of multiple trackers and variety of features within the proposed objective function.

F. EVALUATIONS ON SHORT-TERM TRACKING DATASETS
In addition to the VOT2018 short term tracking dataset evaluated in the previous Section, we have also performed experiments on OTB100, UAV123 and GOT-10K datasets. Frame#70

Frame#100 Frame#240
Bolt2 FIGURE 6: Visual results of the proposed CSF algorithm and its comparison with existing SOTA trackers including ATOM [65], PrDiMP [16], DSLT [80], GFS-DCF [41], ASRCF [42], and RPCF [48] on 12 challenging sequences selected from the OTB100 [26] and 6 sequences from UAV123 [28] datasets. Frame indexes and sequence names are shown for each video. Our proposed CSF algorithm has consistently performed better than the compared trackers.   [26]. The legend of precision plot contains threshold scores at 20 pixels, while the legend of success rate contains area-under-the-curve score for each tracker.

1) OTB100 Dataset
The proposed CSF algorithm is evaluated on OTB100 dataset (average video length is 590 frames with 100 videos). The performance is evaluated using the success and precision over varying overlap thresholds. Some visual results of OTB100 dataset are presented in Figure 6. Figure 7 shows the precision and success plots of the proposed CSF tracker with other SOTA trackers including GFS-DCF, RPCF, ASRCF, MCCT, ECO, CCOT, HCFTs, STRCF, and SRDCF. It should be noted that the proposed CSF (ASRCF) tracker used the response maps estimated by ASRCF tracker on different features. In terms of precision plot, the proposed tracker has obtained 95.1% precision score while the second best GFS-DCF tracker obtained 93.2%. Compared to the baseline AS-RCF tracker, the performance of the proposed tracker has improved by 2.8%. In terms of success plot, the proposed tracker CSF tracker has obtained 72.2% success rate while the second best is ECO tracker obtaining 70.0% success rate. Compared to baseline ASRCF tracker, the performance improvement is 3.0%. It demonstrates the effectiveness of our proposed fusion algorithm.
We have also evaluated the attribute-based performance on the OTB100 dataset. The 11 different attributes including Illumination Variation (IV), Occlusion (Occ), Out-of-Plane Rotation (OPR), In-Plane Rotation (IPR), Deformation (DEF), Out of View (OV), Background Clutter (BC), Motion Blur (MB), Low Resolution (LR) and Fast Motion (FM) are evaluated in terms of Precision Rate (PR) and Success Rate (SR) and compared with many SOTA trackers. Table 4 shows the attribute-based performance comparison of the proposed CSF tracker with SOTA trackers.
In terms of Precision Rate (PR), the proposed CSF tracker (baseline ASRCF) achieves the best results under 5 out of 11 challenging tracking attributes including OCC (91.6%), BC (95.1%), DEF (93.1%), OPR (93.2%) and OV (93.9%). For sequences with IV, SV, MB, FM, and IPR tracking challenges, the proposed tracker achieves the second best performance compared to other competing trackers. In terms of Success Rate (SR), the proposed CSF tracker (baseline ASRCF) achieves the best results under 7 out of 11 challenging tracking attributes including IV (72.4%), SV (68.1%), OCC (69.3%), BC (72.2%), DEF (67.6%), OPR (69.3%) and IPR (68.2%). For sequences with MB, FM, OV and LR tracking challenges, the proposed tracker achieves the second best performance compared to other competing trackers. The improved performance of the proposed tracker demonstrates the effectiveness of the common subspace fusion mechanism.

2) UAV123 Dataset
This dataset contains 123 video sequences with an average length of 915 frames [28]. The results are compared with 15 SOTA trackers: ECO, GCT, CREST, SRDCF, STRCF, MEEM, BACF, MUSTER, DSST, MCCT, STAPLE, ASRCF, GFS-DCF, RPCF and DSLT. Some visual results from the UAV123 dataset are shown in Figure 6. Figure 8 shows the performance comparison of the proposed CSF tracker with other SOTA trackers in terms Precision Rate (PR) and Success Rate (SR).
Overall, the proposed CSF tracker achieves the best precision rate of 79.2% which is 2.0% better than the baseline tracker, GFS-DCF (77.2%), and 2.4% better than the deep tracker, DSLT. The proposed CSF algorithm also achieves the best success rate (AUC) of 56.6% which is 2.3% better than GFS-DCF and 3.1% better than DSLT. The best performance achieved by the proposed tracker is because of the subspace fusion mechanism across varying types of feature representations incorporated within the baseline tracker GFS-DCF.
In terms of mAO, the proposed CSF tracker achieves 46.4% performance which is 4.2% better than the baseline tracker, GFS-DCF (42.20%), and 5.80% better performance than the CCOT tracker (40.6%). In terms of mSR 0.50 mea-

G. EVALUATION ON LONG-TERM TRACKING DATASET
We have also evaluated the performance of the proposed tracker on two long term visual object tracking datasets including LaSoT [30] and VOT2018-LT [25]. In the below subsections, we describe the performance comparison of the proposed tracker on these datasets.

1) LaSoT Dataset [30]
The test split of this dataset consists of 280 videos with an average length of 2448 frames [30]. The results of the proposed CSF tracker are compared with 15 SOTA trackers including ECO, DSLT, BACF, HCFTs, CFNET, LCT, SRDCF, TRACA, STAPLE, STRCF, ASRCF, and GFS-DCF. The performance is reported in terms of PR and SR by using the protocols provided by the original authors [30].
In terms of PR, CSF tracker has obtained 38.20% accuracy which is 2.6% better than the existing baseline tracker, GFS-DCF (35.60%), and 4.50% better than the ASRCF tracker as shown in Figure 9. In terms of SR, CSF tracker has achieved 3.50% performance improvements compared to GFS-DCF and up to 5.70% better accuracy than ASRCF. This experiment demonstrates the effectiveness of our proposed fusion algorithm on long term tracking challenges.
2) VOT2018-LT Dataset [25] The long term challenge of VOT2018 dataset consists of 35 video sequences with an average resolution of 468 × 785 as shown in Table 1. The proposed CSF tracker has also been evaluated on this dataset in terms of F-score, Recall (Re),  and Precision (Pr) measures as defined in the VOT2018-LT evaluation kit [25]. In this dataset, ranking is achieved using maximum F-score attained by each tracker. Table 6 shows the performance comparison of the proposed CSF tracker with five SOTA trackers including CCOT, DeepSRDCF, DeepSTRCF, UDT, and GFS-DCF. Overall, it can be observed that our proposed CSF tracker has attained best F-score of 67.80% and outperforms GFS-DCF, UDT, and CCOT trackers by 2.60% and 5.80% margin, respectively. The corresponding Re and Pr scores also demonstrated the improvements of the proposed tracker compared to second best tracking method.  : Precision and success plots using OPE of the proposed CSF tracker against other SOTA trackers on LASoT dataset [30]. The legend of precision plot contains threshold scores at 20 pixels, while the legend of success rate contains area-under-the-curve score for each tracker.  34.19fps. Thus, the additional time taken in this configuration is 5.87fps. Thus, the proposed CSF fusion algorithm is computationally efficient and does not incur significant computational overhead beyond the underline baseline trackers.

V. CONCLUSION
In this work, an information fusion algorithm is proposed to encode the complementary information contained by various types of features and trackers to improve VOT performance. For this purpose, low-rank representations of response maps are computed which remove unwanted boundary effects and suppress noise. A common low-rank subspace representation is estimated such that it is close to each individual subspace on the Grassmann manifold in terms of projection distance. The common subspace representation acts as the fusion scheme which integrates information encoded by individual low-rank response maps. The CSF algorithm is generic and works well with various types of features, correlation filterbased trackers, and deep trackers. Evaluations are performed for feature fusion and tracker fusion on seven challenging tracking benchmark datasets and compared with several SOTA trackers. Our algorithm has consistently demonstrated significant performance improvements over various baseline methods. The CSF algorithm has also outperformed existing fusion schemes using the same features and baseline tracker. We observe that the fusion of deep correlation filtersbased trackers has resulted in the highest performance gain. The SOTA fusion-based tracking methods assign weights to different feature or response map. An advantage of the proposed fusion algorithm is that it does not require weight assignment to different feature representations or response maps. The proposed fusion algorithm finds it challenging to handle significant target scale and orientation variations. In future, the proposed fusion algorithm will be implemented as a deep layer in an end-to-end network for VOT. . Jorge Dias research in the area of Computer Vision and Robotics and has contributions on the field since 1984. He has several publications in international journals, books, and conferences. Jorge Dias was been principal investigator from several international research projects. Jorge Dias published several articles in the area of Computer Vision and Robotics that include more then 80 publications in international journals, 1 published book, 15 books chapters, and more then 280 articles in international conferences with referee.
NAOUFEL WERGHI received Habilitation and PhD in Computer Vision from the University of Strasbourg. He has been a Research Fellow at the Division of Informatics in the University of Edinburgh, Lecturer at Department of Computer Sciences in the University of Glasgow. Currently, he is Associate Professor at the Electrical Engineering and Computer Science Department in Khalifa University for Science and Technology. He has been visiting professor in the University of Louisville, University of Florence, University of Lille, the Korean Advanced and Institute of Sciences and Technology in South Korea. His main research area is 2D/3D image analysis and interpretation, where he has been leading several funded projects related to biometrics, medical imaging, remote sensing, and intelligent systems. He is Associate Editor in the Eurasip Journal for Image and Video processing. He is a member of the IEEE Signal Processing Society, and IEEE Pattern Analysis and Machine Intelligence. VOLUME 4, 2016