Deep Group-Shuffling Dual Random Walks With Label Smoothing for Person Reidentification

Person reidentification (ReID) is a challenging task of finding a target pedestrian in a gallery set collected from multiple nonoverlapping camera views. Recently, state-of-the-art ReID performance has been achieved via an end-to-end trainable deep neural network framework, which integrates convolution feature extraction, similarity learning and reranking into a joint optimization framework. In such a framework, the similarity is learned via an embedding network, the reranking is conducted with a random walk, and the whole framework is optimized with a cross-entropy-based verification loss. Unfortunately, the embedding net is difficult to train well because their two-dimensional outputs mutually interfere each other when using the conventional random walk. In addition, the supervision information has not been fully exploited during the training phase due to the binary nature of the verification loss. In this paper, we propose a novel approach, called group-shuffling dual random walks with label smoothing (GSDRWLS), in which random walks are performed separately on two channels—one for positive verification and one for negative verification—and the binary verification labels are properly modified with an adaptive label smoothing technique before feeding into the verification loss in order to train the overall network effectively and to avoid the overfitting problem. Extensive experiments conducted on three large benchmark datasets, including CUHK03, Market-1501 and DukeMTMC, confirm the superior performance of our proposal.


I. INTRODUCTION
Person reidentification (ReID) aims to match pedestrain across multiple cameras and has increasingly gained attetion in computer vision and pattern recognition community, due to its importance to video surveillance analysis. However, person ReID is quit challenging because of the heavy variations from different viewpoints, varying illumination, changing weather conditions, clutter background and etc.
Traditional approaches addressing the person ReID task are roughly from three perspectives: a) feature extraction, b) metric learning, and c) reranking. In [1]- [4], hand-crafted features based on color, texture or their combination are designed. However, it is a challenge for these methods to capture enough discriminative information. In [5]- [10], metric learning-based methods are proposed to learn a The associate editor coordinating the review of this manuscript and approving it for publication was Joewono Widjaja . transformation that projects the original features into a new feature space where different pedestrians can be clearly separated. Nevertheless, these methods suffer from high computation costs and are hardly suitable for the datasets of large size. In [11]- [13], reranking as a post-processing approach is proposed to refine the ranking list based on the contextual information of the probe image and the gallery set. However, the improvement of reranking is usually restrained because exploitable information in reranking methods is very limited.
Recently, the deep neural network ResNet-50 [14] is introduced in the ReID task so that feature extraction and metric learning can be merged together. Although remarkable improvements have been reported in [15], [16], these methods are still suboptimal because the reranking is ignored. More recently, Shen et al. [17] proposed an end-to-end trainable deep learning framework for person ReID, in which feature extraction, similarity learning and reranking are integrated together, and a group-shuffling random walk is used to perform reranking for effectively training the overall framework. However, the embedding networks used to learn the similarity are difficult to train well due to the entanglement phenomena of the random walk with a single channel, and they suffer from the overfitting problems. In addition, the supervision information in each minibatch has not yet been fully exploited in the training stage.
In this paper, we propose to address the shortcomings in the integrated end-to-end framework [17] by three ingredients: a) conducting dual random walks in two separate channels, one for positive verification information and one for negative verification information; b) using an adaptive labelsmoothing technique in the cross-entropy-based verification loss, to effectively exploit both the positive verification labels and the negative verification labels; and c) implementing a random walk in the training phase without splitting each minibatch data into a probe set and a gallery set in order to obtain more supervision information. The contributions of this paper are highlighted as follows.
• We propose a dual random walk that performs random walks at two separate channels, one for positive verification information and one for negative verification information. By doing so, this method can not only eliminate the mutual interference but also improve discriminative ability.
• We propose an adaptive label-smoothing technique for the verification loss to address the overfitting problem, in which the smoothing term is associated with the number of identities.
• We exploit all the pairwise supervision information in each minibatch by conducting dual random walks without splitting each minibatch of the training set into probe and gallery sets.
• We conduct extensive experiments on three benchmark datasets to evaluate the overall performance of our proposals, the effects on the training strategy, and the effects of the hyperparameters.
Paper Outline: The remainder of this paper is organized as follows. Section II reviews the relevant work. Section III presents the preliminary random walk and our proposals. Section IV shows experiments with discussion and Section V concludes the paper.

II. RELATED WORK
This section reviews the existing work for person ReID, which is relevant to our proposal.

A. GRAPH BASED MODEL
In a person ReID task, it is difficult to obtain enough labeled data in real scenarios. In [18], an attribute-based person recognition network is proposed, in which the manually annotated attributes are used not only as identity information, but also for speeding up the retrieval process. Additionally, to address the lack of labeled data problem, a semi-supervised learning framework is proposed in [19], in which only one example of each identity is used to perform label estimation in the training stage and the size of pseudo-labeled candidates is progressively increased for the next step updating. On the other hand, it is also crucial to make full use of the available supervision information. In [20] and [17], an endto-end deep CNN model is designed which employs embedding networks (fully connected layers) to refine the output of ResNet-50 and achieves good performance. Specifically, [20] proposed an end-to-end network called similarity-guided graph neural network (SGGNN) that directly uses the output of ResNet-50 to calculate the node feature (node information) of two input images through the operations of subtraction, squaring and batch normalization, and then, introduced a small two-FC-layer message network to transform the node feature into a message feature (edge information). SGGNN can not only obtain weights from the node feature with supervision by pairwise information, but also refine the node feature by the message feature with the help of a random walk. In [17], an end-to-end framework is proposed in which the closed-form solution of an infinite-step random walk is employed in the training stage, and the output of ResNet-50 is refined by the learned embedding networks in the testing stage. Note that both [20] and [17] proposed to split each minibatch of the training set into probe and gallery sets and are fine-tuned on a strong baseline. In contrast to these approaches, our proposal performs similarity propagation in a concise manner without splitting each minibatch of the training set into probe and gallery sets, and thus, exploits all the pairwise supervision information.

B. RERANKING
Many works [12], [13], [21]- [26] have reported that the original ranking list can be refined by extra contextual information in the gallery set. For example, in [21], [22], extra information is obtained from human feedback; in [12], [13], [23]- [26], the contextual information in the local neighborhood structure or the manifold structure of the gallery samples were considered. Additionally, a unified framework is proposed in [27], which utilizes both neighbor information and user feedback to refine the ranking scores for cross-modal retrieval. Note that all the extra information obtained in the testing stage is a kind of similarity information, which can be obtained in the training set. In this sense, if all the similarity information in the training set is fully captured during feature extraction or metric learning process, a reranking stage is no longer necessary. However, in the approach of traditional feature extraction and metric learning, it is difficult to fully exploit the similarity information in the training set. Fortunately, the emergence of ResNet [14] makes it possible with its giant parameter spaces. To better learn the similarity in the training data, the embedding networks are used. To refine the feature extraction with the similarity information of the training data, a group-shuffling trick based random walk is used to guide the training of the ResNet-50 in [17]. However, the embedding networks used in [17] are ont only quit difficult to train well VOLUME 8, 2020 FIGURE 1. ResNet-50 is used as feature extraction. After global average pooling (GAP), the output of ResNet-50 is ranked according to L2 distance and regarded as the output of Stage 1. After the elementwise subtraction, batch normalization and feature grouping, the output of ResNet-50 is divided into N C equal-length sections, which are sent into different embedding net (fully connected layer) to compute similarity score. Then, each similarity score S [k] is refined to S ∞ [k][l ] by the dual random walks algorithm with possibility transition matrix A [l ] . Finally, 1 [l ] is labeled as the output of Stage 2, which is the refined outcome of Stage 1.
due to the entanglement phenomena of the single-channel random walk used, but also suffered from the overfitting problems.

III. OUR PROPOSALS
This section introduces the random walk as a preliminary method and then presents our proposals.

A. PRELIMINARY RANDOM WALK
Given a probe set X p with N p images and a gallery set X g with N g images, let S denote an undirected similarity graph such that for a probe x p and gallery set X g , S pg is a graph that describes the relationship between probe x p and the gallery set, and S gg describes the relationship among all the samples in the gallery set. To refine the similarity graph S pg by S gg , a probability transition matrix A gg is induced from S gg as follows: where i, j = 1, · · · , N g and all the diagonal entries of A are set to zero to avoid self-reinforcement. Hence, the initial similarities S (0) pg of x p and the gallery set can be refined by a random walk, where S (1) pg denotes the refined similarities after a one-step random walk and λ ∈ (0, 1) balances the two items by guaranteeing that refined similarities S pg will not be too far away from the initial similarities S pg . The general form of Eq. (2) is Expanding Eq. (3) yields According to the sum of a geometric series, for λ ∈ (0, 1), we have and Therefore, the convergence of the infinity-step random walk is where . For all the samples in the probe set, we have where S pg ∈ IR N p ×N g indicates the similarity graph between the probe and gallery set.

B. TRAINING THE JOINT FRAMEWORK OF LEARNING FEATURES AND SIMILARITY VIA GROUP-SHUFFLING DUAL RANDOM WALKS WITH LABEL SMOOTHING
For clarity, we present the flowchart of the overall framework in Fig. 1, which mainly consists of three components: a) a deep feature extraction module, b) a similarity learning module, and c) a reranking module. Compared to [17], the random walk, the training strategy, and the loss function used here are different. Briefly, our proposals exploit more supervision information during the training stage to effectively train the overall framework and avoid overfitting.

1) DEEP FEATURE LEARNING MODULE
Denote a training set with N images as Then, in the feature extraction module, we use a deep convolution neural network, ResNet-50 [14], which produces the feature maps Z = {z 1 , z 2 , · · · , z N }, where in which denotes the set of parameters in CNN and z i ∈ IR D .

2) SIMILARITY LEARNING MODULE
Consider minibatch images of size n. The similarity learning module, which is an embedding network built on top of the output of the feature extraction module, transforms the learned convolution feature Z = {z 1 , z 2 , · · · , z n } into pairwise similarity.
Denote as a three-mode tensor of n × n × D, which is calculated as follows: where i, j = 1, · · · , n, denotes the elementwise product, BN means batch normalization, and (i, j, :) is a D-dimensional vector.
With the group-shuffle trick proposed in [17], we divide equally into N C sections, in which the k-th section of is denoted as [k] and is defined as where [k] ∈ IR n×n×d . Then, with each [k] , rather than learning a single similarity matrix of n × n, we learn two similarity matrices, one for negative verification and one for positive verification, which are arranged into a scoring tensor of n × n × 2 to score the information for pairwise verification. Specifically, for each [k] , we form a fully connected (FC) layer with two-dimensional output to calculate the scoring tensor S [k] ∈ IR n×n×2 as follows: where h [

3) GROUP-SHUFFLING DUAL RANDOM WALK
To conduct dual random walks, we form two probability transition tensors A [k] from S [k] by softmax as , where i, j = 1, · · · , n and all the diagonal entries of where λ ∈ (0, 1) and S (1) [k][ ] denote the scoring tensor S [k] refined after one-step propagation over A [ ] . According to Eqs. (3) to (7), the scoring tensor S is computed from the scoring information for the positive verification information. Since we conduct random walks at two separate channels, we term our approach dual random walks.

4) LOSS FUNCTION OF THE OVERALL FRAMEWORK
To exploit richer supervision information for more effectively training the convolution feature learning module and the similarity learning module, the approach in [17] uses a crossentropy verification loss based on the binary verification label of image pairs and the scoring information S [k][ ] . Unfortunately, the binary nature of the verification labels results in information omission, in which only half of the information of S [k][ ] is used by the verification loss. To make full use of the verification information and prevent overfitting, VOLUME 8, 2020 in this paper, rather than directly using the binary verification information, we introduce a novel verification loss with an adaptive label smoothing technique as follows: andỸ In experiments, we will evaluate the verification loss with label smoothing in an ablation study.
Remarks: In the training phase, we use all n images in a minibatch to conduct the group-shuffling dual random walks without dividing the n images into a probe set and gallery set. By doing so, we exploit the supervision information on all n 2 pairs of images, whereas in [17], the n images in a minibatch are divided into a probe set and gallery set, ignoring the supervision information among the images in the probe set. In the reranking stage, we conduct group-shuffling random walks on two separate channels and incorporate label smoothing into the verification loss. Putting these together, our approach exploits richer supervision information than that in [17] and avoids the over-fitting problem.
For clarity, we provide the pseudocode of training our proposed GSDRWLS framework in Algorithm 1. 1 Note that, the major computation load is in the line 10 for computing S [k][l] via Eq. (15), which has a time complexity O(n 3 ). When the group-shuffling trick is used, the time complexity will be O(N 2 C n 3 ).

IV. EXPERIMENTS
To validate the efficiency and effectiveness of our proposals, we conduct experiments on three large benchmark datasets, including CUHK03 [28], Market-1501 [29] and DukeMTMC-ReID [30].
of 12,936 images of 751 identities, in which 3,368 images randomly selected from the 12,936 images are used as the probe set.

2) CUHK03
Reference [28] collects 13,164 images of 1,360 walking pedestrians from six disjoint cameras on the campus of The Chinese University of Hong Kong. It is close to a real-world surveillance scenario because it suffers from misalignments and missing body parts. In addition, the dataset includes both manually cropped pedestrian images and detected pedestrian images. In the experiments, a total of 13,000 images of 1260 people are used as the training set, and 200 images of 100 people are used as the testing set.

3) DukeMTMC-ReID
Reference [30] is also a large dataset captured from 8 camera views in an open environment on Duke campus. It is a subset of the DukeMTMC set and provides a manual bounding box of more than 2700 different identities. There are 16,522 images of 702 identities obtained from random sampling that are used as the training set and 17,661 images of 702 identities with 408 distractors that are used as the testing set. The probe set is randomly sampled from each identity in each camera view from the testing set and contains a total of 2,228 images.

Testing and Evaluation Metrics:
The testing procedure is the same as in [17]. For a probe image, we use the convolution feature obtained at stage 1 to find the best-matched top-q images in the gallery set with Euclidean distance where q = 75. Then, we compute the affinity matrices S pg and S gg , where S pg indicates the affinity between the probe and the top-q images in the gallery set, S gg indicates the affinity matrix of the top-q images in the gallery set, and q = 75, and refine S pg with each S gg by group-shuffling dual random walks. Finally, all the refined S pg[k] [l] are averaged as the final results (labeled as stage 2 in Fig. 1).
In our experiments, the mean average precision (mAP) and CMC rank-1, rank-5 and rank-10 accuracies are adopted as the evaluation metrics. For each dataset, different mAP and CMC computation methods are used after their initial setup to calculate the final performance.

B. PARAMETERS AND IMPLEMENTATION DETAILS
The backbone network for feature learning is Resnet-50, which is pretrained on ImageNet [31]. As a baseline approach, we train Resnet-50 using both triplet loss and ID loss, as illustrated in Fig. 3.
ResNet-50 with ImageNet initialization is first trained as in Fig.3 and then loaded into the framework shown in Fig. 1 for further fine-tuning, where the batch size n is set to 256 and in each minibatch, 64 different identities (C = 64) are randomly sampled, 4 images (K = 4) per identity. All the image are resized to 256 × 128 pixels, and the initial learning rate of ResNet-50 is set to 10 −4 in the first 50 epochs and then decreased to 10 −5 in the next 50 epochs. The FC layers are initialized with a zero-mean Gaussian with a standard deviation of 0.01 and, their learning rate is set as 10 times the learning rate of ResNet-50. The Adam algorithm [32] is employed in gradient descent for back-propagation. We set λ = 0.95 in the random walk and τ = 0.10, C = 64 in label smoothing. For data augmentation, we use random erasing [33] and random horizontal flipping. Our experiments are conducted on a server equipped with an Intel Xeon CPU E5-2630, 64GB memory and 4 GeForce GTX 1080 Ti GPUs.

C. PERFORMANCE EVALUATIONS
In this section, we conduct extensive experiments to evaluate the performance of our proposals on the three benchmark datasets.

1) EVALUATION OF DIRECT TRAINING STRATEGIES
ResNet-50 [14] is initialized with the parameters pretrained on ImageNet [31] and all the FC layers are initialized with a Gaussian of zero mean and a standard deviation of 0.01. For data augmentation, we use random erasing [33] and random horizontal flipping. The initial learning rate of ResNet-50 is set to 10 −4 in the first 50 epochs and then decreased to 10 −5 in the next 50 epochs for the CUHK03 dataset. Through extensive experiments, we find that this learning rate is also suitable for Market1501 and DukeMTMC. To further improve the performance, we use the learning rate suggested in [34] so that in total, 120 epochs are needed in the training stage, where the learning rate linearly increases from 3.5 × 10 −5 to 3.5×10 −4 during the first 10 epochs and decays to 3.5×10 −5 and 3.5 × 10 −6 at the 40th and 70th epochs, respectively. As two baselines, we take the framework in Fig. 3, which is trained with both triplet loss and ID loss, and the directly trained GSRW [17]. Moreover, we train the overall framework in Fig. 1 using verification loss with the label smoothing of different propagation steps (e.g., one-step, infinite-step, or a combination of them).
Effects of Different Propagation Steps: The experimental results are reported in Table 1. We observe that the end-toend trainable framework in Fig. 1 cannot yield satisfactory performance if we use the verification loss with infinite-step propagation in a random walk. This is also true for GSRW, which uses verification loss with infinite-step propagation in a random walk. When directly training with the verification loss under infinite-step propagation, the model can hardly converge during the training stage so that the performance of the model on the testing set is very poor. A similar phenomenon can also be observed in GRSW [17], which is also trained with an infinite-step scenario. The reason is that the initial parameters of the model are not good enough, so the correct information contained in the primal transition matrix A produced by FC layers is very limited. In particular, during forward propagation, this incorrect information is further contaminated after the matrix inverse operation in the infinite steps so that it is difficult for back-propagation to correct it. Surprisingly, if we use a combination of onestep propagation and infinite-step propagation, we can train the overall framework in Fig. 1 very well and yield results that are better than or competitive with those of the baseline approach. In Eq. (14), the affinity information is only slightly diffused based on the original information so that the useful information can be retained during forward propagation. Therefore, it is advantageous to find a good convergence direction by the back-propagation algorithm. While using only one-step propagation in the verification loss cannot yield satisfactory performance, one-step and infinite-step can reinforce each other to achieve very promising performance when combined.

Effects of Dual Random Walks and Label Smoothing:
To understand the effect of using dual random walks and label smoothing, we conduct a set of ablation study experiments on the three datasets and list the experimental results in Table 2, where the verification loss under a combination of one-step propagation and infinite-step propagation is used. From the results in Table 2, we conclude that using either dual random walks or label smoothing alone cannot yield satisfactory performance. Promising performance can only be obtained by using dual random walks assisted with label smoothing. VOLUME 8, 2020   The results of fine-tuning the overall framework in Fig. 1 using verification loss with label smoothing under different propagation steps. GSRW * refers to the results reported in [17]. The baseline+RW indicates employing the random walk involved in our proposal to directly refine baselines as a post-reranking process.

2) EVALUATION OF FINE-TUNING STRATEGIES
For a fair comparison to GSRW [17], we use the same training procedure to train our proposals with the same parameter settings on the three benchmark datasets. Specifically, we first train ResNet-50, as shown in Fig. 3 and then adopt the welltrained ResNet-50 into the framework in Fig. 1 and fine-tune the overall framework on the training dataset.
Effects of Different Propagation Steps: We conduct experiments on the three benchmark datasets and list the results in Table 3. Both GSRW [17] and our GSDRWLS with different loss functions notably outperform the baseline approach. This finding confirms that, rather than directly training the overall network, training the network progressively is important. Note that the results of our GSDRWLS with different loss functions are consistently better than the results of GSRW [17]. This finding confirms the effectiveness of dual random walks and label smoothing. The difference among the results of our GSDRWLS with verification loss under different propagation steps is not significant. Unlike in the direct training case, compared to using infinite-step propagation in the verification loss, using a combination of onestep and infinite-step propagation does not yield significant differences. To make a fair comparison, we use the random walk algorithm in our proposal to directly refine the baselines generated in Fig. 3 as the post-reranking process and show the results in Table 3. From this comparison, we can confirm that properly incorporating the reranking into the training process can yield a significant improvement than using the reranking as a post-processing.

Effects of Dual Random Walks and Label Smoothing:
To further validate the effect of dual random walks and label smoothing, we again conduct an ablation study for training the overall framework in Fig. 1 with the fine-tuning strategy. The experimental results are reported in Table 4. Note that without dual random walks and label smoothing, the difference between our GSDRWLS and GSRW is in the 40024 VOLUME 8, 2020  way of using the minibatch training data. In GSDRWLS, we perform similarity propagation on each whole minibatch while GRSW first divides each minibatch into a probe set and a gallery set and uses S gg similarity among the members of gallery set to refine S pg similarity. Unfortunately, as seen in Table 4, the results of GSDRWLS without label smoothing are slightly worse than those of GSRW [17]. This finding suggests that although more supervision information is exploited in our GSDRWLS, the generalization of the overall network might be degenerated because the verification loss is inclined to result in overfitting. When using label smoothing in the verification loss, the results of GSDRWLS are consistently improved. This finding confirms the importance of using label smoothing in the verification loss. Furthermore, the combination of dual random walks and label smoothing helps GSDRWLS yield the best performance in most cases.

3) EVALUATION OF THE SENSITIVITY OF HYPERPARAMETERS
To understand the effect of label smoothing on the verification loss, we conduct experiments to evaluate performance with respect to the parameter τ . The experimental results are displayed in Fig. 4. The performance is relatively stable when setting τ ∈ [0.1, 0.4].
Moreover, we conduct experiments to evaluate the performance with respect to the parameter λ ∈ {0.6, 0.7, 0.8, 0.9, 0.95} in the label propagation and show the experimental results in Fig. 5. We can see that the performance is not sensitive to the parameter λ when it is set in the range [0.6, 0.95]. In addition, we conduct experiments to evaluate the effect of using different numbers of groups in computing the similarity and group shuffle [17]. We train the whole framework using the number of groups N C ∈ {1, 2, 4, 8} and report the results in Table 5. Setting N C = 1 is effective when using the whole feature maps produced from ResNet-50 and using only one FC layer in our model to compute the similarity. The similarity is propagated on its own probability transition matrix A. However, our model still shows an improved performance in all three datasets. If the feature maps are split further, the performance of our model could be further improved because more delicate information could be captured by the group shuffle. However, the performance will slightly degenerate if the feature maps are divided into too many groups. As can be seen, the best performance is achieved by using N C = 4.

4) EVALUATION OF MODEL COMPLEXITY
To clearly show the model parameters which reflects the space complexity of model, we divide our framework into two parts: one part is ResNet-50, which is the backbone and is used for feature extraction; the other is embedding networks, VOLUME 8, 2020   which are refined by our proposed GSDRWLS to learn a suitable similarity metric in the training procedure. Table 6 shows that ResNet-50 dominates the model parameters. Specifically, the percentage of the embedding networks compared to ResNet-50 is approximately 0.174% (41,000/23,508,032) and 0.002% (61480/2,705,506,304) in terms of model parameters and FLOPs, respectively. In addition, due to the matrix inverse and group-shuffling trick, training our model needs around 20.5, 20.5 and 17.8 hours for datasets Market-1501 / DukeMTMC / CUHK03, respectively.

D. COMPARISON TO STATE-OF-THE-ART RESULTS
We compare our model with state-of-the-art methods on Market-1501, CUHK03 and DukeMTMC-reID in Table 7,   TABLE 9. Comparison with state-of-the-art methods on the DukeMTMC-ReID dataset. Table 8 and Table 9 respectively. The results clearly show that by simply being trained directly, our model could achieve 85.6%/94.0% for Market-1501, 96.2%/96.2% for CUHK03 and 69.1%/83.9% for DukeMTMC-reID in mAP/rank-1 accuracy, even better than the results reported in [17], which is fine-tuned on a strong baseline. Then, based on the good baseline shown in Fig. 3, the performance of our model surpasses all the state-of-the-art methods and achieves 88.1%/95.1% for Market-1501, 95.2%/96.0% for CUHK03 and 75.0%/86.7% for DukeMTMC-reID in mAP/rank-1 accuracy. Moreover, our model has a low complexity for its operation on FC layers and is compatible with the strong backbone (ResNet50). In addition, compared to a traditional reranking method such as [13], which only refines the ranking list by the context information of the testing set, our model is more stable and stronger for learning more discriminative information from the training set.

V. CONCLUSION
We have proposed a novel approach, called group-shuffling dual random walks with label smoothing (GSDRWLS) to fully exploit richer pairwise supervision information for more effectively training a deep network-based person reidentification framework. Specifically, in GSDRWLS, random walks are conducted in two separate channels, an adaptive label smoothing technique is used in the cross-entropy-based verification loss, and each random walk in the training phase is implemented without splitting the minibatch data into a probe set and a gallery set. Extensive experiments conducted on three benchmark datasets validated the effectiveness of our proposal. Finally, we note that our proposed approach can be incorporated with the state-of-the-art methods, e.g., [35], [37], [45], to further improve the performance, and can be modified by adopting lightweight deep networks to reduce the complexity.