Deep Learning Research With an Expectation-Maximization Model for Person Re-Identification

In existing person re-identification methods based on deep learning, the extraction of good features is still a key step. Some efforts divide the image of a person into multiple parts to extract more detailed information from semantically coherent parts but ignore their correlation with each other. Others adopt self attention to reallocate weights of pixels for learning the association between different regions. This association can improve the accuracy of the person re-identification task, but the features obtained by this type of algorithm have high redundancy, which is not conducive to the expression of feature information. In order to address the above challenges, we propose a feature extraction method based on a novel attention mechanism which combines the expectation maximization (EM) algorithm and non-local operation. We embed the attention module into the ResNet50 backbone network. The attention module captures the correlation between different regional features through non-local operation and then reconstructs these features through the EM algorithm. In addition, we divide the network into a global branch and a local branch, where the global branch extracts the complete features, and the local branch uses the Batch DropBlock method to erase a portion of the features to achieve feature diversity. Finally, extensive experiments validate the superiority of the proposed model for person re-ID over a wide variety of state-of-the-art methods on three large-scale benchmarks, including DukeMTMC-ReID, Market-1501 and CUHK03.


I. INTRODUCTION
Person re-identification (re-ID), also known as cross-border tracking, is one of the most important research endeavors in the field of computer vision, serving vital roles in video surveillance, intelligent security, pedestrian authentication, human-computer interaction and other fields. There are some differences between the recognition task and the detection task. The detection task [1], [2] devotes more attention to the category of the object, while the recognition task devotes more attention to the ID of each target. The purpose of person re-ID is to retrieve images of a person with the same identity as the query person from the large-scale person gallery in scenes with different perspectives, time and places acquired by multiple non-overlapping cameras. There are also many works on person identification technology from different sources, such as human faces, fingerprints, gaits, The associate editor coordinating the review of this manuscript and approving it for publication was Li He . and movements. Compared with face recognition, the person re-identification scene is closer to the real environment, but it is more difficult and challenging to achieve under the dramatic variations with respect to illumination, occlusion, resolution, human pose, view angle, clothing, and background. In previous studies [3]- [5], the authors use gait and movement behaviors as dynamic features for person identification. These methods do not require high image resolution, so they can identify a person at a greater distance. Compared with gait and motion identification, the feature extraction process of person re-identification is simpler, because it only needs to extract static features of a person, such as clothing, hairstyle and backpack, etc. Current person re-ID research only uses cropped or detected person images to match query images. Therefore, extracting a rich representation for each input picture is one of the goals for person re-ID in the labeled dataset.
In addition, the person images captured by cameras in various areas have different backgrounds, so an effective mechanism is needed to direct attention to the person.
A small number of attention deep learning models for person re-ID have been recently developed for reducing the negative effects of poor detection and human pose change. Nevertheless, attention models trained by some deep learning methods generally have a high degree of redundancy. Others only focus on the attention selection of spatial or channel domain of features. Additionally, they often consider only the local feature information whilst ignoring the relationship between them.
Therefore, in order to alleviate the negative impact of the above problems on model performance, we propose a technique to capture inter-regional associations and reduce redundancy between features. The remainder of this paper is divided into the following sections. Related works are described in Section II, and the contribution of this study is described in Section III. The framework of the whole network and the details of the attention module are explained in Section IV. In Section V, we conducted multiple experiments on commonly used datasets to prove the effectiveness of the proposed approach. Finally, the conclusions of this study are presented in Section VI.

II. RELATED WORKS
At present, there are two main types of person re-ID research: feature-based [6] learning methods and measure-based [7] learning methods. Feature-based learning methods treat person re-ID as a classification task, which classifies a person with the same ID. Therefore, the main task of this kind of method is to learn more recognizable features from each person ID image, reducing the difficulty of classification. Measure-based learning methods measure the semantic similarity between the embedded features of the model by mapping the high-dimensional image of a person to the low-dimensional feature space. These methods can reduce the intra-class distance and increase the inter-class distance between features. In this paper, we focus on the former type of methods. Conventional methods describe features by manually designing feature descriptors. In recent years, with the development of deep learning [8], [9], methods based on deep convolutional networks can learn more discriminative features than manual features. In [10], [11], a ResNet-based network was proposed to extract the global features of the whole body and to measure the similarity of the extracted features. Among these methods, several efforts [12]- [15] locate a number of body parts first and then input each part into convolutional networks to extract discriminative part feature representations. In [15], the Part-based Convolutional Baseline (PCB) network was proposed, which divides the feature map into six parts horizontally on the conv-layer and forces each part to satisfy individual identification loss to acquire rich part feature representations. The Multiple Granularity Network [16] (MGN) is one of the best performing person re-ID networks among the part-based approaches. MGN uses multi-branch structure to combine global features and local features as the final feature representation. On local branches, MGN horizontally divides the feature maps into 2 or 3 parts for learning part-level features. These part-based methods can achieve very promising results due to the mining of more detailed features. However, these models require relatively well-aligned body parts for the same person ID in order to reduce external noise.
Various works [17]- [22] have studied attention mechanisms, which help neural networks mine more discriminative features and suppress useless features. In the person re-ID task, the attention mechanism mainly focuses on learning meaningful information for the task, enhancing the representation ability of features, and reducing the harmful effects of useless information, such as background and blocking. In [23], SENet learned the relations between feature channels and filtered out the features of the channel with the greatest response. In [24], spatial and frequency domain information is selectively fused to classify targets. In [25], the Mixed High-Order Attention Network employed high-order statistics for learning attention information. In addition, self-attention-based methods have also been used in computer vision tasks. The self-attention mechanism usually focuses on all positions of the feature map and takes its weighted average value in the embedded space as the response of the corresponding position in the image. For the self-attention method, non-local operation [26] can effectively capture the relationships between various parts of a person's body, which is ignored by the previous methods. However, these methods, which involve the calculation of the correlation between each pixel and others in the image, will cause a sharp increase in the number of model parameters and are computationally intensive. In the past few years, some approaches have utilized various dropout-based data augmentations to enlarge the dataset, which is beneficial for discovering rich features. For example, Batch Drop-Block [27] achieved high performance by partially erasing the extracted features and jointly training with the original features. However, these previous studies only considered extracting features from images in various ways and did not consider that the redundancy generated in this process would harm the performance of the model. Therefore, in this study, we present a method which combines the expectation maximization (EM) [28] algorithm and non-Local operation to resolve the above problem.

III. CONTRIBUTIONS
The main contributions of this paper are summarized as follows: -To obtain the relationships between distant body parts in the person image, we propose a method using non-local operations to model the correlation between pixels.
-Inspired by the expectation eaximization [28] algorithm, we use this algorithm to map the extracted features into some of the most representative feature descriptors, and use them to reconstruct the original features. This process can effectively filter out redundant features and useless information. In addition, we combine this process and non-local methods into an attention module. The input and output dimensions of the module are the same and can be easily inserted into the CNN.  -In this study, we propose a novel two-branch lightweight architecture for discovering rich features of the person re-ID task. Specifically, we adopt the idea of Batch DropBlock, developing a feature erasing branch on the basis of the global branch, to ensure the diversity of features.

A. OVERALL PROCEDURE OF PROPOSED METHOD
In this section, we describe in detail the proposed method used in this study. Figure1 shows the overall network architecture, which includes a backbone network, a global branch (orange colored arrows) and a local branch (blue colored arrows). There are many options for the backbone network, among which VGG, Inception, ResNet and their improved versions are the most commonly used. The number of VGG layers can reach 19, but VGG has two disadvantages, namely: 1. excessive network parameters, which consume substantial disk space; 2. training is very slow. Inception has 22 layers, and it uses convolution kernels with different sizes in the same layer, which is beneficial to obtain richer feature information. The gradient disappearance problem of VGG and Inception will become very serious after the network depth increases. ResNet uses shortcut connection, which effectively relieves the problem of gradient disappearance, helps to increase the depth of the network and improves the accuracy of the model. Based on the above considerations, we use the residual convolutional neural network ResNet-50 [29] as the backbone network for feature extraction. We then divide ResNet-50 into four stages, which can extract feature information of different levels from Stage 1 to Stage 4. In order to make the extracted features more discriminative, we introduce a novel attention module in the backbone network. Most current methods usually model attention in the spatial domain or channel domain, while we adopt the self-attention mechanism to model the correlations between the different positions of feature maps. We then use the EM algorithm to find a compact descriptor set for reconstructing modeled features. This process can improve the accuracy of the person re-ID task by obtaining reconstructed features with low-rank and suppressing redundant information. Table 1 describes the detailed calculation of the layer parameters in the proposed network. Firstly, we convert the original image through a series of preprocessing steps. We resize the image into the shape of 384×128×3 and use it as an input to the whole network. In Table 1, convolution sets are expressed as (Conv1, Conv2_x, Conv3_x, Conv4_x, Conv5_x), where x represents the frequency of convolution. In order to describe the data flow more concisely, we no longer list BN and ReLU separately after Stage 1. In fact, they appear in every convolution process. ''Up'' and ''dilated'' in Table 1 represent upsampling and dilated convolution, respectively.

B. FEATURE EXTRACTION
Due to the existence of the 3 × 3, 7 × 7 convolution kernel, feature maps gradually become smaller in the process of extracting features, which is not conducive for the attention module to capture detailed correlations of the feature map in the deep network. To this end, we modified the convolutions in Stage 4 of ResNet-50 to dilated convolutions [30] for a larger receptive field. It can be seen from Table 1 that the final feature maps of Stage 3 and Stage 4 have the same spatial size, 48×16×512, 48×16×2048, respectively, where the difference is that the number of output channels of Stage 4 is four times as large as that of Stage 3. Larger feature maps are beneficial for obtaining richer human body information. The network is divided into two branches after the feature map output by Stage 4 of ResNet-50 passes through a bottleneck block. The global branch provides a global feature representation across the network structure and also supervises the training of the local branch. It calculates the average value of all pixels on each feature map through a global average pooling (GAP) [31] layer to produce a 2048-dimensional feature vector, and then passes through a feature dimension reduction module to reduce the dimensionality of the feature vector. The feature dimension reduction module consists of a 1 × 1 convolution layer, batch normalization (BN) [32] layer, and a ReLU activation layer. The final feature dimension is reduced to 512, which is jointly trained with triplet loss, label smooth cross entropy loss and center loss.
The local branch uses a Batch DropBlock module to erase a certain percentage of the feature map after the bottleneck layer. Different from the ordinary random erasure method, Batch DropBlock erases the same area of the features which come from the same batch for finding semantic correlations. We use the global max pooling (GMP) [31] layer instead of the global average pooling layer to produce a 2048dimensional max feature vector. GMP can identify significant areas of unerased features here. The local branch feature is reduced to 512 dimensions after dimensionality reduction. Similar to the global branch, we use the triplet loss, label smooth cross entropy loss and center loss to jointly train the local branch features.
Finally, the 512-dimensional feature vectors from the two branches are concatenated as the final 1024-dimensional embedding feature for the inference phase of the person re-ID task.

C. ATTENTION MODULE
The convolutional neural network is the mainstream method for person re-ID tasks. In order to extract rich feature information from the receptive field, it stacks multiple convolutional layers to obtain a larger receptive field. However, it is difficult for the convolution to capture the dependencies of distant areas on the feature map, which is exactly what the task needs. Non-local operation quickly captures the long-distance dependencies by directly calculating the correlations between two positions on the feature map. It can be formulated as: where x represents the input feature vector, function f (·, ·) calculates the similarity between pixel i and j, function g calculates the representation of the feature map on the pixel j and C(x) is the normalization factor. The final y i represents the weighted average of all pixels, except pixel i, which are transformed by the function g, and the weight is the normalized similarity function f (·, ·).
The attention module in this paper introduces a second-order statistical covariance instead of a simple dot-product operation to calculate the similarity between pixels, because covariance can obtain richer inter-regional information than dot-product operation. However, the covariance inevitably results in some redundant features, which will affect the expression of effective features. In order to reduce the redundancy of the features while retaining the valid feature information, we adopt the reconstruction idea to map the features in high-dimensional space to the low-dimensional manifold. We then reconstruct the features through a few of the most representative feature descriptors to obtain more compact low-rank features than the original features. The reconstruction process can filter redundant features and effectively restore valid information.
In this paper, we use the EM algorithm to reconstruct redundant features. Only a small portion of the complex high-dimensional data has a beneficial effect on the person re-ID task, so this information deserves more attention. The EM algorithm is a heuristic iterative method. It aims to maximize the posterior probability or likelihood value for latent variables models, and then obtain the optimal estimation of parameters. Suppose that X = {x 1 , x 2 , · · · , x N } is the collection of feature information obtained, which consists of N observed samples, where each feature point x i has its corresponding latent variable z i . {X , Z } is the complete data and its likehood function is lnp(X , Z |θ), where θ is the set of all parameters in the model. In fact, the knowledge of latent variables in Z comes from the posterior distribution p(Z |X , θ). The EM algorithm maximizes the likelihood value of lnp(X , Z |θ) by two steps, separately calculating expection and maximizing expectation, which is given by: Before the iteration, we firstly select the initial value of the parameter θ (0) . We denote θ (i) as the estimated value of the i-th iteration parameter, and in step E, we calculate the posterior distribution of the latent variable Z according to θ (i) at the (i + 1)-th iteration. We then calculate the expectation of the complete data likelihood Q(θ, θ (i) ), as following in formula (3), where p(Z |X , θ (i) represents the conditional probability distribution of the latent variable while feature information X and i-th parameter estimation θ (i) are given.
Step M updates the parameters by maximizing the expectations obtained in step E to obtain the (i + 1)-th parameter estimate value θ (i+1) , which is given by formula (4), The EM algorithm executes step E and step M repeatedly until the convergence state is reached. In this paper, we regard the attention information in the features as latent variables in the EM algorithm. We firstly estimate the attention information, then maximize the likelihood based on the observed features and attention information to obtain the current model parameters (feature descriptors). After a few iterations, the EM algorithm reaches the convergence state when the change of model parameters becomes very small. At this time, the feature descriptors are the most representative, and the features reconstructed with attention information are more robust than those learned with convolutional networks.
As shown in Figure 2, we divide the proposed attention module into two stages: F and B. Because there are certain relationships between the various parts of the human body, Stage F introduces second-order statistical covariance to capture the relations between long-distance regions on the feature map. Given the input feature map X ∈ R h×w×c , h and w are the height and width of the feature map, while c represents the number of channels. We compress the spatial dimension of X to the single dimension to become X ∈ R hw×c . We then build two functions of θ(x) and g(x) by a 1×1 convolution of batch normalization layer and ReLU activation layer to obtain feature maps with shape hw × c?r, where r is a reduction factor of the feature channel. The covariance matrix is then calculated with θ(x), as following in formula (5), where I = 1 c/r (I − 1 c/r ), I is the identity matrix with shape c?r × c?r, and the shape of I is hw × hw. We adopt 1 c/r as the scaling factor for the covariance matrix, then normalize the covariance matrix by applying softmax. We multiply the normalized result by g(x) to obtain X as: While modeling the second-order statistics, Stage F also produces substantial redundant feature information, which negatively impacts the person re-ID task. To this end, we introduce the expectation maximization algorithm to reconstruct the features of Stage F with a small number of feature descriptors, so as to obtain the effective features of low redundancy. Stage B consists of three steps: probability estimation (E), likelihood maximization (M) and feature reconstruction. We regard the mapping matrix Z as the latent variables in the model and K feature descriptors as model parameters. The input given in Stage B is the feature map X ∈ R hw×c?r , and the initial values of descriptors are µ ∈ R k×c?r .
Step E updates the mapping matrix (attention map) Z , which is formulated as: where λ is used as a hyperparameter to control the distribution of Z , and the default value is 1.
Step M updates the descriptors µ by maximizing the complete data likelihood, where µ is counted as the weighted average of X . The k-th descriptor is formulated as: Step E and step M alternately execute T times until µ and Z reach the state of approximate convergence. The final µ and Z are used to reconstruct X to obtain X as: Finally, we use 1×1 convolution to restore the dimension of X from hw × c?r to h × w × c, and add it to the original feature map X to obtain X , which is formulated as: The pseudocode of this attention module is provided as Algorithm 1.

D. LOSS FUNCTION
We adopt multiple loss functions to jointly train the feature vectors of two branches, which are soft margin batch-hard triplet loss [33], smooth label cross entropy loss [34] and center loss [35]. Triplet loss is formulated as: ]. (11) where D(f θ (x i a ), f θ (x i p )) and D(f θ (x i a ), f θ (x i n )) represent the distance between the anchor sample and the positive samples and negative samples, respectively, and m is the margin for loss.
The person re-ID problem can be transformed into a classification problem: that is, each person ID is equivalent ; 6: for t = 1 to T do 7: E: Z =softmax(λX (µ T )) ← X ∈ R hw×c/r , µ ∈ R k×c/r ; 8: M: µ k = hw n=1 z n X n hw n=1 z n k ← Z ; 9: end for 10: X = Z µ ← final Z , µ; 11: X ∈ R h×w×c ← X ∈ R hw×c/r ; 12: X = X + X ; to a class, so we use the cross entropy loss for optimization. Cross entropy describes the distance between two probability distributions, defined as: where k ∈ 1, 2, · · · , K represents the person category, p(k) represents the predicted probability that the input image belongs to category k, and q(k) is the actual probability. The center loss reduces the distance between similar samples, fomulated as where c is the sample center. The final loss is the weighted sum of the above three loss functions, namely, where γ i , γ c are weighting factors.

V. EXPERIMENTS A. DATASETS AND EVALUATION CRITERIA
For better evaluation, we selected three large-scale person re-ID benchmarks: Market-1501 [36], DukeMTMC-ReID [37] and CUHK03 [38]. The Market-1501 dataset includes 1,501 identities collected by six cameras at Tsinghua University. The training set includes 12,936 training images with 751 different identities. Gallery and query sets have 19,732 and 3,368 images, respectively, with another 750 identities.
The DukeMTMC-reID dataset includes 16,522 training images of 702 identities, 2,228 query and 17,661 gallery images of another 702 identities. VOLUME 8, 2020 The CUHK03 dataset contains 14,096 labeled images and 14,097 detected images of 1,467 identities captured by two camera views. The labeled set contains 7,368 training, 1,400 query and 5,328 gallery images, respectively. The detected CUHK03 set includes 7,365 training images, 1,400 query images and 5,332 gallery images. The dataset parameters are shown in Table 1.
To evaluate the performance, we use the cumulative matching characteristic (CMC) and mean average precision (mAP) metrics. The horizontal coordinate of the CMC curve is the Rank value, among which Rank-1, Rank-5, Rank-10, etc. are commonly used. The vertical coordinate is the matching rate, which represents the probability of finding the correct ID in the first k matching results. It can be calculated as follows: if the top k candidate samples from the returned list contain correctly matched images, then the accuracy rate is 1; otherwise, the accuracy rate is 0.
The second indicator, mAP, regards the person re-ID problem as an object retrieval problem and can evaluate the performance of the whole model. mAP is calculated on the basis of precision (P) and average precision (AP). P refers to the proportion of the query image in the retrieved image. AP measures the quality of the trained model with a single category, and mAP measures the quality of the model with all categories. AP is formulated as: where k represents the number of query images, M_k represents the total number of images when the k-th matching image of the query picture is retrieved, and N indicates the number of matching items of the query image. The mAP is the average of the AP, formulated as where Q is the total number of images in the query set. In addition, we utilized the receiver operating characteristic (ROC) curves and equal error rate (EER) for performance evaluation. The ROC curve is plotted with (true positive rate) TPR against the (false positive rate) FPR, where TPR is on the y-axis and FPR is on the x-axis. We could evaluate a binary classification model with different classification thresholds and then obtain the ROC curve. FPR and TPR are formulated as: where FP (false positive) represents a positive predicted value which is actually false. TP (true positive) represents a positive predicted value which is actually true. FN (false negative) represents a negative predicted value which is actually false. TN (true negative) represents a negative predicted value which is actually true. The equal error rate (EER) is the point on the ROC curve that corresponds to an equal probability of misclassifying a positive or negative sample. This point is obtained by intersecting the ROC curve with a diagonal of the unit square.

B. IMPLEMENTATION DETAILS
Our network is trained using a single GTX 2080Ti GPU based on the PyTorch framwork with a batch size of 64. We employed ResNet-50 as the basic backbone network for our method in the experiments, and initialized them with the ImageNet pre-trained parameters. For training, the input images are resized to 384×128 and augmented by horzontal flip, normalization and cutout. We conducted the experiments on three datasets: DukeMTMC-reID, Market-1501 and CUHK03. The total number of epochs is set to 200 [300], namely, 200 for both Market-1501 and DukeMTMC-reID, and 300 for CUHK03. We choose the Adam optimizer with a warm-up strategy, where the base learning rate is initialized to 3.5e − 5. In the first 50 epochs, the learning rate increases to 3.5e − 4, then decays to 3.5e − 5 at epoch 100 [150] before further decaying to 3.5e − 6 at epoch 150 [200]. The initial value of the EM algorithm feature descriptor in Stage B of the attention module is randomly generated, and then the initial value is continuously updated and optimized through iterations of steps E and M.
In the experiment, we use the cross-validation method to prevent the model from overfitting. The detailed contents of the cross-validation method are to first divide the dataset into a training set and a test set, and then divide a small part from the training set as a validation set to fine-tune the model. The person IDs in the training set and the test set do not overlap. After every 10 epochs of training, the program will use the validation set to verify the performance of the model and fine-tune it. In the testing phase, we use the test set to varify the final performance of the model. In this way, we can obtain a proper model with good generalization performance.

C. EXPERIMENTAL ANALYSIS
We evaluate the effectiveness of the proposed network on three datasets. The attention module is divided into two stages, F and B, where F represents Non-Local with covariance, and B represents the process of feature reconstruction with the EM algorithm. Figure 3 shows the attention map generated by the EM algorithm in the attention module. It can be seen that the EM algorithm guides the attention of the model to the person's body, which ignores the disturbance caused by the background information. The feature descriptor µ generated in the iterative process is orthogonal, which ensures that the redundancy of the features is minimized. Experimental verification shows that the highest accuracy    can be achieved when the number of feature descriptors K is 160 and the number of EM iterations T is 3.
We compared the impact of different parts of the attention module on network performance. We can see from Table 3 that both Stage F and Stage B offer a certain degree of accuracy improvement on the network. For the DukeMTMC-reID dataset, our model with only F improved the performance by 1.1% and 1.0% for mAP and Rank-1 metrics, respectively. This proves that the feature information     results than the split module: 78.8% for mAP and 89.4% for Rank-1. Similar to the results on the DukeMTMC-reID dataset, the use of the attention module on Market1501 and CUHK03 also provides an improvement in accuracy as compared with the baseline. Tables 3 and 4 show the detailed comparisons on the above three datasets. Figures 4,5, and 6 show the CMC curve of the attention module on the three datasets: DukeMTMC-reID, Market501 and CUHK03. From these diagrams, we can see that the F and B stages of the attention module can respectively improve the performance of the network, and the complete attention module after the fusion of the two stages can enable the greatest improvment. Figures 7 shows the ROC and EER curves of the attention module on the above datasets. Experimental results show that the proposed model performs better than the original model, which indicates that the model with our approach can learn more discriminative deep features.
We compared the proposed model with the original ResNet50 with respect to model complexity. Since our network structure has one more branch as compared with the original Resnet50, the time complexity of the model has increased to a certain extent. In addition, it can be seen from Table 5 that after the proposed attention module is added, the FLOPs and parameters number of the model do not   increase significantly. In the groups of Table 6, Table 7 and Table 8, we respectively compared our work with some stateof-the-art methods proposed in the recent years over the popular benchmark datasets DukeMTMC-ReID, Market1501 and CUHK03. All reported results do not apply any re-ranking techniques. From these tables, we observe that our proposed method achieves substantially superior performance over all compared methods on the benchmark of DukeMTMC-ReID and Market1501. For DukeMTMC-ReID, Table 6 shows that our proposed approach obtained 78.8% mAP and 89.9% Rank − 1 accuracy. Compared with HA-CNN (CVPR18), which also researches the attention mechanism, our method improved mAP by 15% and Rank − 1 by 9.4%. Compared with BFE (ICCV19) and MGN (ACM MM18), which are also multi-branch, our attention-based method has also achieved great improvement. Similar to the comparison with DukeMTMC-reID, our model achieved slightly better results on Market1501 as compared to baseline methods, such as MHN (ICCV19) and MGN (ACM MM18). As shown in Table 7, our model achieved 87.1% for mAP and 95.4% for Rank − 1, respectively. Compared with MHN-6, which introduces 6-level attention information, our method improved mAP by 2.1% and Rank−1 by 0.3%. On the CUHK03 dataset, our method can also obtain a good result as compared with some classic methods.

VI. CONCLUSION
In this paper, we propose a multi-branch person re-ID method based on the fusion attention mechanism to mine effective information of features from the perspective of feature extraction. Instead of normal attention, such as channel attention and spatial attention, we use covariance in non-local operation to capture the correlations between features, then reconstruct features with expectation-maximization operation, which computes more compact feature descriptors by iteratively executing the EM algorithm. The reconstructed features are low-rank and have more discriminative and robust representations. In the network framework of this paper, we adopt a two-branch structure. In addition, we add a feature erasing branch, which can obtain more features as compared with the global branch. This branch applies batch erasing strategy to erase the same area of the same batch features during training for learning detailed features. Extensive experiments show that the proposed approach achieves stateof-the-art performance on popular re-ID datasets, including Markets1501, DukeMTMC-reID and CUHK03. In addition, the results obtained showed that the our proposed attention module can improve model performance.
In future research, we will first evaluate the performance of a model trained by one domain on another domain, and then study the method of improving the generalization ability of the model. Additionally, we plan to use surveillance cameras to collect person images, and study unsupervised methods to further improve model performance with these unlabeled images.