Feature Correlation Residual Network for Fine-Grained Image Recognition

Learning discriminative features for visually similar classes is crucial for fine-grained image recognition tasks. Bilinear pooling models use the outer product of embedding features to enhance the representation capability and achieve favorable classification performance. However, these models cause exceedingly high dimensionality of features which makes them impractical for large-scale applications and may result in overfitting. This article proposes a feature correlation residual method to mine the channel and spatial correlation of embedding features without increasing the dimensionality of features. For this purpose, each channel/location of the embedding features in the residual module is determined by its channel/spatial correlation to all other channels/locations. Then, the correlation residual features are used to complement the original ones. In addition to cross entropy loss, batch nuclear norm loss and triplet loss based on the extracted features are used as regularization to alleviate overfitting, enlarge inter-class variations and reduce intra-class variations. Experimental results show that our method achieves state-of-the-art performances on some popular datasets for fine-grained image recognition.


I. INTRODUCTION
Fine-grained image recognition (FGIR) has become an important research topic in computer vision and widely applied in biological monitoring, automatic driving and intelligent retail. FGIR aims to recognize sub-categories, such as different species of birds, different brands of cars, and different types of retail products. Unlike traditional image classification tasks, the inter-class variations for FGIR can be much smaller because the instances from different classes may belong to the same super-class. Meanwhile, the intraclass variations can be large owing to the pose variations, which makes FGIR challenging.
As shown in Fig. 1, some instances from four gull species resemble with each other, while the instances in various poses from the same species appear different. Recently, the success of deep learning in image classification [1], [2] has spread to FGIR which employs the convolutional neural network (CNN) to learn discriminative features for visually similar classes. These deep learning approaches can be classified in The associate editor coordinating the review of this manuscript and approving it for publication was Gulistan Raja . two paradigms: local feature extraction and feature correlation extraction.
Local feature extraction methods first detect semantic local parts, and then extract local features and merge them with global features for classification. However, densely labeling object parts is expensive, which limits these approaches in practical applications. Furthermore, the performance of these approaches relies on the correctness of the part location step, which may introduce more errors.
To learn more discriminative feature representations, feature correlation extraction methods have attempted to encode higher order statistics of deep embedding features from feature correlation. For example, bilinear pooling CNN [3] replaced embedding features by their outer product, in order to enhance the representation capability. However, the extremely high dimensionality of features has made it impractical for large-scale applications.
This article proposes a feature correlation residual network. The channel and spatial correlation of embedding features are estimated and accumulated as complements (residual) of the original features, in order to learn discriminative features. To alleviate the overfitting problem, we minimize the batch nuclear norm of embedding feature matrix to slightly make the features similar to each other. A triplet network is constructed based on the features, in order to increase inter-class variations and reduce intra-class variations. The final loss of the network is determined by both the classification cross entropy loss, batch nuclear norm loss and triplet loss.
The main contributions of this work are: 1) We extract channel and spatial correlation of original embedding features without increasing the dimensions of the final embedding features. Each dimension of the embedding features in the network is determined by its channel and spatial correlation to all the other dimensions. The features containing correlation information are added as residual to original features, such as deep residual network [4], to make sure that the new features derive all the information needed in the original ones, as well as integrate high order information from channel and spatial feature correlation.
2) Batch nuclear norm loss based on embedding features is introduced as regularization to alleviate overfitting. Unlike [53] and [54] where the nuclear norm of class prediction distribution matrix is computed, we minimize the batch nuclear norm of the embedding feature matrix which affects embedding features more directly.
3) We evaluate three popular FGIR datasets and our method achieves the state-of-the-art performances for all of them.

II. RELATED WORK
In this section, we review the studies related to local feature extraction, feature correlation extraction, and triplet network.

A. LOCAL FEATURE EXTRACTION VIA EXTRA ANNOTATIONS OR AUXILIARY DATA
To capture the subtle difference between visually similar subcategories, several researchers have sought to extract features of semantic parts of the target object and create mid-level representations. Specifically, a part location, detection or segmentation module is employed before the classification. [5] trained both object and part-based detectors and chose the best detections with geometric constraints. [6] integrated localization, alignment and classification in a single network to adaptively compromise the errors of classification and alignment. [7] proposed a mask-CNN model to learn object and part masks for selecting meaningful descriptors.
Meanwhile, several studies have indicated the network learning visual concepts and semantic parts through the domain knowledge obtained from extra data. For example, [30] explored semantic embedding from knowledge bases and text, [31] organized the visual concepts in the form of knowledge graph, and [32] learned prior feature distribution from an auxiliary dataset. [48] utilized the language descriptions of discriminative parts or attributes to learn a text encoder model. The text category score from language stream and the image category score from classic vision stream were complementary and thus improved the final classification accuracy.

B. LOCAL FEATURE EXTRACTION BY SELF-SUPERVISED PART DETECTION
Obtaining part annotations such as key points, bounding boxes, and segmentation masks of parts, is labor-intensive, which limits the scalability for real-world applications. Therefore, several studies have extracted local features without extra annotations or data.
[19] learned a whole-object detector automatically to localize the object by a saliency map, and then selected the distinguished parts under spatial constraints. Motivated by attention mechanisms [20], [21] and [22] recursively learned discriminative local region attention maps and region-based feature representations. [23] proposed a self-supervision method to localize informative regions without extra annotations by employing a teacher agent that guides the part navigators to localize the most informative regions. [49] localized the object and its parts and then determines the number of discriminative regions by a two-stage stacked deep reinforcement learning network. A semantic reward function combining attention-based reward and category-based reward was proposed to help the network learn more discriminative features. [50] integrated object-level attention to localize objects of images and part-level attention to select discriminative parts of object. The part selection model was driven by object spatial constraint to restrict the selected parts locating within the object region, and part spatial constraint to reduce the overlaps among parts. [52] forced all feature channels belonging to the same class to be discriminative through a channel-wise attention mechanism. [51] adopted data augmentation by splitting the original image into pieces and randomly rearranging them. To compensate the noises introduced by the destruction process, an adversarial loss to distinguish the original image from the destructed ones is applied. [47] sequentially infered attention maps along channel and spatial dimensions for adaptive feature refinement. The difference between our method and [47] is that our method exploits the channel and spatial correlations among features.
Note that the performance of this kind of approaches is highly related to the part location accuracy. In contrast to face recognition, the task of FGIR usually suffers from much larger pose variations, and thus, it is more difficult to accurately locate and align the target objects. False location will deteriorate model accuracy and may cause even worse classification results than those obtained by merely employing global features in such case.

C. FEATURE CORRELATION EXTRACTION
Unlike semantic part localization and local feature extraction, some studies have directly learned more discriminative feature representations by mining feature correlation. One of the most popular methods is bilinear pooling CNN [3], which replaces embedding features by their outer product, to enhance representation capability and achieve favorable classification performance. [9] showed that the matrix square-root normalization offers further improvement of representation power.
However, this approach can lead to exceedingly high dimensionality of features, making it impractical for largescale applications and causing overfitting.
To alleviate this problem, recent studies have attempted to project the correlation features to some feature space with lower dimensionality. [8] and [10] provided a kernelized viewpoint of bilinear pooling, reducing the feature dimensionality by kernel functions as a logistic regression or support vector machine.
On the other hand, [11] considered inter-layer feature correlation and constructed a hierarchical bilinear pooling network for the interaction of cross-layer bilinear features, which further improved the feature representation.

D. DEEP METRIC LEARNING AND TRIPLET NETWORK
The siamese network [36] and triplet network are both deep metric learning methods, which aim to learn a distance metric that measures similarity or dissimilarity between objects.
The siamese network contains two twin networks that accept distinct inputs but share the same weights, joined in by an energy function that calculates a distance metric between the outputs of the two networks. The distance between two objects from the same class (positive pair) should be as small as possible while that from different classes (negative pair) should be larger than a margin.
The triplet network [37] extends the concept of the siamese network to three identical networks, which accept inputs of an anchor object, positive object from the same class of the anchor object, and negative object from a different class. The triplet network adaptively keeps the distance of the negative pair larger than that of the positive pair with a margin.
In many domain-specific classification tasks, such as face recognition [38], [39] and person re-identification [40], where inter-class variations are much smaller than the classic image classification tasks, combining metric loss with classification loss always achieves better preformance than only choosing the latter.

III. FEATURE CORRELATION RESIDUAL NETWORK
In this section, we present a feature correlation residual network for FGIR. We first provide an overview of the whole pipeline, and then introduce it in detail.

A. OVERALL FRAMEWORK
The overall network architecture of the proposed model is illustrated in Fig. 2. Our network is a convolutional triplet FIGURE 2. Overall framework of the proposed method. The network comprises three backbone CNN models sharing the same weights, three identical channel and spatial feature correlation residual modules extracting features containing high-order information from the corresponding outputs of the backbone models, and three losses including classification loss, batch nuclear norm loss and triplet loss.
network that comprises three backbone pre-trained CNN models sharing the same weights, three identical feature correlation residual modules extracting features containing high-order information from the corresponding outputs of the backbone models, and three losses, including classification loss, batch nuclear norm loss and triplet loss. The backbone model is a general image classification model such as VGG16 [41] or ResNet50 [4] pre-trained by Ima-geNet [45], from which the final average pooling layer and fully-connected (FC) layer have been removed. The batch nuclear norm loss and triplet loss are only used in the training phase to find a more discriminative feature space, and the prediction phase acts exactly like general classification tasks, requiring only one of the three streams.

B. FEATURE CORRELATION RESIDUAL MODULE
The structure of the feature correlation residual module has two branches: one for channel feature correlation residual and the other for spatial feature correlation residual. In each branch, the correlation weighted response is calculated and sent to a post-processing sub-network. The outputs of these two branches are added to the original features as our embedding features for loss computation. Channel feature correlation and spatial feature correlation can be employed parallelly or sequentially. We discuss the performance of different ways of combination of channel feature correlation module and spatial feature correlation module in Sec. IV-D.

1) REVIEW OF BILINEAR POOLING
Given an image I , letX ∈ R w×h×c denote the output feature map processed by the backbone network, where w, h and c indicate the height, width, and number of channels, respectively. We reshapeX to X ∈ R s×c , where s = w × h.
Bilinear pooling for feature correlation by an explicit outer product of x is given as follows: Each element of the final feature map Z ij = k X ki X kj is the sum of the product X ki X kj of the corresponding channel pair i, j = 1, . . . , c at location k = 1, . . . , s.
Bilinear pooling contains the information about secondorder feature correlation for a more powerful feature representation. However, it involves a much higher-dimensional feature space, which may cause overfitting.

2) CORRELATION-WEIGHTED RESPONSE
Intuitively, the objective of channel feature correlation extraction is to obtain high-order channel relation from original features without increasing the feature dimensions. For each VOLUME 8, 2020 channel ic, we replace it by an interpolation of all channels jc, and the weight of each channel jc is determined by the correlation function f c between the channels ic and jc. In other words, all channels contribute to ic based on their correlations.
Formally, let vector x ic ∈ R s denote the ic-th channel of the feature map X , thus the channel correlation-weighted response of x ic is defined as where f c (x ic , x jc ) denotes the correlation between channel ic and channel jc of feature map x, f c (x ic , x jc ) ∈ R, The channel correlation-weighted response of feature map X is the concatenation of g c (x ic ) for all the channels.
Similarly, let vector x is ∈ R c denote the is-th location of the feature map X , thus the spatial correlation-weighted response of is-th location of the feature map X is defined as where f s (x is , x js ) denotes the correlation between locations is and js of feature map X , satisfying js f s (x is , x js ) = 1. The spatial correlation-weighted response of feature map X is the concatenation of g s (x is ) for all locations.

3) CORRELATION FUNCTION FOR RESIDUAL
A natural choice of the channel correlation function f c (x ic , x jc ) is the softmax function of the inner product of x ic and x jc : Note that we use −x T ic x jc instead of x T ic x jc , which means a channel j with lower correlation has higher weight when calculating x ic 's correlation response. This is because the correlation-weighted response g c (x ic ) will be added to the original feature x ic as residual, and thus, it cares more about the channels that different from x ic to capture more information and increase generalization.
Similarly, the spatial correlation function f s () is a softmax function of the inner product of x is and x js :

4) IMPLEMENTATION OF CORRELATION RESIDUAL FEATURES
The channel and spatial correlation residual are convenient to be implemented by matrix multiplications. The result of the channel correlation function is matrix F c : where (A) denotes softmax function of matrix A along the rows. Then, we obtain the correlation-weighted response matrix G c : The spatial correlation residual can be implemented through a similar operation.
The final embedding features are an aggregation of the residual features and the original ones: where RN is the operation of a sequence of the network including a 3 × 3 convolutional layer, a batch-norm layer, and an average pooling layer, and AP means average pooling.

C. FEATURE BATCH NUCLEAR NORM REGULARIZATION
Inspire by [53] and [54], we utilize a batch nuclear-norm module as a regularization to prevent the classifier from overfitting.
Assuming that {z 1 , z 2 , . . . , z K } denote the embedding features of K training samples in the batch, we define feature matrix Z by To alleviate the overfitting problem, we slightly keep the features of different samples close to each other. To this end, we minimize the rank of Z to make all the features in the batch similar. However, this rank minimization problem is NP-hard, so we minimize the convex relaxation of the rank of Z instead, which is the nuclear norm of Z . Therefore, we have the batch nuclear norm loss L b as regularization. Unlike [53] and [54] where the nuclear norm of class prediction distribution matrix is computed, we minimize the batch nuclear norm of the embedding feature matrix which affects embedding features more directly and achieved better performance.

D. COMBINED NETWORK WITH CLASSIFICATION LOSS, BATCH NUCLEAR NORM LOSS AND TRIPLET LOSS 1) COMBINED NETWORK
The outputs of the feature correlation residual network are evaluated through the combination of three objectives: classification loss, batch nuclear norm loss and metric loss.
For classification loss, we simply connect an FC layer and a softmax unit to the embedding features and calculate the cross-entropy loss as a traditional classification task. For metric loss, we use the triplet network to increase the interclass variations and reduce the intra-class variations.
The final loss of our network is a weighted combination of classification loss L c , batch nuclear norm loss L b and triplet loss L t : where λ b and λ t are hyper-parameters that control the importance of L b and L t .

2) TRIPLET LOSS FUNCTION
A triplet network comprises three sub-networks sharing the same weights. The three images fed in these sub-networks are named as anchor input I , positive input I + sampled from the images in I 's class, and negative input I − sampled from a different class.
Assuming that triplet (z, z + , z − ) denotes the embedding features of the triple inputs extracted by the backbone model and feature correlation residual module, the triplet metric loss function is given as follows: and m is the margin parameter. Intuitively, the loss ensures that given an anchor point z in the embedding feature space, the location of a positive point z + belonging to the same class is closer to the anchor z than that of a negative point z − belonging to another class by at least a margin m.

3) HARD NEGATIVE SAMPLING
A major problem of the triplet network is that the possible combinations of triplets grow exponentially with the size of the training dataset, which makes the training process infeasible.
Note that most of the combinations satisfy the triplet distance constraint, and thus do not contribute to feature learning. Therefore, we use the semi-hard negative sampling strategy [39] which selects the negative samples in a batch online, such that

IV. EXPERIMENTS A. DATASETS AND EXPERIMENTS SETTINGS
To evaluate the effectiveness of our method, we conduct experiments on three extensively used and publicly available datasets, including CUB-200-2011 [42], Stanford Cars [43] and FGVC-Aircraft [44]. Besides category labels, these datasets also include more annotations such as object's bounding boxes, local key area locations, hierarchical labels or attribute labels. Compared to category information, these extra annotations are labor-incentive and dataset-dependent, limiting the extensibility for real-world applications and performance evaluation. Therefore, our method only uses category labels. The statistics of these datasets are illustrated in Table 1.

B. IMPLEMENTATION DETAILS
We use an input image resolution of 448 × 448 in all experiments and employ data augmentation such as random cropping, rotation, and horizontal flipping during training. The batch size is set to 16, and each batch contains 4 categories with 4 instances in each category. We fine-tune the ResNet50 pretrained by ImageNet as our backbone network. The model is trained for 180 epochs by a stochastic gradient descent optimizer, with an initial learning rate of 0.001, which decays by 0.1 for every 60 epochs. The weight decay is set to 5 × 10 −4 . The margin m in (11) is set to 0.5 and the importance λ in (10) is set to 1 empirically.
Our method is implemented by PyTorch. All experiments are conducted on a workstation with a GeForce RTX 2080 Ti GPU.

C. COMPARISON WITH STATE-OF-THE-ART METHODS
We compare our method to some baselines due to their stateof-the-art performance. The comparison results are shown in Table 2.
The results in Table 2 indicate that our method surpasses these baselines with a certain margin.Note that our method does not need extra annotations.

D. ABLATION STUDY
We conduct ablation studies to better understand the impact of each component on our approach, with the results summarized in Table 3.
We compare the performance of employing feature correlation residual module with channel correlation only, spatial correlation only, the combination of channel correlation and spatial correlation, and with batch nuclear norm and triplet network. We also compare three different architectures of combination of channel correlation and spatial correlation: sequential channel-spatial, sequential spatial-channel, and parallel. The results show that using sequential channelspatial correlation obtains more performance improvement over other cases. Moreover, using channel correlation alone is slightly better than using spatial correlation alone. This is probably because compared to the spatial structure of appearance, a subtle difference carried by channels is more VOLUME 8, 2020  Impact of different components of our method. Channel denotes employing channel feature correlation module only. Spatial denotes employing spatial feature correlation module only. Channel-Spatial-Par denotes employing channel and spatial correlation modules parallelly. Channel-Spatial-Seq denotes sequentially employing channel and spatial feature correlation modules. Spatial-Channel-Seq denotes sequentially employing spatial and channel feature correlation modules. Channel-Spatial-Seq + BNN + Tri denotes employing channel and spatial feature correlation modules,batch nuclear norm module and triplet loss module. significant for FGIR. In addition, the results confirm that batch nuclear norm and triplet loss further improves the classification accuracy.
We also investigate the effectiveness of the hyperparameters λ b , λ t and m on CUB-200-2011 dataset. Fig. 3 -Fig. 5 show the impact on classification accuracy given different hyper-parameters. In 3, we fix m to 0.5, λ b to 5 and change λ t from 0.1 to 10 to search different models. In 4, we fix m to 0.5, λ t to 1 and vary λ b from 1 to 50. In 5, we fix λ b to 5, λ t to 1 and vary m from 0.01 to 10. As shown in Fig. 3 -Fig. 5, the performance of our model remains largely stable across a wide range.

E. QUALITATIVE ANALYSIS
To better understand our model, we apply Grad-CAM [46] to visualize the model response of the last convolutional layer of ResNet50 in our network on CUB-200-2011 and CARS196 datasets. Grad-CAM is a visualization method proposed to calculate the model response in convolutional layers weighted across different channels, producing an activation map that highlights the important regions in the image for predicting a class. We compare the visualization results of the last convolutional layer of the backbone ResNet50 in our network and the last convolutional layer of ResNet50 in the fine-tuned traditional image classification network(baseline). As illustrated in Fig. 6, the activation region of the our network covers the target object better than the baseline methods. Although the baseline network captures the most discriminative part of the object, our method covers more discriminative parts. This has significant importance in FGIR.

V. CONCLUSION
In this article, we proposed a feature correlation residual network. The channel correlation and spatial correlation of embedding features were extracted and accumulated as residual to original features, to enhance the feature representations. Batch nuclear norm loss and triplet loss was constructed based on the features to alleviate overfitting, further increase inter-class variations and reduce intra-class variations. Comprehensive experiments on several FGIR benchmarks convincingly demonstrated the effectiveness of the proposed method.