Adopting Attention and Cross-Layer Features for Fine-Grained Representation

Fine-grained visual classification (FGVC) is challenging task due to discriminative feature representations. The attention-based methods show great potential for FGVC, which neglect that the deeply digging inter-layer feature relations have an impact on refining feature learning. Similarly, the associating cross-layer features methods achieve significant feature enhancement, which lost the long-distance dependencies between elements. However, most of the previous researches neglect that these two methods are mutually correlated to reinforce feature learning, which are independent of each other in related models. Thus, we adopt the respective advantages of the two methods to promote fine-gained feature representations. In this paper, we propose a novel CLNET network, which effectively applies attention mechanism and cross-layer features to obtain feature representations. Specifically, CL-NET consists of 1) adopting self-attention to capture long-rang dependencies for each element, 2) associating cross-layer features to reinforce feature learning,and 3) to cover more feature regions,we integrate attention-based operations between output and input. Experiments verify that CLNET yields new state-of-the-art performance on three widely used fine-grained benchmarks, including CUB-200-2011, Stanford Cars and FGVC-Aircraft. The url of our code is https://github.com/dlearing/CLNET.git.


I. INTRODUCTION
It is one giant leap of image classification with computer vision [37]- [39]. FGVC distinguishes objects of subcategories, e.g.,aircraft models [1], flower species [2], etc. It is gaint value to force on the subtle and discriminative features due to similarity in object appearances [3]- [6], [36]. However, it is a challenging task because of the difficulty of obtaining hidden features. Currently, weakly supervised learning approaches with image-level labels are the typical ways to achieve FGVC, i.e, region proposal methods, attention-based methods and transformer methods.Each method has pros and cons.
Currently, region proposal methods [7]- [11], [19], [33]- [35] rely on the local region proposing to identify the discriminative regions. Fu et al. [20] proposed RA-CNN, which can gradually seek out discriminative regions and merge The associate editor coordinating the review of this manuscript and approving it for publication was Khoa Luu . multiple classification results to achieve image classification. Zheng et al. [21] proposed MA-CNN, which obtains rich image features by the feature representation of multiple local feature regions. Liu et al. [9] proposed filtration and distillation learning (FDL), which describes object-based features learning and region-based features learning as ''teacher'' and ''student'', respectively. FDL [9] provides better supervision for region-based features learning. However, these methods rely on complex algorithms to select discriminative parts,which makes the network difficult to train.
On the other hand, attention-based methods are gaint leap for FGVC,which utilizes attention operations to get better classification results. One strategy is the model with fixed network structure, e.g., StackedLSTM [8], TASN [12], which hinders the availability in practical use. Another strategy is to use attention mechanism to design generic block [13], [14], which can be integrate CNN conveniently. However, those generic blocks can only improve deep layers feature extraction,but perform poorly in shallow layers. On the contrary, the shallow layers contain rich spatial features. Thus, different layers features are complementary. Unfortunately, existing methods neglect that the interaction of cross-layer features or only simply concatenate features [15].
Recently, transformer was used in FGVC [17], [22], [23], which relies on raw multiply self-attention weights to learn the discriminative features and performs very well. These show that the long-range dependencies among elements are great value for FGVC. However, we find that methods are still in learning region proposing. Specially, methods demand to initialize the size of the image patch with sliding window, which may split the discriminative regions to harm local region proposing. Moreover, methods only utilize K tokens with the maximum value and give up other tokens, which causes the loss of complementary parts.
To address above challenges, we propose a novel FGVC model CLNET to reinforce the discriminative features extraction. We observe that self-attenton method and cross-layer method can mutually reinforce fine-grained feature learning. As a result of this, the CLNET consists of self-attention module, high-level features module and associating cross-layer features module. First, the self-attention module is non-local block [26], which can obtain long-distance dependencies for each element. Second, the high-level feature module is DBT block [13], which can effectively learn deep semantic representations. Finally, associating cross-layer features module is R 3 Net [27], which integrates different layers features to achieve feature representation saliency enhancement.
To the best of our knowledge, the discriminative features of the object will be lost in the propagation CNNs. To minimize the loss of valuable information,we adopt a combination of self-attention and cross-layer to achieve enhanced feature representations. In order to exploit the dependencies between all pixels, we use non-local block [26], [13] to reinforce deep feature representations. To utilize the complementarity of different convolution layer, we employ R 3 Net [27] to achieve discriminative feature learning. Meanwhile, we are the first to adopt the self-attention mechanism and cross-layer features to achieve FGVC. Our CLNET outperforms existing vision transformer VIT [16], TransFG [17] models on the benchmark datasets, as shown in FIGURE 1. Our contributions are summarized as follows: 1) We propose a novel model CLNET which demonstrates the effectiveness of associating self-attention mechanism with cross-layer features.

II. RELATED WORK
In this section, we review the existing FGVC works, which are relevant to our research. To overcome the challenging for FGVC, the research methods are composed of attention models and feature fusion models.

A. ATTENTION METHODS
Feature learning is an important role for FGVC. Due to the subtle differences among subcategories, we only utilize CNN to extract deep semantic features, which hinders further the representation learning. To address above problems, Lin et al. [3] proposed bilinear pooling model, which adopts two CNN to get the pairwise feature, and then uses outer product to generate high dimensional vector. Hu et al. [18] proposed SENET, which caculates the weight of each channel to enhance significant features for realizing feature recalibration. Zheng et al. [12] adopted trilinear attention module to extract fine attention map and designed an attention-based sampler to highlight the discriminative regions. Woo et al. [14] proposed the Convolutional Block Attention Module (CBAM) model, which is general module that combines spatial attention and channel attention. Zheng et al. [13] proposed the deep bilinear transformer (DBT), which learns fine-grained feature representation by semantic grouping and intra-group interaction,and CNN performs well with DBT blocks. Note that Dosovitskiy et al. [16] proposed VIT, which is first to apply transformer to image classification.Meanwhile, He et al. [17] proposed the first transformer model of FGVC (i.e.TransFG), which uses the raw attention weights to select the discriminative regions of the image. However, attention-based generic blocks, e.g.,CBAM,DBT,etl. merely utilize deep semantic information,and VIT-based models hava to face super largescale training dataset. Instead, our model with non-local [26] operator achieves global attention,w hich is a flexible network framework.

B. FEATURE FUSION METHODS
Due to the success of resnet [28], resnet-based models are widely used in visual tasks. while one convolution layer contains limited discriminative features for FGVC, thus researchers try to utilize multi-layer features for feature extraction. These approaches rely on the interaction of cross-layer features to increase attention to the region of interest. In general, low-level contains rich spatial features and the object location is accurate. On the contrary, highlevel is only rich in semantic features. Long et al. [24] used feature representations of different convolutional layers to achieve better image segmentation. Yu et al. [15] proposed HBP, which refines the feature representation capabilities by cross-layer bilinear pooling. Qi et al. [25] proposed high resolution remote sensing image road extraction algorithm based on multi-feature fusion, which fuses spectral features with spatial features to improve the recognition performance of road meshes. However, inter-layer merely simply linear pooling calculations are unable to get sufficient feature relationship. Therefore, our model refine cross-layer feature for saliency detection with R3NET [27].

III. METHOD
In this section, we introduce the proposed CLNET, which contains three modules, i.e., long-range dependencies module, deep semantic information extraction module, and associating cross-layer feature module.
An overview of the proposed CLNET is show in FIG-URE 2. Note that we show the framework of the bakbone in Table 1. From FIGURE 2, it can be observed that we integrate non-local blocks [26] and DBT blocks [13] in the resnet [28] as backbone. Subsequently, the integrated feature consists of 1) refine each layer feature, 2) integrate highlevel features(H) and integrate low-level features(L).Next,we take H and L as R 3 Net [27] inputs to reinforce the salient features. Meanwhile, R 3 Net uses H as the saliency feature, which shows that applying supervision signals to H can obtain better saliency features. Finally, the last saliency map (S n ) is used as input to the fully convolutional network (FCN [24]) to get the classification results via softmax.

A. LONG-RANGE DEPENDENCIES MODULE
If the global features can be effectively used in the FGVC, the image classification performance can be further improved.
To address this problem, we utilize non-local block [26], which captures the long-rang dependencies between any two positions. Furthermore, experiments show that non-local block [26] can be integrated into resnet [28] and perform well, so it will be good to enhance feature learning. Specifically, the non-local block is as follow: where y i is a non-local operation, x i is an input feature, w z is a convolution operation and the output channels are equal to x i ,z i is a residual connection. The detailed calculation process is shown in FIGURE 3.

B. DEEP SEMANTIC INFORMATION EXTRACTION MODULE
Currently, attention mechanism plays important role for deep feature representations. Typically, DBT [13] adopts semantic grouping and intra-group bilinear interaction to promote feature learning. To verify the performance of DBT, ablation studies on integrated stages show that DBT block is insufficient to get low-level features, whereas it is effective for the extraction of high-level features. Concurrently, dbtnet [13] is a model built by integrating DBT on resnet. Thus, we merely add non-local blocks on dbtnet [13] as backbone. Meanwhile, the loss function of DBTnet is as follow: C. ASSOCIATING CROSS-LAYER FEATURE MODULE

1) OPTIMIZED INTEGRATED FEATURES
As far as we know, we are the first to apply R 3 Net [28] to the FGVC. Specifically, R 3 Net only uses upsampling feature maps and concatenation feature maps. To alleviate the issue, we use attention mechanism, which can better restrain the features with weak correlation and enhance the features with strong correlation. To be specific, we utilize convolutional block attention module [14] (CBAM) to achieve better finegrained feature representation, as shown below where ⊗ represents element-wise multiplication, I ∈ Rc * h * w is input feature map, Mc ∈ Rc * 1 * 1 is channel feature map, Ms ∈ R1 * H * W is spatial feature map, O2 is final output.   where σ denotes the sigmoid function,MLP is multi-layer perceptron.
where f 7 * 7 is a convolution operation with the filter size of 7 * 7.

2) SALIENT FEATURE NETWORK
Saliency detection aims to search salient region in the image, i.e.,the region of interest RoI). In this paper, we use R 3 Net. The optimized network structure is shown in FIGURE 4.

D. NETWORK ARCHITECTURE
We propose that non-local blocks and DBT blocks can be integrated into resnet, as shown in Table 1.

IV. EXPERIMENTS
In this section, we evaluate and analyze the performance of CLNET on three fine-grained benchmarks.

A. EXPERIMENTS SETUP 1) DATASETS
To evaluate the effectiveness of CLNET, we conducted experiments on tree widely used datasets, including CUB-200-2011 [29], Stanford Cars [30] and FGVC-Aircraft [1]. The detailed description of quantity,category numbers and the standard training/testing splits can be found in Table 2.

2) IMPLEMENTATION
The pytorch was used as deep learning freamwork.Our CLNET is trained on 3 GPU (i.e, GeForce RTX 2070 8GB). We adopt the common setting to pre-train CLNET on VOLUME 10, 2022 ImageNet [31]. To speed up the training process and reduce the over-fitting issue,we utilize the well-trained R 3 Net [27] on MSRA10K. It can be observed that the loss function of CLNET is as follows: Concurrently, referring to methods of most FGVC models,we used SGD optimizer without momentum and weight decay,and the batch size was set to 48. Note that for a fair comparison, loss function(L) has the same parameters with DBTNET [13].

B. PERFORMANCE COMPARISON
To verify the advantage of CLNET,we compared it with other state-of-the-art (SOAT) networks on three benchmark datasets. From Table 3, it can be observed that CLNET achieves SOAT competitive performance on CUB-200-2011 [29], Stanford Cars [30] and FGVC-Aircraft [1].
Specifically, the 3th column of Table 3 shows the comparison results on CUB [29]. For resnet-101 based method, we compare CLNET101 to the SOAT StackedL STM,and achieve 2.7% improvements. Moreover, CLNET101 gets 1.4% performace gain compared to TransFG [17] and reaches 93.1% accuracy. HBP [15] uses hierarchical bilinear pooling to extract discriminant features, which ignores low-level features. FDL [9] utilizes the filtration learning with discrimination matching method to locate discriminative regions, which ignores the associations between global features of the image. StackedLSTM [8] uses LSTM for image classification, and its fixed network structure hinders application in practice. TransFG uses raw attention weights to select discriminant regions of the image, whereas the initial image segmentation ignores the association between pixels. However, our CLNET maintains simplicity and robustness.
Then, we analyze the results of the cars [30] experiment, which shows that CLNET is in absolute advantage. Meanwhile, we observe that the models that use CNN as the backbone perform better than transform in this dataset. Directly, our model improves by 1.9% compared to TransFG [17] in terms of accuracy metric even if the backbone network is resnet-50. Our analysis is related to the fact that the image noise in this dataset is less, and the discriminative features of the image are easy to extract.
Similarly, the performance of CLNET in aircraft dataset is also excellent. Due to the subtle differences between the objects in this dataset, image classification is difficult. Currently, a few FGVC models use this dataset for experiments. MACNN [21] obtains feature regions by generating multiple significant feature regions. FDL [9] conducts region proposing via filtration learning. Experiments show that the accuracy of CLNET50 is improved by 1.69% compared with FDL and reaches 95.06% accuracy.
Since the non-local blocks and cross-layer features fusion are adopted,the efficiency of the CLNET should be analyzed. We use memory size and the efficiency of images processing as metrics to compute the complexity of the proposed method.
From Table 4, it can be seen that the performance of CLNET are lower than that of renet50. We believe that this is because of the relatively complex network structure of CLNET. Concurrently, we can see that CLNET has completely surpassed the vision transformer which also proves the superiority of CLNET.

C. ABLATION STUDIES
We conducted ablation studies on CLNET50 to illustrate the impact of different model structures on accuracy metric. Ablation studies have the same effects on the three datasets, so these experiments were done only on the CUB-200-2011 [29] dataset.

1) NON-LOCAL BLOCKS
It is important to verify the effectiveness of non-local blocks in CLNET.
First, in order to clarify the power of non-local blocks, we add non-local block into DBTNET50 [13] to demonstrate the influence on deep feature learning.
In Table 5, baseline is resnet50 without non-local blocks. We can observe that deeply integrating non-local block into stage II and stage III brings 0.7% accuracy gains compared to baseline. Meanwhile, integrating non-local block into stage V can not significantly improve the performance. Thus, we integrate non-local block into Stage II, Stage III and Stage IV in DBTNet50 [13].
Secondly, to further verify the influence of non-local block on the model classification results, we conduct more experiments with non-local blocks adding to different stages.
In Table 6, we define the baseline as CLNET50 without non-local blocks. We can observe the impact of adding non-  local blocks at different stages on the classification results. Specifically, if we add non-local blocks at every stage, the accuracy can improve by 3.0%. Even if we only add non-local blocks in Stage II, there is 0.7% improvement over baseline. However, it can be observed that integrating non-local blocks into stage I and stage V cannot significantly improve the performance. Our analysis is that stage I has less semantic information and stage V has fewer low-level features. In addition, adding blocks to clnet50 will increase the amount of computation.Thus we abandon the blocks in stage I and stage V.

2) OPTIMIZED INTEGRATED FEATURES
In order to obtain more discriminative features, we use channel attention and spatial attention.Thus,our model integrates CBAM [14] blocks. From Table 7, it can be observed that integrating CBAM [14] blocks in the model can improve by 0.9% accuracy gains,which means that CBAM blocks improve feature represent ions.
As each module has different functions,the order of CBAM blocks may affect the overall performance. From Table 8, VOLUME 10, 2022     it can be observed that CBAM-first achieves a 0.07% improvement, but the impact is not significant.

3) CROSS-LAYER FEATURE FUSION
In order to show the advantages of the associating cross-layer features, R 3 Net [27] is used to illustrate the performance gain.
In Table 9, the DBTnet50 [13] network integrates non-blocks as baseline. From Table 9, it can be observed that the associating cross-layer features through R 3 Net [27] can improve by 2% gains. Thus different levels of features can be mutual benefit and enhance the information of the region of interest.
From Table 10, we know that H as saliency map brings 0.7% accuracy gains.It confirms that high-level features retain more discriminative features.

D. VISUALIZATION EXPERIMENTS
We randomly select two images from each dataset. The visualization result of CLNET50 is shown in FIGURE 5.
To investigate the advantages of the CLNET50, we conduct experiments by gradually integrating different modules.
Specifically, the 3th row shows that the model identifies multiple discriminative parts of the object. Instead, CLNET50 is able to focus multiple attention regions and reinforce the feature representation as shown in 4th row.

V. CONCLUSION
In this work, we propose a novel FGVC model CLNET to enhance the learning of fine-grained features. Currently, we are the first to propose using global features to refine semantic features, and associate cross-layer features to reinforce saliency features.Extensive experiments demonstrate that CLNET is able to achieve state-of-the-art performance on various FGVC tasks. In addition, visualization experiments prove the interpretability and effectiveness of the model.
With the results achieved by CLNET, it shows the great potential of adopting attention and cross-layer features in FGVC tasks. Since the complex model structure of CLNET, we will study on using methods such as knowledge distillation to compress model to further improve efficiency. In addition,we will explore utilizing cross-layer feature fusion in vision transformers.