Multi-Label Image Classification by Feature Attention Network

Learning the correlation among labels is a standing-problem in the multi-label image recognition task. The label correlation is the key to solve the multi-label classification but it is too abstract to model. Most solutions try to learn image label dependencies to improve multi-label classification performance. However, they have ignored two more realistic problems: object scale inconsistent and label tail (category imbalance). These two problems will impact the bad influence on the classification model. To tackle these two problems and learn the label correlations, we propose feature attention network (FAN) which contains feature refinement network and correlation learning network. FAN builds top-down feature fusion mechanism to refine more important features and learn the correlations among convolutional features from FAN to indirect learn the label dependencies. Following our proposed solution, we achieve performed classification accuracy on MSCOCO 2014 and VOC 2007 dataset.


I. INTRODUCTION
Multi-label image classification aims to recognize the different objects or attributes in images.Compared with the single label image classification, which predicts only one label to each image, multi-label classification is more complicated.The labels of each image are different and the number of labels in per image is not fixed.Actually One can surmise whether other labels exist in this image according to predicted labels due to the label correlation.The key to solve the multi-label problem is to exploit the label correlation to precisely predict labels in images.The label correlation learning is long-standing problem as it is abstract and difficult to model directly.
With the development of machine learning and deep learning technologies, a lot of solutions [2], [3], [8], [34], [37], [53] are proposed to learn the label correlation and have achieved promising performance on different benchmarks.However, they all ignored two realistic problems in multi-label classification: object scale inconsistent and label tail (category imbalance) as shown in Figure 1.Object scale inconsistent: The associate editor coordinating the review of this manuscript and approving it for publication was Chenguang Yang.In the actual applications, the proportion of different objects in images is different such as person and tennis ball.Small objects in images are more difficult to identify than big objects.Label tail: label tail can also be viewed as category imbalance which manifests itself as a long tail distribution of labels.It is difficult for the algorithm to learn the informative features of tail objects and accurately identify the tail labels, because tail labels appear in dataset with very few times.Actually, frequently occurring categories are more easily identified.
Both object scale inconsistent and label tail are common phenomenons in realistic datasets.In a deep neural network, the features of the last few layers have a larger receptive field.If the receptive field is much larger than the object size, the feature of small objects is easily overlooked.Further, deep networks will pay attention on easily identifiable categories such as person and car.Meanwhile, for the tail objects, they only appear a few times in dataset so that neural network cannot learn generic distinguishable features from limited data.Small objects and tail labels usually have low recognition performance than other categories.This will affect the overall performance and versatility of the algorithm.However, there exists a same key point between object scale inconsistent and tail label, which is the lack of informative and representive features to classify these categories.
To challenge the multi-label image classification task, we proposed Feature Attention Network to mine more representative features and learn label correlation based on self-attention mechanism.
Our Feature Attention Network contains two sub networks named Feature Refinement Network and Correlation Learning Network.Feature Refinement Network aims to solve the object scale inconsistent and the label tail problem by mining informative features, and the Correlation Learning Network for learning the label correlation indirectly by learning semantic and spatial dependencies among features.
In order to recognize multi-scale objects, the multi-scale feature and the context information are important and useful.We extract multi-scale feature for recognition.Smaller objects usually are obvious in low-level features (spatial feature) and disappear in high-level features (semantic feature).Therefore, it is necessary to reasonably exploit the multi-scale feature.However, not all features are informative, we should highlight important features and underrate the less importance ones.Therefore, we proposed Feature Refinement Block to select the useful and outstanding features, inspired by SEnet [15].
Correlation Learning Network aims to learn the label correlation by model convolutional feature dependencies.Label correlation is long-standing but key problem in multi-label classification.A lot of methods [3], [34], [37] try to model the label dependencies indirectly due to its abstract nature.In our proposed solution, we learn the feature correlation by self-attention [33] method.Convolutional feature contains pixel intensity information and spatial distribution information.Correlation Learning Network integrates the multi-scale features from Feature Refinement Network.It can explicitly exploit the feature intensity and spatial information to get the new feature which considers label correlation and further solves the object scale inconsistent and label tail problem.
In this paper, we reconsider the large-scale multi-label image classification task.We point out ignored problems in multi-label image classification: object scale inconsistent and label tail problem.Then, we propose Feature Attention Network, which not only solves the above two problems, but also learns the label relationship.Our experiment results on MSCOCO 2014 and PASCAL VOC 2007 demonstrate the effectiveness of our solution.

II. RELATED WORKS A. MULTI-LABEL CLASSIFICATION
Instead of transforming the multi-label problem [8], [26], WARP [11] proposed to exploit the advantage of convolution features to multi-label annotation and analyze key components that improve performance.Hypotheses-CNN-Pooling (HCP) [39] proposed to use the max pooling to aggregate different results of each specific object hypotheses.CNN-RNN [34] built a joint CNN-RNN network to learn joint image-label embedding in which semantic label relevance is considered.Other works like [2], [3], [37] used RNN to reason or find the corresponding attention regions in terms of multi-label classification.Those solutions can only predict the top-k labels not unfixed label.SRN [53] learned the class-wised attention maps and captured the potential correlation between them by doing spatial regularization on feature maps.However, despite the better performance of these methods, these methods ignore the object scale inconsistent and label tail issues.

B. ATTENTION
Attention plays an important role in both computer vision and neural language processing field.Some models introduce supervise information to capture the context information or long-range dependencies among features in action recognition [4], [10], [25].Apart from this, SEnet proposed Squeeze and Excitation Module to adaptively recalibrate channel-wise features without extra supervised information.Meanwhile, work [33] proposed self-attention mechanism to draw global dependencies between input and output and achieved great success in machine learning.Further, non-local operation [36] was introduced to relate the response of a position to the features of all positions.Non-local has improvements on many computer vision tasks.Work [14] improved the object detection performance by well-design relevance learning network.Attention mechanism has been proven to be effective in learning label dependencies.

C. MULTI-SCALE
object scale inconsistent is more realistic and long-standing problem.Multi-scale features have been used to improve the object detection performance [20], [23], [27].The top-level features of deep neural networks have rich semantic information and have small size but larger receptive field that is useful to recognize bigger objects.The features of first few layers contain rich spatial information which represent the simple understanding of images by neural networks, and has bigger size but smaller receptive field.Therefore, the larger scale feature is useful to find small objects.However, not all features are useful.It is necessary to select the informative features forward to the output.

D. LABEL TAIL
Work [38] pointed out that tail labels have less impact in terms of Top-k precision and nDGG@k metric.Therefore, they develop a low-complexity multi-label algorithm by trimming tail labels adaptively.However, a good multi-label classification model should not be limited by unfixed number of labels.In object detection field, object detection methods [9], [12], [29] use the online hard example mining to balance the positive and negative sample ratio.However, you have no idea which is positive examples before get the results.Focal loss [21] put more attention on hard misclassified examples by changing the loss function.Our model solves the problem of low classification accuracy of tail labels by finding more fine-grained and discriminative features.

III. PROPOSED SOLUTION
In this section, we detail our proposed solution.We firstly analysis the problem in multi-label image classification.Then, we point out how we solve object scale inconsistent and label tail problem using Feature Refinement Network.Finally, we detail our Correlation Learning Network for learning label relevance.
In this section, we detail our proposed Feature Attention Network (FAN).Feature Attention Network consists of backbone network, Feature Refinement Network (FRN) and Correlation Learning Network (CLN).Backbone network can be classic network such as VGG, Resnet.FRN and CLN are introduced as followed.

A. FEATURE REFINEMENT NETWORK
The recognition accuracy of small objects and tail label is usually lower than other labels that are easier to recognition.
Features of small objects are easily ignored by deep neural networks due to the convolutions with stride and pooling operation.On the other hand, the tail label objects appear a few times in dataset.Lack of training data leads to underrepresentation of tail label by neural networks.Therefore, the same problem exists between the object scale inconsistency and the label tail problem: the lack of informative and fine-grained features.In our solution, we build feature recalibrate mechanism to mine the useful features, which exploit the global context information and multi-scale features reasonably.

1) RTB
Residual Transform Block is used to transform features of different resnet stages to same level space, in Figure 3, which is benefit to followed feature fusion and recalibrate operation.RTB contains a convolution layer and a residual block.RTB acts as a buffer between the backbone network and the Feature Refinement Network.RTB is similar with common residual block in Resnet [13].In RTB, we firstly use a convolution to reduce the dimension of inputs to K. In our model, we set K as 512.Then, a residual block is followed to transform the feature.And finally, we use the pooling operation to halve the size of the feature maps.The average pooling with 2 × 2 kernel and stride 2 is used in our solution.However, for the bigger resolution feature maps like Block2 stage, we use more RTBs and average pooling to get uniform size feature maps.

2) FRB
Feature Refinement Block is used to fusion and recalibrate features of different convolution stage.It highlights the informative and discriminative features and pay less attention on unimportant features.It is achieved by self-attention mechanism.FRB will learn a weighted vector from different stage features.The weighted vector will serve as a attentionvector to recalibrate feature.This can highlight features that are useful for small objects and tail label recognition.Specially, we concatenate high-level features x h and low-level features x l in channel dimension to get new features x c , x c ∈ R C * H * W . High-level features x h have rich semantic information but less spatial information.It can be viewed as semantic supervised information to guide the recalibrate of low-level features x l .Then, we use the global max pooling to capture the global context information.
where F sq denotes the global max pooling.It takes x c as input to calculate the vector z c , z c ∈ R C * 1 * 1 .
where F tr refers to the convolution layers with relu activation followed.σ is sigmoid function.zc is learned weighted vector which go from 0 to 1 because of sigmoid function.Notice that zc is learned from x c but works on x l .
where xl is refined features which will guide the next feature refinement iteration.We expand the dimension of zc to R C * H * W before channel-wise multiplication.We iteratively use FRB to recalibrate features from top to down.

3) GMP
Note that we use global max pooling in FRB.Other works had verified the global average pooling (GAP) is effective in image classification [13], [37] and semantic image segmentation [24] respectively.However, in our feature refinement and fusion process, we wish our model pays more attention to representative and discriminative features.Especially global average pooling will miss responses from small objects, when the feature maps have larger resolution.However, Global Max Pooling (GMP) will select the max response point as the global representation in terms of responding feature map.
It will not ignore the responses from small objects or tail label objects.Therefore, we use global max pooling to capture global context information.Our ablation experiments demonstrate GMP is better than GAP.
In conclusion, we build Feature Refinement Network to fusion multi-scale features and mine representative and discriminative features.And global max pooling is used in Feature Refinement Block to capture context information.These is benefit to recognize small objects and tail label.

B. CORRELATION LEARNING NETWORK
This section introduces our Correlation Learning Network which learn feature spatial dependencies and semantic relevance by self-attention.Some works use LSTM to locate attentional and informative regions that related to different semantic objects, and further predict semantic labeling scores on the located regions.LSTM can capture the global dependencies of located regions.However, SRN learns attention map for each label and further performs spatial regularity on learned features maps.We design Correlation Learning Network (CLN) to learn semantic dependencies and spatial relevance of features simultaneously by self-attention mechanism.Specifically, CLN learns attention responses based on relationships between different positions of feature.The response of any location to attention feature is related to the feature of other locations.The formula is as followed.
where f (x i ) is attention feature scalar, where i is the index of output feature position in space, and x is an input signal.The response of f (x i ) is related to all positions (∀ j) of feature f (x j ).
Here g(x i , x j ) is a binary function.It computes an attention matrix for regularizing feature f .We consider g(x) as a linear embedding operation.In our solution, g is defined as a dot product function as followed.
where θ and φ are different image features.Here C(x) is normalized function.We use softmax function in our network.g(x) is used to compute attentional weight matrix with semantic relevance of features considered.In our solution, θ and φ are refined feature P 3 and P 2 respectively from Feature Refinement Network.Compared with feature P 2 and P 3 , P 4 has rich semantic information.Therefore, we use learned attention matrix to regularize P 4 .

C. LOSS FUNCTION
Previous works like [3], [11], [34], [37], [39], [52] can only predict the top-k predictions.However, the number of label of each image is unfixed.Work [17] can predict unfixed number of labels through adaptive thresholds learned by designed label decision module.Our model output prediction scores with dimension R C , where C is the number of classes each dataset.The correlation information is considered before predictions are output.We can view multi-label outputs as a collection of multiple two-label output.Therefore, 0 is threshold for screening prediction scores.It is benefit to predict unfixed numbers of label.In our model, we use the multi-label soft margin loss to optimizer our model.For each minibatch, we can calculate loss using the following formula: where ŷ is the predicted label and y is the ground truth one-hot label.

IV. EXPERIMENTS
To valid the effectiveness of our proposed solution, we carry out experiments on MSCOCO2014 and VOC2007 dataset.
The results of both datasets demonstrate that our solution has out-standing performance.In this section, we firstly introduce the datasets we used and our implementation details are followed.In the following, we compare with other best multi-label classification methods, and perform some ablation experiments to evaluate each module of our model.Finally, we report our results on both dataset, and analysis it in detail.

B. IMPLEMENTATION DETAILS
Our deep neural model contains two parts: Resnet-101 [13] and Feature Attention Network.Resnet is used to extract the image feature for Feature Attention Network.To fairly compare with other methods, we also use VGG [32] as backbone to demonstrate the effectiveness of our solution.

1) NETWORK IMPLEMENTATION
Our deep framework is shown in Figure 2, which contains backbone network, Feature Refinement Network and Correlation Learning Network.In addition, Feature Refinement Network consists of two components: Feature Transform Block (FTB) and Feature Refinement Block (FRB), schematically depicted in Figure 3. FTB is used to transform features into low-dimension space, and get compact and information representations.Firstly, we use a convolution layer with 1*1 size kernel to reduce the dimension of inputs to 512.Then, two another convolution layers with 3*3 kernel size are followed.Batchnorm [16] and Relu activation function are following the first convolution.Finally, we can get transformed results by residual connection.For feature x 2 , we use two FTBs to map features.We use average pooling with kernel 2 and stride 2. FTB acts as a buffer between backbone and feature refinement block.
In FRB, global max pooling is used to get compressed feature vector.Fully connected layer learns channel-wised weights θ, which has the same channel number as low-level feature.
where f is fully connection layers with weights w f .w is the results of fully connection layers.We calculate the weight θ by the sigmoid function σ .θ ranges from 0 to 1.The value of θ closed to 1 indicates that corresponding channel is important than other channels and vice versa.Finally, the final refined feature is computed by point-wised multiplication.

2) TRAINING DETAIL
We use Resnet-101 or VGG as backbone of our model and load the weights pre-trained on ImageNet dataset [5].We train our deep neural network in end-to-end way, using mini-batch stochastic gradient descent (SGD) with momentum factor 0.9 and weight decay 1e-4.We set batch size as 16.
We use random crop and random horizontal flip in training for both datasets.In training process, we assign different learning rates to different network layers.Specifically, in the early stage of training process, we set the learning rate of Feature Attention Network as 0.1, Resnet-101 for 0.01.This will increase the speed of training.The learning rate is multiplied by 0.1, when the test accuracy is basically not increasing.The input image size was set as 448*448.

3) METRICS
We use the same seven metrics as [34], [47], [53] used to evaluate our proposed solution.The metrics used include macro/ micro precision (P-C/P-O), macro/micro recall (R-C/R-O), macro/micro F measure (F-C/F-O) and Mean Average Precision (MAP).Precision is a measure of result relevancy, while recall is a measure of how many truly relevant results are returned.Precision and recall do not make scene in the isolation from each other.
where TP, FP, FN denote true positive, false positive and false negative respectively.F measure is a balanced metric considering precision and recall simultaneously.In our paper, we use F1 measure.Mean Average Precision is the mean value of class-wised average precision.Therefore, F measure and MAP are more important metrics.We compare our proposed solution against previous best multi-label image classification methods on MSCOCO 2014 [22] dataset and PASCAL VOC 2007 dataset [7].The evaluation results are shown in Table 1 (for COCO) and Table 2 (for VOC).Some methods only provide top-k predictions.To compare with them fairly, we also compute top-k metrics based on our top-k prediction.In other best methods and our method, the hyper-parameter k is 3. Clearly, our proposed approach outperforms baseline and SRN [53] greatly, and improves the mAP performance from 77.1% to 81.8%.For the balanced metrics F1 − C, F1 − O, we all get state-of-art performance.Compared with other methods [3], [34], [37] which use RNN to learning label correlation information, our results has significant improvement than them.

1) RESULTS FOR TAIL LABEL AND SMALL OBJECT
To illustrate how our solution improves the classification accuracy of tail labels and small object categories, we select six labels with the fewest occurrences and six labels with the smallest percentage of the images in COCO dataset, and show the AP improvement of them in Table 3.We can easily know from Table 3 that our solution can greatly improve AP value of tail label, especially Resnet is used as feature extractor.We also show the effectness of our approach on small object categories classification, in

D. ABLATION STUDY
To evaluate our design modules, we decompose our approach and reveal the effect of each component in COCO [22] and VOC dataset [7].COCO dataset is more complicated and realistic in image scene than VOC dataset.

1) ABLATION FOR FEATURE REFINEMENT NETWORK
Feature Refinement Network aims to learn informative and discriminative features, which is benefit to classify the small scale objects and tail label objects.We use Resnet-101 as our backbone and also make compared experiments with Resnet-101.The ablation results are shown in Table 5.
We can easily know that our feature refinement network improve classification performance greatly, especially in recall rate RC, RO.The increase of recall rate means that the increase of the number of predicted positive labels.
This indicates feature refinement network can predict more positive labels compared with baseline.Actually, more features are benefit to find negligible object.When correlation learning network is not used, we use joint predictions from P 2 , P 3 and P 4 to get final predicted scores.

2) ABLATION FOR CORRELATION LEARNING NETWORK
Correlation Learning Network is responsible for learning label dependencies.Label dependencies play important role

3) ABLATION FOR GLOBAL MAX POOLING
As described in the section III-A.We use global max pooling instead of global average pooling to capture global context information.Global max pooling is sensitive to obvious responses and will not miss features of small objects.We made ablation experiments to valid the function of global max pooling.Its results are shown in Table 6.Global max pooling improve the performance of PC and F score in our solution.

E. VISUALIZATION
To further illustrate the effect of our FAN on solving the tail label and object scale inconsistent problems, we visualize learned feature maps using CAM method [31] in Figure 4.
The visualized results show that FAN with Resnet101 as backbone can locate negligible objects more accurately than Resnet101.It suggests that our network is trained to capture semantic and spatial dependencies of objects in the image.

V. CONCLUSION
In this paper, we proposed Feature Attention Network for large-scale multi-label image classification.On one hand, we proposed the recalibrated feature to make our deep model pay more attention on small objects and tail label objects.
On the other hand, we designed correlation learning module to learn semantic and spatial dependencies of objects based on the attention mechanism.Our ablation experialso demonstrated the effectiveness of each component of our model.We also validated the role of global max pooling in capture context information.Extensive evaluations on MSCOCO2014 and VOC2007 datasets confirm that our proposed Feature Attention Network outperforms other multi-label image classification methods.Visualization results show that FAN can accurately locate the objects in the images, which is benefit to small objects and tail label recognition.

FIGURE 1 .
FIGURE 1.The illustration of COCO2014 dataset (a) label number distribution (b) object scale distribution.We zoom in to show the data of the red box.Tail labels only appear a few times in the COCO2014 dataset, and the small objects has a small proportion in images, which will bring difficulties for image classification.

FIGURE 2 .
FIGURE 2. The illustration of our deep framework.Our proposed model contains three parts: Backbone, Feature Refinement Network and Correlation Learning Network.Block1-4 denote different convolutional stages.conv is single convolution layer.fc is the fully connection layer.FTB and FRB is feature transform block and Feature Refinement Block respectively, where x and P denote features of different stages, respectively.The blue fronts denote corresponding math operation.

FIGURE 3 .
FIGURE 3. The illustration of our proposed feature transform block and feature refinement block.(a) Feature Refinement Block.(b) Feature Transform Block.

FIGURE 4 .
FIGURE 4. Visualized feature maps from COCO dataset.We make compare with Rsenet101 baseline in locating multi-scale objects on the image.FAN is our method with Resnet101 as backbone network.Label in black are ground truth labels and red one are false labels.It suggests that FAN can more accurately locate the object corresponding to ground truth labels in the image.

TABLE 1 .
Comparison results of average precision and mAP of other methods and our method on the MSCOCO dataset.The bold front is used to mark the best results.

TABLE 2 .
Comparison of average precision and mAP of other methods and our method on VOC dataset.The best evaluation value is highlighted in bold front.

TABLE 3 .
The increase in the average precision (AP) of tail labels in the coco dataset.

TABLE 4 .
The increase in the average precision (AP) of small object labels in the coco dataset.

TABLE 5 .
Detailed results of each component of our proposed solution on COCO dataset.FRN: feature refinement network.CLN: correlation learning network.

Table 4 .
The classification accuracy of VGG with FAN can exceed the classification accuracy of Resnet101 baseline, which can demonstrate the effectiveness of our Feature Attention Network on multi-label image classification.

TABLE 6 .
Compared results of global average pooling and global max pooling on COCO2014 dataset.GAP: global average pooling.GMP: global max pooling.Our approach with GMP has better performance.inimagerecognitionand scene understanding.We can exploit the predicted labels and label dependencies information to reason possible positive labels in image.Our ablation results are shown in Table5.Compared with Resnet-101 results, Resnet-101 with Correlation Learning Network improves F score and MAP a lot due the increase of recall score.From results of Resnet101 + CLN, CLN has less impact on precision score PC.The increase of recall score means the increase of number of predicted false negative labels.When feature refinement network is not used, x 2 , x 3 and x 4 is equal to P 2 , P 3 and P 4 .That means feature attention matrix is computed by x 2 and x 3 .Our final results from Resnet-101 with joint FRN and CLN can demonstrate the improvement of CLN.

TABLE 7 .
Detail results of our proposed approach and baseline on VOC2007 dataset.