Significance of Softmax-Based Features in Comparison to Distance Metric Learning-Based Features

End-to-end distance metric learning (DML) has been applied to obtain features useful in many computer vision tasks. However, these DML studies have not provided equitable comparisons between features extracted from DML-based networks and softmax-based networks. In this paper, we present objective comparisons between these two approaches under the same network architecture.


INTRODUCTION
Recent developments in deep convolutional neural networks have made it possible to classify many classes of images with high accuracy. It has also been shown that such classification networks work well as feature extractors. Features extracted from classification networks show excellent performance in image classification [1], detection, and retrieval [2], [3], even when they have been trained to classify 1000 classes of the ImageNet dataset [4]. It has also been shown that fine-tuning for target domains further improves the features' performance [5], [6].
On the other hand, distance metric learning (DML) approaches have recently attracted considerable attention. These obtain a feature space in which distance corresponds to class similarity; it is not a byproduct of the classification network. End-to-end distance metric learning is a typical approach to constructing a feature extractor using convolutional neural networks and has been the focus of numerous studies [7], [8], [9], [10], [11].
However, there have been no experiments comparing softmax-based features with DML-based features under the same network architecture or with adequate fine-tuning. An analysis providing a true comparison of DML features and softmax-based features is long overdue. Figure 1 depicts the feature vectors extracted from a softmax-based classification network and a metric learningbased network. We used LeNet architecture for both networks, and trained on the MNIST dataset [12]. For DML, we used the contrastive loss function [13] to map images in two-dimensional space. For softmax-based classification, we added a two-or three-dimensional fully connected layer before the output layer for visualization. DML succeeds in learning feature embedding (Fig. 1a). Softmax-based classification networks can also achieve a result very similar to that obtained by DML-Images are located near one another if they belong to the same class and far apart otherwise (Figures 1b and 1c).
Our contributions in this paper are as follows: • We show methods to exploit the ability of deep features extracted from softmax-based networks, such as normalization and proper dimensionality reduction. They are technically not novel, but they must be used for fair comparison between the image representations.
• We demonstrate that deep features extracted from softmax-based classification networks show competitive, or better results on clustering and retrieval tasks comparing to those from state-of-the-art DML-based networks [9], [10], [11]  We show that L2 normalization of softmax-based features is a powerful way to improve their performance. Even though we introduce probability invariant shift, which removes effects of softmax ambiguity and null space ambiguity, L2 normalization still works better.
In order to align the condition of the network architecture, we restrict the network architecture to GoogLeNet [14] which has been used in state-of-the-art of DML studies [9], [10], [11].

Softmax-Based Classification and Repurposing of the Classifier as a Feature Extractor
Convolutional neural networks have demonstrated great potential for highly accurate image recognition [14], [15], [16], [17]. It has been shown that features extracted from classification networks can be repurposed as a good feature representation for novel tasks [1], [2], [18] even if the network was trained on ImageNet [4]. For obtaining better feature representations, fine-tuning is also effective [6].
Bell and Bala [7] compared classification networks and siamese networks, but they used coarse class labels for classification networks and fine labels for siamese networks; thus, it was left unclear whether siamese networks are better for feature-embedding learning than classification networks. Schroff et al. [8] used triplet loss for deep metric learning in their FaceNet, which showed performance that was state-ofthe-art at the time, but their network was deeper than that of the previous method (Taigman et al. [23]); thus, triplet loss might not have been the only reason for the performance improvement, and the contribution from adopting triplet loss remains uncertain. Song et al. [9] used lifted structured feature embedding; however, they only compared their method with a softmax-based classification network pretrained on ImageNet (Russakovsky et al., [4]) and did not compare it with a finetuned network. Sohn [10], and Song et al. [11] also compared their methods to lifted structured feature embedding, thus the comparisons with softmax-based features have not been shown. arXiv:1712.10151v2 [cs.CV] 13 Apr 2019  Illustration of learning processes for softmax-based classification network and siamese-based DML network. For softmax, the gradient is defined by the distance between a sample and a fixed one-hot vector; for siamese by the distance between samples.

Differences Between Softmax-based Classification and Metric Learning
For classification, the softmax function (Eq. 1) is typically used: where pc denotes the probability that the vector u belongs to the class c. The loss of the softmax function is defined by the cross-entropy where q is a one-hot encoding of the correct class of u. To minimize the cross-entropy loss, networks are trained to make the output vector u close to its corresponding one-hot vector. It is important to note that the target vectors (the correct outputs of the network) are fixed during the entire training (Fig. 2). On the other hand, DML methods use distance between samples. They do not use the values of the labels; rather, they ascertain whether the labels are the same between target samples. For example, contrastive loss [13] considers the distance between a pair of samples. Recent studies [8], [9], [10], [11] use pairwise distances between three or more images at the same time for fast convergence and efficient calculation. However, these methods have some drawbacks. For DML, in contrast to optimization of the softmax cross-entropy loss, the optimization targets are not always consistent during training even if all possible distances within the mini-batch are considered. Thus, the DML optimization converges slowly and is not stable.  [14] architecture we use in this paper. We extracted the features of the red-colored layers. For (a), we applied PCA to reduce the number of feature dimensions. For (b) and (c), the dimensionality is reduced by the fc reduction layer.

Dimensionality Reduction Layer
One of DML's strength in using fine-tuning is the flexibility of its output dimensionality by a final fully connected layer. When using features of a mid-layer of a softmax classification network, on the other hand, the dimensionality of the features is fixed. Some existing methods [6] use PCA or dis-  criminative dimensionality reduction to reduce the number of feature dimensions. In our experiment, we evaluated three methods for changing the feature dimensionality. Following conventional PCA approaches, we extracted features from a 1024-dimensional pool5 layer of GoogLeNet [14] (Fig. 3a) and applied PCA to reduce the dimensionality. As a comparison, we also tried random projection for dimensionality reduction via orthogonal projection matrix. In a contrasting approach, we made use of a fully connected layer-we added a fully connected layer having the required number of neurons just before the output layer (FCR1, Fig. 3b). We also investigated a third approach in which a fully connected layer is added followed by a dropout layer (FCR2, Fig. 3c).

Normalization
In this study, all the features extracted from the classification networks are from the last layer before the last output layer. The outputs are normalized by the softmax function and then evaluated by the cross-entropy loss function in the networks. The output vector p = (pi) is given by softmax (y). For an arbitrary constant c, softmax (y) equals to softmax (y + c1).
The features x we extracted from the networks are given as y = W x + b, where W and b are from the linear projection matrix and the bias, respectively. As pointed out, the vector y has an ambiguity in the softmax function, thus x should be normalized for the use of deep features.
In this paper, we show that L2 normalization is empirically effective. Some studies used L2 normalization for deep features extracted from softmax-based classification networks [6], [23], whereas many recent studies have used the features without any normalization [9], [15], [26]. W x and W x/|x| do not always result in the same probabilities after the softmax function is applied. Applying L2-normalization for deep features rounds the confidence of predicted results while it keeps the magnitude relationship between probabilities of every classes. However, as Fig. 1b clearly indicates, the distance between features extracted from a softmax-based classifier should be evaluated by cosine similarity, not by the Euclidean distance. In this study, we mainly validated the efficiency of L2 normalization of deep features.
We also considered another way to cope with the ambiguity introduced by the shift invariance of softmax function and null space of W . We define a distance metric that takes softmax invariance and the null space into account, which treats features that result in the same probabilities as equal. We report the experimental results of using the distance metric with probability invariant shift in Section 4.4.

EXPERIMENTS
In this section, we compared the deep features extracted from classification networks to those from state-of-the-art DMLbased networks [9], [10], [11]. The GoogLeNet architecture [14] was used for all the methods-thus, the numbers of parameters are the same between DML-based networks and softmax-based features. All the networks were fine-tuned from the weights pretrained on ImageNet [4]. We used the Caffe [27] framework for the implementation.

Comparisons between softmax-based features and DML-based features
Here, we give our evaluation of clustering and retrieval scores for the state-of-the-art DML methods [9], [10], [11] and for the softmax classification networks. We used the Caltech UCSD Birds 200-2011 (CUB) dataset [24], the Stanford Cars 196 (CAR) dataset [25], and the Stanford Online Products (OP) dataset [9]. For CUB and CAR, we used the first half of the dataset classes for training and the rest for testing. For OP, we used the training-testing class split provided. The dataset properties are shown in Table 1. We emphasize that the class sets used for training and testing were completely different.
For clustering evaluation, we applied k-means clustering 100 times and calculated NMI (Normalized Mutual Information) [28]; the value for k was set to the number of classes in the test set. For retrieval evaluation, we calculated Recall@K [29].
In Table 2 and Table 3, we show comparisons of the performance of clustering and retrieval using NMI and Recall@K scores, respectively, for CUB and CAR datasets. We compared the softmax-based features, lifted structure [9], N-pair loss [10] and the clustering loss [11]. The results of the DML methods were quoted from the paper [11]. Regarding the lifted structure [9], the results in the parenthesis correspond to the scores we obtained from running the publicly available code ourselves, which we confirmed were almost the same as those in [11]. As we can see from Table 2 and Table 3, softmax-based features outperformed DML features. The softmax-based features all performed well in the two datasets.
In OP dataset shown in Table 4, contrasting to CUB and CAR datasets, DML features outperform softmax-based features. We will make detailed analysis in the subsequent section.

Detailed comparisons between softmax-based features and lifted structure embedding features
We made detailed comparisons between softmax-based features and lifted structure embedding [9] when changing dimensionalities and size of data. We conducted these experiments using the code available for lifted structure embedding [9].
Firstly, we show how the performance varies when changing the feature dimensionalities. We changed the dimensionalities of softmax-based features via PCA, FCR1 and FCR2, and investigated how the performance of clustering and retrieval varied. We compared them against those of lifted structure embedding of the same dimensionality.
For training, we multiplied the learning rates of the changed layers (output layers for all models and the fully connected layer added for FCR1 and FCR2) by 10. The batch size was set to 128, and the maximum number of iterations for our training was set to 20,000, which was large enough for the three datasets to converge as mentioned in [11]. These training strategies were exactly the same as those used in [9].
We show the results for CUB and CAR datasets in Fig. 4 and in Fig. 5, respectively, under varying dimensionalities. The deep features extracted from the softmax-based classification networks outperformed the lifted structured feature embedding in clustering (NMI) and retrieval (Recall@K).
For clustering performance measured by NMI, all of the softmax models (PCA, FCR1, and FCR2) showed better scores than the lifted structured feature embedding. Regarding normalization, softmax-based features with L2 normalization showed better performance than those without normalization. The NMI scores of PCA, FCR1 and FCR2 monotonically increased as the feature dimensionality increased for the CUB dataset (Fig. 4). On the other hand, in CAR dataset (Fig. 5), the NMI scores of FCR2 and the lifted structure embeddings decreased from 256 dimensions and those of PCA and FCR1 were saturated above 256 dimensions. This experimental result shows that 1024 dimensions is too large to represent the image classes of CAR dataset. It also implies that the feature dimensionality should be carefully considered in order to achieve best performance depending on the target data.
For retrieval performance measured by Recall@K metric, the softmax-based features also outperformed features of lifted structured feature embedding. Regarding L2 normalization, features with normalization showed better score than without L2-normalization. Fig. 6 shows the clustering and retrieval performance measured by NMI, and Recall@K, respectively, for the Online Products dataset. Contrasting to CUB and CAR datasets, the softmax-based features with L2 normalization and the lifted structure embedding showed almost the same performance in the clustering and retrieval. As shown in Table 1, the OP dataset is very different from the CUB and CAR datasets in terms of the number of classes and the number of samples per class-the number of classes is 22k and the number of samples is 120k. The number of samples per class in the OP dataset is 5.3 on average, which is far smaller than the CUB and CAR dataset.

The effect of the dataset scales
From the results for these three datasets, we conjecture that the dataset size-that is the number of samples per class-has a considerable influence on softmax-based features. Hence, we changed the size of datasets by sampling the images of CUB and CAR datasets for each class and ran the experiments again. We constructed seven datasets of different sizes, containing 5, 10, 20, 40, 60, 80, and 100% of the whole dataset, respectively. Among them, 5% corresponds to approximately 3 and 4 images per class in the CUB and the CAR dataset, respectively. As shown in Fig. 7 and Fig. 8, the differences between the scores for softmax and DML were small if the size of the training dataset was small. The gap between softmax and DML became larger as the dataset size increased. The softmax-based classifier was largely influenced by the size of the dataset.

Distance metric with probability invariant shift
We define a distance metric that considers the softmax invariance and null space of the linear projection matrix W. When two feature vectors are mapped to the same probability, the distance between the two becomes zero. Assume a vector u such that The shift operation x + cu has no influence on the softmax operation because softmax (W x) = softmax (W (x + cu)) = softmax (W x + c1), where c is an arbitrary constant. u exists   when the dimensionality of the feature x is larger than the number of classes to be classified. u is represented by where W T W W T −1 is the pseudo-inverse of the linear projection matrix W , {v1 . . . vD−1} are the basis vectors that span the null space of W , and {c0 . . . cD−1} are arbitrary constants. The shift operation is called the probability invariant shift in this paper. Using the probability invariant shift, the distance between x1 and x2, defined below, removes the effects of the softmax ambiguity and dimensionality reduction.
civi . (5) In this section, we present comparative experiments on the distance with a probability invariant shift and with L2 normalization using the CUB and the CAR datasets. Because the    Tables 5 and 6 show the results of the comparisons. In all cases, the L2 normalization was the most effective. The results demonstrated that the distance metric with a probability invariant shift had little effect on the clustering performance.

CONCLUSION
Because there was no equitable comparison in previous studies, we conducted comparisons of the softmax-based features and the state-of-the-art DML features using a design that would enable these methods to objectively demonstrate their true performance capabilities. Our results showed that the features extracted from softmax-based classifiers performed better than those from state-of-the-art DML methods [9], [10], [11] on finegrained classification, clustering, and retrieval tasks when the size of the training dataset (samples per class) is large. The results also showed that the size of the dataset largely influenced the performance of softmax-based features. When the size of the dataset was small, DML showed better or competitive performance. DML methods have advantages when the number of classes is very large and the softmax-based classifier is no longer applicable. In DML studies, softmax-based feature have rarely been compared fairly with DML-based feature under the same network architecture or with adequate fine-tuning. This paper revealed that the softmax-based features are still strong baselines. The results suggest that fine-tuned softmaxbased features should be taken into account when evaluating the performance of deep features.

Limitations.
• When the number of classes is huge, it is hard to train classification networks due to GPU memory constraints. DML-based methods are suitable for such cases because they do not need the output layer which is proportional to the number of classes.
• For cross-domain tasks, such as sketches to photos [30], [31] or aerial views to ground views [32], DML is effective. Classification-based learning needs complicated learning strategies like in [33]. DML-based methods can learn cross-domain representation only by using a pair of networks.
• For datasets with continuous labels, DML-based methods might be helpful because classifier-based method cannot deal with them. However, most recent DML studies are specialized to datasets with discrete labels.
To utilize the methods to datasets with continuous labels, some extensions are necessary.