Heterogeneous Double-Head Ensemble for Deep Metric Learning

The structure of a multi-head ensemble has been employed by many algorithms in various applications including deep metric learning. However, their structures have been empirically designed in a simple way such as using the same head structure, which leads to a limited ensemble effect due to lack of head diversity. In this paper, for an elaborate design of the multi-head ensemble structure, we establish design concepts based on three structural factors: designing the feature layer for extracting the ensemble-favorable feature vector, designing the shared part for memory savings, and designing the diverse multi-heads for performance improvement. Through rigorous evaluation of variants on the basis of the design concepts, we propose a heterogeneous double-head ensemble structure that drastically increases ensemble gain along with memory savings. In verifying experiments on image retrieval datasets, the proposed ensemble structure outperforms the state-of-the-art algorithms by margins of over 5.3%, 6.1%, 5.9%, and 1.8% in CUB-200, Car-196, SOP, and Inshop, respectively.


I. INTRODUCTION
Deep metric learning has been successfully used in various applications related to computer vision, such as image retrieval [1]- [10], person re-identification [11], [12], and face verification [13]- [15]. Deep metric learning refers to the design of feature extracting functions with deep neural networks so that the features of semantically similar images are close to others. Ensemble is a method of ensuring robust performance by training diverse models and aggregating their prediction results.
For deep metric learning, the ensemble requires two or more deep networks that extract diverse feature vectors [16]. A variety of feature vectors can be obtained by semantically diverse attention of the input image [11], [17] or by applying a loss to keep feature vectors away from others [2], [18]. The diversity can also be achieved by making networks converge to diverse local minima [19]. The goal of the ensemble is to generate a synergy between those diverse The associate editor coordinating the review of this manuscript and approving it for publication was Bin Liu . feature vectors [16]. However, such ensemble methods have a crucial disadvantage: the multiple models require considerable memory and heavy computation.
To alleviate the disadvantage, multi-head structures have been employed for many applications [20], [21], including deep metric learning [3], [11], [22], [23]. Multi-head structures can effectively reduce memory usage by sharing a part of low-level layers. However, most previous studies VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ simply use multiple heads having the same structure by copying the original block for each head, which leads to a limited ensemble effect due to lack of head diversity. In this paper, to accommodate the trade-off between memory saving and performance, we present an elaborate design concept. By examining the trade-off based on design concepts, we search for an effective multi-head ensemble structure referred to as Heterogeneous Double-head Ensemble (HDhE), as shown in Figure 1. In our design, we consider three factors influential to memory saving and performance: designing a dimension-reduced feature vector, determining the shared body part, and designing diverse heads to generate diverse feature vectors favorable to the ensemble. Considering these, we prepare design concepts and we conduct rigorous evaluations on various variants chosen on the basis of design concepts. Based on such evaluation of the trade-off between memory saving and the ensemble performance, we determine the dimension-reduced feature vector, the low-level shared structure, and the heterogeneous structure of heads.
To show the advantages of our HDhE structure, we present the following experimental results. Our HDhE structure outperforms the baseline ensemble of two independent networks by over 4% in the CUB-200 dataset (Recall@1), using only one-quarter of baseline feature dimension and three-quarters of baseline memory usage. In addition, on four kinds of image retrieval benchmark datasets, our HDhE outperforms not only existing ensemble methods but also the state-of-the-art methods by large margins. Besides, we show that the HDhE is valid for other types of loss, such as triplet loss, even if it is designed only by using softmax loss.
The contributions of our work are summarized as follows: • We find influential design factors to design an effective multi-head ensemble structure for memory saving and performance improvement.
• We establish the design concepts encompassing design factors to search the target ensemble structure.
• Following the design concepts, we propose a heterogeneous double-head ensemble structure that provides memory savings and renews the state-of-the-art performance.  [8], which appeared later, can achieve higher performance without ensemble than the above algorithms, while using the same backbone network (inception-v1). Meanwhile, Xuan et al. [5] proposed an ensemble algorithm that creates meta-classes through random grouping, and the independent models learn the different meta-class data. However, the strategy that renders the best performance of the algorithm uses 48 independent networks, which proves to be a disadvantage. This means that the computational power of 48 independent models is needed for one inference. On the other hand, our multi-head structural ensemble model requires fewer parameters compared to two independent models. ABE [3] is an ensemble algorithm that uses a multi-head structure for the deep metric learning. This algorithm introduces a new loss function in which the feature vectors of each head push each other. However, their 8-head ensemble performance designed with inception-v1 also lags behind the current single algorithm with the same backbone. In this paper, the proposed structure outperforms not only the existing ensemble algorithms but also the stateof-the-art algorithms by a large margin on four image retrieval benchmarks.

B. MULTI-HEAD ENSEMBLE STRUCTURE
Studies have used a multi-head structure by expanding high-level layers of the backbone network in various tasks such as person re-identification, image classification, and deep metric learning [3], [11], [20]- [23]. The shared layer structure of the multi-head ensemble is different for each study.
For some examples, one study [23], using the ResNet-50 network, used the layers until Res3_2 as a shared layer and the other layers as multi-heads. In another study [20] using the ResNet-110 network, the layers until Res3_18 were used as a shared layer. There was also study [3] that used Inception-V1 network, where layers until inception_3b were used as shared layers. However, these studies did not clearly state how their multi-head structure was designed and employed. In this paper, we clarify the effective structure of the multi-head ensemble through insightful experiments and its trade-offs.

III. METHODOLOGY
The multi-head structure for deep metric learning, there are three structural design factors. First, the feature vector should be of a small dimension for making a concatenated ensemble feature. Second, the structure that shares low-level layers should save memory and not degrade performance over an ensemble of the independent models. Third, the feature vectors from each head should be diverse to make ensemble synergy. In this paper, we only consider the structural design of each head. In this section, we set up the design concepts according to the three design factors and propose a multi-head structure following the design concepts through rigorous evaluation of variants on each factor. For the baseline network, we adopt the ResNet-50 [24] network, most popular backbone model. All experiments covered in this section are conducted on validating the CUB-200-2011 [25] dataset. The CUB-200-2011 training dataset is divided into training and validation sets equally. Values in the tables is the average value of the results obtained by 5 trials using different random seeds.

A. DIMENSION-REDUCED FEATURE VECTOR
Design Concept: A design concept of feature extraction block is presented to mitigate a redundancy by concatenation of ensemble vectors, thus improving the network efficiency. To this end, we set up multiple variants of the last feature block in the direction in order to reduce feature dimension and network complexity without considerable loss of performance. Then, we select one of the variants through a trade-off evaluation between complexity and performance for possible variants of the feature block. Evaluation: We present three representative variants that modified the last residual block of ResNet-50, as shown in Figure 2. Their performance is shown in Table 1. The trade-off evaluation results are as follows: • The ''original'' gives 2048 dimensional feature vector and its time-complexity is O(n 2 × dim). Hence this one is not recommended in order to use feature vectors due to excessively large dimensions, even though it shows the best performance, as seen in Table 1.
However, this method has a large performance drop compared to the original one, as shown in the second row of Table 1. Furthermore, adding a linear layer increases the network parameters, so it can be said that this method is not effective.
• The variants ''Penultimate'' and ''Penultimate + '' variants greatly reduce the complexity without considerable loss of performance. In addition, these variants reduced the parameters by removing the last convolutional layer. ''Penultimate + '' is chosen as our feature block because it performs slightly better performance than ''Penultimate''.

B. EFFECTIVE STRUCTURE OF SHARED BODY
Design Concept: The multi-head structure shares low-level layers, and the remaining high-level layers branch off into several heads. The design concept of the shared structure of low-level layers is provided to mitigate excessive memory usage, which is a side effect of the ensemble method. For this, we set up multiple variants with different shared layers. Since the goal of this design concept is to investigate the effective number of shared low-level layers, the number of heads is set to 2. Then, we investigate the trade-off between memory saving and performance variation. Finally, we choose a variant that shares the maximum number of layers without degrading performance for the two-head ensemble structure. Evaluation: The layers of ResNet-50 network can be divided into five blocks as {conv1, Res1_x, Res2_x, Res3_x, Res4_x}. We have simply redefined it as {B1, B2, B3, B4} by combining conv1 into Res1_x. We have prepared four variants based on different shared blocks. To clearly observe the ensemble effect, the learning rate of one head was set to twice that of the other's learning rate. The feature vector from each head is selected from Penultimate + . The performances of each head and ensemble are shown in Table 2. The trade-off evaluation results are as follows: The ensemble performance (Recall@1 on the validation set) for the combination of two heads. The increased value from the maximum among the two heads is shown in parentheses. The characteristics of each head are given in Table 4. The top performances of each cluster are marked by bold.
• As the sharing portion becomes large, memory saving increases but the performance may decrease. However, interestingly, the ensemble performance does not always decrease with an increase in block sharing. In Until B1 and Until B2, the ensemble performances rather increase compared to without sharing (None).
• The structure of sharing Until B3 can reduce a large number of parameters, but the ensemble performance is degraded. This implies that the feature vector of each head could not be independent of the other due to much sharing.
• Although Until B1 and Until B2 have similar ensemble performance, the Until B2 is selected for our proposed structure design because Until B2 has the advantage of a large memory saving.

C. DIVERSE MULTI-HEADS
Design Concept: The third design concept of the proposed multi-head structure is to produce diverse feature vectors from multi-heads. We assume that the diverse designs of multi-heads can yield diverse feature vectors. Under this assumption, we investigate which structural combination is favorable for the ensemble via experimental evaluation. To this end, we prepare multiple head-variants based on the Until B2 shared structure and evaluate the ensemble performance for every combination of different head-features after training only the heads separately in fixing the remaining part. The head variants are designed by simply modifying the part of head layers in the hyper-parameter level (e.g., stride and learning rate changes). Based on the results, we employ a combination that provides the best ensemble performance for the proposed multi-head design. Evaluation: We have prepared nine head-variants whose modification factors are summarized as follows: Learning rate change (f η ): The learning rate is considered as the first modification factor because it is a simple way to obtain an ensemble gain. The learning rate is a parameter that determines the step size of the update to compensate for the loss generated in the learning process. In the experiments, four learning rates 0.5×, 1×, 2×, and 4×, of the original learning rate (5×10 −4 ) are considered.
Stride change (f s ): The level of stride is the second modification factor. The stride is a design parameter of the convolution filter and determines the spatial filtering density of the input tensor. In the original ResNet-50, the two stride choices of the second convolution filter of Res4_1 are set to 1 and 2.
Higher layers pruning (Last layer): We leverage a technique that prunes higher layers. Ro and Choi [26] have shown that pruning a couple of higher layers shows better performance than the original one. Therefore, we consider the three variants whose Last layer are Res4_1 (Res4_2, Res4_3 pruned), Res4_2 (Res4_3 pruned), and Res4_3 (not pruned), respectively. A total of nine head-variants are prepared by choosing the learning rate, the number of strides, and pruning variants, as shown in Table 4. The feature vectors are selected by passing over the last convolutional layer for each variant in the same manner as Penultimate + in section III-A. The performance of single feature vector from each head is shown in Table 4. The ensemble performance of two heads is shown in Table 3, and the value in parentheses shows the ensemble effect, i.e., the increased value from the maximum among two heads. Based on these experimental results, the discussions for the design of each head are as follows: • From the result of each head in Table 4, Type-G(η 1 s 1 r 2 ) provides the best performance. In general, setting stride to 1 outperforms 2, as shown in Table 4. The layer pruning design saves memory by reducing parameters. For example, Type-H(η 1 s 2 r 1 ) with Res4_1 has about half parameters of Type-A(η 1 s 2 r 3 ) with Res4_3, almost without loss of performance.
• In the top left of Table 3, among the combinations of learning rate variants, Type-B(η 0.5 s 2 r 3 ) and Type-D(η 4 s 2 r 3 ) ensemble has the highest performance. The performance of Type-D(η 4 s 2 r 3 ) is the worst in Table 4, so its self-ensemble combination (Type-D,D) is also the worst, as shown in Table 3. However, Type-D is more helpful for an ensemble, and interestingly, it has a large variation from the default learning rate, compared to other learning rate variants. In the opposite case, Type-A(η 1 s 2 r 3 ),B(η 0.5 s 2 r 3 ), the combination with the smallest learning rates, has the lowest increase of ensemble performance. These results show that a successful ensemble of learning rate variants requires sufficient variations such as 0.5× and 4×.
• Among the results in the top right of Table 3, Type-A(η 1 s 2 r 3 ),I(η 1 s 1 r 1 ) shows the best ensemble performance with the different pruning (Res4_3 and Res4_1) and the different stride (2 and 1). From the comparison of Type-A(η 1 s 2 r 3 ),E(η 1 s 1 r 3 ), it is shown that difference in stride does not contribute to making ensemble gains.
• In the bottom right of Table 3, the combinations of Type-E(η 1 s 1 r 3 ),H(η 1 s 2 r 1 ) and Type-E(η 1 s 1 r 3 ),I(η 1 s 1 r 1 ) that have the different pruning (Res4_3 and Res4_1) provide the best and second-best performance. In the above cases and the case of Type-G(η 1 s 1 r 2 ),I(η 1 s 1 r 1 ), it is also confirmed that the different pruning from every other head is favorable to the ensemble.
• Table 5 shows the representative results of the combination of triple heads. The triple-head structure using Type-E(η 1 s 1 r 3 ), Type-D(η 4 s 2 r 3 ), and Type-I(η 1 s 1 r 1 ) shows the best ensemble performance, where each head adopts different structure from the others, i.e., Res4_3, Res4_2, and Res4_1 as Last layer, respectively. The best performance among the triple-head structures is slightly better than the best one among the double head structures. However, the triple-head structure requires larger complexity than the double head structure. Therefore, the structure of the double-head has been selected for our proposed structure in the consideration of trade-off between the memory savings and ensemble performance. In summary, from our experimental investigation, we have found the following ensemble-favorable combinations for multi-head design. First, the large learning rate variation, such as 0.5× and 4×, is favorable to make the ensemble. Second, the Last layer difference between Res4_3 and Res4_1 is the most ensemble-favorable. Third, the difference in stride does not contribute to the ensemble, but the reduction of stride yields better performance in each head. Forth, the triple-head structure does not have large merit over the double-head structure in the performance comparison.

D. PROPOSED MULTI-HEAD ENSEMBLE STRUCTURE
Based on the evaluation results, we propose a Heterogeneous Double-head Ensemble (HDhE) structure as shown in Figure 3-(a). Following the result in subsection A, the HDhE structure uses feature vectors that are reduced by one-quarter of the dimension of the original model. Following the result in subsection B, the shared body adopts Until B2, thus improving performance along with saving memory. Lastly, following the result in subsection C, the head structure is designed to make diverse feature vectors. The Last layer of the double-head is designed as Res4_3 and Res4_1, respectively, the learning rate is set to 0.5× and 4×, respectively, and the stride is set to 1 for both of two heads.
In addition, referring to the evaluation results, we can customize the multi-head structure in various views. For example in Figure 3-(b), a memory-saving version of HDhE (Ms-HDhE) can be suggested as follows. In Ms-HDhE, the shared body is enlarged Until B3 and the heads can be chosen by Res4_2 and Res4_1 with their learning rates of 0.5× and 4×, respectively, and the stride of 1 for both.

IV. EXPERIMENTS
In this section, we provide experimental results for evaluating the proposed structure, especially in deep metric learning for image retrieval tasks. We first introduce the standard benchmark datasets of image retrieval tasks in section IV-A. Then we will give the implementation details of these experiments in section IV-B. The evaluation experiments contain four parts: the ablation studies on structural variants (section IV-C), the verification of suitability as a backbone VOLUME 8, 2020 structure (section IV-D), comparison with the state-of-the-art methods (section IV-E).

A. DATASETS
The proposed structure has been evaluated following four benchmark datasets for the image retrieval task.

B. IMPLEMENTATION DETAILS
In the paper, our experiments were implemented using the PyTorch [30] library. As the backbone network, ResNet-50 [24] pre-trained in ImageNet [31] was used. The input size was set to 224 × 224 and the batch size was set to 40. We utilized a data augmentation that flips input images horizontally in the probability of 0.5 in training. Basically, our HDhE was trained along with classifier by softmax (crossentropy) loss. The leaning rates were initially set to 0.001 for convolutional blocks (B1, B2, B3, and B4) and 0.01 for the classifier. For CUB-200 and Car-196, each HDhE has trained with the initial setting for the first 10 epochs and subsequently 10 epochs trained with learning rate 0.1× of the initial one. For SOP and Inshop, each HDhE has trained for the first 30 epochs with the initial learning rate, and the learning rate was dropped to 0.1× for the next 20 epochs. The optimizer used in this paper was stochastic gradient descent (SGD) with nesterov momentum [32]. The initial momentum rate and the weight decay were set to 0.9 and 5 × 10 −4 , respectively. For the evaluation metric, the Recall@K metric [28] is employed.

C. ABLATION STUDY
In this section, we provide ablation studies to evaluate the extent to which design factors contribute to performance improvements. To this end, we have evaluated the ablation variant models from the proposed Heterogeneous Double-head Ensemble (HDhE) model. For the baseline model, we used the simple ensemble of two independent networks.

1) EFFECT OF PENULTIMATE +
To verfiy the effect of proposed Penultimate + , we evaluated the model replacing the Res4_3 block by Penultimate + in the baseline model. In the first and second rows of Table 6, the proposed Penultimate + not only reduces the dimension of feature vector by one-quarter over the baseline but also provides better ensemble performance in Recall@1.

2) EFFECT OF SHARED LAYERS
To evaluate the effect of our shared layers, we have compared two variants of shared layers. As shown in the third row of Table 6, HDhE-A (Until B3) saves parameters by a large margin, but the performance drop is relatively large. On the other hand, HDhE-B (Until B1) delivers performance comparable with the proposed HDhE (Until B2). However, it requires more parameters than the proposed. Therefore, the selection of Until B2 is valid in the fact that it saves parameters and improves performance as well. In addition, the proposed Memory-saving version of the Heterogeneous Double-head Ensemble (Ms-HDhE) can reduce the model complexity without considerable loss of performance. In addition, to observe the ensemble gain, a single model variant (single) was compared. Interestingly, Ms-HDhE has much better performance (+2.5% in Recall@1) with less complexity than the single model, as shown in Table 6, using the original ResNet-50.

3) EFFECT OF HEAD DESIGN
HDhE-C, HDhE-D, and HDhE-E use the same learning rate in two heads unlike HDhE. Additionally, HDhE-C and HDhE-D use the pruned head-2, where HDhE-C uses   Res4_3 and HDhE-D uses Res4_2 as Last layer, unlike HDhE using Res4_1. The stride of every head was set to 1. The evaluation results of the above variants are shown in Table 7. In the comparison among the above variants, HDhE-E delivers the best performance, whereas HDhE-D delivers better performance than HDhE-C, implying that the diversity in head structure by pruning only head-2 contributes to the increased performance. In particular, the larger the difference between two heads (Res4_3 and Res4_1), the larger is the performance improvement. In conclusion, the proposed HDhE adopting different learning rates in two heads gives further performance improvement compared to HDhE-E adopting the same learning rate in two heads. This implies that the diversity of learning rate also contributes to the increased performance. The above evaluation results show that the heterogeneous design of multi-heads can achieve a significant ensemble gain. Table 8 shows the evaluation results of the model using triple-heads with Until B2 shared structure. The triple-heads were designed as Res4_3, Res4_2, and Res4_1 of Last layer and 0.5×, 2×, and 4× of the default learning rate. The stride was set to 1 for every head. The triple-head structure provides better performance than the HDhE on the Car-196 but rather a slight decrease in the Recall@1 of the CUB-200. On the other hand, the parameters of the triple-head structure are 16M larger than those of HDhE. According to the trade-off between complexity and performance improvement, the triple-head or four-head structure does not have significant merit compared to the double-head (HDhE) structure.

D. GENERALITY FOR THE OTHER LOSS FUNCTIONS
In this section, to verify the generality of HDhE, we conducted an additional experiment by applying another type of loss function that uses triplet loss [33] along with softmax loss. The proposed HDhE structure trained by only softmax loss with a random sampling manner, whose setting is a simple and fast way to train the model. However, the triplet loss requires a different type of sampling to select samples of triplet pairs. Thus, we employed an identity sampling method and used the hard sample mining technique [34]. We then applied the triplet and softmax losses to the double heads simultaneously. We will call this method Tri-HDhE.
Interestingly, Table 9 shows that the performance of Tri-HDhE is enhanced significantly from that of the HDhE (softmax loss). Even though the performances of Cars-196 and Inshop datasets are already over 90% in training only by softmax loss, there are performance improvements of over 1% in Recall@1. From these results of this experiment, the generality of the proposed HDhE structure is verified in that it can be used as a backbone network for further performance improvement via other losses or for various applications.

E. COMPARISON TO STATE-OF-THE-ART METHODS
We compared the proposed HDhE family with existing stateof-the-art algorithms. Surprisingly, the Tri-HDhE model outperforms the state-of-the-art methods in the four image VOLUME 8, 2020  retrieval datasets, as shown in Table 10 and Table 11. In addition, not only HDhE but also Ms-HDhE outperforms the existing deep metric learning algorithms. Moreover, the results show that HDhE outperforms the existing ensemble algorithms (HDC [1], A-BIER [2], ABE-8 [3], and DREML [5]) with a large margin. Impressively, even the single-head version of HDhE (Sh-HDhE) having only the pruned head (Last layer is Res4_1) shows a competitive performance to the latest algorithms (SCHM [8], SoftTriple [9], and Proxy-Anchor [10]). Furthermore, the proposed HDhE gives a performance improvement of over 4% of Recall@1 (CUB-200) from the single-head version of HDhE (Sh-HDhE). This implies that our ensemble structure provides meaningful ensemble gain.

V. CONCLUSION
In this paper, we have provided the factors that influence an effective multi-head ensemble structure design, and we set up the design concepts covering the design factors. Through an evaluation of various variants following the design concepts, we propose a heterogeneous double-head ensemble structure. The proposed structure has the following advantages: -The dimension of feature vector from each head is reduced to one-quarter of the baseline model while improving performance. -The shared body contributes to not only memory savings but also the performance enhancement. -The heterogeneous design of heads yields a large amount of ensemble gain. -The designed structure outperforms the existing ensemble structure and achieves state-of-the-art performance on the image retrieval benchmarks. -The designed structure can be used as a backbone for various applications and other types of loss such as triplet loss.