I. Introduction
Deep metric learning (DML) as one of the similarity learning methods has attracted more and more attention recently in the field of deep learning [1]–[7]. Its commonly strategy is to exploit a deep end-to-end feature representation. DML methods take the relations between samples into consideration and map samples into a new embedding space where samples with the same label are closer while the samples with different labels are far apart. However, the features learned by DML methods may yield suboptimal results if they only modeled simple relations of the data. For this reason, many deep learning methods introduce a multitask learning mechanism [8], [9], i.e., optimizing the classification task and similarity learning task at the same time. In this way, these methods can achieve better results. However, there are still two problems existed in these methods.
A variety of existing methods only extract single-layer features from the last fully connected layer of deep neural networks [10]–[12]. As a matter of fact, the features extracted from images in low layers have abundant details like position, while high-layer features contain semantic information, such as shapes and targets. Hence, single-layer features may be sensitive to variations, such as viewpoints and illumination.
Most deep multitask metric learning methods organize the training samples into small cliques to compute their correlations, such as pairs [13], [14], triplets [15]–[18], and quadruplets [10]. Accordingly, the learned features may be discriminative only in cliques while not in the whole embedding space due to limited correlation.