Ranked List Loss for Deep Metric Learning

—The objective of deep metric learning (DML) is to learn embeddings that can capture semantic similarity and dissimilarity information among data points. Existing pairwise or tripletwise loss functions used in DML are known to suffer from slow convergence due to a large proportion of trivial pairs or triplets as the model improves. To improve this, ranking-motivated structured losses are proposed recently to incorporate multiple examples and exploit the structured information among them. They converge faster and achieve state-of-the-art performance. In this work, we unveil two limitations of existing ranking-motivated structured losses and propose a novel ranked list loss to solve both of them. First, given a query, only a fraction of data points is incorporated to build the similarity structure. Consequently, some useful examples are ignored and the structure is less informative. To address this, we propose to build a set-based similarity structure by exploiting all instances in the gallery. The learning setting can be interpreted as few-shot retrieval: given a mini-batch, every example is iteratively used as a query, and the rest ones compose the galley to search, i.e., the support set in few-shot setting. The rest examples are split into a positive set and a negative set. For every mini-batch, the learning objective of ranked list loss is to make the query closer to the positive set than to the negative set by a margin. Second, previous methods aim to pull positive pairs as close as possible in the embedding space. As a result, the intraclass data distribution tends to be extremely compressed. In contrast, we propose to learn a hypersphere for each class in order to preserve useful similarity structure inside it, which functions as regularisation. Extensive experiments demonstrate the superiority of our proposal by comparing with the state-of-the-art methods on the ﬁne-grained image retrieval task. Our source code is available online: https://github.com/XinshaoAmos Wang/Ranked-List-Loss-for-DML


INTRODUCTION
D EEP metric learning (DML) plays a crucial role in a variety of applications in computer vision, such as image retrieval [37], [52], clustering [18], and transfer learning [54].For example, by using DML, FaceNet [47] achieves superhuman performance on face verification with 260M face images of 8M identities.
Loss function is one of essential components in successful DML frameworks and a large variety of loss functions have been proposed in the literature.Contrastive loss [5], [13] captures the relationship between pairwise data points, i.e., similarity or dissimilarity.Triplet-based losses are also widely studied [6], [47], [65].A triplet is composed of an anchor point, a similar (positive) data point and dissimilar (negative) data point.The purpose of triplet loss is to learn a distance metric by which the anchor point is closer to the similar point than the dissimilar one by a margin.In general, the triplet loss outperforms the contrastive loss [47], [54] because the relationship between positive and negative pairs is considered.Inspired by this, recent ranking-motivated methods 1 [26], [37], [47], [52], [53], [54] propose to take into consideration the richer structured information among multiple data points and achieve impressive performance on many applications, e.g., finegrained image retrieval and clustering.
However, there are still certain limitations in current state-ofthe-art DML approaches.Firstly, we notice that only a proportion 1.We term them ranking-motivated methods, whose target is to make the largest distance of similar pairs smaller than the smallest distance of dissimilar pairs.In this context, we do not care the distance order inside positive and negative sets.Namely, rank and retrieve can be used interchangeably here. of informative examples is incorporated to capture the structure in previous ranking-motivated loss functions.In this case, some non-trivial examples are wasted and the structured information is extracted from fewer data points.To address it, we propose to utilise all non-trivial data points to build a more informative structure and exploit it to learn more discriminative embeddings.Specifically, given a query, we obtain a ranked list by sorting all other data points (gallery) according to the similarities.Ideally, all the positive examples are supposed to be ranked before the negative samples in the feature space.To achieve this, we introduce ranked list loss (RLL) to organise the samples of each query.Given a query, the optimisation of RLL is to rank all positive points before the negative points and forcing a margin between them.In other words, RLL aims to explore the set-based similarity structure, which contains richer information than the point-based approach, e.g., triplet loss.
Secondly, we observe that the intraclass data distribution is not considered in the previous structured losses.All algorithms [12], [37], [47], [52], [53], [54], [58] target to pull data points in the same class as close as possible.Consequently, these approaches try to shrink samples of the same class into one point in the feature space and may easily drop their similarity structure.To solve this, we propose to learn a hypersphere for each class in RLL.Specifically, instead of pulling intraclass examples as compact as possible, we only force the distance of a positive pair smaller than a threshold, which is the diameter of each class's hypersphere.In this case, RLL can explicitly preserve the intraclass similarity structure inside each class within the hypersphere.
Empirically, the convergence rate of DML methods highly depends on the possibility of seeing non-trivial samples [47].Given a query (anchor), it is non-trivial to separate the positive and negative sets by a margin when all data points are considered.As a result, only a few ranked lists are perfectly optimized as the model improves during training.Therefore, our method can take advantage of a maximum of elements with non-zero losses and release the potentials for the learning procedure.The proposed RLL is illustrated in Figure 1.
Furthermore, a recent work [1] studied the existing of critical learning periods in artificial learning systems, which represents the time during which a temporary stimulus deficit may cause a permanent skill impairment in biological systems.Specifically, it shows the critical period in classification tasks using Fisher Information and Information Plasticity.Instead, we study and validate the critical learning period in deep metric learning via dynamic example weighting.Verifying the existing of critical learning periods is important, which can help us understand why network initialisation and warming-up training strategies have a large impact on the final performance.More importantly, it inspires us to be more careful and spend more effort on monitoring the early learning process of a deep network in the future work.
In short, our contributions in this paper are listed as follows: • We propose a novel ranking-motivated structured loss, named Ranked List Loss (RLL), to learn discriminative embeddings using the recent setting of mini-batch fewshot retrieval.In contrast with previous ranking-motivated losses, we incorporate all non-trivial data points and exploit the structure among them.Novelly, we learn a hypersphere for each class to preserve intraclass data distribution instead of shrinking each class into one point in the embedding space.

•
We propose two versions of RLL, i.e., the full version and a simplified version termed RLL-Simpler.As indicated by its name, RLL-Simpler simplifies the full version and is more preferable for exploration in practice, because it has only two hyper-parameters.

•
Using RLL-Simpler, we obtain the state-of-the-art performance on two large datasets, i.e., SOP [54] and In-shop Clothes [32].Then, using the full version of RLL, we present comprehensive ablation studies for understanding the vital factors in deep metric learning.Specifically, the study on the network depth is of high practical value.
The rest of this paper is organised as follows: Section 2 introduces some basic notations and preliminaries of deep metric learning, e.g., how a loss function is related to deep metric learning, prior work and practical strategies.We represent our method in section 3. Specifically, in section 3.7, we introduce a dynamic weighting scheme and use it to study the critical learning periods in the context of deep metric learning for the first time.In section 4, we show extensive experiments to compare with related baselines and comprehensively study important components.The critical learning period of deep metric learning is studied in section 5. Finally, we make a summary of this work in section 6.

PRELIMINARIES AND RELATED WORK
Notations.We use bold capital characters to denote matrices and set.Bold lower-case letters are used to represent vectors.Scalar variables are denoted by non-bold letters.Let X = {(x i , y i )} N i=1 be the input data, where (x i , y i ) indicates i-th image and its corresponding class label.The total number of classes is C, i.e., , where N c is the number of images in c-th class.For every x i , we use y i and superscript interchangeably to indicate its class label.

Design of Loss Functions for Learning Discriminative Deep Representations
In this subsection, we briefly introduce the relationship between design of loss functions for learning discriminative representations and deep distance metric learning.
Generally, metric learning aims to learn a metric to measure the distance between two samples.For example, NCA [12] learns a linear transformation from an input space X to a metric space AX.In NCA, d( Here A A is a learned metric while d(x i , x j ) is the distance between x i and x j in the metric space.However, from the perspective of feature learning, metric learning learns an embedding function f , e.g., f (x) = Ax.In the embedding (metric) space, the distance between every pair is computed by their Euclidean distance.A loss function, e.g., NCA for preserving neighbourhood structure, is needed to supervise the learning of an embedding function f .Therefore, we can see that the loss function defines how we compute the distance between two instances in the feature space.As a result, the design of a loss function is crucial in metric learning (discriminative embedding learning).
In the context of deep metric learning, a deep neural network is used as a non-linear encoding function.To empirically provide supervision information in the loss layer, there are three widely applied approaches: • Learning to fulfil instance-to-class similarity relationship constraints, i.e., pulling an example's feature towards its class centre while pushing it away from other class centres simultaneously.After encoding input data points, this approach is usually implemented with a fully connected In triplet [47], the anchor is compared with only one negative example and one positive example.In N -pair-mc [52], Proxy-NCA [37] and Lifted Struct [54], one positive example and multiple negative classes are incorporated.N -pair-mc randomly selects one example per negative class.Proxy NCA pushes the anchor away from negative proxies instead of negative examples.The proxy is class-level and can represent any instance in the corresponding class.Lifted Struct uses all examples from all negative classes.On the contrary, our proposed Ranked List Loss not only exploits all negative examples, but also makes use of all positive ones.Best viewed in colour.
layer for computing the dot product between an example's feature and all class centres, i.e., the weight vectors of the fully connected layer.The outputs are termed logits, which are normalised with a softmax transformation.Finally, the cross entropy is applied to ensure the instance-toclass similarity relationship constraint is met.There are many variants of this approach, including L2-Softmax [43], Large-margin Softmax [31], Angular Softmax [30], NormFace [63], AM-Softmax [62], CosFace [64] and ArcFace [7].
• Learning according to the instance-to-proxy similarity relationship constraints.This idea is an interesting extension of instance-to-class similarity relationship modelling.In this approach, classes are represented by proxies [37], [42].It is interesting in that more flexibility is given: (1) The number of proxies can be smaller than the number of training classes, in which case multiple classes are assigned to the same proxy.It is named fractional proxy assignment, i.e., one proxy represents multiple classes [37]; (2) We can also represent one class using multiple proxies [37], [42]; (3) When semantic class labels are available, we can apply static proxy assignment analogously to instanceto-class modelling.In cases where semantic labels are not given, we need to exploit dynamic proxy assignment according to the distances between a data point with all proxies [37].(4) When the number of proxies is the same as the number of classes and static proxy assignment is applied, this approach becomes equivalent to instance-toclass similarity relationship modelling.

•
Leaning to meet instance-to-instance similarity relationship constraints.Intrinsically, high-order similarity relationship, i.e., ranking motivated algorithms, are derived over pairwise relationship.In this context, we label that two data points are either similar or dissimilar, without the need to know how many training classes are given and which class each sample belongs to.The proposed methods according to instance-to-instance similarity relationship include contrastive loss [5], [13], triplet loss [19], [47], Lifted Struct [54], N-pair-mc [52], Histogram loss [59], Angular loss [66], Sampling Matters [75], OSM and CAA [70], Multi-similarity Weighting [68], ICE [71] and so on.
In this work, we focus on studying the second approach.Compared with instance-to-class or instance-to-proxy similarity relationship constraints, pairwise constraint is a good solution for the challenging extreme object recognition setting [41], [77], in which there exist an enormous number of classes and only a few images per class.In our experiments, we evaluate our proposed method in such setting.Additionally, pairwise similarity constraint is more suitable for incremental (online) learning setting.When new training classes come, by instance-to-instance distance metric learning, we do not need to verify whether those new coming classes have occurred in the previous training stage.Beyond, knowledge transfer and re-learning is also more straightforward since we do not need to retrain a new fully connected layer.Theoretically, instance-to-instance similarity relationship modelling is more applicable for scaling up vision recognition.

Ranking-Motivated Structured Losses
Triplet Loss [47], [73] aims to pull the anchor point closer to the positive point than to the negative point by a fixed margin m: where Γ is the set of triplets, i, j and k are the indexes of anchor, positive and negative points, respectively.f is the embedding function, [•] + is the hinge function.
N-pair-mc [52] exploits the structured relationship among multiple data points to learn the embedding function.Triplet loss pulls one positive point while pushing a negative one simultaneously.To improve the triplet loss by interacting with more negative classes and examples, N -pair-mc aims to identify one positive example from N − 1 negative examples of N − 1 classes (one negative example per class): where are N pairs of examples from N different classes, i.e., y i = y j , ∀i = j.Here, x i and x + i are the query and the positive example respectively.{x + j , j = i} are the negative examples.
Lifted Struct [54] is proposed by Song et al. to learn the embedding function by incorporating all negative examples.The objective of Lifted Struct is to pull one positive pair (x + i , x + j ) as close as possible and pushes all negative data points corresponding to x + i or x + j farther than a margin α.Mathematically: where P and N respectively represent the sets of positive pairs and negative pairs.Given the query x i , Lifted Struct intends to identify one positive example from all corresponding negative data points.
Proxy-NCA [37] is proposed to address the sampling problem using proxies.The proxy W is a small set of data points that represent training classes in the original data.The proxy for u is chosen by: p(u) = argmin w∈W d(u, w), p(u) denotes the closest point to u from W. The Proxy-NCA loss is the traditional NCA loss defined over proxies instead of the original data points: where Z is the negative set, p(u) and p(z) are the proxies of positive and negative points, respectively.a is the anchor and d(•, •) is the Euclidean distance between two points.With static proxy assignment, i.e., one proxy per class, the performance is much better than dynamic proxy assignment.However, the proxies in the static proxy assignment are learned during training and similar to the class vectors of the fully connected layer in classification.Recently, SoftTriple [42] further improves this proxy-based idea and achieves promising performance.The main drawback is that the scalability to extremely large datasets is not theoretically guaranteed.The proposed RLL is ranking-motivated structured loss, which avoids two limitations of traditional methods by incorporating all non-trivial data points and exploring intrinsic structured information among them.The illustration and comparison of different ranking-motivated losses and ours are presented in Figure 2.

Directly Learning to Retrieve
Inspired by information retrieval, there are many methods proposed to maximise mean Average Precision over retrieved results of queries [27], [29], [34].We briefly present some recent representative work as follows.
Information Retrieval Lens [58] proposes a new form of a few-shot learning task, i.e., few-shot retrieval.Every batch is an independent task, composed of n classes and k images per class.Instead of splitting a batch into a support set (a training set) and a query set as done in Prototypical Networks [51], Information Retrieval Lens [58] iteratively uses each data point in the batch as a 'query' to rank the remaining ones.The optimisation objective of [58] is to maximise the mean Average Precision (mAP) over those rankings.
FastAP [4] approximately optimises the mean Average Precision via distance quantization.Optimising mAP directly is highly challenging, thus FastAP exploits quantization-based approximation to reduce the complexity and improve the efficiency.Therefore, FastAP is tailored for optimising mAP using stochastic gradient descent.
Prec@K [33] means top-K precision, or Recall@K performance metric, which targets at optimising its top K retrieved neighbours.To optimise Prec@K, it proposes to emphasise on misplaced images near the decision boundary, i.e., besides the K-th nearest neighbours.Concretely, those prioritised misplaced images are: (1) similar ones which do not belong to K nearest neighbours, but are close to K-th nearest neighbour.The optimisation objective is to pull them into K most nearest set; (2) dissimilar ones which are in the K nearest neighbours set, and are close the K-th nearest neighbour.The optimisation target is to push them out of the K nearest neighbours set.
MIHash [3] is a supervised hashing method for learning binary vector embeddings.For every query, MIHash minimises its neighbourhood ambiguity through an information-theoretic lens, where the separability between positive and negative points is measured by the mutual information between the distance distribution of positive examples and that of negative ones.
In this paper, we also show that our proposed algorithm follows the setting of few-shot retrieval [58].It optimises the distances of positive and negative pairs instead of average precision.Additionally, Prec@K is partially motivated by our CVPR 2019 conference version [69].FastAP [4] is published in CVPR 2019, being concurrent with ours.The focus of MIHash is to learn a Hamming space for efficent retrieval in large databases.

Clustering-Motivated Structured Losses
Struct Clust [53] is recently proposed to learn the embedding function f by optimising the clustering quality metric.The proposed structured loss function is defined as: where ŷ and y are the predicted and ground-truth clustering assignments respectively.F measures the quality of the clustering on X with the label assignment and distance metric.NMI(y, ŷ) is the normalised mutual information [48].NMI is 1 if the predicted clustering assignment is as good as the ground-truth and 0 if it is the worst.ŷ is predicted based on the learned distance metric f and Struct Clust [53] aims to learn f such that the F of the ground-truth assignment is larger than any other predicted clustering assignment.However, this algorithm is NP-hard as we need to optimise both the clustering medoids and the distance metric simultaneously.As a result, the loss augmented inference and refinement are applied to select facilities (clustering medoids) based on the greedy algorithm [35].Large enough greedy search iterations are needed to find a local optimum, which might be costly.
Spectral Clust [26] also aims to optimise the quality of the clustering.Spectral Clust relaxes the problem of clustering with Bregman divergences [2] and computes the gradient in a closed-form, which reduces the algorithmic complexity of existing iterative methods, e.g., Struct Clust [53].However, it is still nontrivial to learn deep models based on mini-batch implementation.Large batch size (i.e., 1260 = 18 classes × 70 samples per class) is required for the clustering in the mini-batch.As a result, Spectral Clust iteratively computes submatrices and concatenates them into a single matrix for computing the loss and gradient, which is computationally expensive.
Both ranking-motivated and clustering-motivated structured loss functions exploit the structured similarity information among multiple data points.However, in general, clustering-motivated losses are more difficult to optimise than ranking-motivated losses.

Mining and Weighting Non-trivial Examples
Example mining strategies are widely applied in existing methods [6], [20], [33], [47], [49], [50], [52], [54], [67], [70], [75], [79] to provide non-trivial examples for faster convergence and better performance.Mining strategies vary in different cases.For example, FaceNet [47] proposes to mine semi-hard negative samples.In N -pair-mc [52], hard negative class mining is proposed to provide informative negative examples.In Lifted Struct [54], harder negative examples are emphasized in a soft way.Given a query, Sampling Wisely [33] proposes to select misplaced images near the decision boundary of its top-K nearest neighbours.Namely, Sampling Wisely [33] prioritises data points near the top-K decision boundary.In Divide & Conquer [46], the embedding space is split into non-overlapping subspaces.Accordingly, the training data is split into clusters for training embedding functions in different subspaces.Consequently, two samples from the same subspace have smaller distances than two from different clusters, which can be interpreted as a proxy to the mining of meaningful relationships.Stochastic class-based hard example mining [55] is proposed to mine hard examples effectively.

METHODOLOGY
Our objective is to learn a discriminative function f such that similarity scores of positive pairs are higher than those of negative pairs in the feature space.There exist at least two samples in each class so that all data points can be optimised.Given a query from any class, the objective is to rank its matching samples in front of its retrieved result.

Pairwise Constraint
Inspired by the former work on pairwise similarity constraint [13], [75], we aim to pull positive examples closer than a predefined threshold (boundary).In addition, we intend to separate the positive and negative sets by a margin m.To achieve this, we choose the pairwise margin loss [75] as our basic pairwise constraint to construct the set-based similarity structure.Specifically, given an image x i , the learning objective is to push its negative points farther than a boundary α and pull its positive ones closer than another boundary α − m.Thus m becomes the margin between two boundaries.Mathematically, (8) where y ij = 1 if y i = y j , and

Ranked List Loss
In a mini-batch, when x c i is a query, we use it to retrieve the remaining data points, i.e., gallery according to their similarities to the query, which is illustrated in Figure 1.In the retrieved list, there are N c − 1 positive points in the positive set and k =c N k points in the negative set.The positive set with respect to the query x c i is denoted as P c,i = {x c j |j = i}, and |P c,i | = N c − 1.Similarly, we represent the negative set with respect to x c i as

Mining Informative Pairs
Mining informative examples is widely adopted [6], [17], [47], [52], [54], [70], [79], because it can help to speed convergence and improve generalisation performance if properly designed.By informative examples, we mean non-trivial data points which have non-zero losses, i.e., violating the pairwise constraint with respect to a given query.Because trivial data pairs have zero gradients, including them for training can 'weaken' the contribution of nontrivial examples during gradient fusion as the model improves [17].Therefore, we only train on non-trivial positive and negative examples.
Concretely, for the query x c i , the non-trivial positive set after mining is represented as Analogously, we denote the negative set after mining as

Weighting Negative Pairs
For each query x c i , there are a large number of non-trivial negative examples (N * c,i ) with different magnitude of losses.To make better use of them, we propose to weight the negative examples based on their loss values, i.e., how much each negative pair violates the constraint.Our weighting strategy can be simply represented as: where T n is the temperature parameter which controls the degree (slope) of weighting negative examples.If T n = 0, we treat all non-trivial negative examples equally.If T n = +∞, it becomes the hardest negative example mining because weights are normalised by their sum.

Weighting Positive Pairs
Usually, in the settings of deep metric learning, given a query, there are quite a few matching positives in the search space.However, when multiple positive data points exist, we can also weight those positive ones according to their loss values to make better use of them.Analogous to negative examples weighting, the weighting strategy of positive data pairs can be denoted as follows: where T p is the temperature parameter which controls the degree (slope) of weighting positive examples.If T p = 0, it treats all nontrivial positive data pairs equally.If T p > 0, positive pairs with larger distances are emphasised.On the contrary, if T p < 0, closer positive pairs are assigned with higher weights, which is widely used to preserve local similarity manifold structure when many positive data points exist [6], [20], [70].The absolute value of T p determines the differentiation degree over positive data pairs.

Overall Optimisation Objective
In order to pull all non-trivial positive points in P * c,i together and learn a class hypersphere, we minimise: Fig. 3: The overall framework of our proposed ranked list loss.In one mini-batch, every image acts as a query iteratively and ranks other images according to the similarity scores.Then in every ranked list, we mine non-trivial data points and weight them based on their distances to the query.Finally, the ranked list loss is computed for every query.
Meanwhile, to push the informative negative points in N * c,i beyond the boundary α, we minimise: In RLL, we optimise the two minimisation objectives jointly: where λ controls the balance between positive and negative sets.
We treat the two objectives equally and fix λ = 0.5 without tuning in all our experiments.In this case, the positive and negative sets can contribute equally.Consequently, the sample imbalance problem, i.e., the majority are negative examples, is addressed well.
We remark that our proposed RLL is an extension of traditional point-based triplet loss.Given an anchor, RLL separates the positive and negative sets with a margin between them.Note that in the optimisation of every retrieved list, we exploit independent normalisation [70] to address the imbalanced number of positive and negative examples, which is not considered in the previous ranking-motivated losses.In the ranked list of x c i , we regard the features of other examples as constants.Therefore, only f (x c i ) is updated based on the influence of weighted combination of other data points, which makes the learning process more stable.

Hypersphere Regularisation by Distance Thresholds
For each query x c i , we propose to make it closer to its positive set P c,i than to its negative set N c,i by a margin m.At the same time, we force all negative examples to be farther than a boundary α.Consequently, we pull all samples from the same class into a hypersphere.The diameter of each class hypersphere is α−m.According to our extensive ablation studies on both large and small datasets, we find that it is an improper practice to pull positive data pairs as close as possible in the literature.Intuitively, we should not try to suppress all intraclass variances.By pulling all intraclass examples into one hypersphere instead of one point, better generalisation performance can be obtained.Therefore, we term it hypersphere regularisation.The optimisation objectives with hypersphere regularisation is illustrated in Figure 4.

RLL-Simpler
We have introduced the full version of RLL, which includes two distance hyper-parameters α and m for defining the optimisation objectives, and two scaling parameters T n and T p for weighting negative and positive data points, respectively.In addition, we propose a simpler version which has only two hyper-parameters, thus being much easier to apply in practice.
Firstly, we can reformulate the optimisation objectives to use only one distance threshold.Following the widely used practice Fig. 4: The optimisation objectives with hypersphere regularisation.Different shapes represent different classes.For simplicity, we only show three classes while many classes exist in practice.α − m denotes the diameter of each class hypersphere.Therefore, the distance between any two positive examples is optimised to be no greater than α − m.In addition, the distance between any two hypersphere boundaries is no less than m.
[26], [37], [53], [63], we apply an L 2 normalisation layer after the final fully connected layer.As a result, the Euclidean distance any two data points ranges from 0 to 2. Recognising a data pair as similar or dissimilar can be treated as a binary classification problem.Therefore, intuitively, we can use the hyperplane Euclidean distance = 1 as the decision boundary.Following this intuition, we change the setting of α and m in RLL as follows: • Smallest distance between negative points: α = 1 + m 2 ; • Largest distance between positive points: α−m = 1− m 2 ; • Margin between positive and negative set: m.
Namely, in RLL-Simpler, we aim to push the distances of negative pairs to be larger than 1 + m 2 while pull those of positive ones to be smaller than 1 − m 2 .Secondly, training RLL on a mini-batch follows the few-shot retrieval setting, i.e., N c -shot C-way setting.Generally, C is much larger than N c , e.g., C = 60 and N c = 3, ∀c, therefore we have much more negative points than positive ones in the retrieved list.Consequently, by default, we set T p = 0 without weighting positive data points in practice.
In summary, in RLL-Simpler, we have only two active hyperparameters: m for defining the margin between positive and negative pairs, and T n for weighting negative examples of a query.It is a simplified version of the full version by setting T p = 0, and α = 1 + m 2 .With RLL-Simpler, our objective is to demonstrate Algorithm 1 Ranked List Loss on one mini-batch.
, the embedding function f , the learning rate β.4: Output: Updated f .5: Step 1: Feedforward all images {x i } N i=1 into f to obtain the images' embeddings {f (x i )} N i=1 .6: Step 2: Iterative retrieval and loss computation.
Mine non-trivial positive set P * c,i .
In real-world applications, we recommend first applying RLL-Simpler, and then adopting the full version to further push the performance if needed.

Learning Deep Models with RLL
To learn deep models, we implement our RLL based on minibatches and stochastic gradient descent.Each mini-batch is a randomly sampled subset of the whole training classes, which can be regarded as a mini ranking problem with a much smaller gallery, i.e., searching the matching examples from a smaller number of classes.
A mini-batch is composed of C classes and N c images per class (N c -shot C-way setting).Therefore, each mini-batch can be also treated as a few-shot retrieval task [58].Every image x c i in the mini-batch acts as the query (anchor) iteratively and the other images serve as the gallery.The RLL on a mini-batch is represented as: where N = c N c is the batch size.The learning of the deep embedding function f based on RLL is illustrated in Algorithm 1.
The overall pipeline is shown in Figure 3.

Computational Complexity
As illustrated in Algorithm 1, our proposed method does not require the input data to be prepared in any rigid format, e.g., triplets, n-pair tuplets.Instead, it takes random input images with multiclass labels.We conduct online iterative ranking and loss computation (step 2 in Algorithm 1) after obtaining images' embeddings (step 1 in Algorithm 1).Therefore, the computational complexity of RLL is O(N 2 ), which is the same as existing ranking-motivated structured loss functions [37], [52], [54].

Exploring The Critical Learning Periods of Deep Metric Learning
In biological systems, critical period represents the time windows of early post-natal development during which a temporary stimulus deficit may lead to a skill impairment permanently [11], [22], [24], [36], [39], [74].Critical periods do not exist only in biological systems, but also in artificial learning systems [1].Recently, a study on the critical learning periods [16] of deep neural networks is quite interesting and has got much attention [1].As a result of fundamental constrains coming from learning dynamics and information processing, [1] finds that the critical early transient determines the final optimised solution.Specifically, no matter how much additional training is used, a temporary stimulus deficit during the critical early learning period cannot be overcome later.
The early learning phase of artificial deep neural networks is under-scrutinised compared to the network's behaviours around convergence and the asymptotic properties of the optimisation process.However, it plays a key role.In [1], Fisher Information and Information Plasticity have been exploited to study the early learning phase.In our work, we validate and study this critical early learning phase in deep metric learning from a novel perspective, i.e., dynamic example weighting.We represent our design choices in this section and display the results in Section 5.
Concretely, during training, we revise Eq. ( 9) to its dynamic variant: T where cur iter < max iter, max iter denotes the total number of training iterations while cur iter is the performed number of iterations till now.Therefore, instead of fixing the scaling parameter of the weighting scheme, we can study and analyse the effect of dynamic weighting scheme on the optimisation results.Two cases we aim to explore are introduced in detail as follows.Their empirical results are presented and discussed in section 5.
• Study of The Early Learning Phase.To validate the critical early learning phase in deep metric learning, we fix T 2 and present the results of different T 1 .In this case, the early learning phase changes along with T 1 .Intuitively, if the early learning phase is critical, then the test performance will be sensitive to T 1 .
• Exploration of The Later Learning Phase.On the contrary, we fix T 1 and change T 2 to study the effect of the later learning phase on the final optimisation solution.Similarly, if the test performance is sensitive to T 2 , then the later learning phase is crucial, and non-important otherwise.

Experimental Details
Some recent papers [9], [38] have raised the concerns about the fairness of comparing different DML methods.For example, Lifted Struct [54] reports that the embedding dimension does not play a crucial role.Accordingly, thereafter, the embedding size is different in some papers [10], [26], [52], [53], [54].However, it is recently noticed that the embedding size has a huge impact on the performance [9], [38].In this work, to make sure that our comparison is as fair as possible, we present all our implementation details as follows:

Datasets
We conduct experiments on four popular benchmarks: (1) SOP [54] contains 120,053 images of 22,634 online products sold on eBay.com.59,551 images of 11,318 categories and 60,502 images of 11,316 categories are used for training and testing respectively.The train/test split and evaluation protocol are the same as [54].(2) In-shop Clothes [32]  We use the first 98 classes (8,054 images) for training and the remaining 98 classes (8,131 images) for testing.On all datasets, our method is evaluated on the original images, i.e., without using the bounding box information.

Data Augmentation
For fair comparisons, we follow the practice of [68] for data augmentation.Concretely, we (1) warp the original image to a size of 256 × 256; (2) crop a random size (default: 0.5 to 1.0 of the resized image) 2 with a random aspect ratio (default: 3/4 to 4/3); (3) resize the crop to 227 × 227 and horizontally flip it with a probability of 0.5.At testing stage, we only use a single centre crop without mirroring.We remark that the performance on large datasets (i.e., SOP and In-shop Clothes) changes slightly when the data augmentation is different.While small datasets CUB-200-2011 and CARS-196 are very sensitive to small changes.

Backbone and Initialisation Details
We use GoogLeNet V2 [21] as our backbone network, and compare with prior results using the same backbone as well.Additionally, in the original net, there are three fully connected layers of different depth.We refer them based on their relative locations as follows: L for the low-level layer (inception-3c/output), M for the mid-level layer (inception-4e/output) and H for the high-level layer (inception-5b/output).By exploiting them, we study the representations of different depth, which is valuable for deploying a model in practice.Following [26], [37], [53], the pretrained model on ImageNet [45] is used for initialisation in our experiments.Three original 1000-neuron fully connected layers followed by the softmax layer and cross-entropy loss are changed to three new fully connected layers followed by an L 2 normalisation layer and our proposed ranked list loss.The new layers are randomly initialised and optimised with 10 times larger learning rate than the others for faster convergence.
2. We change the random crop size from the range [0.16, 1.0] used in [68] to [0.5, 1] as we do not want a crop containing no object of interest at all.Therefore, following the concurrent work Multi-Simi [68], and the following work SoftTriple [42] and Reality Check [38], we freeze the BatchNorm layers during training to fairly compare with their reported results on CUB-200-2011 and CARS-196.

Embedding and The Embedding Size
We use RLL-H to denote the single high-level embedding.RLL-M and RLL-L are analogous.For an exactly fair comparison with other baselines, we look at the results of RLL-H.Generally, there are two ways to increase the embedding size: (1) increasing the size of a single embedding; (2) concatenating multiple embeddings.Empirically, we find that our method is less sensitive to the increasing size of a single embedding, while performs much better when concatenating multiple embeddings.
On all datasets, we follow their corresponding most common setting and set the embedding size accordingly to fairly compare with as many existing approaches as possible.Finally, we study the impact of the embedding size on the largest dataset SOP in the section 4.8.2.

Performance Metric
Following [54], we report the image retrieval performance measured by Recall@K.We do not report the image clustering quality NMI [48], because NMI is not a proper metric for fine-grained object recognition with tremendous classes and only several images per class.Similar idea is presented recently in very detail in [38].
Indeed, all reported results are validation performance.Theoretically, the validation data follows the same distribution as the test data, and the test data is always unknown before the deployment for practical use.Therefore, it is reasonable to report the validation performance as long as the validation data is only used for model selection, without being used for training.

Training and Optimisation Settings.
The standard stochastic gradient descent (SGD) optimiser is used with a momentum of 0.9, a weight decay rate of 2e −5 .We set the base learning rate to 1e −2 .In each mini-batch, we randomly sample C classes and K images per class.We use a single Tesla V100 to train the large dataset SOP, while a single GTX 1080 Ti to train other datasets.To leverage the computational resources, we set C = 60, K = 3 on SOP while C = 22, K = 3 on the other three datasets.Thus N c = K = 3, ∀c, the batch size N = 180 on SOP while N = 66 on others.
We do not manually optimise the hyper-parameters extensively, i.e., we use 'RLL-Simpler' to compare with existing methods, so that we only need to optimise m and T n .In addition, we simply fix T n throughout the training process.We only explore the dynamic weighting scheme when studying the critical learning periods in DML.
In the ablation study, we use the full version of RLL for more comprehensive study.Moreover, in those experiments, we use the simpler data augmentation for faster convergence since our focus is analysing the key components, instead of comparing with existing methods.Specifically, we crop a random size of 227×227 from the resized image (256 × 256), and horizontally flip the crop with a probability of 0.5.
For the ease of reproducing our results, Our source code and the training scripts of all datasets are publicly available online: https://github.com/XinshaoAmosWang/Ranked-List-Lossfor-DML.TABLE 1: Comparison with the state-the-of-art methods on SOP.XBM [72] is marked with '*' because it exploits information across mini-batch tasks.The '-' denotes the corresponding results are not reported in the original paper.The embedding dimension of different approaches is shown for the sake of clarity.SOP Dim.

Comparison with Recent Baselines
Competitors.We compared our method with the following methods which are implemented and tested under the same settings: Triplet Semihard, Lifted Struct, N-pair-mc, Struct Clust, Spectral Clust, and Proxy NCA.These methods have been described in Section 2 except for Triplet Semihard [47] which mines semihard negative examples to improve the conventional triplet loss and is reimplemented in Struct Clust [53] with GoogLeNet V2.HTL [10] exploits the hierarchical similarity structure among different classes and merges similar classes recursively, i.e., building the global class-level hierarchical tree by using all original classes as leaves and updates the tree after every epoch.HTL [10] is computationally expensive and unscalable to extremely large datasets.Additionally, SoftMax norm and SoftTriple [42] are theoretically non-scalable to extremely large dataset because they use multiple proxies to represent one class.XBM [72] exploits extra information across mini-batch tasks.We do not compare with ensemble models [23], [40], [76], [79].Some methods [8], TABLE 3: Comparison with the state-of-the-art methods on CARS196 in terms of Recall@K (%).For an exact comparison, all the reported results have an embedding size of 64.
Result analysis.The comparisons between our method and existing competitors on four datasets are presented in Tables 1, 2, 3, and 4, respectively.Other reported approaches use the high-level embedding.From the Tables 1, 2, 3, and 4, we have the following observations: • On the two larger datasets, RLL-Simpler has the state-ofthe-art performance, as shown in Tables 1 and 2.
• On the two smaller datasets, except for the theoretically non-scalable SoftMax norm and SoftTriple, only Multi-Simi is better than ours.However, technically, the weighting scheme of Multi-Simi considers multiple metrics, thus being more complex than ours.
We remark: (1) in the general settings of metric learning, the training classes are disjoint with testing classes, which means the training set can be infinitely large.Therefore, larger datasets, e.g, SOP and In-shop Clothes, are better to test on; (2) CARS196 and CUB-200-2011 are significantly sensitive to the minor changes of training details as mentioned in [38], [42], [68].

On the network depth of an embedding function
In this subsection, we study the network depth of an embedding function and the concatenation of embedding functions.The results on SOP and In-shop Clothes are displayed in Tables 6  In-shop Clothes Dim R@1 R@10 R@20 R@30 R@40 R@50 and 7, respectively.RLL-Simpler-(L,M,H) denotes the multilevel embedding by concatenating the low-level, mid-level and highlevel embeddings.Others are analogous.We study and report them because from the practical perspective, RLL-Simpler-L and RLL-Simpler-M are of smaller network depth, while perform similarly or even better than RLL-Simpler-H.Firstly, this indicates that very deep networks are not necessarily better in practice.This information is of great value in applications where smaller networks with faster computational speed are preferred.Secondly, this motivates us to exploit multi-layer embeddings well.All of them are discriminative, and exploit non-identical knowledge to measure the distance between data points since they are of different depth.We obtain promising performance by simply concatenating them.Therefore, there is an open future research lead to better exploit multi-layer embeddings in deep metric learning, without the need to train multiple learners [23], [40], [76], [79].

Mining Non-trivial Pairs
As presented in Section 3.2, for each query, RLL mines examples which violate the pairwise constraint with respect to the query.Specifically, we mine negative examples whose distance is smaller than α in Eq. (12).Simultaneously, we mine positive examples whose distance is larger than α − m in Eq. (11).As a result, a  Impact of α.To study the impact of α, we set the temperature T n = 10 and the margin m = 0.4 in all experiments.The results are presented in Table 8.We observe that a proper negative constraint α is important for RLL to learn discriminative embeddings.This is consistent with our intuition as α controls how much the negative examples are pushed away.

Weighting Negative Pairs
In this section, we conduct experiments to evaluate the influence of T n for weighting negative examples in Eq. ( 9).We fix m = 0.4 and α = 1.2 in all experiments.The temperature parameter T n (T n > 0) controls the slope of weighting.The results are presented in Table 9.We observe that: • When T n = 0, RLL treats all non-trivial negative examples equally, i.e., no weighting is applied.The Recall@1 result is 78.8%, which is only 1% lower than the best one using proper weighting.This demonstrates the superiority of RLL even without weighting.

•
The performance changes slightly as T n does.The performance gap is around 1% when T n ranges from 0 to 20.In addition, the performance drops when T n is too large.This may be because 'very' hard examples exist in the training data (e.g., outliers) [6], [47], [70].

Weighting Positive Pairs
In most cases where only a few positive data points exist, we only weight negative examples.For example, when K = 3, given a query, there are only two positives for every query so that we treat them equally.We have tried differentiating them but the performance difference is neglectable.However, when multiple positives exist, it is quite natural to ask whether weighting positive examples helps.Therefore, following [70], we empirically study weighting positive examples when there are many positive data points in a retrieved list.While fixing the batch size to be 180, we study two different cases: K = 6 and K = 12.Given a query, there are 5 and 11 positive instances out of 179 in its retrieved list, respectively.In each case, we choose 5 different T p : 10, 5, 0, -5, -10.We remark  that: 1) T p > 0 denotes harder positives with a larger distance are emphasised while T p < 0 represents easier positive pairs with a smaller distance are focused; 2) When the absolute value of T p is larger, e.g., 10 and -10, the relative weight between two instances is larger.As a result, the differentiation becomes more significant.The results are shown in Table 10 and we discuss them as follows: • When emphasising on easier positive examples (T p < 0) or without weighting (T p = 0), the performance is similar.
• When focusing on harder positive examples, the performance decreases as T p increases.

•
The results are generally consistent with [6], [70].They find that emphasising on harder positive examples cannot preserve the intraclass similarity manifold structure, thus reducing the generalisation performance.

Hypersphere Regularisation
The hypersphere diameter is an indicator of the intraclass variance.
To study the impact of it, we fix α = 1.2 and T n = 10 while changing m.For more comprehensive exploration, we try two different settings: C = 60, K = 3, and C = 30, K = 6.We do experiments on the SOP dataset and present the results in Table 11.
We have two important observations: • When m = α = 1.2, the diameter is 0, which means positive pairs are pulled as close as possible and has the same effect as the conventional contrastive loss.In this case, the Recall@1 is considerably worse than the best.Especially, when C = 30, K = 6, without hypersphere regularisation, the Recall@1 is only 76.4% while the best Recall@1 with strong regularisation is 79.1%.

•
In both settings, we obtain almost the best performance when the hypersphere diameter is 0.8.In case where C = 30, K = 6, the Recall@K results are more sensitive to the hypersphere diameter.

Ablation Study on Other General Factors
In this subsection, we present our empirical study on other method-independent factors of deep metric learning: 1) Batch size; 2) Embedding size; 3) Batch content.

The Impact of Batch Size
The batch size is usually important in deep metric learning.During training, we follow the few-shot retrieval setting, so that the batch size determines the scale of a problem we are going to solve every iteration.We conduct experiments on the SOP to evaluate the influence of batch size in our approach.Specifically, we fix the number of images per class (∀c, N c = K = 3) and only change the number of classes (C ∈ {40, 50, 55, 60, 65}) in each minibatch.The results are reported in Table 12.We can see that RLL is not very sensitive to the batch size.The results of different batch content C × K on the SOP.We fix N = C × K = 180 and change C, K.

The Impact of Embedding Size
Batch content (N = C × K) R@1 R@10 R@100 The feature dimension is another considerable factor when learning deep representations for downstream tasks.Generally, the objective is to encode an input into a low-dimensional feature vector so that the storage and computational complexity can be reduced on downstream tasks, e.g., fast image retrieval [78], [80].Therefore, in this subsection, we study the influence of embedding size in our RLL.In all experiments, we set C = 60, K = 3, α = 1.2, m = 0.4, T n = 10.The results on the SOP are displayed in the Table 13.We can see that generally a larger embedding size leads to a better performance.Finally, the performance increase becomes negligible.Therefore, due to the limited storage and faster computational speed requirement in practice, we can choose a smaller encoding size.

The Impact of Batch Content
In this subsection, we study the format of a few-shot retrieval task in every iteration, e.g., the number of classes C and images per class K. To abstain the impact from batch size, we fix N = C × K = 180 and change C, K at the same time.We set α = 1.2, m = 0.4, T n = 10 in all experiments.We display the results of RLL-(L,M,H) in Table 14.We observe that when there are more classes and fewer images per class, i.e., a task becomes more difficult, we obtain better generalisation performance.

Qualitative results
Visualisation of Image Retrieval.In Figure 5, we visualise the image retrieval results on the SOP test dataset.For every query, we show its top 4 images in the ranked list of the gallery set.We observe that the learned embedding model is robust and invariant to rotation and viewpoint.Visualisation of Image Clustering.We visualise the image clustering result on the SOP test dataset in the Figure 6.

STUDY ON THE CRITICAL LEARNING PERIODS OF DEEP METRIC LEARNING
We have discussed in section 3.7 that the early learning period is critical in artificial deep neural networks.In this section, we study and validate them in the context of deep metric learning empirically from the perspective of example weighting.Our results on the In-shop Clothes and CUB-200-2011 datasets are displayed in Table 15.In these experiments, the batch size is 60.The number of training iterations is 30,000 on In-shop Clothes, while 10,000 on CUB-200-2011.A significant observation is that T 1 is a more sensitive factor than T 2 : the performance variance with different T 1 s is much larger than the performance variance with different T 2 s.Therefore, critical learning periods of deep metric learning also exist in the early learning phase.This is interesting and inspires us that more effort should be spent on the design of the early learning phase.

CONCLUSION
In this paper, the ranked list loss is proposed to exploit all nontrivial data points in order to provide more informative supervision for learning discriminative embeddings.Following up our CVPR 2019 conference version, we further improve RLL to be a general extension of ranking-motivated losses.Concretely, given a query, RLL splits other data points into positive and negative sets, and forces a margin between them.In addition, example mining and weighting are exploited to leverage all informative data points.Our proposed RLL achieves the state-of-the-art performance on two large datasets.Furthermore, we present many other interesting results, which are of high practical value and can be open leads of future research: (1) The results of RLL-L and RLL-M are highly competitive and even better although their network depth is shallower; (2) How to better exploit multi-layer embeddings in deep metric learning?(3) How to better design the early learning phase of deep metric leaning, since it is the critical learning stage.

Fig. 1 :
Fig. 1: Illustration of our proposed Ranked List Loss (RLL).Given a query and its ranked list, RLL aims to make the query closer to the positive set than to the negative set by a margin m.Circle and triangle represent two different classes.The blue circle is a query.The yellow shapes represent nontrivial examples while the red shapes represent trivial examples.The arrow indicates the query's gradient direction determined by the corresponding nontrivial examples.The final gradient direction of the query is a weighted combination of them.The optimisation target for every list is shown in the bottom.Best viewed in colour.

Fig. 2 :
Fig. 2: Illustration of different ranking-motivated structured losses.Different shapes (circle, triangle and square) represent different classes.For simplicity, only 3 classes are shown.The purple circle is an anchor (query).In triplet[47], the anchor is compared with only one negative example and one positive example.In N -pair-mc[52], Proxy-NCA[37] and Lifted Struct[54], one positive example and multiple negative classes are incorporated.N -pair-mc randomly selects one example per negative class.Proxy NCA pushes the anchor away from negative proxies instead of negative examples.The proxy is class-level and can represent any instance in the corresponding class.Lifted Struct uses all examples from all negative classes.On the contrary, our proposed Ranked List Loss not only exploits all negative examples, but also makes use of all positive ones.Best viewed in colour.
contains 7,982 classes and 52,712 images in total.It is split into a training set and a testing set.The training data contains 25,882 images of 3,997 classes.The testing set includes 3, 985 classes, and there are 14,218 query images and 12,612 in the gallery for search.(3) CUB-200-2011 [25] has 11,788 images of 200 bird species.5,864 images of the first 100 classes are used for training and 5,924 images of the other 100 classes for testing.(4) CARS196 [61] contains 16,185 images of 196 car models.

Fig. 5 :
Fig. 5: Visualisation of image retrieval on the SOP test dataset.The leftmost column shows queries, which rank the images in the gallery according to the similarity.

Fig. 6 :
Fig. 6: t-SNE visualisation [60] on the SOP test dataset.Best viewed on a monitor when zoomed in.

1 :
Mini-Batch Settings: The batch size N , the number of classes C, the number of images per class N c .2: Parameters: The distance constraint α on negative points, the margin between positive and negative examples m, the weighting temperature T n , T p . 3:

TABLE 2 :
Comparison with the state-the-of-art methods on Inshop Clothes dataset.XBM

TABLE 4 :
Comparison with the state-of-the-art methods on CUB-200-2011 in terms of Recall@K (%).All the displayed results have an embedding size of 64 for an exact comparison.

TABLE 5 :
Comparison with the state-the-of-art methods on SOP with ResNet-50 backbone.The embedding dimension of different approaches is shown for the sake of clarity.

TABLE 6 :
Exploration on the network depth of an embedding function and the concatenation of embedding functions on SOP.

TABLE 7 :
Exploration on the network depth of an embedding function and the concatenation of embedding functions on In-shop Clothes dataset.

TABLE 8 :
The impact of α on the SOP.In all experiments, m = 0.4, T n = 10.

TABLE 9 :
Results of different T n on the SOP in terms of Recall@K (%).We fix m = 0.4, α = 1.2 in all experiments.What examples are mined is determined by α and m.We conduct experiments on the large dataset SOP to analyse the influence of α.Regarding m, we study it in section 4.7.

TABLE 10 :
Results of weighting positive examples on the SOP.We fix other parameters, i.e., m = 0.4, α = 1.2 and T n = 10.

TABLE 11 :
The impact of hypersphere diameter for preserving the intraclass variance on the SOP.We set α = 1.2, T n = 10 in all experiments.

TABLE 12 :
The results of different batch size on the SOP.

TABLE 13 :
The results of RLL-(L,M,H) with different embedding size on the SOP test dataset.

TABLE 15 :
Study on the dynamic weighting scheme of negative pairs on the In-shop Clothes and CUB-200-2011 datasets.We report Recall@1 (%).
Neil M. Robertson is Professor and Director of the Centre for Data Sciences and Scalable Computing, at the Queens University of Belfast, UK.He researches underpinning machine learning methods for visual analytics.His principal research focus is face and activity recognition in video.He started his career in the UK Scientific Civil Service with DERA (2000-2002) and QinetiQ (2002-2007).Neil was the 1851 Royal Commission Fellow at Oxford University (2003-2006) in the Robotics Research Group.His autonomous systems, defence and security research is extensive including UK major research programmes and doctoral training centres.