Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000

Similarity learning which is useful for the purpose of comparing various characteristics of images in the computer vision field has been often applied for deep metric learning (DML). Also, a lot of combinations of pairwise similarity metrics such as Euclidean distance and cosine similarity have been studied actively. However, such a local similarity-based approach can be rather a bottleneck for a retrieval task in which global characteristics of images must be considered important. Therefore, this paper proposes a new similarity metric structure that considers the local similarity as well as the global characteristic on the representation space, i.e., class variability. Also, based on an insight that better class variability analysis can be accomplished on the Stiefel (or Riemannian) manifold, manifold geometry is employed to generate class variability information. Finally, we show that the proposed method designed through in-depth analysis of generalization bound of DML outperforms conventional DML methods theoretically and experimentally. INDEX TERMS Deep metric learning, Image retrieval, Stiefel manifold, Non-linear mapping.


I. INTRODUCTION
Deep metric learning (DML) is a learning method that can increase intra-class compactness and inter-class variability by quantifying intrinsic or extrinsic relationship between images. Since the existing DML methods [14,37] are based on a similarity metric that can successfully encode the images of various attributes, they have been widely applied to product searching [32], face verification [37], perturbation analysis [30], etc.
The similarity metric is mainly defined using the pairwise sample-based Euclidean distance or cosine similarity [37,23]. Among popular triplet-based methods [37], some used the mining [15] for complexity reduction and the others [39] adopted multiple samples with margin. Even episode-based learning scheme with meta data [65] was grafted into [37]. Recently, the similarity metric [23,43] applying cosine similarity (or dot product) to the softplus function has received a lot of attention owing to high performance. Also, some methods [41,34] tried to maximize global features on the embedding space by using a histogram or beta function. To further improve the performance, [31,44,48] suggested the ensemble of the above-mentioned similarity metrics at the feature-level or network-level.
On the other hand, class variability, one of the core properties of DML, can be analyzed through discriminant analysis (DA) [2]. Basically, the goal of DA is to find a projection component that maximizes the ratio of intra-class compactness and inter-class variability, and was mainly used to analyze dimension reduction and discriminative eigenpair. Eigenpair indicates a tuple of eigenvalue and eigenvector. The performance of DA was guaranteed even on manifolds [29] and the association of DA with pairwise sample-based DML was already verified [1].
Note that most DML studies only considered local similarity based on pairwise connection. Since DML must be able to improve the retrieval performance, global features such as semantic representation need to be reflected in the similarity metric [13,66]. In addition, since existing DML methods defined the embedding space produced by convolutional neural networks (CNNs) as a vector space, they had a limit in reflecting nonlinear characteristics such as multi-variate covariance (see Sec. II.B for detailed motivation).
In order to overcome the drawbacks of the conventional DMLs, we come up with two ideas as follows: 1) Based on a pairwise similarity metric reflecting local characteristics, we reflect even a global characteristic in the representation space, i.e., class variability. 2) We also consider nonlinear characteristic by analyzing the class variability in the lowdimensional space (LDS) derived from the embedding space. Class variability is observed through the global characteristics of samples corresponding to discrete labels [45]. This global property is space dependent. This fact provides justification for analyzing class variability from the manifold and global characteristics of the space.
As in Fig. 1, we compared the silhouette scores [64] of a few representation spaces to choose an appropriate representation space. The silhouette score is a metric that shows the degree of partitioning of clustered data, and can quantify the learning tendency of retrieval tasks. Silhouette scores were highest in the order of Stiefel, Riemannian, and Euclidean space. This result shows that compact representation in a matrix form enables discriminative class variability analysis. In this experiment, Euclidean space and manifold employed PCA and local linear embedding for low dimensional embedding [3], respectively. On the other hand, [17] reported that matrix representation of embedding space is more useful for learning the intrinsic characteristics of images than vector representation. This report supports analysis of Fig. 1.
Therefore, this paper presents a new metric loss for DML. First, we derive class variability factor by using manifold sampling [55] and eigenpair obtained through DA in a nonlinear manifold such as Stiefel. Next, we map this factor into a one-dimensional line manifold to associate with a 1D similarity metric like Euclidean distance. Then, we define the projected factor as the geodesic factor (GF) (see Sec. III.A). Finally, we present several geodesic metric losses (GMLs) based on locality-perspective metric and GF (see Sec. III.B). Note that GF can be interpreted from the following two points of view.
 GF for analyzing class variability on LDS has the same goal as subspace clustering [54]. In other words, we can regard that GF explicitly grafts subspace clustering paradigm to the retrieval task.
 GF plays a role of self-supervision that allows the embedding layer to further reflect class variability in learning. Note that GF is not used in testing, and only the embedding layer that can understand the class variability of data distribution is used in testing.
This paper is organized as follows. Sec. II describes our motivation from a technical point of view, and then previews preliminary knowledge for the proposed method. Sec. III depicts the design procedure of GF. Sec. IV evaluates the performance of the proposed method qualitatively and quantitatively.

A. Notation
Let = ( ) ∈ be a d-dimensional (embedding) vector of an image ∈ corresponding class labels ∈ , where is CNN-based embedding network [16,18]. Also, let a set of embedding vectors and the number of classes be = { } = and , where is the minibatch size. Linear DA (LDA) is used to find the d-dimensional optimal eigenvectors { } = that maximize the ratio of intra-class compactness and inter-class variability. LDA can be reinterpreted by finding the optimal eigenmatrix ∈ ℝ × : (inter-class covariance), (total covariance), and (intra-class covariance) [2]. The eigenpair of * , i.e., {( , )} = can be found using the eigenvalue equation. Here, ≥ ⋯ ≥ . And the eigenvalue solver plays a role to perform forward/backward pass operation (e.g., DA and gradient calculation) during the end-to-end learning of LDA (SDA in Fig. 2).

B. Motivation
As a representative similarity metric, Mahalanobis distance between and is defined by If covariance matrix is I, Eq. (1) becomes the Euclidean distance ( , )(= ) and − can be approximated by a scale term based on eigenpair ( , ) [4]. Early metric learning techniques [9,45] assumed as data and they focused on projecting data into a discriminative LDS through a covariance matrix. On the other hand, the latest metric learning techniques [37,23] could handle the retrieval task only with the local linearity of the embedding vector under the assumption of − = , thanks to CNN's feature analysis capability. However, the previous approaches that seldom took into account the global information of feature (or data) and nonlinearity in designing metric showed limitations in improving performance in the fine-grained retrieval task. [36] designed the scale term of Eq. (1) using an orthogonal layer and a scale layer on the Stiefel manifold. Since the two layers were only designed for scale-invariance purpose, eigenpairs could not be dealt with explicitly. This indicates that the global characteristics such as class variability could not be reflected properly. Therefore, we intend to design a GF that can reflect the spectrum of eigenpairs, i.e., global information on a nonlinear space.

C. Geometry of Matrix Manifold
Riemannian Manifold. A Riemannian manifold (ℳ, ) is a smooth manifold ℳ equipped with a Riemannian metric . is defined as the inner product on the tangent space Also, Lie group SO( ), which operates through group action like matrix multiplication on Stiefel manifold, is defined by Then, the group action between ∈ SO( ) and ∈ St( , ) is expressed as ∈ St( , ) . As a result, the concept of this group action is used to design a retraction map or lifting map on Stiefel manifold ( in Fig. 2) [7].
Usually, (length) normalization [46] is applied to embedding vectors to meet Stiefel constraint, and orthogonal initialization is applied to the other trainable parameters.

A. Geodesic Factor (GF) Generation
The process of generating GF consisting of three steps begins just after feature extraction process as in Fig. 2. Assume the Stiefel manifold is LDS. Adaptive LDS searching. The first step is to use a specific dimension reduction tool [3] that maps embedding space to a low-dimensional representation space (① in Fig. 2). This data-adaptive dimension reduction tools in [3] are simple, but they are powerful projection methods that allow generalization even for out-of-distribution (OOD) samples. And then, pairwise inner product (PIP) criterion [52] is used to find the optimal dimension in the dimensionality reduction process.
where ∈ ℝ × and ̂∈ ℝ × . Note that the criterion of   [11] of St and an arbitrary point, respectively. Since a well-known logarithmic map [11] may not be defined as a closed form, lifting map −1 , which is the inverse process of retraction map , is used for this step [22] (② in Fig. 2).
where is the symmetric positive semidefinite × matrix such that − ∈ St. Next, −1 of Eq. (6) is computed as follows [22]: 1) Compute = , 2) solve Continuous-time Algebraic Riccati Equation (CARE) [22] (− ) + (− ) + 2 = 0 for , and 3) compute − . Before obtaining the eigenpair on St , we transform the matrix space into the vector space according to the existing approaches [51] for transforming the data shape on manifold. Specifically, based on Kronecker productbased vectorization operation St( , ) ⊗ St( , ) → St ̅ ( , 1), St is transformed into product space St ̅ [51]. So, an eigenpair that can reflect data consistency is obtained. Finally, based on ̅ on St ̅ , eigenpair ( , ) is produced by LDA. Generating GF. The third step is to quantify class variability in the representation space. First, the quantification process is based on an eigenpair and a probability model-based sampling [55]. In detail, as in Eq. (7), class variability factor is sampled on St ̅ by using the average eigenvalue reflecting separability per class label as the range of the probability model [55] ( in Fig. 2).
where the probability model assumes a uniform distribution U, 1 is a scale factor, and the center value of U is computed by that is the retraction map of and kpdimensional zero vector 0, i.e., , where 's subscripts , , and indicate anchor, positive, and negative points, respectively and stands for a set of all possible pairs. Then, GF is determined by placing on the line manifold that reflects the separability characteristics as much as possible through the discriminative eigenvector 1 ∈ ℝ : Since is the global information that quantifies class variability in the representation space and it lies on the line manifold, it can be fused with the scalar similarity metric reflecting local characteristics. Also, because is defined based on (inter-)class variability, it gradually becomes larger as latent vectors corresponding to different class supervision are separated, that is, similarity learning progresses. It can alleviate the inherent overfitting problem of similarity learning by playing a momentum in the gradient operation (see Sec. III.C). In the next section, the one-dimensional GF generated through Eq. (8) is used for the purpose of reflecting the class variability factor in the Euclidean distance or cosine similarity.

B. Defining Geodesic Metric Loss (GML)
This section presents three versions of GML based on representative or latest DMLs [23,31,37]. GML-Tri is based on matrix multiplication of squared Euclidean distance 2 and as in Eq. (9).
To reflect all elements of in 2 , an N-dimensional all-one vector is used. Similar to [14,37], the first and second terms of Eq. (9) are defined by hinge function [•] + and class specific margin 1 [46], and they are trained so that and are located close to each other, and and are located far from each other. The third term encourages tangent space St ̅ to find the discriminative eigenpair ( , ). On the other hand, in order to reduce the computational burden of Eq. (9), ( 2 ) ∈ ℝ is reconfigured as follows: Similarly, 2 is reconfigured. Among samples, the numbers of pairs satisfying 2 > 1 and 2 < 1 are and , respectively. GML-PA employs like GML-Tri, and utilizes Proxy-Anchor (PA) [23] as the baseline.
where indicates cosine similarity, and Cp and Cn denote the sets of vectors in positive and negative relationships with a proxy vector p, respectively. and + denote the set of all and positive proxies, respectively [23].
The exponentially exploding or asymptotically vanishing phenomenon that occurs frequently in [23] is controlled by . Thus, can give stability to similarity learning. GML-DiVA is designed using DiVA [31] as the baseline, which ensembles several similarity metrics at the network level. The loss function of GML-DiVA is as follows: where ℒ is the margin loss [46] based on the class label triplet set { , , } with = and ≠ . ℒ ℎ and ℒ are losses when the configuration conditions of the triplet set are ≠ ≠ and = = , respectively. ℒ is the regularization loss proposed in [31]. Similar to GML-Tri in Eq. (9), ℒ is defined by multiplying the Euclidean distance of margin loss by . ℒ ℎ and ℒ are also defined like ℒ .
The goal of GML-DiVA is to create synergy with loss functions defined through for understanding global semantic similarity. Please refer to Appendix I for GML-Cont and GML-MS handled in the experiment section.

Complexity analysis. ( 3 ) and ( 2 ) are required for CARE equation and non-linear mapping to
St ̅ , respectively. Here, ≫ , . In addition, ( 2 ) is required for eigenvalue solver, which is less than ( 3 ) of Triplet [10]. On the other hand, compared to ( ) of PA [23], the N-squared complexity of eigenvalue solver is somewhat burdensome. Table I shows that when considering the performance improvement by the proposed method, the additional time required for learning is sufficiently tolerated. In the test phase, the proposed method uses only the embedding layer like other techniques, so it consumes the equivalent testing time.

C. Gradient Analysis of GML
Eq. (9) is differentiated to analyze the effect of on the behavior of the embedding vector.
Here, only the case where the hinge function is greater than 0 is handled, and the gradient term of can be omitted because it is relatively smaller than the gradient term of 2 . In addition, the gradient of 2 can be decomposed into a    through gradient analysis. In case of Trip-semi, since the gradient is only proportional to the size component , the negative vector can be easily belong to different class label region at the beginning of learning. As such, it will be difficult for , which was initially mis-learned, to be included in the region of the original class label. On the contrary, which belongs to 's class in the latter part of learning is hard to escape from the region of this class ( of Fig. 3). On the other hand, the gradient of GML-Tri, i.e., GML based on triplet loss is affected by and that is the sum of all components of GF. As a result, , which quantifies class variability, plays a role of clipping the gradient of at the beginning of learning, and acting as a momentum at the end of learning. ( of Fig. 3). Fig. 4 shows the positive impact of GF in the early stages of learning. GML-PA showed a steep increase in recall rate in the early stage of learning (0-10 epochs) than PA and MS. This momentum is maintained even when the of recall@ increases, and it affects the peak performance.
See Appendix I for gradient analysis of anchor , Stiefel manifold, and derivation of GML-PA.

D. Generalization Bounds of GML
This section analyzes the generalization bound of the similarity metric structure based on the representative covering number [25]. Let be a parameterized model that is optimized by and the similarity metric of . In addition, let the expected triplet set generated by a probability distribution be ( , ′ , ′′ )~. Then, the generalization bound of in the metric space ( , ) is defined as follows. Proof. For the proof of Theorem 1 and the details of ( ) and the covering number, see Appendix II.
On the other hand, based on Theorem 1, the bounds of different similarity metric structures can be compared as follows.

Theorem 2.
Let the generalization bound of the similarity metric structure [23] based on cosine similarity be . assumes a softplus function type. Also, as in [37], let the hinge function with margin that is based on Euclidean distance be . Then, the following relationship is established between and .
Proof. Refer to Appendix II for the proof of Theorem 2. and correspond to the bounds of GML-Tri and GML-PA, respectively. Eq. (15) provides theoretical   TABLE II  PERFORMANCE COMPARISON ON CUB200-2011, CARS196, AND SOP. SUPERSCRIPT 128 AND 512 REPRESENT THE EMBEDDING  VECTOR SIZE, AND SUPERSCRIPT 512B REPRESENTS THE RESULTS OF TRAIN/TEST IMAGES OF 256×256 SIZE WITH 512 EMBEDDING  VECTOR SIZE. HERE, 'PA', 'TRI', AND 'CONT' REPRESENT PROXY-ANCHOR, TRIP-SEMI,  indicators regarding optimization stability and convergence between different similarity metric structures. In addition, the structure with tight generalization bounds improves retrieval performance (see Table II of Sec. IV.A). From this point of view, Theorem 2 can be an explicit tool for analyzing the relationship between the retrieval performance and the generalization bound.

IV. EXPERIMENTS
Configuration details. We used PyTorch library for network design and parameter optimization. All experiments were performed five times on 4 NVIDIA GeForce GTX 1080 TI GPUs. We adopted the training protocol of the base techniques [37,23,31] to which GML is applied. Two backbone networks of ResNet50 (R50) [16] and BN-Inception (BN) [18] were used. For pre-processing, a center crop of 224×224 (or 256×256) and horizontal flip were applied to input images. As in the other techniques, only the embedding layer trained by GML was used for evaluation, and the GF generation process was excluded. Please refer to Appendix III for specific details.
Hyper-parameter setting. For PA as the base technique, AdamW [28] with learning rate of 10 −4 and = 180 were used, and for the others, Adam [21] with learning rate of 10 −5 and = 112 were used. 1 of Eq. (7) and 2 of Eq. (9) were set to 1e-1 and 1e-6, respectively. Also, and 2 of Eq.  Fig. 5 for more details.

A. Experimental Results
Quantitative results. Quantitative evaluation was performed on the CUB200-2011, Cars196, and SOP datasets. Recall@Q (R@Q) [20] was employed for retrieval performance evaluation, and normalized mutual information (NMI) [38] was adopted for clustering performance evaluation. Table II listed Euclidean distance-based and cosine similarity-based techniques [23,43,33,39] in the order of embedding vector size, and then showed their performance. The greatest performance improvement was observed in GML-Cont and GML-Tri. For example, in CUB200-2011, GML improved R@1 by 6% and 6.8%, respectively, in comparison to the base technique (see the 2nd and 6th rows in Table II). Note that among the triplet-based DML techniques, GML-Tri was the best. GML-Tri showed R@1 gap of only 0.3% from PADS-Tri [35] for CUB200-2011, but it provided significant R@1 differences of 2.1% and 2.4% for Cars196 and SOP datasets, which can guarantee generalization performance. Also, it showed a steady performance gain of more than 1% over DSML [53]. This proves that a loss function inspired by subspace-clustering is

FIGURE 5. Examples of datasets for DML.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3143129, IEEE Access VOLUME XX, 2021 2 more effective in retrieval tasks than SNR-based class variability analysis [53] and reinforcement learning-based sampling [35]. GML-PA achieved the highest performance improvement among cosine similarity-based techniques. Seeing the 12 th row, R@1 was 70.7% in CUB200-2011 and 87.5% in Cars196. In case that the input image size is set to 256×256, GML boosted the performance of PA [23] and HORDE [19] by about 1% (see the last row). This is SOTA performance in this embedding vector size.
Next, Table III compares GML-DiVA with conventional ensemble methods [31,44,48]. GML-DiVA showed better performance in terms of both recall and NMI than [44] and [48] with embedding vector size of 512 or more.
Although GML-DiVA showed a meaningful R@1 of 1.7% for the SOP dataset, its overall performance improvement was not significant, compared to GML-Tri and GML-PA of Table II. Actually, DiVA is somewhat different from the purpose of GML designed for improving a single metric because DiVA boosts performance through fusion of metrics. Based on a study [50] showing that the ensemble performance depends on the performance of single metrics, we can interpret that the metric used in GML-DiVA has a positive effect on the classifier fusion process.
Qualitative results. In this section, a qualitative evaluation in terms of similarity was performed using a confusion matrix based on the distance correlation (DC) index. Note that well-known Euclidean distance is difficult to effectively reflect the similarity of latent vectors as well as disentanglement characteristics. So, we analyzed the relationship between vectors based on DC. As in Fig. 6, for the same class label images 1 and 2 , both GML-PA and PA showed high DC values of about 0.7. However, the DC value between a 'bird' image 1 and a 'cup' image 4 was 0.41 for GML-PA and 0.47 for PA. This supports the outstanding discrimination performance of GML. In the case of PA, although the class label is different, the DC (0.44) between 3 and 1 , which have a common attribute of `bird', is smaller than the DC (0.45) between 3 and 6 , which have completely different attributes. On the other hand, GML-PA shows an ordinary DC distribution. This experiment result shows that GML-PA can learn similarity metric by reflecting not only class label but also image attribute. See Appendix IV for more examples of Fig. 6 and latent space visualization.

B. Ablation Study
In this section, we performed comparative analysis on the hyper-parameters of GF. Fig. 7 evaluates the techniques used for LDS searching of Sec. III.A according to the embedding vector size. The performance was highest in the order of GML-PA, Proxy-Anchor, and MS. As the embedding vector size increased from 32 to 256, R@1 also improved in proportion to the capacity of the embedding layer. However, from 512 and above, the performance of all techniques was saturated around 87%. This indicates that the embedding layer no longer provides sensible information in calculating the similarity metric. Also, the performance variation according to LDS search methods amounted to about 1%. On the other hand, it is noteworthy that the technique showing the best performance differs depending on the embedding vector size. For example, LLE showed the best performance of about 73.8% at 32, and IsoMap of about 87.8% was the best at 1024.   Although GML-PA is somewhat sensitive to parameters, overall performance was clearly improved compared to authentic PA. Note that GML-PA with p=1 also showed higher performance than PA (68.4%). This proves that class variability has a positive effect on similarity learning. In detail, the performance according to k rarely shows a certain trend, whereas p is somewhat proportional to the R@1. This is because p indicates the capacity for data representation on Stiefel manifold.
Finally, Table IV evaluates the performance for different manifolds and non-linear mappings. We performed this experiment with manifolds created while changing the orthogonal parameter of the Stiefel manifold. From Table  IV, we can observe that the Stiefel-based method has a performance advantage of about 1% R@1 over Riemannian and Unitary. Next, the logarithmic map [11] was used to generate the tangent space, but only the variation within 0.3% of the recall rate was observed. As a result, applying the same interpretation as in Fig. 7, finding an appropriate on Stiefel manifold is equivalent to creating a GF that reflects the class variability of the embedding space well. Refer to Appendix V for additional materials such as experimental results on the in-shop clothes dataset and performance changes according to minibatch size.

V. RELATED WORKS
Metric learning summary. The goal of DML is to learn the projection matrix from the input space into the representation space [45,9]. Various regularization terms have been proposed to mitigate overfitting in the DML process [6,47]. In addition, CNN-based DML techniques such as pair losses [24] and triplet losses [37] have appeared to deal with high image dimensions. Also, there were studies on the concept of neighborhood component analysis (NCA) that locate points on the representation space in the discriminative decision space based on class supervision information [12]. On the other hand, unlike the previous studies, there were cases where global information on embedding space was quantified in various ways and used as a tool for similarity learning [41,34]. Finally, a novel approach for image retrieval tasks in bulk databases has emerged [67]. [67] has in common with the proposed method in that color and texture features are extracted considering both local and global views of an image. Also, it is noteworthy that features useful for improving retrieval performance were extracted from the frequency domain. Latest DML approaches. Recently, several metric learning techniques based on the analysis of class variability have been proposed. For instance, DSML [53] defined the signalto-ratio (SNR) from the (Euclidean) distance and variance between samples in a mini-batch, and then used it as a loss function. However, due to the inherent characteristics of SNR metric, DSML can only be applied to metric learning based on triplet or contrastive distance. That is, it is difficult to apply to different techniques in terms of formula such as PA [23]. PADS [35] proposed an adaptive negative sampling mechanism through a class variability factor and a feedback loop based on reinforcement learning. PADS, which suggested a new sampling concept, is easy to attach and detach. However, since PADS does not structurally change the loss function, it does not affect similarity learning. Also, since PADS analyzes the statistical property on Euclidean space, it cannot capture the non-linear properties of the data. Finally, [63] analyzed class variability on a manifold. [63] quantified class variability on the Riemannian manifold and applied this to basic metric learning techniques such as triplet. However, [63] is numerically unstable and sensitive to outlier samples because other constraint terms such as orthogonality are not employed when mapping to the manifold. Discriminant analysis (DA). The objective of DA is to obtain a factor that maximizes inter-class variability in a given space, and then perform dimension reduction and eigenvalue analysis using the factor. For example, dimension reduction could be performed after defining the relationship between samples through Laplacian graphs [49] or Laplacian graphs based on the Gaussian kernel could be used for constructing an intra/inter-class covariance matrix [40]. One of the major features of DA is that it can perform pattern analysis in association with manifold geometry. For example, Louis et al. proposed a probabilistic DA based on probabilistic model to deal with manifold characteristics [29]. On the other hand, since DA has a common goal with metric learning, the relationship between DA and metric learning has been theoretically analyzed [1]. However, as far as we know, there have been no recent studies that have performed the retrieval task using both DA and metric learning at the same time.

V. CONCLUSION AND FUTURE WORK
This paper proposes a method for successfully obtaining nonlinear characteristics in embedding space and presents a novel metric structure based on this characteristic. The proposed method will inspire recent DML studies which consider only the Euclidean distance or cosine similarity in vector space. Future study will be to apply nonlinear characteristics to diverse DML techniques and to propose an optimization method for efficiently computing matrix computation.

APPENDIX I (GRADIENT ANALYSIS OF GML) A. Form of GML-Cont and GML-MS
The GML-Cont and GML-MS used in the experiments of the main body are defined by where <⋅,⋅> denotes dot product and , , and are set to 2, 50, and 1, respectively. Here, and denote the sets of vectors in positive and negative relationship with , respectively.

B. Gradient of GML-Tri
The final form of Eq. (9) of the main body is given by The partial derivatives of Eq. (A3) with respect to a triplet set , , and 1 are as follows: The magnitues of embedding vector gradients are tuned by . In Eq. (A7), {⋅} outputs 1 when the given condition is satisfied, and 0 otherwise.

C. Gradient of GML-PA
The loss function of GML-PA is defined by Eq. (A8), and the derivative analysis on cosine similarity is given by Eq. (A9).
Note that the gradients of GML-Cont and GML-MS can be analyzed in the same way as GML-Tri and GML-PA, respectively.

D. Gradient of
In general, created through the sampling process cannot perform the backpropagation process. To solve this problem, we build an end-to-end algorithm by referring to the reparameterization trick of [56].

E. Gradient on the Stiefel Manifold
Let a point be a trainable parameter. The process of updating on the Stiefel manifold consists of 4 steps as follows: First, find the derivative of the objective function ∇ℒ * ( ) of at iteration . Second, map ∇ℒ * (W t ) to ∇ St ℒ * ( ) on T W t St using the lifting map. Third, through the specifically determined momentum and ∇ St ℒ * ( ) at iteration , calculate the momentum +1 for the next step. Finally, update to +1 on the Stiefel manifold along the direction of +1 . For more details, please refer [57] and [58].

APPENDIX II (GENERALIZATION BOUNDS OF GML)
In this section, background knowledge for generalization bound analysis and proof of Theorem II.2 are examined.
The robustness of the pairwise similarity metric through subset samples placed on a compact space is defined as follows.
Definition II.2 (Robustness of metric learning [59]). where is the hypothesis learned by on . (⋅) and (⋅) quantify the robustness of the algorithm and depend on the training sample. A shown in [59], Definition II.2 guarantees the following generalization bound.
Theorem II.1 (Generalization bound of metric learning [59]). If a learning algorithm is ( (⋅), (⋅))-robust and the training sample consists of the triplets , then for any > 0, with probability at least 1 − we have: is a predefined constant.
It is noteworthy that the generalization bound of can be analyzed more clearly with Theorem II.2 than Theorem II.1. Theorem II.2 shows that the loss of triplet tuples located close to each other in the partitions split by the covering number has a certain range.

C. Generalization bound for GML-PA
In this section, we guarantee the generalization bound of GML-PA by using Theorem II.2. Let and ′ denote the cosine similarity corresponding to the part of empirical and expected loss, respectively. Then, we can show the generalization bound of GML-PA is given by Here, since the positive and negative terms are derived identically, it can be assumed that Eq. (A20) has the same meaning as Theorem II.2.
By the way, the cosine similarity metric having a finite range in the compact space satisfies the Taylor expansion as follows: According to Eq. (A21), the relationship between GML-PA with softplus function structure and GML-Tri with hinge function structure can be expressed by If the generalization bounds of GML-PA and GML-Tri are defined as and , respectively, this can be re-written by ≤ ≤ ( ) = 16̃̃. Therefore, it can be inferred that GML-PA also has a finite generalization bound, which is a more compact bound than . 

D. Generalization bounds for GML-Cont and GML-MS
GML-Cont/MS has the same structure as GML-Tri/PA, but only the composition of the constant term has changed. Therefore, the generalization bounds of GML-Cont and GML-MS can be equally derived from Eqs. (A13) and (A22).

APPENDIX III (HYPER-PARAMETER SETTING)
GML-PA/MS. Embedding network is trained for 80 epochs with the initial learning rate of 10 −4 on the CUB200-2011 and Cars196, and for 120 epochs with the initial learning rate of 6 × 10 −4 on the SOP and In-shop. Decay factor of optimizer is set to 10 −4 .
Step-wise learning rate reducer (StepLR) with the step size of 10 and the ratio of 0.5 is used for convergence stability on the CUB200-2011 and Cars196, and the step size of 20 and the ratio of 0.25 on the SOP and In-shop. In case of GML-PA, the learning rate for proxies is scaled up 200 times for faster convergence. Random sampling is used to configure minibatch during training phase. Default minibatch size is 180. GML-Tri/Cont. Embedding network is trained for 100 epochs with the initial learning rate of 10 −5 on the CUB200-2011 and Cars196, and for 100 epochs with the initial learning rate of 4 × 10 −5 on the SOP and In-shop. Decay factor of optimizer is set to 4 × 10 −4 . StepLR with the step size of 25 and the ratio of 0.3 is used for convergence stability on the four DML public datasets. Distance-based sampling [46] is used to configure minibatch during training phase. Default minibatch size is 112. GML-DiVA. Embedding network is trained for 150 epochs with the initial learning rate of 10 −5 on the four DML public datasets. The configuration of StepLR is the same as GML-Tri. Distance-based sampling [46] is used to configure minibatch during training phase. Default minibatch size is 112. Other detailed coefficients are set by reffering to [31].

APPENDIX IV (ADDITIONAL QUALITATIVE RESULTS)
In this section, qualitative results in terms of confusion matrix based on the distance correlation (DC) index is additionally examined. We qualitatively compare the results in Figs. A1-8 obtained through the CUB200-2011 [42], Cars196 [26], and Stanford Online Products (SOP) [32] datasets. The main comparison method is Proxy-Anchor (PA) [23]. The overall DC value between inter-class images is lower in GML-PA than in PA. This is because the inter-class variability property has been further considered in GML-PA. For example, in Fig. A5, the DC value between 1 and 5 images of PA is 0.7, which is higher than that between all intra-class images. Figures A9, A10, A11, and A12 show the embedding spaces using grid-wise T-SNE visualization tool [60]. We can see the images which contain similar attributes locate close to each other, vice versa. Compared to the visualization results of [23], our 2D embedding plot shows discriminative location of images.

APPENDIX V (ABLATION STUDY)
Table A.1 shows the performance for the in-shop clothes dataset. As in the SOP dataset, GML-PA shows an excellent R@1 performance improvement of 1.2% for the embedding vector size of 512. Since the performance has already reached the upper limit, it is difficult to expect further improvement. On the other hand, GML-Cont and GML-Tri, which have a weaker saturation in performance, show noticeable R@1 improvements of 7% and 8.8%, respectively.
Next, Table A.2 shows the performance according to the minibatch size. Since the similarity metric is calculated using piecewise components in a minibatch, smaller minibatch size tends to show lower performance. However, GML-PA consistently shows higher performance than PA even in a relatively small minibatch size. This demonstrates that GML-PA is less sensitive to the minibatch size, which is important for extracting local information.