DeepDiffusion: Unsupervised Learning of Retrieval-adapted Representations via Diffusion-based Ranking on Latent Feature Manifold

Unsupervised learning of feature representations is a challenging yet important problem for analyzing a large collection of multimedia data that do not have semantic labels. Recently proposed neural network-based unsupervised learning approaches have succeeded in obtaining features appropriate for classification of multimedia data. However, unsupervised learning of feature representations adapted to content-based matching, comparison, or retrieval of multimedia data has not been explored well. To obtain such retrieval-adapted features, we introduce the idea of combining diffusion distance on a feature manifold with neural network-based unsupervised feature learning. This idea is realized as a novel algorithm called DeepDiffusion (DD). DD simultaneously optimizes two components, a feature embedding by a deep neural network and a distance metric that leverages diffusion on a latent feature manifold, together. DD relies on its loss function but not encoder architecture. It can thus be applied to diverse multimedia data types with their respective encoder architectures. Experimental evaluation using 3D shapes and 2D images demonstrates versatility as well as high accuracy of the DD algorithm. Code is available at https://github.com/takahikof/DeepDiffusion


I. INTRODUCTION
Technology for content-based multimedia information retrieval is essential to manage a variety of data, such as text documents, two-dimensional (2D) images, and threedimensional (3D) shapes. Over the past decade, accuracy of multimedia information retrieval has improved drastically due in large part to the development of distance metric learning techniques, especially those using Deep Neural Networks (DNNs). Previous studies (e.g., [1], [2]) trained a DNN by using a large amount of labeled data to acquire a distance metric useful for matching, comparing, or ranking of the data. Feature representations extracted from the supervisedly trained DNNs yield retrieval accuracy significantly higher than traditional handcrafted features. However, collecting sufficient number of labeled data is often impractical since annotation by humans is quite laborious. Insufficiency in number of labeled data hampers the supervised learning of feature representations and their distance metric.
Recently, unsupervised learning of feature representations, which tries to obtain expressive features from a collection of unlabeled data, has received much attention. In general, unsupervised learning is more challenging than supervised learning since direct supervision by labels is not available. To train feature extraction, or encoder, DNNs without using labels, "pretext" tasks are usually employed. In the field of 2D image analysis, such pretext tasks as self-reconstruction [3], context prediction [4], pseudo label classification [5], and feature contrast [6] have been proposed. Training using these pretext tasks achieved a certain level of success in learning feature representations in an unsupervised manner.
We argue that these previous pretext tasks for unsupervised representation learning have two shortcomings. (i) First, features learned by most of the previous pretext tasks for classification [4]- [16] are not optimal for retrieval. Training sample x (e.g., 3D shape) Data augmentation θ Intrinsic feature m 7,843 That is, DNN training using these pretext tasks attempts to form the latent feature space where similar data samples are embedded close to each other. However, none of the existing methods [4]- [16] impose any constraints on the structure of feature distribution in the learned latent feature space. Features learned without such constraints are expected to be nonlinearly distributed, possibly with gaps, in the latent feature space. Accurate classification can be achieved if decision boundaries among such nonlinearly distributed set of features are properly established by a subsequent supervised nonlinear classifier. On the other hand, however, retrieval accuracy suffers if the latent feature space is highly nonlinear, for example with bunching and/or gaps in the latent feature distribution. Feature comparison for retrieval is typically done by using a fixed distance metric (e.g., the Euclidean distance). The fixed distance metric has difficulty in finding all the data samples relevant to a query in the latent feature space having nonlinear distribution. To achieve accurate retrieval, a latent feature space having a continuous and smooth distance metric adapted to the nonlinear distribution of the input data is essential.
(ii) Second, some of the previous pretext tasks (e.g., [3], [4]) lack versatility, as they can only be applied to a specific data type. For example, the context prediction task [4] works only for 2D images since it heavily depends on the structure of 2D image. Or, the self-reconstruction [3] requires a loss function in addition to a pair of an encoder DNN and a decoder DNN, all of which must be exclusively designed for each data type.
Our goal in this paper is to develop a datatype agnostic unsupervised learning framework that learns retrievaladapted feature representation. To this end, we introduce diffusion distance on latent feature manifold.
We first motivate the use of diffusion distance on latent feature manifold. Manifold hypothesis [61] assumes that high-dimensional data samples or their feature descriptors tend to lie in the vicinity of a nonlinear, low-dimensional manifold. Before the DNN era, manifold learning techniques [21] have empirically shown their effectiveness in tasks such as dimensionality reduction, clustering, and retrieval of diverse data types. We believe that the manifold hypothesis is also applicable to latent feature spaces formed by randomly initialized DNNs designed for encoding various data types such as 2D image and 3D shape. We thus hypothesize that incorporating the idea of manifold learning into the unsupervised deep learning could achieve our goal, i.e., learning retrieval-adapted features of diverse data types. Following our hypothesis, we attempt to optimize an encoder DNN and its latent feature space by using a pretext task that respects the structure of nonlinear manifold lying in the latent feature space. Our pretext task leverages diffusion distance on feature manifold [45] as a distance metric. Diffusion distance on a feature manifold is a powerful metric for information retrieval [45]. Given a set of discrete data points, or nodes, diffusion distance is defined as an average length of multiple paths between two nodes on the feature manifold graph. Averaging distances through multiple paths produces a smoother distance metric among a set of nodes. As described in [45], diffusion distances are typically computed by iteratively propagating similarities on the manifold graph. Many studies on multi-media information retrieval (e.g., [17], [18], [19], [45]) demonstrated that the data-adaptive diffusion distance can retrieve semantically similar data more accurately than a non-data-adaptive, or fixed, distance such as Euclidean distance. These successes motivate us to use the diffusion distance on a feature manifold, which tends to be smooth, for learning retrieval-adapted features. Also, to achieve applicability to a wide range of multimedia data, our pretext task is formulated as a loss function so that it works in conjunction with various data types, datasets, and encoder DNN architectures.
Our motivation described above is realized as a novel algorithm called DeepDiffusion (DD). Fig. 1 illustrates the learning framework of the DD algorithm. The crux of the DD algorithm is its loss function, which is designed to learn retrieval-adapted features. Our loss function, named Latent Manifold Ranking (LMR) loss, is built by reformulating the loss function of the Manifold Ranking algorithm [17], [19].
As shown in Fig. 1, the LMR loss is computed based on diffusion-based ranking in the latent feature space. The LMR loss is used to jointly optimize two sets of parameters, that are, connection weights within the encoder DNN and the manifold formed by latent features of all training samples. This optimization aims to transform the initial latent feature space so that a submanifold consisting of mutually similar features to contract while mutually distinct submanifolds to move farther away from each other. Diffusion distance-based ranking is achieved by comparing the latent features using a fixed (e.g., Euclidean) distance in the transformed latent feature space. After a training, feature vectors extracted by the trained encoder DNN are used for retrieval. In addition to this simple feature extraction, we also propose another feature extraction method that exploits both the trained encoder and the optimized latent feature manifold to boost retrieval accuracy.
Strengths of the DD algorithm are threefold. First, it learns features appropriate for content-based matching, comparison, and ranking of multimedia data. Second, our method learns such retrieval-adapted features in a fully unsupervised manner. The DD algorithm requires no semantic labels for its training. Third, our algorithm is widely applicable to any type of multimedia data as long as encoder DNNs that process the multimedia data type are available. This is because the DD relies only on its loss function.
We evaluate the effectiveness of the DD algorithm by using two types of multimedia data, that are, 3D shape represented by point set and 2D image represented by pixels. The experiments demonstrate that feature representations learned by the DD algorithm yield retrieval accuracy significantly higher than those learned by the existing methods for unsupervised feature learning.
Contributions of this paper can be summarized as follows.
 Proposing a novel unsupervised representation learning algorithm named DeepDiffusion (DD). Training framework of the DD algorithm is suitable to learn retrieval-adapted features and is applicable to diverse multimedia data types.
 Empirically evaluating accuracy and versatility of the DD algorithm by using scenarios of 3D shape retrieval and 2D image retrieval.
The rest of the paper is organized as follows. Related work is reviewed in Section II and our proposed algorithm is described in Section III. Section IV reports experimental results. Finally, conclusion and future work are discussed in Section V.

A. DNN-BASED DISTANCE METRIC LEARNING
Supervised deep metric learning is a well-studied approach to improving accuracy of information retrieval [48]. Majority of the approach creates supervisory signals by forming pairs [1], triplets [2], or higher order tuples [54], of labeled training samples. A latent feature space is optimized so that intraclass distances become smaller than inter-class distances. Some studies [55], [56] exploit user-generated tags as weak supervisory signal for deep metric learning. Meanwhile, a group of studies [49], [50], [57] try to directly optimize retrieval accuracy indices such as Average Precision and Recall via DNN training. They adopt ranked lists as supervisory signals and train the DNN with a dedicated objective function to maximize the retrieval accuracy. Although these approaches ( [48]- [50], [57]) are effective in improving retrieval accuracy, they require a large number of labeled data for training.
Unsupervised fine-tuning of a DNN is one of the alternatives for deep metric learning. [51], [52], [58], [59] mine pseudo supervisory signals from latent features of unlabeled data to fine-tune the DNN. Although the mining procedures are unsupervised, they heavily depend on latent features extracted by the DNN pre-trained with labeled data. In contrast, our approach is purely unsupervised. That is, a DNN is trained from scratch by using only unlabeled data to obtain latent features suitable for retrieval.

B. DNN-BASED UNSUPERVISED REPRESENTATION LEARNING
Unsupervised learning can potentially leverage a large amount of unlabeled data as training samples. To effectively train DNNs without relying on labels, pretext tasks are usually employed. Various pretext tasks for unsupervised representation learning have been proposed mainly in the field of 2D image analysis.
Autoencoder (AE) [3] learns features via a selfreconstruction task. An AE is formed by paring an encoder DNN with a decoder DNN. The encoder embeds an input data into the latent feature space while the decoder reconstructs the input from the embedded feature. Generative Adversarial Network (GAN) [20] learns a sample distribution of training dataset through adversarial training of a generator DNN and a discriminator DNN. Although GAN was originally proposed to generate realistic 2D images, it has also proven useful for unsupervised visual feature learning [7]. The other pretext tasks for unsupervised visual feature learning include, for example, predicting different channels (e.g., [8]), predicting spatial context (e.g., [4]), and predicting geometric transformation (e.g., [9]) of 2D images. Though these pretext tasks perform reasonably well in learning feature representation, they lack versatility. Designing the pretext task requires deep understanding of the data type. In addition, designing a DNN architecture and/or a loss function for the data type also requires knowledge of the data type.
Compared to the abovementioned pretext tasks, pseudolabel classification (e.g., [5]) and feature contrast (e.g., [6]) are more versatile and flexible. These tasks do not require the decoder DNN, and the encoder DNN is trained with a loss function that can be used independently of data type. The pseudo-label classification approach automatically assigns a pseudo category label to each unlabeled training sample. Instance Discrimination [5], [10], [60] assigns a unique pseudo-label per training sample. The pseudo-labels of DeepCluster [11] are generated by using k-means clustering of a set of latent features extracted by an encoder DNN. Swapping Assignments between multiple Views (SwAV) [12] and Local Aggregation [13] incorporate a clustering procedure as a part of a DNN to generate pseudo-labels adapted to the training data. On the other hand, feature contrast, also known as contrastive learning, trains the encoder DNN by comparing latent features. The feature contrast approach tries to form a latent feature space where features of positive sample pairs are embedded closer while features of negative sample pairs are embedded further apart from each other. Majority of the feature contrast methods, e.g., [6], [14] and [15] creates a positive pair by coupling two different "views" of a training sample. These two views are generated by applying data augmentation with different augmentation parameters to the training sample. A negative pair is formed between the pair of views derived from mutually different training samples. Features learned by pseudo-label classification or feature contrast were experimentally proven to be effective for classification and segmentation of multimedia data, especially 2D images.
Nevertheless, the existing pretext tasks mentioned in this section do not necessarily learn features that are suitable for information retrieval. Training by these pretext tasks do not pay much attention to the structure of the initial latent feature manifold formed by a randomly initialized encoder DNN. In addition, these pretext tasks do not impose any constraints on the learned latent feature distribution. Therefore, the training would strongly distort the initial latent feature manifold and converges to an unconstrained latent feature space having nonlinear distribution. Such nonlinear latent feature distribution, possibly with folds, bunches and gaps, is unsuitable for information retrieval ranked by using a fixed distance metric. For example, pseudo-labels of DeepCluster [11] ignore the nonlinearity of latent feature manifold since they are created by the k-means clustering. If one of the estimated centroids lies among different submanifolds in the latent feature space, DeepCluster overly deforms these submanifolds so that their latent features agglomerate as one cluster. Also, positive/negative pairs used by the feature contrast approach [6], [14], [15] may distort the latent feature manifold since negative samples are chosen randomly from a training data set. If both positive and negative samples lie on the same submanifold, it would be torn apart to converge to a latent feature space that is not optimal for retrieval. Our DD, on the other hand, learns retrieval-adapted features based on the criterion entirely different from the existing methods, that is, diffusion distance on the latent feature manifold.

C. GRAPH-BASED MANIFOLD LEARNING
Manifold learning is a powerful technique to discover a nonlinear low-dimensional subspace inherent in a highdimensional feature space. Various manifold learning algorithms were proposed primarily before the DNN era [21]. To unveil an intrinsic geometry of the high-dimensional space, manifold learning algorithms analyze a manifold graph. In this graph, each node represents a sample (i.e., a raw datum or its feature vector) and a set of such nodes are connected by edges whose weights are calculated based on proximity of the samples. The representative manifold learning algorithms (e.g., [22], [23]) use eigenanalysis of the manifold graph to nonlinearly project the samples to the lowdimensional subspace. Manifold Ranking (MR) [17] [19] learns a distance metric among the samples by using similarity diffusion on the manifold graph. We later review the MR algorithm.
These classical manifold learning algorithms described above have been widely used for compression, clustering, comparison, and visualization, of high-dimensional data. However, accuracy of the distance metric learned by these classical algorithms highly depends on quality of data samples associated with nodes. Analyzing the manifold graph formed by raw data or their handcrafted features often produces a sub-optimal distance metric.
Some studies attempt to combine the idea of manifold learning with deep learning [24]- [27] for "manifold-aware" DNNs. Each of these studies proposes either a loss function or a DNN architecture that considers the manifold structure of input features. [24]- [26] propose loss functions that learn a latent feature space whose distance metric is constrained by proximity, in an input feature space, of training samples. [27] proposes to use a graph convolution layer as a building block of a DNN to diffuse similarity over a manifold formed by input features.
These manifold-aware deep learning methods [24]- [27] succeeded in finding feature embedding better than the classical manifold learning methods [21]. However, most of the existing manifold-aware DNNs do not optimize feature extraction since they process handcrafted features extracted prior to training the embedding DNN. In contrast, we take a holistic approach: our proposed DD optimizes both extraction of features from raw input data and embedding of the extracted features to the latent feature space.

1) Manifold Ranking algorithm
Here, we briefly review the Manifold Ranking algorithm [17], [19], whose loss function serves as the basis of our LMR loss. MR learns a distance metric for retrieval of diverse multimedia data (e.g., [18]). MR works either in supervised, semi-supervised, or unsupervised mode. We describe the unsupervised MR below.
Given a set of N unlabeled feature vectors, MR first constructs a manifold graph W by connecting neighboring samples in the feature space. W ∈ ℝ N×N is an adjacency matrix whose element wij represents a similarity between two samples i and j. Usually, wij is calculated from the Euclidean distance between the input feature vectors. The objective of MR is to find a set of ranking score vectors {ri}i=1, …, N according to the diffusion distances on the manifold. The ranking vector ri contains similarity values between query i that serves as the diffusion source and all the other features lying on the manifold. The loss function associated with {ri}i=1, …, N is formulated in (1). For simplicity, we omit some coefficients contained in the original loss function [19].
In (1), the first term is called fitting constraint, which fits the ranking score vector ri of the query i to the one-hot vector yi representing the diffusion source. The second term is called smoothing constraint, which makes the query and its neighbors in the input feature space to have similar ranking scores. Eq. 1 can be solved either by an iterative form or a closed form solutions [17] [19]. Both solutions, however, require high spatial and temporal costs to analyze N×N adjacency matrix when the number of samples N is large (e.g., N > 10k). Our proposed loss function is more efficient and is compatible with DNN training on a larger dataset (e.g., N~100k).

A. OVERVIEW OF DEEPDIFFUSION ALGORITHM
Fig. 1 illustrates our DeepDiffusion (DD) algorithm. The concept of the DD algorithm is to incorporate the idea of the MR algorithm, which leverages smooth diffusion distance on feature manifold for retrieval of diverse data types, into the problem of unsupervised deep representation learning. DD differs significantly from the existing pretext tasks for unsupervised deep representation learning in that DD respects the structure of a latent feature manifold formed by a randomly initialized encoder DNN. During a training by DD, both feature extraction and embedding by the encoder DNN are optimized so that the features are salient and the metric space is continuous and smooth. In the optimized latent feature space, the fixed distance (e.g., Euclidean distance) among the latent features approximates the diffusion distance on the latent feature manifold formed by the initial latent features.
Learning such retrieval-adapted latent feature space is achieved primarily by a novel loss function called Latent Manifold Ranking (LMR) loss. The LMR loss is built upon the loss function of the MR algorithm shown in (1). As depicted in Fig. 1, the LMR loss is computed by using two kinds of features, namely, extrinsic features and intrinsic features. An extrinsic feature is a feature of the input sample embedded by using the encoder DNN. An intrinsic feature is a feature lying on the manifold formed by latent features of all the training samples. The intrinsic features are not computed by the encoder, but updated via training.
One can use the MR loss shown in (1) as-is as a loss function for DNN training. However, directly applying the MR loss to DNN training confronts two issues. First, constructing a full manifold graph having N 2 connections from N training samples suffers from high temporal and spatial computation costs. Our LMR loss mitigates this issue by constructing an ad-hoc, small manifold graph constructed from a subset of data samples for each training iteration. That is, at every training iteration, a bipartite manifold graph is formed between B (B ≪ N) extrinsic features extracted from an input minibatch and N intrinsic features of all training samples. The use of the small manifold graph significantly reduces both temporal and spatial complexities compared to the full manifold graph. Second, the (squared) Euclidean distance used in the MR loss is not necessarily the best criterion for deep learning of distance metric. Recent studies on supervised deep metric learning (e.g., [28]) have demonstrated that comparing latent features by using divergence leads to a latent feature space better than that learned by using the Euclidean distance (e.g., [1], [2], [54]). Following the success of supervised deep metric learning, we employ divergence between probability distributions, instead of the Euclidean distance used in (1), to compare the ranking score vectors. Detail of the LMR loss is described in Section III.B.
After training, the optimized encoder DNN and the optimized latent feature manifold (i.e., intrinsic features) are used to extract features from novel data unseen during the training. We propose two feature extraction methods. The first method simply uses the trained encoder only. The second method exploits both the encoder and the latent feature manifold for feature extraction. Details of the feature extraction are described in Section III.C.
Since the crux of the DD algorithm is the LMR loss function, DD can potentially work in conjunction with any data type and encoder DNN architecture. As mentioned in Section I, we assume that the manifold hypothesis is valid in the latent feature space of an encoder DNN. If the initial latent features lie on a nonlinear manifold, DD is expected to learn retrieval-adapted latent features regardless of input data type. In the experiments, we demonstrate the versatility of DD by using multiple data types, datasets, and encoder DNN architectures.

B. LATENT MANIFOLD RANKING LOSS
This subsection elaborates on the LMR loss, which is the core of the DD algorithm. We first define the symbols necessary to formulate our loss. Let {(xn, IDn)}n=1, …, N be a training dataset containing N unlabeled samples x. Each training sample x is paired with its unique identification number ID, which is used to specify a diffusion source. ID for the n-th training sample is n. In other words, the value of ID differs per training sample and does not change throughout training. The DD algorithm at the training stage takes as its input a mini-batch containing B training samples, i.e., {(xb, IDb)}b=1, …, B. To diversify the training samples, we apply data augmentation to each sample x in the mini-batch to obtain x̂. The augmented sample x̂ is then embedded to the P-dimensional (e.g., P=256) latent feature space by using the encoder DNN parameterized by θ. The embedded feature, or extrinsic feature, is denoted by f, whose L2 norm is normalized to 1. Let M ∈ ℝ N×P be a matrix representing the latent feature manifold formed by all the training samples. M is constructed by stacking a set of N row vectors, i.e., {mn}n=1, …, N. θ and M are optimization targets of the DD algorithm.
Our objective function using the LMR loss is defined as (2), (3), and (4). Similar to the original MR loss (1), our LMR loss (2) comprises the fitting term Lfit and the smoothing term Lsmooth. The coefficient λ balances these two terms.
Fitting term: Lfit constrains the ranking vector rb of the training sample in the mini-batch to be close to the diffusion source vector yb. rb is computed as rb = softmax(fb • M T ). The N-dimensional vector rb holds probabilistic similarities among the feature fb and all the intrinsic features contained in M. yb is an N-dimensional one-hot vector whose IDb-th element is 1 while the other elements are 0. We use crossentropy H(rb, yb) to compare the ranking score vector and the diffusion source vector.
The effect of the fitting term is as follows. By minimizing Lfit, all the extrinsic features are embedded farther from each other since their ranking vectors are pulled toward different diffusion source vectors. Therefore, the fitting term is effective in learning salient features that can distinguish each training sample from other samples contained in the training dataset.
Smoothing term: Lsmooth constrains the extrinsic features and their neighboring intrinsic features to have similar ranking score vectors. Identical to the fitting term, rb of Lsmooth is computed as rb = softmax(fb • M T ). rn is defined as rn = softmax(mn • M T ) where mn is n-th row of M. Thus, rn contains ranking scores from the intrinsic feature mn to all the intrinsic features including mn itself. wbn indicates similarity between the extrinsic feature fb and the intrinsic feature mn. wbn can be interpreted as one of the connection weights of the bipartite manifold graph, whose matrix size is B×N, formed between the B extrinsic features and the N intrinsic features. The value for wbn is computed by using (5), where kNN(fb) is a set of k nearest intrinsic features of fb. We use Cosine distance in the latent feature space to compute kNN(fb). The neighborhood size k is a hyper-parameter to be determined later. D(rb || rn) is a dissimilarity between the two ranking score vectors. We use the Jensen-Shannon divergence, i.e., a symmetric version of the Kullback-Leibler divergence, to measure the dissimilarity between the two probability distributions rb and rn.
By reducing Lsmooth, the extrinsic features and their neighboring intrinsic features are attracted to each other in the latent feature space. In other words, when Lsmooth is small, the extrinsic features are likely to be projected onto the surface of the latent feature manifold formed by the intrinsic features. Therefore, concurrently minimizing Lfit and Lsmooth helps the encoder DNN embed each training sample in the proximity of its corresponding intrinsic feature point on the latent manifold. In such a latent feature space, the distances between the extrinsic features computed as diffusion distances approximate the distances along the surface of the latent feature manifold.
Choosing hyper-parameters: The LMR loss has two hyper-parameters, that are, the balancing parameter λ and the number of nearest intrinsic features k. Both hyper-parameters control smoothness of the feature distribution in the latent feature space. Optimal values for λ and k would depend on various conditions including data samples used for training and the architecture of an encoder DNN. To validate generalization ability of the DD algorithm, we fix λ at 1 and k at 20 when we empirically compare DD against the existing unsupervised feature learning algorithms. We also experimentally investigate the influences that these hyperparameters have on retrieval accuracy.
Initializing learnable parameters: We randomly initialize the set of connection weights θ of the encoder DNN by using the algorithm proposed by He et al. [29]. Subsequently, each of the training samples is projected into a feature in the latent space by using the randomly initialized encoder. These randomly projected feature vectors are used as the initial values of the latent feature manifold matrix M. We employ such initialization of M since several studies [30] [31] have shown that randomly initialized DNNs extract features having some degree of accuracy. We expect that M initialized with the randomly projected features represents more accurate initial manifold structure than M initialized with random values sampled from, for example, a normal distribution. The effectiveness of initializing M with randomly projected features will be shown in the experiments.
Effect of data augmentation: Data augmentation is applied to each training sample prior to passing it to the DNN. We expect that the training with data augmentation yields latent features robust against perturbation of training samples (e.g., affine transformation of 3D shapes). The fitting term Lfit facilitates learning such robust features. That is, ranking vectors of multiple augmented data samples derived from an original training sample are pulled toward the same diffusion source vector as their original. As a result, latent features become less sensitive to the perturbation caused by the data augmentation. The detailed procedure of the data augmentation is described in Section IV.A.
Algorithm 1 summarizes the procedures of unsupervised DNN training by the DD algorithm.

C. FEATURE EXTRACTION
After the training, the encoder DNN and the latent manifold are used for feature extraction from unseen data samples. We propose two features, that are, an embedded feature DD (E) and a diffused feature DD (D). Computation of the embedded feature is straightforward. An input sample is projected to the latent feature space by using the encoder DNN. The output from the encoder, i.e., f, is the DD (E) feature of the input. The Euclidean distance between the pair of DD (E) features is expected to approximate the diffusion distance on the latent feature manifold. Algorithm 2 summarizes the procedures for extracting the DD (E) feature.
We observe that the iterative diffusion process adopted by the traditional diffusion-based ranking algorithms [45] would make the DD (E) feature better fit for retrieval. We thus propose the DD (D) feature, which is a smoothed version of the DD (E) feature. The DD (D) feature is computed by similarity diffusion on the latent feature manifold. After computing f, we find its k nearest intrinsic features kNN(f) from the optimized M, or {mn}n=1, …, N, to determine multiple diffusion sources. kNN(f) is a set of k intrinsic features closest to f found by using the Euclidean distance. The diffusion sources are represented by an N-dimensional vector g0. The n-th element of g0, i.e., g0(n), is computed by using (6).
We use the recursive formula gr = gr˗1 • S, which is one of Algorithm 1 Training by using the DD algorithm.

1: Inputs: The training dataset {(xn, IDn)}n=1, …, N 2:
The encoder DNN parameterized by θ 3: The balancing coefficient λ (e.g., λ=1) 4: The number of neighbors k (e.g., k=20) 5: The optimization algorithm (e.g., Adam [43] with initial learning rate η=10 −4 ) 6: Outputs: Optimized encoder parameters θ 7: Optimized intrinsic features M, or {mn}n=1, …, N , of the training samples 8: Initialization: 9: Randomly initialize θ. 10  The encoder DNN optimized by Algorithm 1 3: The intrinsic features M optimized by Algorithm 1 4: The number of neighbors k (e.g., k=20) 5: The number of iterations R for similarity diffusion (e.g., R=20). 6: Output: The DD (D) feature gR of x 7: Construct the sparse similarity matrix S: (S can be precomputed prior to feature extraction.) 8: S ← M • M T 9: for each row si of S do 10: Sparsify si by replacing all the elements except for the k largest elements with 0. 11: end for 12: Feed the input sample x into the encoder DNN to obtain the latent feature f. 13: Find a set of k nearest intrinsic features of f, i.e., kNN(f), from the intrinsic features M. 14: Create the diffusion source vector g0 by using Eq. (6). 15  The encoder DNN optimized by Algorithm 1 3: Output: The DD (E) feature f of x 4: Feed the input sample x into the encoder DNN to obtain the latent feature f. 5: return f the simplest forms of diffusion computation on a feature manifold [45], to derive similarity. S ∈ ℝ N×N is a sparse similarity graph whose (i, j) element is mi • m T j if mj is included in a set of k nearest neighbors of mi, and 0 otherwise. By iterating the recursive formula, the diffusion source vector gr becomes smooth according to the pairwise similarities of the intrinsic features optimized during the training. We iterate the recursion for R=20 times, whose validity is evaluated in the experiments. After the iteration, gR is scaled to unit length to obtain the DD (D) feature. k for computing the diffusion feature is identical to k used in the LMR loss. Algorithm 3 summarizes the procedures for extracting the DD (D) feature.
Retrieval rankings are generated by using the Euclidean distances among either the DD (E) features, the DD (D) features, or the "fused" features. Intending to improve retrieval accuracy, the fused feature DD (E+D) is formed by concatenating the DD (E) vector (i.e., f) and the DD (D) vector (i.e., the scaled gR).

A. EXPERIMENTAL SETUP
We comprehensively evaluate the effectiveness of the proposed DD algorithm through the experiments using multiple data types (i.e., 3D shape and 2D image), multiple datasets, and multiple encoder architectures.
Datasets: We use three 3D shape datasets and three 2D image datasets. Table 1 summarizes the statistics for each dataset. ModelNet10 (MN10) [32], ModelNet40 (MN40) [32], and ShapeNetCore55 (SN55) [33] contain 3D polygonal models of rigid 3D objects such as furniture, vehicles, and household appliances. We convert each 3D shape to a set of 1,024 3D points by using the algorithm by Ohbuchi [34]. Each 3D point set is normalized so that its gravity center coincides the origin of the 3D space and its whole shape is enclosed by a unit sphere. Fashion MNIST (FMNIST) [35] includes 2D images of clothes, while STL10 [37] consists of 2D natural images. COIL100 [36] contains 2D images of 100 different objects placed on a turntable. For each object, a set of 72 images was taken at intervals of 5 degrees while rotating the turntable. We use the images taken between 0 to 175 degrees for training, and the rest for evaluation.
Retrieval accuracy of the testing data is measured in Mean Average Precision (MAP) [%], which is the de facto standard accuracy index for information retrieval systems.
Implementation details: For a fair comparison, we use the same data augmentation, DNN architecture, and optimization method among all the algorithms used in the experiments.
We employ online data augmentation. Each training sample in the mini-batch is augmented with a probability of 0.8. A 3D point set is augmented by random affine transformation. Specifically, the point set is first randomly rotated about the x-, y-, and z-axes in this order. The rotation angle for each axis is randomly and independently chosen from the uniform distribution U(−5°, 5°). After the rotation, the point set is anisotropically scaled. Directions of scaling are the x-, y-, and z-axes, and scaling factor for each axis is randomly sampled from U(0.8, 1.2). The 3D shape is then sheared successively in the x-, y-and z-axes. Shearing factor for each axis is sampled from U(−0.2, 0.2). Finally, the 3D shape is randomly translated. Displacement along each axis is chosen from U(−0.2, 0.2). A 2D image is augmented by random cropping and horizontal flipping. The 2D image is isotropically scaled by the factor 1.2, and a rectangle having the original image size is cropped at random position. The cropped image is then horizontally flipped with the probability of 0.5.
We use four popular DNN architectures as encoders. To encode 3D point sets, we adopt PointNet [38] and Dynamic Graph CNN (DGCNN) [39]. To process 2D images, we employ Inception V1 (also known as GoogLeNet) [40] and ResNet18 [41]. The number of dimensions P of the latent space is fixed at 256 throughout the experiments. In addition to the encoder DNN, AE and GAN require a decoder DNN. We employ the 3D point set decoder composed of five fullyconnected layers [42] and the 2D image decoder consisting of five deconvolution layers [7]. The DNNs are trained by using mini-batch gradient descent. We use Adam [43] with the initial learning rate of 10 −4 . Each mini-batch contains 16 3D shapes or 64 2D images. We iterate the training for 300 epochs and report the highest MAP scores obtained during the training.
The DD algorithm is implemented in Python using TensorFlow library [44]. Most of the experiments were done on a PC having an Intel Core i9-7900X CPU and an Nvidia GeForce RTX 2080 Ti with 11 GBytes of GPU memory. Table 2 compares retrieval accuracies of the 12 unsupervised feature learning algorithms including the proposed DD. In the tables, "DD (E)" and "DD (D)" indicate the embedded feature and the diffused feature, respectively, learned by using the DD algorithm. "DD (E+D)" denotes concatenation of the two feature vectors above. The hyper-parameters of DD, i.e., k and λ, are fixed at 20 and 1, respectively. As the reference, Table 2 includes "untrained" that shows accuracy of the features extracted by using the randomly initialized encoder DNN.

1) Comparison with existing algorithms
As shown in Table 2, our DD algorithm yields MAP scores higher than the existing methods for almost all the combinations of the encoder DNNs and the datasets. These results verify that optimization using the diffusion distance on the latent feature manifold is effective for learning feature representation adapted to multimedia information retrieval.
In addition, Table 2 compares efficacy of the two proposed features, i.e., DD (E) and DD (D). The winner among DD (E) and DD (D) depends on dataset and encoder used. But the fused feature DD (E+D) slightly but consistently outperforms the two. We suspect that distances among the embedded features of the testing data slightly differ from the diffusion distances on the latent manifold since the testing data are not used in training. Thus, the diffused feature, which is computed by using the latent feature manifold, would assist the embedded feature to find larger number of true neighbors in the latent feature space. Fig. 2 visualizes the latent feature spaces learned by using the three unsupervised feature learning algorithms, i.e., AE, SwAV, and our DD. We use PointNet and Inception V1 as encoder DNNs and use DD (E) as the latent features of DD. The t-SNE algorithm [53] is used to visualize the latent features extracted from the testing samples of the MN10 and FMNIST datasets. As can be observed in Fig. 2, the latent features learned by the DD algorithms are well-separated. That is, each cluster of the latent features consists of one or two semantic categories in most cases and these clusters are reasonably far from each other. Such feature embeddings contribute to the high retrieval accuracy of the DD algorithm shown in Table 2. On the other hand, AE and SwAV tend to learn feature embeddings that are not suitable for information retrieval. In the latent space of AE, the feature clusters overlap each other. SwAV learns favorable feature embeddings for the MN10 dataset, but the latent features of the FMNIST dataset are separated into small clusters. Since the training frameworks of AE and SwAV are not intended to acquire retrieval-adapted features, they do not necessarily form latent feature spaces suitable for retrieval.

2) In-depth evaluation of DD algorithm
This subsection investigates the influences that various design parameters of the DD algorithm have on retrieval accuracy. We use the MN10, MN40, FMNIST, and COIL100 datasets for evaluation. PointNet and Inception V1 are employed as encoder DNNs.
Hyper-parameters: Fig. 3 plots retrieval accuracies of the DD (E) feature against the number of nearest intrinsic features k used for computing the smoothing term. We can observe that, for all the four datasets, MAP scores gradually improve with increasing k from 1 to 20. Optimal k depends on dataset (and probably also on encoder DNN). The FMNIST dataset requires larger k than the other datasets to attain peak accuracy. This is probably because the category size of FMNIST (6,000 samples per category) is larger than those of the other datasets (tens to hundreds samples per category). When the category size is large, computing the smoothing term with large k would facilitate capturing the manifold structure composed of many samples of the same category. Fig. 4 shows the influence the balancing parameter λ of the LMR loss has on the retrieval accuracy of the DD (E) feature. In the graph, λ=0 means the DNN is trained by using the fitting term only. Fig. 4 indicates that the smoothing term has positive impact on retrieval accuracy. However, excessively large λ leads to low retrieval accuracy since gradients derived from the loss function is dominated by the smoothing term. Fig. 5 shows retrieval accuracy of the DD (E+D) feature plotted against the number of iterations R for calculating the diffused feature. Although the influence by R on the accuracy is small, the diffusion with sufficient iterations slightly improves the MAP score. The accuracies saturate at about about R=20 for the MN10, MN40, and COIL100 datasets. In the FMNIST, the accuracy continues to improve at R > 20.

AE
SwAV Visualization of the latent feature space learned by AE [3], SwAV [12], and our DD algorithm.    Since FMNIST has more training data than the other three datasets, more iterations would be required for convergence of the similarity diffusion. Ablation study: We evaluate contributions from the components of the DD algorithm, that are, the loss function, the data augmentation of training samples, and the initialization of intrinsic features. Table 3 demonstrates that all the four components positively impact retrieval accuracy. The two terms of the LMR loss, i.e., Lfit and Lsmooth, are indispensable to learn retrieval-adapted features. The data augmentation clearly improves retrieval accuracy. Diversifying training samples would help the DD algorithm obtain robustness against perturbation of the samples and improve generalization ability. Table 3 also shows that the initialization method for intrinsic features has a significant impact on accuracy. In the column "RPF init.", "Yes" indicates that the intrinsic features are initialized with the features of training data extracted by the encoder having random parameters. "No" means that the intrinsic features are initialized with randomly sampled values [29]. Evidently, initialization with random projection features improves accuracy. Fig. 6 compares training behavior between the two initialization methods. Initialization with random values appears to cause overfitting; MAP of the testing data saturates early (around 120 epochs) while loss value of the training data continues to decrease. In contrast, the training started with randomly projected features is more regularized and yields higher MAP score (nearly 80%) on the MN10 dataset.
Computational cost for training: We evaluate training scalability of the DD algorithm. To do so, we create pseudo large-scale training datasets by duplicating the 3D shape data contained in the MN10 dataset. During training, we measure GPU memory footprint and time per epoch. Fig. 7 shows that the training by DD is reasonably efficient when the number of training samples N is less than 100K. However, both spatial and temporal costs grow significantly as N increases. Using the GPU having 11 GBytes of memory, training is not executable at N > 330K due to an out-of-GPU-memory error. Training by DD requires O(N) spatial complexity since multiple N-dimensional ranking score vectors as well as N intrinsic features need to be stored to compute the LMR loss. The increase in GPU memory usage against N, however, is not linear in Fig. 7. This is probably due to the behavior specific to the TensorFlow library. In terms of training time, the DD algorithm requires O(N 2 ) temporal complexity since the similarities (i.e., wbn in (4)) of all pairs of the N extrinsic features and the N intrinsic features must be calculated during training of one epoch.
The high training costs of the DD algorithm might be mitigated by using subsampling methods (e.g., [46]) to reduce N for loss computation, or approximate nearest neighbor search methods (e.g., [47]) to avoid the all-pair comparison among the extrinsic/intrinsic features. We leave for future work applications of these techniques to improve efficiency of the DD algorithm.

V. DISCUSSION
While the quantitative experiments in Section IV show the effectiveness of the proposed DD algorithm, it is still unclear how DD behaves differently from the existing unsupervised deep feature learning algorithms. As we mentioned in Section I, DD aims to transform the initial latent feature space so that a submanifold consisting of mutually similar features to contract while mutually distinct submanifolds to move farther away from each other. To prove that DD works as intended, this section compares training behaviors of the multiple feature learning algorithms including DD. We visualize and observe changes in the manifold structure in the latent feature space during training. However, the visualization methods for high-dimensional space such as t-SNE used in Fig. 2 are not suitable for observing the true  structure of feature manifold. This is because these visualization methods by themselves use some form of dimension reduction that could obscure the observation. We thus use a far simpler problem, i.e., a 2D toy dataset, to observe manifold structures. Our toy dataset consists of 1,000 2D points lying on the three-arm spiral distribution (see "0th epoch" in Fig. 8a). We first describe the setup of the experiment. We use an encoder DNN consisting of four fully-connected layers. The input and output layers of the encoder have two neurons, 30  while the hidden layers have 1,000 neurons. Following the manifold hypothesis, we want to create a nonlinear feature manifold (e.g., the three-arm spiral) in the latent space of the encoder DNN. To do so, we pretrain the encoder DNN so that it learns an identity mapping. That is, we sample 10,000 2D points from the uniform distribution as training data for pretraining. The encoder DNN is pretrained so that the input 2D points are embedded in the same coordinates as the inputs in the 2D latent space. After the pretraining, the encoder DNN is further trained by using either PID, DC, SimCLR, or our DD. For DD, we use the different values for the number of neighbors k, i.e., k=5, 20, and 200. The 1,000 2D points lying on the three-arm spiral is used as a training dataset. The training is iterated for 6,000 epochs. In addition to observing the latent feature space, we compute retrieval accuracy of the latent features. To measure retrieval accuracy, we embed all the 1,000 2D toy data points in the training dataset and compare these latent features by using the Euclidean distance. A MAP score is computed assuming that the features lying on the same submanifold, i.e., the same arm of the spiral, belong to the same semantic category. Fig. 8a visualizes changes in the 2D latent feature space during training. Fig. 8b shows retrieval accuracies, measured in MAP, during training. Note that the latent feature distribution and the MAP score (77%) at the beginning of training (0th epoch) are the same across all the methods in Fig. 8. Evidently, the six cases in Fig. 8 show different training behaviors. PID and DC destroy the structure of the initial latent feature manifold. That is, the three submanifolds during training overlap with each other, resulting in decrease in retrieval accuracy. PID in this experiment tries to embed the latent features in a straight line so that each training sample can be discriminated. However, PID fails to separate the green submanifold from the other two. DC in this experiment mixes up the latent features after 2,000 epochs. This is because pseudo labels, i.e., cluster centers, are generated among different submanifolds. Latent features are mixed up since those belonging to the different submanifolds are attracted to the same cluster center. For SimCLR, the latent feature space is hardly distorted and retrieval accuracy is almost unchanged throughout training. This is probably because the training objective of SimCLR, i.e., making distances among positive pairs smaller than distances among negative pairs, is achieved to some extent in the initial latent space. Although SimCLR does not show high retrieval accuracy, it is likely to yield high classification accuracy if a nonlinear classifier is trained in the learned latent feature space.
In contrast, DD behaves differently from PID, DC, and SimCLR. When DD uses an appropriate value for k (i.e., DD with k=20), the nonlinearly distributed initial latent features are successfully disentangled. Although continuity of the initial feature distribution is lost to some degree, the three submanifolds at the 6,000th epoch are mutually separated and the latent features on the same arm are embedded close to each other. Retrieval accuracy of DD with k=20 improves as the training progresses, and MAP score reaches nearly 97%. These results indicate that the initial latent features properly adapt to retrieval via the training by DD.
The success of DD stems from the design of its loss function consisting of both the fitting term and the smoothing term. As we mentioned in Section III.B, the fitting term acts to move latent features away from each other. Therefore, using only the fitting term ignores the manifold structure and would result in a latent feature space similar to PID. The smoothing term helps DD consider the manifold structure by constraining neighboring latent features to have similar ranking scores. Combining the fitting term and the smoothing term (with proper k) enables DD to transform the latent features while preserving the proximity of data points in neighborhoods on the same submanifold. Fig. 8 also shows that using an inappropriate value for k leads DD to training failure. When k is small, i.e., k=5, the latent features after the training of 6,000 epochs are split into numerous small clusters and retrieval accuracy scarcely changes from the initial MAP score. Or, if k is too large, i.e., k=200, a collapse of the latent feature space occurs. The collapse occurs since the smoothing term acts strongly at a large scale. Each latent feature attracts 200 of its neighbors out of 1,000 training samples. Such a global influence force all the features to converge at the same point in the latent space. As also demonstrated in Fig. 3, the smoothing term and its hyperparameter k have a significant impact on the retrieval accuracy of DD.

VI. CONCLUSION AND FUTURE WORK
This paper tackled the problem of unsupervised learning of feature representations suitable for multimedia information retrieval. To obtain retrieval-adapted features without relying on semantic labels, we proposed the novel algorithm called DeepDiffusion (DD). The DD exploits diffusion distances on a latent feature manifold to optimize a feature extraction and embedding as well as a distance metric among embedded features. The optimization is achieved by using the carefully designed loss function, named Latent Manifold Ranking (LMR) loss, which encourages to form the latent feature space suitable for computing similarities among data samples. The comprehensive evaluation showed the following advantages and limitation of the DD algorithm.
 DD is capable of learning features more adapted to information retrieval than the existing deep learningbased unsupervised feature learning algorithms.
 DD is versatile as it learns retrieval-adapted features regardless of the data types, the datasets, and the encoder DNN architectures we have tried.
 DD is reasonably efficient on datasets having less than 100K training samples. However, for larger datasets, DD suffers from high spatial and temporal computational costs.
We also evaluated a difference in training behavior among the DD algorithm and the existing unsupervised feature learning algorithms. The visualization of the latent feature space demonstrated that DD with an appropriate hyperparameter learns a distance metric that reflects the manifold structure in an initial latent feature space.
Future work includes evaluating the DD algorithm on datasets having more samples (e.g., 1M) and datasets consisting of multimedia data other than 3D shape and 2D image (e.g., text document). As we discussed in the section on experiments, the current DD algorithm has difficulty in training using very large datasets. We will thus consider improving computational efficiency of the DD. Also, we intend to extend the DD algorithm to a scenario of semisupervised or supervised learning of retrieval-adapted feature representations.