Probability Distribution-Based Dimensionality Reduction on Riemannian Manifold of SPD Matrices

Representing images and videos with Symmetric Positive Definite (SPD) matrices and utilizing the intrinsic Riemannian geometry of the resulting manifold has proved successful in many computer vision tasks. Since SPD matrices lie in a nonlinear space known as a Riemannian manifold, researchers have recently shown a growing interest in learning discriminative SPD matrices with appropriate Riemannian metrics. However, the computational complexity of analyzing high-dimensional SPD matrices is nonnegligible in practical applications. Inspired by the theory of nonparametric estimation, we propose a probability distribution-based approach to overcome this drawback by learning a mapping from the manifold of high-dimensional SPD matrices to a manifold of lower-dimension, which can be expressed as an optimization problem on the Grassmann manifold. Specifically, we perform the dimensionality reduction for high-dimensional SPD matrices with popular Riemannian metrics and an affinity matrix constructed using an estimated probability distribution function (PDF) to achieve maximum class separability. The evaluation of several classification tasks shows the competitiveness of the proposed approach compared with state-of-the-art methods.


I. INTRODUCTION
In recent years, Symmetric Positive Definite (SPD) matrices have received significant attention in the field of computer vision, due to their capacity to characterize data variations. As an effective and powerful representation of images and videos, SPD matrices have been employed in diverse applications. Successful examples include covariance descriptors for recognizing faces [1]- [3], actions [4], textures [5], gestures [6], and pedestrians [7] as well as diffusion tensors for segmenting images [8], [9].
While offering advantages and promising properties for various computer vision tasks, the handling of SPD matrices on the basis of their intrinsic geometry becomes a challenge. As shown in several studies [10], [11], the space of SPD matrices is not a vector space but a non-linear Riemannian manifold. Though it is feasible to apply the Euclidean geometry to SPD matrices and employ matrix The associate editor coordinating the review of this manuscript and approving it for publication was Aniruddha Datta.
Frobenius norm to measure the similarity between them, poor performance and undesirable side-effects would be the end result [8], [12]. In other words, conventional learning methods based on the Euclidean structure are inadequate for analyzing SPD matrices owing to their neglect of the manifold geometry. In an attempt to generalize algorithms designed for Euclidean spaces to Riemannian manifolds, several researchers [2], [3], [13] have used popular Riemannian metrics [8], [9], [14] to account for the geometry of SPD matrices. Briefly, these works can be categorized into two major types: The first aims to obtain a locally flattened approximation of the manifold in the tangent space where linear geometry applies [6], [7], [15]; the second category of methods embed the manifold into a high-dimensional Reproducing Kernel Hilbert Space (RKHS) with an implicit map, and then use kernel-based methods [2], [3], [13], [16]. However, mapping SPD matrices either into a tangent space or into a high-dimensional RKHS will inevitably distort the geometry of the original manifold and the computational cost increases dramatically with the size of the SPD matrices. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Conceptual illustration of typical learning methods on the manifold of SPD matrices. Methods in type (a) flatten the original manifold S n ++ with the tangent space S n + (the space of symmetric matrices) and then apply Euclidean geometry. The type (b) methods embed the original manifold S n ++ into an RKHS H with an implicit kernel map, which makes it possible to exploit various kernel-based methods. The approach proposed in this article, ( type (c)), learns a mapping from the the original manifold S n ++ to a lower-dimensional one S m ++ (m < n) directly, with the additional objective to enhance the discriminative power of the representation.
Since high-dimensional SPD matrices are often encountered in computer vision, another type of learning method [17]- [19] has been proposed to seek a column full-rank projection matrix mapping from the manifold of high-dimensional SPD matrices to a lower-dimensional and more discriminative one. The work in [18] aims to learn a tangent map that can directly transform the logarithm of SPD matrices from the original tangent space to a more discriminative one, and then formulates the learning as an optimization problem of seeking a Mahalanobis-like matrix. While a mapping from the original manifold to the target manifold can be derived from this tangent map, it distorts the geometry of SPD matrices to a certain extent. In contrast, the work [17] directly transforms the original manifold to a lower-dimensional, more discriminative manifold. It restricts the projection matrix to be orthonormal, and thus implements the learning as an optimization problem on the Grassmann manifold. It is noteworthy that the Riemannian metrics used in this framework should be affine-invariant, and the solution space of the optimal projection matrix is constrained. Thankfully, the work in [19] developed an adaptation to provide support for affine-invariant Riemannian metrics. The key concepts of typical learning methods on the manifold of SPD matrices are illustrated in Fig. 1.
In this article, we propose an improved manifold to manifold dimensionality reduction approach for SPD matrices that shares the same scheme as [17]- [19], yet we enhance the discriminative power by exploiting the probability distribution of SPD matrices to define a weighted affinity measure for the optimization. Compared with [17], [19], the main contribution of this work is twofold: • By exploiting the probability distribution of SPD matrices, we provide a new solution to the problem of constructing the affinity matrix. As the probability distribution indicates the membership grade of a sample to a class, the resulting affinity matrix will be more accurate than the one based on a binary nearest neighbor selection.
• To reflect the notion of between-class dispersion more accurately, we employ a 'sample-to-class' membership to represent the dissimilarities of inter-class samples instead of the simplistic 'sample-to-sample' distance. Besides, we introduce a new strategy that focuses on the points from other classes that are closest to a given point. The number of samples drawn from such selected classes is determined by a control parameter.

II. PRELIMINARIES
This section provides an overview of Riemannian geometry on the manifold of SPD matrices and several popular Riemannian metrics.

A. RIEMANNIAN MANIFOLD OF SPD MATRICES
In mathematics, a differentiable manifold M is a topological space locally similar to Euclidean space with a globally defined differential structure. For a point P on the manifold, its tangent space T P M is defined by tangent vectors of all possible curves passing through P. SPD matrices are symmetric matrices with the property that all their eigenvalues are positive. As mentioned above, the space of n × n SPD matrices (denoted by S n ++ ) is not a subspace in Euclidean space but a specific Riemannian manifold with nonpositive curvature. More precisely, S n ++ is a differentiable manifold endowed with an inner product on each tangent space. For a point X ∈ S n ++ , it is possible to define the derivatives of curves in its tangent space T X S n ++ by using the logarithm map log X : S n ++ → T X S n ++ , which has an inner product , X . A smoothly-varying family of inner products on all tangent spaces is known as a Riemannian metric. It enables us to define various geometric notions on the manifold, such as angles and lengths of curves. The geodesic distance between two points on the manifold specifies the length of the shortest curves connecting the two points. Such shortest curves are known as geodesics. They are similar to straight lines in R n . Obviously, the geodesic distance induced by a Riemannian metric is a more appropriate measurement for comparing two SPD matrices than the Euclidean distance.

B. RIEMANNIAN METRICS
A number of metrics on S n ++ have been recently proposed to utilize its nonlinear geometry, but not all of them define a true geodesic distance on the manifold. There are two most popular ones: Affine-Invariant Metric (AIM) [8] and Log-Euclidean Metric (LEM) [12].
Although AIM has high computational costs, in practice, its invariance to affine transformations is a desirable property for computer vision tasks. Given a point X ∈ S n ++ , AIM is defined as where V , W ∈ T X S n ++ , and T X S n ++ denotes the tangent space of S n ++ at point X. The geodesic distance between two points X i , X j ∈ S n ++ is induced by AIM as where log (·) is the matrix logarithm operator and · F denotes the matrix Frobenius norm. Another widely used Riemannian metric LEM is expressed by classical Euclidean computation as LEM also defines a true geodesic distance on S n ++ and is simple to use compared with AIM.
Besides, the Stein divergence [14] derived from the Bregman matrix divergence also approximates geodesic induced by AIM while being less expensive to compute and preserving invariance to affine transformations. The approximated geodesic distance is defined as where det (·) is the matrix determinant operator.

III. PROPOSED APPROACH
In this section, we describe the proposed dimensionality reduction approach for SPD matrices. Firstly, we formulate the problem of dimensionality reduction for SPD matrices and then introduce the process of constructing an affinity matrix from the estimated probability distribution function (PDF) to yield maximum discriminability. Finally, we provide a detailed solution to the corresponding optimization problem.

A. DIMENSIONALITY REDUCTION FOR SPD MATRICES
Suppose we have a set of SPD matrices X = {X 1 , . . . , X N }, X i ∈ S n ++ . To achieve the goal of dimensionality reduction, a map F : S n ++ × R n×m → S m ++ , m < n from the original manifold S n ++ to the target manifold S m ++ of lower dimensionality with a parameter W ∈ R n×m is defined as Thus, we seek to find a transformation W to ensure the resulting space preserves a similar structure to that of the original data and forms a valid manifold of SPD matrices S n Obviously, W is required to be a column full-rank matrix. Analogous to [17], [19], we construct a N × N real matrix G as an affinity matrix. Its element G (i, j) encodes a measure of similarity or dissimilarity of SPD matrices X i and X j . The detailed information about this affinity matrix will be given in Sect. III-B.
With the affinity matrix G, we define a cost function to describe the overall distribution of SPD matrices in the target manifold as where δ can be one of the geodesic distances induced by Riemannian metrics in (2), (3), or (4). To avoid potential degeneracies and following the common practice in dimensionality reduction, the solution is enforced to be orthogonal, i.e., W T W = I m , and I m is the identity matrix of size m. It has been proven that this additional requirement entails no loss of generality for Riemannian metrics mentioned in Sect. II-B [17], [19]. Formally, the procedure of dimensionality reduction for SPD matrices can be expressed as a minimization problem

B. AFFINITY MATRIX
Although different criteria can be applied to construct the affinity matrix, we choose the supervised scenario since discriminative information is critical to classification. Suppose there are C different classes here and each point X i ∈ S n ++ belongs to one of them. In this article, the affinity matrix G used by the cost function in (6) is defined as which resembles the Maximum Margin Criterion (MMC) [20]. G w and G b are affinity matrices for encoding within-class similarity and between-class dissimilarity respectively. The theory of Graph Embedding was exploited VOLUME 8, 2020 FIGURE 2. Two instances illustrating the deficiencies of the symmetric binary affinity matrix G. In case (a), the elements of X 1 with its nearest intra-class neighbours (including X 1 itself) in G w are set to 1 when λ w = 5. However, distances between X 1 and other points vary dramatically, and this distribution cannot be uniquely reflected in G w with those binary elements. In case (b), the nearest inter-class neighbours of X 1 in class c 1 (λ b = 2) only include samples from class c 2 , while the distance between X 1 in class c 1 to a sample from class c 3 is almost the same as the distances between X 1 in class c 1 to samples from class c 2 . In another word, the affinity matrix G b could be unreliable since it encodes the inter-class dissimilarity with just a few pairs of samples, especially when λ b is small.
by [17], [19] to define G w and G b in the form of symmetric binary matrices, constructed from nearest neighbor graphs as where N w (X i ) is the set of λ w nearest intra-class neighbors of X i , and N b (X i ) is the set of λ b nearest inter-class neighbors of X i . λ w and λ b are used to reflect the trade-off between within-class compactness and between-class dispersion. With this affinity matrix G, the mapping f defined in (5) with an optimal W learned from (7) is considered as a discriminative mapping for not only minimizing the intra-class distances but also maximizing the inter-class distances. Nevertheless, there are two problems to overcome. First, the binary nearest neighbor selection is too simplistic. It does not differentiate the degrees of (dis)similarity between two points, which makes the mapping sensitive to scattered samples (see case (a) in Fig. 2). Second, the between-class affinity matrix G b defined in (10) encodes the between-class dispersion via a few selected 'sample-to-sample' pair-wise distances rather than 'sample-to-class' membership (see case (b) in Fig. 2).
To overcome these problems, we introduce a novel definition for the affinity matrix based on the probability distribution of SPD matrices. Let X be a set. A probability density function on X is a function f : X → R + that satisfies X f (x)dx = 1. We use Kernel Density Estimation (KDE) to obtain a fine-grained estimatef p (x) of the distribution f p (x) about a set X p . It is defined aŝ where N p is the number of points x i ∈ X p , K is a kernel satisfying K (x) 0, ∀x ∈ X , K (x)dx = 1, and h is a bandwidth parameter (also known as the smoothing parameter). By utilizing the Gaussian kernel we can combine the bandwidth parameter h with the covariance (i.e., = I and H = diag(h)) and express the estimate aŝ As introduced in Section I, we propose to rely on the probability distribution of SPD matrices to define a new affinity matrixĜ. However, a multi-dimensional KDE defined in (11) is based on a Euclidean structure and would not work for SPD matrices. If one attempts to interpret SPD matrices as living in a Euclidean space and use the multi-dimensional KDE for them, the underlying PDF would involve a 'delta function' which vanishes when one moves away from the manifold, and 'becomes infinite' on the manifold in order to have a proper normalization [21]. More specifically, the estimator will 'try to converge' to a singular PDF. In such circumstances, the PDF will be infinite on the manifold and zero elsewhere. In order to estimate the PDF on the manifold of SPD matrices, we can map them to the tangent space and transform them into vectors. Since KDE may produce less reliable and very sparse PDFs [22], we can learn a low-dimensional representation of these vectors through PCA. However, this solution is inconsistent with the aim of performing discriminative manifoldto-manifold dimensionality reduction for SPD matrices.
Recently, [23] analyzed the KDE in spaces of Gaussian distributions endowed with different metrics. Explicit expressions for the estimator were provided based on the Fisher metric defined in terms of multivariate centered Gaussian distributions N . Under the Fisher metric, the space of multivariate centered Gaussian distributions is isometric to the space of SPD matrices under the affine-invariant metric, and the explicit expression of the distance between two centered Gaussians is [8]: where · is the norm associated with the scalar product is the set of n×n SPD matrices with corresponding labels and c i ∈ {1, . . . , C}. The estimator given in [23] is defined as: where λ p and λ q are the eigenvalues of log X Naturally,f c (X) indicates the membership grade of a sample X to a class c, which describes the local geometrical structure of the data. We define the new within-class affinity matrixĜ w aŝ where is the set of λ w 'furthest' intra-class neighbors of X i that are selected on the basis of their proximities to X i in the descending order.
Note that the definition ofN w (X i ) here is different from N w (X i ) in (9), where the latter ones are selected on the basis of their ascending order distance to X i measured by a Riemannian metric. The reason for this change is that we intend to construct an improvedĜ w by giving a priority to the weights of samples from the same class that are far away from each other. The proximity of two samples is computed by the absolute value of the difference between their membership grade to class c (i.e.,f c (X)). For two intra-class samples, a large difference between their membership grade means that bringing them close to each other would contribute to achieving the objective of within-class compactness and vice versa. In other words, we mainly focus on keeping samples in overall close proximity instead of equally reducing their relative distances. As a consequence,Ĝ w encodes the within-class similarities by selecting pairs of intra-class samples with large variances and marks the corresponding elements with real-valued (rather than binary) coefficients and Ĝ w (i, j) = G w (i, j). Analogously to (16), a new between-class affinity matrixĜ b is defined aŝ whereĈ b (X i ) is the set of λ b most confounded classes of X i that are selected on the basis of the membership grade of X i in these classes arranged in a descending order (i.e.,f c j (X i )). T s c j = {X k } λ s k=1 is the set of λ s most representative samples in class c j that are selected on the basis of the membership grade of samples in class c j to itself in descending order (i.e., f c j X j ). λ s is used to control the number of samples selected from a class c to describe the between-class dissimilarity between c and another inter-class sample X and , j). Finally, the affinity matrixĜ is calculated bŷ G =Ĝ w −Ĝ b .

C. OPTIMIZATION
As discussed in [17], the minimization problem in (7) is regarded as an optimization problem on a Grassmann manifold and can be solved by the Riemannian Conjugate Gradient (RCG) method [24]. Therefore, we need to compute the Jacobian matrices of δ 2 with respect to W . For AIM and the Stein divergence, the corresponding Jacobian matrices are derived as Despite the fact that there is no analytical form for the gradient of LEM with respect to W , an approximation log W T XW ≈ W T log (X) W was given by [19] and the required gradient is given by The procedure of our probability distribution-based dimensionality reduction approach is represented in Algorithm 1, where ∇ W L (W ) is the gradient on Grassmann manifold and τ (H, W 0 , W 1 ) is the parallel transport of the tangent vector H from W 0 to W 1 . The main computational cost of this algorithm is an iterative Riemannian optimization, which includes three major steps: evaluating the objective, computing the gradient, and performing RCG. A detailed analysis of the complexity of these three major steps can be found in [19].

D. COVARIANCE DESCRIPTORS
When we use SPD matrices as the input of our approach, we refer to the covariance matrices computed from feature vectors. Let S = [s 1 , s 2 , . . . , s r ] be the data matrix of a set with r samples, where s i ∈ R n denotes the n-dimensional vector representation of the i-th sample. A sample-based covariance matrix C is defined as wheres = 1 r s i is the mean of all samples. Except for being used as a fundamental concept in various fields of computer vision, the covariance matrix was initially introduced as a region descriptor by [5] over ten years ago. Given an w × l image region, a feature vector s i ∈ R n VOLUME 8, 2020

Algorithm 1 Optimization Algorithm
Input: A set of SPD matrices and their corresponding labels . . , C}. Parameters m, h, λ w , λ b , and λ s .

Output:
The optimal transform W .
Construct affinity matrix using (16) and (17). W old ← W 0 (i.e., the truncated identity matrix I n×m ), Line search along the geodesic in the direction H from W to find W * = arg min W L (W ).
is extracted from each pixel of this region to describe its attributes such as location, color, gradient, filter response, etc. With these feature vectors, an n × n covariance matrix C can be computed to represent this image region. There are several merits of choosing the covariance matrix as a region descriptor. First, it provides a natural manner to fuse various feature types. Second, it is robust against noise and outliers due to its averaging operation. Third, it allows a comparison between regions of different sizes to be carried out, which means a region covariance descriptor is independent of the size of the region. In addition to a region descriptor, the covariance matrix has recently been utilized as a form of general representation in an increasing number of computer vision tasks, such as image set classification [2] and frame-based video classification [15]. In this scenario, each individual image/frame in a given set will produce a corresponding feature vector, and the covariance matrix C is then computed from these image/frame-based vectors to represent this set.
In order to exploit the Riemannian geometry, the nonsingularity of C will be indispensable. When C is used as a region descriptor, its size (n × n) is determined by the dimensionality of the feature vectors. The total number of these feature vectors extracted from an image region is usually much greater than their dimensionality (w · l n). This situation ensures the covariance matrix C is positive definite and allows it to be computed with a complexity of O(n 2 ) very fast. In contrast, using the covariance matrix as a general representation for image sets/videos usually suffers from a sufficient sample problem. This is because the feature vectors are no longer pixel-based and have a relatively high dimension, while the number of available images per set in most tasks is likely to be low. C is a valid SPD matrix only if r n, otherwise C would have zero eigenvalues and become indefinite. To avoid the latter case, a popular choice is to append a small-scaled identity matrix to the covariance matrix C as X = C + λI, λ = 0.001. Although the covariance matrix C becomes strictly positive definite, the impact of such a regularization deepens as the size of the perturbation matrix grows [25]. As a consequence, expressive features of compact dimensionality are highly desired for the covariance representation. Feasible measures include, but are not limited to, reducing image resolution, mapping data to a low-dimensional space, performing proper feature extraction, applying suitable regularization methods, etc.

IV. EVALUATIONS
In this section, we study the effectiveness of the proposed probability distribution-based SPD dimensionality reduction (PDSPDDR) approach by presenting the results of major experiments involving three different computer vision tasks: object classification, material categorization, face recognition, and scene classification. To demonstrate the effectiveness of our approach, we compare it with several representative SPD manifold learning algorithms. We first briefly describe these methods and the datasets used in the experiments and then discuss the results.

A. IMPLEMENTATION DETAILS
For the benchmarking, the proposed PDSPDDR approach and three types of SPD manifold learning methods, including Riemannian metric-based methods, Riemannian kernel-based methods, and dimensionality reduction methods are evaluated with their corresponding classifiers. The Riemannian metric-based methods employ a simple NN classifier(the number of neighbors used in the NN classifiers is set to 1). These methods provide clear evidence of the benefits of the discriminative manifold mapping. The Riemannian kernel-based methods make use of kernels on the Riemannian manifold to exploit some classic vectorial algorithms (PLS, Sparse Representation). The dimensionality reduction methods seek a nonlinear mapping from the original manifold to another space with more discriminative power and usually formulate the learning procedure of this manifold mapping as an optimization problem. The algorithms evaluated in the experiments are listed as follows:

MMDML:
Multi-Manifold Deep Metric Learning for Image Set Classification [26]; DRSPD-AIM: Dimensionality reduction on SPD manifolds with AIM-based NN classifier [19]; DRSPD-LEM: DRSPD with LEM-based NN classifier [19]; DRSPD-Stein: DRSPD with Stein divergence-based NN classifier [19]. DRSPD-RSR: DRSPD with Stein divergence-based Riemannian Sparse Representation [19]. DRSPD-RSR: DRSPD with Log-Euclidean Kernels for Sparse Representation [19]. For a fair comparison, the main parameters of each algorithm are empirically tuned according to the original settings. For LEKSR, we use the implementation based on the Gaussian kernel. For DRSPD and PDDRSPD, the maximum iteration number of the RCG is set to 50, and the λ w is set to the minimum number of samples in each class, while λ b was determined by cross-validation (λ b λ w ). We also keep the target dimensionality m in PDDRSPD to be consistent with that in DRSPD for each dataset. In the pursuit of good performance, the λ s of our PDSPDDR is set to be the same as λ w . For each task, the experiments are repeated 10 times with randomly selected gallery/probe combinations and the average recognition accuracies are reported.

B. OBJECT CLASSIFICATION
For the first task, we used the ETH80 dataset [27]. This dataset contains images of 8 different categories and each category has 10 objects from everyday life: pears, tomatoes, cows, dogs, horses, cups, and cars. Each object includes 41 images captured under different views and some of them are shown in Fig. 3. We resize all the images to 20 × 20 and treat each object as an image set. Therefore, the setbased covariance descriptors of size 400 × 400 are generated from intensity images. Note that we use 4 randomly selected objects for the gallery and the other 6 objects for the probe, which is different from the setting in [2] to increase the difficulty of classification on this dataset. We use m = 30, λ b = 3 on this dataset. Table.1 shows the performance of our approach and other methods on this dataset. There is no doubt that kernel-based learning methods (i.e., CDL and RSR) are superb, while the Riemannian metrics based NN methods obtain lower performance. However, with the help of discriminative dimensionality reduction, the performance of NN methods is notably improved. Especially, our PDDRSPD not only outperforms DRSPD with a lead of over 2%, but also outperforms the kernel-based methods including CDL, RSR, and LEKSR. This demonstrates the benefits of encoding (dis)similarity between SPD matrices with their PDFs. The highest accuracy of 94.39% is obtained by RSR on the learned manifold with PDDRSPD, which is slightly better than the performance of MMDML on this dataset.

C. MATERIAL CATEGORIZATION
For experiments in material categorization, we used the UIUC dataset [28]. The UIUC material dataset contains 18 classes of complex materials taken in the wild from four general categories: bark, fabric, construction materials, and fur of animals. Each class has 12 images with various geometric fine-scale details (see Fig. 4 for sample images). Following the same protocol as in [19], we randomly assigned half of the images of each class as gallery data and used the rest as probe data. To obtain robust and discriminative Region Covariance Matrices (RCMs) [5] for material categorization, we extract SIFT features [29] from grayscale images. Specifically, every image is resized to 400 × 400 and then the dense SIFT descriptors are computed on a grid(the spacing is 4 pixels and the size of a bin is 4 pixels). Therefore, each grid point produces one 128-dimensional SIFT feature vector used for constructing RCMs of size 128 × 128. We adopt the same parameters from [19] for the dimensionality reduction on this dataset.
As listed in Table.1, we can see the proposed PDDRSPD improves the most related method DRSPD by 1% ∼ 2% with NN and also achieves a comparable performance to the kernel-based learning methods (our PDDRSPD-AIM and PDDRSPD-Stein is better than RSR and LEKSR, but not as good as CDL). Note that the gap between the dimensionality reduction methods and the NN methods with the LEM solution is quite small, which is probably caused by deviations from the approximation of LEM. However, the low-dimensional SPD matrices with kernel sparse coding (PDDRSPD-RSR and PDDRSPD-LEKSR) outperform CDL.

D. FACE RECOGNITION
For the last task, we used the YouTube Celebrities (YTC) dataset [30]. It is a quite challenging and widely used video face dataset which has 1910 video sequences of 47 different persons collected from the YouTube website, such as actors, actresses, and politicians (see Fig. 5 for examples). Most video sequences are low resolution and recorded at a high compression ratio with different numbers of frames (vary from 8 to 400). The face region in each frame in a video sequence is first extracted and then resized into a 20 × 20 intensity image. Histogram equalization is applied to eliminate the illumination effect. Hence, each video sequence generated an image set of faces. Similar to the strategy in [19], we introduce the mean face in each image set to improve the set-based covariance descriptor by concatenating it with the mean to yield a 401-dimensional SPD matrix as where µ and C are the mean and covariance of one image set. Following the standard practice [31], the entire dataset is equally divided into five folds (with a minimal overlap). In each fold, one person has 3 randomly chosen image sets for the gallery and the other 6 image sets for the probe. As can be seen from Table.1, our approach achieves a performance that is comparable to the state-of-the-art methods, with PDDRSPD-AIM yielding the maximum performance of 75.23%. It is clear that our approach is better than some dimensionality reduction based methods (i.e., LEML and DRSPD) on this dataset and the performance of NN based on AIM and the Stein divergence is dramatically improved after being combined with our PDDRSPD (up to 20%). With the superb capability of deep learning to model the nonlinearity of samples, MMDML achieves the best result of 80.57%. It is worth noting that our approach also boosts the performance of RSR to a similar level with MMDML (from 76.94% to 79.42% in the case of Stein divergence-based metric learning).

E. SCENE CLASSIFICATION
Scene classification is an important task in computer vision. In our work, we formulate it as an image set classification problem and evaluate it on the MDSD dataset. MDSD includes 13 classes with 10 small clips per class. These videos have large variations in terms of illumination, view, scale, and resolution (see Fig. 6 for example). Recently, deep neural networks have become the most popular and powerful approach for image classification, and the direct use of raw images is not able to provide robust and discriminative features for this task. Therefore, we use the output of the last pooling layer of ResNet [32] (pretrained on ImageNet [33]) as the descriptor for each frame. The original-sized color images are used as the input to CNN and the dimensionality of the CNN output is reduced by PCA to 400. Following the same settings as in [22], we adopt a seventy-thirty-ratio (STR) protocol which randomly selects 7 videos for training and the remaining 3 videos for testing. We set m = 40 and λ b = 7 on this dataset.
We report the experimental results on this challenging dataset in Table.1. Our approach succeeds in boosting the performance of the covariance descriptors computed from deep features with different classifiers. We can see that the PDDRSPD-LEM improves on the closest method, DRSPD-LEM, by 5%, while PDDRSPD-RSR and PDDRSPD-LEKSR also outperforms the deep metric learning method MMDML with a lead up to 2%. In addition, the performance of the Stein divergence-based NN is doubled by applying the proposed PDSPDDR approach.

F. PARAMETER SENSITIVITY
Note that there are 5 major parameters in the proposed approach, and they were all set in our experiments in a simple, yet principled manner. Here we introduce the setting of these parameters and analyze the influence of some parameters on the overall performance.
There are 3 shared parameters in the proposed PDDRSPD and the most related DRSPD: the number λ w of intra-class neighbors, the number λ b of inter-class neighbors and the dimensionality m of the induced manifold. As observed in [19], the performance of learned low-dimensional SPD matrices monotonically increases when the value of λ w is gradually raised from one to its upper bound. Therefore, it is intuitively clear to set λ w to the minimum number of gallery samples in each class (maximize the notion of within-class compactness). Different from λ w , [19] also found that either too high or too low value of λ b would result in performance degradation. Since λ b is designed to keep a balance between with-class affinity matrix and between-class affinity matrix, we set λ b to its optimum by empirical optimisation using a cross-validation performed in a limited range (1 λ b λ w ). We also tested the sensitivity of both algorithms to m. According to our observation on the UIUC material dataset, the final result is very insensitive to the variation of m.
For the two new parameters λ s and h introduced in the proposed approach, we investigated their influence on the overall performance. For this purpose, we applied our PDDRSPD-AIM on the UIUC material dataset with varying values of λ s (from 1 to 6) and h (with λ s = 1). The results are depicted in Fig. 7. As described in Sect. III-B, λ s is the number of samples selected from a certain class c j to describe the inter-class dissimilarity between the given sample X i and class c j , so that it naturally has the same upper bound with λ w : the minimum number of gallery samples in each class. When λ s = 1, the number of effective elements inĜ b is equivalent to those in G b . We can see that the accuracy of PDSPDDR-AIM is generally improved when the value of λ s increases, and the maximum performance is achieved for  λ s = 6. Based on this, by simply letting λ s = λ w for the proposed PDDRSPD, good performance can be expected in all experiments.
In order to obtain the probability distribution of SPD matrices on the Riemannian manifold, h is an essential smoothing parameter for producing reliable estimates. As shown in Fig. 7, the sensitivity of our approach to h, is relatively low. In any case, the performance can be improved by tuning h.
In addition, we have also analyzed the efficiency of our approach. In particular, Fig. 8 shows the run time of 50 iterations of DRSPD and PDDRSPD (with λ s = 1 and λ s = 6). We performed this comparison on the UIUC material dataset with a fixed gallery/probe combination. When the value of λ s increases, the running time of PDSPDDR also increases, but the optimal cost becomes lower too. However, when λ s is set to 1, our approach still gains a better result than DRSPD within comparable time, thanks to the affinity matrix constructed from PDFs.
Overall, our approach does require one extra parameter h, compared with DRSPD, but we consider it not burdensome and its impact on the performance is very positive.

V. CONCLUSION
We have proposed a probability distribution-based dimensionality reduction approach for SPD matrices to map the manifold of high-dimensional SPD matrices into a manifold of lower-dimensional discriminative ones. By exploiting the PDFs of SPD matrices estimated with a KDE, we defined a novel affinity matrix to enhance the discriminative power of popular Riemannian metrics on the target manifold and formulate the learning of manifold-to-manifold transformation as an optimization problem on the Grassmann manifold. VOLUME 8, 2020 Extensive evaluations have shown that the competitiveness and efficiency of our proposed approach compared with the state-of-the-art methods on three challenging datasets. In the future, we intend to extend our probability distribution-based approach to the unsupervised scenario.