Joint and Progressive Subspace Analysis (JPSA) with Spatial-Spectral Manifold Alignment for Semi-Supervised Hyperspectral Dimensionality Reduction

Conventional nonlinear subspace learning techniques (e.g., manifold learning) usually introduce some drawbacks in explainability (explicit mapping) and cost-effectiveness (linearization), generalization capability (out-of-sample), and representability (spatial-spectral discrimination). To overcome these shortcomings, a novel linearized subspace analysis technique with spatial-spectral manifold alignment is developed for a semi-supervised hyperspectral dimensionality reduction (HDR), called joint and progressive subspace analysis (JPSA). The JPSA learns a high-level, semantically meaningful, joint spatial-spectral feature representation from hyperspectral data by 1) jointly learning latent subspaces and a linear classifier to find an effective projection direction favorable for classification; 2) progressively searching several intermediate states of subspaces to approach an optimal mapping from the original space to a potential more discriminative subspace; 3) spatially and spectrally aligning manifold structure in each learned latent subspace in order to preserve the same or similar topological property between the compressed data and the original data. A simple but effective classifier, i.e., nearest neighbor (NN), is explored as a potential application for validating the algorithm performance of different HDR approaches. Extensive experiments are conducted to demonstrate the superiority and effectiveness of the proposed JPSA on two widely-used hyperspectral datasets: Indian Pines (92.98\%) and the University of Houston (86.09\%) in comparison with previous state-of-the-art HDR methods. The demo of this basic work (i.e., ECCV2018) is openly available at https://github.com/danfenghong/ECCV2018_J-Play.

potential more discriminative subspace; 3) spatially and spectrally aligning manifold structure in each learned latent subspace in order to preserve the same or similar topological property between the compressed data and the original data.A simple but effective classifier, i.e., nearest neighbor (NN), is explored as a potential application for validating the algorithm performance of different HDR approaches.Extensive experiments are conducted to demonstrate the superiority and effectiveness of the proposed JPSA on two widely-used hyperspectral datasets: Indian Pines (92.98%) and the University of Houston (86.09%) in comparison with previous state-of-the-art HDR methods.The demo of this basic work (i.e., ECCV2018) is openly available at https://github.com/danfenghong/ECCV2018J-Play.

I. INTRODUCTION
H YPERSPECTRAL (HS) data are often characterized by rich and diverse spectral information, which enables us to locate or identify targets more easily.However, their high dimensionality also raises some crucial issues that need to be carefully considered, including information redundancy, complex noise effects, need for large storage capacities and high performance computing, and the curse of dimensionality.A general way to address this problem is to compress the original data to a low-dimensional and highly-discriminative subspace with the preservation of the topological structure.In general, it is also referred to as dimensionality reduction (DR) or subspace learning (SL).
Over the past decade, SL techniques have been widely used in remote sensing data processing and analysis [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], particularly hyperspectral dimensionality reduction (HDR) [12].Li et al. [13] carried out the HDR and classification by locally preserving neighborhood relations.In [14], spectral-spatial noise estimation can largely enhance the performance of dimensionality reduction, and the proposed method not only can extract high-quality features but also can well deal with nonlinear problems in hyperspectral image classification.Authors of [15] introduced the sparseness property [16] into the to-be-estimated subspace in order to better structure the low-dimensional embedding space.Rasti et al. [17] extracted the hyperspectral features in an unsupervised fashion using the orthogonal total Variation component analysis (OTVCA), yielding a smooth spatial-spectral HSI feature extraction.In [18], a spatial-spectral manifold (SSM) embedding was developed to compress the HS data into a more robust and discriminative subspace.Wang et al. [19] proposed to select representative features hierarchically by the means of random projection in an end-to-end neural network, which has shown the effectiveness in the large-scale data.Very recently, Huang et al. [20] followed the trail of drawbacks of spatialspectral techniques, and fixed them by designing a new spatialspectral combined distance to select spatial-spectral neighbors of each HS pixel more effectively.In the combined distance, the pixel-to-pixel distance measurement between two spectral signatures is converted to the weighted summation distance between spatially adjacent spaces of the two target pixels.
Despite the good performance of these methods in HDR, yet most of them only adhere to either the unsupervised or the supervised strategy, and fail to jointly consider the labeled and unlabeled information in the process of HDR.Some recent works for semi-supervised HDR have been proposed by the attempt to preserve the potentially global data structure that lies in the whole high-dimensional space.For example, Liao et al. [21] simultaneously exploited labeled and unlabeled data to extract the feature representation from the HSI in a semisupervised fashion, called semi-supervised local discriminant analysis (SELD).Different from [21] that utilizes the similarity measurement to construct the graph structure, in [22], the performance of LDA is enhanced with the joint use of the labels and "soft-labels" predicted by label propagation, yielding a soft-label LDA (SLLDA) for semi-supervised HDR.A similar semi-supervised strategy was presented in [23] to reduce the spectral dimension of HSI by embedding pseudo-labels obtained using the pre-trained classifier into LFDA, called semi-supervised LFDA (SSLFDA).The use of "soft-labels" or "pseudo-labels" is effective for the process of low-dimensional embedding.Since more pixels considered can help us better capture the global manifold of the data, even though these soft or pseudo-labels could be noisy and inaccurate.It should be noted that these techniques are commonly applied as a disjunct feature learning step before classification, whose limitation mainly lies in a weak connection between features by SL and label space (see the top panel of Fig. 1).It is unknown which learned features can accurately improve the classification.In [24], the features can adequately exploited by using the tdistributed stochastic neighbor embedding and a multi-scale scheme, and the proposed neural network shows outstanding and reliable performance in HS image classification.
A feasible solution to this problem can be generalized into a joint learning framework [26] that simultaneously learns a linearized subspace projection and a classifier, as illustrated in the middle panel of Fig. 1.Inspired by it, a large amount of work has been proposed for various applications, such as cross-modality learning and retrieval [27], and heterogeneous joint features learning [28].Although these works have tried to make a connection between the learned subspaces and label information using regression techniques (e.g., linear regression) to adaptively find a latent subspace in favor of classification, they fail to find an optimal subspace.It is that the representative ability only using a single linear projection Fig. 1.The motivation interpolation from separately learning subspaces and training classifier [25], to jointly learning subspaces and classifier [26], to joint and progressive learning multi-coupled subspaces and classifier again [1].The green bottom line from left to right indicates a gradual improvement in feature discriminative ability.Ideally, the features (subspaces) learned by our model are expected to have a higher discrimination ability, which benefits from the proposed joint and progressive learning strategy.
remains limited for the complex transformation from the original data space to the potential optimal subspace.Similar to the joint learning model, deep neural networks (DNN) have attracted increasing attention due to its powerful ability in HS feature extraction.Chen et al. [29] designed a stacked autoencoder (SAE) for feature extraction and classification of HSI.In [30], the authors investigated the performance of self-taught feature learning (e.g., convolutional autoencoder (CAE)) by jointly considering the spatial-spectral information embedding with the application to HSI classification.

A. Motivation and Objectives
To sum up, these aforementioned methods can be approximately categorized into linear HDR and nonlinear HDR techniques.Consequently, the strengths and weaknesses of the two methods can be summarized as follows.
1) Theoretically, nonlinear HDR strategies, such as manifold learning [31] and DNN-based DR methods (e.g., SAE and CAE) [32], can over-fit the data perfectly, owing to their powerful model learning capability.However, this type of method is relatively sensitive to complex spectral variability inevitably caused by complex noise, atmospheric effects, and various physical and chemical factors in hyperspectral imaging.Because the spectral variability tends to be absorbed by the DNN-based methods [33], the discriminative ability of dimension-reduced feature gets possibly hurt.
2) In turn, the linearized SL methods, such as principal component analysis (PCA) [34], linearized manifold learning (e.g., locality preserving projection (LPP) [35]), local discriminant analysis (LDA) [25], and local fisher discriminant analysis (LFDA) [36]) can well address the above drawbacks, yet they usually provide limited performance due to the defects of the model itself, that is, the single linearized model is lack of data representation ability.
The above trade-off motivates us to develop a multi-layered linearized SL technique for HDR with more discriminative and robust data representation and to preserve the structural consistency between the compressed data and the original data.

B. Method Overview and Contributions
To effectively pursue high spectral discrimination and preservation of the spatial-spectral topological structure in compressing the HS data, we propose a novel joint and progressive subspace analysis (JPSA) to linearly find an optimal subspace for the low-dimensional data representation, as shown in the bottom panel of Fig. 1.A promising idea of simultaneous SL and classification is used to form the basic skeleton of the proposed JPSA model.In the framework, we learn a series of subspaces instead of a single subspace, making the original data space being progressively converted to a potentially optimal subspace through multi-coupled intermediate transformations.To avoid trivial solutions, a selfreconstruction (SR) strategy in the form of regularization is applied in each latent subspace.Furthermore, we not only consider structure consistency (topology) between the compressed data and the original data in both spatial and spectral domains, but also align the two (spatial and spectral) manifolds in each latent subspace, yielding the SSM embedding in the process of HDR.
Beyond previous existing works, i.e., [1], [37], the main contributions of our work can be summarized as follow: • We develop a novel semi-supervised HDR framework (JPSA) for better learning the spatial-spectral lowdimensional embedding by modeling relations between superpixels and pixels in a joint and progressive fashion.• With the SR term simultaneously performed on superpixels and pixels, the linearized JPSA shows its robustness and effectiveness in handling the spectral variability over many nonlinear HDR approaches, which will be well demonstrated in the following experiment section.• Spatial-spectral manifolds are preserved in each latent subspace and are further aligned for spatial-spectral structure consistency between the compressed data and the original data, where the manifold structure in spectral space is computed by Gaussian kernel function, and the spatial manifold structure is determined by superpixels, e.g., simple linear iterative clustering (SLIC) [38].• To avoid falling into bad local optimums, a pre-training model, called auto-reconstructing unsupervised learning (AutoRULe), is proposed as an initialization of JPSA to jointly initialize the branches of pixels and superpixels.• An iterative optimization algorithm based on the alternating direction method of multipliers (ADMM) is designed to solve the newly-proposed model.

II. JPSA: JOINT & PROGRESSIVE SUBSPACE ANALYSIS
Fig. 2 illustrates the workflow of the proposed JPSA.Intuitively, the JPSA is a two-stream multi-layered regression model involving the two input sources: pixel-wise and superpixel-wise spectral signatures and the same output (ground truth).In the learning process of the two-stream model, the to-be-estimated parameters (projections) are shared with a spatial-spectral alignment constraint in each latent subspace.Moreover, each learned subspace is expected to be capable of projecting back to its former high-dimensional product, which is measured by the reconstruction loss.

A. Review of Joint Regression
Before introducing our JPSA, we first briefly introduce the basis of developing our method: a joint regression model [26], in which SL and classification are simultaneously performed to reduce the gap between the estimated subspace and labels.This model has been proven to be effective in extracting the discriminative low-dimensional representation [39].Let X = [x 1 , ..., x k , ..., x N ] ∈ R d0×N be a HS data matrix with d 0 bands by N pixels, and Y ∈ {0, 1} L×N be the onehot encoded class matrix corresponding to labels, whose kth column is defined as y k = [y k1 , ..., y kt , ..., y kL ] T ∈ R L×1 , we then have where • F represents a Frobenius norm; P ∈ R L×dm (d m denotes the dimension of the latent subspace) is regression matrix to explicitly bridge the learnt latent subspace and labels, and the projection Θ ∈ R dm×d0 is usually called as intermediate transformation and the corresponding subspace ΘX is called as the latent subspace.It has been proven in [40] that the feature is prone to be structurally learned and represented in such a latent subspace.Further, by considering the graph structure measured by an adjacency matrix W ∈ R N ×N as a regularizor [41], the joint regression model in Eq. ( 1) can be extended to the following improved version [37]: where D ii = j W ij is defined as a degree matrix and the Laplacian matrix L can be computed by L = D−W [42].The third term of Eq. (2), i.e., graph regularization, can provide additional prior knowledge by modeling relations between samples, thereby improving the regression performance.

B. Problem Formulation
A single linear transformation is hardly capable of describing the complex mapping relationship between the data and labels well, particularly for HS data suffering from a variety of spectral variabilities.On the other hand, although the nonlinear techniques (e.g., manifold learning or DL) hold a powerful representation ability for the HS data, yet they are usually vulnerable to the attack of spectral variability, inevitably degrading the quality of dimension-reduced features.As a trade-off, we propose to progressively learn multicoupled linear projections on the basis of the joint regression framework.Thus, the resulting JPSA with necessary priors can be formulated as the following constrained optimization problem: where {Θ l } m l=1 ∈ R d l ×d l−1 are defined as a set of intermediate transformations, m is the number of subspace projections, and {d l } m l=1 stand for as the dimensions of those latent subspaces.Moreover, X l denotes the l-th layer subspace features, where X 0 represents original data (X), while X sp l denotes the superpixel representation of X l .To effectively solve the twostream joint regression model in Eq. ( 3), several key terms are featured in the following.
1) SR Loss Term Υ({Θ l } m l=1 ): Without any constraints or prior, jointly estimating multiple successive variables in JPSA can hardly be implemented, especially when the number of estimated variables gradually increases.This can be well explained by gradient missing between the two neighboring variables estimated in the process of optimization.In other words, the variations between two neighboring variables approach to a tiny value or even zero.When the number of estimated projections accumulates to a certain extent, most of the valid values could only gather a few projections, making other projections being close to identity matrix and become meaningless.To address the issue mentioned above, a kind of autoencoder-like scheme is adopted to reduce the information loss in the process of propagation between two neighboring spaces.The benefits of the scheme are two-folds.On the one hand, this term can prevent over-fitting of the data to a great extent, especially avoiding all kinds of spectral variabilities from being considered, since we found that those variabilities are difficult to be linearly reconstructed.On the other hand, it can also establish an effective link between the original space and the subspace, enabling the learned subspace to project back to the former one as much as possible.Such a strategy can be formulated by simultaneously considering pixels and superpixels of HSI: (4) Please note that we propose to utilize Eq. ( 4) in each latent subspace to maximize the advantages of this term.
2) Prediction Loss Term E(P, {Θ l } m l=1 ): This term is to minimize the empirical risk between the original data and the label matrix through a set of subspace projections and a linear regression coefficient, which can be written as (5) Theoretically, with the increase of the number of estimated subspaces, the variations between neighboring subspaces are gradually narrowed down to a very small range.In this case, such small variations can be approximately represented via a linear transformation.This allows us to find a good solution in a simple way, especially for the non-convex model.
3) Alignment-based SSM Regularization Φ({Θ l } m l=1 ): As introduced in [43], manifold structure is an important prior for compressing high-dimensional data, which can effectively capture the intrinsic structure between samples.For this rea- son, we not only embed the locally spectral manifold structure computed between the pixels, but also model the non-locallike spatial manifolds constructed by superpixels.Therefore, the two graph structure can be formulated as where φ k (X i ) and φ k (X sp i ) are the k neighbors of the pixel X i and the superpixel X sp i , respectively.Additionally, we also align the spatial-spectral manifolds in each learned subspace to enhance the model's ability to distinguish and generalize, further yielding the structure consistency of the two-stream joint regression model.The alignment operator can be expressed by the form of a graph: where φ(X sp j ) denotes the pixel set in the j-th superpixel.By collecting the above sub-graphs, we have the final graph structure (W f ) by considering spatial and spectral neighbors of each pixel as well as their alignment information: Thus, the resulting manifold alignment-based spatial-spectral regularization can be written as where L f can be computed by D f − W f .In this study, each pixel's spatial neighbors are other pixels in the same segment obtained by SLIC, while its k spectral neighbors are selected with Euclidean measurement on a kernel-induced space.Fig. 3 illustrates the spatial-spectral graph structure.
Hyperspectral data are non-negative either in radiance or reflectance.To inherit this physical nature, we expect to learn non-negative features with respect to each learned lowdimensional feature (e.g., {X l } m l=1 0).The hard orthogonal constraint with respect to the variable Θ could lead to nonconvergence of the model or reach a bad solution.To provide a proper search space of the solution, we, therefore, relax the constraint by imposing a sample-based norm constraint [44] on each latent subspace as x lk 2 1, ∀k = 1, ..., N and l = 1, ..., m.Note that these constraints are similarly applicable to the superpixel-guided optimization problem.

C. Model Learning
Considering the fact that we need to successively estimate multi-coupled variables in JPSA, which obviously results in the increasing complexity and the non-convexity of our model, a group of good initial approximations of subspace projections {Θ l } m l=1 would greatly reduce training time and help finding a better local optimal solution.This is a common tactic that has been widely used to address this issue [45].Inspired by this trick, we pre-train our model by simplifying Eq.(3) as where [X l X sp l ] is collectively rewritten as Xl for convenience of writing and model optimization.
We call the Eq. ( 12) as auto-reconstructing unsupervised learning (AutoRULe).Given the outputs of AutoRULe to the problem of Eq. ( 3) as the initialization, {Θ l } m l=1 and P tend to obtain the better estimations.In details, Algorithm 1 summarizes the global algorithm of JPSA, where AutoRULe is initialized by LPP.
We propose to use the ADMM-based optimization method to solve the pre-training method (AutoRULe), hence an equivalent form of Eq. ( 12) is considered by introducing multiple auxiliary variables H, G, Q and S to replace Xl , Θ l , X+ l and X∼ l , respectively, where () + denotes an operator for converting each component of the matrix to its absolute value and () ∼ is a proximal operator that solves the constraint of xlk 2 1 [46].Therefore, the resulting expression is The constrained optimization problem in Eq. ( 13) can be converted to its augmented Lagrangian version by introducing the Lagrange multipliers {Λ n } 4 n=1 and the penalty parameter µ, where the non-negativity and norm constraint can be relaxed by defining two kinds of proximal projection operators l + R (•) and l ∼ R (•).More specifically, l + R (•) can be expressed as while l ∼ R (• k ) is a sample-based normalization operator: where • k is the k-th column of matrix • in our case.Algorithm 2 lists the optimization procedures of Au-toRULe, and the solution to each subproblem is detailed in Appendix A.    Fix H t+1 , G t+1 , Θ t+1 l , P t to update Q t+1 by Eq. (32). 7 Fix H t+1 ,G t+1 , Θ t+1 l , Q t+1 to update P t+1 by Eq. ( 34). 8 Update Lagrange multipliers using Eq. ( 35). 9 Update penalty parameter using µ t+1 = min(ρµ t , µmax). 10 Check the convergence conditions: After running the AutoRULe, its outputs can be fed into JPSA for the model initialization, and then the two subproblems (solve P and {Θ l } m l=1 ) in Eq. ( 3) can be optimized alternatively as follows: Optimization with respect to P subproblem: Typically, this is a Tikhonov-regularized least square regression problem, which can be formulated as where the variable Ỹ is a collection of [Y Y] similar to the variable X. Intuitively, the analytical solution of Eq. ( 16) can be directly derived as where V is assigned to Θ m ...Θ l ...Θ 1 X, ∀l = 1, ..., m.
Optimization with respect to {Θ l } m l=1 : When other variables are fixed, the variable Θ l can be individually solved, hence the optimization problem for any Θ l can be written in the following general form: (18) Likewise, the problem of Eq. ( 18) can basically be solved by following the framework of Algorithm 2 (More details regarding the variable optimization can be found in Appendix A.).The only difference lies in the optimization of subproblem with respect to H. Herein, we supplement the optimization problem of the variable H as follows min whose analytical solution is given by Finally, the aforementioned optimization procedures are repeated until a stopping criterion is satisfied.

D. Convergence Analysis
The iterative alternating strategy used in Algorithm 1 is nothing but a block coordinate descent, whose convergence is theoretically guaranteed as long as each subproblem of Eq. ( 12) is exactly minimized [47].Each subproblem optimized in Algorithm 2 is strongly convex, and thus the ADMM-based optimization strategy can converge to a unique minimum when the parameters are updated in finite steps [48], [49].Moreover, we experimentally illustrate to clarify the convergences of J-Play and the proposed JPSA on the two HS datasets, where the relative errors of objective function value are recorded in each iteration (see Fig. 4).

A. Description of the Data
The experiments are performed on two different standard HS datasets, corresponding to different contexts, different sensors, and different resolutions.
1) Indian Pines AVIRIS Image: The first HS cube was acquired by the AVIRIS sensor with 16 classes of vegetation.It consists of 145 × 145 with the spectral 220 bands covering the wavelength range from 400nm to 2500nm in a 10nm spectral resolution.A set of widely-used training and test sets [1] with the specific categories is listed in Table I.A false-color image of the data is given in Fig. 5.
2) University of Houston Image: The second HSI was provided for the 2013 IEEE GRSS data fusion contest.It was acquired by an ITRES-CASI-1500 sensor over the campus of the University of Houston, Houston, USA, with a size of 349 × 1905 × 144 in the wavelength from 364nm to 1046nm.The information regarding classes as well as training and test samples can be also found in Table I.The first image of Fig. 6 shows a false color image of the study scene.

B. Experimental Setup and Preparation
We learn the subspaces for the different methods on the training set and then the test set can be simply projected to the subspace where training and test samples can be further classified by the nearest neighbor (NN).The reason for selecting the simple but effective classifier in our case is that the NN classifier tends to avoid the confusing evaluation, as it is unknown whether the performance improvement originates from either the classifiers or the features themselves if more advanced classifiers are used.
We adopt three criteria to quantitatively assess the algorithm performance, including Overall Accuracy (OA), Average Accuracy (AA), and Kappa Coefficient (κ).They can be formulated by using the following equations. and where N c and N a denote the number of samples classified correctly and the number of total samples, respectively, while N i c and N i a correspond to the N c and N a of each class, respectively.P e in κ is defined as the hypothetical probability of chance agreement [51], which can be computed by where N i r and N i p denote the number of real samples for each class and the number of predicted samples for each class, respectively.

C. Results Analysis and Discussion
1) Indian Pines Dataset: Table II presents the classification performances of the different methods with the optimal parameter setting tuned by cross-validation on the training set using the NN classifier.Correspondingly, the classification maps are given in Fig. 5 for visual assessment.
Overall, PCA provides similar performances with the baseline (OSF), as the PCA more focuses on maximizing the information but could fail to excavate the potential data structure that lies in reality.By smoothing the spatial structure of HSI, OTVCA enables better identification of the materials than OSF and PCA.For the supervised HDR methods, the classification performances of classic LDA are even lower than those previously mentioned, due to the limited amount of training samples.Holding a more powerful intra-class homogeneity and inter-class separation, LFDA obtains more competitive results by locally focusing on discriminative information, which is obviously better than those of the baseline, PCA, and LDA around 8%.However, the features learned by LFDA are relatively difficult to be generalized, due to the small-size labeled samples.Comparatively, SELD learns a robust lowdimensional feature representation with a higher generalization ability, since unlabeled samples are involved in the process of model training.In SELD, the unlabeled information is embedded by computing the similarities between samples, which is more effective than that using the pseudo-labels (e.g., SLLDA and SSLFDA) to some extent.However, these semi-supervised methods are still bad at handling noisy data.A direct proof can be shown in Fig. 5 that there exist obvious salt-and-pepper-like noises in classification maps of SELD, SLLDA, and SSLFDA.Likewise, although the SAE holds a strong nonlinear learning ability in data representation, its performance is still limited by complex spectral variability and pixel-wise feature embedding.Thanks to the spatial information modeling, CAE locally extracts the spatial information and thus obtains a relatively smooth classification result.With the benefit of a multi-linear regression system, the J-Play algorithm performs much better (at least 7% OAs) than DNN-based nonlinear HDR (SAE and CAE).Such a strategy makes the learned features more robust against various spectral deformation and degradation, in spite of without accounting for the spatial information.
The performances of the proposed JPSA are superior to the other methods, which indicates that JPSA can learn a more discriminative and robust spectral embedding.The alignmentbased SSM embedding enables us to identify the materials at a more accurate level on a small-scale training set.As shown in Fig. 5, the classification map obtained by JPSA is smoother than others, demonstrating that our method is capable of effectively aggregating the spatially contextual information in the process of HDR by means of superpixels.It is worth noting that the JPSA not only outperforms others from the whole perspective, but also obtains highly competitive results for each class, particularly for Corn, Soybean-Notill, Soybean-Mintill, Soybean-Clean, and Building-Grass-Trees that have a dramatic improvement of about 10% in classification accuracy.
2) The University of Houston Dataset: Fig. 6 shows a visual comparison among the different algorithms, and the specific classification accuracies for various compared methods, which were optimally parameterized by a cross-validation as listed in Table III.
Generally, there is a basically consistent trend in classification performance between OSF and PCA: around 72% OA as a baseline.For another unsupervised HDR method, OTVCA approximately yields a 2% improvement on the basis of OSF and PCA.Owing to the use of total variation operator in OTVCA (see the smooth classification map in Fig. 6), it shares similar performances with discriminant analysisbased approaches such as LDA and LFDA.This reason why the unsupervised OTVCA is comparable to the supervised HDR methods could be, to some extent, two-fold.On one hand, local smoothing strategy is a good fit for HS feature extraction and HDR tasks; on the other hand, the smallsize training set hinders the supervised LDA and LFDA finding a generalized or transferable discriminative subspace.Nevertheless, LFDA is capable of steadily performing better than OTVCA owing to the consideration of local manifold structure.This might be seen as indirect evidence to show the effectiveness of the manifold embedding in HDR.More intuitively, the performance of semi-supervised methods is superior to that of those only considering the labeled samples, where the SSLFDA achieves the best classification results.This demonstrates the effectiveness of embedding unlabeled samples in improving the generalization ability of the learned    JPSA outperforms other HDR algorithms significantly, which indicates that the proposed method is capable of effectively approximating an optimal mapping from the original space to the label space by fully considering a trade-off between spectral discrimination and subspace robustness, thus providing a robust and discriminative low-dimensional feature representation.Further, the embedding of spatial-spectral information enables semantically meaningful object-based HS classification results.Notably, JPSA is able to more effectively eliminate the effects of shadow covered by clouds in image acquisition, compared to other methods as shown in Fig. 6.

D. Parameter Sensitivity Analysis of JPSA
The quality of low-dimensional feature embedding, to some extent, depends on the parameter selection, it is, therefore, indispensable to investigate the sensitivity of parameter setting in JPSA.Five main parameters involved in the JPSA, which need to be emphatically analyzed and discussed, would result in a significant effect on dimension-reduced features and even final classification results.They include three regularization parameters (α, β, and γ) in Eq. ( 3), subspace dimension (d), and the number of layers (m).
Significantly, we start to analyze the effects of different m for JPSA.With the different number of learnt projections, we successively specify the proposed model as JPSA 1 , . . ., JPSA l , . . ., JPSA m , ∀l = 1, ..., m.To investigate the trend of OAs, m is uniformly set up to 8 on the two datasets.We experimentally set the number of clusters in SLIC as 10% of the total samples.As listed in Table IV, with the increase of m, the performances of JPSA with SSM embedding steadily increase to the best with around 3 layers for both datasets and then gradually decrease with a slight perturbation.This might be explained by over-fitting and error accumulation of the model in the multi-layered regression process, since our model is only trained on a limited number of samples.Note that more results about the JPlay in terms of the parameter m can be found in [1], and the code is openly available from the link: https://github.com/danfenghong/ECCV2018J-Play.
Apart from the parameter m, the regularization parameters and subspace dimension also play a crucial role in improving the model's performance.More specifically, the resulting quantitative analysis of the two datasets is given in Fig. 7, where the parameter combinations of (α = 1, β = 0.1, γ = 0.1, d = 20) and (α = 1, β = 0.1, γ = 0.1, d = 30) achieve the best classification performance on the test sets for the first and second datasets, respectively.The resulting parameter selection for the two sets of datasets is basically consistent with that determined by 10-fold cross-validation on the training set (please see Section III.B for more details).The cross-validation is, therefore, an effective strategy to automatically determine the model's parameters so that other researchers are able to produce the results for their own tasks.More specifically, the optimal parameters can be determined by testing all of the parameter combinations.Furthermore, we only show the two-dimensional figures (see Fig. 7) for the convenience of visualization, where other variables are set to be the optimal ones except for the current investigated variable.Moreover, we can observe from Fig. 7, that with the increase of d, the JPSA's performance increases to the optimal value with the dimension of 20 for the Indian Pines dataset and 30 for the University of Houston dataset, respectively, then starts to reach a relatively stable state, and finally decreases with a slight perturbation when the subspace dimension is approaching to that of original spectral signature.For the variable α that mainly controls the prediction errors between the input data and labels, it is a very important factor that needs to be carefully considered in the model learning, since the setting of α is sensitive to the feature embedding and even to the final classification results.Similarly, the terms of SR and SSM alignment also have great effects on the classification performance, which indicates the importance of the two terms.What's more, the subspace dimension is a noteworthy factor as well, although the OAs with different dimensions are relatively stable when the variable d reaches a larger value (e.g., 10).

E. Ablation Studies of JPSA
Additionally, we analyze the performance gain of JPSA by step-wise adding the different components, i.e., SR term, SSM alignment term, etc. Table V details the increasing performance when different terms are fused.As it turns out successively embedding each component into the JPSA would lead to a progressive enhancement in feature representation ability.This demonstrates the advancement and effectiveness of the proposed JPSA model for HDR.

IV. CONCLUSION
In this paper, we proposed a joint and progressive subspace analysis (JPSA) technique to learn an optimal mapping for effective HS data compression along the spectral dimension.JPSA is expected to find a discriminative subspace where the samples can be semantically (label information) and structurally (SSM or topology perseveration and alignment) represented and thereby be better classified.Oriented by assessing pixel-wise HS classification performances, we conduct extensive experiments using JPSA in comparison with some previous state-of-the-art HDR methods.The desirable results using JPSA demonstrate its superiority and effectiveness, particularly in handling various complex spectral variabilities compared to other nonlinear DR techniques (e.g., DL-based methods).In the future, we will further develop and apply the JPSA framework to the multi-modality learning.

APPENDIX A SOLUTION TO AUTORULE
The solution to problem (12) can be transferred to equivalently solve the problem (13) with ADMM.Considering the fact that the object function in Eq. ( 13) is not convex with respect to all variables simultaneously, but it is a convex problem regarding the separate variable when other variables are fixed, therefore we successively minimize L µ (Eq.( 13)) with respect to Θ l , H, G, Q, S, {Λ n } 4 n=1 as follows: Θ l problem: The optimization problem for Θ is min which has a closed-form solution: H problem: The variable H can be estimated by solving the following problem: its analytical solution is given by G problem: The optimization problem can be written as which can be effectively solved as Here the update rule for Q can be expressed as S problem: The variable S is estimated by solving whose solution can be updated in each iteration by the vectorbased projection operator of Eq. ( 15): Lagrange multipliers ({Λ i } 4 i=1 ) update: Before stepping into the next iteration, Lagrange multipliers are updated by

Fig. 3 .
Fig. 3.A showcase to illustrate the graph structure used in the alignmentbased SSM regularization term.

14 end 15 | 4 )
Compute the objective function valueObj t+1 and check the convergence condition: 16 if | Obj t+1 −Obj t Obj t Regression Coefficient Regularization Ψ(P): This regularization term ensures a reliable solution and improves the generalization ability of the model, which is

Fig. 4 .
Fig. 4. Convergence analysis of J-Play and JPSA with different m values of 2, 4, 6, 8 (left to right) was experimentally performed on the two HS datasets.(a): Indian Pines Dataset.(b): University of Houston Dataset.

Fig. 5 .
Fig. 5.A false color image, the distribution of training and test sets with category names, and classification maps of the different algorithms obtained using the NN classifier on the Indian Pines dataset.

Fig. 6 .
Fig. 6.A false color image, the distribution of training and test sets with category names, and classification maps of the different algorithms obtained using the NN classifier on the University of Houston dataset.
Fix H t , G t , Q t , P t to update Θ t+1 , Q t , P t to update H t+1 by Eq. (28).

TABLE I SCENE
CATEGORIES, THE NUMBER OF TRAINING (TR) AND TEST (TE) SAMPLES FOR EACH CLASS ON THE TWO DATASETS: INDIAN PINES ANDUNIVERSITY OF HOUSTON.

TABLE II QUANTITATIVE
PERFORMANCE COMPARISONS OF DIFFERENT ALGORITHMS ON THE INDIAN PINES DATASET WITH THE OPTIMAL PARAMETER COMBINATION IN TERMS OF OA, AA, AND κ, AS WELL AS THE ACCURACY FOR EACH CLASS.THE BEST ONE IS SHOWN IN BOLD.JPLAY 4 MEANS A FOUR-LAYERED J-PLAY MODEL (m = 4), WHILE JPSA 4 DENOTES A FOUR-LAYERED JPSA MODEL (m = 4).

TABLE IV CLASSIFICATION
PERFORMANCE (OA, AA, AND κ) WITH THE DIFFERENT NUMBER OF LEARNT PROJECTIONS (m) ON THE TWO DATASETS.

TABLE V ABLATION
ANALYSIS OF JPSA WITH A PROGRESSIVE COMBINATION OF DIFFERENT TERMS ON THE TWO DATASETS.