MIMA: MAPPER-Induced Manifold Alignment for Semi-Supervised Fusion of Optical Image and Polarimetric SAR Data

Multi-modal data fusion has recently been shown promise in classification tasks in remote sensing. Optical data and radar data, two important yet intrinsically different data sources, are attracting more and more attention for potential data fusion. It is already widely known that a machine learning-based methodology often yields excellent performance. However, the methodology relies on a large training set, which is very expensive to achieve in remote sensing. The semi-supervised manifold alignment (SSMA), a multi-modal data fusion algorithm, has been designed to amplify the impact of an existing training set by linking labeled data to unlabeled data via unsupervised techniques. In this paper, we explore the potential of SSMA in fusing optical data and polarimetric synthetic aperture radar (SAR) data, which are multi-sensory data sources. Furthermore, we propose a MAPPER-induced manifold alignment (MIMA) for the semi-supervised fusion of multi-sensory data sources. Our proposed method unites SSMA with MAPPER, which is developed from the emerging topological data analysis (TDA) field. To the best of our knowledge, this is the first time that SSMA has been applied on fusing optical data and SAR data, and also the first time that TDA has been applied in remote sensing. The conventional SSMA derives a topological structure using $k$ -nearest neighbor (kNN), while MIMA employs MAPPER, which considers the field knowledge and derives a novel topological structure through the spectral clustering in a data-driven fashion. The experimental results on data fusion with respect to land cover land use classification and local climate zone classification suggest superior performance of MIMA.


I. INTRODUCTION
I N recent decades, data fusion has attracted a lot of attention in the remote sensing community [1], [2], [3], [4], motivated by the simple fact that multiple data sources reveal complementary physical properties of observed scenes.
This work is jointly supported by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. [ERC-2016-StG-714087], Acronym: So2Sat), Helmholtz Association under the framework of the Young Investigators Group "SiPEO" (VH-NG-1018, www.sipeo.bgu.tum.de), and the Bavarian Academy of Sciences and Humanities in the framework of Junges Kolleg.
For example, optical RGB data normally possesses high spatial resolution [5], while multi/hyperspectral data contains spectral information [6], and synthetic aperture radar (SAR) data gives dialectic and geometric properties.Thus, it is valuable to develop algorithms that are able to take advantage of different data sources for applications.In this regard, machine learning techniques are becoming increasingly important due to their excellent performance [7], [8], [9].As is generally known in machine learning, the training data set is of great importance [10].Most successful techniques require a large set of training data [11].However, accessing a large training data set is very expensive, especially in remote sensing, because labeling a training data set in this field requires expertise that is more complicated than identifying dogs and cats.Therefore, a semi-supervised learning technique is a good option for remote sensing tasks, as the unlabeled data set is linked to the training data set by unsupervised approaches in the learning of the technique.It amplifies the effect of the existing training data set.Considering the importance of data fusion and precious training data, this paper studies a semi-supervised learning technique, named manifold alignment, to fuse optical image and polarimetric SAR (PolSAR) data for the purpose of classification.

A. Fusion of optical and SAR data
Due to the rapid development of remote sensing missions such as LandSat-8, Sentinel-2, EnMAP for optical remote sensing and TerreSAR-X, Tandem-X, Sentinel-1 for radar remote sensing, a huge amount of optical data and SAR data have been collected; the data volume can be expected to increase over time.The fusion of the two data sets hold great potential for use in various applications [12].Besides the data availability, the other reason to fuse them is that dialectic and geometric properties provided by SAR data are complementary to the spectral information of optical data.However, fusing them in practice is not as straightforward as the argument for doing it.The difficulty lies in the intrinsic differences in their imaging geometry.Because of the slanted looking angle of the SAR sensor, SAR images have an oblique appearance, with distortions of foreshortening, shadowing, and layover.This results in image geometry that is severely dissimilar to the nadir looking optical data.The extent of SAR distortions is positively correlated to height.This will pose substantial challenges when fusing these two data sets, especially in urban areas with large height fluctuations.To date, some studies have explored fusing these two data sources.We categorize those studies into three types based on their purposes: (1) registration oriented, (2) detection oriented, and (3) classification oriented.
1) Registration is actually a prerequisite of any further fusion.However, precise registration of SAR and optical image is very challenging due to geometric differences.A conditional generative adversarial network [13] was trained to generate an artificial SAR image given a real-world optical image, and the optical image and SAR data are then registered by matching the artificial SAR data with the real-world SAR data.This technique was shown to be effective in a suburban area.A 3D registration is introduced in [14] to align optical and SAR data by imitating the physical procedure of optical and SAR imaging based on a digital surface model.A pseudo-siamese convolutional neural network architecture [15] was trained to identify corresponding optical and SAR data in image patches and showed promising preliminary results.By far, although progress has been made in recent years, precise SAR and optical data registration has not achieved a robust solution yet, especially for complex urban areas.Thus, for other purposes of fusing these two data sources, the straightforward approach is registration by geographic coordinates.
2) Detection tasks have been proven successful by using optical and SAR data for the purpose of detecting building outlines [16], crops [17], water [18], [19] and urban areas [20].Since the detection task focuses on specific targets, studies extract representation of those targets from each of the two data sources so that they can work together to identify targets.For example, for detecting crops, optical data provides spectral signatures and SAR provides scattering mechanisms of interested targets; these characteristics are extracted and used together to identify the target under detection.
3) Classification is more challenging than detection tasks for more than one class of interest is under consideration.This paper focuses on this challenges [21].Recently, a number of studies [22], [23], [24], [25] have tried to solve classification tasks by using both optical and SAR data.In general, these fusing strategies all extracts features from the individual data set, then concatenate all the features and feed them into various classifiers.The most important part of this procedure is to extract hand-crafted informative features [26] regarding classification.A two-stream convolutional neural network (CNN) [27] derives high level features of individual data sets by utilizing the power of CNN and then concatenates those features for classification.In brief, concatenation is the main strategy for fusing SAR and optical data so far, which is an effective and straightforward approach.

B. Semi-supervised manifold alignment (SSMA)
SSMA pursues a projection for each input data source, and maps corresponding data into a shared latent space [28], [29].These properties hold within this space: (a) data of the same classes locate close to each other; (b) data of different classes locate far from each other; (c) the topological property of individual data is preserved.These three properties make SSMA to be a promising candidate for our task from a methodological perspective for two reasons.First, the first two properties promote classification-wise advantageous information from any data source to be used.Second, the final property implicitly connects unlabeled data to training data, which amplifies the functionality of the training data.These two factors meet our need for an algorithm that fuses data sets with the maximum usage of the training data.
In the remote sensing community, SSMA has been investigated for various applications.It was applied to fuse an RGB image and hyperspectral image so that visualization of hyperspectral image could be achieved in the latent space, exhibiting more spectral information than conventional visualization methods in [30].A kernel manifold alignment was introduced in [31] to fuse multiple optical remote sensing data into a latent space by nonlinear projections for a classification task.Manifold alignment was also used in [32] to align spectral signatures from different optical data sets by projecting them into a latent space so that object detection was achieved.
In regard to remote sensing data fusion, different data sources observe the same region of interest.Essentially, the observed target is a single object that appears differently in data sources due to sensor specifications.Thus, this question arises.Although, theoretically, SSMA is a good choice, does one latent space of observed objects, where data sources can be aligned?If it exists, can we find that space by using SSMA?Tuia et al. [33] applied SSMA to find the underlying space of multiple optical data sets under three scenarios: different looking angles, multi-temporal, and different sensors.In this work, we aim to fuse multi-sensory data sets, namely optical image and polarimetric SAR data, by SSMA.

C. Topology and MAPPER
One important feature of SSMA resides its exploration of the topological structure of data.The conventional manifold based method [28], [29], [34], [35], [36] approximates topological properties by using the kNN.They essentially assume that the underlying manifold of a data is a Riemannian manifold which can be locally approximated by Euclidean measurement [37], [38], [39].Recently, topological data analysis (TDA) has emerged as a new mathematical sub-field of big data analysis, by means of studying topological properties in the data [40], [41], [42], [43].One TDA tool, named MAPPER, resolves a computable approximation of the Reeb graph which represents the topological structure of a data with respect to one interested intrinsic property of the data [44], [45].
A general explanation of topology is that it is an art of simplification.It ignores complex information of the object under studying, and rather focuses on one meaningful aspect of it.On this regard, conventional manifold methods focus on the aspect of the local connection or the local structure.On the side of MAPPER (Reeb graph), it focuses on the topological structure of data related to the interested intrinsic property.
In real applications, the MAPPER has been proven capable of revealing unknown knowledge in medical studies, by interpreting topological structures of data sets.The tool was applied to analyze breast cancer transcriptional data and uncovered a sub-group of Estrogen Receptor-positive (ER+) breast cancers.Patients suffering this kind of cancer exhibit 100% survival and no metastasis.This finding was previously unknown and is invaluable for future treatment [46].MAPPER was also applied to analyze the data of preclinical traumatic brain injury (TBI) and spinal cord injury (SCI).It revealed a previously unknown pattern of co-occurring TBI and SCI, as well as a previously unknown harmful effect of an experimental drug treatment [47].With the help of MAPPER, Li et al. [48] explored complex medical records of type 2 diabetes (T2D) patients and revealed previously unknown sub-groups within T2D.All the above discoveries are invaluable and contribute to greater precision in the practice of medicine.
Besides the inspiration of these successful studies in medicine and the sound theoretical foundation of the MAPPER [44], the other reasons which motivated the authors to utilize the MAPPER to explore the topological structure of the remote sensing data are listed as following: 1) The field knowledge The MAPPER focuses on the topological structure of data related to an intrinsic property.In practice, the intrinsic property is quantitatively derived from the data by an expert-designed filter function.The quantified property operates as a lens through which the MAPPER observes the data and extracts the topological structure of the data.Therefore, the choice of the lens, equally the filter function, introduces a field knowledge into the procedure of extracting topological structure.To our best knowledge, the ability of extracting topological structure from a fieldknowledge perspective is unprecedented for the manifoldrelated technique in remote sensing.
2) The regional-to-global topological structure Instead of focusing on local structures of data points in conventional manifold-based techniques, the MAPPER focuses on an intrinsic property introduced by the filter function.Under the guidance of the filtered values, MAPPER divides a data into several bins, derives topological structure of each bin, and collects those structures together as a global one.This results in a regional-to-global topological structure.For the complex remote sensing data, especially SAR data, the regional derived structure is more robust to outliers than the local derived one.
3) The data-driven and optimized topology The spectral clustering is embedded into MAPPER in this work, leading to a data-driven and optimized topological structure.
• A data-driven topology.The eigen-gap concept in the spectral clustering detects the number of clusters [49].This ensures the derived topological structure suits the distribution of the data.Rather, conventional techniques derive the topological structure of the whole data set, with the kNN of a fixed k [50].
• An optimized topology.The spectral clustering is an optimized graph-cut algorithm, which is capable of unbiased grouping [51], [52].However, a conventional manifold technique directly relies on the precision of the similarity measurement.Although sophisticated similarity measurements have been developed in remote sensing, the high dimensionality and the complexity of the data still pose challenges on the measurement.

D. Summary
The contributions of this paper are three-fold.
• This work studies the fusion of heterogeneous remote sensing data sources, namely, the optical data and the polarimetric SAR data, with the semi-supervised manifold alignment technology.• To our best knowledge, this is the first time that the topological data analysis (TDA) technique has been applied in the remote sensing community.• A novel MAPPER-induced manifold alignment is proposed for semi-supervised data fusion.Its performance on the fusion of polarimetric data and optical data regarding classifications is quantitatively analyzed.The remainder of this paper is organized as follows.In Section II, MAPPER and SSMA are reviewed, and MIMA is introduced.The experimental setup, results, and comparisons are provided in Section III.Finally, Section IV provides conclusions and remarks on the work.

II. METHODOLOGY
In this section, we first introduce the background of the topological data analysis tool called MAPPER.Then, we review the basics of semi-supervised manifold alignment.Finally, the novel MIMA is introduced.

A. MAPPER
In order to introduce MAPPER [44] in a comprehensive and understandable way, we first provide an intuitive example, shown in Fig. 1.The theoretical foundation of MAPPER is then introduced from the perspective of applied topology.Due to heavy reliance on mathematical concepts for this paper, we note that notations in section II-A represent separate meaning from the other notations in the rest of the paper.
1) Intuitive explanation MAPPER is a mathematical tool developed from applied topology to analyze and visualize big data sets [44].The algorithm essentially consists of three components: • Filter function selection.MAPPER first requires a filter function which derives a filtering space where the interested intrinsic property is quantified.The chosen filter function should reveal physical meaning or geometric property of the data.It allows a specialist to introduce field knowledge into data analysis.For the example shown in Fig. 1, the filter function is chosen as the distance to the wrist, so that a filtering space in Fig. 1 (B) is derived from the data point cloud.• Data separation.In the filtering space, the continuous value range is sliced into overlapped intervals with a given overlap percentage and number of intervals, shown in Fig. 1 (C).Guided by the overlapped intervals [54], the original input data can be separated into overlapped data bins accordingly, as shown in Fig. 1 (D).The separated data in bins have the same dimension as the original data.a graph where a node represents a cluster, and an edge represents a link of two clusters.The link is generated for two clusters if they share common data points.Therefore, the graph serves as a simplified visualization of the topological structure of a data set.For example, the graph in Fig. 1 (E) is derived by MAPPER to represent the topological structure of the point cloud data of a human hand.
It is worth to highlight that the filter function is not seen as a dimension reduction, but quantifies a filtered space which guides the separation of the original data.As mentioned above, topology is an art of simplification.The conventional manifold learning focuses on the local structure of individual data points.On the other hand, MAPPER derives the topological structure of data while focusing on the property quantified by the filter function.
2) Theoretical foundation First, it is necessary to introduce the concept of covering in topology.In [55], it is explained as: let p : U → Z be continuous and surjective.If every point z of Z has a neighborhood C that is evenly covered by p, then p is called a covering map, and U is defined to be a covering space of Z, then p is a local homeomorphism of U with Z.It means that, in terms of function p, the preimage in U and the image in Z share the same topological properties locally.
The rest of the theoretical foundation is introduced in blocks corresponding to the three components of the MAPPER.
• Filter function selection.According to MAPPER [44], data is situated in a topological space X as illustrated in Fig. 1 (A).A continuous function f : X → Z projects space X to another space Z, as shown in Fig. 1 (B).• Data separation.The space Z is equipped with a covering space U , as shown in Fig. 1 (C).Assuming that covering space U is a k-simplex spanned by a set {α 1 , α 2 , ..., α k } so that U = {U α }, since f is continuous, f −1 (U α ) forms a covering of space X and could be used to represent topological space X of given data , as shown in B. Semi-supervised manifold alignment (SSMA) R mi×ni be a matrix representing the i th data source, with m i dimensions by n i instances.The term x k i denotes the k th instance of the i th data source.Let K denote the total number of data sources.SSMA learns a set of K projections {f 1 , ..., f K }.The i th projection f i maps the i th data source X i into the latent space, where all the K data sources are aligned in terms of the three desired properties discussed in the Introduction.The properties are formulated by three matrices, called the similarity matrix, dissimilarity matrix, and topology matrix.More specifically, the similarity matrix ( 1) is computed by labeled information to pursue property (a): the data of same class located close to each other.
The dissimilarity matrix is formed as (2) to accomplish property (b): data of different classes located far from one another.
The topology matrix (3) describes the topological structure of the data, which aims at the property (c): the topological property of individual data is preserved.
Each of the matrices (1), (2), and ( 3) is a matrix with the size (n In each matrix, the W i,j is a matrix representing the relationship between the i th and j th data sources on the individual property. Similarity matrix W s and dissimilarity matrix W d are generated based on label information.If x p i and x q j share a same label, then W i,j s (p, q) = 1, otherwise W i,j s (p, q) = 0.If x p i and x q j belong to different classes, then W i,j d (p, q) = 1, otherwise W i,j s (p, q) = 0.
Since the topological structure of the individual data set is preserved, the matrix W t is a block-wise diagonal matrix.The topological structure is conventionally given by the kNN, which means W i,i t (p, q) = 1 if x p i and x q i are neighbors in a given kNN neighborhood.Otherwise, W i,i t (p, q) = 0.In order to simultaneously model the three properties of the latent space, three terms are formulated for the cost function: Minimizing Eq. ( 4) has the effect of pulling data of the same class together in the latent space, which meets property (a).
Maximizing Eq. ( 5) tends to push data of different classes away, which is consistent with property (b).
Minimizing Eq. ( 6) preserves the topological structure of individual data set, corresponding to property (c).Eqs.(4 -6) can be combined into the final cost function, which is formulated as (7): and hence an optimization problem ( 8) can be written as Proven in [29], the solution f 1 , ..., f K that minimizing the cost function L(f 1 , ..., f K ) is given by the smallest non-zero eigenvectors of the generalized eigenvalue decomposition of (9).And the matrix D and the matrix L in (9) are the degree matrix and the Laplacian matrix, respectively. where C. MAPPER-induced manifold alignment for semi-supervised data fusion (MIMA) As introduced in the last section, three properties are pursued in SSMA while projections are being learned.Essentially, the first two properties seek to minimize intra-class variance and maximize inter-class variance for the projected data by using label information.This is a goal commonly pursued by many classification strategies, such as linear discriminant analysis [56].The third property, preserving topological structure, brings two powerful characteristics to SSMA.First, the Algorithm 1: MAPPER(X i ,b,c,F) Input: X i ∈ R mi×ni : the i th data source with n i instances and m i dimensions, b: the number of bins, c: overlapping percentage of adjacent bins, F: filter function.Output: W i,i c : adjacent matrix with the size of n i × n i .
1 calculate the parameter space X i F 2 divide X i F into b intervals with c% overlap of adjacent intervals 3 divide data X i into b data bins corresponding to intervals achieved in 2 4 for (each data bin): 5 Spectral clustering 6 end for 7 Construct topological matrix 1, if p and q in the same cluster; 1, if p and q in the linked clusters; 0, otherwise.
Compute the projections {f 1 , ..., f K } by solving Eq (9) 8 for (i = 1:K): topological structure is extracted from data, both with and without a label.Thus, SSMA builds up connections among them, which implicitly propagates the label information to unlabeled data.This would amplify the usage of existing labels.Since the label is valuable, the propagation property of the topological term is highly valued.Second, topology emphasizes a notion of nearness, but can distort or even ignore large distances [44].This is a desirable property for the purpose of classification.For instance, data of one class located in a certain extent of feature space, and locations with large distance to the extent are meaningless for classifying the specific class.This is also proven truth in classification using topology [36], [35].
In order to achieve the topological term, kNN commonly serves as the tool to approximate topological structure in conventional methods [29], [57], chosen for its simplicity.In our proposed MIMA, we utilize MAPPER to extract topological structure.There are two reasons to do so.First, when applying MAPPER, field knowledge could be introduced by choosing the filtering function F. In remote sensing classification, field knowledge is essential for the complicity of data.Second, when using kNN, nearness is decided solely by the parameter K, which is manually given.Once the K is determined, it is applied to all data without any adaptation.However, nearness is achieved by clustering in MAPPER, which is a more robust approach than deciding nearness by giving a threshold value K. Furthermore, in order to empower MAPPER to decide the nearness in an adaptive manner, the original single-linkage clustering [58], [44] is replaced by the spectral clustering [51] in MIMA.The reason is that the spectral clustering is able to detect the number of clusters by the concept of the eigengap [51].Thus, when clustering each data bin, the number of clusters is decided based on the data itself, meaning that the nearness is derived in a data-driven manner.For different data bins, the numbers of clusters are different, meaning the nearness is derived for different parts of data in adaptive fashion.Thus, our improved version of MAPPER is capable of deriving topological structure in an automatic and adaptive fashion.
Although the original goal of MAPPER is to provide a simplified visualization of a complicated data set, as shown in Fig 1, one can also derive the comprehensive topological structure of all data points using MAPPER.The topological structure of data source X i could be represented as an n i × n i matrix W i c , where n i is the number of instances: W i c (p, q) = 1, when data instances p and q are in the same cluster or in linked clusters, otherwise, W i c (p, q) = 0.In MIMA, the topological matrix W t in equation ( 3) is replaced by W c (10).

III. EXPERIMENTS AND DISCUSSION
A. Data set and feature design 1) Land cover land use data set (LCLU data set) As shown in Fig. 3, the LCLU data set consists of three data sources: a hyperspectral image, dual-Pol SAR data, and ground truth data.The hyperspectral image is a simulated spaceborne EnMAP scene with a size of 817 by 220, a 30-meter ground sampling distance (GSD), and 244 spectral bands ranging from 400 nm to 2500 nm [59].The dual-Pol SAR data is a VH-VV polarized Sentinel-1 single look complex (SLC) data collected by interferometric wide swath mode. 1 The Sentinel-1 SLC data is preprocessed by the ESA SNAP toolbox. 2 The processed dual-Pol SAR data has a GSD of 13 meters and a size of 1723 by 476.It is organized as the commonly used PolSAR covariance matrix.The ground truth is a land cover land use data set derived from an Open Street Map data. 3) Local climate zone data set (LCZ data set) The local climate zone data set is demonstrated in Fig. 4. It consists of a multispectral image, a dual-Pol SAR data, and a ground truth data.The multispectral image is a scene of LandSat-8 data with a size of 2220 by 2143, a 30-meter GSD, and 11 bands.The dual-Pol SAR data is also a VV-VH polarized Sentinel-1 data processed by the ESA SNAP toolbox.It has a 13.9-meter GSD, a size of 4795 by 4632, and is organized as the commonly used PolSAR covariance matrix.The ground truth is a local climate zone label released by the IEEE GRSS IADF for the data fusion contest in 2017. 4) Label configuration For both the LCLU data set and the LCZ data set, as shown in Fig. 3 and Fig. 4, the training label and the testing label are block-wise separated so that the transferring ability of algorithms is under examination and the risk of implicitly including testing samples into the training procedure is avoided [60].The label information is detailed in Table I and Table II.
4) Unlabeled data Regarding SSMA and MIMA, the training procedures involves both labeled data and unlabeled data.The unlabeled data was selected by the clustering strategy in [16] so that cluster centers of unlabeled data were selected.In this work, for a more general case, the unlabeled data for training is randomly selected outside the extent of training set.For both the LCLU data set and the LCZ data set, 6000 unlabeled data instances are selected to be involved in training.
5) Feature design of the LCLU data set In order to conduct fair comparisons among algorithms, two principles are pursued on the design of input features of individual data sources.The  first principle is simply that input features of each data source should be the same for all algorithms.The second principle is that, when an individual data source is used for classification, the input feature should enable reasonably good performance.This is to ensure that later improvements do not originate from the unexplored potential of one data source, but from the fusion or the fusion algorithms.For example, due to the wellknown curse-of-dimensionality [61], conducting classification on selected dimensions of hyperspectral images could result in better performance than using the data with all dimensions [62].If the original full dimensional data were used in our case, it would then be unclear later whether the improvement comes from the fusion or from the dimension reduction.Regarding the feature design of the simulated EnMAP data, the spectral-spatial feature concept is adopted by extracting morphological profiles from extracted informative subdimensions [63], [64].Specifically, the first four principal components (PCs) are extracted, which accounts for 99% of the variances of the simulated EnMAP data.The morphological profile is then extracted from these four PCs with radius equal to one, two, and three.In total, 28 features are extracted from the simulated EnMAP data set.
Regarding the feature design of Sentinel-1 dual-Pol data, four polarimetric features are derived: intensity of the VH channel, intensity of the VV channel, the coherence of VV and VH, and the intensity ratio of VV and VH.Since the morphological profile was proven to promote classification of PolSAR [25], [65], [66], it is also used to extract spatial information from dual-Pol data here with radius equal to one, two, and three.In total, it results in 28 features from Sentinel-1 dual-Pol data.6) Feature design of the LCZ data set The feature design for the LCZ data set also follows principles described in the feature design of LCLU data set.
Regarding the feature design of the LandSat-8 data, in order to achieve feature combination for reasonable good performance, the feature extraction and selection follows the strategy in first prize work from the GRSS IADF data fusion contest in 2017 [8].Local statistical parameters (mean and standard deviations in a 100 × 100-meter neighborhood) and morphological profiles are extracted from original LandSat-8 data.For details, please refer to [8].In total, 34 features are used in our work.
Regarding the feature design of Sentinel-1 dual-Pol data in the LCZ data set, the data source and the preprocessing are the same as those in the LCLU data set.The prepared fundamental features are the four polarimetric features.However, due to the local climate zone describes an urban local neighborhood at a grid with a 100 × 100-meter unit cell, feature extraction is different in the LCZ data set than in the in LCLU data set.Local statistical parameters, mean and standard deviation of local 100 × 100-meter cell, are derived from all four polarimetric features, resulting in eight features.Morphological profiles are therefore extracted from these eight features with radius equal to one and three.Thus, 40 features are prepared in total.

B. Experiments setting
In experiments, nine algorithms, which are listed in Table III, are applied to extract features from optical and dual-Pol SAR data.Hereafter, three classifiers are used to test the performance of these algorithms in terms of classification accuracy.The seven algorithms are: (A) dual-Pol SAR data (POL), (B) optical data (OPT), (C) fusing of optical and dual-Pol SAR data by feature concatenation (OPT-POL), (D) fusing optical and dual-Pol SAR data by COSPACE [67], (E) fusing optical and dual-Pol SAR data by LeMA [68], (F) fusing optical and dual-Pol SAR data by unsupervised joint dimension reduction using locality preserving projection [35] (LPP), (G) fusing optical and dual-Pol SAR data by semisupervised joint dimension reduction using locality preserving projection (LPP-SE), (H) fusing optical and dual-Pol SAR data by SSMA [33], and (I) fusing optical and dual-Pol SAR data by the proposed MIMA.
Among these nine algorithms, parameter tuning is required by (D) COSPACE, (E) LeMA, (F) LPP, (G) LPP-SE, (H) SSMA, and (I) MIMA, as shown in Table III.For the (D) COSPACE and (E) LeMA, two learning rates, α and β need to be set for the optimization.They are tunned by searching in a grid of {10 −1 , 10 −2 , 10 −3 , 10 −4 }.Regarding the parameter k in (F) LPP, (G) LPP-SE and (H) SSMA, it has been reported in [33] that the parameter does not have significant influence on the result and is recommended to be nine.For the parameters dn and µ, they will be discussed later in our experiments.In MIMA, two more parameters have to be decided: (1) b : the number of intervals for dividing the data, and (2) c: the overlapping rate.Since these two parameters have limited influences, as discussed in section III-F, specially when compared with the other parameters.Their values are chosen as 5 and 50%, respectively, which is a result of consulting other studies [46], [48] and summarizing practical experiences of the author.
Besides the parameter tunning, one important part of MIMA is to select the filtering function.As discussed before, the filtering function provides a perspective of observing the data and introduces field knowledge.As principal components are widely used in classification of remote sensing data and have been proven to be effective, and this is the first attempt of applying MAPPER in remote sensing, the first and second principal components are chosen to serve as the field knowledge in this work.
As shown in Table III, (D) COSPACE, (E) LeMA, (H) SSMA, and (I) MIMA are all fall into the manifold alignment fusion strategy.However, their learning resources are different.COSPACE is designed to learn a joint latent space via the existed labeled data.In addition to the labeled data, LeMA  also uses the pseudo-labeled data, predictions of a trained classifier on unlabeled data, to include unlabeled data into the procedure of data fusion.For SSMA and MIMA, they utilize the labeled data and extract the data distribution under the guidance of mathematical assumptions, to achieve data fusion.Therefore, when a large amount of labeled data exists or the data distribution is not correlated with labels, LeMA would be more appropriate than SSMA and MIMA, and vice versa.
Since our goal is to assess the performance of fusion, three classical classifiers are chosen: the one-nearest-neighbor classifier (ONE-NN), the linear support vector machine (LSVM), and the Gaussian kernel support vector machine (KSVM).In this work, parameter tuning of LSVM and KSVM are done in a heuristic procedure [69].

C. Classification on the LCLU data set
This section demonstrates and discusses the experimental results obtained on the LCLU data set.
1) Fusion vs. non-fusion.As shown in Fig. 5 and Table IV, classification on fused hyperspectral imagery and dual-Pol SAR data outperforms classification on the individual data source, in terms of classification accuracies.Among the fusion algorithms, our proposed MIMA provides the best classification performance, which, in terms of overall accuracy, exceeds classifications on dual-Pol SAR data by 25%, 20%, and 21% and exceeds classifications on hyperspectral imagery by 7%, 4%, and 5%, using ONE-NN, LSVM, and KSVM, respectively.This proves that fusion of hyperspectral imagery and dual-Pol SAR data is advantageous to LCLU classification.
2) Fusion categories.Based on properties of fusion algorithms listed in Table III Fig. 6: Visualization of the optical data and the PolSAR data of the LCLU data set, using t-SNE [70] in their original and projected spaces.The x and y axis are the first and second dimensions resluted from the t-SNE.The first row are: the PolSAR data in the original space, the optical data in the original space, LPP jointly projected space, LPP-SE jointly projected space, and COSPACE projected space, respectively.The second row are: the PolSAR data in SSMA projected space, the optical data in SSMA projected space, the PolSAR data in MIMA projected space, the optical data in MIMA projected space, and LeMA projected space, respectively.
According to the classification accuracy in Table I, with the feature concatenation (OPT-POL) serves as the benchmark, it is obvious to find that: (a) joint dimension reduction fusion algorithms achieve similar classification accuracy to the feature concatenation fusion; (b) the overall accuracy provided by label-driven manifold alignment are around 7% lower than the accuracy achieved by the feature concatenation; (c) the data-driven manifold alignment fusion outperforms the feature concatenation by 2%.Discussions regarding the three findings are detailed as follows.
It is well known that the dimension reduction technique is capable of boosting the classification accuracy, due to the curse-of-dimensionality [61].However, according to the finding (a) above, this doesn't suit the feature concatenation fusion in out experiment.Because the curse-of-dimensionality has been tackled in our feature design.The finding (a) also validates that the improvement of our proposed method is not a side effect of dimension reduction.
The label-driven manifold alignment fusion learns projections that map original data sources to a latent space purely based on the label, and applies learned projections on the unlabeled data to accomplish fusion.The finding (b) gives a clear clue that this type of fusion can not provide a proper fusion result for the LCLU data set.This could because the label-driven learned latent space is not applicable to a general case, namely the unlabeled data.Thus, the label-driven manifold alignment fusion might provide a destructive fusion when the label data can not represent the data distribution which is often the case in remote sensing.
The data-driven manifold alignment fusion also learns pro-jections that map original data sources to a latent space.However, the latent space is jointly defined by the label and the data structure explored from the original data sources, including labeled and unlabeled data.The finding (c) suggests that the data-driven manifold alignment fusion is an effective fusion strategy, which improves the overall accuracy about 2% by comparing to the feature concatenation fusion.
3) MIMA vs. SSMA.As shown in Fig. 5 and Table IV, the proposed MIMA has superior performance to SSMA.In Fig. 5, verified with three different classifiers, classifications on MIMA-fused data outperform classifications on SSMAfused data, when parameter µ and the number of dimensions are the same for both fusion strategies.The classification performance of the best parameter combinations is shown in Table IV.It is clear that the novel MIMA strategy still outperforms SSMA strategy, not only verifying the superior performance of the proposed novel MIMA algorithm, but also proving that a MAPPER-derived topological structure is more effective than a kNN-derived structure regarding LCLU classification.
4) Parameter µ.As shown in Fig. 5, with ONE-NN and KSVM classifiers, a higher value of µ results in better classification performance for both SSMA and MIMA algorithms.Recalling that a higher value of µ assigns stronger weight on topological structure of data in the fusing phase, this is solid evidence that topological structure benefits our classification task.We also find that the way MIMA derives the structure is more beneficial to this LCLU classification than the way SSMA accomplishes it.
5) Fusion visualization.In Fig. 6, we visualize the fused features of different algorithms using the t-SNE algorithm [70].It is obvious that the joint dimension reduction technique results a set of features which is less discriminative than the original feature.This is also reflected on the classification results, shown in Fig. 5. On the other side, when using   manifold alignment techniques, it is clear that the derived feature is more discriminative than the original ones.

D. Classification on the LCZ data set
This section demonstrates and discusses the experimental results obtained on the LCZ data set.
1) LeMA.The most outstanding phenomenon appears in Fig. 7 and Table .V is that LeMA outperforms all the other algorithms by 2 to 6%, which is considered a large margin in this experiment.Since LeMA and COSPACE both accomplish the fusion by using the labeled data and have similar performance in the experiment of LCLU data set, it is very interesting to find out the reason why LeMA not only outperforms COSPACE but also all the other fusion algorithm with a large margin.The difference between COSPACE and LeMA is that, while learning the projections from the labeled data, LeMA, additionally, includes pseudo-label into the learning phase.The pseudo-label are predictions of unlabeled data inferred by a trained classifier.LeMA has a strategy of selecting pseudolabel which have a high probability to be correctly labeled.2) Fusion.According to Table .V, all fusion algorithms, except LeMA, have similar performance to the classification using only the multispectral imagery.Based on the 0.19% difference between OPT-POL and OPT, we might infer that features extracted from dual-Pol SAR data do not benefit the LCZ classification scheme in terms of overall accuracy.Fig. 8: Visualization of the optical data and the PolSAR data of the LCZ data set, using t-SNE [70] in their original and projected spaces.The x and y axis are the first and second dimensions resluted from the t-SNE.The first row are: the PolSAR data in the original space, the optical data in the original space, LPP jointly projected space, LPP-SE jointly projected space, and COSPACE projected space, respectively.The second row are: the PolSAR data in SSMA projected space, the optical data in SSMA projected space, the PolSAR data in MIMA projected space, the optical data in MIMA projected space, and LeMA projected space, respectively.
3) Data-driven manifold alignment.When the fusion is carried out by the OPT-POL, essentially, the information given by the label decides the classification boundary.However, in addition to the label, the data-driven manifold alignment involves the topological structures of the data to find the classification boundary.As shown in Fig. 7, SSMA cannot compete with OPT-POL.This means that data structure is not beneficial with respect to LCZ classification.This is actually reasonable.The LCZ classification scheme describes the contents of an urban local neighborhood relating to the morphological structure, man-made or natural components, and height of structures.However, the topological structure derived from the remote sensing data reveals data structure in terms of its physical meanings, such as covering materials for multispectral images and geometric information for SAR data.Thus, the structure is not directly related to LCZ concepts.On this regard, the data-driven manifold alignment is more appropriate for the LCLU classification, since the information derived in the topological structure is directly related to LCLU classes.
Despite the challenges that LCZ classes pose, when comparing to OPT-POL, the proposed MIMA slightly improves 1.02% overall accuracy with the LSVM.Comparing to LeMA, the performance of MIMA differ by -3.83%, -0.79%, and 0.23% in terms of overall accuracy with three different classifier.We consider the performance are comparable.Only the -3.83% indicates a big difference.However, this is because 38.83% and 6.76% differences of training and testing records have a huge impact on the classifier 1NN.With the two other classifiers, even with fewer training samples, the proposed MIMA is able to provide comparable classification accuracy.
4) Parameter µ.According to Fig. 7 and Table V, trends in terms of µ show that SSMA achieves its best performance when parameter µ equals one and performances are downgraded by increasing µ without a pattern.However, MIMA exhibits a pattern that classification accuracy increases as the value of parameter µ increases.This means that putting higher weights on the topological structure while fusing with MIMA would provide better classification performance in terms of OA.
5) Fusion visualization.We visualize the fused features of different algorithms by t-SNE, as illustrated in Fig. 8.However, it is difficult to carry out a detailed analysis according to the visualization results.In general, the manifold alignment based fusion provides spaces where classes concentrate well.To our knowledge, the optical data are projected into a more discriminative subspace via the proposed MIMA.   .VI.Our proposed MIMA do suffer from comparably high computational cost, as shown in Table VI.This is due to the high computational cost of the spectral clustering [51] in MIMA.If the algorithm efficiency is of key importance for a targeted application, more studies could be carried out to find a less demanding clustering algorithm as a substitute.However, considering 9170 optical pixels with 34 dimensions and 9170 SAR pixels with 40 dimension are involved in the training of the algorithm, we think two minutes which required by MIMA is still acceptable.

F. Data bins and overlap rates
As described in the section II-C, there are two parameters brought to MIMA by the MAPPER.They are: the number of data bins and the overlap rate of adjacent data bins.Among all experiments in the previous sections, the number of data bins is chosen as 5 and the overlap rate is selected as 50%, based on the experience of medical studies [46], [48].However, in this section, the impact of those two parameters are discussed, in terms of the remote sensing data.
Theoretically, the number of data bins has a similar effect to the value k of the kNN, which controls the extent of a local neighborhood.Because the local topological structure is derived by the clustering in smaller slices of the data, when a larger number of bins is applied.On the other hand, the overlap rate controls the strength of the connection between adjacent local neighborhoods.Although the theoretical concept is clear, their impacts are really depending on the data set which it works with.Fig. 9 demonstrate the impact of the number of data bins and the overlap rate by using the LCLU data set and the LCZ data set, in terms of classification accuracy.The number of bins is set to values from 5 to 50 with an interval of 5.The overlap rate is set to values from 0.1 to 0.9 with an interval of 0.1.For the sake of simplicity, the parameter µ is set to 2 and the latent space dimension is set to 50, for the analysis in this section.
According to the upper two plots in Fig. 9, regarding the LCLU classification, the number of bins and the overlap rate do not have a significant influence in terms of the overall accuracy.However, based on the bottom two plots in Fig. 9, regarding the LCZ classification, one can recommend a large overlap rate around 90% and the number of data bins around 10. Thus, the decision of both parameters really depends on the data set and the targeted classification scheme.Last but not the least, it also relates to the choices of the filtering function.One more interesting point is that, by comparing the fluctuation of curves in Fig. 5, Fig. 7, and Fig. 9, we can observe that impacts of µ and dn are much larger than impacts of the number of bins and the overlap rate.

IV. CONCLUSION
In this paper, we propose a MAPPER-induced manifold alignment for semi-supervised fusion of optical data and polarimetric SAR data, inspired by the semi-supervised technique and the emerging field of topological data analysis.Specifically, we embed a successful topological data analysis tool, MAPPER, into SSMA, to accomplish heterogeneous data fusion.Furthermore, our modified version of MAPPER functions adaptively to data by improving clustering.The performance of MIMA on fusing optical data and polarimetric SAR data is superior to that of SSMA, LPP, COSPACE, LeMA, and the feature concatenation, with respect to LCLU classification and LCZ classification.SSMA-based method is applied to fuse optical data and SAR data for the first time.This is also the first time that topological data analysis is applied in remote sensing field.
In the future, further experiments will be conducted to explore the potential of the proposed MIMA by selectively introducing field knowledge of remote sensing data.In this manner, physical meanings of different remote sensing data can be explicitly introduced into data fusion, instead of treating it as a data-driven machine learning problem.We believe an expert knowledge driven MIMA can further improve the fusion performance.

Fig. 1 :
Fig. 1: Example of MAPPER approach to derive the topological structure of the point cloud of a human hand.(A): Data space X, point cloud data of a human hand; (B): Filtered space Z, points colorized by the filter value; Filter function f : assigning data points with their horizontal distances to the right end; (C): U covering of Z, overlapped intervals of the filtered value; (D): f −1 (U α ) covering of X, separating original data into bins according to intervals in (C), data in bins remain their original dimension; (E) f −1 (U α ) covering of X, achieved by clustering bins of data.Modified from [53].

Fig. 1 (
D). • Clustering and visualization construction.The set {α 1 , α 2 , ..., α k }, as the vertices of k-simplex, are k connected components in topological space X which can be achieved by clustering.Thus, f −1 (U α ) is achieved to represent data space X, as shown in Fig 1 (E).

Fig. 2 :
Fig. 2: The flowchart of the algorithm MIMA.(A) Training phase: a topological graph (W c ) is derived from the optical data and the SAR data by MAPPER.A similarity graph (W s ) and a dissimilarity graph (W d ) are formed by using the label information.Therefore, three regularization terms A, B, and C are formulated as Eq. 4, Eq. 5, and Eq. 6, respectively.Lastly, the projection to the latent space is learned by optimizing argmin f1,...,f K

Fig. 5 :
Fig. 5: Classification performance in terms of overall accuracy (OA) for the experiments applied on the LCLU data set.The charts show the results of the three classifiers, from left to right, ONE-NN, LSVM, and KSVM. xii

Fig. 7 :
Fig. 7: Classification performance in terms of overall accuracy (OA) for the experiments applied on the LCZ data set.The charts show results for the three classifiers, from left to right, ONE-NN, LSVM, and KSVM.
In our classification evaluation, those correct-prone pseudolabel are also used for training classifiers of fused data.In the case of the experiment on the LCZ data set, by comparing to original 3170 training records and 18205 testing records, there are 1231 additional pseudo-label selected from the test data set which are used for training classifiers.It increases the training data set by 38.83% and occupies 6.76% of the testing data for validation.We believe the change in classification setting is the main reason that LeMA performs the best in the experiment of LCZ data set.On the other hand, in the experiment of the LCLU data set (3116 training records and 441778 testing records), LeMA has 721 additional pseudolabel, which increases the training data set by 23.14% and occupies 0.16% of the testing data for validation.

Fig. 9 :
Fig.9: Plots of the classification overall accuracies achieved by applying the classifiers: the ONE-NN, the LSVM, and the KSVM on MIMA fused feature, while only values of two parameters varies, the number of data bins and the overlap rate.From left to right: (1) plot of overall accuracies achieved on the LCLU data set; the curve and the error bar represent the mean and the standard deviation, which achieved statistically with varying overlap rates; (2) plot of overall accuracies achieved on the LCLU data set; the curve and the error bar represent the mean and the standard deviation, which achieved statistically with varying number of bins; (3) plot of overall accuracies achieved on the LCZ data set; the curve and the error bar represent the mean and the standard deviation, which achieved statistically with varying overlap rates; (4) plot of overall accuracies achieved on the LCZ data set; the curve and the error bar represent the mean and the standard deviation, which achieved statistically with varying number of bins;

TABLE I :
Summary of training and testing for LCLU data set

TABLE II :
Summary of training and testing for LCZ data set

TABLE III :
The nine algorithms in experimental comparisons.Their fusion strategies are MA-fusion (manifold alignment fusion) and DR-fusion (joint dimension reduction fusion).The learning resource are the Label (annotated data-label records), the Pseudo-label (prediction from a classifier), and the data structure (the distribution of data in feature space).The parameters of these algorithms are, k: the k th neighbor in kNN for approximating topological structure; dn: the number of dimensions in the projected space; µ: the importance weighting of topological structure; α and β: learning rates.

TABLE IV :
Quantitative performance comparison with the different algorithms on the LCLU data, in terms of class-specific accuracy, kappa coefficient, average accuracy, overall accuracy, and mean overall accuracy.The best performance achieved is shown in bold.Power (a, b) of learning rates α = 10 a and β = 10 b are shown for COSPACE and LeMA in terms of the best performance.Number of dimensions (dn) is indicated as (dn) for the best performance of LPP and LPP-SE.Parameter values of µ and number of dimensions (dn) are indicated as (µ, dn) for the best performance of SSMA and MIMA.
, to simplify the discussion, we xi

TABLE V :
Quantitative performance comparison with the different algorithms on the LCZ data, in terms of class-specific accuracy, kappa coefficient, average accuracy, overall accuracy, and mean overall accuracy.The best performance achieved is shown in bold.Power (a, b) of learning rates α = 10 a and β = 10 b are shown for COSPACE and LeMA in terms of the best performance.Number of dimensions (dn) is indicated as (dn) for the best performance of LPP and LPP-SE.Parameter values of µ and number of dimensions (dn) are indicated as (µ, dn) for the best performance of SSMA and MIMA.

TABLE VI :
The computational cost of algorithms in comparison.The time listed in this table are means of ten repetitions on each algorithm, carried out on the LCZ data set.The unit is reported in second In order to show the computational efficiency of algorithms in comparison, experiments of ten repetitions over the LCZ data set had been carried out for every algorithm in comparison.All the experiments are accomplished on a desktop with