Multi-Modal Data Fusion for Land-Subsidence Image Improvement in PSInSAR Analysis

There are three popular methods to understand the land subsidence: leveling, Global Navigation Satellite System, and Interferometric Synthetic Aperture Radar (InSAR) analysis using SAR images. While both leveling and the Global Navigation Satellite System can measure the amount of land subsidence only at specific points, InSAR analysis can observe a wide area in short time intervals. In terms of accuracy, however, InSAR analysis is inferior to leveling; centimeter/millimeter order (InSAR/PSInSAR analysis) vs. millimeter order (leveling). Among all observation errors in InSAR analysis, a tropospheric delay error has a large adverse effect on the measurement. It is difficult to suppress this tropospheric delay error by conventional methods because they try to remove error at each pixel independently in an InSAR image. However, geometrically-neighboring regions/pixels should be naturally correlated. Our proposed method employs such a neighboring relationship in a convolutional neural network (CNN). Our CNN is designed to improve InSAR analysis by mutually incorporating the InSAR image and the tropospheric delay error, which are estimated by any conventional methods. Experimental results demonstrate that our proposed method can reduce the mean error compared with a conventional method: from 10.3mm to 6.80mm.


I. INTRODUCTION
The ground always shifts vertically and horizontally. One of the critical shifts is a land subsidence. Once the land subsidence occurs, the ground hardly returns to its original state. It is essential to understand the land subsidence and to take a countermeasure against it as early as possible, because it may cause huge negative impacts on our lives such as building collapse and lifeline damages.
We have three popular methods to measure the land subsidence: leveling, Global Navigation Satellite Systems (GNSSs), and Interferometric Synthetic Aperture Radar (InSAR) analysis. The functional properties of these three methods are summarized in Table 1. For leveling, a huge human effort is required in order to measure a wide area. The GNSSs including Global Positioning System (GPS), GLObal NAvigation Satellite System (GLONASS), Galileo, and Quasi-Zenith Satellite System (QZSS) measure the crustal alteration so that ground-installation devices receive signals transmitted from artificial satellites. SAR is a form of high spatial-resolution micro-wave radar. SAR is The associate editor coordinating the review of this manuscript and approving it for publication was John Xun Yang . transmitted from a satellite towards the ground surface. Its reflection intensity and phase are captured by the same satellite. From the phase shift between those captured at different capturing times, we can compute the ground subsidence. As summarized in Table 1, while both the leveling and GNSS (1) require special measurement devices and (2) measure the amount of the land subsidence only at specific points, InSAR analysis can observe a wide area in short time intervals with no ground-installation device. These advantages motivate us to employ InSAR analysis for various real-world applications (e.g., [1]- [3]). Although these advantages enrich the value of InSAR analysis, its accuracy is inferior to the leveling process. 1 This paper proposes convolutional neural networks (CNNs) for improving InSAR analysis. Machine learning such as CNNs allows us to stochastically rectify conventional InSAR analysis, based on a number of training data. In this work, the training data is a set of InSAR analysis results with larger error and their ground-truth data measured by leveling with 1 While the quality of leveling measurements changes among different leveling datasets, high-quality leveling measurements (i.e., 0.1 millimeterorder resolution [4]) were used in our experiments as described in Section IV. TABLE 1. Functional properties of three measurement methods; leveling, GNSS, and InSAR. The best and second-best properties of each function is colored by red and blue, respectively. less error. In accordance with a standard supervised machine learning manner, a CNN is trained so that they can rectify the result of InSAR analysis so that it is equal to the groundtruth data measured by leveling. Once the CNN is trained properly, it can rectify any new InSAR analysis results, which are not included in the training data. In addition, we focus on reducing the error associated with a tropospheric delay, which is considered to be adverse on the measurement by InSAR analysis.
The contributions of this paper are as follows: • Our proposed method achieves millimeter-order accuracy in land-subsidence measurements by the CNN that fuses multi-modal data (i.e., land-subsidence and tropospheric-delay images) obtained by conventional InSAR analysis.
• We explore several types of multi-modal fusion approaches for our proposed CNN in order to evaluate the appropriateness of each approach. Optical and SAR images captured by satellite-mounted cameras. Each pixel value in this SAR image shows the reflection intensity of SAR. The reflection intensity is, in general, higher in metals and minerals such as buildings and rocks. For example, pixel values in a part of an island shown in the lower left region in this SAR image are higher, because this island is an airport with landing fields.

II. RELATED WORK A. InSAR ANALYSIS AND ITS EXTENSIONS
Different from an optical image (e.g., left image in Fig. 1), a SAR image (e.g., right image in Fig. 1) consists of the intensity and phase of radar signals. Assume that an orbiting satellite captures SAR images of the same ground surface region at different times, as shown in Fig. 2. The distance between the satellite and the ground surface is expressed by the following equations [5]- [7] where r 1 and r 2 are the FIGURE 2. Geometric relationship between an orbiting satellite and the ground surface. SAR images of this surface region are captured by the same satellite at different times. These SAR images are captured from slightly-different locations. r i denotes the distance between i -th trajectory and the ground surface. As shown in the figure, one SAR image captures around 100km × 100km, and its 1 pixel covers around 10m × 10m.
distances in the first and second capturing times: where n i , λ, and x i denote the number of waves, the wave length, and the fraction (0 ≤ x i < λ), respectively. i denotes the capturing time (i.e., i ∈ {1, 2}). If two SAR images are captured at the completely-same location, the difference between r 1 and r 2 excepting their first terms (i.e., the integer multiple of the wave length), which is expressed by the following equation, is determined only by the land subsidence: where φ i denotes the phase of a radar signal. In the phase shift included in Eq. (3) (i.e., φ 1 − φ 2 ), unknown scaling factors of the terrain are vanished. As a result, φ 1 − φ 2 represents the distance difference in the order of the radar wave length. Since Eq. (3) is computed without the integer multiple of the wave length, the phase unwrapping algorithm [8]- [10] is achieved for obtaining the distance difference, r 2 − r 1 , between the two capturing times. Eq. (3) is computed in all pixels of the paired SAR images. The phase shift values compose an initial InSAR image. This initial InSAR image still contains several types of erroneous components. Each pixel value, φ = φ 1 − φ 2 , in the initial InSAR image is called a residual phase and is expressed as a sum of five components, including an ideal phase shift, φ def , and four erroneous components, as follows [11]: • φ def denotes an ideal phase shift caused only by the ground subsidence. Estimating φ def is the goal of InSAR analysis.
• φ atm is the estimation error caused by the atmospheric phase. φ atm includes the estimation error of tropospheric delay, which is called a tropospheric delay error.
• In reality, two SAR images are captured at slightlydifferent locations, as shown in Fig. 2. This difference in the capturing locations produces φ orb .
• Since an InSAR image inevitably contains stripe noise, the Digital Elevation Model (DEM) is used for removing this stripe noise. This process produces φ DEM .
With the initial InSAR image, these four erroneous components are estimated and removed in the further InSAR analysis. While readers can refer to [12], for example, for more details, removing these erroneous components is disturbed by noisy values in SAR images. Noise in the SAR images strongly depends on ground objects that reflect radar signals. That is, a noise level of multiple SAR image captures is roughly determined in each pixel, if the SAR images are spatially aligned so that each pixel captures the same ground surface.
For removing the above erroneous components robust to noise, only pixels with the stable amplitudes of reflected radar signals are used in Persistent Scatterer InSAR (PSInSAR) [5], [13]. Given a sufficient number of spatially-aligned SAR images, a metric of this stability is provided by the standard deviation, σ p , of the amplitudes in p-th pixel. In addition to σ p , the mean µ p of the amplitudes can also be used for this evaluation metric. This is because it is known that, with microwave radar, more stable and higher reflections are reflected from metals and minerals such as buildings and rocks [13]. With σ p and µ p , p-th pixel is selected by PSInSAR analysis if σ p µ p is lower than a threshold in (e.g., typically σ p µ p < 0.25) [13]. In each selected pixel, µ p is considered to be a pixel value in the PSInSAR image.
While PSInSAR can achieve millimeter-order accuracy in exceptionally-good conditions [14], the unknown and complex atmospheric density (in particular, water vapor pressure in the air) makes φ atm larger. Other additional data (e.g., numerical meteorological 2models [15]) and the relationship between the phase and topography can reduce φ atm . Machine-learning methods [16] are also effective for reducing φ atm . However, tropospheric delay correction is still a difficult problem.

B. InSAR ANALYSIS USING MACHINE LEARNING
In [17]- [20], local areas in InSAR images are classified for detecting important time-series changes such as volcano deformations and ground deformations. Low-level processing with InSAR images such as phase filtering [21], phase unwrapping [22], and denoising [23] can be also improved by CNNs. In order to utilize time-series InSAR images, recurrent networks are also useful [24].
For fully utilize the performance of machine learningbased methods, a small amount of training data in InSAR analysis is often a bottleneck. This problem is alleviated by self-supervised learning [25] and synthetic data [18], [26].
In contrast to the aforementioned previous methods, our proposed method fuses a land-subsidence image, which is acquired by PSInSAR analysis, with a tropospheric-delay error map for further improving the accuracy of the landsubsidence image. How to fuse the land-subsidence image with the tropospheric-delay error map via CNNs is a technical contribution of our paper. ''F'' indicates a fusion operation such as channel concatenation and elementwise add. In (a), multi-modal data are first fused and then fed into encoder-decoder networks consisting of convolution layers. In (b), encoders extract features before multi-modal fusion. A network shown in (c) employs an additional path (colored by magenta in the figure) connecting multi-modal paths for information exchange between these multi-modal paths even before fusion. In (d), since one input modality is equal to the output modality, the path corresponding to this modality is regarded as dominant. Towards this dominant path, features extracted from other modalities are fused. In the example shown in (d), the first input modality colored by green is equal to the output modality.

C. MULTI-MODAL IMAGE FUSION USING CNNs
Multi-modal data fusion is widely used in various computer vision problems. Recently, as well as most computer vision methodologies, multi-modal data fusion is achieved by CNNs. CNNs allow us to fuse complex multi-modal data. Such a CNN is appropriate for fusing the land-subsidence image with the tropospheric-delay error map that is affected by complex atmospheric factors.
Typical network architectures proposed for multi-modal data fusion are shown in Fig. 3. In CNNs shown in (a), multi-modal data are fused and then fed into convolutional layers. In [27], for example, RGB and depth images are fused for image segmentation using convolutional encoder and decoder [28] While multi-modal data is fused before convolutional feature extraction in (a), those features are extracted first independently in each modality and then fused in (b) [29]- [32]. For example, DMFNet [32] employs residual modules [33] for this feature extraction. In (c) as proposed in ACNet [34] and RFBNet [35], in addition to paths for each modality, an additional path for feature fusion (i.e., a path colored by magenta in the figure) is employed. For the case where one of the input modalities is equal to the output modality, in (d) [36], [37], only one modality is regarded as dominant. For intensive analysis of this dominant modality, features extracted from other modalities are auxiliarily fused to those extracted from the dominant modality. This approach is employed in FuseNet [38] for fusing RGB and depth data. Depending on the network architecture, different fusion operations indicated by ''F'' in the figure are employed. For example, multi-modal data are concatenated in [27], while the elementwise add operation is used in [38].
As described at the end of Section I, our proposed method fuses multi-modal data (i.e., land-subsidence and tropospheric-delay images obtained by conventional InSAR analysis) for millimeter-order accuracy.

III. PROPOSED METHOD
Our strategy for improving an InSAR image is to reduce the tropospheric delay by a CNN with a tropospheric-delay image. We expect that the distribution of the tropospheric delay values provides us critical information about where and how tropospheric delay error is large. Section III-A introduces how we obtain a land-subsidence image and its corresponding tropospheric-delay image from a pair of SAR images captured by a satellite. With these land-subsidence and tropospheric-delay images, our proposed network for tropospheric delay correction is proposed in Section III-B. Land subsidence image generation from N SAR images. After SNAP achieves InSAR analysis, its output images (i.e., N − 1 InSAR images) are fed into StaMPS/MTI [39], [40] for PSInSAR analysis. Its output is one PSInSAR image. The snap2stamps software is used for converting the SNAP format to the StaMPS/MTI format. For further improvement, precise tropospheric delays are estimated by TRAIN [41]. The estimated tropospheric delays are used for correcting the PSInSAR image. Our proposed method requires (a) the corrected PSInSAR image and (b) the estimated tropospheric delay image. Figure 4 shows the process flow for generating InSAR and tropospheric delay images in our work. First of all, multiple SAR images (i.e., N SAR images in the figure) are fed into SNAP [42], [43] to get initial InSAR images. With SNAP, InSAR images between one SAR image and all other images are computed. As a result, N − 1 InSAR images are obtained. These InSAR images are used for PSInSAR analysis using Stanford Method for Persistent Scatterers / Multi-Temporal InSAR (StaMPS/MTI) [39], [40], [44] in order to acquire one PSInSAR image. Inconsistency between data formats of SNAP and StaMPS/MTI is resolved by the snap2stamps software [12], [45]. Long-range phase-shift computation is done by Snaphu [10], [46] in StaMPS/MTI. Further improvement is achieved by tropospheric-delay correction by TRAIN [41], [47]. In TRAIN, a tropospheric delay is corrected under the assumption that a tropospheric-delay error depends on the altitude at each point. With the tropospheric-delay image estimated by TRAIN, the PSInSAR image is improved.

A. PSInSAR FOR LAND-SUBSIDENCE AND TROPOSPHERIC-DELAY IMAGES
Finally, this improved PSInSAR image is transferred to its land-subsidence image. The phase shift included in the PSInSAR image is captured along the radar direction, while the land subsidence is measured along the vertical axis. Based on the angle between them, StaMPS/MTI computes the landsubsidence image. In our experiments, θ was given by the European Space Agency (ESA). These land-subsidence and tropospheric-delay images, each of which is expressed as a 8-bit gray scale image, are used in our proposed method, which is described in what follows.

B. TROPOSPHERIC-DELAY CORRECTION WITH FUSION OF LAND-SUBSIDENCE AND TROPOSPHERIC-DELAY IMAGES
The network architecture of our proposed method is shown in Fig. 5. Two input images (i.e., land-subsidence and tropospheric-delay images) are fed into the network that produces the improved land-subsidence image. As with general CNNs, our network employs a combination of convolution and pooling layers for representing global contexts of input images. We used max pooling layers. Since each pooling operation reduces the size of a feature map, it is enlarged to its original size by unpooling layers [48]. Batch normalization layers [49] allow us to make the training process efficient and stable. A Rectified Linear Unit (ReLU) is used for an activation function. At the last layer, 1 × 1 convolution reconstructs the output image, as with done in many image reconstruction methods such as image style transfer [50] and image super resolution [51], [52].
The basic architecture is based on Fig. 3 (d) with a dominant path, because the dominant path is appropriate for our problem where one of the input modalities is equal to the output modality (i.e., land-subsidence image). As with FuseNet [38], features extracted from another modality (i.e., tropospheric-delay image) are fused with the dominant path by the elementwise add indicated by yellow layers in Fig. 5.

1) DENSE SKIP CONNECTIONS FOR FUSING GLOBAL AND LOCAL FEATURES
Different from FuseNet, on the other hand, our network is augmented with skip connections between a pair of encoding and decoding layers. Through each skip connection, features extracted from the encoder and decoder are concatenated in the concatenation layer colored by gray in Fig. 5. Such skip VOLUME 9, 2021 connections are proposed for simultaneously understanding global and local features in UNet [53].

2) LOSS FUNCTIONS
For training our network, ground-truth subsidence values are required in order to compare them with the output landsubsidence image. These ground-truth values are measured by leveling in our method. The measured points are spatially aligned with the land-subsidence images based on their latitudes and longitudes. These spatially-aligned outputs and ground-truth measurements are compared for training the network. This training is achieved with the following Mean Squared Error loss (indicated by ''MSE loss (l)'' in Fig. 5): where q gt,l and p out,s(l) denote ''the measurement value of l-th ground-truth leveling point'' and ''the value of its spatially-aligned pixel (denoted by s(l)) in the output landsubsidence image,'' respectively. The number of the points in each land-subsidence image is N l . In addition to MSE l , the following Mean Squared Error loss (indicated by ''MSE loss (p)'' in Fig. 5) is also used for making the output image similar to its input image: where p in,i and p out,i denote pixel values of i-th pixel in the input and output land-subsidence images, respectively. N p denotes the number of pixels in each land-subsidence image. Note that the above loss functions (5) and (6) impose two different conditions on a pixel corresponding to each leveling point. On such pixels, the loss function (5) should be more important in order to learn the values measured by leveling. In other words, the loss function (6) should be neglected at each leveling point. Therefore, pixels around the leveling points in MSE p are low-weighted by W (i) as follows: where W (i) has lower values around the leveling points. W (i) is expressed by the 2D Gaussian mixture as shown in Fig. 6 where the dimension of the yellow rectangle is equal to the image size. Initially, all weights in W (i) are 1, and then the 2D Gaussian mixture is subtracted from this initial W (i). Each 2D isotropic Gaussian distribution is centered at one of the leveling points and is normalized so that the maximum value is 1. Therefore, 0 ≤ W (i) ≤ 1 where W (i) = 0 at each leveling point. With the above two loss functions (5) and (7), a loss for training our proposed network is defined as follows: where α denotes a weight constant.

3) RANDOM PATCH CROPPING FOR A SMALL NUMBER OF GROUND-TRUTH DATA
Unlike standard image recognition problems, a SAR image is huge. For example, 640 × 480 pixels in many images of the COCO dataset [54] vs. around 20,000 × 10,000 pixels in SAR images. This huge image cannot be directly located in a GPU, which is required for practical deep learning. For resolving this problem, we crop the huge image to small patch images, each of which is independently fed into the network. In our experiments, the size of each patch image is 1024 × 1024 pixels. 2 This size is the maximum size that can be loaded to a 24GB NVIDIA Titan RTX GPU for an effective mini-batch size. We empirically found that a smaller patch image (e.g., 512 × 512 pixels) makes the training difficult to converge successfully. This might be because the smaller patch cannot cover a meaningful variation of altitudes observed in the tropospheric-delay image, while the altitudes give a large impact on tropospheric-delay error.
In the inference process, each image is divided into nonoverlapping patches with no margin, and then all of the patches are independently fed into the network. By tiling the outputs estimated from these input patches, the rectified landsubsidence image is acquired. Since the locations of these patches are determined independently of those of leveling points in the inference process, the leveling points are located randomly in each patch. Therefore, leveling points should be randomly located in each patch also in the training process. To this end, in the training process, each patch is cropped randomly so that it contains at least one leveling point.
While the aforementioned random cropping process extracts only one patch around each leveling point in the training process, the performance of general machine learning including CNNs can be improved by increasing training data (i.e., cropped patches in our case). Since the amount of SAR images and leveling data is limited, 3 the aforementioned one-patch cropping process might result in insufficient training data. To fully utilize a limited number of available training data, data augmentation [55] is widely employed. The data augmentation process increases the training data so that the original given training data is slightly changed as 2 1024 × 1024 pixels captures around 10 × 10 km in the land. 3 See Section IV-A for details. long as the image context is unchanged. For example, images are cropped, flipped, rotated, blurred, and degenerated with noise. Based on the above discussion, not only one but also several patches are randomly located and cropped from each leveling point, as illustrated in Fig. 7.   FIGURE 7. Random patch cropping for data augmentation in training the CNN shown in Fig. 5. Black rectangles, blue squares, and red and purple rectangles indicate SAR images, leveling points, and patches, respectively.
For the training process with the patches, our loss function (8) is modified so that this loss is computed in each patch.

A. DATASET 1) LEVELING MEASUREMENT DATA
Official land subsidence data collected by Ministry of Environment, Japan [67], is used for our experiments [56]- [66]. This dataset covers 33 regions in Japan. While other common measurement data that are publicly available are not precise sufficiently compared with InSAR/PSInSAR analysis as shown in Table 1, the leveling process employed in this data collection provides 0.1 millimeter-order resolution [4]. Therefore, this dataset of the leveling points can be regarded as ground-truth data for our supervised learning strategy. Table 2 shows the statistics of leveling points used in our experiments. The land-subsidence values of the leveling points between two continuous years (e.g., between 2010 and 2011) are compared with the subsidence maps of InSAR analysis between the same years. While this dataset contains 1,025 leveling points, each of which was measured on around VOLUME 9, 2021  (1) where the points are measured, (2) which year the measurement began, (3) which year the measurement ended, (4) the measurement period of the points, and (5) the number of the points measured in each area. The points in Ibaraki prefecture are used as test data, while all other points are used in the training process. January 1st, some of them were not measured multiple times. For InSAR analysis, at least two SAR images are required in the same region. Therefore, among all leveling points in the above data, we used only points that were observed twice or more. In these points, 2,339 land-subsidence values were collected in total.
For fair comparison where training and test data have no overlap, measurements in Ibaraki prefecture are used as test data. All other measurements are used in the training process. In total, 2,339 land-subsidence values are split to 2,055 training data and 284 test data; 284 = 142 × 2 is the number of the land-subsidence values obtained from threeyear leveling measurements in Ibaraki.
As described in Section III-B, a SAR image is cropped to small patch images around each leveling point. The number of the leveling points included in each patch image is 4.49 on average. The maximum, minimum, and variance are 22, 1, and 16.23, respectively.

2) SAR IMAGES
We use charge-free SAR images collected by ESA Sentinel-1 [68]. Sentinel-1 observes the leveling points included in our dataset introduced in Section IV-A1 with a period of 12 days. Since all the leveling points were observed on around January 1st, we use SAR images that were captured between December 25th and January 5th for making a pair of the SAR image and leveling measurements included in this SAR image. We assume that the time difference between the leveling and SAR measurements (i.e., ± 5 days at most) is sufficiently short in contrast to a year (i.e., time interval between two leveling measurements) so that the SAR measurements are almost synchronized with the leveling measurement. Note that, while a SAR image is asynchronous with the leveling points, the land-subsidence image and the tropospheric-delay error map (i.e., two inputs of the proposed network) are synchronized because they are estimated from the SAR image. This is important because the troposphericdelay error map is affected by the weather changing from moment to moment; if these two inputs are asynchronous, the changing weather makes it difficult to fuse the two inputs. On the other hand, a short time-difference between the two inputs and the leveling points is acceptable under the assumption that the land subsidence is not caused in such a short time (i.e., ± 5 days at most). In total, 346 SAR images are collected for our dataset. In this SAR image collection, 10 SAR images are collected on average for each region. These 10 SAR images allow us to achieve PSInSAR for estimating InSAR images robust to noise.
As described in Section III-A, SAR and InSAR images were generated by SNAP [42]. From nine SAR images, eight InSAR images are generated. While these SAR images consist of reflection intensity values, each pixel value in the InSAR images is a phase shift.

B. EVALUATION
All estimation results are evaluated by root mean square error (RMSE), which is expressed as follows: where q gt,l and p out,s(l) denote the ground-truth land-subsidence value of l-th leveling point and its spatiallyaligned pixel (denoted by s(l)) in the output images, respectively. N t is the number of leveling points in all images; in our experiments, N t = 284, which is the total number of the test data of the land subsidence values.

C. PARAMETERS
All parameters were determined empirically so that RMSE (9) is minimized with all training data. α = 1.5 in the loss function (8). σ 2 of the Gaussian distribution for defining W (i) in the loss function (7) is 200.
As mentioned before, we used StaMPS/MTI [39], [40] for PSInSAR analysis. Its parameters used in our experiments are shown in Table 6.

D. CNN TRAINING
Our CNN shown in Fig. 5 was trained by the Adam optimizer [69]. The learning rate and the size of the mini-batch were 1.0e-5 and 8, respectively. The training process was halted when the number of epochs reached 100. Fig. 4 is 10.3.

RMSE computed from PSInSAR images obtained by the baseline method shown in
In the first experiment, we evaluate the effectiveness of multi-modal data fusion. In order to validate the effectiveness of the tropospheric-delay image for multi-modal data fusion, the network path for the tropospheric-delay image is removed from our proposed network shown in Fig. 5. That is, the . This evaluation was done with the test data (i.e., the data of Ibaraki prefecture), which was excluded from the training data. Since this criterion is an error metric, a smaller value means a better performance. The best score is colored by red. The baseline result obtained by PSInSAR is shown in the leftmost row. tropospheric-delay image is not fed into the network, and only the upper path for the land-subsidence image is used for this comparison. This network is called a unimodal CNN in Table 3. As shown in Table 3, both CNNs are better than the baseline (i.e., PSInSAR result). While the big performance gain is acquired by the CNN (i.e., 10.3 by PSInSAR vs. 7.15 by unimodal CNN), the performance can be further improved by our multi-modal fusion with the tropospheric-delay image (i.e., 7.15 by unimodal CNN vs. 6.80 by our multi-modal fusion CNN).  (1). The effectiveness of the proposed randomly-shifted cropping is demonstrated.  For validating the superiority of our proposed network for the fusion of land-subsidence and tropospheric-delay images, several variations of multi-modality fusion networks are evaluated. Table 4 shows the results (i.e., RMSEs) of all the variations. In the table, (a), (b), (c), and (d) are the original UNet [53] that accepts one input image, the modified UNet that accepts two images (i.e., land-subsidence and tropospheric-delay images), ACNet [34], and our proposed network. Since the original UNet (a) accepts only one input image, the land-subsidence and tropospheric-delay images are concatenated channel-wise to be merged into one image. That is, given a W × H pixel land-subsidence image and a W × H pixel tropospheric-delay image, a W × H × 2 image is fed into UNet in (a). The detailed architecture of (b) is shown in Fig. 8. Note that all of (a),  Table 7.
Compared with PSInSAR as the baseline, all deep network-based methods with multi-modal data fusion get better results. However, several fusion methods are inferior to the unimodal CNN, which is shown in Table 3. This might be because it is not easy to train CNNs with more parameters successfully.
Among the four variations of network architectures, our proposed networks (d) and (d') outperform other networks both with and without skip connections. Furthermore, the skip connections can improve the performance, i.e., (d) 7.73 vs. (d') 6.80. These results demonstrate that our proposed method (d') is the best among all variations shown in Table 4.
In what follows, the results of our ablation studies are shown for demonstrating advantages in our algorithm and parameter choices. Table 5 shows the results obtained by our method with and without randomly-shifted patch cropping in the training process. We can see that the randomly-shifted patch cropping scheme can decrease RMSE.  (2). The effect of the receptive field is evaluated. While a wide variety of receptive fields are evaluated, its size has less impact on the performance.
In a CNN, the size of its receptive field is important in determining how much the CNN captures global context information. As shown in Table 8, a wide range of receptive VOLUME 9, 2021 TABLE 9. Ablation study (3). The effectiveness of data augmentation using randomly-shifted patch cropping is validated.
fields are verified. The receptive field of the default network shown in Fig. 5 is 140 pixels. The receptive field is changed by adding or removing the following set of layers: 3 × 3 Conv, Batch Normalization, ReLU, 3 × 3 Conv, Batch Normalization, ReLU, and Pooling. By removing one set, adding one set, and adding two sets from the default network, the receptive field becomes 68 pixels, 284 pixels, and 572 pixels, respectively. Table 8 shows that the performance is varied depending on the receptive field. The lowest peak of RMSE is observed with the receptive field = 140. RMSE is increased when the receptive field gets larger probably because (1) it is not easy to properly train a greater number of parameters and (2) a larger receptive field (such as 284 and 572) is not effective for representing the land subsidence around each local leveling point. In all other experiments, therefore, the size of the receptive field is 140.  (4). The effectiveness of residual learning is validated. Table 9 shows the effects of the data augmentation process described in Section III-B. We tested the augmentation processes varying the number of randomly-cropped patches around each leveling point (denoted by N a where N a ∈ {1, 2, 3, 4, 5, 6}). While the best result is obtained with N a = 5, the performance is decreased with N a = 6. This performance drop might be caused because excessive data augmentation saturates useful information. Furthermore, increasing randomly-shifted patch cropping over-spreads the appearance variation of patch images. This makes it difficult to train the network properly.
While the lowest error is obtained with N a = 5, all other experiments shown in this section (i.e., Tables 3, 4, 5, 8, and 10) are done with N a = 4 for efficient training; more training data requires more training time. Note that, for fair comparison, data augmentation with N a = 4 was executed for other methods as well as for our proposed method, which gets the best result. Table 10 shows results acquired with and without residual learning [33], which is effective for deep networks. In our network shown in Fig. 5, a residual path begins right after the input land-subsidence image and ends right before the output land-subsidence image. The output of the last 1 × 1 convolution layer (indicated by the purple layer in the figure) is elementwise-added to the input land-subsidence image via the residual path in order to reconstruct the output landsubsidence image. While the performance is not gained by this residual learning, its performance drop is small. These results show that our network is (1) sufficiently deep for improving the land-subsidence image as shown in Table 3 but (2) sufficiently trained without overfitting because the network is not too deep (i.e., if the network is too deep, residual learning might be able to improve the performance). For visual evaluation, input and output images of our proposed network are shown in Fig. 9. The input and output images are shown on the left and right sides in the same row. In the upper example (i.e., (a) success case), the global subsidence values are similar between the input and output images. This demonstrates that the MSE loss (7) works correctly. Based on the fact that (1) the input and output images are similar as shown in Fig. 9 (a) and (2) the subsidence values of the leveling points in the output are improved as shown in Tables 3, it is expected that the output image is estimated more accurately than the input image. In contradiction to the success case, in the lower example (i.e., (b) failure case), it can be seen that (1) the input and output images are dissimilar and (2) the ground-truth value overlaid at the center is dissimilar to its neighboring pixels in the land-subsidence image modified by our method. In this failure case, a large difference between the ground-truth leveling value and its neighboring values in the land-subsidence image estimated by PSInSAR. For resolving this problem, more additional cues might be required in addition to the tropospheric delay error map. Otherwise, the leveling value might be possibly erroneous due to measurement error or some other reasons. In such a case, our proposed method requires an explicit mechanism for removing the erroneous training data.

V. CONCLUDING REMARKS
This paper proposes a method for improving the accuracy of a land-subsidence image acquired by PSInSAR analysis. Our method employs a PSInSAR scheme for obtaining the landsubsidence image as the baseline. This baseline is improved by being fused with the tropospheric-delay image through our proposed CNN. Our CNN achieves this multi-modal data fusion as well as global and local feature fusion.
Future work includes the usage of other modalities (e.g., atmospheric water vapor and numerical meteorological models) that are known to be essential modalities for PSInSAR analysis. While our proposed method is applicable to SAR data observed anytime anywhere in theory. it is possible that region-and-time specific training is required for maintaining the performance independently of region and time.