A Wavelet-Driven Subspace Basis Learning Network for High-Resolution Synthetic Aperture Radar Image Classification

The feature learning strategy of convolutional neural networks learns the deep spatial features from high-resolution (HR) synthetic aperture radar (SAR) images while ignoring the speckle noise based on the SAR imaging mechanism. In the feature learning module, the noise reduction by feature-adaptive projection guided by a powerful embedded wavelet feature reconstruction mechanism can effectively learn the deep feature statistics. In this article, we present a wavelet-driven subspace basis learning network (WDSBLN), following an encoder–decoder architecture, for the HR SAR image classification. The powerful wavelet module, including wavelet decomposition and reconstruction, is employed for keeping the structures of learned features well under speckle noise. Specifically, a compact second-order feature enhancement mechanism is designed for improving the contour and edge information of low-frequency components in the feature decomposition stage, and a local feature attention module based on the point-wise convolutional layer is adopted to aggregate the contextual information of the local channel and reserves detail information in the high-frequency components. Then, the reconstructed feature map is employed as a guided standard in the subspace basis learning (SBL) module. The SBL module, including basis generation (generating the subspace basis vectors) and subspace projection (transforming deep feature maps into a signal subspace), maintains the local structure of HR SAR image patches and acquires the robust feature statistics. We conduct evaluations on three real HR SAR image classification datasets, achieving superior performances as compared to other related networks.


I. INTRODUCTION
S YNTHETIC aperture radar (SAR) systems, one of the active imaging sensor with the advantages of high resolution [1], [2], [3] and extreme penetrating power, could capture the backscattering information of ground objects in day-and-night and all-weather conditions [4]. Recently, SAR systems have provided rich high-resolution (HR) images in various applications, i.e., land planning, crop yield mapping, and environmental detection [5]. HR SAR image classification aims at assigning semantic labels to every pixel of the image and is considered as one of the important steps of the high-level SAR images interpretation. Nevertheless, there are complex spatial information and coherent speckle noise in the SAR images, both of which could make the description of SAR images difficult and challenging. Consequently, the comprehensive representation plays a critical role in the SAR image classification task.
The low-level and middle-level feature descriptors, also called handcrafted features, have been successfully developed for the description of SAR images. The low-level feature descriptors [6], [7], [8], i.e., intensity feature and gray-level cooccurrence matrix (GLCM), focus on describing the geometric structure and texture of image patches and achieve good performances. Song et al. [9] proposed a histogram of oriented gradients (HOG)-like feature, named SAR-HOG, for SAR image classification, and this feature representation effectively captures the structures of targets. Esch et al. [10] analyzed the land cover types of SAR images by the speckle statistics and intensity information, the statistics of speckle noise could be acquired from the SAR images by the means of an unsupervised analysis. Mian et al. [11] presented a novel family of parameterized wavelets, which can be viewed as the description of SAR images, and it reduced undesired high side lobes in the process of SAR image decomposition. All the aforementioned methods incorporate the prior knowledge to capture SAR image features, and these features are poor generalization. Afterwards, with the purpose of achieving stronger features, there has been increasing interests in the middle-level feature learning methods based on machine learning techniques, i.e., sparse representation [12], fisher vector (FV) [13], superpixel-level FV [14], and multiscale local fisher patterns [15]. Although these feature descriptors characterize SAR images better than the low-level features, they require lots of prior knowledge for feature extraction and are not suitable for HR SAR image processing under amounts of SAR data. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ At present, the deep learning feature-based methods, which have excellent feature learning capability, have been successfully applied on SAR image classification. Several approaches have been introduced so far to classify different land covers via developed deep neural networks (DNNs), i.e., deep belief network (DBN) [16], deep sparse tensor filtering network (DSTFN) [17], distribution and structure match generative adversarial network [18], deep joint distribution adaptation networks (DJDANs) [19], and complex-valued convolutional neural network (CV-CNN) [20]. For the highly imbalanced classes, geographic diversity, and label noise in HR SAR images, Huang et al. [21] introduced a deep transfer learning network with a top-2 smooth loss function, which was proven to have a good generalization of large-scale HR SAR image classification. Qian et al. [22] combined data-based methods with model-based methods and proposed a neural network with structural constraints for capturing the characteristics of distribution and structures of HR SAR images. Simultaneously, Liang et al. [23] applied a multiscale deep feature fusion network and covariance pooling manifold network (MFFN-CPMN) for learning the global statistical properties and local spatial features simultaneously under the limited SAR labels. Although all the above methods pay much attention to the complex structural and geometrical information of HR SAR images, coherent speckle noise is very challenging for interpreting land covers in HR SAR images.
Various methods have been developed for interpreting HR SAR images efficiently with the consideration of coherent speckle noise or complex structure [24]. Geng et al. [25] utilized a graphcut-based spatial regularization into a deep supervised and contractive neural network (DSCNN) for restraining the influences of speckle noises and yielded a superior classification performance. Qian et al. [26] trained a Ridgelet-Nets with the speckle reduction regularization that combines deep features with statistical modeling and geometric analysis of HR SAR image patches and showed great potential on HR SAR image classification. Simultaneously, with the purpose of exploring distinguishable context information affected by the coherent speckle, Geng et al. [27] developed a spatial feature learning network based on the long-short-term memory (LSTM) for learning the contextual features. Moreover, Liang et al. [28] employed the global context-aware block and the residual context encoder block for enhancing the local-global semantic contexts in an encoder-decoder network architecture with superior performances achieved. Notably, the popular pixel-wise image classification methods [28] always follow an encoderdecoder structure, and these methods always are called sematic segmentation in computer vision task, i.e., fully convolutional network (FCN) [29], U-net [30], and SegNet [31], explicitly these methods learn the spatial layout of labels and have attracted tremendous attention. Ley et al. [32] employed a generative adversarial network (GAN) transcoding to transcode SAR images into the optical images, then the output layers of the FCN were replaced with a classifier. Wu et al. [33] designed a multiscale convolutional neural network (CNN) for pixel-wise classification, which is following an encoder-decoder architecture, and this network utilized the autoencoder regularization branch and the contextual attention branch for learning classification information efficiently. Fang et al. [34] designed a siamese U-net with sharing weights and a fast Fourier transform (FFT) correlation layer for SAR-optical matching, and notably, the global context and local details of SAR and optical images were well retained. Ren et al. [35] integrated an attention mechanism into the popular U-Net to classify the open water and sea ice on Sentinel-1 A SAR images in pixel-level and achieved better performances. These encoder-decoder methods make an attempt on feature learning of HR SAR images, but do not pay much attention to the influences of coherent speckle noise. Consequently, the distinguishable deep feature learning method under speckle noise, using an encoder-decoder structure, is valid for boosting the performances of HR SAR image classification.
The methods, combining DNNs and transform-domain algorithm, capture the deep features and keep the structures of extracted features well under coherent speckle noise. Recently, Qin et al. [36] trained a wavelet speckle reduction module in CNNs and proved the superiority of the proposed method under different degrees of speckle noise. Ni et al. [37] proposed a subspace wavelet encoding network (SWENet) for modeling the robust features in individual subspaces. Additionally, Gao et al. [38] and Duan et al. [39] also integrated wavelet into CNNs and designed a wavelet-based layers for SAR image analysis. The aforementioned wavelet-based methods embed wavelet algorithm into DNNs for feature learning directly and are effective at SAR image classification tasks, but the specific analysis of low-frequency and high-frequency components is necessary in the image denoising field. Therefore, we should conduct a specific analysis of low-frequency and high-frequency components under the influence of speckle noises on SAR images for keeping the deep features well. The popular attention mechanism, i.e., SENet [40], CBAM [41], SKNet [42], and GCNet [43], could enhance the deep feature representations in different degrees. However, the attention mechanism based on computer vision always can not be suitable for SAR image processing directly because of the influences of speckle noise based on the SAR imaging mechanisms. Thus, designing an efficient wavelet components-enhanced scheme under speckle noises is challenging for HR SAR image classification. Moreover, the semantic segmentation methods based on CNN-based layers, depending on the local filter response without global structure information, are still challenging to segment HR SAR images in the high-frequency details, i.e., speckle noises.
To address the aforementioned challenges, based on the robust deep feature learning, we propose a wavelet-driven subspace basis learning network (WDSBLN), following an encoderdecoder architecture, for pixel-wise HR SAR image classification. The proposed WDSBLN contains the following modules: the powerful wavelet components-enhanced scheme and nonlocal image feature learning by projection. The powerful wavelet components-enhanced scheme, inserting after the final convolutional layer, keeps the structures of learning features well under speckle noise. Especially, the components-enhanced scheme is carefully designed based on the characteristics of high-and low-frequency subbands of SAR image patches. The nonlocal image feature learning strategy is designed by subspace 1) The powerful wavelet module (PWM), including wavelet decomposition and reconstruction, is inserted in the feature learning stage for keeping the structures of learned features and generating a superior and robust feature map in an end-to-end training pipeline. Specifically, a compact second-order feature enhancement mechanism and a local feature attention module are designed for improving the contour information and reserving detail information in the low-frequency and high-frequency components.
2) The subspace basis learning (SBL) module is adopted to transform deep feature maps into a signal subspace and captures the powerful features spanned by the learned basis vectors based on basis generation part. Herein, the reconstructed feature maps generated by PWM, are employed as a guided vector in subspace projection part. The SBL module maintains the local structure of HR SAR image patches while considering the global structure information of features.
3) The PWM and SBL modules are inserted into a commonly used U-Net architecture, which is in an end-to-end fashion. Experimental results on three real HR SAR image classification datasets demonstrate the improvements of the feature representation ability and HR SAR image classification results, as compared to other related networks. The rest of this article is organized as follows. The details of our proposed WDSBLN are presented in Section II. Section III reports and analyzes the experimental results. Finally, Section IV concludes this article.

II. METHODOLOGY
In this section, the overall structure of WDSBLN, consisting of four parts: multiconvolution+batch normalization (BN) block, PWM module, SBL module, and deconvolution+BN block, is illustrated in Fig. 1. And notably, the powerful wavelet module (PWM), including a compact second-order feature enhancement (CSFE) block and a local feature attention (LFA) block, is utilized for keeping the structures of the learned features well under speckle noise. The SBL module, including basis generation and subspace projection, maintains the local structure of HR SAR image patches after considering the global structure information of deep features. Afterwards, we will offer a complexity analysis of the proposed WDSBLN in the following.

A. Powerful Wavelet Module (PWM)
The PWM has four parts, i.e., Gabor wavelet decomposition layer, CSFE block, LFA block, and wavelet reconstruction layer. For the Gabor wavelet decomposition layer, given a feature map of the output of the encoder module X ∈ R H×W ×C , this layer is carefully designed by an orthogonal Harr wavelet transform. Herein, X can be converted into a 2-D matrix, then 2-D discrete Harr wavelet transform can be utilized to acquire wavelet features. Especially, the 2-D discrete Harr wavelet transform can be realized by row-by-column 1-D discrete Harr wavelet transform [44]. Let the wavelet basis vectors be h = 1 At this time, the 1-D discrete Harr wavelet transform will be calculated row-by-row to obtain the high-frequency and low-frequency components in the horizontal direction; then the acquired high-frequency and low-frequency components will be calculated column-by-column to acquire one low-frequency feature component and three low-frequency components. The forward calculation of Gabor wavelet decomposition layer is presented as follows: W hl , and W hh denote one low-frequency feature component and three low-frequency components of X. (1 ↓ 2) and (2 ↓ 1) are down sampling by row and column, respectively. ⊗ c and ⊗ r are the column-convolution and row-convolution operations, respectively. f d 0 and f d 1 are the low-pass and high-pass filters defined by wavelet basis vectors l and h. W ll represents most of the feature information, i.e., contour and edge. W hl , W lh , and W hh indicate the low-frequency feature components, i.e., the details of deep features and noise. For low-frequency component W ll , a CSFE mechanism is designed for improving the contour and edge information. The detail structure of CSFE is given in Fig. 2.
In the CSFE module, we first present a compact high-order statistics into the feature enhancement module for improving the contour and edge information of low-frequency feature component. First of all, we generate two sets of random sequences [45] m i ∈ N n , i = 1, 2 and n i ∈ {−1, +1} d , i = 1, 2, then, m i and n i are uniformly sampled from {1, 2, 3, . . ., B} and {−1, +1} randomly, B denotes the dimension of compact second-order feature statistics, B << C 2 . Second, define a sparse matrix P, which is determined by m and n, then Z = PW ll . More specifically, Z can be calculated as follows: where (PW ll ) h = e:m(e)=h n(e)W ll(e) , h = 1, 2, 3, . . ., B, W ll(e) is the e th element of W ll , then (PW ll ) h can be acquired through the sum of all n(e)W ll(e) , which satisfies m(e) = h. Based on this, we obtain two tensor sketch vectors Z(W ll , m 1 , n 1 ) and Z(W ll , m 2 , n 2 ) [45], both of them have the sparse feature information of W ll . Then, a compact secondorder attentional statistics is achieved by where is circular convolutional operation. For fast calculation, the FFT is utilized for acquiring the compact second-order statistics approximately as follows: where F and F −1 are FFT and inverse FFT, respectively. • denotes the element-wise multiplication. Then, the output of CSFE module can be achieved by A compact second-order feature statistics is employed for enhancing the feature discriminability of low-frequency feature component in CSFE module. Herein, CSFE maps highdimensional feature space into the low dimension, and it solves the calculation and storage problems of high-dimensional bilinear pooling feature statistics.
For three high-frequency components W hl , W lh , and W hh , the local feature attention (LFA) module is adopted to aggregate the contextual information of local channel and reserve the detail information, the architecture of LFA module is presented in Fig.  3.
From Fig. 3, we conclude that the LFA module is designed based on the popular SENet [40], but differently, we employ a point-wise convolutional layer for aggregating the contextual information of local channel. For W hl , we have (6) where σ is the Sigmoid function. B and R denote BN and ReLU layer, respectively. PConv is the point-wise convolutional layer. W lh and W hh are calculated in a similar way to W hl . Our LFA module can be viewed as a local version of SENet, and it removes the global average pooling (GAP) layer in the SENet and replaces the fully connected (FC) layer with a pointwise convolutional layer, it emphasizes the importance of local attention to activate the units. Additionally, for emphasizing and preserving the detailed structure, LFA module employs the local-cross channel context to adaptively activate each element in the feature maps. Then, a wavelet reconstruction layer based on inverse wavelet transform algorithm, is designed for reconstructing the waveletbased feature map in the following style: where L and H are calculated by l and h [37], X P ∈ R H×W ×C . The output of the wavelet reconstruction layer generates a superior and robust feature map and keeps the structures of learned features well under speckle noise due to the CSFE and LFA modules.

B. SBL Module
The SBL module, including basis generation and subspace projection, could generate the subspace basis vectors and transform deep feature maps into a signal subspace [46]. For the input X P (a wavelet-based feature map) and X (a convolutional feature map), we should first estimate γ basis vectors, named V=[ν 1 , ν 2 , ν 3 , . . ., ν γ ]. V ∈ R N , N = HW denotes a basis vector in each signal subspace. Herein, the procedure of the SBL module is given in Fig. 4.
For the basis generation part, we generate the subspace basis vectors V by a bank of convolutional layers [46] for mapping X ∈ R H×W ×C to X P ∈ R HW ×γ , i.e., V=Conv Concat X, X P .
The output of concatenate operation is H × W × 2 C, then the convolutional layers, named Conv(•) in (8), are employed for learning the subspace basis vectors [ν 1 , ν 2 , ν 3 , . . ., ν γ ]. Herein, we feed the high-level features X and wavelet-based features X P into the SBL module, and on the other hand, high-level features are mapped into a signal subspace guided by the superior and robust wavelet-based features. For the subspace projection part, we transform high-level feature maps X into a signal subspace through orthogonal linear projection [46], which can be formulated as is a normalization matrix that guarantees that the basis vectors [ν 1 , ν 2 , ν 3 , . . ., ν γ ] are orthogonal to each other under the consideration of the global structure information of deep features. Herein, the reconstructed features X S ∈ R H×W ×C . As observed from (8) and (9), the basis generation and subspace projection parts can be inserted into an end-to-end training pipeline.

C. Multiconvolution Block and Loss Function
In the encoder stage, the multiconvolution block is utilized to acquire the multiscale high-level features, which is more powerful than the conventional convolutional layer. The multiconvolution block consists of three parts, i.e., multiconvolution layer, BN layer, and ReLU layer, as illustrated in Fig. 5.
As clearly displayed in Fig. 5, the multiconvolution layer [47] contains three parallel stages with different parameters (h, w, c, g), which denote the height, width, channel, and group number of the convolutional operation, respectively. Then, the calculation of the multiconvolution block is processed as where GConv(•) indicates a group convolution, and X i is the input of the multiconvolution block. From (10), one can observe that, this block generates multiscale deep features in three different convolutional parameters that are similar to the pyramidal architecture. Additionally, we should carefully set the stride and padding of convolutional operation for guaranteeing the same size of feature maps obtained by different group convolutional operations. In general, the stride and padding are set to be 1 and h//2 (acquiring the largest integer which is not greater than the computational results). Different from the conventional convolutional layer, the computational cost and the number of parameters are reduced by the group convolution, and the ability of model expression and feature generalization can be enhanced by multiscale structure and ReLU activation layer.
where D and E denote the number of pixels and classes, respectively. s is the ground truth, and t is the estimated probability of our proposed WDSBLN.

D. Network Architecture
The detailed architectures of the proposed WDSBLN, including multiconvolution and deconvolution parts, are illustrated in Table I.
Conv and ConvT denote convolutional layer without group parameters and deconvolution layer, respectively. Layer is a multiconvolution layer shown in Fig. 5. In the encoder stage, one conventional convolution layer and three multiconvolution layers are employed for feature learning and network training: The multiconvolution layer is with different parameters, which are similar to pyramidal architecture, i.e., 3 × 3, 5 × 5, and 7 × 7. Additionally, the parameters of group convolution grow from 1 to 8. The decoder module employs four deconvolution layers with kernel parameters of 3 × 3 for reconstructing the high resolution feature map, and the output of the decoder stage is 128 × 128.

A. Dataset
For our experiments, three real HR SAR land-cover classification datasets, which are TerraSAR-X data and Sentinel-1B data, are employed to evaluate the validity of our proposed WDSBLN, as shown in Fig. 6. The first TerraSAR-X dataset, called TerraSAR-X1, was obtained in city of Lillestroem, Norway, and HH-polar imaging mode and 0.38-m resolution, then a subimage whose size is 3580 × 2250, including building, forest, grassland, road, river, and open land, is utilized for classifying the land-cover of HR SAR images. The second TerraSAR-X dataset, named TerraSAR-X2, with 3600 × 3600 pixels and an horizontal receive (HH)-polarization spotlight mode, consisting of water, forest, farmland, and buildings, is selected for the land-cover classification task. The Sentinel-1B dataset [48] of size 8149 × 5957 was acquired in Berlin, Germany, and this dataset was taken in interferometric wide mode with 10.13-m GSD, Sentinel-1B dataset has four land-cover categories, i.e., built-up, agriculture field, forest, and water body.

B. Experimental Setup
All experiments are implemented in the PyTorch framework, with CPU: i9-11900K at 3.50 GHz, GPU: NVIDIA GeForce RTX 24 G 3090, and RAM: 64 GB. The training parameters are set as follows: the learning rate is fixed as 0.001, the weight decay and momentum are set to 0.001 and 0.9, respectively. Batch size is 128, and epoch is set to 200. After that, an Adam (Adaptive moment estimation) optimizer is utilized to train our proposed WDSBLN.
In the experiments, three training and testing datasets of HR SAR image classification are produced via sliding window strategy of size 128 × 128 and step of size 16, 32, and 32. Then, the total number of image patches on TerraSAR-X1, TerraSAR-X2, and Sentinel-1B data are 29 729, 12 769, and 46 805, respectively. We randomly split the total image patches into 10%, 20%, and 30% for training, respectively, and the rest for validation. Herein, we separate the image patches and labels for training and validating 10 times to avoid the overfitting phenomenon.

C. Parameters Analysis 1) Influences of Convolutional Kernel in Multiconvolution
Part: The number of convolutional kernel in the multiconvolution part plays important roles in the encoder stage for the feature learning. Herein, the TerraSAR-X1 dataset with 20% training ratio is utilized to analyze this parameter. The overall accuracy (OA) and the experimental results are computed and listed in Table II. One can conclude that, the performance of the convolutional block with multikernel is better than that with single-kernel because of the multiscale feature learning strategy. Moreover, the convolutional layer with small kernel, i.e., 3 × 3 is more powerful than 5 × 5, and 7 × 7, about 0.19% and 0.51% high, as shown in the experiments of single-convolution part. This is mainly because the stack of 3 × 3 convolutional layers could  capture local deep features well while keeping a suitable receptive field. Simultaneously, the combination of 3 × 3 and 5 × 5 convolutional kernels can achieve better performances than other two approaches. The multiconvolution block with convolutional kernels of size 3 × 3, 5 × 5, and 7 × 7 achieves the best performances over the other approaches.
2) Influences of the Dimensions of Compact Second-Order Feature Statistics: In the proposed CSFE module, the dimension of compact second-order feature statistics B has influence on the improvement of contour and edge information on the feature map. Then, Fig. 7 gives the performances of the HR SAR image classification on the TerraSAR-X1 dataset under 20% training samples, B is set as 1024, 2048, 4096, 8192, and 16384, respectively. Moreover, OA, AA, and Kappa are employed in this experiment.
As can be seen from Fig. 7, the growth of the dimension of compact second-order feature statistics generally leads to an improvement on OA, AA, and Kappa coefficient from 1024 to 8192. The high compression of the dimension of a second-order feature statistics will loss the essential feature information, then both feature distinguishability and model accuracy will decrease. Furthermore, B = 8192 and B = 16384 achieve the highest and similar scores. However, as we all know, the larger dimension is, the higher model computation will be. Toward the end, B is set to 8192 in the following experiments after considering the tradeoff between the classification accuracy and model complexity.

D. Performances of HR SAR Image Classification 1) Qualitative Analysis of HR SAR Images:
In this section, three HR SAR image classification datasets, i.e., TerraSAR-X1, TerraSAR-X2, and Sentinel-1B, are employed for evaluating the effectiveness of our proposed WDSBLN. 30% training ratio of three datasets are fed into the proposed network, and we repeat each experiment of three HR SAR datasets ten times. At the same time, we compare WDSBLN with Tiny-FCN (TFCN) [29], Tiny-Pyramid-UNet (TPUNet) [49], SegNet [31], convolutional-wavelet neural networks (CWNN) [39], and statistical convolutional neural network (SCNN) [50] for demonstrating the classification performance. Then, the details of these compared networks are given: TFCN and TPUNet are designed  via conventional FCN and U-Net, but differently, the sequences of 3 × 3 convolutional layer, BN layer, and ReLU layer are utilized for downsampling the feature map to 8 × 8. For TPUNet, 3 × 3 convolutional layer is replaced by multiconvolution part with parameters of 3 × 3, 5 × 5, and 7 × 7, which are same as WDSBLN. The first 13 convolutional blocks of VGG-16 are given in the encoder part of SegNet. The CWNN, including four convolutional layers and two wavelet pooling layers, captures the final feature map of size 12 × 12, then it employs an FC layer for the classification results. Both GAP layer and global variance pooling layer are adopted in the SCNN. Additionally, three convolutional layers with the channel numbers 12, 32, and 64 are migrated into the SCNN. Table III and Fig. 8 present the classification performances and maps of TerraSAR-X1 dataset, respectively.
From Table III, it can be observed that our proposed WDS-BLN achieves the highest scores among the comparison methods, and produces better classification performances. The OA, AA, Kappa, and IoU of our WDSBLN can acquire 96.29%, 96.66%, 94.65%, and 90.60% scores, respectively. Compared with the conventional deep learning network based on the encoder-decoder architecture, i.e., TFCN, SegNet, and TPUNet, the proposed WDSBLN acquires more distinguishable feature statistics via powerful wavelet module and SBL module, then our WDSBLN has about 8.80% − 21.85% and 13.60% − 28.20% improvements on Kappa and IoU, and this proves the effectiveness of PWM and SBL modules. CWNN, inserting wavelet constrained pooling layer, is a wavelet based CNNs and Markov random field (MRF) method for the SAR image classification. Although the CWNN applies the wavelet algorithm for capturing the deep features, our WDSBLN also shows a better recognition ability in all of the objective evaluation values (2.61% high for OA, 3.03% improvements for AA, 3.76% high for Kappa, and 7.50% improvements for IoU). As for the SCNN, a representation learning and statistical-analysis-based method, the Kappa and IoU are less than 2.85% and 5.40% in comparison with the WDSBLN, but higher than other four related method. In summary, our proposed WDSBLN acquires more distinguishable features under speckle noise and improves the performances of the HR SAR image classification. Fig. 8 depicts the classification maps of each compared method on TerraSAR-X1 dataset. Almost all of the land-cover classes based on the TFCN method have misclassified pixels; SegNet and TPUNet methods could effectively decrease this phenomenon, this is because the backbone of SegNet is based on the conventional VGG network, which is more powerful. And simultaneously, the skip-connection and up-sample stages are inserted into TPUNet for the SAR image classification, then TPUNet achieves higher performances than TFCN and SegNet. CWNN and SCNN have fewer misclassified pixels, especially in forest and grassland, the possible reason is the power of the wavelet constrained pooling layer and statistical features based on SAR images. Our WDSBLN acquires the optimal classification results, while compared with other methods because of the special design of powerful wavelet module and SBL strategy. Subsequently, Table IV and Fig. 9 report the classification performances and maps of TerraSAR-X2 dataset.
As can be seen from  Fig. 9 illustrates the classification results of TerraSAR-X2 dataset, from which it can be concluded that the proposed WDSBLN performs better visual results than other five methods. Moreover, it can be observed that the TFCN has many misclassified pixels in forest and farmland areas, the similar results occurred in the forest class of the SegNet method. TPUNet achieves the similar segmentation results while compared with SegNet, and a minority of "open area" pixels are misclassified to "forest," "road," or other classes in TFCN, SegNet, and TPUNet methods. CWNN and SCNN methods improve the classification results well because of the introduction of wavelet pooling layer and statistics feature learning module. Compared with all the related methods, the proposed WDSBLN produces the best visual effects, the boundaries are clear, and the classification of all land-cover classes are the highest. Therefore, our WDSBLN is able to greatly improve the HR SAR image classification performances. Subsequently, the classification performances and maps of Sentinel-1B dataset are given in Table V and Fig. 10.
For Sentinel-1B dataset, a very challenging SAR image classification dataset, usually utilizes multispectral information [48] for land-cover classification. Herein, the OA (88.38%), AA (87.57%), Kappa coefficient (82.69), and IoU (82.69%) of our proposed WDSBLN achieve the best performances, outperforming the competing method (SCNN) by 1.96% of OA, 0.38% of AA, 2.67% of Kappa, and 1.80% of IoU. Moreover, compared with the CWNN, a wavelet-based CNNs method for the SAR image segmentation, our WDSBLN has 3.73 and 3.00% gains in Kappa and IoU because of the introduction of the powerful   Fig. 10, both TFCN and TPUNet methods have some misclassified points, especially in built-up area. One can observe that SegNet, CWNN, and SCNN methods have fewer misclassified pixels in the built-up than TFCN and TPUNet. The possible reason is that, first, TFCN, and TPUNet are based on the 3 × 3 convolutional block, SegNet utilizes the conventional and powerful VGG network for feature learning; second, CWNN and SCNN insert wavelet pooling layer and statistics feature learning module into the CNNs, which are more suitable for the HR SAR image classification. Especially, the WDSBLN produces the best visual effect with few misclassified pixels.

2) Classification Results Under Different Training Ratios:
Herein, we make the experiments of three HR SAR datasets under the training ratios of 10%, 20%, and 30% to evaluate the stability of the WDSBLN. The OA, AA, and Kappa are utilized in these experiments, as shown in Fig. 11.
It is noted that, while few training samples, i.e., 10%, are fed into the WDSBLN, OA, AA, and Kappa can acquire the high scores on the three real HR SAR land-cover classification datasets. It is quite evident from the results that large train ratios will bring a great classification results as more training image patches are given compared with low train ratios. Furthermore, the performances of Sentinel-1B dataset are slightly lower than these of TerraSAR-X1 dataset and TerraSAR-X2 dataset, this is because Sentinel-1B dataset, which is a very challenging SAR image classification dataset, usually utilizes multispectral information for land-cover classification task.

3) Inference Time and Analysis:
The experiments of inference time are conducted 10 times on three HR SAR image classification datasets, as shown in Table VI. The values before and after ± indicate the average value and standard deviation of ten inferences.
We can observe that the TFCN acquires significantly less computational cost because of the simple architecture of the network, i.e., 3 × 3 convolutional block, while compared with   Three HR SAR image patches, acquiring from TerraSAR-X1, TerraSAR-X2, and Sentinel-1B datasets, are given in Fig. 12(a), respectively. The feature maps from multiconvolution blocks, shown in Fig. 12(b), are not obvious and cannot restrain the speckle noise well. PWM, including an end-to-end training wavelet decomposition and reconstruction modules, keeps the structures of learned features well under speckle noise and achieve more obvious results than the multiconvolution block. The output of SBL, cascading PWM and SBL module in WDS-BLN, maintains the local structure of HR SAR image patches well from Fig. 12(d) and proves the effectiveness of our proposed WDSBLN for HR SAR image classification.

5) Ablation Study:
We conduct extensive ablation studies to investigate the effectiveness of PWM and SBL modules in Fig. 13. Herein, the dimension of compact second-order feature statistics B is set to 8192, the kernel of multiconvolution block is 3 × 3, 5 × 5, and 7 × 7, and the group number is set to 1, 4, and 8. Additionally, 20% image samples of three HR SAR datasets are employed as the experimental datasets. We can observe from the classification performances of three real SAR classification datasets that, PWM, including Gabor wavelet decomposition layer, CSFE block, LFA block, and wavelet reconstruction layer, achieves a better classification result in OA, OA, and Kappa, while compared with the SBL module. The reason might lie in the special architecture of PWM, i.e., CSFE block is designed for improving the contour and edge information, and the LFA block is adopted to aggregate the contextual information of local channel and reserves detail information in PWM. And notably, PWM+SBL outperforms the other two methods, and this also illustrates the combinations of PWM and SBL could further improve the performances of HR SAR image classification. Toward the end, all of these validate the effectiveness and robustness of PWM and SBL.

IV. CONCLUSION
In this article, an end-to-end training network, following an encoder-decoder architecture, called WDSBLN, is proposed for the HR SAR image classification, and this network aims to capture robust feature statistics under speckle noise and maintain the local structure of HR SAR image patches well. A compact second-order feature enhancement strategy and a local feature attention module are carefully designed in the different frequency components for improving the contour information and reserving detail information, respectively. And notably, the reconstructed feature map generated by PWM, is employed as a guided vector in the subspace projection part of the SBL module, then it could maintain the local structure information while considering the global information of features. The experiments on three real HR SAR image classification datasets, i.e., TerraSAR-X1 dataset, TerraSAR-X2 dataset, and Sentinel-1B dataset, indicate the powerful classification results, and the ablation study shows the effectiveness of PWM and SBL modules clearly. Although our proposed WDSBLN maintains the local structure of SAR image patches, the algorithmic evolution via image slices will have influence on the complete target, and the PWM and SBL modules increases the computational complexity. Therefore, how to integrate an efficient structural information of land-cover into HR SAR image processing is a meaningful problem in the further study.