Novel Convolutions for Semantic Segmentation of Remote Sensing Images

The networks are required to be capable of learning low-level features well when applied to remote sensing image (RSI) semantic segmentation tasks. To capture accurate and abundant low-level semantic information, the early feature extractor layer is crucial to the whole network because all the subsequent features are inferred from that base. To address the low-level feature extraction issue and overcome the shortcomings of traditional convolution such as too many parameters or limited receptive field, some novel convolution units have been proposed in the literature. In this article, we propose two elaborately designed and portable yet effective convolution units, i.e., directional convolution (DC) and large field convolution (LFC), combined as the extractor of low-level semantic features. DC is designed to extract directional features from specific directions, and LFC can achieve a large receptive field with few parameters. Experimental results on two public datasets provide evidence that our convolution units can help deep learning networks improve performance stably and comprehensively compared to the baseline networks.


I. INTRODUCTION
W ITH the development of remote sensing techniques, high-resolution (HR) satellite images become easy to access. It has come to a stage that we can use the abundant remote sensing images (RSIs) to address what society needs. Currently RSIs are widely used in fields such as urban planning [1], forestry detection [2], [3], meteorology forecast [4], agricultural monitoring [5], and even automatic driving assistance [6]. All these tasks mentioned are based on extracting useful information from RSIs. Hence semantic segmentation, a technology that allows researchers to analyze images on a pixel level, is considered to be one of the most important fields in both computer vision and RSI interpretation [7].
The main purpose of semantic segmentation is to cluster parts of images that belong to the same object class, which means it can predict the precise semantic class of each pixel in an image [8]. Before the era of deep learning, researchers often chose support vector machine [9], [10], random decision forests [11], and Markov random fields [12] to implement RSI semantic segmentation tasks. These algorithms are so-called traditional approaches and they make heavy use of domain knowledge. Although they still play an important role in the field of remote sensing, they can not compete with deep learning algorithms nowadays. Since deep convolution neural networks (DCNNs) made shocking results in the field of computer vision, a completely data-driven semantic segmentation technique [13] has been developing rapidly for RSI processing. DCNN approaches, such as FCN [14], UNet [15], and DeepLab series [16], [17], [18], almost dominate the field of RSI semantic segmentation. Subsequently, more and more mechanisms show up such as self-attention [19] and transformer [20]. Yet alongside these excellent networks and mechanisms there stands a problem worth thoroughly pondering, that is how to capture abundant yet accurate low-level information in semantic segmentation tasks.
As Plato says, "The beginning is the most important part of the work." For deep learning-based semantic segmentation task, it is crucial to make sure the beginning of the process, the low-level feature extractor, collects abundant yet accurate information. Accurate low-level information is the key base of subsequent inference of DCNNs [21]. In terms of RSIs, specifically, this kind of image consists of more directional features than images used in segmentation taken by normal cameras. And there is a strong correlation of semantic class between one pixel and those of pixels surrounded as well. Besides, restricted by the limited receptive fields of a normal convolution, the use of DCNNs may lead to failures in modeling contextual spatial relations [22]. From this perspective, it is not appropriate to apply DCNN semantic segmentation algorithms directly to RSIs. In other words, it is necessary to reform existing deep learning networks to fit this scenario better. To address the issue of low-level feature extraction and overcome the shortcomings of a normal convolution, we propose directional convolution (DC) and large field convolution (LFC) in this work. As it means, DC is designed to capture the directional semantic features and LFC is to expand the receptive fields of original convolution. In addition, we design a low-level information extractor by combining these two novel convolutions which can be conveniently embedded into current deep learning networks.
Our main contributions to this work are summarized as follows.
1) We design two novel convolutions, DC and LFC. One is to extract abundant directional features from RSIs and the other is to expand the receptive fields of layers. 2) We propose a residual low-level information extractor by combining the two novel convolutions which can be used to improve the performance of existing networks. 3) We use the improved networks to evaluate our method on DLRSD and Vaihingen datasets for semantic segmentation, and achieve a comprehensive and stable boost compared to the baseline networks. The rest of this article is organized as follows. In Section II, we introduce studies related to novel convolutions and RSI semantic segmentation. In Section III, the design and characteristics of our proposed method are introduced in detail, including DC, LFC, and the residual extractor. In Section IV, the effectiveness and generalizability of our proposed method are verified by the experimental results with high confidence. Conclusions are made in Section V.

A. Novel Convolutions
Convolution is the core of DCNNs and it can be found everywhere in a network. However, normal convolution sometimes can not satisfy specific needs when researchers conduct experiments. Therefore, novel convolutions are designed naturally [23]. Take transposed convolutions for instance, to use a transformation going in the opposite direction of a normal convolution, transposed convolutions are proposed [24]. They are widely used in networks that need upsampling such as UNet, to reform the semantic information on the feature level. Another far-reaching novel convolution is dilated convolution [16], a convolution that can expand receptive fields and capture multiscale semantic information. It is known with the proposal of DeeplabV1, in which some convolution part of the network is modified with dilated convolution. In MobileNet [25], depthwise separable convolution is proposed, a widely used convolution in lightweight networks. Singh et al. [26] consider a heterogeneous convolution, that can decrease the computing costs and maintain accuracy at the same time. Han et al. [27] design a plug-and-play ghost convolution to reduce the computing resources required by the model.
These prior works all leave a great impact on both academic and industrial communities. But it seems that recently, the research about novel convolution has slowed down. Our research about DC and LFC will fill in the blank in this field in recent years.

B. Semantic Segmentation of HR RSIs
There are two notable characteristics of HR RSIs. One is that they often cover a large area of land and the other is HR RSIs are literally, very clear. Under this circumstance,  Mou et al. [22] point out that spatial relations matter a lot for RSI processing.
With the growing number of HR RSIs, the emergence of DCNNs and domain knowledge-guided networks [28], [29], [30], [31], [32], [33], [34], vast available data have led to great achievement in the field of HR RSI semantic segmentation. Nevertheless, no matter how much the training data increases and how deep network architectures go, when dealing with semantic segmentation, extraction of contextual information still remains a technical poser. Many researchers make attempts to solve this hurdle by introducing spatial relations into DCNNs, using spatial propagation and graphical models [35], [36], [37], [38].

III. METHODS
Unlike the convolution used in a prior study, two brand-new convolutions are designed and combined to extract low-level semantic information. Implementation details will be introduced in this section.

A. Directional Convolution
The receptive field of a conventional convolution kernel is usually square, and the most common size is 3 × 3 pixels. However, the square receptive field has no directionality. It does not capture the various directional features in the image well, and redundant features may be extracted during the convolution process. For better perceiving the various directional features, a direction-sensitive convolution is designed in this work, which is equivalent to filtering the image from different directions to perceive the gradient variation in each direction. There are four regular types of DC designed, as the early layer in the networks to extract low-level semantic features from four directions, and of each the kernel shape is different. Specifically, two are 0 • and 90 • convolution kernels, and the other two are 45 • and −45 • . Each kernel has three weights corresponding to three-pixel strips, as shown in Fig. 1.
For image semantic segmentation tasks, the extractor should obtain as much accurate boundary information as it can. The DCs are designed to be a tool for searching boundary information from all-round four directions in an image. Features perpendicular to the boundary direction and the gradient variation in this direction is what the network needs to learn; while features parallel to the boundary direction interfere with boundary extraction, and the gradient variation in this direction should be suppressed. For extracting effective features and suppressing irrelevant features, each type of DC kernel only contains three weights in a featured channel, and pixels in the same strip share the same weight. Since only one weight is applied to each strip, the gradient variation parallel to the kernel direction is smoothed, and irrelevant features are suppressed. This design helps the network to learn the image features directionally and reduce the noise in irrelevant directions, and is proved to be beneficial to low-level information extraction in the experiments.
The implementation of 0 • and 90 • convolution is shown in Fig. 2, and the steps are described below.
1) The input image is processed by an average-pooling operation with kernel size 1 × 5 (or 5 × 1), and the average values of every five adjacent pixels in horizontal (or vertical) direction are merged into one pixel. 2) The average-pooling results are multiplied by 5 to get the sum of the 5 pixel values. 3) A 3 × 1 (or 1 × 3) convolution is applied to the output feature map from step 2. After this series of operations, we obtain the DC with a receptive field of 3 × 5 (or 5 × 3). Only three weights are introduced in the last step. The directions of the pooling and convolution operations are perpendicular to each other. The pooling and convolution strides are both 1 in steps 1 and 3, so the output feature image has the same size as the input.
The 45 • and −45 • convolutions share weights in diagonal strips, which is more complicated than the 0 • and 90 • ones. Considering the geometry, pixels in the image are aligned horizontally and vertically, while the three strips of pixels in the diagonal direction are staggered. For maintaining the geometric symmetry of the convolution kernel, the number of receptive points in the three strips cannot be equal. Here we use five receptive points in the middle strip, and use four receptive points, respectively, in the two sideward strips. Specifically, we implement these two convolution kernels effectively through a well-designed combination of zero-padding, cropping, concatenation, and 1 × 1 convolution, and ensure that the weights can be updated via backpropagation. The steps are illustrated in Fig. 3 and described as follows.
1) Make five duplicates of the input feature map named A-E, respectively. 2) Enlarge the size of the duplicated feature maps by zeropadding. The padding strategy makes the pixels at the same position in A-E constitute a diagonal strip in the input feature map.

3) Add up A-D and offset the result by padding and
cropping to obtain F and G, and add up A-E to obtain H . 4) F-H are concatenated, then 1 × 1 convolution is applied. As the result is 4 pixels larger than the original input, it should be cropped by 4 (2 pixels on each edge). The principle of the 45 • convolution is formulated as follows. Let f i, j be the value at position (i, j) in the input feature map. The values at the same position in F, G, and H are written as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
The output of the 1 × 1 convolution is which is equivalent to filtering with the kernel Through these steps, the diagonal 45 • convolution is implemented, and the −45 • convolution can be implemented similarly.

B. Large Field Convolution
Although DC can extract directional features, it is far from enough to solve the spatial relation hurdle. The receptive field of DC is quite small. As there is plenty of semantic information scattered all over the image, DC is not able to learn large-scale context information, which may lead to poor performance with a lot of misclassified pixels in large-area segmentation. To solve this problem, we propose LFC, whose geometric structure is shown in Fig. 4(b) and (c).
Its inspiration is from the experience of observing objects with human eyes: the object details at the sight focus are clear, while those far away from the sight focus are blurred. However, object details at or far away from the sight focus are also important for object recognition. Features at the sight focus are profitable for identifying object details, while features far away from the sight focus represent the context information. For example, an object can be identified as a tree by focusing our eyes on a leaf and seeing the rest of the tree through peripheral vision. If there is only a single leaf and the rest of the tree can not be perceived, the object can only be identified as a leaf instead of a tree. Based on a similar principle, by combining the context information and local details, resolution, and accuracy can be improved for image segmentation.
LFC is based on the above principle. Fig. 4(a) shows a general 3 × 3 convolution kernel; (b) is a 9 × 9 LFC kernel, whose center has none receptive points (yellow) in a 3 × 3 block, and is surrounded by eight receptive blocks (red). Each receptive block is 3 × 3 in which the 9 pixels share a common weight. The kernel covers the receptive field of 9 × 9 pixels with only 17 weights. Similarly, (c) is a 27 × 27 LFC kernel, which is formed with a kernel (b) surrounded by eight receptive blocks (yellow) of 9 × 9 pixels, and such a large kernel only has 25 weights. Based on the geometric relation, the receptive field of LFC is a threefold growth relationship, i.e., the sizes are 9 × 9, 27 × 27, 81 × 81, and so on. In this kind of kernel, the closer to the center, the smaller the receptive point (block) is, which is in line with the experience of human eyes when observing objects: small receptive blocks are conducive to extracting the local object details, while the surrounding large receptive blocks have no necessity to perceive rich details, but to perceive the whole object. So LFC can effectively combine the context information as well as local details for segmentation.
In a general convolution operation, every pixel value in the receptive field is weighted with the kernel weight by each stride. If the input image is large and a large receptive field is required (such as in RSI segmentation), it is unwise to use a general large-scale kernel for the convolution.
The computation will increase sharply, with the number of parameters increasing by the square relationship of the kernel size. A commonly used operation is down-sampling by convolution or pooling with a stride greater than 1. However, such operations will introduce new problems: the loss of pixel position and the distortion of prediction. Furthermore, a large down-sampling rate will degrade the image resolution, which leads to much information loss. In some research, multiscale features are fused through concatenation or other operations. For example, the shallow and deep features are fused through the contracting path and expanding path in UNet, but the number of feature map channels will multiply after concatenating, and the number of parameters will also increase dramatically.
The implementation of LFC kernel is illustrated in Fig. 4(d). The steps to implement a 9 × 9 LFC kernel are briefly described as follows.
1) The input image is convoluted with 3 × 3 kernel.
2) The input image is processed by 3 × 3 average-pooling with stride 1, and then multiplied by 9. 3) 1 × 2, 2 × 1, and 2 × 2 convolutions with a dilation rate of 6 are applied to the output of step 2. 4) Sum the results of steps 1 and 3.
Step 1 is to set the nine receptive points in the center of the LFC kernel.
Step 2 averages and fuses the 9 pixels in each 3 × 3 block into one pixel. The 1 × 2 and 2 × 1 convolutions with a dilation rate of 6 in step 3 act on the result of step 2, which is equivalent to achieving the four receptive blocks on the four sides of the kernel. The dilation rate of 6 makes the receptive points just correspond to the centers of the four receptive blocks. The 2 × 2 convolution achieves the receptive blocks at the four corners.
After such a series of steps, LFC does not have any hollows or gridding effects. In particular, the stride of LFC is set to 1 as that of the general convolution, so the output feature map remains the same size as the input image. There is no resolution degradation, and the pixel features are wellpreserved. This design greatly reduces the computation and improves the quality of segmentation.
The principle of LFC is formulated as follows. Let f i, j be the value at position (i, j) in the input feature map. The output of the 3 × 3 average-pooling with stride 1 after being multiplied by 9 is The output of 3 × 3 convolution is + y 4 f i, j−1 + y 5 f i, j + y 6 f i, j+1 + y 7 f i+1, j−1 + y 8 f i+1, j + y 9 f i+1, j+1 . (5) The outputs of 1 × 2 and 2 × 1 convolutions with dilation rate of 6 are The output of 2 × 2 convolution with a dilation rate of 6 is After the summation of O 3×3 , O 1×2 , O 2×1 , and O 2×2 , the final result is equivalent to filtering by the following kernel: w 4 w 4 w 4 y 1 y 2 y 3 w 6 w 6 w 6 w 4 w 4 w 4 y 4 y 5 y 6 w 6 w 6 w 6 w 4 w 4 w 4 y 7 y 8 y 9 w 6 w 6 w 6 w 7 w 7 w 7 w 8 w 8 w 8 w 9 w 9 w 9 w 7 w 7 w 7 w 8 w 8 w 8 w 9 w 9 w 9 w 7 w 7 w 7 w 8 w 8 w 8 w 9 w 9 w 9 Similarly, other receptive fields such as 27 × 27 or even larger ones can be achieved by setting the average-pooling kernel size and the dilation rate of the convolution in the module. To further reduce computation, the average-pooling operation with a kernel larger than 3 × 3 pixels (for example, 9 × 9) can be replaced by dilation average-pooling directly applied to the output of the previous average-pooling. When the dilation rate is set to 3, the output result is equivalent to that of 9 × 9 average pooling. If the input image is pooled by 9 × 9 kernel directly, the sum computation of each stride is 9 2 = 81 times. While by the cascading trick, the sum computation is only nine times and is reduced by 8/9. Kernels with larger receptive fields can be implemented by extending the module structure along the branch marked with "For bigger receptive field" in Fig. 4. When the side length of the receptive field is tripled, the number of parameters only increases by 8. After the LFC layer, the following should be a batch normalization layer for effective training convergence. Based on the principle of LFC, a large number of pixel values are summed and then given the shared weights. For example, when the receptive field is set to 81 × 81, the size of the outermost receptive block is 27 × 27, which requires the summation of 27 2 = 729 pixel values, and the results will be very large. It is extremely out-off-balance as the inputs of the next layer, and the training will be unstable and hardly converge. While after processed by batch normalization, the output of LFC conforms to the standard normal distribution, and the gradient vanishing problem is avoided.

C. Residual Extractor
The proposed two convolution units each have their own strengths. To find out how to integrate them as the feature extractor in a deep network, exhaustive experiments are carried out and the results show that the best way is to place the two units at the very beginning of the network. Furthermore, inspired by the pioneering work of He et al. [29] on residual networks, we adopt the residual structure to construct a residual extractor, which can be used to replace the beginning convolution layers for low-level feature extraction.
The comparison of net structures between UNet and UNet with the residual extractor (UNet + DC and LFC) is illustrated  in Fig. 5. The input of the latter is first convoluted with DC, with the number of channels going up from 3 to 64. And then, the output of DC is convoluted with LFC. The results of DC and LFC are added and fed into the next layer of the network. As shown in Fig. 5, the low-level feature extracting layer of UNet is replaced by our convolution units, the concise design of which enables researchers and developers to use the extractor in their own research easily.
As mentioned above, there are four types of DC kernels, which are combined in the residual extractor. As Fig. 6 presents, the four convolutions are performed in parallel, each producing 16-channel output. These outputs are concatenated in the channel dimension, resulting in a feature map of 64 channels. Before being fed into the LFC layer, the feature map goes through a ReLU function and then a BatchNorm function.
This residual extractor allows developers to modify a network freely as long as the low-level information extracting layer of the network uses the normal convolution such as UNet does. In the following experiments, FCN and SegNet are modified such as UNet, too. The main idea of the modification is just to replace the very first convolution of the network.

IV. DATASETS AND EXPERIMENT SETTINGS
In order to assess the generalizability and effectiveness of the proposed convolution units and extractor, two very-HR (VHR) RSI datasets are used in the semantic segmentation tasks, i.e., the DLRSD dataset and the Vaihingen dataset. Samples from the two datasets are shown in Fig. 7.
A. Descriptions of Datasets 1) DLRSD: Images of the dense labeling remote sensing dataset (DLRSD) come from the UC Merced archive [39], but are densely labeled by Shao et al. [40], the spatial resolution of which is 0.3 m. There are 21 categories in the UC Merced dataset and each category contains 100 images with the same size of 256 × 256 pixels. The total 2100 images were labeled into 17 semantic classes, and pixel distribution per class is shown in Table I. We randomly selected 80% of data from each category as the training samples and the rest 20% as the validation samples.
2) Vaihingen: The Vaihingen [41] dataset consists of 33 aerial images of Vaihingen city, with a spatial resolution of 0.09 m and an average size of 2494 × 2064 pixels. The Vaihingen dataset is composed of five foreground object classes, i.e., 1) car; 2) tree; 3) low vegetation; 4) building; and 5) impervious surface and one background class. 16 out of the 33 images were randomly selected as the training samples and the rest 17 as the validation samples.

B. Experiment Purpose
Three deep learning frameworks widely used in remote sensing semantic segmentation are selected as the baseline models, i.e., SegNet [42], fully convolutional networks (FCN [14]) with a ResNet backbone, and UNet [15]. We modify these three frameworks by replacing some layers in the networks with our residual extractor, similar to Fig. 5, to find out how much our method can improve the baselines. 2) Preprocess of Data: Due to the memory limitation of the GPU, images from the Vaihingen dataset are first resized at a resolution of 2560 × 2560 pixels, and then, for each image, it is cropped into 100 pieces at a resolution of 256 × 256 pixels. No data augmentation method is involved during the training.
3) Implementation Details: Adam is chosen as the optimizer at a learning rate of 0.01, with β1 = 0.9, β2 = 0.99 as the weight decay. For different baselines, different batch sizes and epochs are assigned. For SegNet, FCN, and UNet, the batch sizes are 16, 18, and 12, respectively, when training on the DLRSD dataset, and 14, 18, and 14, respectively, when training on the Vaihingen dataset. The training epochs are 140 and 100 for DLRSD and Vaihingen, respectively. The modified models are referred to as SegNet + DC and LFC, FCN + DC and LFC, and UNet + DC and LFC. The kernel size of LFC is set to 9 × 9 throughout the experiments.

D. Evaluation Metrics
To evaluate the improvement of the baseline networks modified with our convolution units, the F1 score and mean intersection over union (mIoU) are obtained. Furthermore, precision and recall are also used for the comparison between the modified networks and recently published methods.
The F1 score is formulated as below Precision and recall mentioned above are defined as follows: in which TP, FP, and FN represent true positive, false positive, and false negative for each class, respectively. And the more widely used metric in recent years, mIoU, is formulated as follows: where k is the number of classes, TP i , FP i , and FN i represent true positive, false positive, and false negative of the ith class.

V. RESULTS AND DISCUSSION
To prove the generalization and effectiveness of our portable convolution units, we focus on the improvement achieved by adding these units to classic deep learning networks. Therefore, there is no data augmentation method applied. Our method demonstrates consistent effectiveness on three classic networks and two datasets. Furthermore, we find that the network embedded with our units goes beyond some excellent networks published in recent years on the DLRSD dataset.
This section consists of four parts, improvement of network performance, comparison with recent networks, ablation study, and low-level feature analysis.

A. Improvement of Network Performance
The experiment results are listed in Tables II-V. F1 score and mIoU are used for numerical analysis. These results  provide strong evidence that the proposed modules help backbone networks improve the ability to extract more accurate semantic information within the experimental datasets.
Let us first take a look at experiments on the DLRSD dataset. As shown in Table II, it is demonstrated that SegNet, UNet, and FCN obtains mIoU increment of 2.13%, 1.63%, and 1.53%, respectively. From Table III, SegNet, UNet, and FCN improves on mean F1 score by 2.07%, 1.16%, and 1.04%, respectively. As for the fine-grained classification improvement, over 13 in average out of the total 17 classes are not inferior to the original models in terms of mIoU for each modification. For the F1 score, the number is approximately 14. Visualization analysis comparison of the DLRSD dataset for each baseline network is illustrated in Fig. 8. We can see in the FCN comparison from the top image group, FCN + DC and LFC predicts the sea class much better than FCN does. This just reflects the fact in Table II, FCN + DC and LFC goes 6% more than FCN for class sea. Another interesting fact is that there are many minute dots in the class car that UNet predicts (the lowest comparison group), while there are no such dots in the prediction of UNet + DC and LFC. It is consistent with the statistics in Table II, UNet + DC and LFC goes 1% more than UNet for class car.
Then, here comes the results on the Vaihingen dataset. Tables IV and V provide comparison between the original and modified networks, in terms of mIoU and F1 score. Just such as the experiments on the DLRSD dataset, our modules make comprehensive improvement on both indicators, too. Compared to the baseline models, mIoU score increases by 2.55%, 0.03%, and 1.38%. Mean F1 is improved by 2.19%, 0.16%, and 1.26%. As for the fine-grained classification improvement, approximately 4.7 on average out of the total five classes are not inferior to the original models in terms of indicator mIoU for each modification. For the F1 score, the number is 4.7, too. Visualization analysis comparison of the Vaihingen dataset for each baseline network is illustrated in Fig. 9. As we can see in the figure, there is so much less noise in our methods.
The fact that as long as the baseline networks are modified with DC and LFC units, there definitely will be a significant improvement both on the indicators and visualization analysis comparison, strongly provides evidence of their effectiveness. It also reveals that these portable units are of great generalization due to the all-around increment on both datasets and three baseline models.

B. Comparison With Recent Networks
Networks enhanced with our modules are not only more precise than the original ones but also go beyond some excellent networks emerged in recent years, on the DLRSD dataset. FGC [43], Deeplab v3 + [44], DHN [45], and UNet-Att [46] are the models that we compare our method to. Due to the limited unified indicators presented in previous studies, mean precision and mean recall over all classes are selected for measurement, and results are shown in Table VI. UNet + DC and LFC outperform the second-best UNet-Att over 6% in precision and nearly 1% in recall.

C. Ablation Study
Since two convolution units are proposed in this work, we need to figure out which part helps more. Moreover, as there are four types of DC kernels, the various combinations should be evaluated. Besides, LFC with 9 × 9 and 27 × 27 receptive fields and convolution using the LFC unit or the normal convolution will be compared as well. SegNet is conducted as the experimental baseline.
1) Directional Convolution: To find out the effect of each DC kernel, ablation analysis is conducted and the results on Vaihingen and DLRSD are presented in Tables VII and VIII, respectively. It is seen that after processed with DCs of some certain direction combination such as 45 • and −45 • or 0 • and 90 • , the mIoU gets a significant improvement, but it still could not match SegNet added full DC with all four directions in terms of mIoU or the amount of mIoU and mF1. Not surprisingly, when convolutions of all four directions are combined, the mIoU on Vaihingen and DLRSD both reach  the top score, which is 60.86% with 2.39% increment for Vaihingen and 48.43% with 1.40% increment for DLRSD, compared to the baseline model. This result confirms the fact that it is the complete DC module that works the best.
2) Large Field Convolution: Intuitively, LFC with larger receptive fields is able to gain more semantic context information, but the accuracy is lower due to the weight-sharing design. We conduct experiments on two receptive fields, i.e., 9 × 9 and 27 × 27, to determine which one is more appropriate for this task, and the results on Vaihingen and DLRSD are shown in Tables IX and X, respectively. It is seen that on Vaihingen each of the two receptive fields has an advantage in one indicator. However, the 9 × 9 receptive field has more obvious advantages in the mIoU indicator, and the amount of calculation is much smaller. Besides, only the 9 × 9 receptive field reaches an increment on DLRSD. Therefore,  when applying LFC to our experiments, 9 × 9 receptive field is adopted.
3) Combinations of DC, LFC, and Residual Block: As we can see in Tables XI and XII, an exhaustive ablation study about the combinations of DC, LFC, and Residual Block is made on the two datasets. It should be noted that checkmarks with a double number 1 in Tables XI and XII represents a parallel network design. Although both DC and LFC can improve the results, the best way to use them is to combine the two with DC first and LFC following, and then put a Residual Block to connect the two convolutions.

4) LFC Versus Normal Convolution:
Although LFC achieves a significant increment, we are not sure whether a normal 9 × 9 convolution could do that or not. Therefore, another ablation study is made and the result is presented in Tables XIII and XIV. It is seen that LFC with the 9 × 9 receptive reaches the top score on both datasets.

D. Low-Level Feature Analysis
To establish a more intuitive understanding of DC and LFC, low-level feature visualization on DLRSD is made in Fig. 10 and UNet + DC and LFC is selected as the test network.
As mentioned above, the network is with a DC unit first and an LFC unit following. Therefore, it can be clearly understood that the figures in the column of Directional Conv are feature maps output by DC and those in the column of Large Field Conv are output by LFC, which is after DC. As presented in Fig. 10, although DC alone can filter subjects and the background, the effect is much better when the features of DC are processed with an LFC unit. It is seen that all the cars and grass from the image of Large Field Conv in the group below are highlighted, while those of Directional Conv are not, which also provides proof that it is DC added with LFC that works the best.

VI. CONCLUSION
In this article, we propose two novel convolution units and a plug-and-go low-level feature extractor, that can fit directional features better and solve the spatial relation hurdle in RSI segmentation. For extracting directional features, we propose DC to implement convolution from four specific directions. And for the spatial relation hurdle, we propose LFC with a scalable receptive field and very few weights. To combine these two convolution units, we introduce a residual extractor as the early extracting layer of a network. We evaluate our method on two public datasets, and experiment results show that our novel convolution design can improve the original networks reliably and comprehensively. The ablation study demonstrates the effectiveness of the design strategies of the two units. In future work, we will extend our convolution units to more tasks such as instance segmentation and object detection within the field of RSI analysis.