Introduction
Agricultural plastic greenhouses, as emerging protected cultivation facilities, are designated as production facility land within agricultural land, with distinct differences from construction land. However, in recent years, some individuals have illegally constructed “greenhouse houses” on arable land, including basic farmland, under the pretext of developing facility agriculture, or built “private estates” within greenhouses, which seriously violate the red line for arable land and pose significant harm to land resources and food security. Therefore, the accurate and timely identification and monitoring of the location and spatial distribution of agricultural greenhouses have become important tasks for natural resource management departments in carrying out the “denonagriculturalization” of arable land.
Due to the concealed appearance and scattered distribution of agricultural greenhouses, manual methods are inadequate for rapid determination and monitoring. As a result, many scholars have begun to use remote sensing technology to detect and extract agricultural greenhouses and mulched farmland, using optical remote sensing images of different spatial resolutions combined with deep learning models to conduct large-scale greenhouse monitoring [1], [2], as shown in Fig. 1.
Process involves acquiring images through remote sensing satellites, splitting the image, and then feeding them into model for image segmentation.
During the iterative process of greenhouse segmentation, there are still numerous challenges. Traditional segmentation methods fail to achieve the required level of accuracy for practical applications. Currently, with the use of deep learning methods, the segmentation accuracy has significantly improved. However, these methods struggle to effectively differentiate interference from urban features that bear similarities to greenhouses. In addition, greenhouse datasets are extremely scarce, which makes it difficult to meet the demands of training deep learning models on a large scale [3], [4]. Resolving the aforementioned problems will be the key determinant of the model's effectiveness.
To address the issue of greenhouse detection, we obtained remote sensing images of land use from satellites [5] and cropped them into 512×512 pixel size to construct a training dataset. Considering the limitations in sample quantity and effective samples in the dataset, we performed a series of data augmentation [6], [7] techniques from various perspectives to make the dataset more suitable for model training. The segmented results were obtained through FADNet consisting of an encoder and a decoder.
To better address the challenges of small object detection and distinguishing between urban construction areas and agricultural greenhouses in greenhouse detection, we advocate a new semantic segmentation architecture called the FADNet, which combines deformable convolution, feature pyramid, and attention mechanism. We made targeted improvements to FADNet in the encoder and decoder stages to enhance the accuracy of small object and edge detection. In the encoder, FADNet refines multiscale semantic information through three modules. Specifically, it uses an attention mechanism to extract global features and enhance semantic information acquisition. It utilizes deformable convolutional networks to extract irregular features and improve the collection of feature edge information. Finally, an improved feature pyramid (ASPP) is utilized to fuse multiscale information and increase the weight of low-level feature information in the feature representation, enabling FADNet to pay more attention to small object information and improve the segmentation capability for small objects. In the decoder, we improve the upsampling process to better integrate low-level feature information.
This article makes the following key contributions.
We manually construct a dataset for training the greenhouse segmentation task and apply data augmentation from multiple perspectives to meet the requirements of model training.
We propose a novel semantic segmentation network FADNet, which is designed specifically for greenhouse segmentation to capture small object and edge information and distinguish between urban and rural areas. FADNet incorporates the SimAM attention mechanism for semantic information retrieval and uses an improved feature pyramid network for multiscale feature fusion. In addition, an improved deformable convolution is employed to capture more edge information and relevant features.
The rest of this article is organized as follows: Section II provides an overview of relevant literature on greenhouse segmentation. Section III presents the technical background of the relevant modules. Section IV focuses on the description of the modules used in this article. Section V presents the experimental results and the related experimental environment. Finally, Section VI concludes this article.
Related Work
Currently, there are several methods for greenhouse detection. One approach is based on traditional methods using machine learning and mathematical statistical techniques. Another approach involves using convolutional neural networks (CNNs) for feature extraction and greenhouse segmentation.
A. Traditional Method
Traditional methods mainly use machine learning technology to extract information, such as color and structure from greenhouse images, and use support vector machines to classify them [8], [9]. However, these features have certain limitations in terms of generalization ability and extraction ability, which ultimately leads to suboptimal classification results. In order to solve the limitations of feature extraction by traditional methods and further improve the accuracy of classification results, some researchers began to use the texture features of plastic greenhouses to supplement the spectral features [10].
B. CNN-Based Models
Deep learning [8], [11] has experienced rapid development in the field of computer vision and has achieved remarkable results in image segmentation [12]. Specifically, the successful application of fully CNNs (FCN) [13] in greenhouse detection has significantly outperformed traditional image segmentation methods in terms of accuracy [14]. However, during the continuous convolution process, a considerable amount of information loss occurs, leading to the loss of details for small targets during segmentation. In addition, segmentation methods only provide blurry contours and fail to accurately segment edge details [15]. To solve this problem, an encoder and decoder architecture is proposed, which directly integrates multilevel semantic features and spatial information [16], [17], such as the U-Net [18] and RefineNet [19] architectures. However, after multiple layers of feature extraction, there is a severe loss of image detail features, making it challenging to achieve accurate segmentation of small targets in image segmentation tasks. Furthermore, due to the use of convolution kernels for feature extraction in convolutional networks, the relationship features between different regions are lost, leading to unreasonable segmentation results, such as confusing plastic films used for dust prevention in urban areas with greenhouse plastic films used in agricultural land.
Preliminaries
A. U-Net
The U-Net [20] was proposed by Olaf Ronneberger, Philipp Fischer, and Thomas Brox, initially used for medical image segmentation tasks. The architecture consists of an encoder and a decoder, with skip connections to facilitate information flow between them and improve the performance of image segmentation.
The encoder section comprises convolutional and pooling layers, which progressively extract features and contextual information from the input image. The decoder section is symmetric to the encoder and consists of convolutional and upsampling layers. By performing multiple upsampling operations, the decoder gradually restores the spatial dimensions of the image and establishes skip connections with corresponding encoder features [21]. The purpose of skip connections is to pass low-level detail information to the decoder, aiding in the recovery of details and improving segmentation results [22]. The skip connections in U-Net are implemented by concatenating the feature vectors from the encoder with the corresponding feature vectors in the decoder. This allows the combination of the contextual information captured in the encoder with the detail information restored in the decoder, resulting in improved image segmentation.
B. Attention Mechanism
The attention mechanism [23], [24], [25] was initially proposed to enhance the model's focus on different parts and is primarily applied in natural language processing tasks. In the attention mechanism, there are typically three main components: 1) query, 2) key, and 3) value. The query manifests the content the model is currently focusing on, while the key and value represent different parts of the input data. By calculating the correlation between the query and the key, attention weights are obtained. These weights are then used to weight the values, resulting in the final attention representation. This mechanism allows the model to selectively focus on specific parts of the input data based on task requirements, thereby improving model performance and effectiveness. The main approach employed in this article is spatial attention mechanism.
Spatial attention [26] primarily focuses on the spatial dimension of the input data, such as different locations in an image or feature vectors. It assigns different attention weights to different locations by calculating the similarity or correlation between them. This enables the model to selectively focus on specific regions or positions in the image, enhancing its perception of spatial structures.
C. Deformable Convolution
Deformable convolution [27], [28] is an improved convolution operation that aims to better adapt to object deformations and movements. Traditional convolution operations use fixed-shaped and sized kernels, which may overlook some fine changes. In contrast, deformable convolution adjusts the sampling positions of the convolution kernel based on the content of the input feature vector by introducing learnable offset parameters.
In deformable convolution, the sampling positions of each convolution kernel are controlled by corresponding offset parameters. These offset parameters are learned during the training process and can dynamically adjust the sampling positions of the convolution kernel based on the contextual information of the input feature vector. This allows deformable convolution to better capture local details and deformation information of the target object, enhancing the model's ability to model complex scenes.
Methods
In this section, we provide a brief summary of our research and provide a detailed description of the components of FADNet. FADNet consists of three main components, including the U-Net network architecture, the DA module (DA) that combines SimAM attention mechanism with convolution fusion, the deformable module that utilizes deformable convolution for edge feature extraction, and the ASPP [29] module that replaces skip connections with a feature pyramid. By integrating the characteristics of these three modules, FADNet effectively addresses the challenges in large-scale segmentation tasks.
A. FADNet
The FADNet proposed in this article is an improvement on the U-Net architecture. Specifically, we introduced the DA module to replace the first, third, and fifth layers of the encoder, and utilized the deformable module to replace the second, fourth, and sixth layers of the encoder. The same improvements were applied to the decoder as well [30], [31]. The purpose of these improvements was to enhance the performance and accuracy of FADNet. The DA module dynamically adjusts the computation of attention weights, allowing FADNet to focus more on regions of interest, particularly the greenhouse areas. The deformable module is able to better capture deformations and fine details in the images.
Furthermore, in the skip connection operation, the original model employed a simple addition operation followed by pooling. However, we believed that this simple addition operation might lead to feature loss. Therefore, in the skip connection process, we introduced the ASPP module to fuse the input features, which involves the fusion of features at different scales. This helps to regulate the relative importance of low-level features during the upsampling process while reducing the loss of feature information. Through backpropagation, FADNet can autonomously learn the magnitude of the relevant weights to better utilize the features in the encoder. By introducing the feature spatial pyramid module, the skip connections can flexibly fuse the features between the encoder and decoder. The convolution operation can learn specific weights to appropriately blend the features from the encoder and decoder, thereby preserving and utilizing useful feature information. This fusion operation helps FADNet better understand the contextual information of the image and improves the accuracy and performance of image segmentation. With these improvements, we further enhance the performance of the FADNet in image segmentation tasks. The overall model architecture is illustrated in Fig. 2.
Process involves acquiring images through remote sensing satellites, segmenting the images into 512 × 512 pixel sizes, and then feeding them into model for image segmentation.
B. DA-Module
The DA-module is an important part of the FADnet encoder and decoder, which can enhance the model's ability to extract global features. It consists of three components: 1) a SimAM spatial attention module, 2) a DoubleConv double convolution module, and 3) a pooling operation. In this module, the SimAM module is used to calculate the attention weights for the feature matrix Q. First, the data from various dimensions of the feature matrix Q are extracted, including batch size, channel, height, and width. The mean value
\begin{align*}
T&=\text{Pow}\left(Q-\frac{Q}{\sum _{j=0}^{h}\sum _{i=0}^{w}Q_{ji}}\right) \tag{1}
\\
E&=\frac{T}{4\times\left(\frac{\sum _{j=0}^{h}\sum _{i=0}^{w}T_{ji}}{\text{width*height}}+\theta\right)}+\alpha. \tag{2}
\end{align*}
Equation (2) is used to sum the second and third dimensions of the feature vector T, where w and h represent the width and height of the input feature vector Q, and
SimAM is an attention mechanism module based on a 3-D network that constructs weights for each neuron by mimicking the motion pattern of neurons. Traditional 3-D network weights are generated by fusing 1-D and 2-D network architectures, which can lead to high computational complexity. In the architecture of SimAM, weights are constructed for each neuron by mimicking the motion pattern of neurons with the following formula:
\begin{equation*}
\zeta _{t}\left(\varphi _{t}, \phi _{t}, y, x_{i}\right)=\left(y_{t}-\widehat{t}\,\right)^{2}+\frac{1}{G-1}\sum _{i=1}^{G-1}\left(y_{0}-\widehat{t}_{i}\right)^{2}. \tag{3}
\end{equation*}
\begin{equation*}
\widehat{Q}=\text{sigmod}\left(\frac{1}{E}\bigodot Q\right) \tag{4}
\end{equation*}
\begin{equation*}
X=\text{Relu}\left(\text{BN}\left(\text{Conv}\left(\text{SimAM}\left(\text{Conv}(Q)\right)\right)\right)\right). \tag{5}
\end{equation*}
By introducing the DA module, the focus on the region of interest, especially the greenhouse location, can be enhanced. This increased focus is mainly manifested in the allocation of attention weights in the attention mechanism. The DA module dynamically adjusts the computation of attention weights, allowing the greenhouse region to receive more weight, thereby strengthening the accuracy of greenhouse recognition.
FADNet determines which regions to pay more attention to by computing weights for the input data. For the greenhouse region, due to its high relevance to farmland, the DA module enables the greenhouse region to receive higher weights. As a result, the FADNet focuses more on the features and details of the greenhouse. Conversely, for urban regions with lower relevance to the target area of interest, the DA module assigns lower weights during the computation, reducing the level of attention given to urban regions. Through this computation method of attention mechanism, the FADNet can allocate weights more effectively to differentiate between agricultural land and urban land.
C. Deformable-Module
The deformable convolution is an advanced CNN technology that allows the shape and parameters of the convolution operation to adaptively change between different layers. Traditional convolution operations use fixed convolution kernel sizes and strides, which means that all convolution layers in the network use the same convolution kernel shape and parameters. However, in certain cases, a fixed convolution kernel may not be suitable for all feature vectors, as different feature vectors may have varying spatial scales and contextual information.
The deformable convolution introduces an adaptive approach that allows the network to dynamically adjust the shape and parameters of the convolution kernel based on the properties of the input feature vector. This enhances the network's expressive power and adaptability, enabling it to better capture feature information at different scales and contexts. Deformable convolutions significantly improve the accuracy of feature extraction. However, implementing deformable convolutional operations involves adding a bias term to the feature vector to compute the position offset for the deformable convolution. This requires two corresponding points (x, y) for each feature point. If all regular convolution kernels were replaced with deformable convolution kernels, it would greatly increase FADNet's parameter count and computational time complexity. Therefore, in this work, a significant number of experiments were conducted to selectively replace some convolutional layers with deformable convolution kernels. This approach achieves a relatively reasonable balance between performance, time complexity, and space complexity. The deformable module is capable of learning the weights for each sampling point while controlling the offset of the convolution kernel sampling points, reducing the influence of irrelevant factors and improving FADNet's accuracy. The following is the operational flow of the deformable module: When using a 3×3 convolution kernel, denoted as R, with a 2-D size, the following equation represents the calculation method for the convolution operation:
\begin{equation*}
R={(-1, 1), (0, 1),{\ldots }, (0, 1), (1, 1).} \tag{6}
\end{equation*}
The deformable model has two steps in extracting the feature vectors of the input image. First, the input image undergoes a traditional convolution operation to obtain the initial feature vectors. Then, by performing convolution on the feature vectors, deformable convolution offsets are generated, resulting in feature vectors that can adapt to objects of various shapes. Equation (7) represents the process of calculating the output feature vectors for normal convolution operation
\begin{equation*}
y(\nu _{0})=\sum _{\nu _{n} \in R}W\left(\nu _{n}\right)gx\left(\nu _{0}+\nu _{n}\right) \tag{7}
\end{equation*}
\begin{equation*}
y(\nu _{0})=\sum _{\nu _{n} \in R}W\left(\nu _{n}\right)gx\left(\nu _{0}+\nu _{n}+\triangle \nu _{n}\right)\triangle m_{n} \tag{8}
\end{equation*}
\begin{equation*}
x(\nu)=\sum _{\rho }G(\rho, \nu)gx(\rho) \tag{9}
\end{equation*}
Based on the aforementioned deformable convolution, a deformable convolution module was constructed in this work. The module consists of one deformable convolution, one normal convolution and normalization operations, and the functionality of the module is ensured through the use of the ReLU function to prevent overfitting. The structure of the deformable module is shown in Fig. 5.
D. Loss Function
When designing the loss function, we used focal loss. The introduction of focal loss is mainly to solve the problem of extremely imbalanced number of positive and negative samples in one-stage target detection. Focal loss also has excellent performance in binary classification problems. The following equation is focal loss:
\begin{equation*}
\text{FL}(p_{t})=-(1-p_{t})^{\gamma }\log (p_{t}). \tag{10}
\end{equation*}
Results and Discussion
This section aims to present the results of the experiment, including several main sections: dataset, experimental environment, demonstration of experimental results, and comparison of experimental effects. Through comprehensive comparative analysis, the superiority of this model is demonstrated.
A. Datasets
The experiment utilized a dataset collected by remote sensing satellites, stored in TIFF file format. Each TIFF file contained the latitude and longitude coordinates of the corresponding area, along with the RGB image information for that area. In the data preprocessing stage, we first processed the dataset to retain the RGB image information and separate it without affecting the latitude and longitude coordinates. Due to the high resolution and incompatible size of the images with the neural network, we applied preprocessing using the OpenCV library to segment the images into smaller blocks of size 512×512.
During the image segmentation process, the dataset's characteristics resulted in the presence of invalid data and images unsuitable for image processing. Through filtering, we removed these invalid data, resulting in a total of 127 valid samples. However, since the number of valid samples was relatively small, directly using them for training could lead to overfitting issues. Therefore, before training, we performed further preprocessing on the images. Considering possible image blurring issues in practical applications, we simulated image blurring caused by factors, such as weather by adding noise. In addition, to augment the dataset and maintain a balanced ratio of positive and negative samples, we employed data augmentation techniques, such as rotation, translation, mirroring, cropping, and scaling, with more emphasis on processing negative samples. Through these data augmentation techniques, we expanded the dataset from the original 180 samples to 1796 samples and achieved a positive-to-negative sample ratio of 7:10. Finally, the preprocessed and augmented dataset is shown in Fig. 6.
Translations are as follows: Original image, Mask of the original image, Mirrored image, Mask of the mirrored image, Rotated image, Mask of the rotated image, Cropped image, Mask of the cropped image, Gaussian blur, Mask of the Gaussian blurred image.
After data augmentation, the ratio of positive and negative samples has also reached a relatively balanced state, as shown in Fig. 7.
B. Results
This experiment primarily focuses on greenhouse identification based on remote sensing satellite imagery. The experiment was conducted on a Windows operating system with an NVIDIA GeForce RTX 3090 graphics card as the hardware environment. For building the neural network architecture, we utilized the PyTorch 3. 9. 10 library in the experiment.
In this experiment, we conducted multiple comparisons. The FADNet achieved a training set accuracy of 99.7% without data augmentation. However, the accuracy on the test set was only 98.9%. We attribute this discrepancy to the limited number of samples and the imbalanced distribution of the dataset. To address this issue, we performed a series of data augmentation techniques to increase the sample size and maintain a balanced ratio of positive and negative samples. After data augmentation, FADNet achieved a training set accuracy of 99.8% and a test set accuracy of 99.6%. The results are depicted in the Fig. 8. As evident from Fig. 8, although the training performance is remarkable, there is a significant disparity in the performance on the test set. This confirms the presence of overfitting, which is a result of limited training samples.
In addition to using accuracy as a metric for evaluating FADNet's training performance, we also examined the comparison at the pixel level. We classified each pixel using FADNet and used the results as a scoring criterion. The effectiveness is illustrated in the Fig. 9.
We conducted a comparative analysis of the performance of FADNet with traditional models, including U-net, CNN, and the variant convolution model, as shown in Table I. The results clearly demonstrate a significant improvement in FADNet, indicating its superior effectiveness compared to the other models, as observed in our study.
We conducted a 200-round training and evaluated the model's accuracy on both the training and test sets. We found that the model began to converge and stabilize around the 90th round. However, during the subsequent 100 rounds, the model's accuracy exhibited oscillations, as shown in Fig. 10. Therefore, we believe that reducing the training iterations to 100 rounds would still yield the best training results.
In the ablation experiment section, a comparative analysis was performed by adding or removing different modules to observe their respective effects.
Based on the data presented in the Table II,
it is evident that the integration of multiple modules can effectively enhance the accuracy of FADNet. In Fig. 11, the prediction results are presented in the form of images. By comparing them with the ground truth values, it can be observed that the FAD-Net model's predictive capability is able to successfully extract the location information of greenhouses from the satellite imagery obtained through remote sensing.Table III shows the impact on model accuracy when the hyperparameter in 2 takes different values. After comparison, it can be inferred that the model achieved higher accuracy when
Conclusion
This article introduces a lightweight FDNet model based on an encoder and decoder architecture, combined with attention mechanisms and deformable convolutions. The design of this model makes full use of the advantages of the SimAM module, the deformable module and the ASPP module to effectively solve the technical problems of remote sensing satellite greenhouse identification and greatly improve the accuracy of greenhouse segmentation. By introducing the SimAM module, the model can more accurately distinguish rural and urban areas and reduce information interference in urban areas. By using a deformable module, the module adaptively adjusts the sampling position of the convolution kernel to better capture the details of small targets and make the segmented edges smoother and clearer. In addition, the model also uses the ASPP module to further improve the perception of features at different scales and help improve the accuracy of segmentation. FADNet achieves significant improvements in the accuracy of greenhouse segmentation compared to previous segmentation methods. In actual application, this model has small memory requirements and can be better deployed in the field. It is of great significance to supervise the illegal use of agricultural greenhouses. The greenhouse identification task based on remote sensing satellite images was realized. However, despite the excellent results, the model still has the potential to further improve accuracy and inference speed, which will be the direction of future research.