Loading web-font TeX/Math/Italic
FADNet:Greenhouse Identification With Fusion Attention Mechanism and Deformable Convolution | IEEE Journals & Magazine | IEEE Xplore

FADNet:Greenhouse Identification With Fusion Attention Mechanism and Deformable Convolution


Abstract:

With the development of agricultural modernization, strengthening the supervision of agricultural production activities plays a crucial role in social security management...Show More
Topic: Remote Sensing and Artificial Intelligence for Sustainable Agricultural Applications

Abstract:

With the development of agricultural modernization, strengthening the supervision of agricultural production activities plays a crucial role in social security management and economic development. However, the current monitoring of greenhouse usage and planning in agricultural production lacks effective regulation. Existing technological approaches involve the analysis and monitoring of agricultural activities using machine learning and simple neural networks, but their detection accuracy is limited. In this article, we present a novel deep learning model called FADNet, which includes the fusion attention mechanism and variable convolution techniques. FADNet utilizes remotely sensed images obtained from satellite sensors as input for training. It employs variable convolution and feature pyramid networks to achieve accurate segmentation of small-scale greenhouse targets. Spatial attention mechanism is employed to address the interference caused by similar features in urban areas for greenhouse identification. In addition, data augmentation techniques are utilized to address the scarcity of greenhouse datasets and the disparity in the distribution of positive and negative samples, thereby enhancing the reliability of the dataset. FADNet achieves greenhouse segmentation by performing pixel-level classification on the images. Extensive experiments have demonstrated that FADNet performs exceptionally well in greenhouse segmentation tasks.
Topic: Remote Sensing and Artificial Intelligence for Sustainable Agricultural Applications
Page(s): 7170 - 7178
Date of Publication: 12 March 2024

ISSN Information:

Funding Agency:


SECTION I.

Introduction

Agricultural plastic greenhouses, as emerging protected cultivation facilities, are designated as production facility land within agricultural land, with distinct differences from construction land. However, in recent years, some individuals have illegally constructed “greenhouse houses” on arable land, including basic farmland, under the pretext of developing facility agriculture, or built “private estates” within greenhouses, which seriously violate the red line for arable land and pose significant harm to land resources and food security. Therefore, the accurate and timely identification and monitoring of the location and spatial distribution of agricultural greenhouses have become important tasks for natural resource management departments in carrying out the “denonagriculturalization” of arable land.

Due to the concealed appearance and scattered distribution of agricultural greenhouses, manual methods are inadequate for rapid determination and monitoring. As a result, many scholars have begun to use remote sensing technology to detect and extract agricultural greenhouses and mulched farmland, using optical remote sensing images of different spatial resolutions combined with deep learning models to conduct large-scale greenhouse monitoring [1], [2], as shown in Fig. 1.

Fig. 1. - Process involves acquiring images through remote sensing satellites, splitting the image, and then feeding them into model for image segmentation.
Fig. 1.

Process involves acquiring images through remote sensing satellites, splitting the image, and then feeding them into model for image segmentation.

During the iterative process of greenhouse segmentation, there are still numerous challenges. Traditional segmentation methods fail to achieve the required level of accuracy for practical applications. Currently, with the use of deep learning methods, the segmentation accuracy has significantly improved. However, these methods struggle to effectively differentiate interference from urban features that bear similarities to greenhouses. In addition, greenhouse datasets are extremely scarce, which makes it difficult to meet the demands of training deep learning models on a large scale [3], [4]. Resolving the aforementioned problems will be the key determinant of the model's effectiveness.

To address the issue of greenhouse detection, we obtained remote sensing images of land use from satellites [5] and cropped them into 512×512 pixel size to construct a training dataset. Considering the limitations in sample quantity and effective samples in the dataset, we performed a series of data augmentation [6], [7] techniques from various perspectives to make the dataset more suitable for model training. The segmented results were obtained through FADNet consisting of an encoder and a decoder.

To better address the challenges of small object detection and distinguishing between urban construction areas and agricultural greenhouses in greenhouse detection, we advocate a new semantic segmentation architecture called the FADNet, which combines deformable convolution, feature pyramid, and attention mechanism. We made targeted improvements to FADNet in the encoder and decoder stages to enhance the accuracy of small object and edge detection. In the encoder, FADNet refines multiscale semantic information through three modules. Specifically, it uses an attention mechanism to extract global features and enhance semantic information acquisition. It utilizes deformable convolutional networks to extract irregular features and improve the collection of feature edge information. Finally, an improved feature pyramid (ASPP) is utilized to fuse multiscale information and increase the weight of low-level feature information in the feature representation, enabling FADNet to pay more attention to small object information and improve the segmentation capability for small objects. In the decoder, we improve the upsampling process to better integrate low-level feature information.

This article makes the following key contributions.

  1. We manually construct a dataset for training the greenhouse segmentation task and apply data augmentation from multiple perspectives to meet the requirements of model training.

  2. We propose a novel semantic segmentation network FADNet, which is designed specifically for greenhouse segmentation to capture small object and edge information and distinguish between urban and rural areas. FADNet incorporates the SimAM attention mechanism for semantic information retrieval and uses an improved feature pyramid network for multiscale feature fusion. In addition, an improved deformable convolution is employed to capture more edge information and relevant features.

The rest of this article is organized as follows: Section II provides an overview of relevant literature on greenhouse segmentation. Section III presents the technical background of the relevant modules. Section IV focuses on the description of the modules used in this article. Section V presents the experimental results and the related experimental environment. Finally, Section VI concludes this article.

SECTION II.

Related Work

Currently, there are several methods for greenhouse detection. One approach is based on traditional methods using machine learning and mathematical statistical techniques. Another approach involves using convolutional neural networks (CNNs) for feature extraction and greenhouse segmentation.

A. Traditional Method

Traditional methods mainly use machine learning technology to extract information, such as color and structure from greenhouse images, and use support vector machines to classify them [8], [9]. However, these features have certain limitations in terms of generalization ability and extraction ability, which ultimately leads to suboptimal classification results. In order to solve the limitations of feature extraction by traditional methods and further improve the accuracy of classification results, some researchers began to use the texture features of plastic greenhouses to supplement the spectral features [10].

B. CNN-Based Models

Deep learning [8], [11] has experienced rapid development in the field of computer vision and has achieved remarkable results in image segmentation [12]. Specifically, the successful application of fully CNNs (FCN) [13] in greenhouse detection has significantly outperformed traditional image segmentation methods in terms of accuracy [14]. However, during the continuous convolution process, a considerable amount of information loss occurs, leading to the loss of details for small targets during segmentation. In addition, segmentation methods only provide blurry contours and fail to accurately segment edge details [15]. To solve this problem, an encoder and decoder architecture is proposed, which directly integrates multilevel semantic features and spatial information [16], [17], such as the U-Net [18] and RefineNet [19] architectures. However, after multiple layers of feature extraction, there is a severe loss of image detail features, making it challenging to achieve accurate segmentation of small targets in image segmentation tasks. Furthermore, due to the use of convolution kernels for feature extraction in convolutional networks, the relationship features between different regions are lost, leading to unreasonable segmentation results, such as confusing plastic films used for dust prevention in urban areas with greenhouse plastic films used in agricultural land.

SECTION III.

Preliminaries

A. U-Net

The U-Net [20] was proposed by Olaf Ronneberger, Philipp Fischer, and Thomas Brox, initially used for medical image segmentation tasks. The architecture consists of an encoder and a decoder, with skip connections to facilitate information flow between them and improve the performance of image segmentation.

The encoder section comprises convolutional and pooling layers, which progressively extract features and contextual information from the input image. The decoder section is symmetric to the encoder and consists of convolutional and upsampling layers. By performing multiple upsampling operations, the decoder gradually restores the spatial dimensions of the image and establishes skip connections with corresponding encoder features [21]. The purpose of skip connections is to pass low-level detail information to the decoder, aiding in the recovery of details and improving segmentation results [22]. The skip connections in U-Net are implemented by concatenating the feature vectors from the encoder with the corresponding feature vectors in the decoder. This allows the combination of the contextual information captured in the encoder with the detail information restored in the decoder, resulting in improved image segmentation.

B. Attention Mechanism

The attention mechanism [23], [24], [25] was initially proposed to enhance the model's focus on different parts and is primarily applied in natural language processing tasks. In the attention mechanism, there are typically three main components: 1) query, 2) key, and 3) value. The query manifests the content the model is currently focusing on, while the key and value represent different parts of the input data. By calculating the correlation between the query and the key, attention weights are obtained. These weights are then used to weight the values, resulting in the final attention representation. This mechanism allows the model to selectively focus on specific parts of the input data based on task requirements, thereby improving model performance and effectiveness. The main approach employed in this article is spatial attention mechanism.

Spatial attention [26] primarily focuses on the spatial dimension of the input data, such as different locations in an image or feature vectors. It assigns different attention weights to different locations by calculating the similarity or correlation between them. This enables the model to selectively focus on specific regions or positions in the image, enhancing its perception of spatial structures.

C. Deformable Convolution

Deformable convolution [27], [28] is an improved convolution operation that aims to better adapt to object deformations and movements. Traditional convolution operations use fixed-shaped and sized kernels, which may overlook some fine changes. In contrast, deformable convolution adjusts the sampling positions of the convolution kernel based on the content of the input feature vector by introducing learnable offset parameters.

In deformable convolution, the sampling positions of each convolution kernel are controlled by corresponding offset parameters. These offset parameters are learned during the training process and can dynamically adjust the sampling positions of the convolution kernel based on the contextual information of the input feature vector. This allows deformable convolution to better capture local details and deformation information of the target object, enhancing the model's ability to model complex scenes.

SECTION IV.

Methods

In this section, we provide a brief summary of our research and provide a detailed description of the components of FADNet. FADNet consists of three main components, including the U-Net network architecture, the DA module (DA) that combines SimAM attention mechanism with convolution fusion, the deformable module that utilizes deformable convolution for edge feature extraction, and the ASPP [29] module that replaces skip connections with a feature pyramid. By integrating the characteristics of these three modules, FADNet effectively addresses the challenges in large-scale segmentation tasks.

A. FADNet

The FADNet proposed in this article is an improvement on the U-Net architecture. Specifically, we introduced the DA module to replace the first, third, and fifth layers of the encoder, and utilized the deformable module to replace the second, fourth, and sixth layers of the encoder. The same improvements were applied to the decoder as well [30], [31]. The purpose of these improvements was to enhance the performance and accuracy of FADNet. The DA module dynamically adjusts the computation of attention weights, allowing FADNet to focus more on regions of interest, particularly the greenhouse areas. The deformable module is able to better capture deformations and fine details in the images.

Furthermore, in the skip connection operation, the original model employed a simple addition operation followed by pooling. However, we believed that this simple addition operation might lead to feature loss. Therefore, in the skip connection process, we introduced the ASPP module to fuse the input features, which involves the fusion of features at different scales. This helps to regulate the relative importance of low-level features during the upsampling process while reducing the loss of feature information. Through backpropagation, FADNet can autonomously learn the magnitude of the relevant weights to better utilize the features in the encoder. By introducing the feature spatial pyramid module, the skip connections can flexibly fuse the features between the encoder and decoder. The convolution operation can learn specific weights to appropriately blend the features from the encoder and decoder, thereby preserving and utilizing useful feature information. This fusion operation helps FADNet better understand the contextual information of the image and improves the accuracy and performance of image segmentation. With these improvements, we further enhance the performance of the FADNet in image segmentation tasks. The overall model architecture is illustrated in Fig. 2.

Fig. 2. - Process involves acquiring images through remote sensing satellites, segmenting the images into 512 × 512 pixel sizes, and then feeding them into model for image segmentation.
Fig. 2.

Process involves acquiring images through remote sensing satellites, segmenting the images into 512 × 512 pixel sizes, and then feeding them into model for image segmentation.

B. DA-Module

The DA-module is an important part of the FADnet encoder and decoder, which can enhance the model's ability to extract global features. It consists of three components: 1) a SimAM spatial attention module, 2) a DoubleConv double convolution module, and 3) a pooling operation. In this module, the SimAM module is used to calculate the attention weights for the feature matrix Q. First, the data from various dimensions of the feature matrix Q are extracted, including batch size, channel, height, and width. The mean value Q_{\text{mean}} of the feature matrix Q is calculated along the height and width dimensions. Then, the difference Q_{\text{distance}} between the feature matrix Q and its mean value is computed, and the preliminary feature T of the feature vector Q is obtained by squaring the Q_{\text{distance}}. The formula is shown as (1) \begin{align*} T&=\text{Pow}\left(Q-\frac{Q}{\sum _{j=0}^{h}\sum _{i=0}^{w}Q_{ji}}\right) \tag{1} \\ E&=\frac{T}{4\times\left(\frac{\sum _{j=0}^{h}\sum _{i=0}^{w}T_{ji}}{\text{width*height}}+\theta\right)}+\alpha. \tag{2} \end{align*}

View SourceRight-click on figure for MathML and additional features.

Equation (2) is used to sum the second and third dimensions of the feature vector T, where w and h represent the width and height of the input feature vector Q, and \theta denotes the added bias weight. This operation completes the grouping operation of the feature vector T across channel and spatial dimensions, resulting in the feature vector E.

SimAM is an attention mechanism module based on a 3-D network that constructs weights for each neuron by mimicking the motion pattern of neurons. Traditional 3-D network weights are generated by fusing 1-D and 2-D network architectures, which can lead to high computational complexity. In the architecture of SimAM, weights are constructed for each neuron by mimicking the motion pattern of neurons with the following formula: \begin{equation*} \zeta _{t}\left(\varphi _{t}, \phi _{t}, y, x_{i}\right)=\left(y_{t}-\widehat{t}\,\right)^{2}+\frac{1}{G-1}\sum _{i=1}^{G-1}\left(y_{0}-\widehat{t}_{i}\right)^{2}. \tag{3} \end{equation*}

View SourceRight-click on figure for MathML and additional features.In (3), \widehat{t}=W_{t}t+b_{t} and \widehat{x}_{i}=W_{i}x_{i}+b_{i} represent the linear transformations of t and x_{i}, respectively. t and x_{i} are the target neuron and other related neurons in a single channel of the input feature vector Q \in R^{(C*H*W)}. W_{t} and b_{t} are the weight and bias transformations, respectively. SimAM is based on the analysis of the motion pattern formula of the neurons mentioned afore and combines the attention modulation observed in the mammalian brain, which typically manifests as a gain effect on neuron responses. Instead of adding feature refinement, SimAM opts to use a scaling operator, as shown in \begin{equation*} \widehat{Q}=\text{sigmod}\left(\frac{1}{E}\bigodot Q\right) \tag{4} \end{equation*}
View SourceRight-click on figure for MathML and additional features.
where T is the feature vector, as shown in (1), the influence of excessively large values in T is weakened using the sigmoid function, ensuring that it does not affect the relative importance of each neuron. Finally, the feature vector \widehat{Q} obtained from the attention module is fused with the initial feature vector Q. Subsequently, a double convolution operation is applied to the attention-modulated feature vector \widehat{Q}, as shown in \begin{equation*} X=\text{Relu}\left(\text{BN}\left(\text{Conv}\left(\text{SimAM}\left(\text{Conv}(Q)\right)\right)\right)\right). \tag{5} \end{equation*}
View SourceRight-click on figure for MathML and additional features.
The first step involves applying a convolution operation with an input channel size equal to the number of channels in the feature vector \widehat{Q} and an output channel size equal to half of the number of channels in the feature vector Q. This is followed by a normalization operation (BN) to constrain the feature range within a certain interval, ensuring network stability and convergence. Finally, a nonlinear transformation is applied through an activation function. This process is repeated, with the only difference being that the input and output channel sizes of the convolution operation remain equal to half of the number of channels in the feature vector \widehat{Q}. This leads to the final feature vector X. The network structure is depicted in Fig. 3.

Fig. 3. - Structure of the DA-module.
Fig. 3.

Structure of the DA-module.

By introducing the DA module, the focus on the region of interest, especially the greenhouse location, can be enhanced. This increased focus is mainly manifested in the allocation of attention weights in the attention mechanism. The DA module dynamically adjusts the computation of attention weights, allowing the greenhouse region to receive more weight, thereby strengthening the accuracy of greenhouse recognition.

FADNet determines which regions to pay more attention to by computing weights for the input data. For the greenhouse region, due to its high relevance to farmland, the DA module enables the greenhouse region to receive higher weights. As a result, the FADNet focuses more on the features and details of the greenhouse. Conversely, for urban regions with lower relevance to the target area of interest, the DA module assigns lower weights during the computation, reducing the level of attention given to urban regions. Through this computation method of attention mechanism, the FADNet can allocate weights more effectively to differentiate between agricultural land and urban land.

C. Deformable-Module

The deformable convolution is an advanced CNN technology that allows the shape and parameters of the convolution operation to adaptively change between different layers. Traditional convolution operations use fixed convolution kernel sizes and strides, which means that all convolution layers in the network use the same convolution kernel shape and parameters. However, in certain cases, a fixed convolution kernel may not be suitable for all feature vectors, as different feature vectors may have varying spatial scales and contextual information.

The deformable convolution introduces an adaptive approach that allows the network to dynamically adjust the shape and parameters of the convolution kernel based on the properties of the input feature vector. This enhances the network's expressive power and adaptability, enabling it to better capture feature information at different scales and contexts. Deformable convolutions significantly improve the accuracy of feature extraction. However, implementing deformable convolutional operations involves adding a bias term to the feature vector to compute the position offset for the deformable convolution. This requires two corresponding points (x, y) for each feature point. If all regular convolution kernels were replaced with deformable convolution kernels, it would greatly increase FADNet's parameter count and computational time complexity. Therefore, in this work, a significant number of experiments were conducted to selectively replace some convolutional layers with deformable convolution kernels. This approach achieves a relatively reasonable balance between performance, time complexity, and space complexity. The deformable module is capable of learning the weights for each sampling point while controlling the offset of the convolution kernel sampling points, reducing the influence of irrelevant factors and improving FADNet's accuracy. The following is the operational flow of the deformable module: When using a 3×3 convolution kernel, denoted as R, with a 2-D size, the following equation represents the calculation method for the convolution operation: \begin{equation*} R={(-1, 1), (0, 1),{\ldots }, (0, 1), (1, 1).} \tag{6} \end{equation*}

View SourceRight-click on figure for MathML and additional features.

The deformable model has two steps in extracting the feature vectors of the input image. First, the input image undergoes a traditional convolution operation to obtain the initial feature vectors. Then, by performing convolution on the feature vectors, deformable convolution offsets are generated, resulting in feature vectors that can adapt to objects of various shapes. Equation (7) represents the process of calculating the output feature vectors for normal convolution operation \begin{equation*} y(\nu _{0})=\sum _{\nu _{n} \in R}W\left(\nu _{n}\right)gx\left(\nu _{0}+\nu _{n}\right) \tag{7} \end{equation*}

View SourceRight-click on figure for MathML and additional features.where y manifests the output vector, \nu _{0} manifests the center point of the traditional convolution kernel, \nu _{n} signifies the sampling points of the traditional convolution kernel, and x represents the input vector. The formula for calculating the output vector using deformable convolution kernel is shown as follows: \begin{equation*} y(\nu _{0})=\sum _{\nu _{n} \in R}W\left(\nu _{n}\right)gx\left(\nu _{0}+\nu _{n}+\triangle \nu _{n}\right)\triangle m_{n} \tag{8} \end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \triangle \nu _{n} signifies the computed offset, \triangle m_{n} portrays the weight coefficients after offset, other variables employ the same operations as traditional convolution. The deformable convolution introduces positional offsets, allowing the output vectors to better represent the features of irregular objects. The \triangle \nu _{n} moves the points in the region R based on the distribution of the target features, and since the offset is generated by convolving the input vector with another convolutional layer, it is typically represented as a fraction. By performing bilinear interpolation on the offsets, the formula for deformable convolution can be transformed into \begin{equation*} x(\nu)=\sum _{\rho }G(\rho, \nu)gx(\rho) \tag{9} \end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \rho denotes the position of the offsetted sampling point, \nu manifests the integer grid point, and G(\rho, \nu) represents the integer form of the sampling point position obtained from the bilinear interpolation operation. The structural diagram of deformable convolution is shown in Fig. 4.

Fig. 4. - Deformable convolution structure diagram.
Fig. 4.

Deformable convolution structure diagram.

Based on the aforementioned deformable convolution, a deformable convolution module was constructed in this work. The module consists of one deformable convolution, one normal convolution and normalization operations, and the functionality of the module is ensured through the use of the ReLU function to prevent overfitting. The structure of the deformable module is shown in Fig. 5.

Fig. 5. - Diagram of the deformable convolution module.
Fig. 5.

Diagram of the deformable convolution module.

D. Loss Function

When designing the loss function, we used focal loss. The introduction of focal loss is mainly to solve the problem of extremely imbalanced number of positive and negative samples in one-stage target detection. Focal loss also has excellent performance in binary classification problems. The following equation is focal loss: \begin{equation*} \text{FL}(p_{t})=-(1-p_{t})^{\gamma }\log (p_{t}). \tag{10} \end{equation*}

View SourceRight-click on figure for MathML and additional features.In the greenhouse dataset, the pixels occupied by the greenhouse account for a very small proportion to other pixels, and are in an unbalanced state. And for other factors, such as land, which are very different from greenhouses, the model can perform good segmentation and achieve good classification results. However, when it is difficult to distinguish pixels, the segmentation accuracy is very poor. Focal loss can effectively enhance the segmentation effect of difficult-to-distinguish samples. It is also well reflected in the model training process.

SECTION V.

Results and Discussion

This section aims to present the results of the experiment, including several main sections: dataset, experimental environment, demonstration of experimental results, and comparison of experimental effects. Through comprehensive comparative analysis, the superiority of this model is demonstrated.

A. Datasets

The experiment utilized a dataset collected by remote sensing satellites, stored in TIFF file format. Each TIFF file contained the latitude and longitude coordinates of the corresponding area, along with the RGB image information for that area. In the data preprocessing stage, we first processed the dataset to retain the RGB image information and separate it without affecting the latitude and longitude coordinates. Due to the high resolution and incompatible size of the images with the neural network, we applied preprocessing using the OpenCV library to segment the images into smaller blocks of size 512×512.

During the image segmentation process, the dataset's characteristics resulted in the presence of invalid data and images unsuitable for image processing. Through filtering, we removed these invalid data, resulting in a total of 127 valid samples. However, since the number of valid samples was relatively small, directly using them for training could lead to overfitting issues. Therefore, before training, we performed further preprocessing on the images. Considering possible image blurring issues in practical applications, we simulated image blurring caused by factors, such as weather by adding noise. In addition, to augment the dataset and maintain a balanced ratio of positive and negative samples, we employed data augmentation techniques, such as rotation, translation, mirroring, cropping, and scaling, with more emphasis on processing negative samples. Through these data augmentation techniques, we expanded the dataset from the original 180 samples to 1796 samples and achieved a positive-to-negative sample ratio of 7:10. Finally, the preprocessed and augmented dataset is shown in Fig. 6.

Fig. 6. - Translations are as follows: Original image, Mask of the original image, Mirrored image, Mask of the mirrored image, Rotated image, Mask of the rotated image, Cropped image, Mask of the cropped image, Gaussian blur, Mask of the Gaussian blurred image.
Fig. 6.

Translations are as follows: Original image, Mask of the original image, Mirrored image, Mask of the mirrored image, Rotated image, Mask of the rotated image, Cropped image, Mask of the cropped image, Gaussian blur, Mask of the Gaussian blurred image.

After data augmentation, the ratio of positive and negative samples has also reached a relatively balanced state, as shown in Fig. 7.

Fig. 7. - Comparison of dataset ratios before and after data augmentation.
Fig. 7.

Comparison of dataset ratios before and after data augmentation.

B. Results

This experiment primarily focuses on greenhouse identification based on remote sensing satellite imagery. The experiment was conducted on a Windows operating system with an NVIDIA GeForce RTX 3090 graphics card as the hardware environment. For building the neural network architecture, we utilized the PyTorch 3. 9. 10 library in the experiment.

In this experiment, we conducted multiple comparisons. The FADNet achieved a training set accuracy of 99.7% without data augmentation. However, the accuracy on the test set was only 98.9%. We attribute this discrepancy to the limited number of samples and the imbalanced distribution of the dataset. To address this issue, we performed a series of data augmentation techniques to increase the sample size and maintain a balanced ratio of positive and negative samples. After data augmentation, FADNet achieved a training set accuracy of 99.8% and a test set accuracy of 99.6%. The results are depicted in the Fig. 8. As evident from Fig. 8, although the training performance is remarkable, there is a significant disparity in the performance on the test set. This confirms the presence of overfitting, which is a result of limited training samples.

Fig. 8. - Comparison chart of training effect under data augmentation.
Fig. 8.

Comparison chart of training effect under data augmentation.

In addition to using accuracy as a metric for evaluating FADNet's training performance, we also examined the comparison at the pixel level. We classified each pixel using FADNet and used the results as a scoring criterion. The effectiveness is illustrated in the Fig. 9.

Fig. 9. - Pixel-level comparison of different models.
Fig. 9.

Pixel-level comparison of different models.

We conducted a comparative analysis of the performance of FADNet with traditional models, including U-net, CNN, and the variant convolution model, as shown in Table I. The results clearly demonstrate a significant improvement in FADNet, indicating its superior effectiveness compared to the other models, as observed in our study.

We conducted a 200-round training and evaluated the model's accuracy on both the training and test sets. We found that the model began to converge and stabilize around the 90th round. However, during the subsequent 100 rounds, the model's accuracy exhibited oscillations, as shown in Fig. 10. Therefore, we believe that reducing the training iterations to 100 rounds would still yield the best training results.

In the ablation experiment section, a comparative analysis was performed by adding or removing different modules to observe their respective effects.

TABLE I Comparison of Training Accuracy of Different Models
Table I- Comparison of Training Accuracy of Different Models
TABLE II Comparison of Recognition Accuracy of Different Pairs of Models
Table II- Comparison of Recognition Accuracy of Different Pairs of Models
Fig. 10. - Accuracy line chart of 200 epochs training and test sets.
Fig. 10.

Accuracy line chart of 200 epochs training and test sets.

Based on the data presented in the Table II,

TABLE III Comparison of Training Accuracy of Different Modules
Table III- Comparison of Training Accuracy of Different Modules
it is evident that the integration of multiple modules can effectively enhance the accuracy of FADNet. In Fig. 11,
Fig. 11. - presentation of prediction results.
Fig. 11.

presentation of prediction results.

the prediction results are presented in the form of images. By comparing them with the ground truth values, it can be observed that the FAD-Net model's predictive capability is able to successfully extract the location information of greenhouses from the satellite imagery obtained through remote sensing.

Table III shows the impact on model accuracy when the hyperparameter in 2 takes different values. After comparison, it can be inferred that the model achieved higher accuracy when \alpha was set to 0.5.

SECTION VI.

Conclusion

This article introduces a lightweight FDNet model based on an encoder and decoder architecture, combined with attention mechanisms and deformable convolutions. The design of this model makes full use of the advantages of the SimAM module, the deformable module and the ASPP module to effectively solve the technical problems of remote sensing satellite greenhouse identification and greatly improve the accuracy of greenhouse segmentation. By introducing the SimAM module, the model can more accurately distinguish rural and urban areas and reduce information interference in urban areas. By using a deformable module, the module adaptively adjusts the sampling position of the convolution kernel to better capture the details of small targets and make the segmented edges smoother and clearer. In addition, the model also uses the ASPP module to further improve the perception of features at different scales and help improve the accuracy of segmentation. FADNet achieves significant improvements in the accuracy of greenhouse segmentation compared to previous segmentation methods. In actual application, this model has small memory requirements and can be better deployed in the field. It is of great significance to supervise the illegal use of agricultural greenhouses. The greenhouse identification task based on remote sensing satellite images was realized. However, despite the excellent results, the model still has the potential to further improve accuracy and inference speed, which will be the direction of future research.

References

References is not available for this document.