Journals & Magazines >IEEE Journal of Selected Topi... >Volume: 18

Multiclass Crop Interpretation via a Lightweight Attentive Feature Fusion Network Using Vehicle-View Images

Abstract:

Automatic crop interpretation can provide important reference information for national agricultural decision-making. However, due to the diverse characteristics and compl...Show More

Topic: Street View Imagery and GeoAI

Metadata

Abstract:

Automatic crop interpretation can provide important reference information for national agricultural decision-making. However, due to the diverse characteristics and complex spatial relationship of crops, remote sensing images taken from a bird's eye view are insufficient in vertical features of crops, making it difficult to interpret crop types and locations accurately. The similar features and blurred edges between different crops make it difficult to extract crop boundaries accurately. Due to the high memory and computational costs, most of the deep learning-based models face efficiency limitations in real-scenario crop interpretation. To address the abovementioned issues, this article proposes a novel lightweight neural network, namely the CropNet, for crop interpretation. Aiming at the problem of feature similarity among different categories of crops, this article designs a global-local path aggregation (GLPA) mechanism, which uses shallow and deep neural networks to extract global detail information and local high-level information to enhance feature discrimination. An edge context feature enhancement module (ECFEM) is proposed to enhance edge and context feature extraction to address the problem of ambiguous crop edges. Finally, a feature fusion module based on an attention mechanism is used to automatically weigh different feature channels to enhance the crop semantics. To demonstrate the effectiveness of the CropNet proposed in this article, we constructed several sets of comparison experiments comparing it with state-of-the-art deep learning models on a manually labeled vehicle-view crop image dataset. The experimental results show that CropNet has better semantic segmentation results with fewer model parameters and lower computational costs.

Topic: Street View Imagery and GeoAI

Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing ( Volume: 18)

Page(s): 496 - 509

Date of Publication: 23 October 2024

ISSN Information:

DOI: 10.1109/JSTARS.2024.3481248

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Food security [1], as one of the crucial issues in current national development, serves as a fundamental pillar for national security and the establishment of a global community with a shared future for humanity. It is also a prerequisite for world peace and development. Crop extraction involves capturing descriptive information about crop distribution areas, types, sizes, compositions, and other characteristics from images. Timely and accurate crop extraction enables dynamic monitoring of crop distribution, shape, growth, growth cycles, and other key information, thereby enabling precise classification of crops. This process can provide essential reference information for national agricultural decision-making [2] and, at the same time, offer crucial support for national agricultural monitoring, production management, and food security [3].

Previously, the primary method of crop extraction involved conducting on-site surveys of preobservation areas by manual experts or relevant monitoring units. For instance, to gather data on field crops and other relevant topics, the National Agricultural Statistics Service [4] and the Farm Service Agency [5] in the United States rely on direct interviews and interactions with farmers. They also make use of the data they have gathered to identify and forecast crop yields during the growing season. This manual extraction method is not only time-consuming but also requires substantial material and labor resources.

With the emergence and development of intelligent methods, such as machine learning, deep learning, and swarm intelligence computing, these methods are widely used in various fields, such as remote sensing [6], [7], [8], [9], geoscience [10], urban computing [11], [12], [13], [14], urban sustainability [15], [16], [17], urban disasters [18], [19], urban function and human healthcare [20], [21], [22], spatial data management [23], [24], [25], and crop monitoring [26], [27], [28]. Compared to traditional manual observation methods, intelligent techniques offer numerous advantages. These include processing large volumes of repetitive tasks to free up labor and resources, automating computations to reduce human error, and quickly and accurately detecting minute details that are difficult to capture manually. Therefore, scientists have begun to try to apply methods combined with artificial intelligence to crop extraction tasks.

Nonetheless, there are significant issues with classic machine learning-based methods that are hard to overlook. For example, the restricted nonlinear capabilities of the model make it difficult to achieve the desired recognition effect, and conventional machine learning techniques usually rely on features defined by experts [29]. These difficulties make it challenging to extract crops and obtain more accurate recognition results from image samples with complex environmental backgrounds. Researchers have been investigating the use of deep learning algorithms to automatically derive crops from image data as a reaction to the aforementioned problems. Deep learning models have many advantages over typical machine learning techniques, including faster speed and accuracy, better compatibility, fewer data processing stages, and a balance of speed and accuracy. The main challenges in crop interpretation tasks include the following three points. Similar features: The characteristics of some crops are quite similar, leading to frequent misidentification during crop interpretation tasks. Blurred edges: The edges of the crop area are relatively blurred, which increases the difficulty of the crop interpretation task. Single perspective: It is difficult to obtain characteristic information such as changes in crop roots, stems, and leaves from remote sensing image data from a bird's-eye view.

Crop interpretation tasks have important research significance in national agricultural monitoring and food security. This work addresses the challenges of similar features between different crop types, blurred crop edges, and single perspective in crop interpretation. Different from most existing works [30], [31], [32], [33], we focus on the problem of crop feature similarity and design modules to extract features from the global and local perspectives to obtain more feature diversity. For the problem of blurred crop edges, we use a more efficient feature fusion strategy to enhance the feature difference between crop edges and background. In addition, this article uses vehicle-view images that are closer to human perspectives and provide higher visibility of details of crops. The main contributions of this article are as follows.

This article uses vehicle-view images for crop interpretation. Compared to remote sensing images, vehicle-view images can capture the vertical spatial information of crops. Building on this, we propose CropNet, a lightweight neural network based on multifeature aggregation for refined crop interpretation.
To address the issue of crop feature similarity, this article introduces a global-local path aggregation (GLPA) mechanism, which employs both shallow and deep neural networks to extract global detail information and local high-level features, respectively, to enhance feature discrimination.
To address the issue of blurred edges of crops, this article proposes an edge context feature enhancement module (ECFEM) to enhance the extraction of edge and context features and automatically weights different feature channels based on the attention mechanism to improve the accuracy and efficiency of crop interpretation.

SECTION II.

Related Work

A. Threshold-Based Segmentation

Since grayscale histograms vary across different images, the threshold can be set based on the grayscale difference between the target and the background or other content in the image. Thresholds are used to divide grayscale ranges, classifying pixels to achieve image segmentation. Traditionally, this strategy has focused on selecting an adequate threshold for more accurate segmentation. Otsu is a global thresholding approach [34]. This method uses interclass variance as a criterion to assess the quality of the threshold and chooses a suitable global threshold to separate the target from the image backdrop. When extracting targets from the background in variable illumination circumstances, using a single threshold makes accurate segmentation difficult. Therefore, Kapur et al. [35] suggested the greatest entropy method for threshold selection. The maximum entropy method selects the threshold by maximizing the combined entropy of the target and background in the segmented image.

Many scholars have further developed more advanced threshold segmentation methods based on the maximum entropy approach, which can express more information for target extraction compared to Otsu. Building on this, Mohamed et al. [36] proposed a vegetation segmentation method based on the hue histogram of RGB images to determine the threshold. The hue histogram in this method conveys more information and can more effectively detect vegetation coverage. Due to the influence of ambient illumination changes on RGB images, this method frequently leads to recognition errors when the original cropped image is in grayscale. Therefore, Dorj et al. [37] attempted to convert RGB images to HSV images using a watershed algorithm in order to reduce the unpredictable effects of the illumination environment on the prediction. Castillo-Martínez et al. [38] proposed a colour-index-based thresholding method to improve crop and background segmentation by processing the thresholds. The threshold-based segmentation method is suitable for simple scenes but is easily affected by changes in ambient lighting. It is gradually replaced by the clustering-based segmentation method.

B. Clustering-Based Segmentation

Clustering-based segmentation groups pixels with similar properties within the same region of the image. After numerous iterations and clustering, they converge to discriminate between target and background information, boosting the accuracy of information division during the segmentation process. Clustering-based segmentation methods often utilize superpixels to extract boundary information within the image. A simple linear iterative method based on $K$-means clustering to generate hyperpixels was proposed by Achanta et al. [39]. It has strong border compliance and adhesion capabilities [40] and can respond more effectively to boundary information when employed in segmentation tasks. This article also proposes an adaptive SLIC method. It uses the maximum distance between different clusters to dynamically adjust the color and spatial information of the image, thus improving the computational speed. However, this method performs poorly in weak boundary recognition. To solve the aforementioned concerns, Di et al. [41] performed multilayer hierarchical processing on SLIC and proposed LI-SLIC. It evaluates whether neighboring superpixels belong to the same item by comparing the similarity of their probability distributions, and then merges them to reduce the number of superpixels and improve the algorithm's robustness. However, this algorithm is slower. The clustering-based segmentation method has the advantage of improving scene generalization throughout the image segmentation process and mitigating the impact of issues like uneven lighting and shadows. However, the quality of segmentation results from two-dimensional images is limited, and understanding complex scene information can be challenging.

Researchers have found that traditional segmentation algorithms struggle to classify multiple label categories in segmentation results. Some scholars use traditional machine learning algorithms for crop classification. For example, Piiroinen et al. [30] obtained hyperspectral data in a highly heterogeneous study area using the AisaEAGLE imaging spectrometer and classified and mapped crops using support vector machines and minimum noise fraction classification algorithms. However, typical machine learning algorithms are inefficient when dealing with large data sets, making it difficult to obtain comprehensive crop information. As a result, researchers are aiming to use deep learning approaches to perform visual domain tasks related to agriculture.

C. Deep Learning-Based Segmentation

Many scholars have studied different improved semantic segmentation networks. Long et al. [42] proposed FCN, which achieved pixel-level prediction and can retain more spatial information of the image for semantic prediction. Ronneberger et al. [43] further proposed U-Net, which ensures learning ability with fewer samples, can extract information at different levels of resolution, and then fuse them, but the global feature recognition effect is poor. To solve the abovementioned problems, Cambridge University proposed SegNet [44], based on the Caffe architecture of fast feature embedding and VGG16. It has a very high convergence speed to improve overall efficiency. The introduction of these networks is an important basis for crop feature extraction tasks. On this basis, Wu et al. [45] proposed a pixel-by-pixel method based on FCN and used an adaptive threshold to improve the accuracy when training data is limited. This method is used for crop semantic segmentation tasks. However, when the target range in the crop image is small, the segmentation and recognition effects are poor. The DeepLab series of structures [46], [47], [48] solves this problem well. DeepLab improves the FCN architecture through dilated convolution and introduces a “dilation rate” parameter to obtain contextual semantic information. This method is suitable for identifying crops and land cover information at different scales. However, the huge amount of parameter information generated by DeepLabV3+ increases a lot of processing load.

The characteristics of hyperspectral remote sensing images make them often selected as research data in crop extraction tasks [49], [50]. For example, the SCRF [31] proposed by Zhao et al. uses spectral weighting to adjust the sensitivity to crop spectral information in order to differentiate between crop types. Jarrett et al. [32] fused multitemporal optical data with radar imagery to implement DnCNN and 3-D U-Net, which successfully implemented changes in time series in static images to identify different crop features more easily. Zhang et al. [51] proposed a fusion method based on the polarization expansion of multispectral data with PolSAR data and FTDN. The method extracts data from the original data and structures them for effective feature extraction. Tian et al. [52] proposed a selective patching TabNet based on single-temporal phase images, which is able to adjust the weights of image elements in classification by introducing TabNet and multitask learning to, and designing an adaptive selective patching module. This model can achieve high classification accuracy despite insufficient crop phenology information.

With the widespread application of hyperspectral remote sensing images, Wu et al. [53] established a new network framework, CCR-Net, for multimodal remote sensing data classification tasks, which reduces the visual impact of cloud cover and enhances the fusion of different features in multimodal data. Yang et al. [33] used the LeNet deep convolutional network to identify corn varieties, and developed the DTCARS-LeNet-5 model, which significantly improved the accuracy of corn recognition. Patel et al. [54] applied deep learning models for transfer learning and evaluated benchmark high-resolution hyperspectral crop images. Hong et al. [55] have been dedicated to studying the semantic segmentation task of multimodal remote sensing data. In 2023, they proposed a super-resolution hyperspectral framework and a novel decoupled-and-coupled network, which fuses super-resolution hyperspectral and high-spatial-resolution multispectral information at the subpixel level. They also constructed a multimodal remote sensing benchmark dataset, C2Seg [56], which includes hyperspectral, multispectral, and SAR data. In addition, they developed a high-resolution domain adaptive network, HighDAN, which improves the generalization ability of AI models in multicity environments through adversarial learning. In 2024, they introduced the SpectralGPT [57] model, which can process spectral remote sensing images broadly and significantly enhances information extraction using 3-D masking strategies, advancing the application of spectral remote sensing data in interpretation tasks. Although intelligent models for processing hyperspectral remote sensing images continue to innovate, the fixed high-altitude perspective of hyperspectral remote sensing still makes it difficult to capture vertical crop details. Bhatti et al. [58] used convolutional neural networks and the Inception-v3 architecture to address crop disease issues and developed mobile applications for real-time recognition. These real-time mobile tools can be effectively used in actual agricultural environments.

SECTION III.

Proposed Method

With the extraction challenges posed by similar features and fuzzy edges between different crop classes and the limited feature representation capability of traditional neural networks. This article proposes a novel lightweight neural network (CropNet) for the semantic segmentation of crops. Fig. 1 shows the general design of CropNet. CropNet designed a GLPA mechanism, which includes the global information extraction module (GIEM) and the local feature extraction module (LFEM). The two modules use shallow and deep neural networks to extract global detail information and local high-level information, respectively. CropNet uses the ECFEM to enhance the extraction of edge and contextual features. In addition, it also includes a multifeature fusion block (MFFB) based on the attention mechanism to automatically weight different feature channels to improve the accuracy of crop extraction.

Fig. 1.

Overview of the proposed method. (a) Acquisition of vehicle-view images along the road and the annotation process with ground truth. (b) CropNet model, a new lightweight neural network for crop extraction tasks. (c) Accuracy evaluation and validation of vehicle-view images, visualization of semantic segmentation results, and comparison of class-related feature maps.

Show All

For crop interpretation tasks, the existing works mainly focus on how to improve the recognition accuracy and efficiency of crops in remote sensing image data. The advantages of the existing works are that the methods for acquiring different remote sensing image data are relatively mature [49], [56]. There are many models applicable to remote sensing images that can be improved through transfer learning. The disadvantages are that it is difficult to obtain the longitudinal characteristic information of crops using remote sensing image data, and there are areas where remote sensing data is difficult to obtain. Our work uses vehicle-view images for crop interpretation, offering higher visibility of details and compensating for the lack of resolution or angle in remote sensing images. More targeted real-time data collection can be carried out for specific agricultural areas, which can provide more timely response measures for some emergencies. And it has the potential advantage of low cost.

A. Local Feature Extraction Module

The LFEM takes vehicle-view images as input and extracts deep features using the image shape $(3, H, W)$. The output is made up of feature pyramids of varying sizes, including $(24, H/4, W/4)$, $(32, H/8, W/8)$, $(96, H/16, W/16)$, and $(320, H/32, W/32)$. Because the original images have different sizes, exposing them to the same downsampling approach may generate semantic issues such as inaccurate detection of small targets. Using a feature pyramid, images of varied sizes can have appropriate feature representations at matching scales, enabling image recognition and prediction across several scales. The fusion of multiscale data can improve the performance of model identification and prediction. MibileNetV2, a lightweight visual architecture, optimizes storage economy using inverted residuals. Simultaneously, the MibileNetV2 model replaces classic convolution operations with linear convolution, depth convolution, and point-by-point, resulting in significant computational savings. While decreasing the parameters, the calculation's accuracy and speed can be considerably improved.

After MibileNetV2, we utilize LightDecoder, a lightweight decoder, to decode the feature pyramid generated by MibileNetV2. LightDecoder extracts features from different layers in the feature pyramid using a fully connected layer operation, upsamples them to the same size as the largest layer, and then merges them. By connecting feature maps of different scales, information on features from shallow to deep layers can be retrieved. The connected feature map is then sent into the convolution, batch normalization (BN) [59], and activation layers to produce a category-related shape feature map $(M, H/4, W/4)$. A category-related feature map can use color to indicate the recognition certainty of several image semantic categories and, as a result, express the model's confidence in category prediction. The local high-level feature extraction module can extract information from crop photos at different scales, extract differential features, and increase CropNet's recognition performance.

B. Global Information Extraction Module

In the GIEM, a unique convolutional layer is constructed as the feature extraction module using the traditional Conv2D+BN+ReLU method. The application of BN improves the stability and generalization of the learning process in deep networks and helps alleviate the issue of low processing speed, which often occurs when the network learns from diverse data distributions. In addition, it can generate smoother optimization surfaces and address the vanishing gradient problem caused by diverse data distributions [60]. We first input the image into seven sets of Conv2D+BN+ReLU blocks to generate feature maps containing detailed information about the entire domain. Then, using PixelShuffle, we upsample the resulting object map to obtain a global information extraction image with the same size as the original image. Based on the feature image data, we derive category information for the global crop features. The global detail extraction module can extract features from the entire range of crop images. This module highlights the feature differences between the image's crop areas and background areas and makes the same category's regional features easier to identify. Finally, we combine the two feature maps obtained from the local higher-order feature extraction module and the global detail information extraction module, serving as the outputs of the global and local feature gathering modules. The data are subsequently sent into the multifunction aggregation module. This approach successfully overcomes the challenge of identifying crops with similar traits.

C. Edge Context Feature Enhancement Module

In CropNet, the ECFEM uses the context-sensitive feature extraction block (CFEB) and the edge-sensitive feature extraction block (EFEB) to extract context features and edge features, respectively. The MFFB, which is based on the attention mechanism, automatically weighs different feature channels to produce the semantic segmentation results of crops.

1) Context-Sensitive Feature Extraction Block

This research employs the global context block [61] as the primary architecture for the CFEB. The structure is depicted in Fig. 2. It extracts global information from input feature images. Similar to the SNL block, the global context block can construct effective long-distance relationships to capture global context information. It also significantly reduces computational costs and the number of parameters, similar to a lightweight squeeze-and-excitation (SE) block. It includes a 2-D convolution layer, a Softmax layer, a LayerNorm layer, and an ReLU layer. The LayerNorm layer decreases the optimization difficulty imposed by the usage of two bottleneck transform layers while also improving the module's generalization ability. The global context block is used to perform context modeling on the full input object map so that the perceptual field can cover the entire object map, effectively extending the semantic information extracted by CropNet. It also makes up for the lack of feature diversity and enhances the context modeling capabilities for classifying crops. The global context block can be expressed using the following formula: \begin{equation*} z_{i}=x_{i}+W_{v 2} \text{ReLU}\left(\text{LN}\left(W_{v 1} \sum _{j=1}^{N_{p}} \frac{e^{W_{k} x_{j}}}{\sum _{m=1}^{N_{p}} e^{W_{k} x_{m}}} x_{j}\right)\right). \tag{1} \end{equation*} View Source$x_{i}$ and $z_{i}$ represent the input feature vector and output feature vector, respectively. $W_{k}$ represents the weight of the key value matrix in the attention weight. $e^{W_{k}x_{j} }$ represents the linear transformation result of the key matrix. $W_{v1}$ and $W_{v2}$ are the weight matrices of different linear transformations, respectively. $\text{LN}$ represents layer normalization. $\text{ReLU}$ represents the ReLU activation function. $\delta (\cdot) W_{v 2} \operatorname{ReLU}(\operatorname{LN}(W_{v 1}(\cdot)))$ represents bottleneck transform \begin{equation*} \alpha _{j}=\frac{e^{W_{k} x_{j}}}{\sum _{m} e^{W_{k} x_{m}}}. \tag{2} \end{equation*} View Source$m$ represents the index of each input feature in the input sample. $\alpha _{j}$ represents the global attention pooling weight.

Fig. 2.

Global context block architecture.

Show All

2) Edge-Sensitive Feature Extraction Block

The EFEB primarily employs the Sobel operator to improve the segmentation accuracy of crop edge information. The Sobel operator was first proposed by American computer scientists Alvin Sobel and Gary Feldman in 1968 at the Stanford University Artificial Intelligence Laboratory. It is now commonly used for edge detection in image processing applications. The Sobel operator can extract information regarding the differences in grayscale values from the feature map. Since crops and backgrounds in vehicle-view images exhibit different image gradients, the Sobel operator can more accurately detect edge information for various crop categories. The Sobel operator enhances the edge information of the input image both laterally and vertically using two sets of $3\times 3$ matrices. This improves the model's sensitivity to edge characteristics through convolution. As indicated in formulas (3) and (4), $A$ represents the initial input picture, while $G_{x}$ and $G_{y}$ represent the images after the convolution procedure, respectively. The Sobel operator's computation process can then be stated using the formula as follows: \begin{align*} G_{x}=& \left[\!\begin{array}{lll}+1.0 & 0.0 & -1.0 \\ +2.0 & 0.0 & -2.0 \\ +1.0 & 0.0 & -1.0 \end{array}\!\right] * A \tag{3} \\ G_{y}=& \left[\!\begin{array}{ccc}+1.0 & +2.0 & +1.0 \\ 0.0 & 0.0 & 0.0 \\ -1.0 & -2.0 & -1.0 \end{array}\!\right] * A. \tag{4} \end{align*} View Source$G$ represents the gradient in the abovementioned formula. For each pixel in the input image, use the generated image pixel obtained above as its horizontal and vertical approximation. The gradient computation is then represented as follows: \begin{equation*} G=\sqrt{G_{x}^{2}+G_{y}^{2}}. \tag{5} \end{equation*} View SourceSimilarly, the gradient direction is determined using both the horizontal and vertical approximations. The steps are as follows: \begin{equation*} \theta =\arctan \frac{G_{y}}{G_{x}}. \tag{6} \end{equation*} View Source

3) Multifeature Fusion Block

This article incorporates the SE block [62] into the MFFB. The SE block is often used as a substructure in other classification or detection models. The basic concept is to modify feature weights by computing the interdependence of feature channels in convolutional networks. The SE block may enhance the expressiveness of effective features and improve the model's capacity in order to extract feature information. It calculates and trains based on the feedback from the loss image feature weights. By utilizing global information to adaptively enhance informative features, the effective element weights are increased, while the less relevant element weights are reduced, improving feature extraction through weight redistribution.

The SE block can learn the significance of various feature weights in crop feature maps, resulting in category-related feature maps after weight modification. Finally, the resulting category-related feature map is fed into a two-dimensional convolution with the output channel number M, and the feature map channels are assigned to the category number M. The feature maps are then upsampled to dimensions (M, H, W). The highest category probability in the category channel determines the final category at each position. The attention-based feature fusion module may dynamically identify the most appropriate category information based on the various feature probabilities retrieved from crop images, thereby improving model recognition accuracy. Simultaneously, the lightweight design ensures the overall recognition speed of CropNet. In addition, in order to reduce the rough and fuzzy boundaries and further improve the prediction accuracy of crop semantic segmentation, CropNet measures the difference between the output image and the ground truth after CIEM, LFEM, and ECFEM, and calculates the loss as $\text{Loss}1$, $\text{Loss}2$, and $\text{Loss}3$, respectively, and obtains Loss through the following formula: \begin{equation*} \text{Loss}=\alpha \cdot \text{Loss}1+\beta \cdot \text{Loss}2+\gamma \cdot \text{Loss}3. \tag{7} \end{equation*} View SourceAmong them, $\alpha = 0.1$, $\beta = 0.1$, and $\gamma = 0.8$ are satisfied.

SECTION IV.

Experiment and Analysis

A. Experimental Crop Dataset

Crop spatial features provide an important theoretical foundation for crop extraction. Spatial features refer to various ground elements, such as landforms, topography, hydrology, and vegetation, along with their accurate spatial representation in images [63]. These differences in features allow crops and other vegetation to be distinguished in images. Therefore, choosing vehicle-view image samples for research can better obtain the spatial features of crops. The information in crop images mostly contains the following nine categories: background, bareland, wasteland, soybean, rice, rape, corn, wheat, and fallow.

The original sample photos of the crop dataset used in this investigation are from the GVG capture system [64]. The GVG capture system was developed by the Institute of Remote Sensing Applications, Chinese Academy of Sciences. It mainly dynamically collects land cover types by taking photos beside roads, with the purpose of investigating land cover types and their proportions. During collection, the location information of the captured image is automatically obtained and stored in the system's background database. The dataset used in this work consists of some crop images saved in the system database. It includes 1112 vehicle-view image samples, with 960 for the training set and 152 for the test set.

Based on the information provided above, experts from the Institute of Remote Sensing Applications, Chinese Academy of Sciences constructed a crop dataset, and the images in the dataset were semantically annotated for various crops. The presence of crop samples in each image was determined by human visual interpretation, and the sample images were then annotated at the pixel level using LabelMe software (https://github.com/wkentaro/labelme).

B. Experimental Data Processing

The data for the experiment were obtained from crop images captured periodically by vehicles equipped with the GVG agricultural information gathering system, along a specified route. The final dataset comprises 1112 vehicle-view images, with 960 samples allocated to the training set and 152 samples to the test set.

This article proposes CropNet, a crop semantic segmentation model based on a multifeature aggregation neural network, which takes fixed-size samples as input. Since the size of each sample image in the original dataset is different, it is necessary to use a sliding window to divide the sample images in the training set and test set. Each plot is divided into image blocks of 512×512pixels. Fig. 3 depicts an intuitive illustration of the sliding window method for partitioning an image. By employing sliding windows to segregate the samples in the crop dataset, a cropped training set and a test set were established, containing 18 780 and 2951 image patches of the same size, respectively.

Fig. 3.

Demonstration of the image segmentation process using a sliding window.

Show All

C. Evaluation Index

In order to evaluate the model provided in this article scientifically and reasonably, we choose overall accuracy (OA) and intersection over union (IoU) as the key indexes for evaluating the experimental results of the semantic segmentation of this model, and calculate the corresponding mOA and mIoU. OA represents the proportion of correctly categorized classes to the total number of classes in all test sets, indicating the accuracy of sample classification. IoU is a measure of the proportion of crosses and combinations of two sets of true and estimated values. The formula is calculated as follows: \begin{align*} \text{OA}=&\frac{\text{TP+TN}}{\text{TP+TN}+\text{FP+FN}} \tag{8}\\ \text{IoU} =&\frac{\text{TP }}{\text{TP +FP+FN}} \tag{9}\\ \text{mOA}=&\frac{1}{M} {\textstyle \sum _{i=1}^{M}} \text{OA}_{i} \tag{10}\\ \text{mIoU}=&\frac{1}{M} {\textstyle \sum _{i=1}^{M}} \text{IoU}_{i} \tag{11} \end{align*} View Sourcewhere TP and FP, respectively, represent the number of samples of the positive class and the negative class that are judged as positive classes, and TN and FN, respectively, represent the number of samples of the negative class and the positive class that are judged as the negative class. $M$ denotes the sample category.

D. Ablation Study

To evaluate the performance of the overall structure of the proposed CropNet with the three important modules, we conducted ablation experiments.

1) Effect of GIEM

The results in Table I show that the mIoU of the model that only introduces GIEM is 38.32%. After introducing GIEM and LFEM and simply combining them, the mIoU is 60.4%, which is quite different from the case of introducing GIEM alone. This indicates that in the crop interpretation task of this article, although GIEM uses a convolutional layer constructed by the Conv2D+BN+ReLU method to enhance the feature differences between the cropped area and the background. The recognition accuracy is less significant because the image may contain multiple crop categories. At the same time, extraction from the global perspective fails to extract detailed information, which may also lead to poor crop segmentation results.

TABLE I Detailed Performance Comparison of Each Module in CropNet

2) Effect of LFEM

The results are shown in Table I. It can be found that when LFEM is introduced, the model's mIoU is 60.7%. The introduction of LFEM can effectively improve the crop interpretation performance of CropNet. Compared with the mIoU (60.4%) after the simple fusion of LFEM and GIEM, the mIoU (60.7%) of using LFEM alone is 0.3%. This shows that in LFEM, the feature pyramid generated by the MibileNetV2 architecture can effectively extract local detail information.

3) Effect of ECFEM

Table I shows that the introduction of ECFEM improves the model's mIoU (65.0%) by 4.6% compared to the model with a simple combination of LFEM and GIEM (60.4%). The model with ECFEM is better than the model with LFEM or GIEM alone, which shows the necessity of introducing ECFEM. The global context block and Sobel operator used in ECFEM enhance the extraction of contextual and edge information, respectively. This better integrates the feature information obtained by the two modules, leading to more accurate recognition results.

E. Analysis and Comparison of Experimental Results

The hardware platform parameters used in the experiment are: CPU I9 12900kf, Memory 64 GB DDR4, and GPU NVIDIA RTX 3090. The software platform parameters are: operating system Windows 10, Pytorch-based deep learning framework (torch version 1.8.2), and CUDA 11.1.

To evaluate the effectiveness of our CropNet model in crop type semantic segmentation, multiple comparative tests were conducted under the same experimental settings. The models compared in this article include BiSeNetv1 [65], BiSeNetv2 [66], FPN [67], and U-Net [43]. BiSeNetv1 can capture spatial and semantic information at the same time while preserving fast processing and effective feature expression capabilities. Building on BiSeNetv1, BiSeNetv2 optimizes the structure by removing time-consuming cross-layer connections, resulting in a more clear and straightforward design and being more lightweight and efficient. FPN can combine features from several layers and predict independently for each fused feature layer. It then aggregates data from several layers to increase prediction accuracy. U-Net improves edge extraction by merging channel numbers and ensures that the output image is the same size as the input image, resulting in more precise semantic segmentation results.

The experimental results are presented in tables including Tables II–IV. In addition, Table II displays the IoU value of the semantic segmentation results obtained from the comparative model on the crop test set. Table III illustrates the OA values of the semantic segmentation results. Table IV gives information on the amount of parameters and FLOPs generated during different model procedures.

TABLE II IoU Values for the Segmentation Results of the Models on the Crop Dataset

TABLE III OA Values for the Segmentation Results of the Models on the Crop Dataset

TABLE IV Parameters and Floating Point of Operations (FLOPs) of the Segmentation Results of the Model on the Crop Dataset

Table II shows the IoU value of each category of different semantic segmentation models on the crop test set. In the bareland category, all models had IoU values of less than 60%, with the largest result (55.5%) for CropNet and the smallest result (20.6%) for U-Net. In the wasteland category, CropNet had the highest results (64.5%) and BiSeNetv2 had the lowest IoU value (52.6%). In the fallow category, the CropNet model gave the best results (51.75%), which was almost double the results of the U-Net model. In the soybean category, CropNet had the highest accuracy (64.0%). In the rice category, CropNet had the highest IoU value (61.7%), which was almost three times that of the U-Net model (22.6%). In the rape category, the BiSeNetv1 model had the highest prediction accuracy (80.4%), and the IoU value of the CropNet model (79.0%) was slightly lower than the former. In the corn category, all models had IoU values less than 56.0%; the CropNet model was the best and the BiSeNetv2 model was the worst. In the wheat category, CropNet (58.7%) and BiSeNetv1 (57.8%) had higher IoU values. In conclusion, the IoU values of GropNet are higher than those of the other four models in all categories except the oilseed rape category. In the oilseed rape category, the IoU value of CropNet is only slightly lower than that of the highest BiSeNetv1 model, with a numerical difference of about 1.48%. The results show that the CropNet model has the highest recognition accuracy on the crop test set.

Table III displays the OA value for each category of the various semantic segmentation models on the crop test set. In the bare ground category, the OA values of CropNet (66.7%), BiSeNetv1 (58.5%), and FPN (56.2%) are all above 55%, and the OA value of CropNet is more than twice that of U-Net. In the wasteland category, the OA values of all models are above 70%, with CropNet (88.2%) being the highest. In the fallow land category, the OA values of all models are below 70%, among which CropNet has the highest (67.3%), which is approximately 3 times the OA value of U-Net (26.1%). In the soybean, rice, rape, and barnyard grass categories, CropNet has the highest OA values, which are 86.3%, 84.2%, 88.4%, and 68.8% respectively. In the wheat category, FPN has the highest OA value (76.3%), which is only about 5.0% better than U-Net (71.3%). In summary, the CropNet model has higher OA values than the other four models in all categories except wheat. The accuracy of CropNet in the wasteland, soybean, and rapeseed categories exceeds 85.0%, and the accuracy in the rice category exceeds 80.0%, indicating that CropNet has a high judgment accuracy on the crop test set and has a good semantic segmentation effect.

In the last row of Tables II and III show the mIoU and mOA values of the different models on the crop test set. Among the other four compared models, the results of mIoU and mOA values were in descending order: BiSeNetv1, FPN, BiSeNetv2, and U-Net. In contrast, this article proposes that CropNet has the highest mIoU (65.0%) and mOA (91.2%) values. This indicates that CropNet is the best for category recognition in crop datasets. To demonstrate the performance and computational overhead of CropNet, we compared the parameters and FLOPs of different models, as shown in Table IV. Parameters refers to the total number of parameters trained by the deep learning model during the entire training process. FLOPs refers to the amount of computation of the model. These two reflect the spatial complexity and time complexity of an algorithm model, respectively. As shown in the table, CropNet has the least parameters (3.38 M) among all the models in the comparison experiment. This shows that when other experimental conditions are the same, CropNet has the least computational burden when performing crop semantic segmentation. The comparison results of FLOPs show that CropNet has the second smallest FLOPs, and the amount of computation is only higher than the BiSeNetv2 model. This shows that CropNet can perform crop semantic segmentation tasks with fewer parameters and computation. In addition, combined with the results in Tables II and III, CropNet surpasses other models in terms of comprehensive performance and accuracy. While obtaining better semantic segmentation results, it has a faster operation rate and less computational burden.

Fig. 4 shows the comparison between the semantic segmentation results of CropNet and the ground truth. The crop distribution area predicted by CropNet is consistent with the ground truth, and there are few misclassified areas. This is because CropNet combines the extraction results of global and local feature information and contextual semantic information for prediction. On the other hand, the boundary contours of crops output by CropNet are more detailed, reflecting the geometric characteristics of different crops. This shows that CropNet can smooth noise during edge extraction, reduce the impact of noise on edge extraction, clearly depict the geometric contours of crops, and enhance boundary extraction capabilities. In addition, the area extracted by crop categories is relatively continuous, and the misclassified areas are less interfered with by noise points. This shows that the design of CropNet can extract differential features and capture more semantic information of crops. The use of SE block can also make more accurate category judgments through the recognition probability of different crops. Based on the abovementioned information, the CropNet model can perform semantic segmentation of crops more accurately.

Fig. 4.

Segmentation prediction results of CropNet.

Show All

Fig. 5 describes the segmentation results of the new lightweight neural network CropNet and four comparison models on the crop test set. Compared with other network models, CropNet is able to retain the geometric features of the classification information in the image more accurately, and the segmentation shapes output by it are more regular with smoother edges. And the extracted content is more continuous and the segmentation boundary of the image is less noisy. These results indicate that CropNet works better in crop semantic segmentation tasks with fewer misclassification cases. Fig. 6 shows the confidence of CropNet and the four comparison models in successfully recognizing crop locations on the crop category's feature map. These category-related feature maps are acquired in the feature maps prior to the Softmax classifier layer in CropNet and correspond to the probability values in the crop category channel to determine the category of the crop region. The red areas in the category-related feature map indicate high confidence in the crop, while the purple areas indicate the opposite. In the category-related feature map of CropNet, the crop area is closer to the ground truth area. The boundary between the red and purple areas in CropNet's category-related feature map is more distinct than in other network models. These results show that CropNet has a stronger ability to classify crop categories and more accurate recognition.

Fig. 5.

Comparison results of segmentation prediction of CropNet with U-Net, FPN, BiSeNetv1, and BiSeNetv2.

Show All

Fig. 6.

Comparison results of category-related feature maps of CropNet with U-Net, FPN, BiSeNetv1, and BiSeNetv2.

Show All

Fig. 7 shows that there are also some errors in CropNet's segmentation results. While shooting the original vehicle-view image, there are some interference factors, such as dark shooting environment, interference occlusion, blurry shooting, and reflection shooting. These reasons will cause the segmentation results to be unrecognizable or unrecognizable, resulting in CropNet's segmentation effect for this part of the image not meeting expectations. These are some limitations of the model for crop semantic segmentation.

Fig. 7.

Demonstration of misclassification results of original vehicle-view images in CropNet and their causes.

Show All

SECTION V.

Conclusion

Crop extraction is crucial for national food security, the economic development of the food industry, and the effective improvement of national macrocontrol. Current deep learning-based extraction methods that use hyperspectral and drone images involve high memory usage and computing costs. Hyperspectral images taken from an overhead perspective fail to capture the intricate longitudinal characteristics of plants, which encourages the use of vehicle-view image datasets. The current deep learning-based extraction method, which utilizes hyperspectral and drone photos, incurs high memory usage and computing costs. To address the challenge of extracting similar features across different crop categories and handling blurred edges, we introduce CropNet, a lightweight neural network based on multifeature aggregation. This architecture comprises two feature aggregation modules: the global-local aggregation module and the multifeature aggregation module. The feature fusion module employs an attention mechanism to assign weights to individual feature channels automatically. In conclusion, CropNet provides advantages, such as fast, lightweight network processing, fewer parameters, and improved feature extraction efficiency.

ACKNOWLEDGMENT

The authors would like to thank for the labeled data from the GVG provided by CropWatch Team.

References is not available for this document.

Multiclass Crop Interpretation via a Lightweight Attentive Feature Fusion Network Using Vehicle-View Images

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction