Semantic Segmentation of Crop and Weed using an Encoder-Decoder Network and Image Enhancement Method under Uncontrolled Outdoor Illumination

Weeds are among the major factors that could harm crop yield. Site-specific weed management has become an effective tool to control weed and machine vision combined with image processing is an effective approach for weed detection. In this work, an encoder-decoder deep learning network was investigated for pixel-wise semantic segmentation of crop and weed. Different input representations including different color space transformations and color indices were compared to optimize the input of the network. Three image enhancement methods were investigated to improve model robustness against different lighting conditions. The results show that for images without enhancement, color space transformation and vegetation indices without NIR (Near Infrared) information did not improve the segmentation results, while inclusion of NIR information significantly improved the segmentation accuracy, indicating the effectiveness of NIR information for precise segmentation under weak lighting condition. Image enhancement improved the image quality and consequently the robustness of segmentation models against different lighting conditions. The best MIoU value for pixel-wise segmentation was 88.91% and the best mean accuracy of object-wise segmentation was 96.12%. The deep network and image enhancement methods applied in this work provided promising segmentation results for weed detection and did not need large amount of data for model training, which is suitable for site-specific weed management.


I. INTRODUCTION
Agriculture is facing tremendous challenges from weeds, which appear everywhere randomly in the field, and compete with crops for water, nutrients and sunlight, resulting in a detrimental impact on crop yields and quality if uncontrolled properly [1], [2]. Numerous studies have demonstrated a strong correlation between crop yield loss and weed competition. The production loss due to weeds can be up to 34% [3]- [6]. To control weeds, different operations have been adopted, among which chemical weeding has been the The associate editor coordinating the review of this manuscript and approving it for publication was Liandong Zhu. most widely used one since 1940s. However, conventional chemical weeding sprays herbicides uniformly to the whole field, resulting in the overuse of herbicides and further leading to catastrophic environmental pollution problems [3]. To counteract these issues, site-specific weed management (SSWM) was introduced. In SSWM, accurate weed identification is crucial, which provides necessary individual target information for spraying to the control system [7].
Machine vision is one of the most popular approaches and has been investigated extensively for weed identification [8]. Conventional procedures for weed detection with machine vision include image pre-processing, segmentation, feature extraction and classification [2], [9]. For the feature extraction procedure, handcrafted features are usually extracted and optimally selected, which are then used for classification. For images captured under ideal conditions and at specific plant growth stages, these conventional methods provide very promising classification results with high classification performances in the order of 80-95% in terms of accuracy [7]. However, for real applications in the field, the task becomes extremely challenging. Weed identification accuracy is influenced by weed density, weed distribution characteristics, varying lighting conditions in the field, occlusion or overlapping of the leaves of crops and weeds, and different growth stages of plants, etc [10]- [13]. Handcrafted features extracted from color, shape, texture and spectrum are not robust enough to the changes of these factors, leading to the poor robustness and low generalization capabilities of conventional crop-weed classification methods, and imposing difficulties to the practical applications of such methods in precision agriculture [7], [14].
Deep learning has been investigated extensively for image processing and has also been applied in agriculture including weed identification. Compared with conventional machine learning methods for identifying weeds from digital images, deep learning can automatically learn the hierarchical feature expression hidden deep into the images, avoiding the tedious procedures to extract and optimize handcrafted features [14]. In addition, semantic segmentation is one of the most effective approaches for alleviating the effect of occlusion and overlapping since pixel-wise segmentation can be achieved. Some deep learning algorithms have been investigated for weed detection. Dyrmann et al. [15] trained a fully CNN based on GoogLeNet architecture to detect weed locations in leaf occluded cereal crops, which yielded a recall of 46.3% and a precision of 86.6%. To cope with substantial environmental changes with respect to weed pressure, weed types, growth stages of the crop, visual appearance, and soil conditions, Lottes et al. [7] adopted a fully convolutional network (FCN) with an encoder-decoder structure and incorporated spatial information by considering image sequences. Both RGB and NIR (Near Infrared) images were used for model training. Results showed that the method substantially improved the accuracy of crop-weed classification. Similarly, Milioto et al. [16] constructed an end-to-end encoder-decoder semantic segmentation network, and fed the network with 14 different vegetation indices and alternate representations as input for semantic weed/crop/background segmentation. The proposed method could properly deal with heavily overlapping objects and a large variety of growth stages, yielding the best MIoU (mean intersection of union) value of 80.8% for pixel-wise segmentation. Though promising results can be obtained with these deep learning-based methods for weed identification, there are still room for improvement. The deep learning networks could learn effective features for weed detection, but are also affected by varying lighting conditions, which were not fully considered in the aforementioned studies. To further improve semantic segmentation accuracy, Chen et al. [17] proposed an encoder-decoder network with atrous separable convolution, for semantic image segmentation. The network could refine the segmentation results especially along object boundaries and yield state-of-art performance on PASCAL VOC 2012 and Cityscapes datasets. However, the network also did not consider varying lighting conditions. Therefore, this work aimed at performing pixel-wise semantic segmentation of field images into soil, crop and weed. Specifically, (1) an encoder-decoder network with atrous separable convolution was investigated for semantic crop/weed/soil segmentation; (2) different input representations including different color space transformations and color indices, were compared to analyze the effect of input representations to the performance of the adopted network; and (3) model robustness with respect to lighting conditions was improved by image enhancement.

A. IMAGE DATASETS
Two image datasets were evaluated in this work. One is a publicly available sugar beet image dataset (http://www.ipb. uni-bonn.de/data/sugarbeets2016/) captured with a readily available agricultural robotic platform, BoniRob, on a sugar beet farm near Bonn in Germany over a period of three months in spring 2016 [18]. The other is an oilseed image dataset captured with a commercial RGB camera (Canon 60D, 50 mm lens, 5184 pixel × 3456 pixel) which was mounted on a gantry-type frame at a height of around 1.5 m above soil at our own test field on campus in early winter 2017. The sugar beet dataset consists of both RGB and corresponding NIR images, captured with a JAI AD-130GE multi-spectral camera at an image resolution of 1296 pixel × 966 pixel. The JAI AD-130GE camera was mounted to the bottom of the BoniRob robot chassis at a height of around 85 cm above soil, consisting of a RGB sensor and a NIR monochromatic sensor. The NIR monochromatic sensor collects signals within the spectral range of 750-1000 nm, with sensitivity peak at around 780 nm. The sugar beet was in early growth stage and weed density was relatively low, with slight overlapping of the leaves of sugar beet and weed. For the RGB images, it seems that the lighting condition is not well, as the brightness and contrast of the RGB images are low, as shown in Figure 1a. There are 283 images in the sugar beet dataset with ground-truth labeling provided by Chebrolu et al. [18] from which we randomly selected 200 images for training and 83 images for evaluation. For our oilseed dataset, the captured images with a resolution of 5184 pixel × 3456 pixel were cropped into more images with a resolution of 1550 pixel × 3456 pixel. The oilseed was also in the early growth stage, but with heavy weed pressure and overlapping. And the oilseed images ( Figure 1b) were captured under the direct illumination of sunlight, with some shadow regions. These 68 RGB images were annotated by hand, with 50 images for training and 18 images for evaluation. To further verify the generalization capability of the proposed approach, the datasets were augmented by gamma correction and changing the L component in HSL color space. For the sugar beet dataset, the gamma value was set as 0.5, 1.5 and 2, and the L component was added by 50, 100 and 150, and subtracted by 40, respectively. And for the oilseed dataset, the gamma value was set as 0.5, 1.5 and 2, and the L component was added by 50, and subtracted by 50 and 100, respectively. Examples of augmented images were shown in Figure 1.

B. IMAGE PREPROCESSING
Image preprocessing can help to improve the generalization capabilities of a classification model by aligning the training and test data distribution and improving the image quality [7]. As the lighting condition substantially affects the robustness of a classification model, three image enhancement methods were investigated in this work. For the input of deep network, Milioto et al. [16] deployed 14 different input representations including different vegetation indices and raw input in different color spaces to improve the performance of classification model and reduce the amount of images for training. Similarly, several different vegetation indices and color spaces were also evaluated as the input representations in this work.

1) IMAGE ENHANCEMENT
The two datasets involved in this work were acquired under totally different lighting conditions, as can be seen in Figure 1. For the sugar beet dataset, the images are with low brightness and contrast. By contrast, the images in our oilseed dataset were captured under the direct illumination of sunlight. The brightness and contrast of the oilseed images are high enough, with some regions close to saturation in the images. There are also some shadow regions in the oilseed images, imposing more difficulties for crop/weed classification. To alleviate the effect of different lighting conditions and improve the robustness of classification model, three image enhancement methods were evaluated.

a: HISTOGRAM EQUALIZATION
Histogram equalization (HE) is a powerful scheme for adjusting image intensities to enhance contrast. In grey scale histogram equalization, the method rearranges the grey values in such a way that the modified histogram resembles the histogram of uniform distribution [19]. The detailed principle and implementation procedures of HE can be reached in reference [20]. For color images with three channels, the same technique equalizing the image in three dimensional spaces causes unequal shift in the three components resulting in change of hue of the pixel [21]. The HE preprocessing for color image adopted in this work is to equalize only the intensity component in the color space of HSI, and then transform the equalized HSI image back to RGB color space [22].

b: AUTO CONTRAST
The process of contrast enhancement increases the perceptibility of the objects in the image. To enhance the contrast of the images involved in this research, the Auto Contrast algorithm used in the commercial software Adobe Photoshop CS6 (Adobe Systems Software Ireland Ltd.) was applied. The Auto Contrast operator does not adjust channels individually and does not introduce or remove color casts. It simply darkens the darkest pixels to pure black, lightens the lightest pixels to pure white, and redistributes all the other tonal values in between proportionally. This makes the highlights appear lighter and shadows appear darker. The Pseudocode demonstrating the process of Auto Contrast is as follows. The parameter 'percent' in the algorithm is the clipping percentage, and delta is a parameter to fine tune the enhanced image. For RGB images, the R, G and B channels are fed to the algorithm, while for NIR images, the input is just one channel.

c: DEEP PHOTO ENHANCER
A Deep Photo Enhancer based on unpaired learning proposed by Chen et al. [23] was applied for image enhancement. As shown in Figure 2, the method is based on the framework of two-way generative adversarial networks (GANs) and U-Net was augmented with global features to act as a generator in the GAN model. Wasserstein GAN (WGAN) was improved with an adaptive weighting scheme, resulting in faster and better training converges. In addition, individual batch normalization layers for generators in the twoway GANs was used to better adapt to the characteristics of their own inputs. For enhancing the images in this work, the model provided by the authors was adopted, which was trained on photographer labels of the MIT-Adobe 5K dataset as well as an HDR dataset selected from Flickr images tagged with HDR.

2) INPUT REPRESENTATIONS
To facilitate greenness identification and plant classification, several frequently used color spaces and vegetation indices were involved to represent the input of model training. The color spaces of YCrCb and YCgCb have been proved to be effective for greenness segmentation by researchers [24], [25], therefore the raw images in these two color spaces were used as two input representations. The vegetation indices involved include NDI (Normalized Difference Index), NDVI (Normalized Difference Vegetation Index), ExG (Excess Green), ExR (Excess Red), ExGR (Excess Green minus Excess Red), CIVE (Color Index of Vegetation), VEG (Vegetative Index), and MExG (Modified Excess Green Index), COM1 (Combined Indices) and COM2, as calculated by Equations (1)-(10) [2], [16]. These indices were developed for vegetation extraction and are less sensitive to changes in field conditions.

C. NETWORK ARCHITECTURE
An encoder-decoder network with atrous separable convolution, was investigated for semantic image segmentation in this work. As shown in Figure 3, the encoder module encodes multi-scale contextual information by applying depthwise atrous convolution at multiple scales. Atrous convolution is a powerful tool that allows to extract the features computed by deep convolutional neural networks at an arbitrary resolution. And depthwise separable convolution could drastically reduce computation complexity by factorizing a standard convolution into a depthwise convolution followed by a pointwise convolution. In the encoder-decoder network, the depthwise atrous convolution combines the atrous convolution and depthwise separable convolution to reduce computation complexity while maintaining similar (or better) performance. A simple yet effective decoder first bilinearly upsamples the encoder features by a factor of 4 and then these features are concatenated with corresponding low-level features from the network backbone. After the concatenation, several 3 × 3 convolutions are applied to refine the features followed by another simple bilinear upsampling by a factor of 4. With these operations, the decoder module could refine the segmentation results along object boundaries. More details regarding the network can be found at reference [17]. The effectiveness of this encoder-decoder network has been approved on the benchmarks of PASCAL VOC 2012 and Cityscapes datasets, achieving the test set performance of 89.0% and 82.1% without any post-processing. The network was implemented relying on the Google TensorFlow library with the programming language Python 3.5.

D. TRAINING DEEP NETWORK
As we know, training a deep model from scratch is computationally expensive and requires large mounts of labelled data. However, in this work we only have 200 and 50 images for training for the sugar beet and oilseed, respectively, making it impossible to train the models from scratch. Hence, we utilized knowledge in other segmentation domain to solve our problem via transfer learning [26] in a low-cost way. Transfer learning for a convolutional neural network that consists of convolution base and fully-connected layers at the end, means to retrain the final layers of the network with new traing data based on a previously trained network. This process will slightly adjust the weights for final layers of the pre-trained model according to the input images. Therefore, in this work we trained the encoder-decoder network based on a pretrained model on PASCAL VOC 2012 dataset from VOC challenges with 11530 images. This leads to much less computation load and training data, while remaining comparable segmentation accuracy.

E. EVALUATION
Three classes were considered (soil, weed and crop) in this work. The performance of segmentation was firstly measured in terms of pixel intersection over union (IoU) averaged across the 3 classes. The mean intersection over union (MIoU) can be calculated using Equation (11).
For automated weeding, it is more important to recognize the targeted object accurately, since the weeding actuator cannot perform pixel-wise operation. Therefore, an objectwise metric was also calculated to indicate the model's performance. We analyzed all objects with area bigger than 320 pixels, which was calculated by dividing the desired minimum object detection size of 1 cm 2 by the spatial resolution of 2 mm 2 /px in the 1296 × 966 images.

III. RESULTS AND DISCUSSION
A. IMAGE ENHANCEMENT Figure 4 compares the visual effect of different image enhancement methods. For raw RGB images (Figure 4a1) from sugar beet dataset, the brightness and contrast were low. After enhancement, the brightness and contrast of the images were improved significantly. The images enhanced by HE method (Figure 4a2) were with the highest brightness and contrast, but the color of the images was distorted and seemed vary unnatural. That may be caused by the irreducible singularities of the transformation between RGB and HSI spaces and the fact that HE is only performed on the intensity component [22]. For RGB images enhanced by Photoshop Auto Contrast (Figure 4a3), they looked bright and sharp, with visually appealing color. And for RGB images processed by Deep Photo Enhancer (Figure 4a4), the brightness and contrast were also improved, but the change was not that significant compared with the images processed by HE and Photoshop Auto Contrast. Different from raw RGB images, the NIR images (Figure 4b1) from sugar beet dataset were with relatively high contrast, thanks to the ability of NIR camera to capture information in low illumination environment. After enhancement, the brightness and contrast of the NIR images were improved significantly. However, the images processed by Deep Photo Enhancer (Figure 4b4) seems to be with color and not grayscale image. After analysis it was found that the images consisted of three channels, which was caused by the three channels output of Deep Photo Enhancer. With respect to the oilseed dataset, the raw RGB images were with high brightness and contrast, but with the direct illumination of sunlight and some shadow regions. The difference between raw and enhanced RGB images was visually marginal. Table 1 illustrates the pixel-wise segmentation results with different input representations and image enhancement methods. Corresponding segmentation results are shown in Figure 5 and Figure 6. For the sugar beet dataset, transformation of color space did not improve the segmentation results, with RGB space yielding the highest MIoU value of 72.01%. Compared with color images in RGB, YCrCb and YCgCb spaces, the model trained with NIR images yielded much better result, with MIoU value of 79.28%. This is consistent with the fact shown in Figure 4 that NIR images are with higher brightness and contrast, facilitating the discrimination between sugar beet from weeds. With Algorithm 1 Auto Contrast Input: RGB or NIR images with a dimension of row * col Output: Enhanced images 1 : I←R * Parameter1+G * Parameter2+B * Parameter3 2 : I←I/(max(I)) 3 : I_sort←sort(I) 4 : I_out←I 5 : I_min←I_sort(row * col * percent) 6 : I_max←I_sort(row * col * percent) 7: for i←1 to row do 8 : for j←1 to col do 9 :

B. PERFORMANCE OF SEMANTIC SEGMENTATION
if I(I, j) < I_min then 10: I_out(I, j) = I_min 11 : else if I(I, j) < I_max then 12 : I_out(I, j) = 1 13 : else 14 : I_out(i, j) = (I(i, j)-I_min) * (   Table 1. only allows three channels as input, all representations listed in Table 1 are set as input with the format of (channel1, channel2, channel3). For grayscale images that only have one channel like NIR image, the three input channels are identical. From the MIoU values it can be observed that the input representations including NIR information (No. 5, 7 and 9) provided much better performance than those without NIR information. By contrast, other vegetation indices did not benefit the segmentation, with some even deteriorating the performance. This again confirms the effectiveness of NIR information for precise segmentation under weak lighting condition.
As stated previously, three image enhancement methods were applied to improve the image quality. And the enhanced images were then used for model training. After enhancement, the brightness and contrast of the images were  Table 1. improved significantly, and the performance of segmentation models was boosted correspondingly, which were all superior than the result obtained by Milioto et al. [16] using 14 channels as input. For enhanced RGB images, comparison of MIoU values showed that the three image enhancement methods performed similarly, with HE being slightly inferior. For enhanced NIR images, model trained with images enhanced by PS Auto Contrast yielded the best results, followed by model trained with images enhanced by Deep Photo Enhancer. It can be also seen that models trained with enhanced images all yielded superior segmentation results than those trained without enhancement. This could be attributed to the low brightness and contrast of the RGB images that would result in missing boundary information of the objects. From Figure 5 it can be seen that the segmented objects (sugar beet and weed) by models trained with NIR information contain more abundant details along the object boundaries. By contrast, the boundaries of objects segmented by models trained without NIR information are much smoother, which seems like they are processed by dilation operation.
For our oilseed dataset captured under better lighting condition, the difference between raw and enhanced RGB images is visually marginal, and pixel-wise segmentation results also demonstrate that image enhancement did not change the segmentation performance ( Figure 6). The deep network is capable of learning effective features hidden deep into the images with high brightness and contrast, regardless of shadow regions. And the three image enhancement methods did not cause any negative effect on segmentation. Comparing the segmentation results of the two datasets shows that the MIoU values of the oilseed dataset are all greater than those of the sugar beet dataset. This may be mainly caused by some mistaken labels in the sugar beet dataset (Figure 7). For those mistaken labelled objects shown in Figure 7, the deep model correctly segmented most of them. However, when calculating MIoU, they were not counted as true positives since they were different from the labels in ground truth images provided. In addition, the poor illumination in the sugar beet  dataset may also has some effect. The proposed method could alleviate but not totally eliminate the negative effect of poor illumination. This can be also confirmed by the semantic segmentation results of the augmented datasets (Table 2 ), from which we can see that after changing the brightness and contrast of the images by altering L component and gamma values, the MIoU values for segmentation without image enhancement (No. 1, 5, 9 and 13) decreased significantly to less than 77%, while after enhancement, the MIoU values all increased to over 82%, comparable but slightly less than those shown in Table 1   Generally it can be concluded that for the sugar beet dataset, image enhancement improved the image quality and thus the robustness of segmentation models in terms of different lighting conditions, and for the oil seed dataset, image enhancement did not degrade the performance of segmentation models. Another point is that the deep network applied in this work does not need large amount of data for model training thanks to the advantage of transfer learning. The segmentation model for the oilseed dataset trained 50 images and yielded MIoU values around 88%. Table 3 illustrates the object-wise segmentation results with different input representations and image enhancement methods. The connected areas bigger than 320 pixels in ground truth and prediction images, which were treated as objects, were counted and the mean accuracy of the connected areas were calculated. For the sugar beet dataset, the mean accuracy of different input representations did not differ from each other obviously, with the mean accuracy ranging from 93.55% to 96.06%. However, after image enhancement, the mean accuracy all decreased, which was counter-intuitive since image enhancement improved the performance of pixelwise segmentations. Analysis found that two reasons may lead to the result. The first is that some objects were wrongly labelled in the ground truth images, as shown in Figure 7. The second is that the mean accuracy was calculated as the ratio of true positives and all objects, which tended to be larger for coarser segmentations, since the objects in the coarser segmentations were larger than the objects in ground truth images and covered the latter ones more easily. For segmentations No. 4, 5, 7 and 21 in Table 1 and Table 3 whose pixel-wise accuracies were close, their object-wise accuracies were also very close to each, indicating that the image segmentation did not reduce the segmentation accuracy. This can be further confirmed by the object-wise segmentation results for the oilseed dataset.

D. RUNTIME
The training time for 200 sugar beet images for 40k iterations is about 8 hours, and 50 oilseed images for 40000 iterations is about 3 hours, on a workstation with an Intel i7 CPU (256 GB RAM) and NVIDIA GTX1080Ti GPU (88 GB GPU memory). For implementing the classifier on our workstation, the runtime is shown in Table 4. We can see that the total inference time is less than 100 ms for a camera with a resolution of 1296 × 966 pixel, which meets the requirement of real-time processing. For a higher resolution image, it takes longer time to process.

IV. CONCLUSION
In this work, an encoder-decoder deep learning network was investigated for pixel-wise semantic segmentation of crop and weed. Different input representations including different color space transformations and color indices were compared to optimize the input of the network. Three image enhancement methods were investigated to improve model robustness against different lighting conditions. Results shows that color space transformation and vegetation indices without NIR information did not improve the segmentation results, while inclusion of NIR information significantly improved the segmentation accuracy, indicating the effectiveness of NIR information for precise segmentation under weak lighting condition. Image enhancement improved the image quality and thus the robustness of segmentation models against different lighting conditions. Another point is that the deep network applied in this work does not need large amount of data for model training. Future work will be focused on model compression, through which the trained model can be applied on mobile platforms with less computing capability.