Drone-Aided Detection of Weeds: Transfer Learning for Embedded Image Processing

In this article, we address the problem of hogweed detection using a drone equipped with red, green, blue (RGB) and multispectral cameras. We study two approaches: 1) offline detection running on the orthophoto of the area scanned within the mission and 2) real-time scanning from the frame stream directly on the edge device performing the flight mission. We show that by fusing the information from an additional multispectral camera installed on the drone, there is an opportunity to boost the detection quality, which can then be preserved even with a single RGB camera setup by the introduction of an additional convolution neural network trained with transfer learning to produce the fake multispectral images directly from the RGB stream. We show that this approach helps either eliminate the multispectral hardware from the drone or, if only the RGB camera is at hand, boost the segmentation performance by the cost of slight increase in computational budget. To support this claim, we have performed an extensive study of network performance in simulations of both the real-time and offline modes, where we achieve at least 1.1% increase in terms of the mean intersection over union metric when evaluated on the RGB stream from the camera and 1.4% when evaluated on orthophoto data. Our results show that the proper optimization guarantees a complete elimination of the multispectral camera from the flight mission by adding a preprocessing stage to the segmentation network without the loss of quality.


I. INTRODUCTION
C ONTINUOUSLY growing human population imposes strict demands to agricultural industry in terms of improving the crop yields and making the food production efficient. At the same time, high efficiency can be achieved by controlling and removing the weeds. To overcome this challenge, a high number of state-of-the-art technologies have been applied in the scope of precision agriculture (PA) [1].
The distribution of weeds and, in particular, Hogweed of Sosnowsky, is a quickly growing problem for agricultural industry. Hogweed of Sosnowsky is a weed that has spread across Manuscript  Eurasia region, including European and Asian countries, Great Britain, and Russia, as well as local parts of North American continent. Hogweed typically grows for up to 5 m and may have the diameter for up to 12 cm. It is capable of producing thousands of seeds, which are then distributed by wind across large area. The plant produces toxins, which are hazardous for human, but at the same time hogweed creates a danger to other crops including the farming ones by affecting their ecosystem. This hazardous weed must, therefore, be precisely identified and removed.
Traditionally, there are three typical approaches for weed localization: 1) inspection by human; 2) satellite imaging; and 3) unmanned aerial vehicle (UAV) monitoring. We discuss these approaches in Section II.
In this article, we rely on the UAV option due to its practical feasibility. The most commonly utilized sensor setup for this platform consists of a single video camera, capable of producing high-quality red, green, blue (RGB) images and a global navigation satellite system sensor providing geographic coordinates for the measurements from the camera. An additional multispectral camera is also typically introduced for the crop monitoring missions to allow for a computation of vegetation indices. The same sensor could also be utilized to increase the crop detection accuracy, as it was shown, for example, in [2], but exploiting two video cameras on board is not weight and cost wise.
In this article, we investigate the tradeoffs between the semantic segmentation quality of crops and the imaging sensor setup complexity. We use a task of hogweed detection as a playground and propose a solution based on the advances of deep learning, which allows for the complete avoidance of a multispectral camera from the mission without any deterioration of segmentation performance compared to the multiple-camera platform. More specifically, we report on the following novelty.
1) Dataset collection, alignment, and labeling: We propose to label the hogweed plants by transforming a sequence of overlapping frames into a single orthophoto. It enables labeling the crops in the area of interest only once without repetition in intersecting parts of the images. The corresponding labels are then back-transformed to the camera frames using a greedy-feature-based image alignment procedure. A similar procedure is applied to align the RGB and multispectral data, with the corresponding orthophotos acting as a reference. As a result, in our approach, the labeled and aligned RGB and multispectral images are produced fast without a need to synchronize frame streams coming from two separate camera sensors. To the best This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of our knowledge, such methodology was never reported before. 2) Removal of the multispectral camera from the setup by replacement with the preprocessing network: Instead of using the two-camera setup, we propose to adopt a transfer learning procedure and learn a convolution neural network (CNN) from the collected paired data to mimic multispectral images. Such images are then synthesized during the preprocessing stage, which allows us to completely avoid multispectral hardware in the aerial platform without any deterioration of segmentation quality. The rest of this article is organized as follows. We introduce the readers to the relevant works in the area in Section II. In Section III, we discuss the methods used in this research. Our results are provided in Section IV. Finally, Section V concludes this article.

A. Weed Detection
Traditional approaches of weed localization include manual inspection with special handheld equipment, which is done either by foot or with the utilization of transport machines like cars or tractors [3]. Although these methods can still be in favor for farmers, there is a certain shift toward the systems allowing for the monitoring of larger areas while preventing the spread of seeds by feet and wheels. That is why the modern development of methods involving the UAVs and even satellites imaging starts to gain close attention for the purposes of more effective human-free ways of crop monitoring and weed control [4].
Currently, satellite imaging provides accurate images across huge areas with different spectral bands lying far beyond just conventional visual spectrum, provided by an ordinary RGB camera. However, satellite imaging still does not allow us to perform precise crop localization due to the large ground sampling distance (GSD), since the resolution of the best available satellite images does not exceed 30 cm/pixel. The typical size of the hogweed leafs is around the same 30 cm, which results in a situation when a single hogweed plant is resolved into the image area of several pixels, which makes its detection nearly impossible. In addition, the satellite monitoring system is vulnerable to the weather changes, since it becomes impossible to capture plants in a cloudy weather. At the same time, UAVs enable to select GSD from millimeters to meters per pixel depending on the needs by varying altitude during the mission. As for UAVs it is still impossible to perform monitoring in rainy conditions, the applicability of them is far more flexible than satellites. Currently, the limitations on equipment and computational resources the UAVs can carry, as well as the area UAVs can cover per a single mission, are strongly bounded by their lifting capacity and power supply resources. This issue poses the problem of effective management of sensors and computer vision algorithms used for monitoring.
Nowadays, in PA, a multisensor setup is typically used. It combines a traditional RGB sensor together with the infrared one. For the latter, the near-infrared (NIR) band is usually considered. Compared with a single RGB camera, the multispectral cameras enable crop monitoring with much greater precision as well as direct calculation of their vegetation characteristics. This is done at a cost of additional hardware carried on board during the mission, and additional efforts for accurate sensor synchronization and image alignment.

B. Image Analysis in PA
Even before the raise of deep learning, many algorithms were proposed to perform a high-level vision of the images. Most of them rely on handcrafted distinct features of the object of interest, and it was shown that this principle is suitable for the agriculture as well [5]. Now, almost all the state-of-the-art approaches involve the usage of CNNs and even transformers. Ronneberger et al. [6] were one of the pioneers who introduced a U-shaped neural architecture for the task of image segmentation, which is still applied in real high-level vision tasks. This architecture was shown to produce state-of-the-art results in agricultural tasks as well and being capable to run on the edge device. For example, in [7], U-shaped architecture was deployed to the unmanned ground vehicle (UGV) in order to perform a robust estimate a pixelwise labeling of the images into crop and weed. It was shown that this setup outperforms previous state-of-the-art approaches and improves the accuracy of crop-weed classification without requiring a retraining of the model. Menshchikov et al. [8] reported on a comparison between other segmentation CNNs, which showed that U-Net performance is slightly lower than the competing SegNet and RefineNet with ResNet backbone. However, the computational complexity of the latter two models does not allow for deploying them to an embedded unit and perform the inference in real time during the mission.
In our previous work [8], we focused on the optimization of the most relevant FCNN architectures for inference on board the single-board computer (SBC). The point of this work was the feasibility of the application of such a drone with artificial intelligence (AI) on board in PA. However, the dataset was collected during a single day. It does not contain different phases of the hogweed growth. Thus, the capabilities of the FCNN, trained on such a dataset, are limited. Furthermore, the data were collected using the RGB camera only. The current article proposes an approach that enhances the capabilities of the monitoring system in several ways.
First, the dataset used in this study was collected in the same area with a weekly period of 2.5 months. It contains the data on the growth phases of the hogweed during this time span. Hence, such a dataset is valuable for both AI and PA areas. Moreover, the dataset was collected using both RGB and multispectral cameras. Additional channels give more information about the object of interest and improve the accuracy and robustness of the neural network. The current research's contribution is the enhancement of the neural network capabilities due to the generation of artificial NIR channels. It provides a significant advantage in terms of both the accuracy of predictions and the feasibility of such a monitoring system in practice. The application of the proposed approach allows us to get rid of the multispectral camera for hogweed detection. Thus, it makes the monitoring system cheaper, more sustainable, and prolongs the flight time.
It should be noted that the deployment of neural architecture to the edge device is not a trivial task [9], and various optimization techniques were proposed to overcome this problem. A hardware-based optimization for binary weight CNNs was proposed in [10], where the arithmetic core was optimized by the elimination of the need for expensive multiplications, as well as by the reduction of I/O bandwidth and storage. Typically, however, most well-known techniques to accelerate inference are software based and include architecture-related optimizations, which are performed either manually or automatically via neural architecture search and pruning in [11]. It was also shown that another type of optimization performed by quantizing neural network weights and inference procedure to uint8 or even binary can drastically accelerate the inference without a large deterioration of their performance. Such techniques were used, for example, in [12] to accelerate the semantic segmentation task on NVIDIA Jetson Xavier NX by the usage of computationally efficient manually designed adaptive blocks, namely multifiber unit and attention module, as well as the quantization of model weights to int8 and pruning, helping achieve a good inference speed without a big drop of prediction quality.

C. Intelligent Monitoring Systems in PA
There are different types of autonomous vehicles, which allow a precise monitoring of crops. Typically, they are classified into two groups, UAVs, that operate in the air, and UGVs, that operate on the ground. As UGVs are able to perform individual plant phenotyping, characteristic monitoring, and even manipulations with crops and weeds [13], in crop monitoring tasks, the priority is given to UAVs due to their ability to cover large areas relatively fast without disseminating the seeds.
Typically, the UAVs are equipped with an RGB camera capturing the crops underneath in a visual spectrum, and the machine vision algorithms are processing the images from this camera in order to get the desired output. Typically, such processing is done offline after the mission is complete [14] since the frame stream is stored on an internal storage of the UAV [15] or is transferred to the intermediate server during the mission by a wireless protocol [16]. Since both the cases are not practically feasible as they require additional efforts and do not allow for fast decision making, it is important to consider inference being performed directly on the edge device [17]. This is especially vital for the hogweed detection, as the weed can spread very fast by generating up to 100 000 seeds a year that are quickly disseminated by wind. A proper detection and elimination of each poisonous plant is essential, and the edge computing enables it by providing the instant information on the geolocation of the plant.
In order to improve the monitoring outcomes, additional information can be provided to the vision algorithm in a form of a separate multispectral frame stream. Typically, this is done to allow direct calculations of vegetation indices such as normalized difference vegetation index (NDVI) or normalized difference red edge index [18]. For example, Honrado et al. [19] report a UAV-based setup consisting of RGB and NIR-retrofitted cameras, which perform imaging missions over rice and corn fields in the Philippines. They comment on using a dual-camera setup in order to compute a precise NDVI of the crops. A survey performed by Radoglou-Grammatikis et al. [20] extends the list of possible multispectral camera usages toward the optimization of the image acquisition system and water stress indices. At the same time, robust quantitative studies require a proper usage of multispectral cameras and additional attention not only to their temporal calibration with visual band cameras, but also to the calibration of their spectral characteristics, as highlighted in [21].
By this reason, several approaches were proposed targeting the capture of multispectral information directly from the corresponding RGB camera frames. For example, several approaches propose to remove the NIR blocking filter installed inside all the consumer RGB cameras [22]. In this case, the signal from the NIR band is mixed with the red channel; however, the authors claim that it still helps to calculate the NDVI with high tolerance. Costa et al. [23] introduce a new NDVI mimicking index computed from RGB data by the use of a genetic algorithm. Finally, inspired by Pix2Pix [24] performance in conditional synthesis tasks, Yuan et al. [25] and Soni et al. [26] separately proposed to use the same conditional generative adversarial CNN for the purpose of sampling visually plausible NIR spectral band from the satellite and aerial RGB observations, respectively.

A. Data Collection
Owing to the individual features of a hogweed plant, large size of its leaves, and large height, we found it possible to perform an aerial collection of data. We have performed two flight missions in two separate days with the time difference of two weeks in order to cover crops in different growth stages, scanning the two neighboring fields in x city and y country, 1 and we used images captured from both the fields separately for train and evaluation purposes. In both the missions, we collected RGB images of size 3000 × 4000 (height × width) pixels using the DJI-FC330 camera mounted on a DJI drone capturing underneath field in spatial locations predefined during the mission planning. Specifically, for the flight collecting training data, we have mounted an additional MAPIR Survey3 multispectral camera, capturing red (660 nm), green (550 nm), and NIR (850 nm) channels of the same spatial resolution at a predefined frequency of 0.5 frames/s (FPS), and we denote the data from this camera as RGN.
In this research, we generate a fake infrared channel using a generative adversarial neural network. This approach is promising from a practical point of view. There are several alignment issues that we address using an artificial channel.
1) First of all, the camera DJI-FC330 is attached to the drone's gimbal. It allows a precise control of camera position during the flight according to the mission requirements. The gimbal is also used to stabilize the camera in the case of wind gusts. On the other hand, the multispectral camera is fixed and is always aligned with the drone's pose.
Hence, there is no way to compensate for its shift in the case of any disturbances, and the resulted images from two cameras are not aligned with each other. 2) Owing to the different design of the lenses, the apertures of cameras are 20 cm away from each other, so both the field of view and the sensor illumination varies between the two. This effect also contributes to the alignment problems. 3) Two cameras have different frame rates and are not synchronized, so they do not perform every shot simultaneously. It raises the issue of alignment in the case of real-time processing during the flight. 4) Using two cameras also lead to issues even in the case of two-orthophoto alignment. There are separate GPS antennas for each camera. GPS data are critical for the orthophoto stitching; therefore, it will influence the precision of the resulting orthophoto. 5) NIR cameras typically have lower resolution; then, we need to crop images from RGB cameras in the case of real-time data processing. It leads to the loss of data from the RGB camera, therefore decreasing the overall effectiveness of the system. 6) Finally, a multispectral camera is expensive in comparison with an SBC and a drone. Moreover, it is an additional payload. Then, it will decrease the time of flight. To sum up, it is better to have a single cheap RGB camera on-board, rather than two cameras. To upgrade the RGB camera with NIR capabilities, we can use the algorithm described in this article. As a result, the monitoring system with a single camera will be cheap and reliable.

B. Data Processing and Labeling
The collected dataset consists of two independent sequences of frames, which are not synchronized between each other. In order to align them, we propose to fuse both the frame streams into corresponding orthophotos, and we use an OpenDroneMap software for this. 2 Once the step is completed, it is convenient to label all the regions of interest within the RGB orthophoto, using any raster image processing software.
In order to align the images and construct a paired dataset for training, we have applied a greedy-region-based procedure. At the first step, the two orthophotos are globally aligned based on the best affine transformation between corresponding ORB features [27]. At the second stage, we divide both the orthophotos and the labels into overlapping patches of size 256 × 256 pixel. We use the same alignment procedure applied to each patch independently, and we search for the best match within the region of ±100 pixels around the patch. We depict the overall pipeline in Fig. 1.
Constructing training data for the online scanning regime involves additional RGB frame stream used to create the orthophoto. For this scenario, an intermediate stage based on the same alignment procedure is required in order to find the position of the entire frame in the RGB orthophoto. The selected region is cropped from both orthophotos and labels. Owing to the nature of texturing step used to synthesize orthophotos, the best match between the original frame and its region inside the orthophoto is achieved within the central part of the frame. Thus, we run the patch-based refinement only on the central part of size 2000 × 2000 pixels. We have performed this procedure in order to sample both train and test data with random crops of size 128 × 128 for training purposes and 256 × 256 for the test. Since for the test dataset we do not collect multispectral images, we exclude these data from the discussed sampling procedure. Overall, we have collected more than 20 000 crops for training purposes and 83 crops for evaluation, and we carefully checked the correctness of the resulted labels by additional look at the images in the latter.
We should note, that the surface where we performed the data collection included both vegetating (young) hogweed crops and those, which have already faded (old). Since the procedures required to eliminate these two types of hogweed plants can vary, we have labeled such cases with different semantics classed to learn the segmentation network to distinguish between them, facilitating an effective practical application of our work. For all our reported data, we have used a manual labeling procedure in the GIMP graphic editor. 3 It should be noted that only RGB orthophotos are required for labeling, since the produced labels can be back-projected to the corresponding RGN orthophotos and original frames using the procedure discussed above. We release the dataset consisting of training, validation, and testing samples, where we included both labeled orthophotos together with a paired crop suitable for training. The dataset is available at www.x.com. 4

C. Neural Network Architectures
In this study, we do not provide extensive experiments with neural network architectures, since our main goal is to show the approaches to improve the performance of the existing predefined architecture when additional data are at hand. However, we do compare different architectures in two modes, namely, computationally more and less intensive, and we do the special emphasis on the latter one in this article due to practical concerns. We depart from the high-capacity neural networks and use a SegNet architecture [28] with 29.5M parameters for the segmentation part and an encoder-decoder network with residual blocks [29] and 7.8M parameters for the fake NIR synthesis. Both the architectures were proven to perform well with the general-purpose segmentation and synthesis tasks, respectively [30], [31]. Targeting a balance between the performance and inference speed when evaluated on an embedded system, we have selected well-known U-Net architectures for both the segmentation and synthesis tasks in the less computationally intensive regime, since its superiority for satisfying both goals was shown in recent papers [7]. It is a fully CNN that was initially developed specifically for the semantic segmentation task; however, it was also shown that this network can efficiently solve image-to-image tasks [32]. Being an encoder-decoder architecture, the main benefit is that it allows the processing and analysis of images at different scales, which allows us to greatly increase its receptive field without the increase of computational complexity. Targeting the fast inference speeds, for the segmentation network, we have used a U-Net backbone consisting of three blocks in both the encoder and decoder parts with 17.3M parameters in total, and a backbone with two blocks and 2.2M parameters for synthesis. We found the capacity of these networks good enough to both obtain a good quality in semantic segmentation task and run them with high speed on an embedded system.

D. Training and Implementation Details
We have conducted all our experiments using the PyTorch framework [33], which allowed us to effectively deploy network training, involving parallel computations, to GPUs. In all the cases, we have trained our networks using crops of size 128 × 128 pixels. We have used random crops as well as random 3 The GIMP Development Team. Available at https://www.gimp.org. 4 It is the subject of double-blind review. multiple of 90 • rotations and random horizontal and vertical flips to augment our training data. We minimized all the objective functions using the Adam optimizer [34] with a learning rate of 1e −3 , and a warm-up scheduling was enabled for the first epoch. For training the synthesis neural network, we have used an adversarial learning adopted from pix2pix [24]. Since the network chosen to perform synthesis is relatively small, we have used an adaptive gradient step frequency [35] for both the generator and the discriminator in order to avoid discriminator's overfitting and other instabilities.
Overall, our training strategy is depicted in Fig. 2 and consists of the following three steps.
1) During the first one, we train our network to segment hogweed plants based on both RGB and RGN data concatenated and passed as an input. For this task, we perform  Fig. 2(a). 2) During the second step, we train another neural network to synthesize fake RGN-like images out of the real RGB ones. The training of this stage is highlighted in Fig. 2(b). For this task, we conduct the adversarial training similar to the earlier proposed pix2pix model [24], and we use the same loss function of the form Here, we denote RGB and RGN images as x and y, respectively, and θ G and θ D as the trainable weights of the generator and discriminator networks G and D, respectively. In addition, we involve the transfer learning and force our network to output images, which should benefit to the target semantic segmentation task. For this purpose, we penalize L2 objective L seg computed between the output features of the network from the previous step evaluated on the ground-truth and fake RGN data where we have denoted the segmentation network from the previous step and its weights as T and θ T . We use the traditional generative adversarial network (GAN) [36] min-max game to conduct the following optimization: The intuition behind this step is to fuse information about valuable features from RGN images, which can benefit the segmentation of hogweed plants into the preprocessing network trained at this stage. We have found values λ 1 = 100 from (1) and λ 2 = 10 from (3) to suit well for this goal. 3) During the final step, we perform an iterative inference of the network trained at the previous step through all our dataset samples to obtain fake RGN images and then fine-tune a semantic segmentation network on these data using the same procedure, as in step 1. This step is depicted in Fig. 2(c) and does not require real RGN data anymore, since information about this domain is fused inside parameters of the synthesis network during the previous stage.

E. Porting to the Embedded System
We have used NVIDIA Jetson Xavier NX as a target device, since it was proven to be an efficient low-cost solution for various vision tasks in edge computing [8]. In order to accelerate the inference speed of our network and make simulations of the proposed approach running on an embedded system, we have applied the well-known techniques of network quantization and layer fusion (pruning). Owing to specific choice of segmentation and synthesis networks in our less computationally intensive regime to be the standard U-Net architectures, the port of our models to the edge device is relatively straightforward. Owing to this reason, we have used NVIDIA TensorRT inference optimizer and runtime, which performs all the optimizations, as well as other architecture-specific tunings.

F. Evaluation Metrics
In order to evaluate our models and compare it to the baselines, we have used two most common approaches for the classification and semantic segmentation (i.e., classification of each image pixel), namely, mean intersection over union (mIoU) and the area under the curve (AUC) of a receiver operating characteristic (ROC) curve, given by the true positive and false positive rates. For more informative evaluation, both the metrics are computed in a multiclass regime when young and old crops are treated as different classes and in a binary regime when both types of crops are treated as a single class. As a reference, we have observed that typical values of 60% of mIoU reported in various multiclass semantic segmentation benchmarks [37] are considered as an acceptable segmentation performance.
To assess the performance of our approach ported to the embedded system, we measure the inference speed per 12-MPx frame (3000 × 4000 pixel) in FPS, as well as the power consumption of the hardware in watts (W).

IV. RESULTS
At the first stage of our study, we validate the designed label transfer. For this purpose, we have randomly selected ten frames out of the sequence used to construct the training dataset. The proposed strategy was used to sample images of crops with corresponding labels transferred from orthophoto data. For the same set of frames, we have performed manual labeling, which allowed to get the true labels for the sampled crops and use them to perform a quantitative comparison. The evaluation performed by the means of mIoU showed 95% and 90% for old and young crops, respectively, with the main source of error attributed to the inconsistencies introduced by the independent manual labeling of frames and orthophoto. Such findings demonstrate that the difference between the true labels and the labels transferred from orthophoto with our proposed strategy is negligible. As extensively studied and proven in [38] and [39], these small errors in labels do not affect the training and performance of neural networks. We also present an indicative example in Fig. 3 for a qualitative comparison.
We have trained our models separately with orthophoto images and with images from the camera. For both the scenarios, we trained the following models.
1) segmentation network working with RGB+RGN pair involving true RGN data (TrueRGN); 2) segmentation network working with only RGB frames (NoRGN); 3) network consisting of synthesis and segmentation parts trained end to end with only RGB data (SynthNoRGN); 4) network consisting of synthesis and segmentation parts trained using the transfer learning procedure with true RGN data (SynthFakeRGN); 5) network consisting of synthesis and segmentation parts trained without transfer learning procedure with true RGN data (SynthFakeRGNnoTL). In Tables I and II, we report on the results of all our models evaluated on both missions we performed directly for testing purposes (Test field) as well as on a subset of data obtained during the collection of our training data (Train field). In Fig. 4, we report the visual example of the synthesis U-Net network trained to mimic orthophoto data. It should be noted that there is no intersection of training and testing data.
Several conclusions can be made from the tables. At first, metrics obtained on a subset of train field dataset are higher than those computed on test data. We think this change may indicate the difference in data collection and manual labeling performed for both train and test fields' data separately at different time. The additional effect is overfitting, which is more pronounced for larger networks in Table I, where the gap between train and test fields is larger. Comparing the same models trained for either orthophoto data or frame data, but evaluated on both, it is clear that networks trained with frames are in general more robust with respect to the change of data type. Thus, the difference between multiclass mIoU metrics of SynthFakeRGN models trained with frames is, on average, 2% on test field data and 5.9% on train field one, while for the same models trained with orthophoto data, it is 6.3% and 24.2%, respectively. It means that frame stream from the camera is more favorable for the purpose of hogweed precise monitoring, which supports our efforts in transferring all computations to the aerial platform.
Comparing segmentation models trained with and without true RGN data, we can emphasize the superiority of the former. Indeed, in all the cases, when evaluated on data from train field, TrueRGN and SynthFakeRGN models constantly perform better than NoRGN and even its more computationally expensive analogue SynthNoRGN. For example, TrueRGN models trained  and evaluated on frame data perform on average 2.2% better than corresponding NoRGN models and 2.4% better than Syn-thNoRGN ones in terms of multiclass mIoU metric. In addition, from Tables I and II, it is clear that TrueRGN models better generalize to the different test data. As an example, the U-Net model (see Table II) trained on orthophoto data gives a score of 75.3% when evaluated on frame data, which is 6.3% less than that of the same model evaluated on the frame test set. At the same time, all the other U-Net models trained on orthophoto data overfit to the training domain, which results in the differences of 26%, 20.9%, and 24% for SynthFakeRGN, NoRGN, and SynthNoRGN, respectively. It is clearly seen from Tables I and II that in almost all the cases, the SynthFakeRGN model performs considerably better than NoRGN and SynthNoRGN with both mIoU and ROC-AUC metrics, which means that it has benefited from the training procedure introduced in this article. For example, trained on orthophoto data, SynthFakeRGN models on average outperform NoRGN by 3.4% and 0.017 in terms of multiclass mIoU and ROC-AUC metrics and by 2.8% and 0.019 in terms of binary ones when evaluated on the test dataset. It is also interesting to note that being trained within the unified pipeline, SynthNoRGN performance in general deteriorates from a lighter NoRGN models, which have a smaller number of parameters and computational complexity. This fact supports our approach, since it again highlights the importance of a proper training procedure.
Finally, we observe that the models utilizing fake RGN data trained with the transfer learning approach (SynthFakeRGN) continuously perform better than their analogues trained with just RGB to NIR GAN (SynthFakeRGNnoTL). The highest difference occurs for models trained with frame data, where SynthFakeRGN outperforms SynthFakeRGNnoTL by 5.1% on average in terms of multiclass mIoU metric measured for the test data. This fact proves that the proposed transfer learning approach is a crucial component for fake NIR utilization if the semantic segmentation task is at hand.
Overall, we observe that multiclass metrics are lower than binary ones, and the difference is especially pronounced when evaluated on the test field. Indeed, the binary classification is an easier problem to solve, compared to the task of separating plants by age; however, even for the latter case, the metrics are relatively high, which shows the success of our approach. We present the visual comparison of multiclass segmentation performance between SynthFakeRGN, SynthNoRGN, and NoRGN models with U-Net architecture in Fig. 5.  Table I trained with orthophoto data on crop taken from the labeled orthophoto of test field (best viewed in electronic version).
In order to show the direct applicability of our hogweed precise monitoring system, we have ported both U-Net semantic segmentation and fake NIR synthesis models to a Jetson Xavier NX embedded system and simulated its runtime during the mission by imitating a frame stream from the RGB camera. Owing to the memory limitations of this computational unit, we have performed a tile inference, where each frame is first divided into nonoverlapping regions of size 100 × 100, which are fed to the network with batch size 1, and the resulting local maps are stitched back to form a global map corresponding to the full frame resolution. Interestingly enough, we have managed to deploy all our networks to an embedded platform without any loss of segmentation quality, so we do not report the difference in metrics for networks evaluated on different platforms. In Table III, we report on the runtime and a power consumption we have measured during the simulation. For comparison, we also report values for the run on NVIDIA Tesla V100, where we have used our original models without any optimizations and we passed the whole 3000 × 4000 frame to the network.

V. CONCLUSION
In this article, we proposed and evaluated an idea of incorporating knowledge about multispectral bands in the neural network performing the semantic segmentation of hogweed crops. We collected and labeled two datasets consisting of old and young hogweed crops from two different fields within the same region, one of which involved only RGB frame stream and is used for testing purposes, and the other was collected with an additional RGN camera on board. We proposed an idea of labeling frames by labeling the corresponding orthophotos, and we successfully applied it for both the datasets. We also proposed an alignment strategy, which allowed us to align RGB, RGN, and label orthophotos between each other and RGB frames from the frame stream. With these data, we trained U-Net and SegNet architectures to perform semantic segmentation, and we used a transfer learning procedure to train a separate synthesis head consisting of another lighter network, which performs the synthesis of RGN data to facilitate further semantic segmentation task. We observed that this approach is at least 1.1% better than the end-to-end training of either segmentation or both synthesis and segmentation networks when only RGB data are involved. We also showed that this training procedure allows us to obtain the segmentation quality of the same level as when a real RGN data are provided. We ported our trained networks to an embedded system and showed that they are capable of performing the processing part of a real mission targeting precise location of hogweed crops. The proposed approach can be scaled to other agricultural-industry-related applications including the disease and insect detection.