A Curriculum Learning Approach to Classify Nitrogen Concentration in Greenhouse Basil Plants Using a Very Small Dataset and Low-Cost RGB Images

The automatic classification of plants with nutrient deficiencies or excesses is essential in precision agriculture. In particular, being able to perform early detection of nutrient concentrations would increase the production of crop yields and make appropriate use of fertilizers. RGB cameras represent a low-cost alternative sensor for plant monitoring, but this task is complicated when it is purely visual and has limited samples. In this paper, we analyze the Curriculum by Smoothing technique with a small dataset of RGB images (144 images per class) to classify nitrogen concentrations in greenhouse basil plants. This Deep Learning method changes the texture found in the images during training by convolving each feature map (the output of a convolutional layer) of a Convolutional Neural Network with a Gaussian kernel whose width increases as training progresses. We observed that controlled information extraction allows a state-of-the-art deep neural network to perform well using little training data containing a high variance between items of the same class. As a result, the Curriculum by Smoothing provides an average accuracy 7% higher than the traditional transfer learning method for the classification of the nitrogen concentration level of greenhouse basil ‘Nufar’ plants with little data.


I. INTRODUCTION
Nitrogen (N) is one of the mineral elements whose absence often limits crop growth and yield [1]. Usually, the soil does not contain enough N to keep plants healthy, so it is necessary to apply commercial nitrogen fertilizers.However, this fertilizer must be used correctly for each crop type.Over-The associate editor coordinating the review of this manuscript and approving it for publication was Mingbo Zhao .fertilization can hurt production costs and the environment, whereas poor fertilization can result in little or no benefit in increasing crop yields [1].
Different ways of determining plant nutrient concentration are categorized as destructive or non-destructive.Recently there has been a great interest in using cameras as a non-destructive method for plant monitoring thanks to the speed of obtaining results [2].Multispectral cameras have proven to be very useful in detecting nutrient concentration levels due to finding changes in nutrient activity in different wavelength ranges; however, these cameras are too expensive.Cameras that capture images in the visible spectrum are a low-cost alternative.However, the only information that can be extracted is limited to the visual traits of the plant, which can change drastically due to lighting conditions, the uneven distribution of branches on the main stem, and the camera's pose relative to the plant when the picture is taken [3].
Currently, Machine Learning and Deep Learning techniques have been used to determine nutrient concentration levels in plants using RGB images (e.g., pakchoi [4], tomato [5], corn [6], sugar beet [7], sorghum [8], and rice [9]), resulting in a non-invasive, cheap, and fast method for plant monitoring.We are interested in monitoring basil 'Nufar' crops in greenhouses.Basil is an aromatic herbaceous plant.In Mexico, this plant is cultivated mainly as a cooking ingredient and, to a lesser extent, for medicinal use [10].Basil is grown under both greenhouse and field conditions.The area destined in Mexico for this crop increased by 60% in 2019 compared to 2017.For this same period, the value of production increased by 200%, reaching 2.3 million dollars in 2019 [11].The country's average yield in 2019 was 11.3 t/ha (tons per hectare), with marked differences between different areas of Mexico.The leading destinations of this crop production are countries such as the United States, Canada, and the United Kingdom.
One characteristic that has made Deep Learning successful in many scenarios is using big data for training.On the other hand, this could be a problem in Agriculture [12], [13], either because of the scarcity of study elements or the high cost of obtaining them.State-of-the-art deep networks contain millions of learnable parameters for a complex but effective image representation.However, due to their large capacity, they are also prone to overfitting.Training Convolutional Neural Networks (CNNs) with lots of data is the best way to avoid or reduce overfitting; otherwise, applying regularization strategies to generalize well is necessary.The most common model regularization techniques are: to reduce the model's capacity, data augmentation, dropout, and weight regularization [14].Classical image augmentation approaches increase training data variability and improve network performance.Nevertheless, it is also possible that the training data does not precisely represent the distribution of the test set, a common problem when there are too few samples.
In this paper, we propose to use the Curriculum by Smoothing (CS) [15] to classify N concentration levels in a small set of basil plant images using a high-capacity artificial neural network such as ResNet50V2 [16].In this Curriculum Learning (CL) strategy, texture information is fed to the network in a controlled manner.As a result, the amount of texture in the images increases as training progresses, allowing the network to improve its performance despite having little data.Therefore, this work aims to predict the treatment type with which the plants were grown, resulting in a 4-class classification problem.To our knowledge, this work is the first to use CL to classify nitrogen concentration levels.
The contributions of this paper are as follows: •A quantitative and qualitative analysis of the use of CS in a CNN to classify basil 'Nufar' (Ocimum basilicum L.) plants grown under different N concentrations, assuming that only a limited data set is available for training.
•A new set of RGB images of basil plants grown under greenhouse conditions and four N nutrition regimes.This dataset has the peculiarity that the plants are very visually similar in some treatments, which makes classification difficult.Nevertheless, these images and their ground truth are available to the scientific community.
The rest of this paper is organized as follows.First, section II discusses the related work since section III explains the dataset's creation, and section IV presents the description of the implemented method.Then, section V describes the experimental results.Finally, Section VI introduces the conclusion of this research.

II. RELATED WORK
Detecting the nutrient concentration level in plants using images can be classified depending on how the plants are grown: in laboratory conditions, in greenhouses, or in the open field, where each presents different problems to be solved.We focus the state-of-the-art study on plants grown under laboratory and greenhouse conditions, where the main issues occur due to changes in the appearance of plants when viewed from different points of view.For example, the arrangement of branches and leaves makes the plant look very different, complicating the extraction of distinctive features for correct classification.In this section, we also introduce Transfer Learning (TL) and CL in convolutional neural networks.

A. DETECTION OF THE NUTRIENT CONCENTRATION LEVEL IN PLANTS USING PROXIMAL RGB IMAGES
Machine learning techniques have been widely used to solve the problem of detecting the nutrient concentration level using leaf images [3], for which it is necessary first to perform hand-crafted feature extraction and then apply some traditional classification techniques such as Support Vector Machines (SVM), K-nearest neighbors, and Decision trees.The essential features have been coloring in different spaces of transformation [17], [18], texture [19], and leaf morphology [20].
On the other hand, Ghosal et al. [21] used a convolutional neural network to detect, classify, and quantify biotic (bacterial and fungal) and abiotic stresses (nutrition deficiency and chemical injury) in soybean leaf images.They extracted and identified the visual cues employed by the network to make predictions, where color-based features had a higher presence in the first layers.Sethy et al. [22] assessed six pre-trained CNNs as feature extractors to classify four levels of N in 5790 close-up images of rice plant leaves.The features extracted by ResNet50 showed better performance when classified with a SVM.Although there have been significant advances in the study of the problem of nutrient stress, these works have only considered proximal images of isolated leaves, which may be an impractical method for real-life monitoring in greenhouses of different concentration levels of a single nutrient as plants contain many leaves of different ages and coloration.
Story et al. [23] addressed calcium deficiency detection in images of lettuce canopies captured by a vision-guided plant monitoring system.They concluded that the best features representing calcium deficiencies are canopy area and textural features such as energy, entropy, and homogeneity.Textural features were of great relevance due to the changes in the appearance of the plant canopies during the experiment, like leaf necrosis, small brown spots, and tip burn.The main limitation of that work is that the images were collected in laboratory conditions with controlled lighting, temperature, and humidity.Vakilian and Massah [24] proposed a vision-based mobile robot to monitor and fertilize with N four cucumber varieties grown in a hydroponic greenhouse.They also used the same textural features as the previous work to detect N deficiencies.However, changes in the different plant elements were not analyzed even though the robot captured images with lateral views of the plants.
Only a few papers deal with the classification of nutrition stress using whole plants grown in greenhouses, RGB images, and Deep Learning, mainly due to the difficulty of obtaining hundreds or thousands of samples with their respective ground truth.Azimi et al. [8] proposed a 23-layer neural network to classify N deficiencies in sorghum plants.They considered three different N treatments and concluded that their network model could perform better than machine learning techniques and deep models such as Res-Net18 and NasNet Large.Taha et al. [25] utilized InceptionV3 to classify the nutrition status of lettuce grown in aquaponics.They used four classes: complete nutrition, phosphorus deficiency, N deficiency, and potassium deficiency.However, these works do not consider the case of having little training data and do not perform a qualitative analysis of the parts of the plant that influence the classification.

B. TRANSFER LEARNING (TL)
TL transfers a model's knowledge from one domain to another [26].TL starts training a high-capacity neural network from its pre-trained weights, typically with the Ima-geNet dataset [27].This technique reuses features extracted from a large amount of data.With prior knowledge of the network, it can be trained with a much smaller number (thousands or hundreds) of images and generally generate good performance.

C. CURRICULUM LEARNING (CL)
CL is the technique of training a machine learning model with organized data.This data first represents simple examples and then contains more complex information as the training progresses [32].A proper CL strategy allows one to converge quickly and arrive at a better solution; otherwise, the CL could degrade the diversity of the data and generate worse results than training without a curriculum [33].
There are different strategies for CL in various domains, such as Computer Vision, Robotics, and Natural Language Processing [33], [34].However, CL has not yet been used to detect nutrient concentration levels.Our work is inspired by Sinha et al. [15], which proposed a CS technique.In that Learning scenario, the outputs of each network layer are convolved using a Gaussian kernel layer.The variance of the kernel decreases as the training progresses, allowing the network to be progressively fed with more and more high-frequency information.They evaluated their proposal on different vision tasks such as classification, semantic segmentation, and object detection with datasets of thousands of images, improving the performance of the networks.Nevertheless, they did not assess the capacity of their curriculum strategy with small data sets.

III. DATASET CREATION
In this section, we first introduce the plant cultivation process and how the N concentration was measured.We then describe how the images were collected and how the images were processed for use in the network experimentation.

A. CULTURAL CONDITIONS AND CROP MANAGEMENT
Basil plants were established from April to June 2019.It started with sowing basil seeds in germination trays; once germinated, we transplanted the seedlings to an inert substrate in polyethylene bag pots with a volume capacity of 8 litres.The substrate was composed of volcanic rock with a grain size of 1 to 7 mm.The plants were grown in a tunnel-type greenhouse with polyethylene as cover material, located inside the Universidad Autónoma del Estado de Morelos, Mexico (18 • 58' 56'' N, 99 • 13' 59'' W).
After transplanting (30 days after sowing), the seedlings were watered daily at 0.5-1 L/day until the third week.Afterward, irrigation water was increased as the basil plants grew to 2 L/day.
Four treatments were established based on different levels of N, which was supplied by applying nitrate mixed with the irrigation water.The four different levels of N correspond to 4, 8, 12, and 16 mEq/L (milliequivalents per litre), and for this research, we denominated them as level I (very deficient), II (deficient), III (ideal), and IV (excessive), respectively.In addition, we supplied micronutrients using the commercial product Ultrasol® micro Rexene® BSP Mix at a constant dose of 80 g/m3 for all four treatments of N to ensure that the physical changes shown by the plants were due only to changes in N doses and not to the lack of any other nutrient.During the growing cycle of the basil plants, there were no pest or disease problems, nor was any stress due to abiotic factors that could cause deviations in the development of the plants observed.The average temperature in the greenhouse during the experiment was 24 • C, and the average relative humidity of 49%.

B. N CONCENTRATION LEVEL IN PLANTS
Ontiveros-Capurata et al. [35] analyzed N concentration levels in the same plant species and variety (Ocimum basilicum L.).After transplanting, they found the highest production of fresh matter (leaves and stems) at 42 days (week six).Therefore, in this work, images were captured in week four so that the model can estimate possible alterations in N concentration in plants when corrective measures can still be applied in fertilization and thus reduce the potential effects on leaf production or fertilizer purchase costs.In addition, we cut off the basil plants' aerial parts (leaves and stems) in week four.These were crushed and mixed to determine the N concentration in these samples using the Kjeldhal method [36].The results of leaf N content and the leaf area (parameters considered to quantify leaf production) of the sampled plants are shown in Fig. 1.These measurements were used as ground truth for the target of our dataset.

C. DATASET ACQUISITION
We analyzed 48 plants in the fourth week after transplanting, 12 per treatment of N. For each plant, 18 images of 4032 × 3024 pixels (see Fig. 2) were captured from different points of view.The plant was placed on a turntable and was rotated every 20 degrees while the camera remained fixed, focusing the plant on a white background.Fig. 3 displays six examples of the rotated view of the same plant.Although these come  from the same plant, they have a very different appearance due to the morphological characteristics of basil.

D. DATASET PREPARATION
We found that by feeding the network with the original images, the neural network uses the size of the background or circular base region to predict the class to which the plant belongs.Plants that were grown with low N levels are smaller than the others.Therefore, we performed pre-processing on the images to isolate the plants and set the background pixels black.First, the raw images are automatically cropped to contain only the plant.Next, each RGB image is transformed to HSV (Hue, Saturation, and Value) color space using the OpenCV library [37] to detect those pixel regions that belong to the plant by specifying maximum and minimum values for each channel using (1).Both the H and S are in the range of [0,255].Once the image is binarized, the image is processed to remove noise through an opening morphological operation (erosion followed by dilation) using a square kernel of 5 × 5 pixels which was found to work well empirically.Finally, a bounding box of the whole plant region is determined and adjusted to extract a square patch with a black background.for the test.Therefore, there are 576 images (144 per treatment) in the training set and 288 in the test set (72 per treatment).One-third of the photos were labelled as test images and the remaining as training images, as shown in Table 1.For a better analysis of our dataset, three distinct splits of the data (split 1, split 2, split 3) are created under the criterion that the plants considered for the test were different for all sets.We make available this data, with its corresponding hand-segmented masks and the code we used for this work, at https://github.com/jofuepa/basil-dataset.git.

IV. METHOD DESCRIPTION
This section details the method we used to classify basil 'Nufar' plants according to four classes corresponding to different treatments of N. First, we present a data augmentation technique implemented to increase variability in our dataset, and then we provide information about the base model used for transfer learning.Finally, we explain how we incorporated learning by curriculum into this model.

A. DATA AUGMENTATION TECHNIQUES
Tensorflow [38] and Keras [39] libraries were used to build the network.Real-time data augmentation is performed using the Keras ImageDataGenerator class, where a new randomly transformed batch replaces each original batch in the data set.The original data is not used for training.The image transformations are rotation, brightness, and horizontal flip.Each image in the batch is modified using a random subset of one or more transformations.This is, when training for 10 epochs, 5,760 transformed images are generated.And 3 times more for 30 epochs as detailed below in step vii of Curriculum method described in section IV-C.Afterward, the Cutout [40] technique is applied, where black squares are drawn randomly on the image.The insertion of these squares generates a more significant variability in the information in the images by hiding parts of it, which is suitable for our problem due to the morphological characteristics of the basil plants.The Cutout has two parameters: the size of the square l and the probability p of an image being altered by this method.Different values of l were tested manually, and the one that minimized the test loss was chosen, while p always remained at 1.0.Table 2 shows the strategies and parameters used in the experiments and their respective values.
The ranges for data augmentation were set manually analyzing visually the original image conditions.Zoom and shear transformations were tested, but performance did not improve.On the contrary, it sometimes worsened.
Cutout is applied over all images because when applying cutout just to some images within the batch this produces an inferior performance.

B. TRANSFER LEARNING METHODOLOGY
Fig. 5 shows the network architecture for applying Transfer Learning (TL).The state-of-the-art architecture analyzed is ResNet50V2 [16], which performs well in classifying the ImageNet dataset using ∼26M parameters.We also decided to use this type of network because residual connections have been demonstrated both practically and theoretically to facilitate training by fitting the data very well while retaining the ability to generalize to new entries [41], [42].Moreover, residual connections help reduce the probability of encountering a vanishing gradient during the training [16].The input layer size is 224 × 224 pixels.A global average pooling layer is adopted to reduce the number of parameters in the model.The last dense layer has a Softmax activation function that yields the class predictions.
The TL process is performed in three main stages: i. First, we replace the classification layer of a model pre-trained on ImageNet with a new one, where its weights are randomly initialized.ii.The convolution base of this model is frozen, and the classification layer is trained for a few epochs (warmup).iii.Next, fine-tuning is done, which consists of unfreezing all the network's convolutional layers and re-training with a low learning rate.
The following model parameters are kept constant during the training to compare the different tests carried out with the ResNet50V2 model.The model is trained for ten epochs in the first training phase (warm-up) with an initial learning rate of 1e −3 .In the second phase, training is started with a learning rate of 1e −5 , and early stopping is applied when the loss is no longer decreasing.The maximum number of epochs without improvement is set at 10.For both stages, we use a batch size of 32, the categorical cross-entropy function, and the Adam optimizer for its convergence speed and loss value, with the following default parameters: β1 = 0.99, β2 = 0.999, and ε = 1e −7 .

C. CURRICULUM BY SMOOTHING PROCESS
We conducted a different training than the one described in [15].Our curriculum strategy consists in smoothing the outputs of the convolutional layers (i.e.signal responses are smoothed).This smoothing step is applied iteratively, such that the amount of smoothing decreases with each iteration, thus incorporating higher frequencies of the signal responses as the training advances.More precisely, this strategy consists of the following steps: i.The pre-trained ResNet50V2 is subject to steps i and ii of the TL process described in section IV-B.ii.The convolution base of the model is unfrozen.
iii.The network is modified by adding convolution layers with Gaussian kernels of 3 × 3 after each original convolutional layer.iv.The sigma value is initialized as σ = 1.0.v.The learning rate is set to 1e −5 .vi.The weights of all the layers containing Gaussian kernels are computed and updated using the σ ′ s value.These layers are frozen, so their weights are not trainable.vii.The network is retrained for 30 epochs.Then, the model that generates the lowest loss is stored for the next iteration (successive 30 epochs) with a new σ ′ s value.viii.The sigma value is updated using σ = σ − 0.2.If σ = 0, the training is finished; otherwise, return to step v.
We also use the Adam optimizer and the categorical cross-entropy loss function in this curriculum process.The batch size and data augmentation strategies also remained the same.In our case, the σ ′ s value decays every 30 epochs (empirically determined) due to the limited number of training images.Table 2 shows the strategies and parameters used in the experiments and their respective values.
The CS technique allows the model to initially learn plant shapes (simple examples) and gradually introduce texture information (more complex instances) over time.This type of structured learning process is well suited to our data-limited problem, as it artificially expands the dataset through the controlled creation of increasingly challenging plant images.By maximizing the utility of the available data, the model improves its capability of generalization and consequently decreases the risk of overfitting.

V. EXPERIMENTAL RESULTS AND DISCUSSION
Each of the following experiments is repeated seven times.In [15], experiments are repeated five times.However, repeating them empirically seven times works better for our dataset.The experimentation is carried out on a laptop with 16GB RAM and GTX1080 GPU under the Ubuntu operating system 22.04 LTS.

A. THE IMPACT OF REDUCING THE STANDARD DEVIATION OF THE GAUSSIAN KERNEL DURING CURRICULUM LEARNING
Fig. 6 shows the accuracy and loss values when layers with Gaussian kernels of the ResNet50V2 network take different σ ′ s values while learning using the three splits.Accuracy increases and loss decreases as σ decreases, even though the neural network has many parameters.When σ = 0.2, the model becomes more robust in discerning between classes.It indicates that mild to low smoothing is better for classification purposes.It is to say that the high frequencies of the objects within the images provide crucial information to discriminate among visual classes.The gap between the training and test loss curves is due to the small amount of data and the unavoidable differences that can be present between samples of the training and test sets.The model might be more confident with more data.Sigma decrement parameter (0.2) was selected empirically; when using 0.1, it arrived at the same result but slower.

B. COMPARING TRANSFER LEARNING VS CURRICULUM BY SMOOTHING
We evaluate the performance of the ResNet50V2 network using TL and CS.For CS, the results presented are obtained when σ = 0.2.Fig. 7 shows the results of the three splits.The worst performance is shown in split 2 using TL, most likely because the training set does not appropriately represent the test set.When TL is used in all cases, the performance decreases considerably due to changes in the pre-training domain and the minimal training set.In our case, the model does not generalize well to the domain of nitrogen concentration classification using lateral views of basil plants.Freezing different parts of the network and fine-tuning the rest did not produce good results either.CS improves the network's performance, and the model becomes more robust in discerning between classes with better confidence than when using  3. Evaluation metrics of the test sets of the three splits using TL and CS strategies with and without data augmentation, considering all runs with ResNet50V2.
TL alone.In addition, learning by curriculum allows a better extraction of discriminative features from plant images.
These results suggest that initially considering only texture information and gradually adding shape information (encoded in the high frequencies of images) enables a more robust learning process than using texture and shape information straight from the beginning.
We chose the model that generated the minor test loss for each run and calculated the accuracy, precision, recall, and loss metrics.Table 3 shows the statistics of these metrics for the three splits using TL and CS strategies with and without data augmentation.When using data augmentation, each original image is replaced by a randomly modified version of the image using always Cutout and one of some of the following: rotation, flip or brightness.From 576 original images selected for training, modified versions are generated in each epoch, e.g.5,760 for 10 epochs.In the CS case, the model was selected when the sigma was 0.2.We report  the mean and the standard deviation in the following format mean±std.We also display the statics generated using the results of all runs in all partitions.These results validate that the performance of the CS strategy is superior to that of TL.With CS, the accuracy improves by 7%, while the loss value decreases by 0.12 units.To better assess the impact of data size over the performance of the model, Table 4 shows the results of metrics using different sized subsets, from 4 to 10 plants for training, and from 2 to 8 plants for test, 18 images correspond to each plant.Data augmentation was used always for experiments of Table 4.

C. QUALITATIVE ANALYSIS OF THE NEURAL NETWORK TRAINED BY CURRICULUM
We obtained the intermediate activations generated by the outputs of internal layers to analyze how an input image is transformed by some of the filters learned by the network.For better visualization, only the 3-layer (''conv2_block1_preact_relu'', ''conv2_block3_preact_relu'', and ''conv3_block2_preact_relu'') outputs at the beginning of the network were considered, and all feature maps were rescaled to exact size.In the shallow layer ''conv2_block1_preact_relu'' (first row of each figure), outputs channels mainly encode color and edge filters.Leaves close to the terminal buds show higher activation in some filters because they are younger leaves than the rest, so their coloring differs.The greenness of  the leaves increases with age, and the amount of N provided to the plant.Also, the plants of class I may have small yellowish leaves, which are not present in the other categories, due to the low N concentration at which they were grown.
The activations generated by layer ''conv2_block3_preact_relu'' encode the contour of the whole plant, the textures and the edges of the leaves and stems.In the case of class I, the region belonging to the stem is more relevant, indicating that it is a distinctive feature for classification.In the other classes, the stem is almost entirely occluded by the leaves.In classes II, III, and IV, the contour of the whole plant is present in several filters.Furthermore, since the enervation of the leaves is more pronounced as the amount of N supplied increases, in case IV, the presence of texture filters that enhance the details of the leaves is more noticeable.
In the deepest layer analyzed, ''conv3_block2_preact_relu'', contour and texture filters are prevalent.At this level, the network also creates filters to separate the plant into regions with similar textures or colors, becoming more evident in classes III and IV, perhaps because of their similarity and to identify discriminative areas.
To interpret which elements of the plants are finally being considered for prediction, we show the output maps of Grad-CAM [43] using the ResNet50V2 network trained by the curriculum in Table 5.This approach uses the gradients of the network's last convolution layer to produce maps highlighting the relevant regions that influence the prediction.In this experiment, 4 test images from each class of split 1 were used.Red-colored regions are the most important for the class, while blue-colored regions are the least significant.The model looked at the plant's stem and some of the nodes in class I because it presents buds and small leaves, the only feature in this class.In addition, the base of the stem is of interest in some instances.For class II, the model also highlights the stem and node regions, but in this case, the nodes are made up of leaves of a larger area, and the petioles are long.When the leaves occlude more nodes, there is a tendency to categorize such plants as class III with a high probability.In class III, the region belonging to the stem receives strong attention in some samples; however, leaves broadly cover it.In class IV images, there are a more significant number of shoots, and the model focuses on the terminal parts of these shoots, whose leaves are not occluded, distinctive, and younger than the others.Images of this class tend to be classified as class III when the plant does not appear to have many leafy shoots.

D. FLEXIBILITY AND LIMITATIONS OF OUR METHOD
Our method can deal with the problem of plant classification under different nitrogen concentrations by teaching the model the texture of the plant in the images in a controlled way.This is beneficial when there is limited data as many different images are created during training, thus increasing the dataset.This artificial data prevents overfitting, allowing the model to generalize to new data.Our method has been shown to generate a better model than the simple TL technique in a large capacity neural network.
CS has the disadvantage of needing more epochs for training than TL, so there is a need for more and better hardware resources to speed up the training.Also, when using CS it is necessary to consider more hyperparameters than in TL, such as the σ ′ s value and the number of epochs in each iteration of our algorithm, which depend mainly on the number of images in the dataset and the information available in them.A model with more hyperparameters can make the training process more difficult, as it is necessary to search for the optimal values of all of them.If the distribution of the data changes, then these hyperparameters will have to be tuned again.
When compared against previous work, ours is the first one that incorporates CL in the classification of Nitrogen concentration for plants.Although previous reports [15] have evaluated the contribution of CL in tasks like classification, semantic segmentation, and object detection, they performed such evaluations in datasets of thousands of images.In contrast, our results show that CL is also beneficial for increasing classification accuracy using a small dataset ranging in the order of a few hundred training samples only.Moreover, different from previous work, where performance is reported only in terms of metrics, we also conducted experiments regarding the explainability of the decision of the learning models, which consisted of the analysis of GradCam [43] maps.Some other differences in our work concerning previous reports include: • We use an end-to-end learning methodology instead of hand-crafted descriptors paired with traditional learning algorithms [17], [18], [19], [20].
• We focus strictly on the classification of Nitrogen treatments and not on other target variables like biotic or antibiotic stress [21], calcium [23], phosphorus, or potassium [25].
Finally, our model consists of a neural network that includes a pre-trained ResNet50V2 with ∼26M parameters, comparable to the aforementioned works that have relied on pre-trained models with between 23M and 88M parameters.

VI. CONCLUSION
This work presents a strategy for incorporating CS in the fine-tuning process of Convolutional Neural Networks, which classify images of basil 'Nufar' plants into four classes corresponding to different N concentration levels.
Over the years, it has been proven that texture is an essential feature for determining nutrient concentration levels [19], [23], [24].The CS technique exploits the plants' texture by controlling how high-frequency information is presented to the network.Hence, this curriculum strategy is considered a data augmentation technique and can deal with little training data.Another way to enrich the training set of plant images has been using Generative Adversarial Networks (GANs) [44].Although it has been studied how to train a GAN with little data, the data is still in the order of thousands to generate quality images [45].
The texture of the plant's leaves was very influential in improving the model's performance.However, the model could use other plant elements to describe each class, such as the appearance of the main stem, the nodes, the morphology of the whole plant, and the terminal parts of the shoots.The most accessible class to differentiate was class I (very deficient) due to the scarce presence of leaves and the yellowish coloration of some leaves.On the other hand, it is more difficult to distinguish between classes II (deficient), III (ideal), and IV (excessive).If images of the top view are also considered, the model could benefit from this visual information to classify these three classes better.Recently, Taha et al. [25] have shown that deep models perform well in classifying nutrient deficiencies using plant canopy images.We will investigate how to integrate these two types of views (top and lateral) into the model since, as shown in Fig. 1, the leaf nitrogen content determined in the laboratory showed significant differences between the levels of N-based fertilization.
We also plan to explore the teacher-student model using networks with smoothing filters [46], which has proven to be a promising approach for classification when there is limited data.
Using CS could lead to automatic systems with low-cost cameras for nutrient detection in greenhouses when the training data set is limited, a common problem in agriculture, which could mitigate problems such as loss of yield, environmental pollution, and high fertilizer purchase costs.

FIGURE 1 .
FIGURE 1. Boxplots of N concentration treatments from leaves and stems of basil plants in the fourth week after transplanting.Moreover, the average leaf area values obtained on the same date are also shown.

FIGURE 2 .
FIGURE 2. Example of images of basil plants of the same age grown under four N concentrations from I (left) to IV (right).

FIGURE 3 .
FIGURE 3. Images of a plant from six fields of view.

Fig. 4 1 ,
shows examples of these patches rescaled to 224 × 224 pixels.With patch creation, we disregard plant height as a classification feature and focus on the other physiological traits of the plants, making the model robust to scale changes.g (x, y) = if 60 < H (x, y) < 100 and 35 < S (x, y) < 255, 0, otherwise.(1) Our dataset consists of 864 patches obtained from 48 plants (i.e.48 × 18), of which 32 plants (8 plants per treatment) are considered for training and 16 plants (4 plants per treatment)

FIGURE 4 .
FIGURE 4. Images of plants with background removed.

FIGURE 5 .
FIGURE 5.The network architecture used to apply Transfer Learning (TL).

FIGURE 6 .
FIGURE 6.Comparison of accuracies and losses under different σ ′ s values throughout the CS process with the three splits.The lines are averaged over seven runs and accompanied by their standard deviation.The plots have the same ranges of axes for comparison.

FIGURE 7 .
FIGURE 7. Comparison of performance achieved by TL vs CS.All lines are averaged over seven runs and accompanied by their standard deviation.

FIGURE 8 .
FIGURE 8. Normalized confusion matrices obtained for each of the splits during a random run of the CS process with σ = 0.2, using data augmentation.

Fig. 8
shows the normalized confusion matrix obtained in a random run with each split during CS with σ = 0.2.Each of the matrices corresponds to the epoch where the value of the test loss was the smallest.The accuracy percentages achieved in split 1, split 2, and split 3 are 88.88%, 86.45%, and 90.27%, respectively.In all cases, class I is the easiest to discern.The problem becomes complicated when separating classes II, III, and IV.Most misclassified classes II and III images are assigned to classes III and IV.The incorrectly classified images of class IV are set to class III.Visually, class I plants have fewer leaves of smaller areas compared to the others, which could be generating its good separability.However, classes II, III, and IV share very similar features, such as the density of the foliage and the shade of the leaves.

FIGURE 9 .
FIGURE 9. Visualization of some feature maps of a test image of class I derived from the outputs of the following network layers: conv2_block1_preact_relu (first row), conv2_block3_preact_relu (second row), and conv3_block2_preact_relu (third row).

FIGURE 10 .
FIGURE 10.Visualization of some feature maps of a test image of class II derived from the outputs of the following network layers: conv2_block1_preact_relu (first row), conv2_block3_preact_relu (second row), and conv3_block2_preact_relu (third row).

Fig. 9 ,
10, 11, and 12 show the intermediate activations for class I, II, III, and IV test images.Each image in each row results from applying a filter obtained from a specific layer.

FIGURE 11 .
FIGURE 11.Visualization of some feature maps of a test image of class III derived from the outputs of the following network layers: conv2_block1_preact_relu (first row), conv2_block3_preact_relu (second row), and conv3_block2_preact_relu (third row).

FIGURE 12 .
FIGURE 12. Visualization of some feature maps of a test image of class IV derived from the outputs of the following network layers: conv2_block1_preact_relu (first row), conv2_block3_preact_relu (second row), and conv3_block2_preact_relu (third row).

TABLE 1 .
Distribution of the data set.

TABLE 2 .
Summary of parameters used in the proposed method.

TABLE 4 .
Evaluation metrics of the test sets generated with different numbers of plants TL and CS strategies with data augmentation, considering all runs with ResNet50V2.These new sets were created from split 2.

TABLE 5 .
Test images showing the discriminative regions located by Grad-CAM in the different classes.