Predicting Agriculture Yields Based on Machine Learning Using Regression and Deep Learning | IEEE Journals & Magazine | IEEE Xplore

Predicting Agriculture Yields Based on Machine Learning Using Regression and Deep Learning


Predicting Agriculture Yields

Abstract:

Agriculture contributes a significant amount to the economy of India due to the dependence on human beings for their survival. The main obstacle to food security is popul...Show More

Abstract:

Agriculture contributes a significant amount to the economy of India due to the dependence on human beings for their survival. The main obstacle to food security is population expansion leading to rising demand for food. Farmers must produce more on the same land to boost the supply. Through crop yield prediction, technology can assist farmers in producing more. This paper’s primary goal is to predict crop yield utilizing the variables of rainfall, crop, meteorological conditions, area, production, and yield that have posed a serious threat to the long-term viability of agriculture. Crop yield prediction is a decision-support tool that uses machine learning and deep learning that can be used to make decisions about which crops to produce and what to do in the crop’s growing season. It can decide which crops to produce and what to do in the crop’s growing season. Regardless of the distracting environment, machine learning and deep learning algorithms are utilized in crop selection to reduce agricultural yield output losses. To estimate the agricultural yield, machine learning techniques: decision tree, random forest, and XGBoost regression; deep learning techniques - convolutional neural network and long-short term memory network have been used. Accuracy, root mean square error, mean square error, mean absolute error, standard deviation, and losses are compared. Other machine learning and deep learning methods fall short compared to the random forest and convolutional neural network. The random forest has a maximum accuracy of 98.96%, mean absolute error of 1.97, root mean square error of 2.45, and standard deviation of 1.23. The convolutional neural network has been evaluated with a minimum loss of 0.00060. Consequently, a model is developed that, compared to other algorithms, predicts the yield quite well. The findings are then analyzed using the root mean square error metric to understand better how the model’s errors compare to those of the other methods.
Predicting Agriculture Yields
Published in: IEEE Access ( Volume: 11)
Page(s): 111255 - 111264
Date of Publication: 04 October 2023
Electronic ISSN: 2169-3536

SECTION I.

Introduction

Agriculture is vital to humanity, not only for food but also for employment and the economy. Crops or arable land are relatively “new,” even though people have been eating grains and plants for over 1,00,000 years. Around 11,000 years ago, during the Neolithic era, often referred to as the New Stone Age, people first actively managed the land and its vegetation. In India, agriculture provides a considerable portion of the country’s economic support and the majority of the country’s food needs. Due to India’s rapid population growth and important climatic changes, the demand chain and food supply must be maintained. Several scientific approaches have been included in agriculture to preserve the harmony between the supply and demand of food. The significant climatic variance makes it difficult for farmers to choose how to be more flexible and sustainable [1]. With modern technology and innovative farming techniques, agriculture must produce more with fewer inputs. Therefore, crop production estimation is crucial in identifying issues with food security. In India, Agriculture uses 70 percent of the water worldwide [2].

More than 50% of Indian workers were employed in agriculture in 2018, which contributed 17%-18% of the country’s GDP. India’s national area held steady at about 3,28,726 thousand hectares between 1971 and 2020 [3]. By 2025, India’s agricultural industry is expected to grow to USD 24 billion. With 70 percent of sales coming from retail, India has the world’s 6th largest food and grocery market. According to preliminary estimates for the fiscal year 2022-2023, the nation will produce 149.92 million tons of food grains altogether (Kharif alone). India’s rapid population growth primarily drives the sector.

Additional evidence favoring these results from increasing income levels in urban and rural locations, boosting the demand for agricultural products across the country. As a result, in addition to the introduction of numerous e-farming applications, the increased utilization of cutting-edge technologies is driving sectors such as geographic information systems (GIS), artificial intelligence (AI), blockchain and remote sensing technologies, and drones.

Figure 1 shows the season-wise analysis of crop production for Kharif, Rabi, Summer, Autumn, and Winter seasons and the year. The crop production is maximum in the Kharif season and minimum in winter.

FIGURE 1. - Season-wise crop production.
FIGURE 1.

Season-wise crop production.

Farmers can boost production under favorable conditions and decrease production loss under unfavorable conditions by using crop yield estimation. Numerous variables, such as farmer practices, choices, pesticides, fertilizers, weather, and market prices, impact crop yield positive forecasts. It is possible to estimate crop yield using statistical information from yields of previous years, together with climate, area-wise output, and rainfall. Recently, machine learning has advanced in various industries, including agriculture.

Numerous machine learning approaches, such as decision trees, artificial neural networks, support vector machines, and deep learning [5], have been applied to calculate agricultural production.

The crop production for India from 1997 to 2020 is shown in Figure 2. From the analysis shown, it can be observed that the majority of the land in the country was planted with wheat and rice, which produced more than 73% of the nation’s staple grains.

FIGURE 2. - Average yearly crop production 1997–2020 [4].
FIGURE 2.

Average yearly crop production 1997–2020 [4].

India accounts for around 40 percent of global rice (Basmati and Non-Basmati) trade and exports to more than 150 countries. Trade ministry data showed that exports rose 11 percent to 2.16 million tons [6] in the first half of 2022-23.

The illustration shows that rice has the highest production and area allocation. The agricultural industry had strong growth in terms of exports over the past year. Exports of rice, basmati and non-basmati totaled USD 6.12 billion in FY22 (through December 2021), as represented in Figure 3.

FIGURE 3. - Rice exports from India in 2015-2022.
FIGURE 3.

Rice exports from India in 2015-2022.

As farming developed throughout the centuries, people ceased utilizing the stars and animal sacrifices to manage their crops in favor of ever-more-scientific techniques. Mechanization, mathematics, and scales were significant as farming expanded throughout the industrial revolution. Modern approaches were being used more and more. By the 1900s, farming was using regression analysis to gauge agricultural output. A statistical method known as regression analysis combines many statistical methods to determine the correlations between a dependent and independent variable. Regression models for crop forecasting frequently rely on inputs like weather, soil characteristics, and past crop yields. For almost a century, farming has now employed these models. Regression approaches became more popular as relevant data and processing power became more widely accessible. Classification and fruit recognition have also been emerging areas in image classification and computer vision in the agriculture sector [7].

A. Research Contributions

India is a predominantly agrarian nation with around 48.9% of the population is employed in agriculture in some capacity. One of the most pressing issues is farmers‘ suicides. Farmers are killing themselves, which is a national calamity. According to a report of the Indian National Crime Records Bureau (NCRB), more than 2 lakh Indian farmers committed suicide between 1995 and 2020. This serious issue tends to come from farmers’ failure to repay loans typically obtained from banks and private landlords. This is because many losses used to occur due to inadequate knowledge about the crops, such as which crop should be sown in which season. The same insight served as the impetus for this study. By utilizing historical agricultural production statistical data for India, a robust mechanism is needed to estimate yield production at the initial stage. Early foresight can assist farmers to avoid committing suicide, suffering financial losses from their crops, and ultimately affecting the nation’s GDP. Forecasting agricultural yields isn’t a simple task as several factors like rainfall, wind speed, soil characteristics, climate, humidity, temperature, etc., affect agricultural production, and no single dataset is available for the same; this data needs to be collected from multiple sources. Although the same concern has been the subject of numerous studies, improved performance is still preferred. Unquestionably, the most captivating scientific frontiers that inspire academics to investigate further aspects of prediction are combined with machine learning and deep learning. The primary contributions of this research are to propose a reliable agricultural yield prediction method by analyzing the data gathered from official government websites and then applying various ML and DL models to improve accuracy metrics and test losses while offering a standard for contrast in this area of research. This study will help the researchers to understand the problem and solution domains effectively and will enhance their knowledge.

B. Article Organization

The objective of this study is to predict agriculture yields. For a thorough understanding of the literature, Section II looks closely at the theories and related research about crop yield prediction according to the investigation’s goals and the research gaps found in the literature. Section III describes the study region, data sources, technological perspective, research methodology, and approaches for agriculture yield prediction, along with the relationships among various dataset features. Section IV shows the model performance of various approaches (i.e., results). Section V concludes the entire work and also suggests additional work.

SECTION II.

Literature Review

By using machine learning approaches, crop yield estimation can be achieved. The dataset, which contains the total area under cultivation, the canal length, the average maximum temperature, and the water sources (tanks and wells) for irrigation, were used to forecast the crop output. According to a study, the computational model developed was superior to those produced using Regression Tree, Lasso, Deep Neural Network, and Shallow Neural Network approach. RMSE is fifty percent of the standard deviation and twelve percent of the average yield for the dataset validation using projected weather data [8].

From 1998 to 2002, during the Kharif season, the accuracy was 97.5 percent using parameters: min/max/average temperatures, area, rainfall, production, and yield [9]. The study focused on estimating crop yields for the Kharif season in the Vishakhapatnam district of Andhra Pradesh. Rainfall significantly influences Kharif crops’ production; hence, the scientists used modular artificial neural networks to forecast rainfall first before utilizing support vector regression to estimate crop output using rainfall and area data. To boost crop yield, these two approaches were used.

The research focused on the following four goals: investigating the Artificial Neural Network (ANN) model to forecast soybean and corn yields under unfavorable weather conditions; examining the model’s capacity for regional, state, and local estimation; assessing the performance of the ANN concerning variation of the parameter; and evaluating the evolved ANN model in comparison to additional models of multivariate linear regression [10]. The study used artificial neural networks to assess rice output in various cities in Maharashtra, India. Maharashtra’s 27 districts’ information was gathered from the open records of the Indian government.

The study estimates superior crop yield using ML methods such as Support Vector Regression (SVR), Random Forest, Artificial Neural Network (ANN), and K-Nearest Neighbor (KNN). Seven hundred forty-five examples comprise the research’s data set, with 70 percent of those cases being randomly assigned for model training and the remaining 30 percent for testing and performance evaluation. According to the final analysis, Random Forest achieves the highest level of accuracy [11]. Utilizing Long-Short Term Memory (LSTM), satellite data in southern Brazil, the research suggests a unique model to forecast soybean yield.

The primary aim of the research is to compare the effectiveness of LSTM neural networks, random forest, and multivariate OLS linear regression [12]. Rainfall, Land surface temperature, and Vegetation indices are used as independent variables in the forecasting process for soybean data, and step two is to determine how soon the model can reasonably anticipate the yield. For all forecasts except DOY 16, Long Short Term Memory outperforms all other algorithms. Multivariate OLS linear regression outperforms all other algorithms for DOY 16 [13]. The results of using a Sequential Minimal Optimization Classifier are analyzed in this research. The experiment was conducted using the WEKA tool and data from 27 districts in Maharashtra, India.

According to the experiment’s results on the same dataset, other strategies outperform Sequential minimum optimization. Sequential minimum optimization demonstrated the minimum accuracy and bad quality, whereas Multilayer Perceptron and BayesNet demonstrated the best accuracy and improved quality [14].

Crop productivity estimation is proposed using a Deep Belief Network (DBN) and Parallel Layer Regression (PLR) approach. Here, five major crops flourishing in Karnataka: pulse, ragi, and rice, are the subject of a DBN approach. The suggested methodology predicts each location in the relevant database to grow one of the five crops. Last but not least, the experimental findings demonstrate that the approach has a significant potential for accurate prediction of agricultural productivity in terms of accuracy, specificity, and sensitivity that its effectiveness has been validated using real-time data and interactions with people [15].

Approaches for machine learning to estimate agricultural yield: linear/ lasso/ridge regression and decision trees have been used. Some machine learning methods were inferior to the Decision tree [16].

A Crop Yield Prediction System (CYPS) is implemented using a KNN algorithm. For a farmer, however, yield predictions must be based on various factors that could influence crop production and quality. Season, crop type, and production area are three aspects that influence yield production; therefore, authors use specific fields like year, crop, area, region, and season to predict yield production. Making decisions related to agricultural risk management requires accurate knowledge of crop yield history [17].

KNN, Random Forest, and Decision Tree Classifier were examined by Rao et al. [18] using two distinct metrics: Gini and Entropy. Findings show that Random Forest has given the most accurate results.

Good performances of 91.35% and 91.17% have been obtained for VGG_19 and VGG_16 based on feature vectors [19]. Vanipriya et al. [20] have proposed using Hydroponics to deal with low agricultural production issues in India as it has high efficiency. Additionally, it offers soil cultivation a greener alternative. Food production depends on the economy and also on the yield of agricultural production [21].

SECTION III.

Proposed Work

A. Study Area

India is the subject of this study because of its diverse climate, which ranges from humid to arid in the south to temperate alpine in the north. It spans 3,28,726 thousand hectares [22] from the Himalayan peaks capped in the snow to the rain forests in southern central tropical, out of which 2,97,319 thousand hectares (between 1971 and 2020) is for agricultural usage [3]. To conduct this study, ten key Indian crops (majorly grown crops)- rice, jowar, maize, bajra, tobacco, jute, barley, ragi, cotton, and wheat have been chosen.

B. Data Sources

For creating the dataset, the information was gathered from the publicly available official websites https://data.gov.in and https://aps.dac.gov.in (i.e. multiple sources) [23], [24]. The dataset generated contains the details for the years 1997 to 2020: State name, district name, crop year, seasons, crop type, rainfall, wind speed, humidity, area under irrigation, area, production, and yield. Figure 2 depicts the crop’s yield for the years mentioned. To forecast the crop yield for India, this study uses a Decision Tree, Random Forest, XG Boost Regression, Convolutional Neural Network, and Long-short Term Memory Network. Furthermore, accuracy, standard deviation, root mean squared error and test loss are used for validation.

The flow of the work done is demonstrated in Figure 4. Here, the dataset is partitioned into training and testing data. 70 percent of the entire data is selected for training purposes, and the remaining 30 percent is used for validation. On the training dataset, k-fold cross-validation is employed. The number of folds taken is four, and training data is randomly assigned to each fold. The trained model is then evaluated.

FIGURE 4. - The k-fold cross-validation.
FIGURE 4.

The k-fold cross-validation.

C. Methods

1) K-Fold Cross Validation

In k-FOLD Cross Validation, k-1 folds are used as training data to train the model, and rest of the data is used to validate the resulting model.

Accuracy, the performance metric, is computed at each step while this process is performed k times. k-Fold cross validation guarantees that in both the training and test sets, every observation from the original dataset has a chance to appear. If we just have a small amount of input data, this is one of the finest approaches. The flow chart for the proposed model is shown in Figure 5.

FIGURE 5. - Flow chart of the approach used.
FIGURE 5.

Flow chart of the approach used.

2) Decision Tree

Classification and regression issues are resolved using a decision tree that performs two essential functions: firstly, it categorizes the features that are pertinent to each decision, and secondly, it determines the best course of action based on the selected features. The plausible choice is given a probability distribution by the Decision Tree algorithm [25]. Here, every node represents a feature, the branch devotes to a selection, and the leaf node denotes the outcome. One characteristic should be the decision tree’s root node to begin tree production. Data splitting is then necessary to finish the decision tree.

3) Random Forest

An uncorrelated forest of decision trees is built by amalgamating the bagging approach and feature randomness, i.e., the bagging approach’s extension is RF [26]. Low correlation across decision trees is ensured via feature bagging or feature randomization, or the random subspace technique, which provides a random selection of features. RF only chooses a portion of the feature splits that decision trees can take into account [27].

Each decision tree in the ensemble is built using the bootstrap sample, a data sample obtained from a training set. The rest 1/3rd of the training sample is used as test data.

4) XGBoost

A popular and successful open-source alternative of the gradient-boosted trees method is called Extreme Gradient Boosting, i.e., XGBoost. Regression trees [28] act as weak learners when gradient boosting is used for regression. With the use of a continuous score, a connection is established between every input data point and a leaf. XGBoost’s convex loss function takes a penalty term for model complexity to minimize a regularized (L1 and L2) objective function/ regression tree functions [29].

Repeatedly, new trees are added to the training process for predicting the errors/residuals of previous trees, which are then amalgamated with preceding trees to make the ultimate forecast. Since the technique reduces the loss while adding new models, it is frequently called “gradient boosting” [30].

Decision tree, random forest, and XGBoost are each represented graphically in Figures 6, 7, and 8, respectively.

FIGURE 6. - Decision tree.
FIGURE 6.

Decision tree.

FIGURE 7. - Random forest.
FIGURE 7.

Random forest.

FIGURE 8. - XGBoost.
FIGURE 8.

XGBoost.

5) Convolutional Neural Network (Cnn)

Convolution Neural Networks use some mathematical operations in between their layers called Convolution. This implementation of CNN has a total of 7 layers. The first layer is a Conv1d layer with 64 filters of the size (3*3), each with a kernel size of 3. Layer 2 (MaxPooling1D layer) has a pool size of two. Dropout is visible on layer 3. The output from layer three is flattened by layer four, which uses the ReLU activation function, and this output is then sent to layer 5. A neural network’s hidden layer 6 has 330 neurons in it. The output layer, layer 7, uses the SoftMax function and has 11 neurons for 11 kinds of output. CNN architecture of the model used is shown in Figure 9.

FIGURE 9. - CNN architecture.
FIGURE 9.

CNN architecture.

6) Long Short-Term Network (LSTM)

A type of recurrent neural network (RNN) called a Long Short-Term Memory (LSTM) network, shown in Figure 10, specifically handles sequential data, such as text, speech, and time series. Sequential data can be used to teach LSTM networks long-term dependencies. For both models, i.e., CNN and LSTM, the optimizer used is ADAM optimizer that is used to change or tune the attributes of a neural network [31], [32] such as layer weights, learning rate, etc., and it is a combination of “RMSP” and “gradient descent with momentum” algorithm. The loss is calculated using MSE (i.e., mean squared error). For training the model, the number of epochs taken is 50, the batch size is 32, and the validation split is 0.2. LSTM architecture of the model used is shown in Figure 10.

FIGURE 10. - Working of LSTM.
FIGURE 10.

Working of LSTM.

SECTION IV.

Results

The features of the dataset are related to one another. The authors identified the crop as a significant feature and plotted the link. Figure 11 shows the production count for each of the 10 crops taken consideration in the study. Figure 12 shows the historical correlations between Area and Crop Type. It can be observed that the maximum area is allotted to wheat, followed by Rice.

FIGURE 11. - Relationship between crop and Production.
FIGURE 11.

Relationship between crop and Production.

FIGURE 12. - Area allotted to crops.
FIGURE 12.

Area allotted to crops.

Figures 11 and 12 show the historical correlations between the dataset’s features, such as Crop type, Area, and Production.

Table 1 presents the forecast outcomes. The study concludes that Random Forest outperforms other machine learning algorithms regarding accuracy.

TABLE 1 Simulating model performance with area and production as inputs
Table 1- 
Simulating model performance with area and production as inputs

Using statistical data, Random Forest produces India’s most accurate crop production estimation with accuracy = 98.96 %, MAE = 1.97, RMSE = 2.45, and SD = 1.23 (Table 1). For the Decision tree and XGBoost, the accuracy, mean absolute error, root mean square error, and standard deviation are 89.78%, 4.58, 5.86, 2.75 and 86.46%, 6.31, 7.89, 3.54. Figure 13, Figure 14, and Figure 15 show the model performance of Decision Tree, Random Forest, and XGBoost, respectively.

FIGURE 13. - Model performance of decision tree.
FIGURE 13.

Model performance of decision tree.

FIGURE 14. - Model performance of random forest.
FIGURE 14.

Model performance of random forest.

FIGURE 15. - Model performance of XGBoost regression.
FIGURE 15.

Model performance of XGBoost regression.

With accuracy ratings of 89.78 and 86.46, respectively, Decision Tree and XGBoost Regression performs less compared to Random Forest. Machine learning is a “black box” technique because it doesn’t interpret much. Random Forest outperformed the other three regression approaches in this study’s application of machine learning techniques to forecast the estimated crop output for India.

Figure 16 details the layers used in CNN and LSTM. Table 2 shows the Test loss for CNN and LSTM, evaluated as 0.00060 and 0.00063 respectively.

TABLE 2 Test results of CNN and LSTM
Table 2- 
Test results of CNN and LSTM
FIGURE 16. - Layer description of CNN and LSTM.
FIGURE 16.

Layer description of CNN and LSTM.

Figure 17 and 18 shows the model performance of CNN and LSTM respectively.

FIGURE 17. - Model performance of CNN.
FIGURE 17.

Model performance of CNN.

FIGURE 18. - Model performance of LSTM.
FIGURE 18.

Model performance of LSTM.

Figure 19 illustrates a comparison of CNN and LSTM. Result show that a change in the number of epochs leads to a significant change in mean absolute error. Here, it can be observed that CNN performs better than LSTM as the loss is less with CNN.

FIGURE 19. - CNN v/s LSTM.
FIGURE 19.

CNN v/s LSTM.

SECTION V.

Conclusion

The demand and supply for food have grown more difficult to manage as the population grows. To assist farmers, experts have worked hard over the past few years to anticipate agricultural yield production. In order to forecast India’s crop yield, this study uses various machine learning and deep learning approaches. The study underlines the advantages of cutting-edge procedures. It is beneficial for small-scale ranchers, as they may use the predictions to estimate crop production for upcoming years and plant it appropriately. Five machine learning and deep learning algorithms, Decision Tree, Random Forest, XGBoost regression, Convolutional Neural Network, and Long-Short Term Memory Networks are applied to the dataset taken into consideration. When data is analyzed at the country level, Random Forest (with accuracy-98.96%, mean absolute error-1.97, RMSE-2.45 and standard deviation-1.23) and CNN (with minimum loss-0.00060) perform better according to the current prediction. Experimental findings demonstrate that the approach has a great potential for precise crop productivity prediction and its effectiveness has been validated using real-time data and interactions with people. More data for each crop year having more historically precise information about the climate and environment is needed. More deep learning models need to be applied to the dataset to identify the method that performs the best. To increase the model’s accuracy in crop production prediction, remote sensing data could be amalgamated with statistical data of districts. The prediction can be more accurate using satellite imagery land cover or satellite image classification.

References

References is not available for this document.