Crop Yield Prediction Using Deep Reinforcement Learning Model for Sustainable Agrarian Applications

Predicting crop yield based on the environmental, soil, water and crop parameters has been a potential research topic. Deep-learning-based models are broadly used to extract signiﬁcant crop features for prediction. Though these methods could resolve the yield prediction problem there exist the following inadequacies: Unable to create a direct non-linear or linear mapping between the raw data and crop yield values; and the performance of those models highly relies on the quality of the extracted features. Deep reinforcement learning provides direction and motivation for the aforementioned shortcomings. Combining the intelligence of reinforcement learning and deep learning, deep reinforcement learning builds a complete crop yield prediction framework that can map the raw data to the crop prediction values. The proposed work constructs a Deep Recurrent Q-Network model which is a Recurrent Neural Network deep learning algorithm over the Q-Learning reinforcement learning algorithm to forecast the crop yield. The sequentially stacked layers of Recurrent Neural network is fed by the data parameters. The Q-learning network constructs a crop yield prediction environment based on the input parameters. A linear layer maps the Recurrent Neural Network output values to the Q-values. The reinforcement learning agent incorporates a combination of parametric features with the threshold that assist in predicting crop yield. Finally, the agent receives an aggregate score for the actions performed by minimizing the error and maximizing the forecast accuracy. The proposed model efﬁciently predicts the crop yield outperforming existing models by preserving the original data distribution with an accuracy of 93.7%.


I. INTRODUCTION
Agriculture is the one amongst the substantial area of interest to society since a large portion of food is produced by them. Currently, many countries still experience hunger because of the shortfall or absence of food with a growing population. Expanding food production is a compelling process to annihilate famine. Developing food security and declining hunger by 2030 are beneficial critical objectives for the United Nations. Hence crop protection; land assessment and crop yield prediction are of more considerable significance to global food production [1]. A country's policymaker depends on precise forecast, to make appropriate export and import assessments to reinforce national food security. Cultivators The associate editor coordinating the review of this manuscript and approving it for publication was Dongxiao Yu. and farmers further benefit from yield forecast to make financial and management decisions. Agricultural supervision, especially the observation of crop yield, is indispensable to determine food security in a region [2]. On the other hand, crop yield forecasting is exceedingly challenging because of various complex aspects. Crop yield mainly depends upon climatic conditions, soil quality, landscapes, pest infestations, water quality and availability, genotype, planning of harvest activity and so on [3]- [5].
The crop yield processes and strategies vary with time and they are profoundly non-linear in nature [6], and intricate due to the integration of a wide extent of correlated factors [7], [8] characterized and impacted by non-arbitrate runs and external aspects. Usually, a considerable part of the agricultural framework cannot be delineated in a fundamental stepwise calculation, especially with complex, incomplete, ambiguous and strident datasets. Currently, many studies demonstrate that machine learning algorithms have comparatively more improved potential than conventional statistics [9]- [12]. Machine learning belongs to the field of artificial intelligence by dint of which computers can be instructed without definite programming. These processes resolve non-linear or linear based agricultural frameworks with remarkable forecasting ability [13]. In Machine learning agricultural frameworks, the techniques are obtained from the learning process. These processes demand over train to perform a specific task. After the completion of the training process, the model makes presumptions to test the information.
Further, machine learning resembles an umbrella that holds various significant strategies and methodologies. On observing the most prominent models in agriculture, we can see the utilization of artificial and deep neural networks [14]. Deep learning is a subgroup of machine learning that can determine outcomes from varying arrangements of raw data. Deep learning algorithms, for example, can develop a probability model by taking a decade of field data and providing insights about crop performance under various climatic conditions [15]. Data scientists utilize various machine learning algorithms to derive actionable insights from the available information. Another intriguing area of artificial intelligence is reinforcement learning [16]. These can be examined as an essential class of algorithms that can be utilized for streamlining logic for dynamic programming. Reinforcement learning is the preparation of machine learning models to make decision sequences [17]. The agent learns to accomplish an objective in an ambiguous, potentially complex environment. Based on the agent's action, the environment rewards it. This scenario depicts the machine as the agent and its surroundings as the environment.
In recent times advanced and progressive artificial intelligence technique named, deep reinforcement learning (DRL) is profound for intelligent decision making in various domains like energy management [18], robotics [19], health care [20], smart grid, game theory [21], [22], finance, computer vision [23], Natural Language Processing [24], Sentiment analysis [25] and so on with an extensive combination of reinforcement learning methods with deep learning models [26], [27]. This model has been efficient to resolve a wide extent of complicated decision-making tasks that were formerly beyond the bounds for the machine. As a result, it is a convincing model endorsed for developing intelligent agricultural frameworks. The characteristic models of deep reinforcement learning include deep successor network, multi-agent deep reinforcement learning and deep Q-network.
In this paper, we propose a supervised smart agriculture framework based on the deep reinforcement learning algorithm. A deep Q-Learning based DRL algorithm is used to strengthen the crop yield forecasting efficiency with the best rewarding iterations. There exist several other deep learning algorithms that may not be bounded by the biases or require huge manual effort in label creation deriving the insights directly from the data like, Autoencoders [28], deep belief networks [29], Gaussian Bernoulli RBM's [30], Bayesian Neural Nets [31], Deep Generative models [32]. These models can sometimes fail to account for uncertainty while interpreting ambiguous inputs. Most of these approaches follow greedy procedures that are sub-optimal, learning a single layer of features at a time without updating its lower-level parameters resulting in slow and inefficient computations. The proposed work overcomes the above-mentioned shortcomings promoting the advancement of smart agriculture and thereby leading to increased food production. The rest of the paper is organized as follows. Section 2 presents the literature review of the existing works. Section 3 briefs about the Deep Q-learning algorithm and the proposed Deep Recurrent Q-Network (DRQN) model for forecasting the crop yield.. Section 4 explains the agriculture dataset and study area description. Section 5 presents the experimental results and frameworks, and the performance of the DRL model over the other machine learning algorithms. Section 6 wraps up with the conclusion and future works.

II. RELATED WORK
The potential growths in Artificial Intelligence have undoubtedly endless potential results [33], [34]. For creating new opportunities, deep learning has surged together with enormous data advancement [35]. This result in need of improved measures to envision, determine and assess data exhaustive strategies in agricultural frameworks [36], [37]. Crop yield prediction can be considered as a pattern recognition problem where AI has shown notable efficiency for agricultural applications [38]. Abrougui et al. have proposed yield prediction of potato crop the soil properties and tillage system by the ANN. The ANN model showed great potential to estimate yield [39]. Haghverdi et al. have defined the prediction of cotton lint from the phenology of crop indices using ANN. The ANN approach is used to generate 61200 models relating to individual crop indices to field estimates of cotton yield to be predicted [40]. Byakatonda et al. explained an ANN-based yield forecast for the maize crop based on the climatic indices and the precipitation length. In order to facilitate agricultural planning, yield predictions are made using ANN models [41]. In the approaches as mentioned earlier, the ANN's were used for the processing, which relied on feature extraction by time-domain and frequency-domain processing methods. This results in the drawback of manual feature extraction mainly depending on the prior knowledge of the data for predicting yield, and the ANN's shallow architecture in learning the complex non-linear relationships in the yield prediction system. With the advent of deep learning, such problems are handled to a certain extent.
Yang et al. have proposed a deep convolution neural network model to predict the crop yield estimation of the rice crop at the ripening stage. The CNN network learns the significant spatial features concerning the crop yield from the high spatial resolution RGB image [42]. Deep learning enabled the crop mapping strategy to identify the crop yield in VOLUME 8, 2020 a respective region. Winter wheat mapping using the ground data statistical references employing the artificial neural networks and the deep CNN are modeled by Zhong et al. This enables automatic identification of wheat seasonality without using samples [43]. Ramesh et al., proposed an optimized deep neural network algorithm recognize and classify crop yield based on the diseased leave images obtained by the image processing method [44]. Babak et al. computed a numerical deep learning model of crop growth by incorporating the DSSAT model's rainfall and irrigation inputs to predict maize yield [45]. An efficient automatic rice crop yield heading date estimation method through deep learning CNN network using time series RGB images of the crop [46] has been proposed by Desai et al. Koirala et al., proposed a two-staged deep learning method using CNN for mango fruit yield estimation [47]. From the literature, the ANN-based process can be efficiently identified as a primary predictor, whereas deep learning approaches can recognize adaptive crop feature extraction by the hierarchical representation of DNN architecture. DNN architecture, however, needs a great deal of experience and prior knowledge which limits its generalization capability. Therefore it is essential to organize deep reinforcement learning (DRL) based smart architecture to examine crop yield prediction. In the DRL framework, deep learning provides the agent with the ability to sense the environment and reinforcement learning provides the ability to learn the best strategy for real-time problems [48]. DRL enables creating an agent that can generalize to an environment that is examined as meta-learning [49]. As a generic way of solving optimization problems through trial and error, DRL finds its application in several fields like agriculture [50], health care [51], energy management [52], robotic system [53] and game theory [54]. The following section provides a brief introduction to the Deep Q-Network DRL algorithm and the proposed methodology.

III. DEEP Q-NETWORK ALGORITHM BACKGROUND
Deep reinforcement learning has advanced together with enormous data growth and improved measure persistence to make new opportunities to determine, evaluate and acknowledge extensive data procedures for agricultural frameworks. Some of the essential factors that need to be analyzed in structuring the deep reinforcement learning models are: • Understanding the patterns and basic structures from the restricted sample space.
• We are reviewing the objective functions with constant representations of events.
• The performance of the framework must be adequately viable to embrace consistently dynamic actions. This section explains in detail the reinforcement learning, Q-learning and the deep Q-Network algorithm.

A. REINFORCEMENT LEARNING
Reinforcement learning (RL) is a framework in artificial intelligence with a dynamic programming concept that develops and trains algorithms utilizing a strategy of reward and penalty. RL differs from other machine learning algorithms by the way that, it is not explicitly advised in performing a task, but it solves through the problem on its own [55]. For the RL study process, a Markov choice Procedure (MDP) is characterized that endorses the formalism where the reinforcement learning problems are embraced. The RL algorithm, which is an agent learns by collaborating and interacting with the environment. The agent will get rewards on the correct actions performed and penalties for the wrong actions. The agent learns by itself without human intervention by increasing its rewards and limiting its penalties. The process of reinforcement learning is presented in Fig.1. An agent that is present in a state 's' performs an action 'a'. On Performing an action the agent attains a reward R(s,a) and moves into a new state s'. The policy is a function that maps the states and the actions. In each state, a policy π is determined to specify the action to be carried out by an agent. In an agent's lifetime, its key objective is to identify an optimal policy π * which magnifies the total discounted reward. The optimal policy π * is defined in equation (1).
A value function V π (s,a) [56] is defined for each state-action pair is an estimate of the expected reward following a policy π . The most optimal value function is attained from the best optimal policy, which is identified by the highest reward obtained by an agent from all the other states. This optimal value function is represented in equation (2).
Thus reinforcement learning agent learns from the environment through interactions. They maximize their rewards by determining the best bellman optimal policy and value function using dynamic programming functions.

B. Q-LEARNING
Q-Learning is a method that assesses which action to take by an agent, depending on an action-value function. It decides the value of being in a specific state and making a specific 86888 VOLUME 8, 2020 action at that state. It is one of the most significant progress in reinforcement learning by the development of an off-policy temporal difference control algorithm. Q-Learning evaluates a state-action value function for a target policy that ascertains in choosing the action of maximum value. The function Q takes the input as the current state 's' and an action 'a' and returns an expected reward of that action in that state. In the initial steps before analyzing the environment, Q functions give the arbitrary fixed values. Later with better analysis, Q function provides a better approximation of the value function for the action 'a' in the state 's'. The Q function goes on updating providing the optimal value. The agent will perform a series of actions that will ultimately generate the total maximum reward.

C. DEEP Q-NETWORK
Deep Q-networks is an advanced reinforcement learning agent that uses a Deep Neural Networks (DNN) to map the connections among the states and the actions analogous to a Q-Table in Q-Learning. DNNs like Convolution Neural Networks (CNN), Recurrent Neural Networks (RNN) and sparse auto-encoder can directly learn the abstract representations of the raw data from the sensors. A DQN agent communicates with the environment through a series of observations, actions and rewards which is identical to the task of Q-Learning agent. Fig. 2 depicts the generic structure of deep Q-Network. The network takes a state as an input and for each action in the action space, the Q-Values are generated. The objective of the neural network is to learn and train the parameters. During the prediction process, this trained network is used to predict the next best action to occur in the environment. Basically, Q-Learning determines the state-action value function for a specific target policy that ultimately chooses an action of best value. It works fine for a restricted state and action space. However, for a huge set of action space may require millions of records to be stored in program memory. This results in the inflation of memory volume leading to curse of dimensionality or an unstable representation of a Q-Function.
The instability in Q-Learning arises due to the correlations existing in the series of observations. The relative small updates in the Q-value can result in the drastic change in the policy of the agent and also the correlation between the target and Q-Value. These inadequacies are overcome in Deep Q-Network using two strategies, namely, experience replay and iterative updates. Iterative updates minimize the correlation between the target and he Q-values through consistently revising the Q-values towards the target values. While experience replay tends to solve the correlation problem by smoothing over the data distribution changes through data randomization. In the proposed work during the enhancement of the DQN agent, the experience replay randomly selects the experience from the memory and the Deep Q-network utilized is the RNN, which acts as a function approximation with weights θ . Hence the Q-Network can be prepared by revising the parameters θ i in the 'i'th iteration by diminishing the mean squared error in the Bellman equation. The loss function, which is the squared difference between the Target Q and the Predicted Q is defined in equation (3) as follows: Gradient descent for the actual parameters can be performed in order to reduce this loss.

D. PROPOSED DEEP REINFORCEMENT LEARNING MODEL FOR CROP YIELD PREDICTION
Reinforcement learning is broadly designed in areas such as operations research, game theory, multi-agent systems, and control theory. In the proposed work, forecasting crop yield is studied as a regression problem that is resolved by supervised learning. This supervised learning-based crop yield prediction process needs to consider the crop yield data and its corresponding parameters as the inputs to determine the crop yield in the concerned region. In the RL based methods, the learning efficiency of the yield predicting agents is determined by the overall rewards. It results in unsteady feedback for the agents to adapt their performance along with the supervised learning methods. In other words, the agents will not be able to recognize from the inputs which samples are not efficiently learned during the learning process. Such a component enforces the agent to be more efficient by uncovering the deep characteristic contrasts among the crop yield. In order to understand the yield forecast method based on DRL, a yield forecasting environment is designed based on the input parameters that converts the supervised learning process to the reinforcement learning process. The environment can be determined as a 'yield prediction game'. Every game incorporates certain parametric feature combinations and thresholds that aids in crop yield and each combination has a set of samples and its corresponding labels. When the agent starts playing, it determines the crop yield parameter values by performing the actions to attain the rewards. For every nearby predicted value of the target, the agent gets a positive reward, otherwise a negative reward. After completing the entire process, the agent will receive an aggregate score for its actions performed. This flow of yield prediction is presented in Fig. 3. For the actual reinforcement learning methods like the Q-learning, it is challenging to discriminate and analyze the crop yield prediction due to the restricted ability of those methods to describe the states. Inspired by the DQN concept of processing huge information, a Recurrent Neural network Based DNN is used in the proposed method to predict the crop yield using the various environmental, soil and groundwater parameters. It is termed as the Deep Recurrent Q-Learning model which is basically an RNN on top of the DQN. RNN can assist in mining temporal and semantic data and has advanced in time series analysis, language modeling and speech recognition. RNN is a variant of the ANN, where the current state input is connected to the output of the previous state. The definite explanation is that the network will recollect the previous information and apply it to the present network calculation.
In our proposed method the DQN agent is framed by stacking the RNN layers sequentially, initializing the parameters utilizing the weights saved in the RNN Pre-training process and adding a linear layer mapping the RNN output to Q-Values. Fig. 4 shows the structure of the RNN used in the DQN.  The value of the hidden state at time 't' is given in equation (4) as follows: The predicted output O t of the RNN at time 't' is given as follows in equation (5): The error L of the RNN at time't' is given as follows in equation (6): The crucial aspects of the RNN, which can efficiently determine the crop yield, are the representation of the actual features self-learning layer after layer and the sparse constraint that limits the parameter space preventing over-fitting. The RNN in the proposed work consists of three hidden layers between the input layer and the output Q-value layer. For each RNN layer, a ReLU [57] activation function and L1 regularization [58], [59] is used. It results in penalizing the absolute values of the data parameters in the neural network when they are huge. Before the training process of the DRL, a pretraining process is applied to all the training data samples. Then the agent's yield prediction perception is built by stacking the input layers and the fully connected layer to output the final Q-values. During the training process of the DRL framework, a huge set of state and action space is processed which can result in instability due to data correlations. Hence in the training process of the DQN, two alterations of the Q-Learning are made to ensure non-divergence of DRL's training process. The first is terms of experience replay, where the agent's experience is saved in the replay memory (D) by means of state, action and reward of the present time stamp and the state of the next timestamp. Say initially at each time step t, the experience replay saves the agent's experience resulting in a collection of specific sets of experiences. An individual experience e t at a time t is described as e t = (s t , a t , r t , s t+1 ) and the memory at time t is defined as D t , where D t = {e 1 . . . .e t . Experience replay is an effective technique in eliminating the divergence in the parameters enabling the agents to recognize its experience in the learning process. The second alteration 86890 VOLUME 8, 2020 of Q-Learning is to utilize an independent network for generating the targets during the Q-Learning update process. These alterations can substantially improve the DRL stability. Also, it is observed that usually RL algorithms iteratively update the action-value function using a Bellman equation. As this approach is tedious in practice the action-value function is estimated using an RNN function approximator with weight θ . Hence the Q-Network can be prepared by revising the parameters θ i in the 'i'th iteration by diminishing the mean squared error in the Bellman equation.
The training process comprises two steps. The first step involves the pre-training of the RNN and the second step is the training of the DQN agent. The agent selects and executes an action based on a ε-greedy policy. Here the action is selected randomly with a probability ε, while the probability 1-ε chooses the action representing the maximum q value. The optimization algorithm utilized in the proposed study is the stochastic gradient descent algorithm. The optimization algorithm updates the network weights iteratively based on the training data. The algorithm for training the RRN based Deep Recurrent Q-Network is defined in follows:

Algorithm Training of RNN Based DQN
Step 1: Pre-training of the RNN. (e) Save the memory D as (s t , a t , r t , s (t+1) ).
(f) With respect to the network parameters θ , perform gradient descent on (r t − Q(s t , a t ;θ ) 2 (g) Reset Q' = Q.

End For End For
The following section explains a brief definition of the various agrarian factors that influence the crop yield, and evaluation of various crop parameters to be considered for the construction of the learning models.

IV. DATASET AND STUDY AREA DESCRIPTION
Deep learning models demand huge data volume for efficient processing. Information with adaptable characteristics streamlines the effort of finding regularities by removing the irrelevant features for the learning objective. Fabricating a deep reinforcement learning model for the agricultural framework is highly tedious since they are extremely unsteady and possess a dynamic non-linear behavior. This section explains in detail the dataset used for the study for predicting the crop yield.
The proposed study investigates the yield prediction of paddy crop for the Vellore district in the southern part of India. Here, the block of district considered for the study include Ponnai, Arcot, Sholinghur, Ammur, Thimiri, and Kalavai. Paddy is one of the prevailing monetary crops cultivated in this region and hence this area is considered for investigation. In varying to the typical climatic and soil parameters, the dataset incorporates specific climate, soil and groundwater properties along with the volume of fertilizers consumed by the crops of the study area. Some of the parameters analyzed in the current study include evapotranspiration, ground frost frequency, groundwater nutrients, wet day frequency, aquifer characteristics which are not recognized together in the existing literature. Table 1 represents concise information about the various crop parameters utilized in the study. The data is taken for 35 years. The paddy crop yield is estimated in terms of area cultivated (in hectares), paddy production (in tons) and yield acquired (in kg/hectare). The knowledge pertinent to regular climatic factors like temperature, precipitation, reference crop evapotranspiration, potential evapotranspiration, humidity and distinctive climatic parameters like ground frost frequency, diurnal temperature range, and wind speed has been utilized. The climatic data are provided by the Indian Meteorological department from its portal metdata tool. The soil parameters comprise topsoil density, soil PH and the amount of the soil macronutrients (Nitrogen, Phosphorus and Potassium) present. Distinctive hydro-chemical properties of groundwater like transmissivity, aquifer type, permeability, electrical conductivity, pre-monsoon and post-monsoon micro-nutrients (calcium, potassium, sodium, magnesium, and chloride) content in groundwater are considered for the study.
The following section presents the experimental results obtained for predicting the crop yield using the DRQN model and comparison of the results with the existing models.

V. RESULTS AND DISCUSSION
The efficiency of a learning model is determined by evaluating the model various execution measures or by monitoring the performance by various evaluation metrics. For the proposed work the model is validated in terms of: • Performance estimation • Comparison of various other algorithms in terms of: • Evaluation metrics • Data distribution properties • Model accuracy measures VOLUME 8, 2020

A. PERFORMANCE ESTIMATION
During the construction of machine learning models, the dataset is arbitrarily split into training and test set, where the highest amount of data is taken as the training set. Even though the test dataset is small, there exist chances of leaving out some important information that may have enhanced the model. Also, there is a concern of high variance in the dataset.
To handle this issue, K-fold cross-validation is utilized. It is a strategy that is utilized to assess the deep learning models by re-sampling the training information for enhancing the performance. Modeling and forecasting time series data are intricate and challenging. Randomly splitting a time series data for cross-validation does not hold well. It may lead to a temporal dependency problem as there is an implicit reliance on past observation and simultaneously, a leakage from the response variable to lag variables is bound to happen. This results in non-stationarity, which is the frequent changes in mean and variance in the information space. In such cases, cross-validation is performed in a forward-chaining manner.
For the proposed approach, five-fold forward chaining cross-validation is performed, which more precisely models the data prediction where the model is built on past data and predicts the forward-looking data. The results are tabulated in Table 2. It is like starting with a small subset of data initially for training, predict for the following data and determining the exactness of the predicted data. The same forecasted data points are enclosed as a part of the next training data subset and the following data points are forecasted. Fig. 5 represents the deep reinforcement model plot before and after performing the cross-validation.
The cross-validation is performed using the python Scikit-Learn machine learning library. The dataset is preprocessed using the min-max scaling which normalized the dataset. From the Sklearn using the sub-library model_selection, the train_test_split function is imported to split the training and test sets. The hyperparameter tuning for the cross-validation in determining the best 'K' is obtained using the cross_val_score function of the Sklearn library. The data is split into 'K' subsets, in this case, it is five by setting the parameter n_splits as 5. The training and validation data set size is determined by the parameter test_size which is taken as 0.3 for the proposed work, indicating that 70% of the data is subjected to training and 30% of the data for testing. The model is trained throughout the K-fold forward chaining cross-validation process and the error metric is determined. The error metric is the r2 score, which is appended in every iteration and obtain the best value defining the overall model accuracy.

B. COMPARISION WITH OTHER MODELS
The proposed deep reinforcement model DRQN is explored and tested with other significant algorithms, namely Deep LSTM network, Artificial Neural networks (ANN), gradient boosting (GB), random forest (RF) and other deep learning based algorithms like Bernoulli Deep Belief Network (BDN), Bayesian Artificial Neural Networks (BAN), Rough Auto Encoders (RAE) and Interval Deep Generative Artificial Neural Networks (IDANN). An inducing aspect for the evaluation metric is their ability to discriminate between the results of different learning models. Absolute Percentage Error (MAPE) and Explained Variance Score (Exp. Var.). To assure a fair examination of the model error metrics, two sets of these four models for training and validation were constructed to forecast yield. The hyperparameter optimization for the proposed approach and the other models are observed through a manual selection approach for the respective models. The key objective of the manual hyperparameter selection is to tune the model's capacity to match the target task complexity. The hyperparameters like the learning rate, the number of hidden units, optimizer, activation function and the dropout values are determined on the degree on which the training process and cost function reduce the test error. The DQRN based DRL model is constructed using an RNN network of one input layer, three hidden layers with each layer consisting of 8 neurons, a fully connected layer and an output layer presenting the crop yield value. The input layer consists of 30 neurons representing the crop dataset parameters. The RNN uses a ReLU activation function for the processing in the hidden layers. To attain the best performance accuracy without over-fitting, the agent learns by performing an action through 1000 epochs.
In the construction of the interval deep generative artificial neural networks [60] and the rough auto encoders [61], a rough set theory is introduced to the deep learning algorithm to deal with data ambiguity. The rough set theory is a mathematical function brought in by Pawlak [62] to handle uncertainty in learning. It is a proper theory obtained from the intrinsic research on logical characteristics of information systems. An information system S is identified as 4 tuple S =< U , V , A, f >. Here U is the universe of primitive objects, A is the set of attributes, V is the domain set such that V = a∈A V a . The mapping f is termed as the information or total function f : U × A → V . In concerning any attribute set A t ⊆ A and concept set such as X ⊆ U rough set defines two approximations: • A t X represents the set of attributes in U which can possibly be members of X with respect to the attributes of A t .
• A t X represents set of all attributes in U that can be exactly identified as members of X with respect to the attributes of A t . The boundary region, B = A t X − A t X defines the set of attributes that can't certainly be identified to X only by considering set of attributes of A t . If B (X ) = ∅ then X is the crisp set with respect to A t else it is defined as the rough set.
For the experimented rough autoencoder model a rough neuron based on the rough set theory is incorporated into the two layered stacked autoencoder model. The rough neurons are applied in the output and the hidden layers of the RAE model. The rough neuron used in this approach consists of an upper bound neuron U = (w 1 , b 1 , α) and a lower bound neuron L = (w 2 , b 2 , β).   auto encoders are trained progressively using back propagation with stochastic gradient descent to determine the rough features for crop yield.
For the experimented IDANN, variational autoencoders with the rough set theory is incorporated to extract the data features. The variational autoencoder is a framework consisting of both encoder and decoder that is trained to reduce the reconstructed error between the generated and the actual data. The features are learned by means of stochastic generation of mean and standard deviation of the input samples. The initialization process is a regularization task; the randomly initialized parameters are moved to better latent space. The features are learned by maximizing the probabilities of the generative model-variational autoencoder to initialize the biases and weights of the multi-layered neural network. Naturally the mean vector oversees where the input encoding   should be centered, and the standard deviation controls how much from the mean the encodings can vary. As the encodings are generated at random, the decoder understands that the sample refers to not a single point in latent space, but rather all nearby points.  Tables 3 to 6 compare the performance of the machine learning models on both training and validation datasets to the metrics as mentioned earlier of evaluation.  These models were enforced and implemented in python in the most effective aspect and tested under similar software and hardware conditions to assure reasonable comparisons. The error metric is used to define the performance degree during the execution of a model. The residuals obtained during the experiments, which are the difference between the actual and the predicted values are used to estimate the error measure. In other words, by observing the magnitude of the residual spread, the precision, as well as the efficiency of the model, is determined.
In terms of precision and efficiency, the proposed deep reinforcement model is observed to outperform the other machine learning models with an accuracy of 93.7% and improved error measures.
However, the performance of other deep learning models BDN, BAN, IDANN, RAE and Deep LSTN is reasonably close to the DRL approach. Fig. 6 and Fig. 7 explain the evaluated performance measures of the experimented models for the crop yield prediction.

2) DATA DISTRIBUTION PROPERTIES
In order to determine if the proposed DRQN model preserved the original distributional properties of the data, the probability density function of the actual crop yield data and the experimented models are observed. The Probability Density function (PDF) is an analytical expression that characterizes probability distribution for a continuous random variable against a discrete random variable. In graphically defining the PDF, the region under the curve will represent the interval where the predicted variable falls. The absolute area in the graph interval equates the probability of the continuous random variable occurrence. It enables us to calculate the probabilities of the range of outcomes.
The probability density functions of the actual crop yield and the predicted crop yield using the proposed deep reinforcement learning and the other machine learning algorithms are shown in Fig.8. It is done to observe if the proposed model and the other ML algorithms can preserve the distributional properties of the actual crop yield data. Fig.9 defines the individual probability density plots of the original data, proposed DRL method and other experimented ML algorithms.
From Fig. 9, it is explicitly defined that the proposed deep reinforcement learning model can more approximately preserve the distribution properties of the actual crop yield data when compared to the other experimented machine learning algorithms.

3) MODEL ACCURACY MEASURES
Evaluation of the model accuracy is an integral part of the model development process. It enables in identifying the optimum model for the data representation and performance of the model for the future timestamps.
Accuracy refers to the ratio of predictions which the model has forecasted precisely. Accuracy reflects the closeness of the predicted value to the actual value or the true value. Fig.10 graphically represents the accuracy measure of the predicted data using the proposed deep reinforcement learning algorithm and the other experimented machine learning algorithms.
On observing the experimental values and results obtained for the paddy crop dataset, the deep reinforcement learning model is found to predict the data with better accuracy and precision of 93.7% over the other experimented algorithms.
Though the accuracy measures of other deep learning algorithms like BDN, BAN, IDANN, RAE and Deep LSTN are reasonably close to the proposed approach the computational cost and time complexity is more than the proposed model. The BDN and IDANN are asserted to be more suitable for predicting continuous data enabling greedy layerby-layer learning efficiently by evaluating the parameters quickly. A critical disadvantage is that the approximation process is restricted to an individual bottom-up pass and the existing greedy process is very slow and inefficient. VOLUME 8, 2020  RAE automatically learns from the data samples, which is an essential feature, it is simple to train specific examples of the algorithm that will perform well on a particular kind of information. It doesn't require any new designing, just relevant training data. But the auto encoders' decompressed outcomes will be degraded on comparison to the actual inputs deviating from lossless arithmetic compression. Also in generalizing the model requires a large amount of training data. Though the RAE supports application in greedy layer-wise approach pertaining for deep networks, better random weight initialization schemes, batch normalization and residual learning could provide sufficient training for deep networks. BAN exposes few powerful insights and techniques to deep learning by automatically estimating errors associated with predictions but however, they are difficult to scale for large datasets. This is even evident by comparing the MAPE value obtained from the proposed approach and the BAN model.
In terms of error measures evaluation, the DRQN presented the lowest error values and almost preserved the original data distribution. It is evident from the results obtained that the proposed deep reinforcement learning DQRN model can solve the crop yield prediction problem by learning from the various dataset parameters through memory replaying and self-learning. Thus the predominance of the proposed method additionally enhances the system intelligence to predict the yield by minimizing the dependence on expert experience.

VI. CONCLUSION AND FUTURE WORKS
The evolution of DRL has raised the self-reliance and the intelligence of the Artificial Intelligence algorithms and motivates to propose a novel crop yield prediction system. The results observed from the precision and efficiency tests illustrate the effectiveness and versatility of the proposed Deep Recurrent Q-Network for yield prediction. By building a yield prediction environment, the proposed method makes it feasible for the agent to identify and learn the crop yield prediction through self-exploration and experience replay. Through the dataset prediction results, it is evident that the yield prediction agent administers the process, suggesting that the proposed method can precisely define the characteristics for crop yield. The combination of RNN based feature processing and DQN based self experimental analysis is the key objective to attain favorable results. Unlike the supervised learning-based crop yield prediction process, DRQN based process provides a complete solution that independently mines the non-linear mapping between the crop yield and the climatic, soil and groundwater parameters. This advantage can definitely minimize expert dependency and prior knowledge for developing crop yield prediction models. Hence the proposed approach provides a perception of implementing a more generalized model for yield prediction. However, the RNN based DRL can cause the gradients to explode or disappear if the time series is very much longer. Experimenting data prediction through a wide range of ML predictive algorithms can be observed as a basis for decision making, but it is critical to interpret the statistical uncertainty related to these predictions. Hence there exist needs to design a framework that predicts both target and their prediction's uncertainty. Probabilistic predictive modeling strategies like information theory, probabilistic bias-variance decomposition, composite prediction strategies, probabilistic boosting and bagging approaches etc. can be considered to handle the uncertainty in statistical predictions that can be observed as a future extension of the current model. Another alternative approach to be considered is to use an LSTM based DRL. Exploration of more crop yield prediction parameters with respect to pest and infestations and crop damage can be included in the current framework to construct a more robust working model in the future. Further improvement in the computing efficiency of the training process is an intriguing option to be concentrated.