WB-CPI: Weather Based Crop Prediction in India Using Big Data Analytics

This paper aims at collecting and analysing temperature, rainfall, soil, seed, crop production, humidity and wind speed data (in a few regions), which will help the farmers improve the produce of their crops. Firstly, we pre-process the data in a Python environment and then apply the MapReduce framework, which further analyses and processes the large volume of data. Secondly, k-means clustering is employed on results gained from MapReduce and provides a mean result on the data in terms of accuracy. After that, we use bar graphs and scatter plots to study the relationship between the crop, rainfall, temperature, soil and seed type of two regions (Ahmednagar, Maharashtra and, Andaman and Nicobar Islands). Further, a self-designed recommender system has been used to predict the crops and display them on a Graphic User Interface designed in a Flask environment. The system design is scalable and can be used to find the recommended crops of other states in a similar manner in the future.


I. INTRODUCTION
Due to sudden changes in weather conditions, farmers and agriculture throughout the country suffer as they fail to produce enough crops. This leads them to take serious steps as they are unable to provide for their family and make ends meet. This also leads to a scarcity of availability of food resources in the country. The conditions of farmers in our country need to be changed.
India's economy is greatly influenced by agriculture as it serves as the backbone of the country. More than 50% of the country is dependent directly or indirectly on the agriculture sector and it is responsible for the employment of the major labour force of the country, which accounts for over 40%. Agriculture produces big volumes of data every year, and hence there is a need to get rid of the obsolete traditional predicting methods by charts and use the availability of the big data collected to create a more prioritized and accurate The associate editor coordinating the review of this manuscript and approving it for publication was Walter Didimo .
predicting system. Big data will help confront the challenges and enhance the understanding of the whole sector. Big data analytics [34] is the process of examining large data sets containing a variety of data types.
The influence of weather can be deemed as a major priority in the prediction of crop yield. A lot of research work has been conducted in identifying how weather as a factor affects agriculture, but most of these studies require large complex information which is not directly available. This leads to the collection of data by estimation which can have either a negative or a positive effect. Hence improvement is needed in the methodology to compensate for the availability of data.
This work focuses on crop prediction using agricultural and meteorological data in India, which is mainly collected from open dataset sources that contain information of crops from all states, but meteorological data revolves around three states and two union territories. The rainfall data has been collected since 1901, and the temperature data has been found from 1995. The crop data has been amassed from 2000, which comprises the production of 123 crops from various VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ regions of India. As a result, the combination of all the data provides an elaborated view of the system and hence serves as the source of the big data. In this project, a MapReduce framework for data processing and a K-means clustering algorithm along with a recommendation function is carried out in the hope to propose crops to sow and elucidate big data applications in agricultural production. This paper is arranged as follows. Section II presents the existing work done and recommendations in agriculture using big data by various analysis methods. Section III proposes the system architecture and algorithms on the basis of the outcomes defined in section II. In the next section, the work is done using already existing datasets and the MapReduce framework, recommender function and the clustering algorithm is implemented to give the desired output. Finally, in section V, conclusions are made, and the future scope of the project is discussed.

II. LITERATURE REVIEW
A. D. BOSE, ''BIG DATA ANALYTICS IN AGRICULTURE'' [1] This paper talks about how Big Data Analytics combined with various structured and unstructured data helps in providing insight to farmers to make a decision as to which crops to grow and reduce losses due to unexpected or unpredictable disasters [23]. In Section I the paper states that we can collect the data produced by sensors from the official databases that are usually maintained and governed by institutions. Here the author suggests we can collect and analyse the data in different stages in agriculture and see their influence in the big picture. It is dependent on two major factors, the push and pull factor [8]. Visualisation of agricultural data is done to simplify the complex, structured, and unstructured data. Interpretation of data can be done using methods like overviews, verifiable models, or in an Ad-Hoc manner and then visualized in the form of tables and graphs [9].
In Section II, the paper talks about techniques like Predictive analysis where we can make the appropriate prediction of the future outcome on the basis of the previous data [33]. A recommendation system is an informative system whose task is to offer an output that is based on functional patterns and behavioral data. Recommender systems generally give useful advice as the output is based on the approach used and the categories. The next method is Data Mining which can be defined as the process of extracting the previously unknown and useful information from large quantities of incomplete data for practical application [35]. It plays a vital role in the agriculture sector, especially discovering patterns in big datasets, i.e., pattern mining Next, the spike and slab regression analytic technique is discussed where the term spike and the term slab are used as a type of coefficient for regression [24]. In the time series analytic technique using big data, time is taken as a variable that is independent with a motive to vegetation price movement, forecast crops and price fluctuation in the current market.
In Section III, the implementation of analytic techniques in agriculture had been discussed. The first method is an Intelligent crop recommendation system that considers all the factors such as soil conditions, temperature, rainfall and location. This system is further split into two different systems: the crop predictor, whose main task is to help agriculturists by recommending crops and the rainfall prediction system that predicts the occurrence of rainfall for each month across the year [17].
The next method discussed was Precision Agriculture using Map-Reduce used to allow variable rates and inputs which help in the understanding of time and space variability in criterion [18]. Here the data is obtained and pre-processed. Then map-reduce is performed, and 3D visualization is done to visualize the output.
Further crop prediction using various machine learning approaches were discussed. A few of them were 1) Grey wolf optimisation (GWO) technique 2) K-means clustering 3) Apriori algorithm 4) Naive Baye. Next Smart Farming was discussed where a few of the services like Internet of Things, Cloud Computing, Mobile Computing were detailed about.
The Crop analysis using Data mining techniques discussed is aimed at analysing greenhouse crops with the help of data mining techniques to extract patterns. With the help of the user interface and selection of specific greenhouse attributes, farmers will be able to predict yield patterns, crop patterns and further make important decisions based on them.
Lastly, the author talks about a Spark-based system to perform collection, learning, training, validation and visualization of distributed data. This method of data analytics can be used for crop yield prediction, current weather trends and performing insights on Agricultural market data [10]. In Section IV, the challenges that are faced in the analysis of big data in agriculture are discussed. The author states that obstacles faced for agriculture are usually Technical or Organizational problems. The paper further mentions the problems faced in the big data analysis of agriculture data, majorly, availability, accessibility and scalability of data for analysis.
Section V talks about the future scope of work where the author goes ahead and discusses various factors that could be helped with like product traceability, genetic engineering, supply chain, yield production, high precision, scientific simulations and so on and so forth. Lastly, Section VI contains a comparison table of big data techniques where one can notice that it suggests that we use MapReduce for weather and climate data and K-Means Clustering for crop and vegetation data by collecting historical datasets.  [2] In Section I of the paper, the focus is on the system of agriculture in Telangana. The data is collected from Cridas and farms of Hyderabad and Hayathnagar. A recommendation system recommends which crop to cultivate in the related seasons using Naïve Bayes classifier. Rice, Cotton, Maise and Chillies are the crops taken into consideration.
Section II talks about the previous work done in the field of precision agriculture. The author tells the advantages and the grey areas of methods and models used in previous work like linear regression with neural networks, MapReduce, KNN algorithm, a crop growth prediction model, sequential data assimilation. In Section III, the author describes the proposed methodology which is used to predict which of the four crops are suitable in Telangana. He talks about the modality and methodological conditions of 3 zones, i.e. (i) Northern Telangana, (ii) Central Telangana and (iii) in Southern Telangana, with seven major types of soil in which farmers mainly cultivate soybean, maise, rice, cotton where the water for irrigating the soil is provided by the rivers the Godavari and Krishna and monsoons (June-September). The suitable conditions for growing rice, maise and chillies are discussed. After collecting data from various sources like sensors from fields, images from satellites, data of crop, irrigation reports and weather data, it was pre-processed to find out the missing values and impute them using the mean method. Then feature selection and data extraction were performed in terms of soil, temperature, rainfall and atmospheric pressure. Further MapReduce was implemented on this data, and then a Naive Bayes classifier model for crop prediction was made using the Naive Bayes algorithm [11]. This model recommended two or more crops based on the input data supplied.
Section IV describes the results and recommends sowing and harvesting suitable crops.
1) It was concluded that cotton should be planted in March/April as July to September are its ideal growth months where maximum growth is noted in the month of August. Since there is no noticeable growth from October to December, the crop can be harvested in January or February.
2) Rice is grown in Rabi and Kharif season. It should be sown in July, as there is notable growth from August to September and it can be harvested in October and November.
3) Chilli requires good rainfall; hence the crop is sown at the start of July.
4) The Maize plant should be sown at the end of June as it has the highest growth in July and is to be harvested in preferably September.
Section V discusses possible future enhancements. This work used Naïve Bayes to introduce a crop recommender system to make it very efficient when it comes to computation. The system can be used on a variety of crops as it is scalable.  [3] This paper discusses crop yield prediction, food security, Map Reduce and nearest neighbour modelling in terms of big data using agricultural data in China. In Section I, the paper talks about food security and its aspects like producing enough food and maintaining a stable supply of food in the market and how big data [25] can help sort this out and points out that the earliest time in advance and accuracy are the priorities of predicting the crop yield.
Section II portrays the advancement and application of big data in crop yield prediction. The paper states that effective plans for improving the performance of prediction of crop yield and the methods to take the maximum advantage of huge datasets related to agriculture and food security. Currently, big data can be obtained in semi-structural or nonstructural forms from Recognition technology, Radio Frequency Identification, Remote Sensing, Weather stations. It is further reviewed by the paper that crop yield forecast is the most addressed topic, followed by climate change impact assessment and water resources. The well-developed methods have been categorised by Statistics methods, Remote Methods, Crop growth simulation, Econometrics. Section III proposes a model based on prior structure and weather data processing structure. First, the data was prepared by collecting it from the China Meteorological Administration with high accuracy of above 99%. Then MapReduce was performed by partitioning the data into multiple sections. Then the map was executed according to certain rules followed by the Reduce function, where data having the same year was rearranged and combined. Output was written to distributed file systems. After that, weather similarity (defined by weather distances) [26] was checked using nearest neighbor's. The smaller was the distance between the two years which was quantified the similar the two years would be. At last, the autoregressive moving average model was used by combining two models. The output produced by one that has white noise as it is input which means that it has a linear relationship. In Section IV, the experiment is conducted on the already existing weather datasets, and the advantages of using this new method are talked through. This crop yield predictor is an application that has its basis on a processing structure that manages data in sequence to search for similar years.
In the first step, the weather data was processed using MapReduce, keeping precipitation, the intensity of sunshine and temperature at ground level as variables to calculate the daily mean and monthly mean. The process was divided into three steps, i.e. Map(to calculate monthly mean value), Reduce (to combine intermediate data), and storing the result. Next, a search for similar years was performed by conducting normalisation on three matrices to obtain a single 59 × 36 matrix, and then the difference of distance was obtained by computing the norm of the target year. After sorting, 20 nearest neighbours were obtained similar to the target year. This was followed by preparing an ARMA model for prediction based on nearest neighbours found. This model was used to predict the crop yield of 2013 as an example and had a deviation of only 0.5%. The nearest neighbour's method using MapReduce weather data processing structure had a balance of both accuracy and time in advance.
Finally, in Section V, a conclusion is made that using the method mentioned above, an advantage of the already existing large datasets can be taken and put into use. Future possible work includes the faster accumulation of data and VOLUME 9, 2021 integrating weather calculation into the section that is processing data to reduce computing time. Lastly, this paper importantly focused on data mining in agricultural data from the perspective of time using a time aspect, the MapReduce weather data processing structure. The same methods can be applied to different geographical aspects. [4] The main aim of the paper is to predict changes in weather and help farmers in making agriculture-related decisions based on those changes. The paper has proposed a model to find solutions to modern world problems, such as worldwide food insecurity induced by frequent climate change, to predicting the impact of extreme weather events and mitigating its effect on global finance. They have made use of Big Data Analytics techniques to make an automatic prediction system. This paper builds the model based on the Hadoop framework.
First, they collected data from various sources like social media, sensor data, weather forecasts etc. and loaded the pre-processed data into HDFS. HDFS stores datasets and provides backup features. They focused on three factors while collecting data, namely, precipitation, temperature and cloud cover for the state of Karnataka. The authors have mainly used Hive for reading and processing data. Hive's strong SQL skills make it possible to process huge volumes of data stored in HDFS. Hive converts SQL queries into a series of MapReduce jobs. They have used MapReduce to analyse the data and as an execution engine suitable for large data processing and to improve the response speed for returning query results.
Then they implemented a prediction function for establishing forecast data through the k-means cluster algorithm. They used Apache Mahout to implement a logistic regression algorithm to predict the future based on the past data. For this, testing and training of data are performed. Then they evaluated the accuracy of the predicted result and represented the output using visualizations making use of the Flotend tool. Flot is a JavaScript plotting library. They used Pig script to perform analysis and the output was provided as an input to Flotend. They made various plots showing yearly and monthly average temperature for a particular region, maximum and minimum temperature, precipitation etc. The authors of the paper aim to improve their model such that it can be used for providing alerts in natural hazards in the future.  [5] This paper aims to predict the crop yield and suggest crops based on it which would, in turn, increase the profit of the farmers and overall, the entire agriculture sector. It also focuses on improving the quality of the crops using datasets for diseases. They have used a new algorithm called Agro Algorithm to predict the crop yield and suggest crops [7] based on the crop yield and taking the soil type into consideration.
In section I of the paper, the authors have used weather datasets containing information about temperature, rainfall in mm, wind speed, evaporation, humidity etc. They further used the weather datasets to determine the type of soil. They also used datasets for crop diseases to determine the ideal weather conditions which would be suitable for a particular crop to grow. Section II presents the numerous methods that already exist for crop prediction and their drawbacks. Here they discussed techniques like clustering, soft computing techniques such as k-means and artificial neural networks. In section III of the paper, the paper talks about some basic knowledge that is required to improve the quality of crops, such as selection of plant and soil factors such as pH, which would play an important role in getting a good yield. The properties of the soil should also be known beforehand. It is also important to select the right seeds and estimate the right amount of fertiliser and pesticides required.
Section IV is the implementation. The implementation is performed on the Hadoop platform since the datasets are large. Normalisation is performed on the data stored in HDFS. This is done by taking the statistical average mean of data. First, the month in which the crop has to be sown is selected, then the classified data is used to predict the quality of soil and the recommended crop. The classification algorithm used is a simple statistical-based learning system. This prediction is represented using pie charts and this prediction is used to form five categories: ''very good'', ''good'', ''average'', ''bad'' and ''very bad''. Section V discusses the architecture used by the authors. They represented their architecture using a flowchart. Firstly, they collect multiple datasets related to agriculture and weather and perform the required analysis and classify the data. Then they used the classified data to predict the soil type and crop that can be sown.
In section VI of the paper, the authors talk about some issues and obstacles that are faced in quality farming in India [12]. These include technical gaps, small sizes of the farms, less availability of data and disparate harvesting systems.
In section VII, they conclude that the crop yield is improved by using weather, soil, crop and disease datasets. This in turn, boosts the standard of production of crops. It helps farmers immensely in selecting a crop suitable for the weather and soil type.
In the future, the authors aim to classify all the types of diseases for a particular crop and determine its cause which would further improve the quality of crops.  [6] In section I, this paper tells us about various methodologies to make weather predictions and discuss them in detail. Weather forecasts are important to prevent future damage, economic downfall and deaths, floods due to extreme weather conditions. Therefore, weather prediction such as rainfall and tropical cyclone prediction becomes very important. Weather prediction also proves helpful to farmers since it can prevent crop damage [19], [20]. The authors collected weather datasets that had data on humidity, rainfall, wind speed, temperature, air pressure, vapour pressure, sunlight intensity and various other factors. They collected this historical data from various sources. They have used big data analytics to study trends and patterns and predict short term as well as long term weather changes.
Section II describes the obstacles that occur in weather prediction. Conventional and statistical models do not give accurate results because the datasets are large and the output depends on assumptions [13]. It is hard to make a prediction model for the long term because the input parameters change rapidly, which are difficult to incorporate in an already built model. Noise in the dataset leads to inaccurate short-term predictions. In section III, they discuss the different methods adopted by researchers for weather prediction:

1) MapReduce
This model was used by a researcher for studying problems related to agricultural lands. They made a soil analysis system using various datasets like historical crop data, soil nutrient levels, fertilizers and manure used.
2) LINEAR REGRESSION AND MapReduce [22] They collected data such as rainfall, temperature and humidity from weather stations and predicted weather, helping farmers in planting crops with good yield and cutting their costs.
ξ denotes the difference between actual and predicted values and N is the number of numerical values in vector x in (1).
3) TIME DELAY RECURRENT NEURAL NETWORK AND FEED FORWARD NEURAL NETWORK [14] They made use of evaporation, soil temperature and humidity, among other factors to predict rainfall. This provided daily, monthly as well as annual rainfall forecasting.

4) WAVELET ANN [15]
This method was used to predict daily average temperature and rainfall using various weather-related datasets.
e (cos(cos(−10.6EPTi))−(Xi/(sin(RF1RH) * RH))+3.7292) where X i = EP e log 10 (4.3EP+0.4941) / (T x + RH) In (2), RH means relative humidity, RF1 is 1-day previous rainfall and EP is evaporation.   These comparisons between different methodologies help in finding the best-suited one for a particular prediction. Section IV concludes that MapReduce and Linear Regression models are better than other techniques to perform weather and climate forecasting as these methods provide an accurate result. It was concluded from this literature review that MapReduce should be used as an efficient programming model for computing very large datasets of weather and climate with ease and high performance. Further, we concluded that using K means clustering on our final generated datasets would help us identify the relation between the crop and produce per area of the particular region.

III. PROPOSED METHODOLOGY
After studying the previous work done, the main aim would be to process the data using MapReduce and frame a recommender algorithm in Python to extract output according to the seasonal conditions and region followed by executing k-means clustering and finding the mean produce per area a group of crops will give in a particular region.
Keeping in mind the previous work done in other papers, we have taken temperature, rainfall, wind speed, humidity, soil type and seed type as the deciding parameters of our system. Firstly, the raw data will be collected and pre-processed in a Python environment. Then this pre-processed data is used as an input for the MapReduce framework of Hadoop to process the data. MapReduce is a programming model for processing large amounts of data with a parallel, distributed algorithm [29]. This model is implemented on the collected datasets for faster processing. In this work, each dataset will be processed differently. In the MapReduce model, the dataset will be divided into key and column pairs as shown in Fig. 1 where the different parameters will be individually taken to perform a MapReduce. The year and region will be stored in the key, and the respective parameter for all the months will be taken as the value for the Map Function. In the Reduce function, these parameters will be calculated and assigned to crop seasons. For the Map function of the crop dataset, the region, year, season and crop will be assigned as the key and the produce and area will be taken as the value. Then the Reduce function will calculate the produce per area for each row of data where the region, year, season and crop will form the key and produce per area will become the value.
Next, we propose to combine all the map reduced datasets of the different parameters to form one final/super dataset and make a recommendation algorithm. Three parameters can be taken as user input: the month, region and state. Next, we will initialise the different agricultural seasons. Then depending upon the user input of the month, we assign it the respective season/s. For example, if the user input is November, then we could assign it Rabi/ Winter crops. Next, we parse through the data and select the three crops that give the best yield in that particular season and also the crops that give the best yield throughout the whole year in that particular state and region. These would be in two different data frames, one for the particular season and one for the whole year. Along with this, we output the temperature, rainfall, wind speed and humidity in which the crop had previously given the same output. We also mention the seed type to be used for each kind of soil for the crops in different regions as the availability of seed and soil type varies from region to region. The output of this recommendation function will be displayed on a Graphical user interface (website) designed on Flask using Python, where the user could input the required data and get the output from the system. Next, we will use a K-means clustering model. Firstly, we will make an elbow graph to calculate the number of clusters/optimal value of K that is required. We will be utilising the Scikit-learn library for this purpose. Then we will be using the fit predict method to get the values of clusters. This will be done in the form of an array where numbers starting from 0 will represent the values of one cluster. Then the clusters will be plotted using the scatter method of the Matplotlib library. Each cluster centroid will be shown which would represent the average value of a cluster about which each crop would be plotted, and every cluster would be represented by a different colour.
To study the relationship between the produce per area, crops and the respective parameters, we would create several 3D graphs and scatter plots. The produce per area could be taken as the Y-axis and crops as the X-axis and we could change the Z-axis according to the parameters taken (here temperature, rainfall, humidity and wind speed). We will also study the relationship between soil and seed type using 2D bar graphs and scatter plots using the Matplotlib, seaborn and mpl_toolkits in Python by taking soil as the X-axis and the number of crops it supports growth for in Y-axis. Also, a graph could be made by taking Crops as the X-axis and Produce per area on the Y-axis to study which crop gives the maximum produce per area in the particular region.

A. DATA COLLECTION
Various data sets were collected during this step. Facing a little difficulty, we found seven datasets related to our workflow and need from Kaggle [27] and a university website [28].

B. PRE-PROCESSING DATA
Here, the collected datasets were combined and cleaned. We uploaded our datasets on the Colab notebook and used pandas data frame to drop the useless columns and retain the ones important to us. We used NumPy, SciPy libraries for our calculations. A few index columns were added for future calculations. Interpolation, as represented in (3), was used to find the estimated value for the missing value in the dataset statistically [45], [46], [32]. In our datasets, we have established the value of certain numerical columns which had NA values, such as the month columns for the rainfall dataset.
where dataframe_name is the name of the dataset in use and the interpolation is done in the forward direction linearly. In the next step, the redundant and dirty data was eliminated using the IQR and z-score method for detecting and deleting the outliers and interpolated the data in a few datasets where it was deemed fit. The Interquartile Range Method as in (4.1-4.4) was used to remove the outliers in the temperature datasets. The 25th and 75th percentile are found, and then 1.5 is taken as the factor because considering only three deviations is appropriate, i.e., 1.5 multiplied by 2, as anything greater than or less than these deviations on the sides above 75th and below 25th percentile will give  inaccurate results [30].
where, Q1 and Q3 are the first 25th and last 75th range and IQR is the interquartile range. dataframe_name is the name of the dataset in use and column_name is the column we want to apply it to. In Fig. 2 the outliers before cleaning the dataset are plotted and in Fig. 3, the cleaned dataset with no outliers is plotted using matplotlib. Z score as shown in (5) is also used to remove the outliers here. Z score is also used to remove the outliers here. We take 3 as a factor here because anything above a positive 3 and below negative three will bring us inaccurate results and hence are outliers [31]. .apply(lambdax : numpy.abs(scipy.stats.zscore(x)) < 3) where, dataframe_name is the name of the dataset in use and x is the data we are performing z-score on.

C. MAP REDUCE
The MapReduce model was implemented on the cleaned data, where the dataset is divided into key and column pairs. For the INBOMBAY.csv, INCHENAI.csv, INCALCUT.csv and INDELHI.csv, in the Map Function first, we took the month, year and region as the key and the temperatures are taken as the value. Then in the Reduce Function calculation was performed to find the average monthly temperature of the regions and store the month, year and region as the key and the average temperature as the value. The year and region will be stored in the key and the respective parameter will  be taken as the value month wise for the Map Function.
In the Reduce function, these parameters will be calculated and assigned to crop seasons like Rabi, Kharif, Autumn, Winter, Summer, Whole Year, and the temporary December and November temperatures for the next year Rabi and Winter calculations. For the rainfall in India 1901-2015.csv in Fig. 4, MapReduce was performed to calculate cumulative rainfall in the different agricultural seasons according to the dataset. The output for the same is displayed in Fig. 5. For the Map function of the crop dataset, the region, year, season and crop will be assigned as the key and the produce and area will be taken as the value. Then the Reduce function will calculate the produce per area for each row of data where the region, year, season and crop will form the key and produce per area will become the value. The produce per area of all the crops was found from the crop_production.csv dataset, which will suggest a particular crop according to the region.

D. PYTHON RECOMMENDATION FUNCTION
The cleaned and map reduced datasets of rainfall, temperature and crop production were combined along with the manually collected wind speed, humidity, soil type and seed type data to form one final dataset. We took the input to get the state, region and month from the user themselves as depicted in Fig. 6. Next, on the basis of the month input by the user,  we assigned the agricultural season/s. After this, depending upon the season assigned, we have parsed through our data to collect the best yield of three crops in the selected input region and state. Besides this, we have extracted the top three crops giving the best yield throughout the year. Then depending upon the crop, the seed type and soil type with two different varieties are suggested as availability of the soil and the varieties of suitable and available seed types vary from region to region. Further, the temperature, rainfall, wind speed and humidity of the region at which the crop had given such stellar outputs is also displayed for reference. This function runs the provided input through the above algorithm to give a favourable recommender output as shown in Fig. 7. We have taken months of Rabi, Kharif, Summer, Winter, Autumn and Whole year into consideration.

E. WEBSITE
A minimalistic website is designed as a graphical user interface for the user. The front end of the website has been made on Flask 2.0.0 in a virtual environment using Python 3, VOLUME 9, 2021    in Fig. 9. It takes the desired input from the user and shows them the predicted output, that is, the top three seasonal crops with the best yield and the top three year-round crops with the best yield along with the expected temperature, rainfall, wind speed and humidity for the input region which gave that desired output. It also suggests two kinds of suitable soil and the respective seeds where the crop can be grown as shown in Fig. 10.

F. VISUALISATION
The super dataset is used to find the relation between produce per area and a particular crop using a bar graph and scatter plot for a particular region. It can be seen in the bar graph represented in Fig. 11 that sugarcane has the highest production per area in North and Middle Andaman Region. In Fig. 12, the scatter plot of the same region is made to plot the initial produce per area and the crops before clustering to differentiate before and after the k-means clustering algorithm.
The elbow graph has been plotted to find the number of clusters that should be made for a particular region's crops according to its produce per area as shown in Fig. 13 a and b. The point where the graph bends give the number of clusters the dataset should ideally have. The elbow of the graph was found to be 2 for both Nicobars as well as Ahmednagar, Maharashtra as can be seen in the figures. The clustering algorithm is then applied and the clusters are formed with the crops plotting their produce per area around the cluster centroids. From the graph in Fig. 14 a, we can deduce that the mean produce per area of the first ten crops in the Nicobar is around 2.5 units and the mean produce per area of the next ten crops in the graph is around 4 units. In Ahmednagar, however, the clusters are formed in such a way that most crops have a produce per area less than 1 unit per area and hence have a cluster center of 0, as shown in Fig. 14 b. While the other cluster is of crops giving very high output per Area, giving an average of about 600 units. These are usually oilseeds grown by farmers.
Next, bar graphs have been plotted to check the relationship between a soil and the different types of seeds that can be sown in it. The graphs in Fig. 15 a and b show 8 different types of soil found in the Nicobar region and the number of varieties of seed that can be sown in the same. Each soil type has been assigned a number from 1 to 7, and the number of crops that can be planted in a particular soil type is calculated and shown on the y-axis. It can be concluded from the graph that Clay   Loam Soil holds the capacity to sow 17 crops with different seed varieties. In Fig.15 b, Laterite (Red Clay), Clay Loam and Sandy Loam show the highest capacity to sow crops.
Further, we have depicted 3D plots of the Andaman and Nicobar Islands and the Ahmednagar region of Maharashtra. The plots in Fig.16 a and b represent the relationship between  the crops, production per area and rainfall. Fig. 16 a shows the effect of rainfall on Andaman and Nicobar Islands whereas Fig. 16 b shows the effect on Ahmednagar, Maharashtra. In the case of Andaman and Nicobar Islands, a high produce per area could be expected if the rainfall is around 500-2000mm throughout the year. For Ahmednagar, it can be seen that most of the crops give a high produce per area if the rainfall is from 0 to 800mm every year. Fig. 17 a and b show the relationship between various crops, their production per area and the temperature they were grown in for Andaman and Nicobar and Ahmednagar respectively. In Andaman and Nicobar region, most of the crops give a high produce per area if the temperature is between 80.3 and 80.7, while for Ahmednagar, a high produce per area is obtained when the temperature is between 74 and 80 units.
When it comes to the relation between humidity, crop and production per area it can be noticed from Fig. 18 that most crops thrive in 74 %-78% humidity, whereas a few crops require higher humidity between 80%-84%. The scatter plot of Fig. 19 depicting the relation of wind speed with crop and production per area concludes that most crops require a wind speed of 5-8 units to grow the best, whereas some of the crops can be seen thriving irrespective of the increase or decrease in wind speed.
All the 3D graphs plotted above show how the types of crops and their production per area is affected by different factors like rainfall, temperature, wind speed, and humidity. It can be noted that all these parameters combine together to form a suitable environment to give a good produce per area for a major variety of crop.

V. CONCLUSION AND FUTURE SCOPE
The proposed work introduces a crop recommendation system and uses MapReduce and K-means clustering, which gives efficient results in terms of computations. The model focuses on a wide range of crops and their produce per area along with the soil type and seed types depending on the varieties used in a particular region. From the visualisation graphs of K-Means clustering, we can find the mean produce for a group of crops. The algorithms that have been used for the recommender function and K-Means Clustering can be accessed on https://github.com/oorjagarg/WB-CPI. Also, the relation between parameters (like optimal temperature, seasonal rainfall, wind speed, humidity, soil availability, required seed types), crop and region has been studied and displayed using 2D and 3D graphs. The system is scalable and it can be used to find the recommended crops of other states in a similar manner as described in the methodology. This work can be further improved to eliminate the problem of disproportion in the production and requirement ratio if an aspect of humidity, wind speed can be added for all the regions and will give a more accurate recommendation. Factors like soil moisture, irrigation, cloud cover etc. may be included in the system to refine its output. Also, the recommender can be modified to warn about the diseases that can occur in a crop in a particular season and suggest the types of fertilizers or nutrients needed in the soil for the crop to grow and give its best yield.

AKHILESH
KUMAR SHARMA (Senior Member, IEEE) received the B.E., M.E., and Ph.D. degrees in computer science and engineering. He is currently working with Manipal University Jaipur, Rajasthan, India, as an Associate Professor. He has more than 18 years of experience. He has been chairing sessions and as an Expert for keynotes in IITs, NITs, Vietnam, Thailand, Malaysia, Australia, China, and Singapore. He has presented many research articles in international journals and conferences and organized various FDP's, events, conferences, and workshops. He holds four patents and four copyrights to his credit and has setup cognitive intelligence research lab in Jaipur. His research interests include the area of soft computing, machine learning, bigdata analytics, and healthcare. He is affiliated with IEEE, ACM, CSI, (IUCEE), and MIR Lab, USA. He is currently the Joint Secretary of ACM Professional Chapter Jaipur.
OORJA GARG was born in Lucknow, Uttar Pradesh, India, in 2000. She is currently pursuing the B.Tech. degree in computer science and engineering with the School of Computing and Information Technology, Manipal University Jaipur, Jaipur, Rajasthan, India. She is also interning as a Developer at Glorich India Pvt., Ltd. Her research interests lie in the field of big data analytics and utilizing it to develop and improve predictions concerning the agricultural sector.
KRISHNA MODI was born in Mumbai, Maharashtra, India, in 2000. She is currently pursuing the bachelor's degree in technology in computer science and engineering with the School of Computing and Information Technology, Manipal University Jaipur, Jaipur, Rajasthan, India. She is currently interning as a Developer at a financial service company VenEx, India. Her research interests include the development of prediction and recommendation systems in agricultural sector using big data analysis, fundamental concepts of weather prediction systems using meteorological data, and understanding development of smart farming concepts.

SHAHREEN KASIM is currently an Associate
Professor with the Department of Security Information and Web Technology, Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia. Her areas of interests include bioinformatics, soft computing, data mining, and web and mobile applications.