Rainfall Prediction Using Machine Learning Algorithms for the Various Ecological Zones of Ghana

Accurate rainfall prediction has become very complicated in recent times due to climate change and variability. The efficiency of classification algorithms in rainfall prediction has flourished. The study contributes to using various classification algorithms for rainfall prediction in the different ecological zones of Ghana. The classification algorithms include Decision Tree (DT), Random Forest (RF), Multilayer Perceptron (MLP), Extreme Gradient Boosting (XGB) and K-Nearest Neighbour (KNN). The dataset, consisting of various climatic attributes, was sourced from the Ghana Meteorological Agency spanning 1980 – 2019. The performance of the classification algorithms was examined based on precision, recall, f1-score, accuracy and execution time with various training and testing data ratios. On all three training and testing ratios: 70:30, 80:20 and 90:10, RF, XGB and MLP performed well, whereas KNN performed least across all zones. In terms of the execution time of the models, Decision Tree is consistently portrayed as the fastest, whereas MLP used the most run time.


I. INTRODUCTION
Accurate and timely rainfall prediction is expected to inject a new intervention phase to the affected sectors accosted with the negative propensities of rainfall extremes. These critical sectors include but are not limited to energy, agriculture, and others, which are greatly affected by rainfall. A plethora of scholarly research has demonstrated that the duration and intensity of rainfall cause major climate-related disasters [1], [2]. The manifestation of the impact of rainfall includes drought [3], floods [4], among others and its associated effects. For example, in 2009, torrential rains affected almost 600,000 people in Senegal, Niger, Burkina Faso and Ghana [1]. In addition, almost half a million people died through floods in 2007 recorded over Ethiopia, Uganda, Togo, Niger, Sudan, Mali and Burkina Faso [1]. Furthermore, findings from [5] project that the death of 30,000 to 50,000 children due to malnutrition in 2009 in sub-Saharan Africa may be worst due to the changes in the variability of rainfall coupled with the acute weather episodes The associate editor coordinating the review of this manuscript and approving it for publication was Li He . affecting the agricultural sector. Apart from precious lives lost through floods, an ample body of literature has reported the impact of rainfall on other vital sectors of the Ghanaian economy [6]- [8]. In [8], it is reported that two major hydroelectricity plants which cater for over 70% of the electricity demand in Ghana is rainfall reliant. This presupposes that a decrease in rainfall has dire consequences on the electricity generation of the country. Agriculture, which employs about 44.7% of the Ghanaian labour force and contributes significantly to the nation's economy, is pivotal to the growth of the economy [9]. Despite its recent decline in performance, the agriculture sector remains a crucial element for poverty reduction and food security in Ghana [9]. However, Ghana's agriculture sector is mainly rain-fed, with about 3% of the cultivable land supported with irrigation [9]. Meanwhile, in developing countries, including Ghana, the primary water source for agriculture, hydropower generation, and others is rainfall.
Many classification algorithms such as Random Forest (RF), Decision Tree (DT), Neural Network (NN), K-Nearest Neighbour (KNN) and others have been investigated for the prediction of rainfall. The performance among these algorithms widely varies, leaving room for enhancement by varying training and testing ratios or combining different techniques. However, rainfall prediction continues to be a challenging task. Therefore, selecting suitable methods in classifying rainfall over a region is vital. Meanwhile, machine learning algorithms have been proposed to enhance rainfall prediction accuracy [10].
For this reason, rainfall prediction based on various techniques for countless locations such as Malaysia, India, Egypt and others is replete. For instance, [11] used machine learning techniques to build rainfall prediction models in some major cities in Australia by comparing Decision trees, Random Forest, Logistic regression, AdaBoost, Gradient boosting and K-Nearest Neighbour. In a similar comparative study, [12] reported that random forest showed an accuracy of 87.1% for a weather prediction model compared to the C4.5 decision tree algorithm, which gave an accuracy of 82.4%. In Malaysia [13], a rainfall prediction model involving different classification algorithms revealed Neural Networks as the best based on the performance of the evaluation metrics compared to the others. The findings from the study showed Neural Networks with an F-score of 73.2%, which was the highest.
Insights from these studies suggest that machine learning algorithms perform well regarding rainfall prediction accuracy and timeliness. Therefore, it is imperative to investigate various classification algorithms to establish the best performing techniques for predicting rainfall in Ghana. Therefore in this paper, we put forward the following contributions: 1) We employ data preprocessing techniques on Ghana Meteorological Agency (GMet) climatic data from the 22 synoptic stations across the four ecological zones of Ghana. 2) We employ classification algorithm such as Decision Tree (DT), Random Forest (RF), Multilayer Perceptron (MLP), Extreme Gradient Boosting (XGB) and K-Nearest Neighbour (KNN) to build the models for rainfall prediction in the different ecological zones. 3) We perform the evaluation of the models based on precision, recall, accuracy and f1 score. The remaining sections of this paper are as follows: The next section focuses on a brief description of the study area and the data source. The methodology utilized is given in section III. Section IV constitutes the results and discussions, and finally, the study's conclusions are given in section V.

II. STUDY AREA AND DATA SOURCE
A. STUDY AREA Ghana has been grouped into four (4) agro-ecological zones according to the Ghana Meteorological Agency classification. Namely: Coastal, Forest, Transition and Savannah zones [14]. The zones are specified by distinct climate conditions [15]. Ghana is characterized by two main rainfall regimes resulting from the Inter-Tropical Discontinuity (ITD) [16]. The two rainfall regimes are the bi-modal and uni-modal rainfall patterns. The coastal and forest zones experience a bi-modal rainfall pattern, whereas the transition and savannah zones are characterized by a uni-modal rainfall pattern [17].
The mean annual rainfall of the Savannah zone per year is about 1100 mm and in comparison, with the other zones, the Savannah is characterized with warm temperatures all year round. In view of the climatic condition over the zone, leading crops cultivated in this zone includes but not limited to sorghum and millet [18]. Meanwhile, the Transitional zone which is sandwiched between the Savannah and the Forest zones obtains a mean annual rainfall per year of about 1300 mm.
The climate of this zone exhibits climatic conditions of both Savannah and Forest zones due to its location. Annual food crops such as plantain and maize are dominant in this zone [18]. The highest mean annual rainfall is recoded over the Forest zone which is 2200 mm per year. This zone is located in the Southwestern part of Ghana and predominantly wet throughout the year. The mean annual rainfall received in the Coastal zone per year which is largely modulated by the circulation of land-sea breeze is about 900 mm.

B. DATA SOURCE
The study will employ rainfall, temperature (minimum and maximum), relative humidity (at 1500 and 0600), Sunshine hours and wind speed data from the 22 synoptic stations across the four ecological zones spanning 1980 -2019 sourced from the Ghana Meteorological Agency. Figure 1 shows the location of all the 22 stations across Ghana. Weather parameters measured by the GMet is according to the World Meteorological Organization (WMO) standards [14]. The datasets are listed in Table 1.
The datasets used in this study include: Temperature, Rainfall, relative humidity, Sunshine hours and wind speed as listed in Table 1.

A. CLASSIFICATION FRAMEWORK
In this section we describe the techniques and tools that have been employed to utilize the various weather features for  rainfall prediction. The classification framework used in this paper as shown in Figure 2 involves handling of missing data, handling of outliers, normalization, training and testing, the implementation of the prediction model and the results of the performance of the models.

B. DATA EXPLORATORY AND ANALYSIS
To achieve high certainty of the validity of future results, Exploratory Data Analysis (EDA) is key to machine learning tasks [19]. Pre-processing techniques are essential for a steady classification proceeding in addition to the generation of satisfactory results. In view of this, various data preprocessing techniques were performed on the dataset. Firstly, the Multiple Imputation by Chained Equations (MICE) package was utilized to impute missing values. MICE is a robust means of filling in missing data in a dataset through an iterative process. A significant advantage of MICE over other missing value approaches is that, by multiple imputations it fills in the missing values multiple times leading to achieving a complete dataset [20]. Secondly, another most important phase of the EDA is to identify and remove outliers in the datasets that can affect the model. The study employed the mathematical function approach in doing this. Specifically, the Inter Quartile Range (IQR) score was used to detect and remove outliers from the datasets using the relation: where IQR is the interquartile range which equal to the first quartile Q1 subtracted from the third quartile Q3. In other words, the IQR is equal to the difference between the 75th and the 25th percentiles. Table 2 is the summary of the raw datasets and after outliers are removed for all the ecological zones. Further, to check for multi-collinearity among the variables, as shown in Figure 3, pair-wise correlation matrix and a corresponding pairplot were constructed across all ecological zones. Also, oversampling of minority class was employed to handle the issue of class imbalance in the target variable. Since imbalanced data can generate biased results in the model due to model's inability to learn much about the minority class [11]. Data normalization or feature scaling is essential since the mathematical processes in machine learning relies on Euclidean distance between two data points. This is important since features used in this current study are characterized with varying magnitudes. Adopting normalization will scale the dataset into weight lying between 0 and 1.
This enhances the data to utilize a standard scale with distortions or loss of vital information. The Min-Max normalization scaler was calculated by using: where x, min (x) and max (x) represents the value to be scaled, minimum and maximum values respectively.

C. MODELS
The study employed 5 classification algorithms. In selecting the classifiers, this current study adopted the model family approach in [11]. These include Tree-based, distance-based, ensemble and a deep learning model. Scikit-learn was used to implement the classifiers. The details of the classification algorithms employed are given below:

1) ARTIFICIAL NEURAL NETWORK-MULTI-LAYER PERCEPTRON
As the most extensively used machine learning algorithm [21], Artificial Neural Network (ANN) is characterized with various types which has been employed in various aspects of research [21]. One of such is the Multi-Layer Perceptron (MLP). In hydro-climatological research, MLP is employed to establish the relationship between predictors and predictands [22]. MLP is a classical structure of Deep Neural Network (DNN) [23] characterized with several layers with numerous neurons. The first layer of MLP is known as the input layer whereas the last layer represents the output layer. The layers which are located in the middle are known as the hidden layers. Using an activation function, the hidden layers merges weights and bias terms with inputs to generate the output. With n number of inputs x = (x 1 , x 2 , . . . x n ) with a vector of weights w j = (w 1j , w 2j , . . . w nj ) for a given node j.
To determine the simulated y j at the node j is given by the equation below [24].
where f is the activation function, w j as the weight vector and the bias related to the node represented as b j . Therefore the output y k is computed by the equation below: where f 1 and f 2 represents the activation functions, also, j, i and k refers to the input, hidden and output layers respectively. x j indicates the inputs, the bias linked to the hidden and output layers are assigned with b i and b k respectively. Further, n and m point out to the neurons in the input and hidden layers respectively. The weights between the hidden and input layers is denoted by w ij whereas the weights between the hidden and output layers denoted by w ki . Y k represents the output of the network. Hereafter, Artificial Neural Network-Multi-Layer Perceptron used in this current study is referred to as Multi-Layer Perceptron (MLP). MLP algorithm was applied on the three training and testing ratios.

2) K-NEAREST NEIGHBOUR
K-Nearest Neighbour (KNN) is a non-parametric learning algorithm which uses euclidean, manhattan and minkowski distances approach in making classification [25]. It's been reported that KNN performs better with minimal number of features [11]. The Euclidean distance is calculated using equation 4 as shown below. Where x ij and x io refers to the i ith data point in the j th predictor and predictand.
To calculate the KNN value, equation. 5 is used Z r and Z k represents the predicted and neighboring data respectively whereas f k d j is the kernel function. This study employs 7 meteorological features which makes KNN a favorable candidate for this current study. According to [26], the performance of KNN in modelling is dependent on the number of neighbors (K ) utilized. The value of K was set to 5 in this current study after preliminary assessment. KNN with K = 5 was applied on all three training and testing ratios.

3) DECISION TREE
Decision tree algorithm is used for both classification and regression in machine learning. In a decision tree, each node in a branch serves as a choice of alternative whereas leaf nodes signifies a decision. Decision performs well with both categorical and continuous variables which fits well with in this current study since our target variable (rainfall) is binary categorical. In building decision trees, the known algorithms utilized include C5.0, Chi-squared Automatic Interaction Detection (CHAID), ID3, Quest, Classification and Regression Trees (CART) and C4.5. The C5.0 was selected for this current study and applied on the three training and testing ratios. C5.0 is an enhanced algorithm from the previous C4.5 and ID3.

4) RANDOM FOREST
Random Forest (RF) was developed by Breiman in 2001 for classification. It's an ensemble machine learning algorithm that uses numerous classification trees thereby gaining the name random forest. In the computation of regression, it merges different decision trees for regression and classification purposes [25]. In spite of its bias towards variables with high levels amongst categorical variables with varying levels, RF algorithm is weighed to be an extremely rigorous learning algorithm in recent times [27]. The function of the RF algorithm involves first of all the collection of random samples from the given data. A decision tree will then be created for each sample in the second stage. Afterwards, voting is done for the predicted results and finally, the classification with most voted prediction is chosen. The RF algorithm was applied on all the three training and testing ratios. Our model configuration used in this current study adopted the number of weak learners to be 100 and maximum depth of tree to be 16.

5) EXTREME GRADIENT BOOSTING
The Extreme Gradient Boosting (XGBoost) is an advanced machine learning technique which hinges on the gradient boosting algorithm developed by [28]. XGBoost is a better handler of overfitting through model formalization. This algorithm was selected for this current study due to its high execution speed. XGB was applied on all three training and testing ratios.

D. EVALUATION METRICS
To evaluate the efficiency of the various algorithms can be done by numerous evaluation metrics. However, this current study focuses on accuracy, precision, recall, f -measure and the confusion matrix which is the basis for the previous metrics. The metrics are therefore defined as:

1) CONFUSION MATRIX
The confusion matrix as shown in Table 3, yields an output in a matrix form which details the model's performance. Where: 1. TN represent the overall number of negatively classified data that has been classified as correct. 2. FN is the overall number of positively classified data that has been identified as 'negative' falsely. 3. FP represent the total number negatively classified data that is classified as 'positive' falsely. 4. TP is the total number of positively classified data that is classified as correct.
By extension, it suffice to deduce the mathematical expression for the evaluation metrics such as recall, precision, accuracy and F1 score in the equations 7, 8, 9 and 10 respectively.

A. ANALYSIS OF CLASSIFICATION ALGORITHMS AT THE COASTAL ZONE OF GHANA
The results for the coastal zone are shown in Table 4. Firstly, results for no-rain class with DT, RF and XGB, in precision, recall and f1 score were 1, 0.993 and 0.996 respectively at the ratio 70:30. However, with the MLP, performed better in precision at the ratio 90:10. Consistently, the KNN classifier showed low results on all three ratios. Secondly, with rain class, the best performance in precision, recall and f1 score were recorded at ratio 80:20 with DT, RF and XGB. However, MLP performed better in recall at 90:10. Again, KNN performed worst on all ratios with the rain class whereas XBG performed best on both classes in precision, recall and f1 score at 80:20.

B. ANALYSIS OF CLASSIFICATION ALGORITHMS AT THE FOREST ZONE OF GHANA
The results for forest zone are presented in Table 5. It can be seen with no-rain class, RF and XGB performed well with same results in precision, recall and f1 score at 70:30 and 80:20. However, with DT, with no-rain class the 70:30 and 90:10 performed better in recall and 70:30 in f1 score. Further, with MLP, no-rain class at 90:10 performed better in recall VOLUME 10, 2022

C. ANALYSIS OF CLASSIFICATION ALGORITHMS AT THE TRANSITIONAL ZONE OF GHANA
The results for the transitional zone is shown in Table 6. As shown, both rain and no-rain classes with RF and XGB, all 3 ratios performed best in precision, recall and f1 score. It can be noted both RF and XGB classified all instances correctly.
Similar performance was seen with MLP with precision, recall and f1 score achieving a result of 1. However, this performance by MLP was only on the ratio 90:10. Comparatively, on all ratios in terms of performance in precision, recall and f1 score, DT outperformed KNN for both rain and no-rain classes. VOLUME 10, 2022

D. ANALYSIS OF CLASSIFICATION ALGORITHMS AT THE SAVANNAH ZONE OF GHANA
The results for the savannah zone is shown in Table 7. Consistently, RF and XGB with no-rain class, all 3 ratios; 70:30, 80:20 and 90:10 performed best in precision, recall and f1 score. At 90:10, DT performed better in precision with regards to no-rain class. KNN no-ran class at both 70:30 and 80:20 performed better in recall. With rain class, RF and XGB performed the best in recall for all 3 ratios. On the other hand, MLP exhibited similar performance in recall but only with 90:10 ratio. DT at 70:30 performed better with rain class in f1 score. On the savannah zone, KNN outperformed DT with 80:20 ratio in precision.

1) CRITICAL ANALYSIS AND SUMMARY OF THE RESULTS OF THE CLASSIFICATION ALGORITHMS
Machine learning classifiers used in this study showed good results on all three different training and testing ratios for the no-rain class in the coastal zone. F1-score, also known as F-measure is an excellent evaluation measure which is based on the average of recall and precision. As clearly shown in Table 4 with regards to f1-score, all the classifiers performed well in classifying the no-rain class as compared to the rain class. Several reasons could be attributed to the lower results with the rain class on the all ratios. These may include the absence of other climatic features and the low VOLUME 10, 2022 rainfall rate over the zone which is consistent with findings in [18].
In contrast, results in Table 5 shows all the classifiers classified the rain class better than the no-rain class on the basis of the f1-score. The zone which is characterized with high rainfall amounts throughout the year, performed well on the 70:30 training and testing ratio respectively for all five (5) classifiers. However, among the five classifiers, Random Forest and XGBoost outperformed the other classifiers with regards to the rain class classification. Furthermore, f1score at the savannah zone, is evident that all the classifiers showed good results for the rain class on all the three (3) training and testing ratios.
As compared to classifiers at the forest zone whose best performance relied on Random Forest and XGBoost at 70:30, the best performing classifiers at the transitional zone include the MLP, RF and XGB (See table 6).
The performance of the classifiers were independent on the ratio of training and testing as all three classifiers were consistent in performance on the three different ratios. From table 7, it is observed that the performance of classifiers for the no-rain class is dominant on all three training and testing ratios which is indicative of low rainfall at the savannah zone which consistent with the findings in [18]. Generally, f1-score for all classifiers performed better for the no-rain class as compared to the rain class. Timeliness is crucial in rainfall prediction. As the adverse effect of rainfall such as floods can be prevented if timely rainfall prediction is achieved. Regardless of a models 5078 VOLUME 10, 2022 performance in accuracy, its swiftness to rainfall prediction is essential. A visual inspection of Fig. 4a shows that almost all the classifiers exhibited similar overall model accuracy with the exception of KNN showing a low model accuracy on the 70:30 training and testing ratio at the coastal zone. However, with regards to time taken for the execution of these models, decision tree leads as the fastest whereas MLP stands out as the model with the longest execution time. Similarly, the decision tree classifier also performed as the fastest classifier to be executed with the shortest possible time at the forest zone (see Figure 4b.).
However, at the coastal zone, the execution time of xgboost and random forest are comparable. On the same 70:30 ratio, transitional and savannah zones also showed the consistency of the decision tree in terms of speed in model execution (See Figures 4c and 4d). Meanwhile, as compared to coastal and forest zones, random forest used more time to execute the model at the transitional zone whereas at the savannah zone, MLP regained its position as the model with the longest execution time. Generally, from Fig. 4, it can be seen that the overall model accuracy of MLP, RF and XGB were high at all the zones on the 70:30 ratio whereas consistently KNN performed least. Comparing the execution times and model accuracy levels at all the zones on 80:20 training and testing ratio, decision tree shows consistency as the model with the fastest time of execution at all zones (See Figure 5).
Meanwhile, in terms of model accuracy, DT exhibited good model accuracy levels with the exception of the savannah zone where it performance was low. (See Figure 5d). From Fig. 6, it can be observed that, in spite of the longer time of execution of MLP at all zones, it performed best in terms of model accuracy as compared to the other models. Again, on the 90:10 ratio, decision tree stood out as the fastest model execution. Overall, decision tree has shown to be a good candidate in terms of timeliness in rainfall prediction over all VOLUME 10, 2022 ecological zones of Ghana. However, in terms of accuracy the MLP, XGB and RF have been pronounced.

V. SUMMARY AND CONCLUSION
This research executed rainfall prediction in Ghana covering all the ecological zones using five (5) classification algorithms namely: Decision Tree, Random Forest, Multilayer Perceptron, Extreme Gradient Boosting and K-Nearest Neighbour. 41 years of past climatic data spanning 1980 -2019 from the Ghana Meteorological Service was used for this study. To evaluate the performance of the classifiers, the evaluation metrics employed included precision, f1-score and recall with results presented in tables. Further, the overall accuracy of the model and the execution times of the individual models were also ascertained and the results are shown in figures.
To ensure effective rainfall prediction, input datasets went through the exploratory data analysis where the multiple imputation by chained equations algorithm was used replace missing data, outliers were removed from the datasets and normalized before the classification stage. The datasets were splitted into two parts: training datasets and testing datasets. We employed 3 different types of training and testing ratios (training data: testing data): 70:30, 80:20 and 90:10 to analyze the performance of the classification algorithms on different training and testing ratios. Findings from the study showed distinct characteristics of classification of the rain and no-rain classes in the various ecological zones in the country. No-rain class was well classified by classifiers in the coastal zone as compared to the rain class.
However, there was an opposite response in the forest zone. In the forest zone, the classifiers performance was best with regards to the rain class. At the savannah zone, all classifiers on the 3 training and testing ratios performed well in classifying the no-rain class which is consistent with low rain pattern observed the region. On all ecological zones and putting together all training and testing ratios, decision tree distinguished itself as the model with the fastest execution time whereas multilayer perceptron performed poorly in terms of time of execution.
Generally, random forest, extreme gradient boosting and multilayer perceptron performed well in all instances which is suggestive that ensemble and deep learning models are good candidates for rainfall prediction. However, K-Nearest Neighbour performed worst in all zones on all training and testing ratios which warrants further investigation. Further study using other classification algorithms and a hybrid model at different training and testing ratios for rainfall prediction in all ecological zones of Ghana is under consideration.
Change and Adapted Land Use (WASCAL), Dynamic-Aerosol-Chemistry-Cloud interactions in West Africa (DACCIWA) and International Development Research Center -Climate Change Adaptation Research and Training Capacity for Development (IDRC-CCARTCD), Global Challenge Research Fund Africa Science for Weather Information and Forecasting Techniques (GCRF African SWIFT), Current and Future risks of Urban and Rural Flooding in West Africa-An integrated analysis and eco-system-based solutions (FURILOOD), and Green gas emissions and mitigation options under climate and land-use change in West Africa-A concerted regional modeling and assessment (CONCERT). He has over 70 publications in high-impact peerreviewed journals and over 85 oral and poster presentations in international conferences.
NAJIM USSIPH received the B.Sc. and M.Sc. degrees in computer science from the University of Ibadan, Ibadan, Nigeria, in 1987 and 1993, respectively, and the Ph.D. degree from the University of Salford, Manchester, in 2015. He has over 15 years of experience in teaching and research at Polytechnic and University level. He has been a Faculty Member with the Department of Computer Science, Kwame Nkrumah University of Science and Technology, Kumasi, since April 2001, and have taught several courses at both undergraduate and graduate levels. His research interests include information systems and e-learning and learning technologies.
TWUM FRIMPONG received the M.Sc. degree in information systems from Roehampton University, and the Ph.D. degree in computer science from the Kwame Nkrumah University of Science and Technology (KNUST). He is currently a Senior Lecturer with the Department of Computer Science, KNUST. His research interests include computer networks security and machine learning.
EMMANUEL AHENE received the M.Eng. and Ph.D. degrees in computer science and technology from the University of Electronic Science and Technology of China. He is currently a Lecturer with the Department of Computer Science, Kwame Nkrumah University of Science and Technology, Ghana, and also a Co-Founder of Cyberpassconsult, an international cybersecurity consultancy firm. His research interests include information security and machine learning.