Pan Evaporation Prediction Using LSTM Models Based on PCA Factor Reduction and Firefly Optimization Algorithm

Evaporation is an important part of the moisture exchange between the earth and the air. Understanding the trend of pan evaporation can help to reveal the status of actual evaporation, which is very useful for the allocation of regional water resources. However, long short-term memory (LSTM) has become a mainstream algorithm for predicting pan evaporation, there are two issues worth considering. One of the issues is how to automatically find the optimal hyperparameters, the other is how to eliminate the correlation between prediction factors to improve prediction performance. To address the two issues, this article proposes LSTM models based on principal component analysis (PCA) factor reduction and firefly optimization algorithm. In the proposed model, fire-fly algorithm can find the optimal hyperparameters, and PCA can eliminate the correlation between prediction factors. Xiangjiang River Basin, an important Basin for China’s water resource management, is selected as a study area, the experimental results are evaluated by root mean square error (RMSE) and the coefficient of determination ( $R^{2}$ ). The results show that the proposed models can successfully predict daily pan evaporation of the study area.


Pan Evaporation Prediction Using LSTM Models
Based on PCA Factor Reduction and Firefly Optimization Algorithm Chuanli Wang , Tianyu Li, Dongjun Xin, Qian Wang, Ran Chen , and Chaoyi Cao Abstract-Evaporation is an important part of the moisture exchange between the earth and the air.Understanding the trend of pan evaporation can help to reveal the status of actual evaporation, which is very useful for the allocation of regional water resources.However, long shortterm memory (LSTM) has become a mainstream algorithm for predicting pan evaporation, there are two issues worth considering.One of the issues is how to automatically find the optimal hyperparameters, the other is how to eliminate the correlation between prediction factors to improve prediction performance.To address the two issues, this article proposes LSTM models based on principal component analysis (PCA) factor reduction and firefly optimization algorithm.In the proposed model, firefly algorithm can find the optimal hyperparameters, and PCA can eliminate the correlation between prediction factors.Xiangjiang River Basin, an important Basin for China's water resource management, is selected as a study area, the experimental results are evaluated by root mean square error (RMSE) and the coefficient of determination (R 2 ).The results show that the proposed models can successfully predict daily pan evaporation of the study area.

I. INTRODUCTION
E VAPORATION is an important pathway of the hydro- logical cycle.Evaporation also is an important factor considered by many applications, such as water resources assessment [23], hydrological forecasting and calculation [10], basin hydrological model calculations [26], etc. Pan evaporation is a key indicator of evaporation and the prediction for it has received increasing attention in recent years.
The machine learning method is often adopted to predict pan evaporation because evaporation is a nonlinear process, and machine learning has a strong nonlinear regression ability [25], [29].Literature [12] utilizes, respectively, an adaptive neuro-fuzzy inference system and multiple linear regression to predict the evapotranspiration of the Mediterranean region.It is concluded that the adaptive neuro-fuzzy inference system has a better simulation effect.Literature [20] compares the performance of the artificial neural network and co-active neuro-fuzzy inference system on simulation pan evaporation of Pantnagar, the result shows that the artificial neural network is superior to the co-active neuro-fuzzy inference system.
With the rise of deep learning, convolutional neural networks (CNNs) [2], long short-term memory (LSTM) [11], and other deep learning algorithms [3] have been applied to prediction evaporation and achieved significant performance.Among them, LSTM has achieved remarkable performance in modeling time series data, such as modeling Dongting Lake Water level [15] and modeling water quality under the environment of Internet of Things [18].The observed pan evaporation is also a time series, so LSTM has become a mainstream algorithm for predicting pan evaporation.Literature [1] proposes an LSTM model to simulate the evaporation of Malaysia and shows LSTM model achieves the highest accuracy compared to other machine learning models.Literature [19] conducts an analysis about the performance of LSTM and multilayer artificial neural network on estimating evaporation of Chhattisgarh in east-central India, and the results show that LSTM is better than the multilayer artificial neural network.
Although the LSTM-based prediction model of pan evaporation has achieved good performance, there are two issues worth considering.One of the issues is how to automatically find the optimal hyperparameters of LSTM because hyperparameters of LSTM, including neuron numbers, batch size, and epoch need to be set manually.The other is how to eliminate the correlation between prediction factors to improve prediction performance because common predictive factors, such as precipitation, temperature, humidity, wind speed, sunshine, and air pressure, have correlations.The first issue belongs to hyperparameter optimization [5], swarm intelligence algorithms are commonly used to solve such issues, for example, literature [4], [21] uses mayfly or pso algorithm optimizes the hyperparameters of support vector machine.However, for neural networks, some works choose firefly algorithm (FA) to solve this issue, for example, [9] proposes FA-based multilayer perceptron to predict pan evaporation.Regarding the second issue, principal component analysis (PCA) is a suitable solution [13], PCA can convert related factors into independent principal components.
The Xiangjiang River Basin is very important for China's water resource management, but there are few works involved in predicting its pan evaporation.So the aim of this article is to propose LSTM models based on PCA factor reduction and firefly optimization algorithm to predict the daily pan evaporation of Xiangjiang River Basin.

A. Study Area and Data
Xiangjiang River Basin situated in the southcentral part of Hunan Province, China, east longitude 112 • 30'16"∼113 • 17'32" and north latitude 26 • 15'50"∼27 • 25'00."Its climate type is subtropical monsoon climate.Its climate is characterized by four distinct seasons, concentrated rainfall, variable spring temperatures, hot summer, dry summer and autumn, and short cold periods.
The data used in this article comes from three hydrological stations, and they are Hengyang Hydrological Station, Zhuzhou Hydrological Station, and the Mapoling Hydrological Station of Changsha City.The time span of the data is from January 1, 1986, to December 31, 2001.Data includes precipitation, temperature, relative humidity, wind speed, sunshine, pressure, runoff information, and pan evaporation.Evaporation data was observed using Eφ20 pan, and each hydrological station has 5840 entries.

B. Research Method
In this section, the entire algorithm process is first presented, followed by a specific introduction of LSTM, FA, PCAs, and variable selection and data preprocessing.The flowchart of the proposed algorithm is shown in Fig. 1.
1) Long Short-Term Memory Network: LSTM network, an improved recurrent neural network (RNN) has the ability of long-term memory, while traditional RNN does not [28].LSTM has achieved great success in the field of sequential data processing, such as natural language processing [16], [30].LSTM realizes flexible memory function with gate notion.The basic unit block of LSTM is illustrated in Fig. 2.
In Fig. 2, t−1 refers to the previous moment, t refers to the current moment, x t refers to the current input, h t−1 and h t , respectively, represent the hidden layer output of the previous moment and the current hidden layer output.C t−1 represents the network memory of the previous moment, C t represents the current network memory, f t represents the forget gate, i t represents the input gate, o t represents the output gate, Ct represents the current temporary memory, and σ and tanh, respectively, represent the sigmoid activation function and tanh activation function.Their calculation formula is defined as follows: where M f , M i , M c , and M o are all transformation matrices, v f , v i , v c , and v o are all offset value, and these parameters need to be solved by the backpropagation process.
As seen from ( 1), (2), and (4), the forget gate can determine how much information stored in the previous moment network memory is added to current memory, and input gate can determine how much information stored in current temporary memory is added to the current memory.
As seen from ( 6), the output gate can determine how much information stored in current memory is used to calculate the hidden layer output of the current moment.
It should be pointed out that in this article, daily pan evaporation is used as the label data, and the time series are transformed into supervised ones by the sliding window method [7].
2) Firefly Algorithm: FA [14], an optimization problemsolving method, belongs to the category of swarm intelligence algorithms.Its key idea is that fireflies attract each other, and the firefly will move toward another firefly with higher brightness.
In the firefly optimization algorithm, each firefly is regarded as a solution of the solution space, and the firefly population is randomly distributed in the search space as the initial solution, and then each firefly is moved according to the movement mode of fireflies in nature.Through the movement of each generation of fireflies, the fireflies will eventually gather around better fireflies.
Let g denote the objective function of a maximum problem, and fireflies x i and x j are two solutions in the solution space of the objective function, d expresses the distance between x i and x j , define g(x i ) and g(x j ) as the brightness of firefly x i and x j , respectively, and suppose g(x j ) > g(x i ).Then, the attraction between g(x i ) and g(x j ), and the moving formula of g(x i ) toward g(x j ), as shown where A denotes maximum attraction when x i and x j meet, α is a attenuation coefficient, β is a random value, and A and α need to be set up, and β need to be designed its range in advance.
The basic steps of the FA are as follows.
Step 1: Initialize parameters including the number of fireflies, maximum attraction A, attenuation coefficient α, random value β, and so on.
Step 2: Calculate the brightness of all fireflies.
Step 3: Update each firefly by (8), if it is the best firefly currently, it moves randomly.Step 4: Repeat steps 2 and 3 until it meets stop conditions.
3) Principal Component Analysis: The basic idea of PCA is to map a group of related data into a group of linear independence data through orthogonal changes and retain the main components via maximum variance [6].It is mainly realized by eigenvalue decomposition or singular value decomposition.The specific implementation steps are as follows.
Step 1: Original data is zero-centered by subtracting the mean in each component.
Step 2: Construct the covariance matrix of the original data.
Step 3: Solve eigenvalues and eigenvectors of the covariance matrix obtained by step 2.
Step 4: Identify the main components of data according to the size of the eigenvalues based on the actual application.

C. Variable Selection and Data Preprocessing 1) Variable Selection:
In this article, six variables were selected as predictive factors.These variables are air pressure (PRS), relative humidity (RHU), precipitation (PRE), wind speed (WIN), temperature (TEM), and sunshine day (SSD).
2) Anomaly Data Cleaning: Due to the fact that some of the observed data are indeed missing, but the number of consecutive missing data does not exceed 6, therefore, according to the PauTa standard (3Sigma principle) [24], missing data x i can be calculated via the following: where μ is mean, and α = 0.7, β = 0.2, and γ = 1 − α − β.
3) Scaling of Data: Data normalization is a common preprocessing task in machine learning, which aims to eliminate the quantity differences of input variables with different units and, thus, data normalization can help objectively evaluate the impact of input variables on results.In this article, the units of input variables are also different, such as precipitation, temperature, evaporation, etc., therefore, it is necessary to normalize the original data.This article adopts the min-max normalization method [8] to map the input variables into the range [0, 1].The formula of normalization is given as x − x min x max − x min (10) where x is an original data, x min is minimum value, x max is maximum value, and x is the normalized data.

4) Evaluation Indicators:
The LSTM model of pan evaporation is evaluated by the coefficient of determination (R 2 ) and the root mean square error (RMSE).R 2 displays the fitting of the model to the observations, and higher values indicate the model has a better fitting.RMSE represents the degree of deviation between the predicted values and the observed values, and smaller values indicate the model has a prediction accuracy.The calculation formula of R 2 and RMSE is shown as where o i is the observed value of the i th sample, p i is the predicted value of the i th sample, ōi is the average value of all observations, and m is the total number of samples.

III. RESULTS
This article uses 16 years of daily data including the six input variables covering the period from 1986 to 2001.70% of the data is selected as training data, and the remaining 30% of the data is used as test data, in other words, the 11-year data from 1986 to 1996 was selected as training samples, and the five-year data from 1997 to 2001 is used as test samples.
In this article, LSTM optimized by FA is called FA-LSTM, and LSTM optimized by FA and PCA is called PCA-FA-LSTM.
After the optimization of hyperparameters by the FA, the values of neuron numbers, batch size, and epoch of the LSTM network were set to 66, 22, and 372, respectively.The activation functions of the hidden layer choose the hard sigmoid function and tanh function, and the sigmoid function is adopted by the fully connected layer.The optimization algorithms used to optimize the weight matrix and transformation matrix include RMSProp, SGD, AdaGrad, Adam, and Adadelta [22].This article selects AdaGrad, which is an improvement of the batch gradient descent method.
The experiment is implemented by Python under Keras, Tensorflow environment.The version numbers for Keras and Tensorflow are 2.2.4 and 1.8.0, respectively.It can be found from Tables IV-VI that for each station, the sum of the first three principal component variances accounts for more than 0.95, so the input variables of PCA-FA-LSTM model are the corresponding first three principal components.The parameter setting and network structure are the same as FA-LSTM.For comparative analysis, continuous 200-day evaporation prediction data is plotted in a graph.Figs.6-8

IV. DISCUSSION
The prediction data is selected, which corresponds to the minimum RMSE in the prediction results of the FA-LSTM model and the PCA-FA-LSTM model.The comparison of these predictions with the original data is shown in Figs.9-11.

Fig. 3 .
Fig. 3. Time series and scatter plot of FA-LSTM prediction results and original data for Hengyang station.

Fig. 4 .
Fig. 4. Time series and scatter plot of FA-LSTM prediction results and original data for Zhuzhou station.

Fig. 5 .
Fig. 5. Time series and scatter plot of FA-LSTM prediction results and original data for Mapoling station.

5 )
Performance of FA-LSTM: The predictive variables of FA-LSTM are PRS, RHU, PRE, WIN, TEM, and SSD.For comparative observation, the pan evaporations predicted by FA-LSTM for continuous 200-day are plotted as figures.Figs.3-5 show the comparison results between the five predictions of pan evaporation and the original evaporation for Hengyang Station, Zhuzhou Station, and Mapoling Station.As can be seen from the figures, the results of the five predictions are very close.The prediction results of the three stations are close to the original values.For Hengyang Station, Zhuzhou Station, and Mapoling Station, RMSE and R 2 of pan evaporation of five groups of experiments are shown in Tables I-III.According to

Fig. 6 .
Fig. 6.Time series and scatter plot of PCA-FA-LSTM prediction results and original data for Hengyang station.

Fig. 7 .
Fig. 7. Time series and scatter plot of PCA-FA-LSTM prediction results and original data for Zhuzhou station.

Fig. 8 .
Fig. 8. Time series and scatter plot of PCA-FA-LSTM prediction results and original data for Mapoling station.
show the comparison results of the five predictions of evaporation and corresponding original value for Hengyang Station, Zhuzhou Station, and Mapoling Station using PCA-FA-LSTM.It can be seen that the results of the five predictions are very close, and the prediction results of the three stations are close to the original values.For Hengyang Station, Zhuzhou Station, and Mapoling Station, the RMSE and R 2 of the PCA-FA-LSTM model are shown in Tables VII-IX.For the Hengyang hydrological station, the maximum and minimum values of RMSE are 0.427 and 0.424, respectively, and the maximum and minimum values of R 2 are 0.974 and 0.973, respectively.For the Zhuzhou hydrological station, the maximum and minimum of RMSE are 0.483 and 0.476, respectively, and the maximum and minimum of R 2 are 0.973 and 0.972, respectively.For the Mapoling hydrological station, the maximum and minimum of RMSE are 0.429 and 0.425, respectively, and the maximum and minimum of R 2 are 0.984 and 0.983, respectively.For each station, the values of RMSE and R 2 are relatively close.

TABLE I RMSE
AND R 2 OF FA-LSTM MODELS AT HENGYANG STATION indicator R 2 , the fitting results of three hydrological stations are all very good, with the smallest value reaching 0.948, Sorted by the maximum value of R 2 , they are HengYang, Mapoling, and Zhuzou.Moreover, the values of each experiment at each station vary very little, which indicates that the FA-LSTM model is very stable.Similar conclusions can also be drawn from indicator RMSE.

TABLE II RMSE
AND R 2 OF FA-LSTM MODELS AT ZHUZHOU STATION 6) Performance of PCA-FA-LSTM: Principal components are analyzed for precipitation, temperature, relative humidity, wind speed, sunshine day, air pressure, and daily pan evaporation on hydrological data from three hydrological stations.The principal component variance and the cumulative proportion of principal component variances are listed in Tables IV-VI.

TABLE III RMSE
AND R 2 OF FA-LSTM MODELS AT MAPOLING STATION

TABLE IV PRINCIPAL
COMPONENT VARIANCES OF PCA AT HENGYANG STATION

TABLE V PRINCIPAL
COMPONENT VARIANCES OF PCA AT ZHUZHOU STATION TABLE VI PRINCIPAL COMPONENT VARIANCES OF PCA AT MAPOPING STATION

TABLE VII RMSE
AND R 2 OF PCA-FA-LSTM MODEL AT HENGYANG STATION

TABLE VIII RMSE
AND R 2 OF PCA-FA-LSTM MODEL AT ZHUZHOU STATION

TABLE IX RMSE
AND R 2 OF PCA-FA-LSTM MODEL AT MAPOLING STATION