Climate Indices Impact in Monthly Streamflow Series Forecasting

Hydroelectricity has been widely deployed in many countries for decades, and remains the main source of electricity in Canada, Norway, and Brazil. Despite the recent diversification, hydroelectricity accounts for approximately 65% of the electricity generated in Brazil. It also represents the largest non-polluting source in the country, and is important for complying with the global goals of reducing carbon emissions. Meeting energy and environmental sustainability requirements impose challenges on the hydroelectric sector in terms of resilience to climate change observed around the planet. These changes affect electricity generation by altering seasonality and increasing the variability of streamflow and evaporation losses in reservoirs. Therefore, in this study, among a set of 27 climate indices, we identified the most relevant for improving the performance of the models applied to monthly seasonal streamflow series forecasting. A database provided by the NOAA Physical Sciences Laboratory was used as exogenous variables for three machine learning models (support vector regression, extreme learning machine, and kernel ridge regression) and one linear model (seasonal autoregressive integrated moving average with exogenous factors, SARIMAX). Random forest with recursive feature elimination was used as the feature-selection technique. The results obtained allowed for the identification of the most relevant set of indices for the analyzed plants, thereby improving streamflow predictions.


I. INTRODUCTION
Hydroelectric electricity production is associated with hydrological variations that directly impact water storage in plant reservoirs [1], [2]. This affects the operation of the system, long-term planning, and the prices of the final consumer. The National Energy Plan 2050 [3] presented in the Main Challenges section, a specific topic on the vulnerability of hydroelectric generation to climate change, highlights that the variability of natural streamflows affects electricity The associate editor coordinating the review of this manuscript and approving it for publication was Sotirios Goudos . generation, primarily hydroelectric generation. There was a possible reduction in precipitation in some regions, which may have affected the generation of installed parks and the economic viability of future plants. In addition, there is a current environmental sustainability agenda, in which Brazil is committed to the global reduction of carbon emissions, in which hydroelectric power plants (HPP) play a central role. In the Brazilian electricity matrix for 2020, hydraulic sources represent approximately 65% of the total electricity produced [4].
In operational planning, streamflow forecasting aims to provide the expected average weekly and monthly natural VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ flows for short-and medium-term operation. In the current methodology, streamflow forecasting is conducted monthly by the National Electric System Operator (ONS) during preparation of the monthly operational program (PMO). Linear computational models dedicated to each of the three planning stages for HPP operation (medium/short-term and operational) were used. From the beginning of the 2000s, for the first time, it was considered in the models, the observed and predicted precipitation in the basins, and the rainfall-runoff modeling, presenting superior performance to the linear baseline model [5]. The precipitation information in the linear models was later included in the daily forecasts (PREVIVAZH model) [6]. In 2018, the ONS proposed that all current short-term predictions, within the scope of the National Interconnected System (SIN), be replaced by the conceptual model of hydrological simulation SMAP of the rainfall-runoff transformation type, called SMAP-ONS, aimed at operational standardization, resource savings, and improvement in the quality of forecasts compared to linear models [7].
Brazil is the fifth largest country in the world, with 8.5 million km 2 distributed in tropical and subtropical areas. The continental dimension of a country makes it susceptible to the performance of different meteorological patterns on multiple spatial and temporal scales that influence the weather and climatic conditions. The meteorological scenario allows a country to have an energy matrix that is three times cleaner than the global average [8], with renewable energy representing 83% of the total generation [9].
Climate variability is a crucial factor that affects precipitation patterns and streamflow. The relationship between climate indices and streamflows in Brazil has been discussed in the literature, and the El Niño-Southern Oscillation (ENSO) is one of the most important climate phenomena and has been widely studied. Tedeschi and Grimm [10] identified streamflow extreme events in the Uruguay and Paraguay river basins related to ENSO in the positive phase. Marengo [11] observed different behavior when investigating precipitation in the Amazon region between 1929 and 1998. The frequency and intensification of the positive ENSO phase reduced rainfall in the Amazon Basin during the period-1975/1976. Climate indices (CI) related to variations in sea surface temperature (SST) in the Atlantic Ocean are commonly related to variations in precipitation and streamflow in Northeast Brazilian rivers. Alves [12] indicated that the reduction in rainfall in northeast Brazil between 1974 and 2005 was related to the positive phase of the tropical Atlantic dipole. Similar results were obtained by Pezzi and Cavalcanti [13]: the positive phase of the tropical Atlantic dipole that occurs with the positive phase of ENSO is related to the reduction of precipitation and streamflow in the rivers of Northeast Brazil.
In our previous study [14], we compared the performance of three machine learning models in scenarios that considered the inclusion of climate indicators as inputs for the models. In this study, as an extension of the previous model, the previously tested models (support vector regression, extreme learning machine, and kernel ridge regression) were used to add the seasonal autoregressive integrated moving average with exogenous factors (SARIMAX linear model). We also expanded the number of stations analyzed, including another Brazilian geographic region (southern region). The proposed method identifies climate indices that can be coupled with predictive models as exogenous variables to improve the performance of straight-flow forecasting in reservoirs of a set of HPPs, as evidenced by the reduction in the final mean absolute error. All models were adapted to work with a set of exogenous variables provided by the National Oceanic and Atmospheric Administration [15].
In this sense, the main objective of this work is to verify if the use of climate indices improves forecasting results considering the use of machine learning-based and linear predictors. In addition, feature selection techniques were implemented.
The remainder of this paper is organized as follows: Section II presents the theoretical concepts used in the study, including the forecasting models and feature selection approach; Section III presents the case study, computational results, and critical analysis; and Section IV presents conclusions and future work.

A. RANDOM FOREST WITH RECURSIVE FEATURE ELIMINATION FOR FEATURE SELECTION
Two specific aspects have received special attention from researchers in machine learning because of the increasing amount of data to be processed: the selection of relevant variables to represent the problem and the selection of relevant samples to orient the learning process [16]. In both cases, three generic approaches can be used: the embedded, filter, and wrapper approaches.
In the embedded approach, the selection algorithm is a component of the more complex algorithm. For example, in the Classification and Regression Tree) model [17], an exhaustive search is conducted in the space of a decision tree, and at each stage, an evaluation function selects the attribute that best separates classes. In the filtering method, the selection was performed prior to induction. 1 In wrappers, the selection of variables takes place outside the basic induction process, but uses the method as a subroutine instead of preprocessing [18].
Random Forest (RF) [17] is a technique that processes several decision trees (DT) [19] created from a random sampling process (bootstrap [20]) from a set of samples. For example, in classification problems (categorical dependent variables), random samples are used to build trees. Each tree provided a rating (vote). The RF algorithm provides the final ranking from the most-voted class. The procedure was similar for regression problems (continuous dependent variables), and the RF response was obtained by averaging the responses of the involved DTs [21]. Additionally, the random forest has an inherent procedure for assessing the importance of predictor variables [22].
Darst [23] investigated the superiority of RF results with recursive feature elimination (RFE) over the single use of RF for the selection of variables. The RFE algorithm, initially with all features in iterative processing, removes variables classified as less important, thereby refining RF results. Fig. 1 shows the schematic of the entire process.
Other feature selection algorithms were investigated in this work, such as the Boruta algorithm [24] and selection by correlation analysis between climate indices and streamflow. None of them improved the results of the absolute average errors of the forecasts in the tested scenarios.

B. SUPPORT VECTOR REGRESSION (SVR)
In Support Vector Regression (SVR) methods, the objective is to find a function f (x) that is at least ε away from the current y i values of the training data and, at the same time, is as linear as possible [25]. In other words, errors smaller than ε do not matter (Fig. 2).
The linearity of f (x) in the induced space F will be greater and w (slope) in (1) will be smaller, which is necessary to minimize ||w|| 2 [26].
being w ∈ F and b ∈ R is the bias. The convex problem to be minimized is given by Equation (2): Kernel-based learning has been used for nonlinear cases. The support vector algorithm depends only on the scalar product of the input variables x i . Therefore, it becomes sufficient to know the transformation obtained by the kernel function (k) instead of explicitly [25]. The optimization problem is given by Equation (3), in which ∝ e ∝ * are Lagrange multipliers of the quadratic optimization problem of the objective function of (2), and the constant C controls the tradeoff between the complexity of the machine and the number of non-separable points [26].
Note that in the nonlinear case, the optimization problem consists of finding a function that is nearly linear in the space induced by the kernel (feature space) and not in the original space [25].

C. EXTREME LEARNING MACHINE (ELM)
A classical multilayer perceptron (MLP) neural network is composed of artificial neurons, whose architecture is divided into an input layer, one or more intermediate layers (or hidden layers), and an output layer. First, the input signals were transmitted to the intermediate layers without any modifications. The hidden layers are responsible for mapping the input signal nonlinearly in another space. The output layer receives this transformed signal and produces a network response by using linear combinations. The traditional method for adjusting the free parameters of an MLP is the application of a backpropagation algorithm [27]. VOLUME 11, 2023 An extension of this concept is the extreme learning machine (ELM), an ANNs with only one hidden layer. However, they differ according to the training process, because the weights of the neurons in the intermediate layer are randomly and independently determined. Therefore, the adjustment process does not adapt to the weights of this layer but only to those of the output layer. The optimal values of the weights are typically calculated analytically. Thus, there is no need to calculate the derivatives, backpropagation of error signals, or iterative algorithms, which significantly reduces the computational cost of the training process [25].
ELM presents a universal approximation capability because the approximation error produced by the network can always be reduced by adding a neuron to the intermediate layer via rigorous determination of the weights of the output layer [28], [29].
ELM training involves determining which W out matrix has the smallest error for the vector d of the expected outputs [27]. The ELM training process is summarized by solving the Moore-Penrose pseudo-inverse operator presented in (4).
being X hid , the matrix with the outputs of the hidden layer. Unlike other algorithms, ELM finds not only the smallest training error but also the smallest norm of output weights, resulting in better generalization performance [30].

D. KERNEL RIDGE REGRESSION (KRR)
This technique is based on a linear ridge regression algorithm that is applied to high-dimensional spaces (feature spaces). A linear regression function constructed in the feature space is equivalent to a nonlinear regression problem in the original problem space [30]. A kernel function was used to build this regression function in the induced space, thereby avoiding the computational overhead caused by a large increase in the parameters.
In the linear case, the Ridge Regression loss function (L RR ) for T samples is given by (5), where x t are vectors in R n and y t ϵ R. The constant λ is calculated using cross-validation and represents a penalty imposed on the slope of the regression line.
It can be generalized to a nonlinear case from (5), as explained in [30], to obtain a linear regression in the feature space, according to (6).
The Seasonal Autoregressive Integrated Moving Average with exogenous factors (SARIMAX) model represents the evolution of other linear models, as detailed in [31].
Considering a time series {X t }, operator lag L is defined as L n X t = X t−n , where n is the backward shift. A polynomial in L of order p in L is given by ϕ The autoregressive model AR(p) can be written according to (7), where ϵ is the noise at time t.
The moving average model (MA) regresses the previous values of errors according to (8), where q is the order of the model.
The ARMA(p,q) model is the sum of AR(p) and MA(q) according to (9).
AR, MA, and ARMA can be used for a stationary series. To remove the nonstationarity that occurs in most real problems, a d-order differentiation operator is used, t−1 . If the original series is differentiated d times before the adjustment of the ARMA(p,q) process, the resulting model is called the ARIMA(p,d,q) integrated autoregressive moving average, as presented in (10).
A seasonal series with s time periods in the year can be modelled from the generalization of (10). Let the seasonal difference operator be B s X t = X t−s . The SARIMA model with non-seasonal order terms (p,d,q) and seasonal order terms (P,D,Q), abbreviated as SARIMA(p,d,q)(P,D,Q) s , can be written as in (11).
The SARIMAX model considers exogenous variables that influence the values of the series but do not undergo autoregression. These are added to the right-hand side of (11), resulting in (12):

III. MATERIALS AND METHODS
Teleconnections are interconnected climate variability between noncontiguous geographic regions [32]. The term also refers to a recurrent and persistent pattern of anomalies of a given variable, which can act for several weeks, months, or even years [33], and can be extracted from correlation analyses or the calculation of principal components, among other techniques [34].
Streamflow forecasting using machine-learning methods and climate indices has been discussed by several authors. Rasouli et al. [35] studied streamflow forecasting in a watershed in British Columbia, Canada, using three machine learning methods (Bayesian neural network (BNN), support vector regression (SVR), and Gaussian process (GP)) with weather and climate data inputs. The results showed that local observations and climate indices were best at longer lead times. Norel et al. [36] studied the existence of spatial links between climate variability indices (ENSO-El Niño-Southern Oscillation, PDO-Pacific Decadal Oscillation, AMO-Atlantic Multidecadal Oscillation, NAO-North Atlantic Oscillation) and the time series of streamflow in 18 areas on six continents for the period 1901-2010. The results demonstrated that a machine learning method (convolutional neural network) trained with climate variability indices could model the streamflow better than a long-term monthly mean baseline. The results also suggest that ENSO is the primary determinant of the considered indices [37].
In this sense, the data resources used in this work consisted of the HPP list (Table 1), the historical series of streamflow obtained in [37], available in the period 1931-2020, and the historical series of climate indices available in [15], totaling twenty-six indices ( Table 2) in distinct periods.
The selection of the HPP considered its location (Fig. 3) in the hydrographic basins to identify the representative indices for each region.
Python libraries [38] were used as software tools, which enabled the implementation of predictor models and feature selection modules using random forest with recursive feature elimination [39]. The machine learning techniques adopted were extreme learning machine (ELM), support vector regression (SVR), and kernel ridge regression (KRR), and the linear model was SARIMAX. The internal parameters of the models, which were defined using a grid search for their optimization, are listed in Table 3.   Table 4 presents the parameters of the feature selection model based on Random Forest (RF). Grid search was again used to select the best value from a list.

A. PRE-PROCESSING
Initially, it was necessary to observe the number of samples in each time series, because the climate series was available for shorter periods than the streamflow series. Therefore, 1979-2014 was selected as the study period for this study.  In the next step, the data were prepared for processing, including treatment of outliers [40], data normalization, and data standardization. In this study, lags were not applied to the historical series.
To verify the influence of climate variables on streamflow forecasts, they were treated as exogenous inputs, together with the streamflow series. For this, a matrix was created containing the period of the training or test stages in the first column and all the available climate index series in the other columns. Additionally, the time-corresponding flow vector represents the target variable. Fig. 4 illustrates the initial architecture of the input data. In a specific module, prior to the forecasts, the most statistically relevant climate indices were selected to determine the dataset that produced the lowest mean absolute errors. Selection was performed using a random forest with recursive feature elimination [23], [39], [41] configured to process 100 decision trees, in which the output was the relevance classification of each index. Based on this result, columns (features) corresponding to the climate indices with the worst classifications were removed from the input data matrix.   where: • C i = settings with and without the use of climate indices respectively; • Model = (SVR, ELM, KRR, SARIMAX); • Station = codes of the 14 Stations (Table 1); • Date = date in training and testing period.
• CI i = most relevant climate indices, obtained in the training step; • Streamflow = streamflow values from the history of each station. The training stage consisted of selecting for all monthly series involved (streamflow and climate indices) data from the period between 1940 and the year before the forecast period (2004). The predictions were made considering one step ahead as the forecasting horizon, corresponding to the next month's value. Table 5 presents the number of samples involved in the training and test for the two scenarios.
For the 4 predictors addressed (SVR, ELM, KRR, SARIMAX), the free parameters were tuned following the directions exposed in Section II. After that, a grid search was performed to find the best set of parameters, and the models were run one last time, except the ELM, which was trained 30 independent times. In this sense, the set of weights of the ELM that led to the smallest error was selected.
Finally, graphs and performance reports were prepared by comparing the forecasts with (C 1 ) and without (C 2 ) climate indices, which are presented in the next section.    the average of the forecast errors obtained from the SVR, KRR, and ELM models using the C 1 and C 2 settings.

IV. RESULTS
Only the values of the three machine learning models were considered because of the low general performance of the linear model, as shown in Table 13-16. Columns MAE_C 1 and MAE_C 2 refer to the average absolute errors (m 3 /s) with and without the climate indices (CI), respectively. The Dif% column shows the percentage difference between MAE_C 1 and MAE_C 2 .
The graphs in Fig. 5 show the normalized series of errors (MAE) of the forecasts for the period, indicating differences in the use of the climate indices at different stations. The results are for stations in different Brazilian  regions. Station 175 (NE) is in the Northeast region, an area with low precipitation/flow and a reduction in the last VOLUME 11, 2023 30 years [42]. A scenario similar to station 270 (CO), in the midwest, a semi-arid region of Brazil. Station 275 (N) is in northern Brazil, in an equatorial area, a region characterized by high rainfall (reference). In these regions the climate indices considerably increased the performance of the forecasts, as indicated in Table 6. This result is possibly associated with the high correlation between a climatic indices and the flows of these stations. In the other stations, 217 (S) and 1 (SE), in the south and southeast of Brazil, a better performance of the forecasts using climatic indices was verified, however in low values. These results may be associated with meteorological uncertainties in regions under or close to the subtropical climate, characterized by extreme weather events.
The graphs in Fig. 6 represent the hydrographs obtained with forecasts from the SVR model, highlighting the differences between the use or absence of CI in relation to the observed data.
The graphs in Fig. 7 represent the hydrographs obtained with forecasts from the SARIMAX model, highlighting the differences between the use or absence of CI in relation to the observed data. For both models, except station 270 (CO)    forecasts that used climate indices. The MAE_C 2 column shows errors without the use of the climate indices. The Dif% column displays the percentage difference between the previous columns. Table 12 presents the climate indices found in the variable selection procedure, all of which had the same degree of relevance, as indicated by the algorithm. Columns CI 1 -CI 4 represent the climate indices indicated by the models according to their importance among the twentyseven indices tested (Table 2).
Although the performance analysis is centered on the mean absolute error (MAE), other metrics were used for comparison purposes, such as the root mean square error (RMSE), the Nash-Sutcliffe efficiency (NSE) coefficient, Consistency with previous results (MAE) was observed when comparing the use or not of climate indices in the forecasts for each region. Specifically, the station in the southern region, which presented the highest errors in the first analysis (MAE), also revealed the worst fit index (Table 16).

V. DISCUSSION
From the results presented, the problem can be analyzed from two aspects: whether there are gains when including climate indices in streamflow forecasting and whether the indices identified for each region are consistent with the literature. Tables 7-11 was constructed from 420 predictions using machine learning models and 140 predictions using the SARIMAX model. In Tables 13-16 and 17, the quantities are detailed by region, where column A represents the number of forecasts with some gain, and column B contains the number of gains greater than 10%. In machine learning (Table 17), 69.5% of the performance errors were reduced using climate indices (column A), and in 48% of them, the gain was greater than 10% (column B). Similarly, for the SARIMAX model (Table 18), 45% of the tests had gains with the use of indices (Column B). Using data from this study, it was observed that the use of climate indices in streamflow forecasts improved the results in percentages, which justifies the continuity of this research.
The results shown in Table 12 are consistent with those reported in the literature regarding the influence of climate indices in different regions of Brazil. The climate indices that were most closely related to the stations in the southern region (S -77, 217) were associated with the ocean-atmospheric phenomena El Niño Southern Oscillation (ENSO), El Niño Southern Oscillation Index (ENSO_PI), and the Multivariate ENSO Index (MEI_v2). The ENSO_PI index considers precipitation from the equatorial Pacific Ocean to configure ENSO phases [43]. However, the MEI_v2 index uses five variables (sea level pressure (SLP), sea surface temperature (SST), surface zonal winds (U), surface meridional winds (V), and outgoing longwave radiation (OLR)) to produce a time series that characterizes ENSO [44]. ENSO in the positive (negative) phase alters atmospheric circulation, favoring (disfavoring) precipitation, and consequently, streamflows in    the watersheds of southern Brazil [45], [46], according to the results shown in Table 7.
The impacts of ENSO in North and Northeast Brazil tend to be the opposite in the south. A positive ENSO phase reduces precipitation in northern and northeastern Brazil, whereas a negative phase favors precipitation and streamflow. The results also showed that the ENSO_PI and MEI_v2 indices were correlated with stations in the northeast (NE) and north (N) of Brazil in CI 2 and CI 3 .
The climate index that presented the best relationship (CI 1 ) with stations in the northeast, north, southeast (SE), and midwest (CO) regions was the Atlantic multidecadal oscillation (AMO). The AMO_ Long Version (data from 1948) is a fluctuation pattern of the North Atlantic sea surface temperature (SST) that varies over a multidecadal temporal scale [47]. The AMO climate index is characterized by a dipole with positive and negative phases. In the negative phase, the Intertropical Convergence Zone (ITCZ) is positioned in the southern latitudes with increasing precipitation in the north/northeast region of Brazil. However, the positive phase of the dipole, which occurs together with ENSO in the positive phase, causes a reduction in precipitation in northeast Brazil [13].
Knight et al. [48] indicated that decadal variations in the northeast wet season precipitation are strongly modulated by variations in the SST gradient between the northern and southern tropical Atlantic regions. The AMO index also showed a greater relationship with the SE and CO stations. However, a physical explanation or a high correlation between this climate index and precipitation/streamflow in these regions is not common in the literature. Further studies are needed to confirm this hypothesis.

VI. CONCLUSION
This study investigated the influence of 27 climate indices on streamflow forecasting for a set of hydroelectric plants in the main Brazilian hydrobasin. VOLUME 11, 2023  Three machine learning models (SVR, KRR, and ELM) and one linear model (SARIMAX) were used to identify the most relevant climate indices, which were selected using a random forest with recursive feature elimination. In addition, we used methods to classify the climate indices based on the importance of reducing the average absolute error of the forecasts.
The results showed gains when models containing climate indices in the input data were used as the exogenous variables. The case studies presented, which are related to distinct river bases, were better solved when the CI was considered, especially those located close to the equatorial line in the northeast and north-Brazilian regions. These findings are important because the planning of operations and pricing strategies is highly dependent on accurate streamflow prediction.
As a future work suggestion for the continuity of this work, it is recommended to test the use of a historical series of climate indices with time lags and use other methods for the selection of variables, such as filters. Regarding the forecasting models, an investigation must be developed to analyze the approaches to deal with multistep horizons [49]. Also, some models have stood out in the current days, among linear [50], nonlinear [51], and combination proposals [52], [53] that can be used considering the perspectives of this work.  Tables 19-21 lists the full range of tests for all stations, in the Root Mean Square Error (RMSE), Nash-Sutcliffe Efficiency Coefficient (NSE) and Willmott Index of agreement (WILL) metrics. Note that the results individualized by model were previously consolidated for the purpose of comparing the metrics.

DECLARATIONS Funding
The authors thank the Federal University of ABC and Câmara de Comercialização de Energia Elétrica for the funding of this study.