Streamflow Prediction in the Mekong River Basin Using Deep Neural Networks

In recent years, the Mekong River Basin (MRB), one of the largest river basins in Southeast Asia, has experienced severe impacts from extreme droughts and floods. Streamflow forecasting has become crucial for effective risk management strategies in the region. However, this task presents significant challenges due to rapid climate changes and the presence of numerous newly constructed upstream dams, which disrupt the natural flow. In this paper, we develop multiple deep learning models (incl. Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN), Long short-term Memory (LSTM), and Transformer) to predict streamflow with different lead time forecasts based on observed meteorological variables and climatic indices (i.e., discharge, water level, precipitation, and temperature) from 1979 to 2019. The results indicate that LSTM obtains high performance for streamflow prediction in both dry and wet seasons while Transformer is not recommended for long-term prediction, especially in the dry season. The proposed deep learning models capture well the fluctuation of river flow in the MRB during the period of high-dam development, especially LSTM (NSE ≥ 0.8). The models’ performances are enhanced with the adding of temperature for short-term prediction while precipitation was the most sensitive variable for long-term one. Such proposed models are essential for government agencies to plan mitigation and adaptation strategies at different periods, which can range from days to years.


I. INTRODUCTION
The Mekong River Basin (MRB) is one of the most important transnational rivers in the world, and the MRB's streamflow plays a key role in many fundamental ecological system processes [1].Hence, the streamflow prediction is significantly important for water resource planning and management [2], [3], [4].However, the phenomena and characteristics of streamflow are characterized as high degree of complexity and nonlineary due to the complex responses of soil characteristics, land cover dynamics and precipitation patterns The associate editor coordinating the review of this manuscript and approving it for publication was Bo Pu .[5].Currently, there are two main approaches that are used to simulate the flow, including conceptual models and timeseries (black box models) [6].The conceptual models are widely used to simulate river flow, and they are based on the concepts of the hydrological cycles.These hydrological models require various input data (e.g.topography, soil map, land use land cover map, and hydroclimate data), which are not always either available in a long time periods or difficult to obtain at all sites.Moreover, the implementation and calibration of such models can typically present various difficulties and requiring some degree of expertise and experience with the models [7].Previous studies in the MRB used a wide range of hydrological models, such as the Soil and Water Assessment Tool (SWAT), Variable Infiltration Capacity (VIC), VMod, or Integrated Catchment model (INCA) [8], [9], [10], [11], [12], [13].These models are not only used to simulate historical river flows but also to predict the streamflow up to the period of 2060s based on climate projections (i.e., CMIP3 and CMIP5).However, they are highly uncertain and heavily depend on experts' knowledge to define model inputs [9].In contrast, time-series black-box models are a data-driven method which is based on the ideals from statistical analysis of both linear and nonlinear relationships between the input and output data [14].These methods do not require understanding the internal structure of the physical hydrology process, but still can provide effective streamflow forecasting [6], [15].Hence, these black box models have been applied for streamflow prediction from the early 1970s, e.g., [4], [6], [16], [17], [18], and [19].
Due to their great successes in many other applications [17], [20], [21], [22], most recent studies focus on using state-of-the-art black-box deep learning techniques for streamflow predictions.E.g., Xu et al. [3] used LSTM for 10-day average flow predictions and daily flow predictions in Hun and Yangtze river basins (China), showing the dominance of LSTM over several hydrological models.Le et al. [23] simulated the river flow in Red River, Vietnam using LSTM to forecast flow rate in one, two, and three days, and to forecast flood.Le et al. [24] compared LSTM with the feed-forward neural network (FFNN) and convolutional neural network (CNN), and confirmed the LSTM outperformed for streamflow forecasting.LSTM model was also used in [25] for daily flow prediction, and it shown higher accuracy compared to the SWAT model.Besides LSTM, Pham et al. [26] developed Convolutional Neural Network (CNN) for daily rainfall-runoff prediction in the Vietnamese Mekong delta with slightly better performance compared to LSTM [26].In [27], Multi-Layer Perceptron networks (MLP) was shown to have better performance than different hydrological models for short-term streamflow forecasting (two weeks ahead) in the Paraná Basin in Brazil.A rainfall-runoff model based on Transformer, proposed by [28] to predict 7-day-ahead runoff on 673 basins in the United States, shown a better performance compared to a LSTM-based model.

A. RESEARCH GAPS
The short-term prediction is crucial for flood early warning system, especially for flash flood, while long-term prediction is significant for planning and water resources management such as the operation of hydro-power operations, sediment transport, and irrigation management [4].Most of the existing studies have focused on daily streamflow forecasting and there is a lack of study to predict streamflow for longer lead time (e.g., months, years) by using multiple deep learning models, especially in a transnational river basin.By considering a diverse range of models, we can avoid relying solely on one model for our findings and gain a more com-prehensive understanding of their performance.Most studies utilize rainfall, water level, and discharge as the inputs to predict streamflow, e.g., [3], [6], [25], [26], [27], [29], and [30].However, the sensitivity of inputs was not involved in these studies to check whether the models' performances are improved by adding all these inputs.Simulating streamflow under the influence of extreme events and the development of dams is a highly complex task.Thus, further investigations are needed to conduct analysis for both short and long-term prediction by using various deep learning models for supporting development planning and strategic decision making in the MRB.

B. CONTRIBUTIONS
The present study is motivated by the increasing number of hydropower dams in the MRB and the extreme events become more frequent and intensive in recent years [10], [31], resulted detrimental impacts on locals and natural habitat.The scientific novelty of this study is to predict streamflow with different lead times in the large river basin, and focusing on three main objectives: • Predicting streamflow using multiple deep learning models (incl., Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN), Long short-term Memory (LSTM), and Transformer), and indicating which one is the most suitable to predict streamflow in the MRB.
• Examining the models' performance in case of the extreme events (drought and flood) and the period of high-dam impacts to ensure the accuracy of proposed models.
• Indicating the most sensitivity input, which influence on the performance and accuracy of deep learning models in the MRB.
The results from this study provide valuable insights for water resources management in the MRB.Specifically, it helps to improve the accuracy of drought and flood predictions with different lead times, thus leading to better water resources management and more effective mitigation and adaptation strategies for extreme events.The rest of the paper is organized as follows.Section II describes the study area and data.Section III presents methodology.Section IV presents results and discussion.Conclusions are drawn in Sections V.

II. STUDY AREA AND DATA
In this part, we elaborate the location of study area, and details about the data, which were used in this study.
The MRB is located in the mainland Southeast Asia and is ranked the 10 th largest river in the world in terms of mean water discharge [32].It covers an area of 795,000 km 2 , and having 4,800 km in length [33].There are two main sub-basins, namely the Upper MRB or Lancang River (i.e., the name of the Mekong in China), and the lower MRB (including Myanmar, Laos, Thailand, Cambodia, and Vietnam).The Mekong river originates from the Tibetan Plateau in China where it is dominated by high mountains and deep valleys, then flows through the less mountainous regions of Thailand and Laos before entering to floodplains of Cambodia and Vietnam.The MRB's climate is strongly regulated by Southeast Asian monsoons, creating a seasonal climate with distinct dry and wet periods [33].The average annual rainfall in the MRB ranges between 400 and 2,000 mm/yr, 70% of which is concentrated in the wet season (May to November) [33], [34].
In this study, we use meteorological variables and climatic index for streamflow prediction, which are ground-based data and interpolated grids.The gridded data are commonly used to tackle the problems of manual errors and lack of long time series data [35].Daily precipitation and temperature are extracted from [36] between 1979 and 2019, with a spatial resolution of 0.5 o x 0.5 o .These data have been applied for the MRB's studies in [37] and [35].All gridded data set are re-gridded to match with the MRB's meteorological station locations by using the bilinear interpolation method, which has found successful applications in [9], [38], and [39].Daily streamflow and water level data at seven hydrological stations (i.e., Chiang Saen (CS), Luang Prabang (LP), Nong Khai (NK), Nakhon Phanom (NP), Mukdahan (Muk), Pakse (Pak), Kratie (Kra)) were collected from a data portal of the Mekong River Commission (MRC) during the period of 1979-2019.All these stations are located at the main stream of the MRB as shown in Figure 1, providing a long and reliable daily time series.

III. METHODOLOGY
In this section, we present the details framework of study, data pre-processing, and architecture of deep neural networks.

A. FRAMEWORK OF THE STUDY
Our framework is presented in Figure 2. The inputs to predict streamflow include meteorological variables and climatic indices (i.e., discharge, water level, precipitation, and temperature).Precipitation and temperature are crucial factors that influence streamflow [40].Precipitation either infiltrates into the ground or directly flows into streams, contributing to increased streamflow.Temperature influence evapotranspiration processes.Daily streamflow provides important information for understanding historical flow patterns, and water level reflects the height of the water surface relative to a reference level.By incorporating these meteorological variables and climatic indices, deep learning models then learns the patterns and correlations between these variables, enabling it to make predictions on discharge for different time.We examine four deep learning models for streamflow prediction, including Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN), Long short-term Memory (LSTM), and Transformer.These models are selected because they represent state-of-the-art techniques in deep learning, and have been widely adopted in previous study, especially in field of water resources [17], [23], [24].All models are performed in two phases (i.e., training and testing phases).First, each model is trained on the training data in the training phase.And then, the trained model is frozen to predict the streamflow values in the testing phase.In this study, 80% of data is used for training, from 1979 to 2011 and 20% of data is adopted for testing, ranging from 2012 to 2019.The training dataset is selected to represent flood years (e.g., 2000 and 2011) and drought years (e.g., 1992, 1997, 2004, and 2010).Similarly, the testing dataset also covers extreme events, for example flood years in 2018 and drought years (e.g., 2016 and 2019), which were recorded as the severe drought, resulted in agricultural reduction and economic losses [31], [41].The period of dam impacts (1992-2019) [31], [42] is also included both training and testing phrases.These ensure comprehensive model learning and evaluations that cover both extreme weather conditions and man-made vulnerability.The input features of all models include daily hydrometeorological data (incl.discharge, water level, precipitation, temperature).
At an arbitrary time t, our model predicts the future streamflow at the time t + k.Depending on the value of k, we categorize the streamflow prediction into two types, namely short-term and long-term predictions.In this paper, we evaluate the performance of different models using k = 5, 10, 15 days for short-term prediction and k = 90, 180, 360 days for long-term prediction.By this way, it helps to clearly show the prediction ability of different models in both long-and short-terms.The models' inputs for short-term prediction are the inputs of 15 consecutive days in the past (lb = 15), the models are then trained separately to predict three different lead times, including 5 days, 10 days, and 15 days (k = 5, 10, 15, respectively).For long-term prediction, the inputs of 365 consecutive days in the past (lb = 365) are adopted as the models' inputs, and this will be used to predict streamflow for lead times of 3 months, 6 months, and 12 months (k = 90, 180, 365, respectively).

B. DATA PRE-PROCESSING 1) NORMALIZATION
The hydrometeorological variables are measured in different units, for example, the range of daily temperature and precipitation values are [10,40] degrees Celsius and [0, 150] mm/day, respectively whereas the daily streamflow values are between 800 m 3 /s and 50,000 m 3 /s in the most downstream station (Kratie).To ensure all input features are treated equally in the learning process.They are rescaled into [0, 1] using the Min-Max normalization as follows: where x is the feature that should be normalized, max(x) and min(x) are the maximum and minimum values of the observed data, respectively.

C. DEEP NEURAL NETWORKS
The Root Mean Square Error (RMSE) is adopted as a regression loss function during the learning processed of all proposed deep learning networks below as shown in Eq. 2: where y i is the observed streamflow, and ŷi indicates the model's output (i.e., predicted streamflow values).

1) MULTI-LAYER PERCEPTRON (MLP)
MLP is a kind of feedforward network that maps non-linear from inputs to outputs.This network may has multiple nodes at the output that forms an output layer, and having additional intermediate layers between the input layer and the output layer, called hidden layers.Each hidden layer has a lot of neurons (i.e., units stacked together).The activation functions are usually performed in all hidden layers such as sigmoid, Tanh, or ReLU (Rectified Linear Unit), in which ReLU [43] is widely used because of its simplicity and effectiveness.
In this study, we utilize three hidden layers for the MLP model (see Figure 3).Specifically, we first flattened the input via a flatten layer, then the flatted data is put through each hidden layer to get features from inputs.A simple, but robust activation function, namely ReLU, is adopted at each hidden layer to create a non-linear mapping between the layer's inputs and outputs.In addition, the ReLU function also reduces the likelihood of vanishing gradient when training the network.There were 64, 128, and 256 units in each hidden layer, respectively.Each hidden layer is followed by a batch normalization layer [44] and a ReLU activation layer [43].The features after going through the hidden layers are utilized for streamflow prediction via the output layer.We adopt a dropout layer with a drop rate as 0.5 before predicting the streamflow.

2) CONVOLUTIONAL NEURAL NETWORK (CNN)
Far apart from the MLP network that uses fully connected layers for processing grid-like data, the CNNs use the spatial correlation of the signal to utilize the architecture in a more sensible way.Its architecture, somewhat inspired by the biological visual system.This means that it makes them extremely useful in spatial convolution and spatial pooling.In CNNs, various convolution layers are utilized to find the locations and amplitude of a detected feature from input via one or more filters.A filter (or kernel) is organized in three dimensions, i.e., height, width, and depth.Instead of being fully connected to the previous layer as the MLP, the CNN applied only a small region of the previous layer and ''slide'' over the previous layer.The movement of a filter is always from left to right and top to bottom.At each location, the matrix multiplication is performed between a filter matrix and all patches of the input to generate a feature map.Then, all values in a feature map are passed to a nonlinearity activation (i.e., ReLU).There are numerous different filters, which are generated to learn multiple features from a given input in parallel.After that, these feature maps are put together to become the final output of the convolution layer.In the proposed CNN model, we adopt three 1D convolution layers (see Figure 4) where the kernel size in each convolution layer is set to 3, the stride is set to 1, and the zero-padding is set to 1.We used 64 filters in the first convolution layer, 128 filters for the second, and 256 filters for the remaining.The batch normalization and ReLU activation layers are used after each convolution layer (the same as the MLP model).Finally, the last feature maps are reshaped into a single vector via a flatten layer.A dropout with drop rate as 0.5 and a fully connected layer were adopted to generate streamflow prediction from this single vector.

3) LONG SHORT-TERM MEMORY (LSTM)
Recurrent Neural Networks (RNNs) is an extremely powerful sequence model and was introduced by [45].A typical RNNs contains three parts, namely, sequential input data, hidden state, and sequential output data.The RNNs is used for sequential information and performs the same task for every element of a sequence where the output is dependent on the previous computations.
The difficulty of training RNNs is to capture long-term dependencies, and has been studied in [46].The RNNs are easily overloaded due to processing and remembering too much information.To address the issue of learning longterm dependencies, [47] proposes Long Short-Term Memory (LSTM), which is able to maintain a separate memory cell that is updated and removed unnecessary information.The architecture of an LSTM memory cell is shown in Figure 5(a).
The key to LSTM is the cell state (denoted as C).The LSTM is capable of removing or adding necessary information for the cell state, which is carefully regulated by gates.The gates are the filter for the information passing through it, they are combined by a sigmoid activation and a multiplication.There are three types of gates, including Forget, Input, and Output gates in an LSTM cell.At each time step t, the input of all gates is x t (representing an element in the input sequence) and hidden state, which is h t−1 the memory cell output from the previous time step (i.e., t −1).The gates are all responsible for filtering the information with different purposes.In which: • Forget Gate decides which information is removed from the cell state.The output value for each parameter ranges [0, 1] in the cell state C t−1 .An output of 1 indicates that it will keep all information, otherwise indicates that all information will be discarded.
• Input Gate aims to select which information is necessary to be added to the cell state.This consists of two parts.First, using a sigmoid layer, that is called the input gate layer (i t ) to decide which values are kept from x t meanwhile a tanh layer is used to generate Ct , which aims to add information for the cell state.In the next part, the i t and Ct are combined to create the updated version for the cell state.
• Output Gate decides which information is used as the output for the next state.The output values are based on the cell state but they will be further screened.Specifically, a sigmoid layer is generated to decide which parts of the hidden state are needed for the output.After that, the cell state C t will be passed to a tanh function to reduce its value with the range [−1, 1], and multiply by the output from the above sigmoid layer to get the output for the next state.The working mechanism of the gates and the flow of information in Figure 5 are presented as follows: where Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.layers, including 64 units for the first LSTM layer, 128 units for the second, and 256 for the remaining.The output of the last LSTM layer is flattened to become a single vector and then applied to a fully connected layer to predict streamflow value.We adopt a dropout layer with a drop rate as 0.5 before predicting the streamflow.

4) TRANSFORMER (TRANS)
Transformer is a deep learning model, which is designed to solve many problems in many fields, such as automatic translation, language generation, classification, entity recognition, etc.However, unlike RNNs and LSTMs, Transformers do not process elements (of an input sequence) sequentially.The inputs are passed into the Transformer model at the same time.Instead of using recurrent architecture like RNNs, Transformer models utilize self-attention layers.Selfattention is a mechanism that helps Transformer models ''understand'' the relationship between features in an input sequence.Self-Attention contains four steps (as shown in Figure 6): • Generate three vectors (Query, Key, and Value).These vectors are created by matrix multiplication between the input vector and the three weight matrices corresponding to the Query (Q), Key (K), and Value (V).
• Score calculation.The score is calculated as the math multiplied (MatMul) between the Query vector and Key vector in turn.The objective of this step is to learn the correlation between query and key vectors.
• Score normalization.In the original Transformer architecture, scores are divided by 8 which makes the slope more stable [48].These values are then passed through the softmax function to ensure that the score values are all positive and their sum does not exceed 1. Near to 1 means that the query is the same as the key, otherwise near to 0 is not.
• Output calculation.Multiply attention weights by the value matrix.The aim of this step is to preserve the attention features and discard the irrelevant features.
To overcome the problem of ''it always attention to itself''.We introduce an updated version of Self-Attention, namely Multi-head attention.Instead of using a Self-attention (one head), multi-heads are used to capture different attentions from a given input sequence.Our Transformer framework is illustrated in Figure 6.In which, we use two Transformer layers with the number head as 4 and number units for Linear and Feed Forward layers are 64.The output of the two Transformer layers is then passed to the Flatten layer.We utilize a dropout layer with a drop rate as 0.5 before going to the fully connected layer to predict the streamflow value.

D. MODEL EVALUATION 1) EVALUATION CRITERIA
The simulation performances of the models are evaluated by two statistical metrics, namely the Nash-Sutcliffe model efficiency coefficient (NSE) and the root mean square error (RMSE).These two metrics are commonly used to quantitatively describe the accuracy of the simulation models.NSE values range from −∞ to 1, where 1 indicates a perfect match between simulated and observed values and smaller NSE values denote less association.In contrast, a lower RMSE value presents better performance, and this metric also assesses how well the proposed model explains and predicts future streamflow.RMSE also is utilized as a regression loss function for proposed models (see Eq. 2), and the NSE is calculated as follows: where y i is the observed data, ŷi is the predicted data from the models and ȳ is the mean of the observed data.
For extreme events (drought and flood), the deep learning models' performance are evaluated using Relative peak error (RPE) and differences of annual 30-day minimum.The accuracy and performance of different forecasting approaches can be assessed by RPE, which provides reliable predictions of peak discharge and rising stage during the flood events.This metric has been used in previous study [49], [50], [51] to quantify the accuracy of forecasting flash floods.

RPE = q sim
peak − q obs peak q obs peak 100% (10) where q sim peak and q obs peak are the simulated and observed peak discharge, respectively and the values of RPE closer to 0 indicate better estimation of peak discharge.

IV. RESULTS AND DISCUSSION
In this section, we first present the performance of all proposed models for both short-term and long-term predictions.The ablation study is adopted to analyse the sensitivity of all inputs (incl., precipitation, water level, and temperature) for streamflow forecasting.The models' performances are then confirmed in the extreme events (i.e., drought year 1997 and flood year 2000), the seasonal flow (dry season and wet season), and the period of high-dam development to show the accuracy and efficiency of models.

A. PERFORMANCE EVALUATION
To examine the effectiveness of the proposed models for both short-term and long-term streamflow prediction.The RMSE and NSE are used to present the models' performance, and the five-time-running is adopted in the testing phase to evaluate the stability and reliability of the proposed deep learning models.The results in Table 1 are the averages of RMSE and NSE for seven stations in the mainstream MRB.

1) SHORT-TERM STREAMFLOW PREDICTION
Table 1 (left) shows the results of all models for three different lead time forecasts, including the 5 days, 10 days, and 15 days.As can be seen, the models' results show a goodness of fit between predicted streamflow and observed streamflow.RMSE and NSE values are not significantly different for all models, and show high performance of streamflow prediction.Particularly, CNN achieves the best performance with averaged RMSE of 1930.35 m 3 /s for the 5 days forecasting, while LSTM and MLP's performances are 1969.80m 3 /s (+39.45 m 3 /s compared to CNN) and 1998.31m 3 /s (+67.96m 3 /s compared to CNN), respectively.The NSE values of these three models are the same at 0.94.For 10 days streamflow prediction, the MLP model achieves better performance (averaged RMSE of 2697.94 m 3 /s) than that of CNN (averaged RMSE of 2733.10 m 3 /s) and LSTM (2755.42m 3 /s).The best model for the 15 days forecasting is LSTM, with averaged RMSE of 2878.22 m 3 /s, followed by MLP (2935.61m 3 /s) and CNN (2939.04m 3 /s).The Transformer model achieves the worst performance for all short-term predictions, with RMSE value of 2314.97 m 3 /s, 2768.54 m 3 /s, and 3174.72 m 3 /s for the 5 days, 10 days, and 15 days streamflow predictions, respectively.

2) LONG-TERM STREAMFLOW PREDICTION
Table 1 (right) shows the performance of all proposed networks for three lead time forecasts (incl., 3 months, 6 months, and 12 months).Overall, LSTM significantly outperforms the other three models.Specifically, it achieves the best performance for 6 months and 12 months streamflow forecasts, and it is ranked second based on the model's performance for 3 months forecasting.The LSTM network has the same NSE value (0.84) for all three lead time forecasts, and the RMSE values are 3365.21m 3 /s, 3393.21m 3 /s, and 3399.02m 3 /s for the 3 months, 6 months, and 12 months prediction, respectively.The better results are obtained from the LSTM model, which has a recurrent structure with memory cells that allow them to capture and store information about previous inputs in the sequence, and thus it helps to remove redundant information and keep tracking of important information in the past to forecast streamflow better.
The performances of all models for long-term streamflow prediction are reduced significantly compared to the shortterm prediction.The results show that the longer lead time the models forecast, the less accurate they are.According to [52], the higher NSE values indicated less error variance, and typically values greater than 0.5 were considered acceptable levels of performance.Thus, our models' results are very reliable to predict streamflow in the MRB for both short-term and long-term periods.
The 10 days and 6 months are represented for short-term and long-term streamflow prediction and are shown in Figure 7. Short-term prediction (Figure 7a) has better results than those of long-term prediction (Figure 7b).It is not a surprise.The longer we predict into the future, the more uncertain the results are.The streamflow prediction at the upstream stations (i.e., Chiang Saen) have lower performance compared to the downstream stations (from Nakhon Phanom to Kratie).Particularly, NSE values of all models at Nakhon Phanom are from 0.88 to 0.92 while that range from 0.58 to 0.66 at Chiang Saen for long-term prediction.For short-term prediction, the MLP outperforms other models at Chiang Saen and Luang Prabang while the LSTM is the best model for streamflow prediction at the other stations (from Nong Khai to Kratie), especially at Nakhon Phanom.For long-term prediction, the LSTM obtains higher accuracy for streamflow prediction, followed by the CNN.Similar to the shortterm prediction, all deep learning networks perform better at Nakhon Phanom while having lower performances at Chiang 97936 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Saen station.Overall, the deep learning networks obtain good results for both short and long-term streamflow prediction at all stations.However, performances are obviously different between the upstream and downstream stations.Chiang Saen, which is close to cascade dams in the Lancang, is harder to predict than others.It can be explained by the influence of dam constructions that have regulated river flow in the MRB in recent years.

B. EFFECTS OF DIFFERENT CLIMATE FACTORS
The ablation study with models' inputs (incl.precipitation, water level, and temperature) is applied to evaluate the models' performance when removing each feature.Particularly, we set up the new inputs, and re-trained all models without adding each variable in turn, then the models are compared with the results in Section IV-A (involving all inputs).The results are illustrated in Figure 8 in which 10 days and 6 months are used to represent short-term and long-term predictions, respectively.

1) FOR SHORT-TERM PREDICTION
The absence of temperature has a significant influence on the models' results, followed by water level and precipitation.Particularly, all models' performance without adding temperature value are reduced by [12.9% to 18%].Without adding the water level and precipitation, all models' performance are reduced by [8.5% to 19.5%] and [5.7% to 24.2%], respectively.Moreover, Transformer has an obvious reduction in the model's performance when removing inputs, especially for precipitation, and its accuracy has decreased sharply by [18% to 24.2%].

2) FOR LONG-TERM PREDICTION
The presents of some variables become a noise and influence on the model's performance.For example, temperature plays a vital role to enhance the models' performance for short-term prediction while that variable is unnecessary information for long-term prediction, especially for CNN and LSTM.Particularly, the accuracy of these two models are increased by 3.6% and 0.4%, respectively without adding temperature.The precipitation and water level are the two most important inputs for long-term prediction, especially for MLP.Among four proposed models, the LSTM is the most efficient model for   long-term prediction.Particularly, the model's performance is reduced by 3.3% when removing the water level value, and it performs well with the absence of inputs (i.e., precipitation and temperature).Temperature is a more sensitive variable for short-term prediction, while precipitation is more sensitive for long-term prediction.This can be explained by the fact that temperature affects the rate of evapotranspiration and thus the immediate runoff, while precipitation is a primary driver of long-term water availability.

C. STREAMFLOW PREDICTION IN THE EXTREME EVENTS
In this analysis, we present models' performance at seven stations in the extreme drought year (1997) and year  (2000) for short-term (10 days) and long-term (6 months) predictions.The year 1997 is recorded as the extreme drought and coincided with the occurrence of strong El Niño event, affecting 3 million people and total losses of rice production was around $400 million in Vietnam [53].Moreover, a severe flood in 2000 is one of the largest floods in the pat 80 years in Vietnam, coincided with the strong La Niña event, and caused damage of $4 million in the Vietnamese Mekong delta [54].Thus, these two years are adopted to represent the streamflow in the extreme events (droughts and floods).

1) FOR SHORT-TERM
Figure 9 presents the models' performance for short-term prediction (10 days) at seven stations in the drought year (1997) and flood year (2000).LSTM is the best model for short-term prediction, followed by CNN.Particularly, LSTM obtains the best performance in the drought year (1997) (from Chiang Saen to Pakse stations), NSE values range between 0.81 and 0.94.Similarly, LSTM outperforms the rest of proposed models at all stations in the flooding year (2000), except at Chiang Saen.The models' performance obtains better results at Nakhon Phanom, whereas these show worse results at the upstream stations (i.e., Chiang Saen and Luang Prabang).The comparisons among four models in extreme years at Kratie for long-term prediction (6 months) are presented in Figure 11.Kratie station is the most downstream station, located in Cambodia, which is a lowland region.It is not only significantly influenced by flooding but also more prone to drought [55], [56].Therefore, this station is chosen to represent the comparison of all models' performance in the drought year (1997) and flood year (2000).The average water discharge in the dry season in 1997 is 3,735 m 3 /s while that amount is double in 2000 (6,762 m 3 /s).Furthermore, the flood year 2000 occurred early, and had two peaks, in which the second flood peak was one of the highest in the past TABLE 2. The performances of all models at Kratie station in the flood months and dry months for short-term prediction (10 days) and long-term prediction (6 months).These statistical indicators are from the testing period, and the best results are shown in bold.
80 years [54], and lasted for two months (from August to October).The long-term prediction for 6 months shows the obvious differences between these four models.Particularly, LSTM achieves the best performance, and is able to capture the low and high peak of the observed discharge in both flood and drought years whereas MLP, CNN, and Transformer cannot detect well the fluctuations of discharges during the flooding period, especially from July to October.
A comparison between predicted and observed streamflow in annual, flood season, and dry season at Kratie for short-term (10 days) and long-term prediction (6 months) are illustrated in Table 2.The proposed models have better performances for short-term prediction than those for longterm prediction.The performance of LSTM is slightly greater compared to other models in both flood season and dry season.Specifically, the statistical indicators (NSE) of LSTM are 0.85 (annual), 0.73 (flood months), and 0.79 (dry months) whereas those of the MLP are 0.82 (annual), 0.68 (flood months), and 0.77 (dry months) for short-term prediction.Similarly, LSTM still outperforms other models for longterm prediction, having NSE values of 0.78 (annual), 0.63 (flood months), and 0.67 (dry months).Transformer has the worst performance in the dry season, and its accuracy for streamflow prediction in the dry season is 0.38 NSE.
Figure 12a and 12b illustrate the differences of annual 30-day minimum discharge and relative peak error for peak discharge, respectively from 1980 to 2019.The results are the comparison between simulated peak/low discharge and observed discharge, indicating the appropriate models for predicting extreme drought and flood events.The LSTM showed the best skill to predict discharge, especially in the extreme drought and flood, values range −20 to 20% for both low and peak discharge, followed by the MLP.Transformer had the worst performance, for example, year 1990 and 2011 (Figure 12a) shown significant underestimation compared to the observed discharge, values of below −60%.These results confirm the good skill of LSTM model to predict streamflow in the extreme events.
In short, there are two main implications drawn from the above results (Figure 9, 10, 11 and Table 2).First, LSTM outperforms other models for streamflow prediction in the  extreme drought and flood events while the MLP is not able to capture the peak flow and low flow at Kratie station.Second, although Transformer can produce acceptable results for short-term streamflow prediction, it should be used with care for long-term prediction, especially in the dry season.

D. STREAMFLOW PREDICTION IN THE ERA OF MEGA-DAMS
In this analysis, the accuracy of proposed models are confirmed in an era of mega-dams (2010-2019).Currently, many dams have been constructed in the MRB (having more than 100 existing dams) [57], leading to obvious changes in natural flow regime and sediment reduction [31], [42], [58], [59].The significant flow alterations are obviously observed after the completion of Xiaowan and Nuozhadu dams in 2010 and 2014, respectively [31], [57].Chiang Saen is the nearest station to the Lancang cascade dams (China), and has shown a clear change in flows due to impact of dams compared to other lower stations [58], [60].As can be seen from Figure 13, LSTM obtains the best performance in predicting river flow at Chiang Saen, followed by Transformer, CNN, and MLP.In particular, the statistical indicators of LSTM are 0.88 and 0.8 NSE for short and long-term prediction, respectively.However, the accuracy of MLP is lower than those of other models, values of 0.79 NSE (for short-term prediction) and 0.63 NSE (for long-term prediction).A comparison between predicted and observed discharge at Chiang Saen in the period of mega-dam (2010-2019) are illustrated in Figure 14.
For short-term prediction (10 days), the simulation results from four proposed models perform well as compared to the observed discharge whereas the MLP and Transformer have lower models' performance for the long-term prediction (6 months).The median, 25 th , and 75 th percentiles from the LSTM (for short-term and long-term prediction) are quite similar to that in the observed discharge, showing that this model achieves better performance than other models during the period of high-dam impacts.The MLP and Transformer are not able to capture the low and peak flows for long-term prediction, and these two models' results are overestimated with respect to the observed discharge.This indicates the limitations of MLP and Transformer in capturing the nonlinear and complex relationships that exist in the data, thus leading to poor performance to predict streamflow in the extreme events and period of dam impacts.

V. CONCLUSION
In this study, we proposed four deep learning models (incl., MLP, CNN, LSTM, and Transformer) to predict short-term and long-term streamflow in the MRB.Our proposed models not only can forecast streamflow with different lead times effectively, but are also able to capture the rapid flow's changes in the extreme events and the period of mega dams.Streamflow prediction in the MRB faces challenges due to a lack of dam information.This data is not easy to obtain because of the political volatility in the region but our experiments show that deep learning methods can deal with it effectively.Results from our models show that LSTM is the most accurate and efficient model compared to the other three models.The longer lead time the models forecast, the less accuracy they are.Particularly, all proposed models achieve good performance for short-term prediction (RMSE values are less than 3000 m 3 /s, and NSE ranges between 0.87 and 0.94) while their performances are reduced for long-term prediction (RMSE values are exceed 3000 m 3 /s, and NSE ranges between 0.82 and 0.85).Besides, LSTM is able to perform well for the streamflow forecasts in the era of mega dams (2010-2019) (with NSE ranges from 0.63 to 0.88).LSTM performs better for long lead time streamflow forecasts because it is able to capture long-term dependencies of time series data, thus predicting the future streamflow more accurately.Furthermore, temperature has a significant influence on models performance for short-term prediction, while precipitation and water level are the two most sensitive variables that contribute to improve the accuracy of models for long-term prediction.Such deep learning models are of great values for supporting development planning and strategic decision making in the MRB, especially in the period of high-dam development.
Further investigations could focus on improving the accuracy of long-term streamflow predictions by incorporating seasonal regulation of mega-dams.The streamflow prediction with different lead times will provide useful information, such as extreme droughts and floods, thus calling for timely  network architecture is shown in Table 4 where Conv denotes the convolution layer with f as the number of filters and k as kernel size.Table 5 and Table 6 present the architecture of the LSTM network and the Transformer network.In which, unit in each LSTM layer denotes the number of LSTM cells, proj_dim is the dimension of the projection layer and num_heads is the number of head attentions in the Transformer layers.T in the output size column of all networks denotes the number of consecutive days of input data.For short-term streamflow prediction and long-term streamflow prediction, T is set to 15 and 365, respectively (i.e., 15 and 365 consecutive days are the sequence length for lead time forecasts).

FIGURE 2 .
FIGURE 2.The framework adopted in this study.

FIGURE 5 .
FIGURE 5. Long short-term Memory (LSTM) model.(a) denotes the architecture of an LSTM cell, and (b) illustrates our proposed LSTM model for streamflow prediction.
weight matrices in a LSTM cell and b f , b i , b C , b o are vector biases.Our proposed LSTM model is illustrated in Figure.5(b).We use three LSTM 97934 VOLUME 11, 2023

FIGURE 8 .
FIGURE 8. Ablation study for (a) short-term (10 days) prediction and (b) long-term (6 months) prediction, in which, w/o P, w/o WL, w/o T denotes model's input without adding precipitation, water level, and temperature, respectively.The results were based on the testing period.

FIGURE 9 .
FIGURE 9.The models' performance (NSE) for short-term prediction (10 days) at seven stations in (a) the drought year 1997 and (b) the flood year 2000.The best results are showed in blue.

FIGURE 10 .
FIGURE 10.The models' performance (NSE) for long-term prediction (6 months) at seven stations in (a) the drought year 1997 and (b) the flood year 2000.The best results are showed in blue.

FIGURE 12 .
FIGURE 12. (a) The differences of annual 30-day minimum discharge.(b) The relative peak error shown percentage difference in flood peak.Both are the comparison between discharge predicted by deep learning models and observed discharge.

2 )
Figure10shows results of streamflow prediction for long-term period (6 months) in the extreme events.LSTM achieves the best performance for both drought year (1997) and flood year (2000), followed by the CNN.Particularly, NSE values of the LSTM model are above 0.95 from Nakhon Phanom to Kratie in the dry year (Figure10a) while its model's result is less accurate at Chiang Saen (NSE = 0.79).For the flood year (Figure10b), LSTM has the best model's result at Pakse and Kratie (NSE = 0.98), and the NSE values at all stations range from 0.87 to 0.98.In contrast, MLP obtains lower NSE values compared to other three models and shows less accuracy at Chiang Saen (NSE = 0.71).The comparisons among four models in extreme years at Kratie for long-term prediction (6 months) are presented in Figure11.Kratie station is the most downstream station, located in Cambodia, which is a lowland region.It is not only significantly influenced by flooding but also more prone to drought[55],[56].Therefore, this station is chosen to represent the comparison of all models' performance in the drought year (1997) and flood year(2000).The average water discharge in the dry season in 1997 is 3,735 m 3 /s while that amount is double in 2000 (6,762 m 3 /s).Furthermore, the flood year 2000 occurred early, and had two peaks, in which the second flood peak was one of the highest in the past

FIGURE 13 .
FIGURE 13.The performances of all models at Chiang Saen station during the high-dam development (2010-2019) for short-term prediction (10 days) and long-term prediction (6 months).

FIGURE 14 .
FIGURE 14. Streamflow prediction at Chiang Saen station during the high-dam development (2010-2019) for (a) short-term prediction (10 days) and (b) long-term prediction (6 months).The outer edges of the boxes represent the 25 th and 75 th percentiles, and the horizontal lines of the boxes represent the median.

TABLE 1 .
The short-term streamflow prediction (left table) and the long-term streamflow prediction (right table) of all proposed networks (incl., MLP, CNN, LSTM and Transformer).The best results are shown in bold.

TABLE 3 .
The MLP network architecture.

TABLE 4 .
The CNN network architecture.mitigation and adaptation in the context of rapid development in the MRB.

TABLE 6 .
The Transformer network architecture.