Improved Transformer Model for Enhanced Monthly Streamflow Predictions of the Yangtze River

Over the past few decades, floods have severely damaged production and daily life, causing enormous economic losses. Streamflow forecasts prepare us to fight floods ahead of time and mitigate the disasters arising from them. Streamflow forecasting demands a high-capacity model that can make precise long-term predictions. Traditional physics-based hydrological models can only make short-term predictions for streamflow, while current machine learning methods can only obtain acceptable results in normal years without floods. Previous studies have demonstrated a close relation between El Niño-Southern Oscillation (ENSO) and the streamflow of the Yangtze River. However, traditional models, holding the encoder–decoder architecture, only have one encoder block that can not support bivariate time series forecasting. In this study, a transformer-based double-encoder-enabled model was proposed, called the double-encoder Transformer, with a distinctive characteristic: "cross-attention" mechanism that can capture the relation between two time series sequences. Using river flow observation collected by the Yangtze River Water Resources Commission and El Niño-Southern Oscillation (ENSO) observation collected by the National Oceanic and Atmospheric Administration, the model can achieve better performance. By using variational mode decomposition (VMD) technique for preprocessing, the model can make precise long-term predictions for the river flow of the Yangtze River. A monthly prediction of 21 years (from January 1998 to December 2018) was made, and the results indicate that the double-encoder Transformer outperforms mainstream time series models.


I. INTRODUCTION
T HE Yangtze River has seen numerous floods in its history. Affected by various factors, including the El Nino weather pattern [1], [2], [3], the river flow time series are complex and nonlinear [4]. Flow prediction is a practical problem and has been drawing an increasing amount of attention. Many studies have been done to predict the river flow for months [5], [6] or even years in advance [7], [8]. Streamflow predictions help in the ability to fight floods in advance and help local administrators make better decisions and mitigate disasters [9].
Over the past few decades, many numerical and machine learning methods have been developed to predict streamflow.
Numerical prediction models [10], [11], [12] can simulate the interactions of various physical processes, such as atmospheric circulation and the evolution of long-term weather in the physical world [13]. Numerical models can be analogous to conducting a particular physics experiment as a way to achieve satisfactory results in short-term forecasting [14]. The soil and water assessment tool (SWAT) [10] was proposed as a way to predict the effects of land management on water, sediment, and chemicals in a large watershed with complex and varied soil types, land-use patterns, and management practices. SWAT is a distributed watershed hydrologic model based on the geographic information system (GIS). SWAT primarily uses the space-based information from remote sensing and geographical information systems to simulate various hydrological physical and chemical processes [10]. However, hydrologic numerical models have some severe drawbacks: 1. Large amounts of data with high accuracy are required; 2. long-term forecasting is less accurate; and 3. numerical methods are computationally intensive, requiring vast computing resources. Statistical prediction methods generally belong to traditional machine learning. Many statistical models have been employed in predicting streamflow [15], [16], [17]. Support vector machines (SVMs) [18] was designed to be a strict theoretical and mathematical basis classifier . In 2016, Shuang Zhu et al. used SVMs to predict the upper reaches of the Yangtze River; here, the R 2 could reach 0.87 in a single-year monthly forecast [19]. Statistical hydrological models can perform well in normal years but perform poorly in flood years because the organizational structure of its parameters limits the complexity of the model, hence not being able to predict the anomalies precisely.
Traditional deep learning models for extracting features that can be used as prediction models. Artificial neural network (ANN), convolutional neural network (CNN) [20], and recurrent neural network(RNN) [21] are promising methods for predicting river flow. An ANN can approximate the unknown and nonlinear functions with arbitrary precision, namely universal function approximators [22]. ANN models have been used to predict streamflow [23], [24], [25]. However, the ANN has two drawbacks: 1. With the increase of sequence length, the amount of trainable parameters increases sharply. 2. An ANN cannot capture sequence information (e.g., position and order ). It is generally accepted that an ANN is not suitable for processing time series. CNNs and RNNs were designed to overcome the limitations of ANNs. Previous studies have found CNNs and RNNs to be more accurate than other models for dealing with time series. Shun-Yao Shih et al. used a CNN for multivariate time series forecasting [26]. Shaojie Bai et al. proposed a new CNN architecture named a temporal convolutional network (TCN) [27] and obtained good results on various time series datasets. An RNN was specifically designed to deal with time series [21]. However, the vanilla RNN architecture is full of problems that cannot be used directly. Based on RNN architecture, long short-term memory (LSTM) [28] has been proposed, as a way to alleviate problems within an RNN. Currently, LSTM is widely used in hydrologic prediction tasks [29], [30], [31]. In 2020, Liu et al. used LSTM to predict the middle reaches of the Yangtze River; here, in flood years, the R 2 monthly prediction could reach 0.89 [32]. However, there are also some shortcomings in CNNs and RNNs: 1. RNNs are not computationally parallel, making them slow and time-consuming. 2. CNNs will lose critical sequence information while processing with time series, decreasing their accuracy. 3. CNNs and RNNs cannot capture the long-term dependence effectively.
The attention mechanism was first proposed by Bengio et al. in 2014 [33], and since then, it has been widely applied in various fields of deep learning. With the attention mechanism, models can ignore low-value information and focus on high-value information. The attention layer can capture the long-term dependency by computing the "attention" between all pairs of points. The process does not need to be sequential, so it is computationally parallel. The attention mechanism can also be used as a feature extractor, outperforming CNN and RNN architectures. Based on an attention mechanism, Google proposed a new model called Transformer [34]. Transformer abandoned the traditional CNNs and RNNs, and the entire network structure is completely composed of the attention mechanism. Transformer was initially designed for natural language processing (NLP). Still, currently, many studies have indicated that Transformer is faster and stronger than CNNs and RNNs in dealing with time series [35], [36], [37]. However, like the previous models, Transformer also has trouble in predicting flood peak discharge.
There are three deficiencies in the previous works: 1. They can not forecast the flood peak discharge precisely. The R 2 in flood years has been lower than 0.9. 2. They cannot make long-term stable predictions with high accuracy. When making long-term predictions (e.g., more than a decade of predictions), the average R 2 may only be around 0.85. 3. Previous models do not have structures for multidimensional data processing and combine all the dimensions into feature vectors as input to make multidimensional predictions for river flow, limiting the models' capability. The previous multidimensional data prediction models all use this method of feature vectors [26], [38], [39]. In this study, a restructured Transformer model combined with VMD is proposed, namely double-encoder Transformer. Variational mode decomposition (VMD) is widely used to preprocess complex and nonlinear data [40]. It can decompose a signal into several different modes to transform the signal in the time domain into the frequency domain. The inherent features of the original signal can be better reflected, making the model fit well, no matter in normal or flood years. The restructured Transformer is still an encoder-decoder architecture with two encoders and one decoder. The structure of the multiencoder can support multivariate prediction. The inputs of the two encoders are observed river flow data and observed ENSO data, respectively. To better obtain the correlation between El Nino and flow, the decoder receives the outputs of the two encoders as inputs and then learns and computes the correlation between them.
The contributions of the present paper are as follows: • The double-encoder Transformer is proposed to enhance the prediction capability successfully, which significantly improves the flow prediction accuracy of flood years with an R 2 higher than 0.95. • A reliable long-term (21 years) flow prediction of the Yangtze River had been performed, achieving high accuracy. • The "cross-attention" mechanism is proposed to solve the bivariate prediction problem, which validates the attention-based model's potential value for making a VOLUME 4, 2016 multivariate prediction. • A heuristic method is applied to determine the K value in the VMD algorithm.

II. METHODOLOGY
The specific research methods can be divided into data preprocessing and deep learning. The critical step of data preprocessing is the VMD algorithm. In section II-A, starting with data preprocessing, the focus is on the details of the VMD algorithm. Deep learning techniques generally develop an encoder-decoder paradigm [39], mainly using a CNN and RNN and their variants. The double-encoder Transformer holds the encoder-decoder architecture with attention blocks.
In section II-B, the principles and processes of attention operation are presented and described. Section II-C provides the core part of the work. The model's architecture and network layers will be presented in detail. Please refer to Fig.  1 for an overview and the sections for more information. This section presents the process of preprocessing original observed flow data and ENSO data. There are three steps: 1. data normalization; 2. Adding time stamps; 3. applying the VMD algorithm to decompose the signal. The key to the third step is the appropriate amount of frequency components decomposed by the VMD. Here, we will demonstrate how to determine the right amount. There are two steps before employing the VMD.
1. Data normalization. The flow of the Yangtze River changes dramatically, ranging from 5000 m 3 /month to nearly 70000 m 3 /month. Suppose the dimensional (scaler) difference of the data is too large. In this case, the data with a large dimension (scale) will take the leading position, which will reduce the accuracy of the model. In addition, data with a large dimension will have a longer axis when updating the model using a gradient descent, which means that the convergence of the model will be slow [41]. In this study, we used min-max scaling to perform data normalization.
2. Adding time stamps. For time series, it is essential to add time information while training [42], [43]. Classic Transformer only encodes the input data with position and does not contain more fine-grained time information such as year, month, day, or hour [39]. The dataset used has a monthly dimension, so we added time stamps of years and months to each piece of data. Like positional encoding [34], these time stamps would be embedded into higher dimensions and added to the input data as sequential information.
VMD works efficiently with nonlinear and nonstationary data like streamflow by decomposing a signal into different modes of distinct spectral bands. In the original description, a model is defined as a signal whose number of local extrema and zero-crossings differ almost by one. Now, the definition is slightly changed into the so-called intrinsic mode function (IMF) [44], [45]. The complete VMD algorithm [40] can be summarized in Algorithm 1, where u k := {u 1 , u 2 , u 3 , ..., u k } Algorithm 1: Complete optimization of VMD [40] Initialize Update ω k : are the shorthand notations for the set of all modes (IMFs) and w k := {w 1 , w 2 , w 3 , ..., w k } are the shorthand notations for their center frequencies, respectively. The role of Lagrangian multipliers λ is to enforce the constraint. The goal of VMD is to decompose the original signal into subsignals u k . VMD has proven to be a better algorithm than empirical mode decomposition (EMD) [46]. VMD is much more robust to sampling and noise and supported by mathematical theory [40]. However, the amount of IMF decomposed by VMD is not fixed, and the effect of VMD is mainly affected by this amount. K will be used as a shorthand notation for this amount in the following presentation. Applying VMD to decompose the original signal will produce losses. When K is too small, a lot of important information in the original signal will be filtered, affecting the accuracy of subsequent predictions. The loss can be observed by decomposing the original signal into IMFs and then recomposing them and comparing them with the original signal. The loss can be defined as the difference between the recomposed signal and the original signal. To measure the loss, the coefficient of determination (R 2 ) is introduced. A R 2 between 0 and 1 can reflect the difference between the two distributions, and the smaller the R 2 the greater the difference. The loss can be measured by R 2 :Loss = 1−R 2 . Fig. 2(a) and Fig. 2(b) show the influence of K to loss on the flow and ENSO dataset, respectively. The larger K is, the smaller the loss. However, When K is too large, it will cause a few problems: 1. There will be gaps in high-frequency IMF, and the high-frequency signal becomes intermittent. 2. The center frequencies of the adjacent IMFs will be close to each other, resulting in modal repetition and extra noise. 3. More IMFs mean more computing resources and more time. In this study, a heuristic approach was adopted to select the appropriate K value. We apply Hilbert transform (HT) [47] to each IMF. The HT can be described as (1): where π is the Pi. * and τ are the convolution operator and the integration variable, respectively. The integral is considered to be Cauchy principal value, which avoids singularities at τ = t and τ = ±∞. The instantaneous amplitude A(t) and the instantaneous phase ψ(t) can be calculated using x(t) and x (t): Then, calculate the instantaneous frequency IF (t) for all points except for two endpoints: Finally, the mean of all the instantaneous frequencies of each IMF is the central frequency of that IMF: Fig. 3(a) and Fig. 3(b) show the central frequency distributions of the two datasets under different K values, respectively. Here, we focus on the high frequencies. When the K value is in a reasonable range, it is meaningful that the higher the frequency, the higher the central frequency. When K is too large, there is an obvious inflection point where the central frequency goes down as K goes up. This inflection point means that K is already large enough. When the K value is too high, the high-frequency part of the signal will break off, and there is no instantaneous frequency at the breakpoint; hence, after averaging, the center frequency will decrease instead. Fig. 4(a) and Fig. 4(b) show the variation of the highest center frequency with K , where the inflection points can be seen. The inflection point means the highest center frequency, and the K should be selected at the point. In this study, K = 9 was selected for the flow dataset and K = 8 for the ENSO dataset.

B. ATTENTION MECHANISM
Self-attention is the core part of Transformer. The attention mechanism is a heuristic method that refers to the process of human attention, hence enabling neural networks to focus more on what is important [33]. Therefore Transformer can also be regarded as a feature extractor. As the name suggests, self-attention is about inner attention, which excels in dealing with time series. For a time series sequence, self-attention can capture the correlations between each time step and all other time steps so that the prediction results will be more accurate.
Deep learning techniques mainly develop an encoder-decoder architecture by using RNNs and CNNs. However, both the RNN and CNN models are much slower than attention-based models. Compared with RNN and CNN architectures, the self-attention mechanism is faster [34] and is almost unaffected by the length of the sequence. In general, the longer the sequence, the slower the processing. With this in mind, an experiment was performed to test the speed of the three architectures. Three time series models, including vanilla TCN, vanilla LSTM, and vanilla Transformer, were selected. By recording the time consumed by three models while training, the speed can be measured. We used the flow time series data to train these models, respectively. The length of the input sequence ranges from 12 to 96 and the models were trained for 10 and 50 epochs. This experiment and all the following experiments were performed in the environment given in Table 1. Fig. 5 shows the time consumed     to train the models of three architectures for 10 and 50 epochs under different lengths of the input sequence. It is supposed that the longer the sequence, the slower the processing. For the LSTM, with the length of the sequence increasing, it consumes significantly more time. However, for CNN and Transformer, the time consumption changes gently and is less affected by the sequence length. The reason for this phenomenon is that the sequence length is not long enough.
In the settings of this study, the sequence length ranging from 12 to 96 is reasonable and meaningful. It can be seen that the time consumption of TCN has an obvious increase when the sequence length is longer than 48. As for the Transformer, there will be a significant increase, when the sequence length is longer than 400. In a word, the time consumed by the Transformer is the least of the three and is hardly affected by the sequence length. On the one hand, the RNN and CNN architectures are slower than the attention architecture. On the other hand, the CNN and RNN models are powerless to capture longterm dependency as efficiently as attention-based models. The reasons for this are as follows: 1. RNN is a linear architecture, and to obtain the correlations between time steps, the operation of each time step depends strictly on the previous steps [21]. RNN is a stepby-step architecture that limits its parallelism. As a result, the RNN is slow. The CNN uses convolution kernels to extract features. The size of the kernels and step size of the convolution affect its speed. Moreover, to better extract features, a multilayer convolutional network is required. Although a CNN is much faster than an RNN, it is still not as fast as self-attention. Self-attention completely abandons RNN structures and instead introduces matrix operations. Matrix operations are parallel, which means that each time step is computed simultaneously instead of one by one, which significantly increases the speed. Because it is parallel, the length of the sequence has little effect on the speed.
2. Although the structure of the "gates" mechanism such as LSTM and its variant gated recurrent unit (GRU) [48] alleviate the problem of long-term dependence, RNN is still powerless in dealing with the exceedingly long-term dependence [34]. If the length of inputs is L, the RNN obtains a VOLUME 4, 2016 length of dependence that is shorter than L. The length of dependence that a CNN could get completely depends on the size of convolution kernels, which are generally shorter than L. On the other hand, because the attention value between any two-time steps is computed, it can obtain arbitrarily longterm dependence before then focusing on the high weight and ignoring the low weight. Self-attention obtains a length of dependence that can be L.
The input sequence is transformed into three vectors through three matrices in the self-attention mechanism. These are query matrix (Q), key matrix (K) and value matrix (V ). The canonical self-attention is defined based on the tuple inputs (Q, K, V ). Self-attention performs the scaled dot product that can be summarized as follows [34]: where Q ∈ R dx×d k , K ∈ R dx×d k , V ∈ R dx×dv and d x is the input dimension. d k stands for the dimension of the Q martrix and K matrix, and d v stands for the dimension of the V matrix. We use 1 √ d k as the scaling factor. The concepts of query, key, and value are derived from information retrieval systems. This matches the key to the query to get the matching value. The value is the similarity between the query and the key. In matrix operations, the dot product is a method to compute the similarity of two matrices, and operation Q · K T computes the similarity of each pair of time steps. Weighted matching is then performed based on similarity, and the weights are the similarities. In this study, based on attention mechanism, we propose the "cross-attention" mechanism (Fig. 6). The cross-attention mechanism combines both the dot product attention and the additive attention to efficiently capture the dependency between two time series sequences. Like traditional attention mechanisms, there are one query, one key, and one value. The cross-attention block comes immediately after the two encoders. The outputs of the flow encoder were transformed into the query and the value matrices, and the outputs of the ENSO encoder were transformed into the key matrix. Let q i , k i , and v i stand for the i-th row in Q, K, and V respectively.The process of cross-attention can be described as follows: First, use additive attention to weighted average the key matrix into a global key vector (k): The weight α i is calculated as follows: where w k ∈ R L is a trainable vector. Use Hadamard product to simulate the nonlinear relation between q i and k and get a score (s i ): The i-th query's cross-attention is defined as: It has been proven that the additive attention has a lower computational complexity than dot product attention. The main purpose of using the additive attention is that it can quickly summarize important information in a sequence with linear complexity, which greatly improves the efficiency of multivariate forecasting.

C. DOUBLE-ENCODER TRANSFORMER
It has been proven that El Nino has a close correlation with rainfall, which affects the flow of the river, so El Nino has a significant impact on streamflow [1]. Classic Transformer only has the self-attention block, which is only suitable for dealing with a single time series. When dealing with the prediction issues with multivariate data, it cannot capture the attention among different kinds of variates effectively. The aim of the attention mechanism is to obtain the correlation of one point to all the other points and then weight the average of them so that the attention operation could be more than just self-attention. Because the self-attention process can obtain a correlation among the time steps of one sequence, the correlation between different sequences can be obtained, too. In this study, we improved the classic Transformer by using two encoders and one decoder with a cross-attention block to make streamflow predictions with El Nino covariate. The whole double-encoder architecture is proposed and illustrated in Fig. 7. Because Transformer abandoned the RNN linear sequence structure to support parallel computing, the positional information in time series will be lost. Therefore, adding order signals to vectors is necessary to help the model learn this positional information, so positional encoding [34] is used to solve this problem. Positional encoding works Use the VMD to decompose the original data: foreach modal in VMD(F )and VMD(E) do Perform data embedding: Convert EMB into Q, K, V matrices usingQ self _f , K self _f , V self _f , Q self _e , K self _e , and V self _e , and compute the Self-attention: The outputs of the two encoders are: Re-compose the F self and E self into one: Convert F ′ into Q, and V matrices using Q cross and V cross . Convert E ′ into K using K cross . Compute the Cross-attention: The prediction of river flow is: by combining order information and vectors to form a new representation input to the model to learn order information.
Positional encoding is itself a vector with order information.
In this study, order information and date information are added to the input sequence. The datasets are monthly, so the year and month information were embed into vectors and added to original input vectors with positional encoding. The original hydrological time series are nonlinear and nonstationary. Some features will be ignored if used directly, causing decreases in the prediction accuracy, especially in flood years. The VMD could decompose original data into several IMFs. Compared with the other frequency decomposition technique, the fundamental components decomposed by VMD have physical significance and are much more robust to sampling and noise. In this study, the flow sequence was decomposed into nine IMFs and ENSO sequence into eight IMFs. The IMFs were sent into encoders to perform self-attention. Each IMF should have its own encoder block, so there are nine encoder blocks for the flow sequence and eight for the ENSO sequence. The present study adopted the transfer learning method: flow IMFs share a common input embedding block and self-attention block, and so do ENSO IMFs, but the layers behind them are different. To sum up, there are two input embedding blocks, two self-attention blocks, 9+8 feed-forward blocks and 9+8 linear blocks. The introduction of transfer learning can significantly save the space occupied by the model and accelerate the training speed. All the IMFs of flow and IMFs of ENSO were recomposed as one, respectiovely, after being processed by the encoders. The recomposed flow and ENSO data were sent into the decoder to perform cross-attention. In river flow forecasting, the ENSO dataset works as the covariate. The main purpose of cross-attention is to capture the impact of ENSO on flow. flow was convert into Q and V , and ENSO was convert into K. The details of cross-attention were presented in section II-B.
Let L i n and L o ut stand for the input window size and output window size. The proposed method can be summarized as Algorithm 2: where Q self _f , K self _f , V self _f , Q self _e , VOLUME 4, 2016 K self _e , and V self _e are conversion matrices in self-attention. All the IMFs of flow were converted into Q, K, and V using the same Q self _f , K self _f , and V self _f matrices, and all the IMFs of ENSO were converted into Q, K, V using the same Q self _e , K self _e , and V self _e matrices. The w f and w e are trainable vectors used to recompose the IMFs.

III. EXPERIMENT AND RESULTS
This study focuses on the streamflow predictions on the Hankou Hydrological Station. The goal is to predict river flow with the flow and ENSO datasets. The flow dataset was collected by the Yangtze River Water Resources Commission, and the ENSO dataset was collected by the National Oceanic and Atmospheric Administration. Both datasets were collected monthly from January 1952 to December 2018, hence totaling 67 years. The datasets were divided into two parts: one from 1952 to 1997 and another from 1998 to 2018. The former was used to train the model, and the latter was used to make predictions. The Yangtze River has experienced several floods from 1952 to 2018, most recently in 2016. In 1998, the Yangtze River experienced a devastating flood because of strong subtropical highs, resulting in the most significant surge over the past 50 years. In this study, the proposed model was proven to work well in both flood years and normal years by making predictions from 1998 to 2018. We have chosen streamflow data in 1998, 2016, and 2018 (two flood years and one normal year) to make predictions and then made 21 years of rolling predictions from January 1998 to December 2018 to further verify the overall reliability.
Some representative models were selected as comparisons, including the traditional statistical analysis method: autoregressive integrated moving average model (ARIMA), the convolutional neural network: TCN [27], the representative of the RNN: LSTMa [49], and the classic Transformer [34]. For CNNs, RNNs, and classic Transformer, there is no structure specifically designed to support multidimensional prediction, so it is general to combine all dimensions into feature vectors as input when dealing with multidimensional time series. All the models have a 2d correlated input to make multidimensional time series forecasting. Under the fixed-size window forecasting setting, for double-encoder Transformer, the input flow sequence is F t = f t−L , ..., f t | f i ∈ R dx from time t−L to time t, and the input ENSO sequence is E t = e t−L , ..., e t | e i ∈ R dx . And the input sequence for TCN, LSTMa, and classic Trans- L is the length of the inputs. It has been proven that El Nino is cyclical, with a cycle of about 5.5 years [50] and through experiments, we have verified that there will be better results when the input length (L) is 72 (6 years). l is the length of the prediction steps. In this study, l is set to 12 so that the decision makers can take action to prepare for the flood a year in advance. For the TCN model, the kernel size was 24, and the stride was 1. The MSE-Loss was selected as the loss function, here using AdamW as the optimizer.
To measure the performance of the results, two evaluation metrics were introduced, including root mean squared error (RM SE) and coefficient of determination (R 2 ). The RMSE was defined as: (11) and the R 2 was defined as:

A. ORIGINAL DATA DECOMPOSE
To improve the accuracy and speed up the convergence of the model, data normalization is necessary. Min-max scaling was used to perform data normalization. For data X = {x 1 , x 2 , x 3 , ..., x n }, the min-max scaling is defined as: where the [min, max] is a scaling interval and all the data is scaled into the interval. In this study, both the flow and ENSO data were scaled into [−1, 1]. The scaled data from January 1952 to December 2018 were decomposed into nine IMFs and eight IMFs, respectively. The decomposed results are shown in Fig. 8(a) and Fig. 8(b).

B. THE FLOOD YEARS AND NORMAL YEAR PREDICTIONS
The main purpose of river flow forecasting is to predict floods, especially the flood peak discharge, helping those affected prepare to fight the flood ahead of time. So it is crucial to predict floods accurately, which previous models have not been able to do. Traditional models could obtain decent performance in normal years but did not work well in flood years. In this section, the time series from January 1952 to December 1997 were used to train the model, and 1998, 2016, and 2018 were selected to make predictions to measure the model's performance. Taking 1998 predictions as an example, the input data is the flow data and ENSO data from January 1992 to December 1997 (a total of 72 months). The output is a 12-months prediction for 1998. ARIMA, TCN, LSTMa, and classic Transformer were selected as comparisons that made the same predictions on these models. Fig. 9(a) and Fig. 9(b) show the whole 12 months of predictions for 1998 and 2016, respectively. Additionally, data in 2018 were selected as a representative of normal years to measure the model's performance in normal years. Fig. 9(c) shows the whole 12-month prediction of 2018. Finally, the metrics including R 2 and RM SE were used to measure the performances of all the models in 1998, 2016, and 2018. Table 2 shows the RM SE and R 2 of all the models on these three years of predictions.
All three models could fit the streamflow change trend that increases first and then decreases. They all performed well   at the beginning and end of the year (when the streamflow is low), and the gap among their forecasts was widest in June, July, and August (when the streamflow is at its peak of the year). The double-encoder Transformer had a higher R 2 and lower RM SE. In the flood years (1998,2016), it is evident that double-encoder Transformer fits better than other models with the actual values, especially the flood peak, and the R 2 could reach more than 0.95. In a normal year (2018), all three models performed well, but double-encoder Transformer was more accurate than the others, here with a 0.94 R 2 .

C. THE 21 YEARS OF ROLLING PREDICTIONS
In section III-B, three years were selected to perform predictions, and the results showed that the double-encoder Transformer had significant advantages in flood years and normal years. However, the results of the three-year predictions were largely haphazard. In this section, a 21-year rolling forecast (from 1998 to 2018) was made to test the models' capacity for long-term prediction. Fig. 10 Annual R 2 and RM SE were calculated for each year of the five models and finally obtained the 21 years of the average R 2 and RM SE. Fig. 11(a) and Fig. 11(b) show the 21 years of the annual R 2 and RM SE, respectively.
The results of the 21-year prediction show that the doubleencoder Transformer was better than the traditional time series models, with an average R 2 of more than 0.91 and an average RM SE of lower than 2600 m 3 /month.
The experimental results show that double-encoder Transformer is superior to the current time series forecasting models. In 21 years of rolling predictions, the average R 2 of the double-encoder Transformer could reach more than 0.91, which is higher than the other models by about 0.1, and the RM SE was just 2579 m 3 /month, nearly half of the other models. It can be seen from Fig. 10(c) that these points of the five models are very concentrated near the actual red line at first, which means that they all performed well when streamflow was low. As the flow continued to increase, the distribution of the points became more scattered, resulting in bad predictions. In this case, the double-encoder Transformer had a considerable advantage of being more accurate in predictions of high streamflow. The results fully prove that the double-encoder Transformer could perform very well in normal and flood years and could be reliable enough in longterm predictions with high accuracy.

IV. CONCLUSION
In this work, the streamflow forecasting problem of the Yangtze River was studied, and a new Transformerbased double-encoder-enabled model was proposed: doubleencoder Transformer. This model, combined with the VMD algorithm, can effectively make precise long-term streamflow forecasting of the Yangtze River, especially in flood years. Although the model is still of an encoder-decoder architecture, it alleviates the limitation of a traditional encoder-decoder architecture. Specifically, we designed the cross-attention mechanism to handle the challenges of not supporting multivariate prediction in traditional time series forecasting models. Combining the additive attention with the dot product attention, the cross-attention mechanism can effectively capture the relation between flow and ENSO data.
The experiments on real-world data of the Hankou Hydrological Station demonstrated the effectiveness of the new model for enhancing the prediction capacity both in normal and flood years. A reliable long-term monthly prediction (from 1998 to 2018) was made. There were floods in two of these 21 years (1998 and 2016). The R 2 in both years is higher than 0.95, and the RM SE value is just 3224 m 3 /month, which is when the flow reached nearly 70000 m 3 /month in 1998. Other mainstream forecasting models, including ARIMA, TCN, LSTMa, and classic Transformer, were selected as comparisons to demonstrate the superiority of the double-encoder Transformer. In the 21 years of predictions, the average R 2 of double-encoder Transformer was about 0.91, which is higher than the other model by 0.1, and the RM SE was 2579 m 3 /month, which is significantly lower than the other models. These experimental results show that the double-Encoder Transformer can be used in realworld streamflow predictions.
The main work of this study is to predict the stream-flow of the Yangtze River, focusing on flood prediction. However, drought year is also important because water is a vital resource on earth and we depend heavily on river water. Accurate drought forecasting is promising for future research. The variation of river flow in the Yangtze River is related to many variables. In this work, we made a bivariate forecast with the ENSO data. The outcomes can be improved by adding more variables to make multivariate forecasting. The cross-attention mechanism is a promising method that can efficiently summarize the features of a sequence into a vector and compute the attention between the independent variable and covariates. This work is a good start in making the multivariate long sequence time series forecasting. We are excited about the future of models with the cross-attention mechanism.