Saturation Line Forecasting via a Channel and Temporal Attention-Based Network

Tailings ponds are places for storing industrial waste. The saturation line is the crucial factor in quantifying the safety of tailings ponds. Existing saturation line time-series prediction methods are mainly based on statistical models or shallow machine learning models. Although these models aim to capture the time dependence of the sequence data, the channel and temporal are even unavailable in principle. To mitigate this problem, in this paper, we present a two-stage forecasting method, which embeds the channel and temporal attention into a hybrid CNN-LSTM model to predict the saturation line. The channel and temporal attention are utilized to capture subtle high-dimensional time-series dependence. In the first stage, the discrete wavelet transform (DWT) is applied to capture the refined sequence information. In the second stage, the CNN-LSTM model is utilized to learn the basic spatial and temporal features in the time series. Furthermore, the channel and temporal attention model are embedded into the CNN-LSTM model to enhance the feature-extracting ability in the channel and temporal dimensions. Consequently, our proposed model is shown to outperform classic models on multiple real-world datasets in terms of RMSE, MAPE, R2 and MAE, respectively.


I. INTRODUCTION
Tailings ponds are places to store industrial waste. The tailings pond failure is ranked top 18th in the world's risk assessment [8]. Until now, more than 100 major tailings dam accidents were reported that caused significant damage from 1960 to 2022 [1]. The common methods for evaluating the safety situation of tailings ponds rely on manual observation or measurement analysis utilizing different sensors, e.g., water level senors, displacement sensors, rainfull sensors and deformation sensors. In fact, considering the variability of topography, mine construction conditions and weather condition, the situation of the tailings dam is complicated and changeable. There are typically tailings dams located in remote mountainous areas, their structure is intricate, and the problems associated with the breakage of the dams are almost nonlinear. This makes it impossible to directly observe the tailings pond's stability.
When the saturation line drops by 1 meter, the safety factor of static stability increases by 0.05 or more, making it the most important factor of tailings dam stability [11]. If the saturation line is too high, the dam stability will be reduced, and even leakage, landslides, and dam failure may occur [20]- [22]. Hence, tailings dams' saturation lines are termed their lifelines [13].
It is imperative to establish an reliable model to predict the height of saturation line so as to evaluate the security situation of tailings pond. However, the prediction research of tailings pond is rare. To alleviate this problem, our goal is to propose a model that can take full advantage of deep learning to fit complex data. In more detail, utilizing the hidden information of the historical saturation line value, the value and tendency in the future can be predicted. Based on this, we propose a channel and temporal attention-based CNN-LSTM network. In our model, convolutional layers play important roles in extracting high-dimensional structure information and pass it on to the LSTM layers for learning time-series dependence. Furthermore, the channelwise operation in the attention model is utilized for extract- ing the channel structure generated by the CNN model in high dimension. The temporal information is crucial since it expresses the implicit time series dependencies in a large degree. Meanwhile, the channel-wise features are enhanced by the channel-wise operation. Since the situation of the tailings dam is complex, the data sequence of the saturation line is unstable and without obvious periodic structure. The noise data contains misleading and takes up a lot of space or memory. To overcome this drawbacks, we applied the discrete wavelet transform (DWT) to decompose the saturation line into different time-frequency sequences, and then remove the noise in the decomposed data according to the rigrsure strategy. Through decomposition and reconstruction operation, the data is refined to show the effective time-series dependence. In this work, taking a chinese pond as the study area, the main contributions of our study are summarized as follows: • Proposing an effective channel and temporal attentionbased CNN-LSTM network to predict the saturation line, which achieves satisfied performance in terms of MAPE, RMSE and R 2 .
• Comparing our proposed model with different hyperparameters and with other state-of-the-art models.
• Conducting the ablation studies to confirm the components of our model i.e., channel and temporal attention, DWT, LSTM.

II. RELATED WORK
Recently, researchers are devoted on tailings pond monitoring [2], [14], [18], [23]- [25]. The researchers are mainly focusing on the stability status by monitoring data from sensors and make early-warnings in time by mathematical modeling method, image recognition method and data analysis. Huang et al. [3] conducted a tailings pond monitoring and early-warning system based on three-dimensional GIS, the response time of the safety monitoring and early warning system is less than 5 seconds. Li et al. [5] proposed GPS means to monitor the displacement of tailings dam online. Gao et al. [6] established remote sensing interpretation using high-resolution remote sensing images. M.Necsoiu [7] used satellite radar interferometry to monitor the tailings sedimentation. D.F.Che et al. [8] assessed the risk of tailings pond by runoff coefficient, which can simultaneously determine the safety performance of multiple tailings dams. Dong et al. [10] set up the alarm system based on the cloud platform, showing good performance in real-time monitoring. Qiu et al. [11] designed a monitoring system of saturation line based on mixed programming.
Recently, with the advantages of handling almost any non-linear and linear problems whatever low-and highdimensions, neural network and machine learning methods have been effectively composed in real-time risk analysis and evaluation [4], [9], [12], [15], [26]- [29]. However, the role of real-time monitoring cannot be equated with early warning and forecasting. In other words, risk prediction methods could help people perceive risk before it happens. With FIGURE 1: The water level sensors we utilized for collecting saturation line data. excellent ability to process time-series, classic prediction model such as Auto-Regressive Integrated Moving Average (ARIMA) and LSTM have been used in prediction problems [16], [19], [30]- [32]. They analyze and identify the time series information of training data and give the prediction value for a few days in advance. Nevertheless, different from LSTM, the ARIMA model only gets a high score at the condition of data with linear correlation or without obvious fluctuation. With the rapid development of deep learning, the CNN and LSTM have been the most popular networks. The CNN could filter out the noise data and extract important features, achieving good performance in images, speech, and time-series [33], [34]. While the LSTM network has the ability to find the linear or nonlinear time series information from the shallow and deep network and combine it with current memory [35]. In light of this, combining LSTM with CNN may achieve better prediction performance to a large extent.

III. DATA PREPARATION
The study site is a copper mine tailings pond in Zhejiang Province, China. The real-world historic data for this work are collected from the water level sensors from 2020-03-18 to 2021-04-29. Specifically, we collect five datasets from different height of the tailings pond. The water level sensor is shown in Figure. 1.
After filling the missing value with medium value and replacing the abnormal value with medium value, 9,865 data points are collected for our prediction task. The monitoring data are continuous, which maintain a wide range of original time-series information. Notably, for the 9,865 data,70% of the dataset are set as the training sets, 10% as the validation set. The performance of our models is confirmed on the rest 20% dataset. We do not shuffle the data as usual in traditional deep learning studies since the time-series dependence relies on the time series order. Table. 1 indicates the describe of the monitoring data. The first three rows show the historical monitoring data, while the other columns show the statistic details.
In order to eliminate the impact of different data dimensions on the calculation, we apply Z-score normalization, the formula is as follows:ẋ where x t is the input data, µ t and σ t are the averages and standard deviation of data.

IV. METHOD
This section firstly describes the overview of our model. The DWT method is indicated in Sec. IV-B, our proposed model is explained in Sec. IV-C.

A. OVERVIEW
In this paper, we presents a two-stage forecasting method, which embeds the channel and temporal attention into a CNN-LSTM model to predict the saturation line. In the first stage, i.e., first row in Figure. 2, the DWT is applied to capture the refined sequence information. In the second stage, i.e., second row in Figure. 2, the CNN-LSTM model is used to learn the spatial and temporal features in the refined time series. Furthermore, the channel and temporal attention model are utilized to enhance the feature extracting ability.

B. DISCRETE WAVELET TRANSFORM
Although the window Fourier transform (short-time Fourier transform) can partially locate the time, since the window size is fixed, it is only suitable for stationary signals with small frequency fluctuations, and not suitable for nonstationary signals with large frequency fluctuations. As a signal time-frequency analysis method, the wavelet transform (WT) can automatically adjust the window size according to the frequency. What has greatly contributed the effectiveness of WT is the truth that it is an adaptive time-frequency analysis method which can perform multi-resolution analysis. As a result, wavelet transform is known as a microscope for analyzing and processing signals. In our study, we apply the discrete wavelet transform (DWT) [18] to decompose the collected saturation line data of tailings pond in to 4 frequency sequences. After removing the noise in the decomposed data, the wavelets are reconstructed to obtain new integrated data for further multi-resolution study.
The WT refers to the displacement of a certain basic wavelet function by ω units, and then the inner product with the analysis signal p(t) at different scales.
where is the scale factor (> 0) to stretch the each basic wavelet φ(t). ω is the displacement. Mallat algorithm [17] provides an effective way to display DWT to process the data using the low-high-pass filters: Where T (i) means the signal. ψ l , ψ h , oL, oH are the low-pass filter, high-pass filter, output of low-pass filter, and output of high-pass filter, respectively. Notably, in the wavelet domain, the coefficient corresponding to the effective signal is large, and the coefficient corresponding to the noise is small. As a result, the noise can be removed by the threshold. In this paper, we apply the common rigrsure threshold in DWT: In the equation, the absolute value of each signal is achieved and then sorted, and the square of each number is taken to obtain a new signal sequence.
The t is the signal, γ t is the threshold and Risk(t) is the generated risk. Take the minimum g(t) corresponding to all risks r(K) to get the final threshold γ t .
The 3-level decomposition and the reconstruction process of DWT using Mallat algorithm is shown in Figure 4(a) and Figure 4(b), respectively. From Figure 4(a) we can see that after decomposing the signal into three different levels. In more detail, at the first level, the original signal T is decomposed to the detail coefficients oL 1 and oH 1 . Then the achieved oL 1 is decomposed to the other two coefficients oH 1 and oL 1 at the second level. The decomposition process does not end until the set number of n-level steps is reached. The Figure 4(b) illustrates the process of de-noise and reconstruction. The noises are shown with small wavelet coefficients, while the useful signals are shown with small wavelet coefficients. The time-series signal T passes through the low-pass filter oL 1 and high-pass filter oH 1 for removing the wavelet coefficients of lower amplitude and restore the wavelet coefficients of higher amplitude to achieve the effect of noise reduction. Subsequently the wavelet reconstruction and integration process is applied on all of these coefficients. Employing the coefficient oL 3 , the low frequency and high amplitude rL 3 is reconstructed. As shown in rL 3 in Figure 4, the sequences become smooth, showing the refined sequence patterns.     The decomposition-denoise-reconstruction process of saturation line data. It is obviously that after DWT process, the random noise of original data (green) is deleted and the data become smooth (orange).

C. STRUCTURE OF THE PROPOSED MODEL
Our study aims to develop the construction of a prediction system for forecasting the saturation line utilizing state-ofthe-art LSTM and CNN networks. What has devoted to the popularity of the convolutional layer is the fact that it good at extracting and recognizing as well as identifying the structures of the time series in the monitoring data, while the LSTM networks achieve good performance in detecting long-short-term dependence. In light of this, the principle idea of our study is to combine the advantages of CNN and FIGURE 5: The channel and temporal attention. The first row indicates the channel-wise attention, while the second row indicates the temporal-wise attention.

LSTM.
We firstly bulid a baseline model namely CNN-LSTM, which utilize one CNN and one LSTM model. In further study, our proposed model improves the baseline with one more LSTM model together with two channel and temporal attention. As we mentioned before, our proposed model is a two-stage model, where the first part is the DWT process, the second part is the time-series prediction model. The convolutional layers encode the time-series information as high-dimensional structure, while the LSTM layer decodes the information from convolutional layers for time-series dependence. The channel and temporal-wise features are enhanced by the attention model. Our proposed model is shown in Figure 2.
Specifically, our proposed model includes one convolutional layer filters of 32, a max-pooling layer filters of 2, two LSTM layers of 25, 50, a flatten layer and a fully-connected layer in order. Different parameters of CNN and LSTM are compared for further study in Table. 4. Furthermore, the attention module is embeded in the baseline CNN-LSTM model. The details of the embedded channel and temporal attention is described in Sec. IV-D.

D. CHANNEL AND TEMPORAL ATTENTION MODULE
In this section, we explain the channel and temporal attention we utilized in this paper. The channel-wise operation is utilized for extracting the channel structure generated by the CNN model in high dimension.
For the channel-wise features, the spatial information of a feature map is aggregated by the average-pooling and max-pooling operations, generating two different spatial descriptors: F c avg and F c max , which represent average-pooled features and max-pooled features respectively.
where σ indicates the sigmoid function and FC indicates the shared fully connected layer with Relu function.
The final channel-wise attention is formulated as: where ⊗ means element-wise product.
For the time-series prediction task, the temporal information is crucial since it expresses the implicit time series dependencies in a large degree. For the temporal-wise features, the channel information of a feature map by using two pooling operations are generated as two maps.
where Conv 7×7 represents the convolution kernel with the size of 7 × 7.
The overall channel and temporal-wise attention is formulated as:

A. METRICS
The prediction performance of our proposed model is evaluated by root mean square error (RMSE), mean absolute percentage error (MAPE) and coefficient of determination (R 2 ). In fact, RMSE meets an important problem: considering that although the model has an error of less than 0.1% in the 95% dataset and very big error in the other 5% dataset, the overall RMSE is still high, resulting in this model considered as a VOLUME 4, 2016 poor model. To solve this problem, MAPE is utilized in the evaluation process. It is the proportion of the total variation of the dependent.
Where y t represents the true value,ŷ p represents predicted saturation line value, y t represents average of true value, and n is the count of data. Figure. 6 shows the prediction results of our proposed model on five different real-world dataset.

B. IMPLEMENT DETAILS
We implement our network using the Tensorflow deep learning framework, and all the models are trained with two Nvidia Tesla V100 GPUs. In this study, the proposed model is trained for 200 epochs with a batch size of 32. We apply RMSE as loss function and Adam for optimizer. The Adam is an improved RMSProp optimizer combining with the moments trick. It is worth noticing that in order to reduce the feature loss during the convolutional layers, same padding operation was conducted during this process. Specifically, we set the sequence length as 10. On the one hand, considering that a longer sequence length will occupy a huge computer memory, on the other hand, we found through experiments that set the sequence length to 10 achieves better performance than, e.g., 3, 7, 10, 20, 50. The most important thing for the hyperparameter selection of the model is the learning rate of the network, which has a significant influence on time consumption until convergence. If the learning rate is set too large, the loss function will be difficult to converge, resulting in a lower final detection accuracy; On the contrary, a small learning rate will lead to slow convergence and increase the training time. For the optimizer, we use the Adam with a momentum of 0.9 and a weight decay of 0.001. For the selection of the number of network iterations, the training process is stopped when the model no longer converges.

VI. EXPERIMENT
Our proposed model is evaluated and compared to other models to show the prediction performance. The prediction performance of our proposed is shown in   the long-short-term data dependencies to a significant degree from the result. The scatter plots of raw data and predicted saturation line is illustrated in Figure. 8, which helps show the prediction performance intuitively.
To show the superiority of our proposed model, we apply comparative studies with other state-of-the-art machine learning and deep learning models, including the support vector regression (SVR), decision tree regression (DTR), random forest regression (RFR), multilayer perception (MLP), single GRU, simpleRNN as well as LSTM models on NO.8 dataset. Table. 3 presents the RMSE, MAPE and R 2 score of these models in our experiments, which demonstrates that our proposed model significantly outperforms the others in R 2 .
In order to build the complete saturation line prediction model and show the reliability of our proposed model together with parameters set, we compared different hyperparameters such as batch size, filters in of the convolutional layers, max-pooling size, number of LSTM cells in our experiments. Table. 4 lists the different situations of combing multiple hyperparameters. Considering several evaluation metrics and running time, we choose the design shown in Case 9. Specifically, in term of the evaluation metrics used in this task, although Case 2 and Case 5 achieve slight better performance than that in Case 9, the Runtime is almost twice. Predictions will perform worse in real time when the running time is excessive, especially when there is a large amount of data. The disadvantage is more pronounced for a large amount of data, and this incurs no loss of generality. Case 3 needs the least Runtime but achieve low accuracy. To be clear, according to Case 9, the implement is: the batch size is 64, one convolutional layer filters of 32, a max-pooling layer filters of 2, two LSTM layers of 25 and 50.

VII. ABLATION STUDY
We conduct the ablation study to evaluate the effectiveness of DWT and channel and temporal-wise attention. Since our proposed model includes the DWT process, one convolutional layer, two LSTM layers and two channel and temporal attention modules, we evaluate these experiments by removing these modules. The results display in Table. 5. Furthermore, we compare the fitting results with the beseline model (one CNN and one LSTM). It can be seen that our proposed model ( Figure. 8) achieves more effective predictions than that utilizing baseline model ( Figure. 7), especially in the predictions NO. 17,NO.21,and NO.28. This is because although the long-term and short-term dependence and hidden time series information can be discovered from the data, the prediction accuracy is greatly affected due to the presence of noise in the data. Furthermore, the channel and temporal-wise features are ignored by the simple CNN-LSTM model.
To overcome the drawbacks of baseline model that cannot de-noise the raw data, we applied the DWT to decompose the saturation line into different time-frequency sequences and remove the random noise. Subsequently, the data after noise reduction is trained by our proposed model. With the help of enhancing the channel and temporal features, the results of all positions are shown in Table. 5. This once again proves that the DWT method can remove a large amount of useless information, thereby assisting our model to more accurately explore the time series information hidden between the data. It also can be illustrated from Table. 5 and the comparison of Figure.   For example, on dataset No.21, the baseline achieves the R 2 of 0.851, while our proposed model has the value of 0.961. When removing the DWT, the model has the value of 0.959 in terms of R 2 ; when removing the attention module, the model has the value of 0.957 in terms of R 2 ; when removing a LSTM, the R 2 drops form 0.961 to 0.951. The results shows that all the components are crucial for our proposed model.

VIII. DISCUSSION AND CONCLUSION
In this work, we applied a new method to predict the safety of tailings pond according to the saturation line using our proposed model, which is also first used in tailings pond risk prediction. Compared with the traditional methods, the risk evaluation method of tailings ponds has the characteristics of high accuracy and high real-time performance.
The contributions of this work is three fold: First, proposing an effective channel and temporal attention-based CNN-LSTM network to predict the saturation line, which achieves satisfied performance in terms of MAPE, RMSE and R 2 .   Second, comparing our proposed model with different hyperparameters and with other state-of-the-art models. Third, conducting the ablation studies to confirm the components of our model. The wavelet transform method is applied to overcome the shortcomings that the original CNN-LSTM model could not de-noise the data. The wavelet transform decomposes the data into 4 layers of wavelets, selects the rigrsure threshold to de-noise the decomposed wavelets and then reconstructs them, subsequently feeds the reconstructed useful signals to our attention-based model to obtain better prediction results. In tailings pond risk prediction task, these experiments consequently provide applicability of the proposed model. Additionally, our model can also be used to predict water levels, weather, and air quality as time-series predictions. The model is evidently capable of not only extracting and recognizing spatial and time series structures, but also identifying long-term and short-term series information.
Our method applies one factor, and in the future, we will focus on more factors of the safety monitoring parameters of the tailings pond, such as the underground displacement, ground displacement and dry beach length. We should also build a risk level that corresponds to the tailings pond monitoring so that our future work more intuitively reflects the safety of the tailings pond.
L INCHAO LI received the M.S degree China Jiliang University. He is a machine learning algorithm engineer. His research interests include computer vision, image processing.
X UWEI WANG received the M.S. degree from the College of Civil Engineering and Architecture, Zhejiang University. He has 5 years working experience in computer vision field and obtains three authorized invention patents. He is currently a doctoral student majoring in Computer Science and Technology, Zhejiang University. His research interests include AI, Deep Learning and Computer Vision.
Q ING LI is currently a Professor with China Jiliang University. He was the dean of the School of Mechanical and Electrical Engineering, China Jiliang University and now he is an executive director of China's metrological testing association. He was selected as a famous teacher of the national "Ten-Thousand Talents Program" in 2017. His current research direction is dynamic measurement and control, sensing technology. Presided over the completion of more than 60 science and technology projects.
S HENGYAO JIA received the B.E. degree in measurement and control technology and instrumentation and the M.S. degree in detection technology and automation from China Jiliang University, Hangzhou, China, in 2008 and 2011, respectively, and the Ph.D. degree in agricultural electrification and automation from Zhejiang University, Hangzhou, China, in 2015. He is a lecturer with China Jiliang University. His current research interests include energy harvesting systems, sensors and measuring technology and embedded system. R ENYUAN TONG was graduated from East China Normal University. He is now an associate professor in College of Mechanical and Electrical Engineering, China Jiliang University. His research interest is detection technology. VOLUME 4, 2016