A Gated Recurrent Unit Network Model for Predicting Open Channel Flow in Coal Mines Based on Attention Mechanisms

The prediction of water inflow during coal mining is an important issue. There are many factors that can affect the water inflow in mines. The intercoupling of these factors makes it difficult for current water inflow forecasting methods to meet the needs of real-time forecasting. Open channels are the main devices used for mine drainage, and their flow rate reflects the water inrush of a mine to some extent. This paper uses a hybrid neural network model combining attention mechanisms and a gated recurrent unit network to make real-time predictions of open channel flow. First, attention mechanisms are used to learn the interdependence between multisource hydrosensor data, and then, a gated recurrent unit network is employed to capture the dependencies on different time scales to improve the prediction accuracy of the neural network model. Finally, we design a series of comparative experiments to verify and analyse the performance of the hybrid neural network model. The experimental verification shows that the proposed model can learn the dependency relationships among multisource sensors, and the modelling of these dependencies can greatly improve the prediction accuracy of real-time flow in open channels.


I. INTRODUCTION
As the cornerstone of China's economic development, coal will continue to be the country's primary source of energy for the foreseeable future. The issue of safe coal production has always been a focus of attention because of the complex hydrological and geological conditions in China's mining areas. As China's demand for coal continues to grow, the scale of coal mining gradually widens, the depth of mining gradually deepens, and the probability of flooding and other water damage accidents increases [1], [2]. Mine water inflow is the primary reason for rationally setting up mine drainage systems and formulating mine water prevention and control measures. The dynamic prediction of mine water inflow has become an extensively studied problem.
The existing methods of predicting mine water inflow can be roughly divided into two types: analytical methods and numerical simulations. Analytical methods use The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney. hydrogeological parameters to establish a conceptual model of hydrogeology in a mining area and calculate and predict an analytical solution to water inflows during mining operations [3], [4], [5]. Among these solutions, the big well method has become a commonly used prediction approach since it allows the use of specific assumptions and simple geological and boundary conditions [6]. Li et al. proposed a generalized large well method for dynamically predicting and evaluating groundwater levels during mining operations, taking the Yimin open-pit mine as an example to verify the effectiveness of the analysis method [4]. Wu et al. used the large well method and a numerical simulation to calculate the comprehensive mine inflow volume in three goafs and compared and analysed the water levels in different periods to provide a reference for decision makers to improve the level of safety in mine production [7].
Numerical simulation is currently the most widely used method; it also employs hydrogeological parameters to establish a groundwater flow model in the mining area and uses the model to simulate the change of groundwater in the VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ mining area during the mining process to numerically predict mine water inflows [8]- [11]. Zhang [15]. Mu et al. established a conceptual and mathematical model of a karst flow system and carried out a numerical simulation of the karst flow system using a finite difference method [16]. Both analytical methods and numerical simulations can achieve satisfactory prediction accuracy when hydrogeological parameters are detailed and the established model is sufficiently accurate. However, analytical methods are based on certain assumptions and specific boundary conditions, which limits their applicability to different mining situations. For example, most analytical solutions can neither directly explain the gushing water from the floor of a mine [17] nor simulate the head and saturated and unsaturated flow conditions of a confined aquifer [18]. Numerical simulation methods require many hydrogeological parameters such as aquifer permeability coefficients, aquifer transmittances, and rainfall data. Obtaining and determining these parameters are very difficult, expensive, and time consuming. In addition, these parameters will increase the cumulative error of the model, thereby increasing the uncertainty of the final result.
In recent years, there has been a great success using machine learning algorithms in areas such as target recognition [19], natural language processing [20], and sequence prediction [21]- [23]. In particular, neural networks such as recurrent neural networks (RNNs) [24], [25], deep belief networks [26], and radial basis function networks [27] have demonstrated excellent performance in sequence prediction problems. Therefore, neural networks are widely used in the field of mine water damage prediction, including the prediction model of flow height in a fractured zone [28] and the framework of probability assessment of mine water inrush accidents [29]. Bahrami et al. [30] designed two hybrid methods coupling artificial neural networks with genetic algorithms and simulated annealing methods to predict the head of an open-pit mine. The results show that these two hybrid methods have the ability to compete with numerical models to some extent. Ardejani et al. [31] used a neural network model to predict the rebound process of groundwater after a dewatering stop at a restored open-pit coal yard in East Midlands, UK. The predicted value was very close to the monitoring result, indicating that the model prediction result was satisfactory.
Although the models proposed by these studies have achieved good prediction accuracy, their prediction time steps are measured in days, months and even years. Forecasting with such a long time step can only be used to analyse the general trends of mine water inrushes, which is less helpful for real-time warnings of mine water damage. Open channels are the main devices used for coal mine water discharges. The effective prediction of open channel flows can not only help predict the amount of water in a mine but also acquire the precursory information of water damage problems, such as water permeability, in a timely manner by comparing and analysing historical data.
Attention mechanisms are first proposed in the field of visual images. In the 2014 Google DeepMind team's research [32], attention mechanisms are used in a recurrent neural network model to classify images, making them a research topic of interest. Subsequently, in [33], Bahdanau et al. used an attention-like mechanism to perform translation and alignment simultaneously on machine translation tasks. Their work was the first to apply the attention mechanism to the field of natural language processing. Since then, attention mechanisms have been widely used in various natural language processing tasks based on neural network models such as recurrent neural networks or convolutional neural networks. In 2017, a study published by the Google Machine Translation team [34] made extensive use of self-attention to learn text representation. Self-attention has also become a recent research area of interest and has been applied to various natural language processing tasks. Attention mechanisms have shown success in the field of visual images and natural language processing since they can model dependencies without considering their distance in the input or output sequence.
Gated recurrent unit (GRU) networks are proposed by Cho et al. [36] as a variant of long short-term memory (LSTM) networks. In addition to solving the long-term dependence problem in traditional RNNs [37], GRU networks also simplify the calculation of the gating function in LSTM networks and improve the calculation efficiency. This paper proposes a hybrid neural network model of attention mechanisms and GRU networks to predict the flow of open channels and increase the performance compared to conventional methods. The proposed methodology can provide an important basis for the design of coal mine drainage capacity and prevention measures. The main contributions are outlined as follows.
1) The proposed method addresses the problem of an excessively large step size in current mine water inrush prediction research. Reducing the prediction step size can improve the real-time prediction of mine water inrushes.
2) The attention mechanism combined with the GRU network is used to strengthen the GRU network's ability to mine dependencies. This enables the network to more comprehensively obtain the association relationship of the input data and improve the network prediction performance.
3) The elevation, water temperature, burial depth, open channel flow, water temperature, and dibhole liquid level data are chosen as the inputs to obtain more comprehensive information. The results validate the performance improvement.
This paper describes the algorithm of the attention mechanisms, the GRU and the overall architecture of the hybrid neural network model in Section II. In Section III, the experimental data are explained, and the optimization method of the model is introduced. Section IV presents the experimental design, results and analysis, and Section V offers the conclusions.

II. ARCHITECTURES AND ALGORITHMS A. ATTENTION MECHANISMS
The input of an attention algorithm consists of sequence queries, keys and values. An attention algorithm can be described as a mapping from a query to a series of keys and values. The two most commonly used attention functions are dot-product (multiplicative) attention and additive attention [33]. To improve the operation efficiency, we pack the sequences queries, keys, and values into the matrices Q(query 1 . . . , query T ), K (key1. . . , key T ), and V (value 1 . . . , value T ), where query i is the queries vector stacked into the matrix Q ∈ R T ×L , key i is the keys vector stacked into the matrix K ∈ R T ×L , value i is the values vector stacked into the matrix V ∈ R T ×L , v ij is an element of matrix V , R is the set of real numbers, T is the number of packed vectors in the matrices, and L is the dimension of the input vectors, i = 1, 2. . . , T , j = 1, 2. . . , L. The computation of the elements in the attention matrix can be expressed abstractly as [35]: where Similarity(query i , key i ) is used to calculate the correlation or similarity between queries and keys. Different attention functions use different calculation methods. In our work, considering different prediction targets, the target parameters and other input parameters have different dependencies. We use a feed-forward neural network that is jointly trained with other components of the predictive network to compute similarity, i.e., where q l and k l are elements in vector query i and key j , respectively, and w il is the weight to be trained in the feedforward neural network. Then, the SoftMax function, also called the normalized exponential function, is used to normalize the calculated correlation. In this way, on the one hand, the correlation can be organized into a probability distribution with a sum of 1. On the other hand, the weights of important elements can be made more prominent through the internal mechanism of the SoftMax function. Finally, the weighted summation can obtain the value of attention. Self-attention is also called intra-attention and is calculated in the same way as attention. Different from the attention mechanism, the queries, keys, and values are the same in the self-attention input sequences, that is, query i = key i = value i . The purpose of the same input sequences is to learn the dependencies between the sequences and link the different positions of a single sequence to calculate the sequence representation.
In practice, the input X exists in the form of a matrix. Suppose that x ij is the element in matrix X ∈R m×n , n represents the time step of interception, and m is the number of sensor parameters. The attention mechanism used in this work can be expressed as: For example, we take the monitoring values of 5 sensors at t times to calculate the attention matrix and analyse the correlation between the sensors. The input X can be represented as where sensor1i, sensor2i. . . , sensor5i are the monitoring values of 5 sensors at times i, where i = 1, 2. . . , t. Then, using Eq. (6) and Eq. (7), the similarity matrix shown in Fig. 1 can be calculated. As shown, the similarity matrix is composed of elements with a value of 0-1. The greater the value of the element is, the higher the correlation between the corresponding sensors. Therefore, when using the elements in the similarity matrix as the weight α ij and calculating the attention matrix through (5) with the original input X , the value with a high degree of correlation can be enlarged according to the weight to achieve the purpose of focusing attention.

B. GATED RECURRENT UNIT
The LSTM introduces three gate functions on the basis of the traditional recurrent neural network structure-the input gate, forget gate and output gate-to control the input value, memory value and output value, respectively, to save long-term memory and solve the long-reliance problem in the traditional recurrent neural network. The GRU combines the forget gate and input gate in the LSTM into a single update gate z: where x is the input, ht-1 is the previous hidden state, W z and U z are the learned weight matrices, and σ is the logistic sigmoid function. Similarly, the GRU replaces the output gate in the LSTM with a reset gate r: where W r and U r are the learned weight matrices. In the end, the output of the hidden unit is: whereh is a new hidden state, W and U are the learned weight matrices, and h t is the hidden state at the current moment. An illustration of the hidden activation function is shown in Fig. 2. The update gate z determines whether the hidden state is to be updated with a new hidden stateh. The reset gate r decides whether the previous hidden state is ignored.

C. MODEL ARCHITECTURE
Because of its network architecture, a recurrent neural network exhibits excellent performance in processing sequence data [38]. The neural network model proposed in this paper adds attention mechanisms to a GRU network architecture. The overall architecture is shown in Fig. 3.
First, the input of the neural network uses the ability of the attention mechanism to learn dependencies of the coal mine hydrological monitoring parameters. Then, a GRU network is utilized to capture the dependencies in different time dimensions. We employ a residual connection [39] between the GRU network and the attention mechanism, followed by layer normalization. Finally, a simple fully connected feed-forward network is used for output.

III. LEARNING ALGORITHM FOR MODEL A. DATA DESCRIPTION
The data used for training and testing in this paper come from the data collected by hydrological sensors in the same mine. We design two sets of data to verify the ability of the hybrid neural network to learn data dependencies. One set is single-sensor data. We input the flow data monitored at time t − n to t in an open channel and predict the flow at time t + 1 in this open channel. The other set is multi-sensor data. We input the elevation, water temperature, and burial depth of the entire underground, including aquifer sensors, open channel flow and water temperature, dibhole liquid level data from a total of 16 hydrological sensors t − n to t, and predict the flow at the same open channel as the single-sensor data. To better reflect the changes of the hydrological monitoring data under the mine and highlight the correlation between the various parameters when they are changed or affected by other parameters, single-sensor data and multi-sensor data use the change values of current time and previous time. Calculate the gradient of stochastic objective function at timestep t: g t ← ∇ θ f t (θ t−1 ); Update moment estimates: Adam is an efficient stochastic optimization method that only requires first-order gradients [40]. Unlike a traditional stochastic gradient descent that keeps a single learning rate to update all the weights, Adam calculates the first-order moment estimation and second-order raw moment estimation of the gradient to design independent adaptive learning rates for different parameters. Adam combines the advantages of adaptive subgradient methods (AdaGrad) [41] and root mean square propagation (RMSProp) optimization algorithms. It not only works well with sparse gradients but also works well in on-line and non-stationary settings.
Let f (θ ) be the objective function that needs to be optimized, and θ be the parameter that needs to be solved in the objective function; then, the gradient g t at timestep t can be expressed as The first moment estimates m t and second moment estimates v t of the gradient g t represent the expected estimates of g t and g 2 t , respectively. m t and v t can be given by where the hyperparameters β 1 , β 2 ∈[0,1) control the exponential decay rates of m t and v t . g 2 t indicates the elementwise square of g t . Considering that moment estimates are biased towards zero, m t and v t are initialized to 0 vectors, especially when the initial time step and the decay rates are small. Therefore, to offset the initialization bias, the bias needs to be corrected. Taking m t as an example, Eq. (14) can be given by By taking the expectations on both sides of Eq. (16) at the same time, it can be concluded that where ζ is 0 or a very small number; then, the remaining (1 − β t 2 ) is the initialization bias that we need to correct. The bias-corrected estimatesm t andv t can be given bŷ The final parameter update calculation is as follows where α is the preset learning rate and ε is the preset blur factor. Adam's pseudocode is shown as Algorithm 1.

IV. EXPERIMENT AND ANALYSIS A. EVALUATION METHOD AND EXPERIMENT SETUP
To obtain a reliable and stable model, verification of the model is indispensable. In this experiment, because the data have a strong timing dependence, nested cross-validation is used to evaluate the model. The nested cross-validation process can provide a nearly unbiased estimate of the true error. To generate a better estimate of the model prediction error, we perform multiple training and test data segmentation and then calculate the average value of the error on these segmentations. A schematic diagram of the nested cross-validation is shown in Fig. 4. We split the training data and test data in a 7:3 ratio. To facilitate further segmentation during the nested crossvalidation, the test data are rounded up to 10 validations. The number of training data and test data is shown in Table 1.
The parameters of the hybrid neural network model are set as in Table 2. The four parameters α, β 1 , β 2 and ε of the

B. EXPERIMENTAL RESULTS AND ANALYSIS
This paper uses a GRU network model that introduces the attention mechanisms for the real-time prediction of open channel flow in mines and compares this model with three recurrent neural network models: the LSTM, the GRU, and the LSTM model that also introduces the attention mechanism. The average training cost and prediction error of repeated experiments are shown in Table 3, and one of the experimental prediction results is shown in Fig. 6 and Fig. 7. Fig. 6 shows the prediction results of single-sensor data and multi-sensor data using the LSTM or GRU model alone. Comparing the performance of the LSTM and GRU under the same set of data, the prediction curves of the two models under the same set of data are similar. However, the prediction results of the GRU are slightly stronger than those of the LSTM at some times.

1) MODEL PERFORMANCE ANALYSIS
As shown, in terms of the prediction error, the RMSE and MAE of the GRU models are slightly better than those of the LSTM models under the two sets of data, but the difference is not obvious. In terms of the training cost, the GRU model is also lower than the LSTM model. Especially under multi-sensor data, the difference between the two models is more than one second.
It can be seen that when the flow prediction problem of open channels is investigated under the data selected in this paper, although the GRU model combines and simplifies the gate functions in the LSTM model, they are still valid to pass the information obtained from the previously hidden state to the currently hidden state, thereby helping the recurrent neural network to remember long-term information. At the same time, due to the simplification of the gate function, the hidden state is more compactly represented, which reduces the overall calculation amount of the model. In particular, multi-sensor data require a large amount of data calculation, and the reduction in GRU model training costs can be seen more clearly.

2) DEPENDENCY MODELLING ANALYSIS
The LSTM and GRU have separate gate functions for each hidden unit that have the ability to capture dependencies on different time scales. Taking the GRU as an example, when the reset gates in the hidden unit are active, the short-term dependencies are captured, and when the update gates are active, the long-term dependencies are captured. As shown in Fig. 6, the prediction results of the same model using single-sensor data and multi-sensor data are compared: under multi-sensor data, because the input data contain more information, the model captures more timing dependencies. Therefore, the prediction result is not as close to the average line as in single-sensor data, and the prediction accuracy has improved.
The prediction results of the LSTM and GRU models after introducing attention mechanisms are shown in Fig. 7. Comparing the results of models without attention mechanisms in Fig. 6, the prediction accuracy of the former is much higher than that of the latter. For example, in the case of multi-sensor data, the RMSE and MAE of the prediction results of the pure GRU model are 19.689 m 3 /h and 16.927 m 3 /h, respectively, and the RMSE and MAE of the prediction results are reduced to 15.265 m 3 /h and 12.656 m 3 /h, respectively, after the attention mechanism is added. After adding attention mechanisms, the prediction curves of the models are closer to the original data curve. This demonstrates the ability of the attention mechanism to learn inter-data dependencies and verifies the importance of dependency modelling to improve prediction accuracy in such prediction problems.
The GRU model with an attention mechanism is used to perform another open channel flow prediction experiment.     Fig. 8 compares the curves of the actual flow and predicted flow. From Table 4 and Fig. 8, it is shown that the hybrid neural network model proposed in this paper has also achieved good results in other open channel flow prediction experiments. This further validates the predictive ability of the model proposed in this paper.

C. COMPARISON RESULTS
To further verify the performance of our proposed model in the real-time prediction of open channel flow, we compared our model with a series of models in the same research area and those in other research areas with similar algorithm structures. The models for comparison include back propagation neural networks (BPNNs), the hybrid method coupling artificial neural networks with genetic algorithm methods (ANN-GA) for predicting groundwater inflow in mines [30], the random forest regression method (RFR) for predicting the height of fractured water-conducting zones in coal roof strata [42], the hybrid method coupling LSTM with support vector machines (LSTM+SVM) for fault prediction [43], and a deep neural network-based traffic flow prediction model (DNN-BTF) that also uses attention mechanisms and GRU networks [44]. The comparative experiment was conducted using the same samples in Table 1. Due to the differences in data dimensions, some parameters of the above models are adjusted in the experiment; for example, the batch size is unified as 30, the output unit is changed to 1, the LSTM/GRU  units are changed to 50, the epochs is set to 20, and the remaining parameters were unchanged. The average error of prediction after repeated experiments is shown in Fig. 9 and Table 5.
The results show that the proposed method exhibits better performance than BPNNs, RFR, ANN-GA, DNN-BTF and LSTM+SVM. GRU networks can extract long short-term dependence from data according to hidden layer units, but BPNNs, RFR, and ANN-GA cannot extract this information. Although LSTM+SVM uses LSTM networks to obtain time dependence, it lacks the attention mechanism to learn the dependence relationship between multi-sensor data. DNN-BTF also uses the attention mechanism and GRU networks, but on the one hand, the method we proposed uses the self-attention mechanism, and a feed-forward network is used to replace the traditional similarity calculation function so that the method can better learn the dependency relationship for the target data. On the other hand, convolutional networks are used in DNN-BTF, and convolutional networks will weaken the input boundary features when extracting features, which is extremely disadvantageous for the prediction of the time sequence.
In general, the real-time prediction method of open channel flow based on GRU networks and attention mechanisms proposed in this paper is a noteworthy method for early-warning research on coal mine water disasters.

V. CONCLUSION
In this work, we propose a hybrid neural network model combining the attention mechanism and a GRU structure for real-time prediction of open channel flow in a mine. After performing and analysing a series of comparative experiments, the model demonstrated a higher prediction accuracy than the traditional LSTM and GRU algorithms. In addition, these experiments also expose the dependence relationship between the sensor data. The exploration of the dependence relationship helps us predict the amount of water in the mine. Compared with some traditional recurrent neural network models, the GRU network model based on attention mechanisms can better learn the dependence relationship between input data.
In future work, we will use more detailed monitoring data for training on the basis of this hybrid neural network model and at the same time improve the model such that it can more fully and reasonably establish the dependency model between data. This work will further improve the accuracy of the prediction results, provide technical support for the prevention of water inflows and guarantee coal mine safety. We can also apply the hybrid neural network to automated manufacturing systems [45] and social networks [46].