Research on a Mine Gas Concentration Forecasting Model Based on a GRU Network

To improve the level of safety in coal mine production, it is important to enhance the accuracy of coal mine gas concentration prediction. In the context of deep learning, we proposed a mine gas concentration prediction model based on gated recurrent units (GRUs). The GRU model is not only simple in structure but also offers high prediction accuracy, and it can make full use of the time-series characteristic of mine gas concentration data. First, we apply the Pauta criterion and Lagrange interpolation to preprocess mine gas concentration monitoring data. Then, a spatial reconstruction method is used to construct the training set for the prediction model. Finally, the mean square error (MSE) is used as the loss function and adaptive moment estimation (Adam) is used as the optimization algorithm to determine the learning parameters of the GRU model for predicting gas concentration values. Experimental results show that compared with models based on support vector regression (SVR), a backpropagation neural network (BPNN), a recurrent neural network (RNN) and a long short-term memory (LSTM) network, the proposed GRU-based model for gas concentration prediction achieves reduced error on the test set, and moreover, the GRU model is more efficient than the LSTM model in terms of run time. Thus, the accuracy and efficiency of gas concentration prediction are both improved, showing that the proposed model is of high practical value.


I. INTRODUCTION
Coal is an important pillar of China's primary energy consumption, and it is related to the economic and energy security of China. Frequent mine gas disasters have caused significant losses to China's coal industry and of miners' lives. At present, coal enterprises have installed safety monitoring systems, but the main functions provided by these systems are simply the short-term identification of and response to disasters. These monitoring systems fail to fully exploit the value of the available gas data, resulting in insufficient forecasting ability for mine gas disasters. Therefore, there is an urgent need to introduce new methods and technologies for supporting improved safety in coal mine production. Through the analysis of monitoring data, reliable and accurate forecasting of gas concentration levels can be achieved, thus improving the early warning capability for coal mine The associate editor coordinating the review of this manuscript and approving it for publication was Liangxiu Han . gas disasters. Such forecasting will be of great significance for reducing the economic losses caused by mine gas disasters and protecting the lives of miners.
Mine gas concentration data are dynamic and nonlinear, and they are influenced both by many natural factors and by mining technology; thus, it is difficult to perform forecasting based on these data using traditional methods [1]. In recent years, neural networks have been widely applied for nonlinear regression modeling and forecasting. Among them, gated recurrent unit (GRU)-based networks have been theoretically proven to be able to represent nonlinear functions with arbitrary accuracy. Therefore, in this paper, we propose a GRUbased algorithm for gas concentration forecasting. The first step is to apply the Pauta criterion to process the noise in the mine gas concentration monitoring data. Then, Lagrange interpolation is applied to interpolate the missing values of the monitoring data to complete the necessary preprocessing. Subsequently, spatial reconstruction of the gas monitoring data is performed to construct the input samples to be used VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ to train the GRU network. Then, we choose the mean square error (MSE) as the loss function and use adaptive moment estimation (Adam) as the optimization algorithm to construct a GRU learning model that is suitable for predicting the timevarying coal mine gas concentration. Finally, coal mine gas data are used for testing. An experimental evaluation shows that the error of the proposed GRU-based gas concentration prediction model is reduced by 7.9% compared with that of a model based on a long short-term memory (LSTM) network and that its run-time efficiency is simultaneously improved by 13.05%, endowing it with better practical value for application. The remainder of this paper is organized as follows. Section 2 reviews some of the relevant literature on mine gas concentration forecasting. Section 3 describes the dataset and data processing. Section 4 introduces the theoretical background on forecasting methods and presents the proposed GRU-based method of mine gas concentration forecasting. Section 5 presents the parameter settings of the proposed model. The experimental results are analyzed in Section 6, and Section 7 concludes the paper.

II. LITERATURE STUDY
Considerable research has been conducted in the area of mine gas concentration forecasting. For example, Zhang Zhao-zhao proposed a brain-like multihierarchical modular neural network (BMNN) with applications for gas concentration forecasting [2]. The model of BMNN has a brain-like multihierarchical structure and uses a collaborative learning approach. Zhang et al. [3] presented research on and the application of an improved gas concentration prediction model based on gray theory and a backpropagation neural network (BPNN) for a digital mine. That paper exploited the advantages of gray prediction and the ability to revise the gray prediction model using a BPNN. Thus the researchers built an improved gas concentration prediction model and carried out specific computer simulations. Stopforth and Davrajh [4] was to develop equations for close estimations of gas concentrations for the Figaro sensors used, to allow other researchers the ability to identify the gas concentrations when using the sensors for different applications. Booth et al. [5] proposed a solution to the shortcomings of a traditional gas emissions forecasting method and analyzed the various characteristics and limitations of gas management design in practice. The proposed method used an improved spatial dataset for prediction while also incorporating basic physics and energy-related principles. Yu and Shi [6] proposed a gas concentration forecasting model based on a radial basis function (RBF) neural network and chaos theory, in accordance with the nonlinear and chaotic time-series characteristics of mine gas data. Liu Kun proposed a method of coal mine gas concentration analysis based on a support vector machine (SVM) model [7]. In this paper, on the one hand, the authors adopted support vector regression (SVR) to predict gas concentration values based on data from other well-performing sensors; on the other hand, they classified the gas concentration data into  two classes, corresponding to either totally safe or slightly high concentrations, by applying a model constructed based on either C-support vector classification (SVC) or a one-class SVM.
However, all of the above methods focus on single-feature learning and prediction for mine gas concentration levels, and consequently, these methods are disadvantageous in terms of forecasting accuracy. To improve the accuracy of mine gas concentration forecasting, techniques based on time-series forecasting are receiving increasing interest. Since Hinton's team won the ImageNet competition using deep learning in 2012 [8], an increasing number of scholars have begun to pay attention to deep learning and to apply it in various fields. Deep learning has achieved breakthroughs in various fields, including natural language processing [9], [10], speech recognition [11], machine translation [12], and image comprehension [13]. A recurrent neural network (RNN) is a kind of deep neural network that is effective at processing sequential data. In contrast to conventional feed-forward neural networks, an RNN preserves, learns, and records the historical information contained in sequential data by means of periodically connected hidden layer nodes [14]. As shown in Figure 1, the structure of an RNN includes an input layer, hidden layers, and an output layer; U, V, and W are the weights from the input layer to the hidden layers and from the hidden layers to the output layer.
Since the parameters of an RNN are shared between layers, the number of parameters is dramatically reduced, thereby shortening the training time. However, an RNN can easily suffer from gradient vanishing or gradient explosion when processing a long sequence of data.
To solve the problems of gradient vanishing, gradient explosion and long-term dependence, Hochreiter and Schmidhuber [15] introduced the LSTM neural network architecture in 1997. LSTM is an improved design for deep neural networks based on the RNN architecture; it also has a chain-like structure, as shown in Figure 2.  However, compared with the simple layers of RNN neurons, the structure of LSTM neurons is more complex, as shown in Figure 3.
The loop structure of an LSTM neuron includes three control gates: a forget gate, an external input gate, and an output gate. These gate structures allow the LSTM neuron to update, maintain, or delete information contained within the cell state. An LSTM network is trained using the Backpropagation Through Time (BPTT) algorithm to determine its parameters [16]. The BPTT algorithm calculates the error term of each LSTM neuron in the reverse direction, calculates the gradient of each weight in accordance with the corresponding error term, and then updates the weights using a gradient optimization algorithm.
LSTM networks are widely used because they can avoid the problems of gradient explosion and gradient vanishing encountered in RNNs. Li Weishan presented preliminary discussions on the application of LSTM models in coal mine gas forecasting and early warning systems [17] and confirmed their effectiveness for this purpose. However, the LSTM architecture is complex in structure and prone to overfitting. Therefore, we propose a GRU-based algorithm for gas concentration prediction. The GRU structure is simpler than the LSTM structure, with fewer parameters, a shorter training time, and lower susceptibility to overfitting.

III. RESEARCH DATA A. DATASET DESCRIPTION
The dataset used in this study was constructed from gas concentration data for a mine working face. We collected the dataset, which consists of 10419 mine gas concentration data points, from January to March 2014, with a sampling interval of 5 minutes. There are no missing raw data, as shown in Figure 4. To highlight the advantages of each model, we divided the dataset into three subsets: one containing 1041 data points, one containing 5209 data points and one containing all 10419 data points. In each subset, 70% of the data points were designated as training data, and the remaining 30% were used as test data to evaluate the prediction accuracy of each model.

B. DATA PREPROCESSING
A typical dataset for the task addressed in this paper consists of gas concentration data and other related data collected by monitoring equipment over a certain period of time (e.g., one hour, one day, or one month). However, these data will generally contain various missing values and outliers due to factors such as process alteration, equipment failure, air volume regulation or other factors related to human activity. Therefore, it is necessary to preprocess the gas concentration monitoring data that are collected in real time to improve the accuracy of mine gas concentration forecasting.
The first step of data preprocessing is to apply the Pauta criterion to process the noise in the gas concentration monitoring data. The Pauta criterion is one of the most common and simplest criteria for discriminating error. If the absolute value of a residual is larger than three times the standard deviation, meaning that inequality (2) is true when combined with formula (1), then this error is considered to be excessively large, and the corresponding measured value is considered to be an outlier that should be rejected, resulting in a missing value for the corresponding measurement time.
where x is the average value of the dataset and σ is the standard deviation of the dataset. The second step is to apply Lagrange interpolation to fill in the resulting missing values of the monitoring data sequence to obtain a complete set of processed monitoring data.
Finally, the dataset is normalized to values in the range of [0,1]. The normalization formula is as follows: where norm(x i ) is the normalized value of x i , min(x) is the smallest value in the dataset, and max(x) is the largest value in the dataset.

IV. GRU-BASED FORCASTING FRAMEWORK A. GRU MODEL
An LSTM module has a large number of parameters and a complex structure; thus, it is prone to overfitting. To overcome these deficiencies, Cho proposed the GRU structure as a variant of LSTM in 2014 [18]. The GRU architecture VOLUME 8, 2020 maintains the characteristics of LSTM while having a simpler structure. The internal structure of a GRU neuron is shown in Figure 5.
Each of these recurrent neural network variants has a structure that consists of replicated instances of a particular module; however, the structure of the replicated module in a GRU network is slightly simpler than that in an LSTM network. A GRU neuron has only two gates, namely, an update gate and a reset gate, denoted by z t and r t in Figure 5. The update gate is used to control the extent to which the information of previous hidden states is carried over into the current state. The larger the value of the update gate is, the more information is preserved from previous states. Similarly, the reset gate is used to control the extent to which the information of previous states is ignored; the smaller the value of the reset gate is, the more information is ignored. Accordingly, shortterm dependencies are usually captured by means of frequent activation of reset gates, while long-term dependencies are associated with the activation of update gates. Since the GRU architecture has only two types of control gates, the calculation speed of a GRU model is much faster than that of an LSTM model.
Let r t represent the reset gate of a GRU at time t, with the following calculation formula where σ is the sigmoid function; X t and h t−1 are the current input value and the last activation value, respectively; W r is the input weight matrix; and U r is the weight matrix of the loop connection. Similarly, let z t be the update gate of the GRU at time t, with the following calculation formula: Let h t be the activation value of the GRU at time t, which is the median value between the last activation value h t−1 and the candidate activation valueh t : h t is calculated as follows: where • represents the Hadamard product.

B. MODEL OPTIMIZATION
The magnitude of the loss function in traditional gradient descent optimization depends on the change in the parameters. Theoretically, the more iterations are performed, the less the loss function should change from one iteration to the next; however, the gradient descent algorithm still may not reach a globally optimal solution. Therefore, the Adam optimizer is chosen as the optimization algorithm in this work. The Adam algorithm [19] was proposed by Kingma, Diederik P., and J. Ba. in 2014. The Adam algorithm is based on the adaptive estimation of a low-order matrix and is a stepwise optimization algorithm for a stochastic objective function. It can dynamically adjust the first-order and second-order matrices of the gradient of each parameter in accordance with the loss function. It has the advantages of easy implementation, high computational efficiency, and low memory consumption. In deep learning, the mini-batch technique is adopted, which causes the objective function to change with the different samples included in each batch. However, the Adam algorithm can still effectively solve the optimization problem even with this randomness in the objective function. It has been proven through experiments that the Adam algorithm is superior to random gradient descent optimization.

C. GAS CONCENTRATION FORECASTING FRAMEWORK
The gas concentration in a coal mine varies in a time-varying and nonlinear manner; thus, it cannot be predicted by means of a simple linear relationship. Neural networks have obvious advantages in predicting complex nonlinear time-varying sequences. Therefore, a gas prediction model based on a GRU neural network is constructed to predict the trend of gas concentration data.
Let the time series of previous gas concentration values at n consecutive moments be denoted by X(t-n+1), X(t-n+2), . . ., X(t-1), with X(t+1) being the predicted value at the next moment; then, the GRU-based gas concentration prediction model can be expressed as follows: The three-layer network framework of the proposed GRU gas concentration prediction model, which includes an input layer, a hidden layer, and an output layer, is shown in Figure 6. The input layer is responsible for preprocessing the original time series of gas concentration data to satisfy the requirements for the input to the GRU model; in the hidden layer, GRU neurons are used to construct a 1-layer loop neural network; and the output layer is mapped to one-dimensional sequence of data through a fully connected layer to realize gas concentration prediction.

V. MODEL PARAMETER SETTINGS
TensorFlow is a GPU-and CPU-based library. It acts as a backend for the Keras library [20]. Keras is a deep learning library that supports the implementation of complicated prepackaged architectures such as the RNN, LSTM and GRU architectures. TensorFlow does not provide many prepackaged architectures but rather supports the design of new architectures, whereas Keras supports new datasets for known architectures. In this work, the Keras and TensorFlow libraries were used for mine gas concentration forecasting. Additionally, Keras and TensorFlow contain many prepackaged deep learning functions, such as activation functions and loss functions. By using these functions, better results can be obtained with a given architecture. The model parameters are explained as follows.

A. PERFORMANCE INDEX
To test the effectiveness of the proposed GRU-based gas concentration prediction method, it is necessary to choose an indicator to comprehensively measure and evaluate the prediction effect. The MSE is chosen as the evaluation index in accordance with both the principles and practice of prediction effect evaluation. The MSE is defined as the expected value of the squared difference between the estimated value of the quantity being predicted and the corresponding true value. The MSE can be used to evaluate the degree of variation in the data. The smaller the value of the MSE is, the more accurately the prediction model describes the experimental data. The MSE can be calculated using the formula below: where N is the number of data points.

B. ACTIVATION FUNCTION SELECTION
An activation function is used to convert an input signal into an output signal, which will then act as the input signal for the next layer. In each cell, a weight function is used for processing in combination with the current state input and the previous state output. In this work, we use the sigmoid activation function. When information from the previous hidden state and information from the current input enter a cell, they are activated by the sigmoid function, with the values of the vectors being between 0 and 1. A value closer to 0 means that the corresponding information is more likely to be forgotten, whereas a value closer to 1 means that it is more likely to be retained. The sigmoid function is calculated using the following formula:

C. HIDDEN LAYER STRUCTURE
We have tested several representative combinations of numbers of hidden layers and hidden nodes and have found that the best choices for the values of these two parameters are 1 and 12, respectively.

D. SELECTION OF THE OPTIMIZATION ALGORITHM AND THE NUMBER OF ITERATIONS
The training algorithm for a neural network usually refers to the algorithm used to update the weight coefficients such that the value of the objective function will be gradually reduced until it converges to a global minimum [21]. In this work, we have selected the Adam as the optimization algorithm. The Adam algorithm is introduced in Section IV-B.
To address the problems of undertraining and overtraining, we apply the cross-validation method. If both the validation error and the training error are steadily decreasing, then the current model is still undertrained. If the validation error is increasing while the training error is steadily decreasing, we consider this behavior to be indicative of overtraining. Ultimately, we find that when the number of iterations is 10, the training error and verification error of the neural network model are basically stable. An SVR model has two very important parameters: C and gamma. Here, C is the penalty coefficient, which represents the tolerance for error. The higher C is, the less tolerance there is for error, and the easier it is to overfit the model. The smaller C is, the less well fitted the model will be. Gamma is a parameter associated with the RBF kernel function and implicitly determines the distribution of the data mapped to the new feature space. The larger the value of gamma is, the fewer support vectors there are, whereas a smaller gamma value corresponds to more support vectors. The number of support vectors affects the speed of training and prediction. Finally, we set C equal to 1 and gamma equal to 0.01.

A. EXPERIMENTAL SETTINGS
In the experiment, the data were split into two different sets, one for training and one for testing. The training set was used to train the forecasting models, while the test set was used to evaluate the final results. The procedure applied in the VOLUME 8, 2020  mine gas concentration forecasting experiment is described as follows: I. Data preprocessing. II. Division of the dataset into a training set and a test set, with the training set comprising 70% of the data and the test set comprising the remaining 30% of the data.
III. Construction of a 3D array from the training set for input to the GRU network via spatial reconstruction.
IV. Determination of the parameters of the GRU model, using the MSE as the loss function and the Adam algorithm as the optimization method.
V. Determination of the number of training iterations, namely, the number of epochs, in accordance with the loss function.
VI. Forecasting on the test set. VII. Comparison with other gas concentration forecasting models of different structures and evaluation of these models in terms of the chosen performance index.
VIII. Visualization of the prediction results after inverse normalization.
Through the above 8 steps, a mine gas concentration prediction model was established based on a GRU neural network.

B. RESULTS AND ANALYSIS
In this experiment, a neural-network-based method was used for the prediction of mine gas concentration levels. Then, we considered the MSE and the run time of the model to evaluate the modeling effect. Through a comparative experimental analysis, we found that the effect of each model is different on different subsets of data; however, as the amount of data to be processed increases, the effect of the GRU model becomes superior to that of the other models.    First, we tested the models on the subset consisting of 1042 data points. The prediction effects of the SVR, BPNN, RNN, LSTM, and GRU models on the corresponding test set are shown in Figures 7, 8, 9, 10, and 11, respectively. A further performance comparison of these models is presented in Table 1.
From Table 1, we find that the RNN and LSTM models achieve higher prediction accuracy than the proposed GRU model does on the subset with a length of 1042 data points; therefore, for small data sets, RNN and LSTM models are still preferred in terms of prediction accuracy.     Then, we conducted the same experiment on the subset consisting of 5209 data points. The prediction effects of the SVR, BPNN, RNN, LSTM, and GRU models on the corresponding test set are shown in Figures 12, 13, 14, 15, and 16, respectively, and the MSE and run time of each model are listed in Table 2.
From Table 2, it can be concluded that the error of the GRU model is less than that of the other four models, indicating that the prediction accuracy of the GRU model increases with an increasing number of data.    Finally, we conducted the same experiment again on the whole collected dataset, with a total of 10419 data points. The prediction effects of the SVR, BPNN, RNN, LSTM, and GRU models on the corresponding test set are shown in Figures 17, 18, 19, 20, and 21, respectively, and the MSE and run time of each model are listed in Table 3.
From Table 3, we can see that the GRU model achieves smaller errors on both the training set and the test set compared with the other models. Compared with the SVR model,    the error of the GRU model on the test set is reduced by 44.41%. Compared with the BPNN and RNN models, the error of the GRU model on the test set is reduced by 21.05% and 7.9%, respectively. Compared with the LSTM model, the error of the GRU model on the test set is reduced by 6.8%, and the GRU model also shows a 13.05% efficiency improvement over the LSTM model in terms of run time.

VII. CONCLUSION
To reduce the economic losses caused by mine gas disasters and protect the lives of miners, it is important to develop an effective method of mine gas concentration forecasting. Therefore, in this paper, we have proposed a gas concentration forecasting algorithm to enable the prediction of gas concentration time series. The optimal number of training rounds to avoid overfitting was determined by considering the loss function on the training set. Case studies showed that with increasing data volume, compared with SVR, BPNN, RNN, and LSTM models, the proposed GRU model achieves both a better prediction effect and lower time complexity, endowing it with better practical application value.
In future work, we will consider the impact of various external factors, such as the temperature, humidity, and pressure in the coal mine, on mine gas forecasting and the incorporation of prior knowledge to achieve better performance.