Improve the Model Stability of Dam’s Displacement Prediction Using a Numerical-Statistical Combined Model

In most studies of dam’s displacement prediction based on monitoring data, emphasis was given on improving the prediction accuracy, while the model stability was merely considered. This study proposed a numerical-statistical combined model which aims to improve the model stability. The displacement was modelled within three modules: recoverable displacement (i.e., displacement induced by the external load including the water pressure and temperature), non-recoverable displacement (i.e., displacement due to the inherent variations of the materials such as the creep and fatigue of the concrete), and measurement errors (i.e., instrument error and human error). To reduce the random errors and increase the model stability, we used the numerical simulation to constrain the coefficients of explanatory variables for the recoverable displacement. The non-recoverable displacement was estimated by empirical equations, and the measurement errors were given by Gaussian distributions. The randomness of coefficients in the model among all monitoring points are constrained further by random coefficient model. We adopted the root mean square error (RMSE) at varying time and the change ratio of the coefficients (CRC) to evaluate the model stability. Results indicated that the proposed model not only has better prediction accuracy but also has better model stability compared with the statistical model and coordinates-included statistical model proposed in previous studies.


I. INTRODUCTION
In most early studies of dam's displacement prediction based on monitoring data, researchers used statistical models to estimate the future displacement from the past monitoring data, in which the displacement was quantified by three The associate editor coordinating the review of this manuscript and approving it for publication was Vlad Diaconita .
influencing factors, i.e., hydrostatic pressure, temperature, and ageing [1]- [4]. With the development of computational technologies, machine learning methods were introduced to the field, and more and more complicate influencing factors were taken into consideration. Such methods include the artificial neural network method [5]- [8], support vector machine method [9], [10], extreme learning machine method [11], and etc.
Both statistical models and machine learning methods have shown very high prediction precisions. However, the model stability was rarely discussed, which is equally as important as the prediction precision, especially when a long-term prediction is involved [12]- [14]. Many unquantifiable factors such as the construction quality of pouring and material properties such as compression strength, elastic modulus, and Poisson ratio may induce uncertainties in predicting the displacement. These factors depend on the spatial positions of the monitoring points on the dam.
In order to enhance the model stability, recent studies have integrated the spatial correlations of the monitoring points into the statistical models, by classifying the monitoring data at different monitoring points into several groups [15], [16]. In the statistical models with monitoring data being classified, the spatial correlations were quantified by groups, whereas the correlations between monitoring points in one group were lacking. To reflect the overall spatial correlations, one method is to integrate the coordinates of monitoring points into the statistical models as explanatory variables [17]. The coordinates-included statistical model considered the spatial correlations between each monitoring point; however, the accuracy of the model was usually reduced due to the increasing number of explanatory variables and model complexity. Taking a simple power function as an example, the number of explanatory variables of the function that includes the coordinates (x, y, z) is increased by 4 3 times compared with that without coordinates. As the number of variables increases, the coordinates-included model exhibits an increasing possibility of autocorrelations between variables, which would weaken the prediction accuracy. The objective of the present study is to enhance the stability of the prediction model without increase the number of variables like coordinates-included statistical models.
In this study, we proposed a numerical-statistical combined model, which modelled the displacement via three modules: the recoverable displacement, non-recoverable displacement and measurement errors. The recoverable displacement is the most critical module, and it represents the displacement induced by the external loads including the water pressure and temperature. This component was quantified by numerical simulation [18]- [20] with reduced explanatory variables. The non-recoverable displacement includes the displacement resulting from the inherent variations of the materials such as the creep and fatigue of the concrete, and it was provided with an empirical formula. The measurement errors (i.e., instrument error and human error) were estimated with Gaussian distributions. To constrain the explanatory variables of the proposed model further, a statistical model called random coefficient model [21]- [23] were used to obtain the coefficients. To evaluate the proposed model, we first quantified the model's prediction accuracy for data of six different prediction periods varied from one month to six months. We then quantified the model stability with an indicator named change ratio of coefficient (CRC), which estimates the fluctuation of the coefficients of the explanatory variables.
This article was organized as follows. Section II presented how the proposed model was developed. We first introduced the statistical model and the coordinates-included statistical model, and then developed the statistical-numerical combined model. Section III exhibited the engineering project and the dataset. Section IV discussed the results including the prediction accuracy and the model stability. Concluding remarks completed the paper in Section V.

II. MODEL DEVELOPMENT A. COORDINATES-INCLUDED STATISTICAL MODEL
Dam's displacement depends on many external and internal factors, such as the external load, the material properties of the dam body and the dam foundation, and the quality of the construction. Researchers developed statistical models to quantify the influence of these factors on dam's displacement. In most statistical models, the calculation of dam's displacement δ were divided into three modules: hydraulic components δ H , temperature components δ T , and ageing components δ θ .
The displacement δ of a single monitoring point was written as: where a 0 is a constant term which represents the initial conditions; H is the upstream water level; n is a coefficient with n=3 for gravity dams and n=4 for arch dams; t counts time from the starting date of the selected dataset; θ = t 100 ; a i , b i , c i are unknown coefficients.
To consider the spatial correlations among all monitoring points, the coordinates of monitoring point were included as variables in the statistical model (see Figure 1). Then, the displacement δ becomes: with H the upstream water level, T the temperature, θ = t 100 , t the time, and x, y, z the coordinates of the monitoring points. f 1 (H , x, y, z), f 2 (T , x, y, z), and f 3 (θ, x, y, z) are related to δ H , δ T , and δ θ , respectively (see Equation (1)).
where f (H ) is the hydrostatic pressure component at one monitoring point as expressed in Equation 2. f 1 (x, y, z) represents the displacement field induced by the water load, and it can be expressed with a multivariate power function: A klmn H k x l y m z n (5) VOLUME 8, 2020 The displacement field induced by the temperature component f 2 (T , x, y, z) exhibits: The displacement field induced by the ageing component f 3 (x, y, z) is: C jklmn θ j lnθ k x l y m z n (j = 0, k = 1 or j = 1, k = 0) (7) Then, the coordinates-included statistical model can be expressed as: C jklmn θ j lnθ k x l y m z n (8) Known the spatial coordinates (x, y, z) and displacement δ(H , T , θ, x, y, z) of each monitoring point, we can fit the A i , B i , C i with the monitoring data using a stepwise regression method.

B. NUMERICAL-STATISTICAL COMBINED MODEL
With the coordinates (x, y, z) integrated into the statistical model, the numbers of independent variables are increased. The increasing variables may lead to an auto-correlation problem. The proposed numerical-statistical combined model aims to increase the model stability without increase the number of independent variables. In contrast to previous studies which divided the displacement into water pressure component, temperature component, and ageing component, we model the displacement within three modules: recoverable displacement δ i,r , non-recoverable displacement δ i,n−r , and measurement errors δ i,e (see Figure 2). The recoverable displacement is estimated by integrating the numerical simulation into the statistical model, the non-recoverable displacement is evaluated with an empirical equation, and the measurement errors is assumed to follow a Gaussian distribution.

1) RECOVERABLE DISPLACEMENT
The module of the recoverable displacement δ i,r represents the dam's displacement induced by the external loads including the water pressure δ iH and temperature δ iT . δ iH and δ iT are regarded as linear elastic and satisfies small deformation assumption, they hence can be expressed by the equilibrium equation (9), geometric equation (10), and constitutive equation (11).
where σ , ε are the stress tensor and strain tensor, respectively; u is the theoretical displacement field; f is the volume force and δ ij is the Kronecker symbol, δ ij = 0 when i = j and δ ij = 1 when i = j. The δ 0 iH at different water levels and δ 0 iT at different temperatures can be solved with Equation (9) to (11) using a finite element method. The relation between the designed and simulated displacement field δ iH and δ 0 iH are: where i is the serial number of monitoring points, E 0 c is the designed elastic modulus, E ic is the actual elastic modulus, and ζ i is the ratio of the designed elastic modulus E 0 c to the actual elastic modulus E ic .
Similarly, the relation between the actual linear expansion δ iT and the designed linear expansion δ 0 iT can be expressed as iT , in which ξ i is the ratio of the designed linear expansion to the actual linear expansion. The displacement of the recoverable module exhibits: δ 0 iH and δ 0 iT are provided by the numerical simulation, ζ i and ξ i represent the fluctuations of the actual elastic modulus and the actual linear expansion coefficients at indicated position.

2) NON-RECOVERABLE DISPLACEMENT
The non-recoverable displacement field δ i,n−r represents the displacement that results from the inherent variations of the materials such as plasticity, creep, fatigue of concrete, and etc. The physical mechanism of these influencing factors are unclear until now. Here, we use the linear function and the logarithmic function to characterize the divergence trend and the convergence trend of δ i,n−r , respectively. The expression of δ i,n−r is exhibited as: where x is the horizontal coordinate and z is the vertical coordinate, θ is the time, d 1lm and d 2lm are pending coefficients.

3) MEASUREMENT ERRORS
The measurement errors δ i,e include the instrument errors and the human errors. The measurement errors of the displacement data at a monitoring point can be regarded as satisfying a Gaussian distribution N (0, σ 2 ). The probability distribution function of δ i,e is:

4) SOLVING THE MODEL
As presented in Section II-B1 to II-B3, the displacement field can be exhibited as: Figure 3 illustrates the flowchart of the proposed numerical-statistical combined model.
With known spatial coordinates of the monitoring points, the geometric characteristics at each monitoring point can be determined from the numerical simulation. The advantage of the numerical method is that only two variables (ζ i and ξ i ) are left in the module of recoverable displacement.
We use the random coefficient model to solve the numerical-statistical combined model, which assumes the regression coefficients are random variables and obey a Gaussian distribution (See Figure 4).
By introducing the random effects, the correlation between individual observations are taken into account, and the degree of freedom of the model are reduced. Then, Equation 16 can be exhibited as: where δ it is a two-dimensional data panel of the displacements containing temporal and spatial information; a kit is VOLUME 8, 2020 a two-dimensional data panel of explanatory variables; t is time; i is the dam's cross-section index; k is the explanatory variables index and u is a random term. The pending coefficient β ki includes β k and γ ki , with β k = (β 1 , · · · , β K ) the common mean coefficient vector, and γ = (γ 1i , · · · , γ Ki ) the derivation from the individual data to the common mean value. According to the central limit theorem, β ki approximately obeys an asymptotic Gaussian distribution.

III. DATASETS
In the present study, we used the monitoring displacement data of the concrete arch dam at the Jinping-I hydropower station, which is one of the highest concrete arch dam in the world (see Figure 5). The elevations of the crest and foundation of the dam are 1885m and 1580 m, respectively. The normal impounded water level and the level of dead water are 1880m and 1800m, respectively. For the dataset, we selected the radial displacement monitoring data of 23 monitoring points distributed in six plumb lines (5#, 9#, 11#, 13#, 16# and 19#) from July 1, 2015 to December 31, 2018. Figure 6 shows the distribution of the monitoring points. Displacement to the downstream direction counts for positive, and the displacement to the upstream direction counts for negative. The displacement data were recorded once a day. After eliminated the missdata, we obtained 914 validated data in total. The dataset was divided into two parts: data from July 1, 2015 to June 30, 2018 were selected as training dataset, whereas data from July 1, 2018 to December 31, 2018 were selected as testing dataset. Figure 7 shows the time variation of the upstream water level and the monitoring data of the selected monitoring points.

A. RESULTS OF THE PROPOSED MODEL
We analysed the module of the recoverable displacement in ABAQUS. We firstly established a three-dimensional finite element model for the selected dam, which consists of the dam body, dam pedestal, and the surrounded mountain. The dam body contained 38537 elements and 31941 nodes.  The model was constrained in the normal direction for all lateral boundaries, and was fixed in all directions for the bottom boundary (see Figure 8). The parameters of the material properties used the designed value (see Table 1). The density, Young's modulus and Possion ratio were used to calculate the hydrostatic pressure-induced displacement,   and the expansion coefficient was used to calculate the temperature-induced displacement. We simplified the dam body as concrete, and the dam pedestal and foundation as rock.
As the upstream water level varied between 1700 m to 1880 m in the real world, we simulated the dam's displacement at six different upstream water levels (see Figure 9). Results showed that the displacement field was approximately horizontal symmetry, with the displacement at the midline larger than at the border areas.
Similarly, as the temperature in the real world varied from 4 • C to 24 • C, we selected 4 • C, 8 • C, 12 • C, 16 • C, 20 • C and 24 • C as the boundary conditions of the downstream dam surface and the dam crest. The boundary conditions of the upstream dam surface were set to 3 • C, 4 • C, 5 • C, 8 • C, 11 • C and 14 • C, which are the average temperatures of the water body in the above six configurations. Figure 10 exhibits the displacement field with the temperature of the upstream dam surface varied systematically from 4 • C to 24 • C. The dam's crest deforms toward downstream at low temperature and toward upstream at high temperature.  With the numerical simulations of the configurations at different water level and temperature, we obtained the regression relationship between H and δ 0 H (i.e., δ 0 H = 4 i=0 a i H i ), and the relation between T and δ 0 T (i.e., δ T = 2 i=0 b i T i ). Table 2 exhibits the coordinates of all the selected monitoring points.
With the random coefficient model, the coefficients of the explanatory variables in Equation 16 can be obtained, which are exhibited in Table 3.

B. PREDICTION ACCURACY
In order to evaluate the proposed model, we calculated the coefficients of determination R 2 of the training data and the root of mean square error RMSE of the testing dataset: where δ i is the average of the monitoring data and n is the number of displacement data. As shown in Table 4, we compared the coefficient of determination R 2 of the proposed model with the statistical model (S model) and coordinates-included statistical model(C-S model). The coefficients of the explanatory variables in the S model and C-S model are shown in Appendix (Table 6 and 7). The number of variables of a single monitoring points of the S model, C-S model, and the proposed model were 9, 160, and 5, respectively. For the displacement prediction of single monitoring point, the number of variables of the C-S model was more than those of the S model and the proposed model. Here, to establish model for all monitoring points, the number of variables were 207, 160 and 115 for the S model, C-S model, and the proposed model, respectively. For the whole dataset, S model had the most numerous variables (207) and thus had the best fitting results but also a larger possibility of having the over-fitting problem. The R 2 of the C-S model was slightly smaller than those of all monitoring points in S model and the proposed model. Overall, the coefficient of determination R 2 of all these three models exceeded 0.95 for the prediction of all the monitoring points, which means all these models performed well in fitting the training data. Figure 11 (a) presents the RMSE of three models for each monitoring point, the average RMSE of all monitoring points were 0.315, 1.679 and 0.270 for the S model, the C-S model, and the proposed model, respectively. For the whole 23 monitoring points, S model had the lowest RMSE in 8 monitoring points, the proposed model had the lowest RMSE in 15 monitoring points. Since more than half of the variables in the C-S model were fairly small (almost close to zero), the number of effective variables is the smallest. Therefore, C-S model performed the worst of three models in fitting and predicting displacement. The prediction accuracy of the proposed model was as good as S model, because it considered the deterministic relation between the variables and the dam's displacement using the numerical simulation.
Further, to evaluate the prediction accuracy at different time period, we calculated the RMSE from one to six months for the testing dataset. Figure 11 (b) exhibits the average RMSE of 23 monitoring points, which varied from 1.67 to 1.97 for the C-S model, from 0.36 to 0.93 for the S model, and from 0.21 to 0.35 for the proposed numerical-statistical combined model. For all these three models, the average RMSE of all monitoring points kept increasing with the prediction period last longer. When the prediction time rose from five months to six months, the RMSE increased intensely. Compared with the other two models, the S model had a more obvious increase of RMSE, especially at the sixth month, the increase rate was 52.4 % (increased from 0.61 for five months' prediction time to 0.93 for six months). The average RMSE of the C-S model for all predicting period was the highest but its increment was smaller than the S model. To conclude, the prediction accuracy of all these three models decline when the predicting time lasts longer. By comparing the RMSE and its increments for the prediction time varied from one month to six months, we noticed that the proposed model had the most steady prediction accuracy. This is because the proposed model limits the variables with random coefficient model.

C. MODEL STABILITY
Here, we select change ratio of coefficient (CRC) as the indicator to evaluate the model stability, which represents the sensitivity of the coefficients of the model to the varied inputs. The expression of CRC is exhibited as: where CRC denotes the change ratio of coefficient, Coef ini and Coef are the coefficients calculated with the initial inputs and varied inputs, respectively. A larger CRC signifies that the coefficients calculated with varied inputs have greater fluctuation, thus a more unstable model. We adopted the varied training datasets and varied measurement errors (δ i,e ) to obtain the CRC in this section. To calculate CRC of varied training datasets, we defined a comparison scheme and used the coefficients calculated with the whole training dataset as a control group. We then reduced 5% and 10% of the training dataset, receptively, and obtained the differences of the coefficients calculated with these three datasets. Similarly, CRC of varied δ i,e was obtained by the coefficients calculated with the initial δ i,e and varied δ i,e according to Equation 20. Here, we set δ i,e fluctuated in the ranges of [-10%, 10%], [-20%, 20%], [-30%, 30%], [-40%, 40%] and [-50%, 50%], respectively as varied inputs. Figure 12, Figure 13 and Figure 14 exhibit the CRC 5% and CRC 10% of the S model, the C-S model and the proposed model (CRC 5% and CRC 10% denotes the CRC with 5% and 10% reduced training datasets, respectively). In Figure12, 20 monitoring points had a higher average CRC of all coefficients (a i , i=1,2,3,4; b i , i=1,2; c i , i=1,2 and constant term) with the 10% reduced training dataset than with the 5% reduced training dataset. The average CRC 5% of PL19-5 (3.677) was maximum, and the fluctuations of the coefficients of PL9-4 (0.571), PL11-5 (0.917), PL16-4 (0.596) and PL19-4 (0.521) were relatively more obvious. When the training dataset was reduced by 10%, the coefficients of PL9-4, PL9-5, PL11-5, PL16-4, PL19-4 and PL19-5 had the most obvious fluctuations whose CRC 10% were above 0.5.

1) VARIED TRAINING DATASETS
In Figur 13, the CRC 5% of A 1ln , A 2ln and B 11ln (l=0,1,2,3; n=0,1,2,3) and CRC 10% of A 1ln , A 2ln , B 01ln , B 10ln and B 11ln (l=0,1,2,3; n=0,1,2,3) were relatively large. As more than half coefficients were not used in the C-S model, the CRC of these coefficients were considered as 0. However, in the C-S model, the maximum CRC 5% and CRC 10% were 67 and 55, which were 18.26 and 59.98 times as much as the maximum CRC in the S model. In this case, the coefficients in the C-S model were quite sensitive when the training dataset varies, and thus the model was unstable.
In the proposed model, the maximum average CRC 5% was 0.146 at the monitoring point PL19-5, and the average CRC 5% of the monitoring points PL9-5, PL13-4, PL19-3 and PL19-4 exceeded 0.05. The CRC of over 80% monitoring points (19 of 23) had a trend of increase with the reduced training datasets, in which CRC 10% of PL11-5 and PL19-5 exceeded 0.2. Compared with the S model and the C-S model,  the proposed model was more stable when the training dataset changes. From Figure 12 and 14, it is noted that the monitoring points around the dam foundation (i.e., PL19-5, PL19-4 and PL9-5) had a more significant influence by varying the training dataset. Table 5 represents the maximum CRC of the three models, in which, the maximum CRC of S model, C-S model and proposed model increased from 15.3235 to 448.5635, from 15.8318 to 285.4802 and from 0.0436 to 0.2361, respectively. The coefficients in the proposed model fluctuated much less in response to the varied δ i,e compared with the other two models.

2) VARIED MEASUREMENT ERRORS
In Figure 15, the average CRC of S model, C-S model and the proposed model varied from 0.1898 to 3.8137, from 0.8939 to 4.9890 and from 0.0085 to 0.0209, respectively. In general, the coefficients of all three models fluctuated more obviously with the measurement errors inputs varied in a wider range. The coefficients in S model and C-S model were sensitive to the variation of measurement errors inputs, on average, a change that is nearly three times as much as the   initial inputs. Whereas the coefficients of the proposed model had little variation toward varied measurement errors inputs.

V. CONCLUSION
In the domain of dam's displacement prediction based on monitoring data, most previous studies gave emphasis on improving the prediction accuracy and paid less attention on the model stability. In this study, we proposed a numerical-statistical combined model to enhance the model stability. The proposed model considered the spatial correlations of different monitoring points and the randomness of the coefficients of explanatory variables. We quantified the dam's displacement via three modules: the recoverable displacement, the non-recoverable displacement, and the measurement error. Numerical simulation was used to construct the coefficients of the explanatory variables in the module of recoverable displacement. The randomness of the coefficients of the explanatory variables was constrained with a random coefficient model.
We used the monitoring displacement data of a concrete arch dam at the Jinping-I hydropower station to validate the proposed model. The coefficients of determination R 2 of the proposed model were above 0.9 for the training dataset of all monitoring points. Compared with the statistical model (S model) and the coordinates-introduced statistical model (C-S model), the proposed model had a better prediction ability: the smallest increase of RMSE with the increase of the prediction time. In addition, the proposed model had the best stability: the average change ratio of coefficient when reduce 5% and 10% of the training dataset (CRC 5% and CRC 10% ) were the lowest compared with the S model and the C-S model, and the lowest CRC in response to five different varied δ i,e series. With the better model stability, the proposed model is more suitable for the long-term displacement prediction of large dams.

ABBREVIATIONS
The following abbreviations are used in this manuscript:

S model
Statistical model C-S model Coordinates-introduced statistical model RMSE Root Mean Squared Error CRC change ratio of the coefficient R 2 Coefficient of determination  ZHENZHU MENG was born in Dezhou, Shandong, China, in 1991. She received the B.Eng. and M.Sc. degrees in water conservancy and hydropower engineering from Hohai University, in 2012 and 2015, respectively. She is currently pursuing the Ph.D. degree with the Environmental Hydraulics Laboratory, École Polytechnique Fédérale de Lausanne (EPFL). Her research interests include landslide generated waves and structural monitoring data analysis.

APPENDIX COEFFICIENTS RESULTS OF THE S MODEL AND THE C-S MODEL
CHENFEI SHAO was born in Nanjing, Jiangsu, China, in 1989. He received the B.Eng. degree in water conservancy and hydropower engineering from Hohai University, Nanjing, in 2011, where he is currently pursuing the Ph.D. degree. His research interests include early warning models for hydraulic structures, monitoring data analysis, computational mechanics, and engineering software. VOLUME 8, 2020