Research on CART Model of Mass Concrete Temperature Prediction Based on Big Data Processing Technology

Due to the influence of temperature changes or temperature gradients in the construction process of mass concrete, temperature cracks will occur in the concrete. In order to achieve a reasonable prediction of the temperature change of the concrete during the construction process and accurately obtain the temperature change trend, this paper attempts to construct a CART prediction model based on the big data processing technology based on the characteristics of the temperature change of the mass concrete. Data processing methods such as missing value filling and moving average and the construction method of CART prediction model, combined with engineering examples, demonstrate the feasibility of the model method. The results show that the model and method can better predict the temperature change of mass concrete. It has high prediction accuracy and can provide necessary guidance for practical engineering.


I. INTRODUCTION
Mass concrete is a common structural form in modern engineering structures. Mass concrete will generate a lot of hydration heat during the pouring process. Due to the large size of concrete structure, poor thermal conductivity of concrete, construction constraints, and temperature changes, temperature cracks usually occur during construction [1][2][3][4][5][6]. The mechanism of temperature cracks is mainly: the mass concrete structure generates stress and strain due to the temperature difference and temperature gradient between the inside and outside. The internal and external constraints of the structure prevent the development of this stress and strain. When the stress and strain exceed the limit value, it will cause concrete Structural damage, resulting in cracks. Therefore, in order to control concrete cracking, it is necessary to control the maximum tensile stress or the maximum tensile strain not to exceed the limit of the corresponding concrete. The existence of temperature cracks not only affects the appearance of the concrete structure, but also has a great impact on the safety of the structure, such as reducing the bearing capacity and safety of the structure, affecting the waterproof performance of the structure, and reducing the durability and service life of the structure.
Necessary temperature control of mass concrete is one of the key means to ensure the quality of mass concrete pouring, such as controlling temperature changes during constructions to avoid large temperature changes or large temperature gradients during concrete construction [7][8][9][10][11]. Predicting the temperature in the process of concrete pouring, predicting the changing process of concrete temperature in advance, and taking corresponding temperature control measures can ensure the construction quality of mass concrete [12,13].
At present, the main analysis methods about temperature change of mass concrete include empirical formula method, finite difference method, finite element method and so on [14]. In 1933, measures such as low-heat cement, split joints, and intermittent pouring were adopted for the first time in the construction of Hoover Dam; around 1960, some concrete dams in the United States adopted measures such as reducing the proportion of cement and pre-buying cooling water pipes, and achieved good temperature control and crack prevention. ; In 1968, Professor EL Wilson of the University of California first introduced the finite element time history analysis method into the temperature stress analysis of concrete dams; in 1990, Enrique Mirambell et al established a two-dimensional finite difference analysis model; in 1992, Barrett PK et al. creatively applies the cracking model to the calculation and analysis of 3D temperature stress; in 2004, Yunus Ballim introduced the development and application of a finite difference thermal model, which can predict the timevarying temperature distribution in mass concrete members; in 2008, Denzil Lokuliy studied the relationship between the generation, development and temperature stress of temperature cracks in mass concrete structures, and systematically analyzed the relationship between the development of mass concrete cracks and other internal effects; in 2014, Lawrence.AM et al. The element method was used to analyze the effect of early strength and cracking of concrete with different auxiliary cementitious materials. Since the temperature changes process of mass concrete is relatively complex, there are many influencing factors, and the above conventional physical analysis methods have simplified the heat diffusion equation, boundary conditions and external environment, which can reflect the thermal change law of concrete, but the above The analysis is based on the physical diffusion process of temperature change, which requires high temperature parameters and will have a large deviation from the actual temperature change of concrete [15][16][17][18][19][20][21][22][23][24][25].
Big data processing technology is a new type of data analysis and processing method, which can make full use of existing data resources, conduct in-depth mining processing, and extract potential data laws and useful knowledge information [26][27][28][29][30]. During the pouring process of mass concrete, the temperature change data can be collected by burying the temperature sensor, and various data of concrete pouring can also be obtained [31][32][33][34]. These data provide a certain data basis for the analysis and application of big data technology. For example [35][36][37][38][39][40], with the help of mass concrete temperature data, Norris et al. proposed an embedded nano-and micro-electromechanical sensing system, and successfully applied it to the temperature and humidity monitoring inside the concrete structure; Kim et al. proposed a surface acoustic wave monitoring system, It can be used for non-destructive monitoring of concrete temperature. However, there is still a lack of relevant research on temperature prediction considering the real-time change of concrete temperature with time, temperature measurement points with spatial distribution, sensor anomaly and failure and other factors. There is also less research, especially the lack of practical methods that can guide the actual engineering of mass concrete temperature monitoring and predictive analysis.
Aiming at the temperature change of mass concrete [41][42][43][44], this paper attempts to introduce big data processing technology by analyzing the temperature data that has occurred to provide a new analysis model for the temperature prediction of mass concrete.

1) INSTANCE SELECTION
The engineering example is selected as the construction of the left-side joint bottom plate of the intake pool of a booster pumping station in Henan Province. The left joint bottom plate has a pouring length of 25 m, a width of 12.2 m, a thickness of 3 m (the thickness of the tooth groove is 4.5 m), and the distance of the tooth groove is 2 m from the rear boundary. Figure 1 is a schematic diagram of the left-linked bottom plate project, showing the structure of the project. The concrete pouring size of the left-side slab of the inlet pool is relatively large, which belongs to the mass concrete pouring construction. In order to control the temperature necessary, when the left-side slab is poured, a corresponding temperature sensor is embedded to collect temperature data.
The concrete grade of the left joint bottom plate of the inlet tank is as follows: C30W4F150, and the grade is grade two. The concrete mix ratio is shown in Table 1, indicates the amount of water, cement, fly ash and other materials in 1m 3 concrete. In order to ensure the effect of temperature control, cooling water pipes are laid during the concrete pouring process of the left-linked bottom plate of the inlet tank to cool down. High-density polyethylene (HDPE) cooling water pipes are used for water cooling. Two layers of cooling water pipes are arranged. The first layer is laid at 1m. The second layer is arranged at 2m, the horizontal spacing is 1m, the inner diameter of the cooling water pipe is Ф32mm, the wall thickness is 2mm, and the water flow rate is controlled at 30L/min. According to the construction schedule, the pouring of the left-linked bottom plate of the inlet pool will end at 10:30 on March 11, 2021.Use the blue line to represent the minimum temperature, the red line to represent the maximum temperature, and set the abscissa as time and ordinate as the temperature value. Draw a schematic diagram of the outside temperature change during pouring in Figure 2, showing the process of outside temperature change during pouring.

2) MASS CONCRETE TEMPERATURE DATA
There are 21 temperature sensors embedded in the concrete of the left-sided bottom plate, 7 on each floor, 3 floors, among which the top sensor is embedded in the concrete 0.1 m away from the concrete surface, the middle sensor is 1.3 m away from the concrete surface, and the bottom sensor is located It is 2.9 m away from the concrete surface, and all sensors are 0.5 m away from the concrete side. The schematic diagram of the layout position of the temperature sensors of each layer are shown in Figure 3, which respectively indicate the position, number and arrangement position of the cooling water pipes of the temperature sensors on the upper, middle and lower layers. A summary of the temperature data coordinates of each measuring point in Table 2. The embedded temperature sensor adopts a digital temperature sensor, which is saved and stored according to the set time. Considering that the concrete temperature changes process is relatively smooth, in order to ensure the accuracy and quantity requirements of the temperature data, the storage time of the selected data is 10min/group, that is, once an hour. A temperature sensor stores 6 temperature data.
In the process of collecting temperature data of largevolume concrete, due to the influence of factors such as construction conditions, weather environment, instrument failure, and the amount of temperature collection data from different data sources (temperature sensors), the consistency of data attributes cannot be obtained It is guaranteed that the original temperature data collected may include abnormal points, missing values, and inconsistent attribute format data, and the data quality is difficult to meet the requirements.
If the original data is used directly, the outliers and missing values that appear will seriously affect the data effect. Therefore, it is necessary to introduce the relevant technology of big data analysis to deal with the outliers and missing values that appear to improve the quality of the original data; And for the accidental error of the original data [45,46], moving average processing is performed to eliminate the error [47,48].

1)MODEL SELECTION
Decision tree is one of the most classic predictive models in the field of data science, and it also is a very widely used classification method. Compared with conventional methods, decision tree has the advantages of low computational time and space complexity, easy to understand output results, insensitivity to missing features, and no feature processing. A large number of researchers have proposed different decision tree generation algorithms in terms of feature types, impurity measures or variables, and the selection method of segmentation and the number of sub-node splits, such as the common ID3, C4.5, conditional inference tree and CART.
The ID3 (Iterative Dichotomiser 3) algorithm pioneered decision trees. This algorithm uses information entropy to measure node impurity, and uses information gain to evaluate node splitting. It can only handle discrete features, and for continuous features, enumerate all Values are used to split nodes. The algorithm tends to select features with more values and tends to split many leaf nodes with a few samples, which is easy to cause over fitting. On the basis of improving ID3, the C4.5 algorithm takes the sample size of each split node into consideration, and uses the information gain rate instead of the information gain as the evaluation criterion for node splitting, avoiding splitting more sub-nodes with a small sample size and selecting those For features with more values, continuous features are processed by splitting thresholds. However, both ID3 and C4.5 can only solve the classification problem, but cannot solve the regression problem, so they are not suitable for reasonable prediction of the temperature change trend in the later stage of mass concrete construction.
Conditional inference trees are similar to classic decision trees, but determine the choice of independent variables and split points based on statistical tests, rather than measuring impurity. The significance test is a permutation test. Since the conditional inference tree is only split into two child nodes each time, the conditional inference tree does not need pruning, and the threshold determines the complexity of the model, so how to determine the threshold parameter is very important. Since the temperature change of mass concrete construction is affected by many factors such as the concrete material itself, the external temperature, and the construction process, these factors interact with each other. When using the conditional inference tree, it is necessary to distinguish independent variables and dependent variables independently, and it is difficult to reasonably determine the threshold parameters.
Classification and Regression Tree (CART) can solve both classification and regression problems. The decision tree prediction model can directly reflect the characteristics of the data, and simultaneously process data-type and general-type attributes, and can produce feasible and effective results for large data sources in a relatively short period. When solving the classification problem, the node impurity is measured by the Gini index, and the node splitting evaluation uses the Gini index drop value. To solve the regression problem, the node impurity is measured by the variance of the target feature, and the node splits evaluation standard using the variance drop value. It acts as a tree structure, where each internal node represents a test on an attribute, each branch represents a test output, and each leaf node represents a category, each split into only two child nodes. For continuous features, they are splited into two child nodes by a splited threshold. For discrete features, CART no longer splits child nodes according to all the values of the feature, but selects one value at a time, and splits nodes according to whether it is the value. When the CART model is learned, the training data is used to establish a decision tree model according to the principle of minimizing the loss function. Therefore, this paper recommends the use of CART algorithm to predict the temperature of mass concrete.
The CART algorithm consists of the following two steps: (1)Decision tree model generation: Generate a decision tree based on the training data set, and the generated decision tree should be as large as possible;(2)Decision tree model pruning: Pruning the generated tree with the verification data set and selecting the optimal subtree. The minimum loss function can be selected as the criterion for pruning.

2) CART MODEL BUILDING
The generation of the CART model is a process of recursively constructing a binary decision tree, which uses the Gini index as the basis for selecting the optimal partition feature [49][50][51][52]. The Gini value refers to the probability that two samples are randomly selected from a sample set, and the two samples do not belong to the same class. The main function of the Gini value is to measure the "impurity" of the data division. The smaller the Gini value, the higher the "purity" of the sample after the division. The formula for calculating the Gini value is Formula 1. For the regression tree, the method of minimizing the square error is used to determine the division method. In the input space, each area is recursively divided into two subareas, and the output value is determined. Formula 2 is the calculation formula with the smallest square calculation error

3) CART MODEL PRUNING
The classic decision tree is the core of the CART model. The pruning of the regression tree is to prevent the model from overfitting. The CART algorithm performs pruning by calculating the error gain rate on the training set. Formula 3 is the calculation formula of Gini value.
  gt is error gain ratio;   Ct indicates the error of the node after the subtree of node t is pruned; the error of subtree t T when the subtree of node t has not been pruned;| t T | indicates the number of t T leaf nodes. The pruning strategy is to take out the node corresponding to the smallest index   gt , cut it out and generate the first subtree, repeat this process until only the root node is left, use it as the last subtree, and then use the verification set to verify all subtrees, and take the tree with the smallest error.

C.Big Data Processing Technology And Methods
The big data analysis ideas for temperature changes of mass concrete, including data collection, data processing, data analysis and data application, are briefly described as follows: Data collection: It is necessary to obtain the corresponding raw data first, as the data basis of big data analysis, which includes the temperature data of each measuring point of the concrete, the position data of the measuring point, the concrete pouring data, the outside air temperature data, the outside weather data, and the outside wind data .
Data cleaning: For the acquired raw data, due to problems such as collection equipment and human records, part of the data will be wrong or missing. Therefore, the raw data needs to be cleaned and sorted to meet the data usage requirements and improve the application effect of big data. ； Data analysis: through data mining on the processed data, the application of machine learning, intelligent algorithms, mathematical modeling and other means to achieve in-depth analysis of the data, and obtain the internal laws and predictive models of the data; Data application: Using the prediction model obtained by big data analysis, according to different working conditions, inputting the corresponding working condition parameters, calling the analysis model, it can carry out reasonable prediction analysis on the temperature change of the mass concrete.

1) OUTLIER IDENTIFICATION
Outliers mainly refer to data samples that deviate significantly from other data and are unreasonable, also known as outliers. The processing of outliers usually requires detection, screening and processing. The more commonly used monitoring methods include boxplots, simple statistics (such as observing maximum or minimum values), the 3σ principle, etc. The more commonly used processing methods include deletion method, imputation method (regression imputation, multiple imputation) and substitution method (replacement of the mean of continuous variables, and substitution of mode and median for discrete variables), etc.
Since there are many data sources for mass concrete temperature acquisition, the data samples are large, and the number of outliers is generally small. Therefore, this paper recommends using a boxplot to identify outliers, and directly remove the identified outliers, so that it can make the processing of outliers very simple, reduce the work intensity of data detection, screening, interpolation, replacement, etc., and obtain ideal data in the fastest way, and at the same time, it has less impact than the huge data source.
Boxplot is a method commonly used in statistics to display the distribution of data without any restrictions on the data. The boxplot is a graph drawn with the maximum, minimum, 0.25 quantile, median, and 0.75 quantile in a set of data. With the help of the boxplot, the degree of dispersion of the data, the overall distribution and the identification of samples can be intuitively reflected. Singular values in the data. For the vertical boxplot, the upper and lower bounds of the box are the 0.75th quantile and the 0.25th quantile, respectively, and the red notch in the middle is the sample median. quantile-0.25 quantile) " , the sample points outside the boundary are regarded as abnormal points. The data points that fall into the upper and lower boundaries of the box are normal points, and the data points outside the boundary of the box are abnormal points. For abnormal points, they need to be eliminated.

2) MISSING VALUE FILL
Due to the working failure of data flow measurement equipment, missing records, outlier elimination, etc., there are certain missing values in the measurement data. For the processing of missing values, filling methods such as mean, mode filling and linear regression filling are usually used, or the setting of It is a dummy variable, or it is directly eliminated. Using mean and mode filling, the method is simple, but the filling results are rough, which even has a negative impact on model training; using linear regression filling is relatively complex, and can obtain better collinearity, but the selection of filling basis is particularly important; when missing values When the number is small, if the impact on the entire data trend analysis is small, the method of setting dummy variables or eliminating them can be used for direct processing.
For the missing values in the mass concrete temperature data, because the data acquisition devices such as thermometers are embedded in the concrete, it is difficult to repair or replace the thermometers when damaged, which will inevitably cause some continuous data to be missing. In order to reflect the different positions inside the mass concrete The temperature change trend, this paper proposes to use the nearest neighbor algorithm model for filling [53][54][55][56].
The nearest neighbor algorithm is a clustering algorithm based on a similarity measurement strategy. It judges the relationship between two data items according to the similarity between different data items, and then obtains it according to the "distance" between different data items Estimates of missing values for related data items.
Suppose the two data samples are respectively , ij,Formula 4 is the expression of the proximity value between two samples.

3) RANDOM ERROR ELIMINATION
Many random factors in the process of data measurement cause the magnitude and direction of the measured value to be difficult to predict, resulting in measurement errors. Due to the universality of measurement errors, it is determined that any measurement will have random errors. In the process of data trend monitoring, the fluctuation of random errors, that is, the burr phenomenon in the data curve. The commonly used methods for random error processing in the data monitoring process include: median filtering, arithmetic mean filtering, moving average filtering, exponential smoothing filtering, and linear quadratic exponential smoothing filtering. Median value filtering is a simple and rough data processing method. It has better processing effect on static measurement signals and is suitable for dealing with coarse errors; arithmetic mean filtering is suitable for filtering signals with random interference in general, but it does not have the Anti-error performance; moving average filtering is a predictive filtering method, suitable for systems with high real-time requirements but slow processing speed; exponential smoothing filtering and linear quadratic exponential smoothing filtering are used as weighted moving average filtering. In the process of predictive filtering, the latest measurement value should have a larger weight than the earlier measurement value, which effectively solves the lag error problem of the latter.
Since the mass concrete temperature data is continuous monitoring data, its change law and trend prediction need to be analyzed, and the error accuracy and tolerance performance are high. At the same time, the weight of the temperature data is suitable for equal treatment in the whole monitoring process. Therefore, this paper recommends using the moving average. The filtering method is a widely used random error processing and analysis and prediction method. Moving average is a commonly used technical method for data analysis. The basic idea is to eliminate errors caused by dynamic changes of the averaged object over time, thereby eliminating errors caused by random fluctuations of uncertain factors during the measurement process. Formula 5 is the model calculation formula for moving average processing [57][58][59][60] .
() Tt is temperature value after moving average， () t  is the time interval containing the moment， () nt is the number of temperature values collected by the monitoring object in the time interval， () i Tt is raw temperature monitoring data.

A Temperature data processing
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3161556, IEEE Access VOLUME XX, 2017 9

1) OUTLIER IDENTIFICATION
Identify the abnormal value of the collected raw temperature data according to the box plot, and determine the normal value range according to the distribution of the box plot, as shown in Table 3 below. Through box plot processing, it is considered that the collected temperature data is not in the interval [1.4, 73.1] as outlier points and needs to be deleted, improve the overall quality of sample data by identifying and removing outliers.

2) MISSING VALUE FILL
Due to the removal of abnormal values and collection failures, certain missing values will appear in the temperature data. In this example, due to the measurement failure of the upper temperature sensor, the data is from 16:20 on March 20 to March 22, 2021 No tem-perature data is collected between 0:20, and there are continuous vacancies. Figure4 is the original distribution of temperature data at the upper level measurement points.  By filling the missing data, the continuity of the data is ensured, the law of filling the data is consistent with the temperature change law of the temperature measuring point where it is located, and the data filling has a certain scientific nature.

3) MOVING AVERAGE PROCESSING
Based on the abnormal value processing and missing value filling of the collected temperature data, the mathematical method of moving average is used to process the temperature data to eliminate the regular systematic errors in the data and improve the data quality.
Take the 3# point in the temperature sensor of the upper layer as an example, the characteristic temperature obtained is obtained by moving and averaging 5 data points including this point (2 points forward and backward), Figure 6 is data distribution after moving average of point 3# on the upper layer, Figure 7 is a partial enlarged view of the upper layer 3# point after moving average. The error ranges between the characteristic temperature and the original temperature after moving average processing is between [-0.3, 0.3]. The average value of the single-point absolute difference between the original temperature and the characteristic temperature is calculated to be 0.046, and the fluctuation of the temperature value becomes smaller. , the temperature value is a smooth curve over time, eliminating the effect of errors caused by high-frequency fluctuations.

B Temperature prediction model training
The temperature data collected by the upper temperature sensor 1#-7# and the middle temperature sensor 1#, 2#, 3#, 6# and 7# arranged on the left side of the inlet pool bottom plate, the temperature data collected by the temperature sensor Location, collection time, outside air temperature, wind, weather conditions, and height information from the temperature sensor to the cooling water pipe are input into the CART regression model for model training.
Through cross-validation, it is determined that the hyperparameters under the CART algorithm are set to 10 for max_depth, 4 for min_samples_split, and 3 for min_samples_leaf. The mean square error of the training set is 0.047, and the mean square error of the test set is 0.086. At this time, the training set and the test set are The mean square error is the smallest.

C Temperature prediction model verification
Using the above temperature data training model to verify the model of the middle temperature sensor 4# and 5# arranged on the left side of the inlet tank bottom and the lower temperature sensor 1#-7# arranged on the left side of the inlet tank floor, Figure8 is a comparison diagram of temperature prediction and actual measurement at points 4# and 5# in the middle layer, Figure 9 shows the measured temperature and predicted temperature of the lower temperature sensor from 1# to 7# on the left side of the bottom of the inlet tank:  Through the above model verification, the trend changes of the predicted temperature curve and the measured temperature curve are kept consistent, which basically reflects the change law of the concrete temperature during the construction period. The prediction error is larger at the inflection point of the temperature curve and at the local points in the later stage of temperature change, but it is basically maintained within 1 °C. Compared with the measured curve, the overall prediction curve fluctuates. Basically, they are concentrated around 0°C.
In summary, the temperature prediction CRAT model trained by using big data processing technology can better reflect the law of concrete temperature changes during the construction period of mass concrete projects, and realize the prediction of temperature changes, and the prediction accuracy is high， with strong generalization ability and stability.

IV.Conclusion
In this paper, combined with the characteristics of temperature change of mass concrete, on the basis of sorting out the prediction models and methods, a CART model suitable for temperature prediction of mass concrete is constructed. Value identification, missing value filling, and random error elimination, improving the quality of the data. Combined with engineering examples, the collected VOLUME XX, 2017 9 temperature data is processed and analyzed to verify the feasibility of temperature change prediction and analysis based on big data processing technology and CART prediction model. The results show that the model and method can better predict the temperature change of mass concrete.
Because the model method proposed in this paper is based on the acquisition and analysis of a large amount of data during the construction of mass concrete, the data is rationalized with the technology of big data processing, and the temperature control model is trained on this basis. Targeted, higher prediction accuracy; the model and method can also be modified in real time according to different working conditions in the construction process, which is more applicable than the traditional method; at the same time, the model method is easy to understand and easy to use, suitable for Mass concrete construction enterprises analyze and process a large amount of temperature data during the construction process and predict the temperature change trend, which is convenient for the construction personnel to adjust the concrete construction process in time according to the prediction results, optimize the engineering measures, save the engineering investment, reduce the occurrence of concrete temperature cracks, and improve the Concrete construction quality.