An Improved-Bagging Model for Water Chemical Oxygen Demand Measurements Using UV-Vis Spectroscopy

The ultraviolet-visible (UV-Vis) spectroscopy measurement method of Chemical Oxygen Demand (COD) in water is a simple physical method that can measure water without secondary pollution from chemical reagents. To solve the problems of low accuracy and insufficient generalization capability of the COD prediction model, an improved Bagging algorithm is proposed and evaluated in this study. The Improved-Bagging algorithm can reduce model variance and bias concurrently, and improves the accuracy and stability of the traditional Bagging algorithm. Results show that the Improved-Bagging algorithm achieves a better prediction ability on different preprocessed data than the traditional Bagging algorithm. After ensemble empirical mode decomposition based (EEMD-Based) algorithm denoising and stability competitive adaptive reweighted sampling (SCARS) algorithm dimension reduction, Improved-Bagging model achieves the best prediction performance. Its coefficient of determination (R2) on the prediction set reached 0.9317, its root mean square error of prediction (RMSEP) reached 5.39 mg/L, and its variance reached 5.53 mg2. Results also show that the Improved-Bagging algorithm can accurately measure the COD concentration in water, which lays the foundation for the wide application of spectroscopy to measure water quality parameters.


I. INTRODUCTION
COD describes the pollution of water by reducing substances, is an important parameter for evaluating water, but also a required parameter in water quality measurements. Generally, there are two methods for measuring COD: chemical methods and physical methods (i.e., spectroscopy methods) [1], [2]. Chemical methods generally use strong oxidants, such as potassium permanganate and potassium dichromate, to oxidize water samples under strong acid conditions, and then calculate the COD in water by measuring the amount of oxidant consumed. Chemical methods have disadvantages such as secondary pollution and long measurement periods, which are not suitable for online and real-time measurements [3], [4]. The wavelength range for measuring COD via spectroscopy is generally in the ultraviolet-visible interval. After UV-Vis spectroscopy is transmitted through water, the corresponding COD value is obtained by measuring the The associate editor coordinating the review of this manuscript and approving it for publication was Wen-Sheng Zhao .
absorbance of the water [5]. The water quality measurement method based on UV-Vis spectroscopy has received increasing attention in recent years, and its application prospects are good.
Essentially, using UV-Vis spectroscopy to measure COD in water allows a COD prediction model to be built based on UV-Vis spectra. By building a calibration model between the UV-Vis spectrum data of water and the COD standard values, the corresponding COD concentration in water can be predicted based on water's spectrum. Therefore, the prediction accuracy of the model depends on the pros and cons of the calibration modeling method [6]. In recent years, with the development and breakthrough of statistics, applied mathematics, chemometrics, artificial intelligence and other fields, some new modeling methods have been applied to UV-Vis spectroscopy to measure COD, providing a new idea to measure COD concentrations in water with a complex composition [7]- [9]. Modeling methods primarily include statistical methods and machine learning methods. Appropriate modeling methods should be used based on the specific application environments being considered. According to the morphological characteristics of different water spectra, statistical methods or machine learning methods are used to build a prediction model suitable for water, which has attracted substantial attention and led to useful results. In early studies, COD was measured via UV-Vis spectroscopy at one or several wavelengths, which is referred to as a single wavelength or multi-wavelength method. Linear regression (LR) and multiple linear regression (MLR) are primarily used for modeling. and is a simple and accurate method for water samples that are of uniform composition and remain relatively similar over time; however, these methods are unsuitable for water samples whose composition changes markedly with time [10]- [13]. However, the full spectrum contains more abundant information, which can effectively improve the accuracy of COD measurement. Therefore, the application of a combination of chemonetrics methods and a full spectrum is the current development trend. For example, partial least squares (PLS), support vector machine (SVM), random forest regression (RFR), and artificial neural networks (ANNs), have already been applied to UV-Vis spectroscopy to measure COD in water [14]- [22]. Machine learning algorithms developed with the rise of artificial intelligence have the characteristics of various types, flexible use, and fast optimization speed, and can be improved. Both measurement performance and the scope of application have made a qualitative leap. However, many machine learning algorithms have not yet been used in real-world water quality measurement applications using UV-Vis spectroscopy, such as ensemble learning methods, thus, there is still much work to be done. Ensemble learning has achieved good results in applications in many fields and has the advantages of improving prediction accuracy, and stability, and eliminating overfitting [23]- [27]. Therefore, this paper uses ensemble learning in COD measurements based on UV-Vis spectroscopy to build a COD calibration model. To improve model accuracy, the basic ensemble learning method is improved to be more suitable for the research demand of COD measurements. This paper thus proposes an Improved-Bagging algorithm based on elastic net regression. Drawing on the idea of the two-phase learning of the Stacking algorithm, the ''learning method'' is used to replace the ''simple average'' ensemble strategy (regression problem) in the traditional Bagging algorithm. Combined with the characteristics of elastic net regression, which is simple and has a feature extraction function, elastic net regression is used as the combination (ensemble) strategy of base learners in Bagging algorithm. Thus, the influence of the high collinearity among the base learners in Bagging on the accuracy of the final ensemble model can be mitigated. Therefore, elastic net regression is used instead of simple averaging to improve the Bagging algorithm. The improved algorithm can retain the advantages of both the Bagging algorithm and the Stacking algorithm, which can reduce the variance and bias concurrently. Thus, the prediction accuracy of the Bagging model is improved. In addition, a variety of spectrum denoising and dimension reduction methods are used to preprocess the spectrum data that are input to the mode. Spectrum features that can comprehensively reflect the COD in water are extracted from highdimensional spectrum data, to speed up model convergence and reduce computational complexity.
The remainder of this paper is organized as follows. Section 2 presents the materials required for the experiment and the proposed methods for spectrum modeling. Section 3 reports the results of COD predictions from different models and discusses the prediction results. Section 4 concludes the paper.

A. INSTRUMENTS AND SAMPLES 1) EXPERIMENTAL INSTRUMENTS
The measurement of COD in water by UV-Vis spectroscopy was performed with a COD measurement instrument system. The system structure is shown in Figure 1 and is primarily composed of a light source, sample cell, spectrometer and a computer. The light source used in the experiment was a DH-2000-DUV deuterium-halogen-tungsten light source, which can provide 190-2500 nm light (Ocean Optics, USA). The optical path length of the sample cell for water is 10 mm. The UV-VIS spectrometer used in the experiment was USB2000+ and can measure light with wavelengths between 165 and 1200 nm with a resolution of 0.45 nm (Ocean Optics, USA). OceanView spectrum acquisition software was used to analyze all data, comes with a spectrometer, and stores the spectrum wavelength range from 193.91 to 1121.69 nm. The baseline was corrected based on deionized water, and the integration time of the spectrometer was 10 ms. Each water sample was successively scanned 10 times, and the average value was taken. A computer is used to save data, process data and build models with the corresponding software.

2) EXPERIMENTAL SAMPLES
Water samples were collected from Qian Lake in the center of Nanjing city, which is important to local fisheries, bird habitats, water resources and the regional environment. The water quality of this lake is affected by domestic sewage from the growing urban population. Therefore, it is critical for local residents to effectively measure and monitor water quality, and to warn of water quality problems quickly. From June 2019 to June 2020, water samples from the lake were collected once per day (except holidays) for one year; a total of 249 samples were collected throughout the year. Some samples had many impurities, were not suitable for further VOLUME 9, 2021 research, and were not considered in this study. Concurrently, considering the next step of sample set division, 240 samples were selected from all collected samples. Each water sample was divided into two parts: one was used to measure the standard value of COD, and the other was used to collect UV-Vis spectrum data. To retain the original characteristics of the lake water to the maximum extent, the collected water samples should be immediately measured for UV-Vis spectroscopy and COD standard values. Using the UV-Vis spectrum collection instrument system introduced in Figure 1, the collected water samples were immediately measured for UV-Vis spectrum data. Baseline correction was based on deionized water. The integration time of the spectrometer was set to 10 ms, each water sample was scanned 10 times, and the average value was taken. The original UV-Vis spectra of 240 water samples are shown in Figure 2.  Figure 2 shows the curve of the original UV-Vis spectra of the collected water samples. The curve trend of the original spectrum data of the water samples collected at different times is similar. A strong absorption peak is shown at approximately 235nm, which is due to absorption by COD. Jumps and pinnacles appear at approximately 500 nm and 680 nm, which are due to the presence of noise in the spectrum. Therefore, due to the deficiency of the collected spectra, it is necessary to preprocess the collected spectra before building a model to improve the accuracy of COD measurement based on UV-Vis spectroscopy. Jumps and pinnacles at approximately 500 nm and 680 nm are non-sensical and are removed before preprocessing.

2) COD STANDARD VALUE MEASUREMENT
According to the rapid digestion spectrophotometry method, the COD of the collected water samples was measured by a DRB200 digester and a DR3900 visible spectrophotometer (HACH, USA). The required chemical reagents and water samples to be tested were fully mixed and put into a DRB200 COD digester preheated to 165 • C in advance for 20 minutes. After digestion, each digestion tube was placed onto the cooling shelf to cool. After cooling to room temperature (25±1) • C, color reagent was added to each digestion tube. Finally, the COD value was measured with a DR3900 spectrophotometer.

C. SAMPLE SET DIVISION
A reasonable calibration set can improve the prediction ability of the built calibration model. Common sample selection methods include random sampling (RS), conventional selection (CS), Kennard stone (KS) and sample set portioning based on joint X-Y distance (SPXY) [28]- [31]. The RS method cannot guarantee the representativeness of the selected samples because it randomly selects the samples of the calibration set. The CS method selects the samples according to the chemical measurement values of the samples and selects the samples with the maximum or minimum chemical measurement values as the calibration set samples. RS and CS are subjective in sample selection. The KS method selects the two sample pairs with the farthest Mahalanobis distance for inclusion in the calibration set; calculates the distance from each remaining sample to each selected sample in the calibration set; determines the minimum distance sample and the maximum distance sample; and adds them to the calibration set. This step was repeated until the number of samples in the calibration set met the requirements. The SPXY algorithm is a new sample set division method based on the KS algorithm and considers the scientific division of the sample set by comprehensively considering the spectrum and chemical values of the samples. This algorithm has the advantages of covering multidimensional vector space and effectively improving the prediction performance of the calibration model. According to the SPXY algorithm, the calibration set and prediction set are divided according to the ratio of 2:1; therefore, among the 240 water samples, 160 samples are used as the calibration set, and 80 samples are used as the prediction set. The statistical characteristics of the samples are shown in Table 1.

D. UV-VIS SPECTRUM PREPROCESSING
Due to the influence of experimental conditions including spectrometer hardware and natural light, the original spectrum data collected will contain some noise, as shown in Figure 2. If the original spectra are used directly, the reliability and stability of the calibration model will be affected; thus, it is important to preprocess the original spectrum data properly in advance. By preprocessing the original spectrum data, we can effectively reduce the influence of external factors on the spectrum; improve the correlation between the spectrum and the component to be measured; and then build a robust and reliable prediction model. In this paper, Gaussian smoothing (SG), Fourier transform (FT), wavelet transform (WT) and EEMD-Based denoising algorithms are used to process the spectrum, and the effects of the four denoising algorithms are compared [32]- [36]. SG smoothing is a type of linear smoothing method that is suitable for eliminating Gaussian noise and is widely used in various data. FT has a good effect on the denoising of stationary signals, and WT denoising methods have been widely studied and have achieved good results in a variety of spectral denoising techniques. The EEMD-Based denoising method is a new denoising method that has strong adaptability and plays an important role in signal denoising.

E. SPECTRUM DIMENSION REDUCTION ALGORITHM
After denoising, the UV-Vis spectra collected by the instrument system still have serious nonlinear or linear overlap, and the spectrum data dimension is high. High-dimensional data contain a large amount of redundant information and hide important information. If the full UV-Vis spectrum is used as the input variable of the calibration model, it will lead to large computing resources and introduce the interference of unknown substances, thus reducing the measurement accuracy and generalization performance of the model. Therefore, it is necessary to reduce the dimension of the high-dimensional spectrum, extract its effective feature information, and improve the efficiency of model training and generalization ability. Data dimension reduction can be divided into two categories: feature transformation and feature selection. From the results of spectrum feature selection, feature selection can be divided into continuous feature selection (wavelength interval selection) and discontinuous feature selection (limited discontinuous wavelengths). According to the results of spectrum feature dimension reduction, this paper will select a representative algorithm from the feature transformation, continuous feature selection and discontinuous feature selection to analyze. PCA, interval partial least squares (iPLS), and SCARS were used to reduce the dimension of the full UV-Vis spectrum, and the calibration model performance of COD was analyzed [37]- [39].
Bagging [40], [41] is an ensemble learning method, that uses the same base learner (different data subsets) to generate multiple different learners and combines these learners into an ensemble model. The core idea of Bagging is bootstrap. For a given dataset containing n samples, we first perform m random samplings with replacement to obtain a data subset containing m samples and use this data subset to train a base learner. The operation is repeated T times, and T base learners are generated. Due to the bootstrap sampling method, there are differences between the data subsets; thus, there are also marked differences between the T base learners. Finally, the T base learners are combined. For the regression problem, the average method is used to combine the base learners. For the classification problem, the voting method is used to combine the base learners to produce a final ensemble model. Among the T base learners, the accuracy of each base learner is not necessarily high, but the result of their ensemble is very high. A schematic diagram of Bagging algorithm is shown in Figure 3.
The Bagging algorithm changes the distribution of the original dataset by resampling and produces a number of data subsets with differences. The more unstable the base learner is to the data subset, the better the performance of Bagging. Currently, decision tree (DT) and artificial neural networks (ANNs) are commonly used as the base learners of Bagging because these two algorithms are sensitive to training data. Considering the limited number of samples used in this paper, it is not suitable to use ANNs as the base learner; thus, DT is used as the base learner. The primary advantage of the Bagging algorithm is to reduce model variance; the increase in performance by reducing bias is negligible. Therefore, this paper explores methods to reduce Bagging bias. VOLUME 9, 2021 2) STACKING ALGORITHM Stacking [42], [43] is also a famous ensemble learning method and is different from Bagging in base learner selection. The base learner of Bagging is usually the same algorithm (with different training data subsets), while the base learner of Stacking algorithm is a different learning algorithm. The Stacking algorithm trains base learners, takes their outputs as inputs of the second learner (meta-learner), and generates the final ensemble results through two-phase learning. The Stacking algorithm first trains the first phase learners from the original training dataset, and then uses the prediction results of the first phase learners to form a new dataset for training the meta-learner. A schematic diagram of a two-phase Stacking algorithm is shown in Figure 4. Different from Bagging, Stacking divides the model into two phases. The first phase trains the sample set and predicts the results, and the meta-learner uses the results of the first phase for further learning to find and correct the bias in the first phase and improve the accuracy of the ensemble model. Stacking is a generalization of ensemble strategy and is an ensemble method based on the ''learning method'' that uses meta-learner to replace the average method (regression problems) in Bagging to reduce model bias. Therefore, Stacking can make full use of the advantages of two-phase learning, reduce model bias, and improve the accuracy of the ensemble model.

3) ELASTIC NET REGRESSION
In the standard linear regression model, the model relates y to x as follows: The regression coefficients in ω can be estimated by optimizing the following elastic net penalty function as Equation (2). When using the elastic net penalty, we obtain the elastic net regression [44]: where {x i , y i } is the sample data, x i ∈ R n is the independent variable, and y i is the corresponding dependent variable. ω ∈ R n is the feature weight vector, and b ∈ R n is the intercept. 0 ≤ r ≤ 1 is the regularization parameter, which controls how much of the loss function is ridge regression and lasso regression. When r = 0, a complete ridge regression is performed. When r = 1, a complete lasso regression is performed.
As a typical linear regression technology, elastic net regression integrates the ridge regression and lasso regression algorithms. Elastic net regression can shrink regression coefficients while performing regularization to select characteristic variables such as lasso regression and obtain a simpler model. Elastic net regression can also select closely associated variables, such ridge regression, to select features, simplify the model, and ensure its stability. Therefore, elastic net regression combines the advantage of ridge regression and lasso regression, and performs feature extraction and regression analysis concurrently. Elastic net regression achieves better performance with data that contain many characteristic variables that are associated with each other.

4) IMPROVED-BAGGING ALGORITHM
When the Bagging algorithm is used to solve regression problems, the ensemble of base learners is typically reported as a simple average. However, there is a high collinearity between the base learners, and the simple average between the collinear variables is limited to improve model accuracy.
To eliminate the effects of collinearity, the meta-learner of Stacking was referenced, and the simple average was replaced by the ''learning method''. The meta-learner selected in this study should overcome the problem of high collinearity between the base learners in Bagging and should avoid the risk that the meta-learner is too complex to lead to overfitting of the ensemble model. Therefore, the elastic net regression algorithm is simple and performs feature extraction, which can reduce the influence of the high collinearity between the base learners in the Bagging algorithm. Therefore, elastic net regression as a meta-learner is introduced to replace the simple average to improve the Bagging algorithm. Elastic net regression can make the improved Bagging (Improved-Bagging) algorithm retain the advantages of both Bagging algorithm and Stacking algorithm, which can reduce model variance and bias concurrently. The Improved-Bagging algorithm is shown in Figure 5.
The workflow of the Improved-Bagging algorithm is as follows: Input: Original data S = x p , y p , p = 1, · · · , N ; Base learner f ; Meta-learner h; Number of base learners T. F t = f (S t ) %Using data subset S t for training the base learner to obtain the prediction result F t end; for n = 1, . . . , T: %Building a new dataset D h used for meta-learner training X n = {F 1 (x n ) , · · · ,F T (x n )} D h = X n , y n , n= 1, · · · ,T % New dataset D h end; Phase 2: %Training meta-learner H = h (D h ) Output: Ensemble mode H Based on the two phases of the Improved-Bagging algorithm, the Bagging algorithm first trains the base learner f repeatedly using different data subset to obtain T base prediction models F 1 , F 2 , · · · , F T and outputs the prediction results. Then, according to the outputs of the base learners, a new dataset D h for meta-learner training is constructed. Finally, the new dataset D h is used to train the meta-learner h to obtain the ensemble model H.

G. MODEL TUNING
When the data set and model remain unchanged, tuning model hyper-parameters is an effective method to reduce model complexity and improve model accuracy. Grid Search (GS) is a model parameter tuning method based on traversal. With the improvement of computer hardware, the computing power and speed of the computer have been greatly improved. Therefore, more search levels and smaller search steps can be set during GS to improve the accuracy of the model. The hyper-parameters that the Improved-Bagging algorithm needs to tune include the parameters of Improved-Bagging and the parameters of the base learner (DT), forming a two-level grid search. The root mean square error of calibration (RMSEC) is used as a fitness function to assess the pros and cons of each group of parameters. The smaller the fitness function value, the higher the model accuracy.

H. PERFORMANCE INDICES
Machine learning methods must be used to build a spectrum data model, but different types of modeling methods have different advantages and disadvantages. Therefore, the comparison of the prediction performance of different models must use quantitative model performance indices. The evaluation of the prediction performance of the model was based on several performance indices, including R 2 , root mean square error of calibration (RMSEC) and RMSEP, variance (s 2 ). The larger R 2 and the smaller RMSEC/RMSEP are, the better the model. The smaller RMSEC/RMSEP, the smaller model bias. The smaller s 2 , the smaller model variance. The equations of these performance indices are shown as follows: where y i is the measured value based on the standard method; y is the average value of y i ;ŷ i is the predicted value based on spectroscopy method; n is the number of samples; y c i is the measured value based the standard method of calibration set;ŷ c i is the predicted value based on spectroscopy method of calibration set; n c is the number of samples of calibration set; y p i is the measured value based the standard method of prediction set;ŷ p i is the predicted value based on the spectroscopy method with the prediction set; n p is the number of samples of prediction set; µ is the average value of  the 10 prediction results was used to represent the variance of the model.

III. RESULTS AND DISCUSSION
A. UV-VIS SPECTRUM PREPROCESSING Figure 2 shows the original UV-Vis spectrum data curve of water samples collected by the experimental instrument. The trends of the original spectra of different water samples are relatively similar but contain considerable noise. Figures 6(a) ∼ 6(d) show the spectra denoised by the SG, FT, WT and EEMD-Based methods.
By comparing the denoised spectra in Figure 6 with the original spectra in Figure 2, the denoised spectrum is shown to retain the basic absorption characteristics of the original spectrum. After processing by different denoising methods, the jump and pinnacle noise in the original spectrum has been reduced to some extent, and the EEMD-Based denoising method achieves the best result. However, the influence of different denoising methods on the COD prediction results still must be studied in more detail via modeling.

B. MODEL PERFORMANCE OF FULL UV-VIS SPECTRUM
To study the processing results of different spectrum denoising methods on the original UV-Vis spectra, and study the fitting effect of the proposed modeling method on the spectrum data, this section compares the performance of the combination of different spectrum denoising and modeling methods on COD prediction. Data denoising methods include raw UV-Vis spectra without any processing, GS, FT, WT and EEMD-Based denoising. The modeling methods include DT, Bagging, random forest (RF) and Improved-Bagging. The COD prediction performance of the combination of different spectrum denoising and modeling methods is shown in Table 2.
Comparing the results shown in Table 2, if the modeling method is the same, the model built by UV-Vis spectrum denoised by EEMD-Based method is better than other denoising methods. The prediction set R 2 is the largest, and the RMSEP and variance are the smallest, which indicates that the EEMD-Based denoising method is more effective; thus, additional research is performed with EEMD-Based denoising used during preprocessing. If the denoising method is the same, the Improved-Bagging model is better than the other two modeling methods, the prediction set R 2 is the largest, and the RMSEP and variance are the smallest, which demonstrates the improved performance of the proposed Improved-Bagging modeling method. Among the full spectrum prediction models of COD in water shown in Table 2, the optimal prediction model is the UV-Vis spectrum processed by the EEMD-Based denoising method and modeled by Improved-Bagging algorithm. The R 2 of the prediction set was 0.9054, and the RMSEP and variance of the prediction set were 7.11 mg/L and 6.73 mg 2 , respectively.

C. DIMENSION REDUCTION OF UV-VIS SPECTRUM
In this study, the original spectrum wavelength range is 193.91-1121.69 nm, and the spectrum resolution is 0.45 nm, including 2048 wavelength features. Full spectrum modeling  will lead to the input variable dimension being too high, and the number of samples is far lower than the spectrum feature dimension. These differences increase the complexity of the calibration model, allowing the model to easily exhibit overfitting; thus, feature dimension reduction is necessary. Through feature dimension reduction, a COD prediction model with better generalization ability is constructed.

1) PCA DIMENSION REDUCTION OF UV-VIS SPECTRUM
In this study, 240 water samples were collected, of which 160 water samples were used for the calibration set and 80 water samples were used for the prediction set. Before PCA dimension reduction, EEMD-Based denoising preprocessing is performed on the collected spectra, and the denoising results are shown in Figure 6(d).
The PCA algorithm is used to reduce redundant information for the input UV-Vis spectrum matrix (160 × 2048), and results are shown in Table 3. Due to the limited space, only the first five principal components and their contribution rate and cumulative contribution rate are listed in the table. The row of cumulative contribution rates in Table 3 shows the first principal component contributed 90.4987% of the contribution rate, and the cumulative contribution rate of the first three principal components has reached 99.8444%, which is sufficient to replace the information for the original full spectrum. Adding the fourth principal component does not significantly improve the cumulative contribution rate. Therefore, the original UV-Vis spectrum matrix can be simplified to a 160 × 3 matrix after PCA dimension reduction, as shown in Table 4. Due to space constraints, only part of the sample dimension reduction results are shown. For the UV-Vis spectrum of 80 water samples in the prediction set, PCA dimension reduction was performed using the same transformation strategy as the calibration set.   compared. Table 5 shows the best iPLS model results for the full spectrum with different interval divisions. Table 5 shows the UV-Vis spectrum is divided into different subintervals, and the subintervals selected by the iPLS method and the prediction results of the PLS model built in the selected subinterval are different. When the full spectrum is divided into 20 subintervals, the RMSECV value of the iPLS model established in the first subinterval (wavelength range 193.91-242.78 nm) is the smallest, reaches 5.46 mg/L, and the number of optimal latent variables is 3, as shown in Figure 7. In this case, the model accuracy in the wavelength range is better than that in any other subinterval. Figure 8 shows the scatter plot of the COD value predicted by the optimal iPLS model and the COD standard value. The prediction set R is 0.8811, and RMSEP is 5.68 mg/L. Therefore, the wavelength range of 193.91-242.78 nm is the optimal wavelength subinterval after dimension reduction by the iPLS wavelength interval selection method.

3) SCARS WAVELENGTH SELECTION OF UV-VIS SPECTRUM
When using the SCARS algorithm to select feature wavelengths, it is necessary to first determine the optimal number of principal components (latent variables) in the PLS model. Initially, the maximum number of latent variables for the PLS model is set to 15, and the Monte Carlo sampling times is set to 3000. Figure 9 shows the RMSECV of the PLS model with different latent variables. Figure 9 shows that when the number of latent variables is 9, the minimum RMSECV is 6.2604 mg/L; thus, the optimal number of latent variables for the PLS model is 9.
Through many attempts to select a group of more appropriate SCARS parameters, this paper sets the Monte Carlo sampling times to 200, the number of latent variables to 9, and the number of cross validation groups to 10. Figure 10 shows that as the number of samplings increases, the number of optimized wavelength variables gradually decreases. The RMSECV value decreased continuously between 1 and 142 samplings, indicating that the variables removed in the screening process did not affect COD prediction. After 142 sampling, RMSECV began to rise, indicating that COD-related variables began to be removed, resulting in the increase in RMSECV. When the number of samplings reached 142, the RMSECV was the smallest (6.15 mg/L), and the corresponding feature wavelength subset was optimal. The subset contained 14  These wavelengths are the optimal wavelength features after dimension reduction of the SCARS wavelength selection method.

D. MODELS PERFORMANCE OF UV-VIS SPECTRUM AFTER FEATURE DIMENSION REDUCTION
This section compares the performance of different spectrum dimension reduction and modeling methods on COD prediction. Data dimension reduction methods include the PCA dimension reduction algorithm, iPLS wavelength interval selection algorithm and SCARS feature wavelength selection TABLE 6. COD prediction performance of the combination of different spectrum dimension reduction and modeling methods.  algorithm. The modeling methods include DT, Bagging, RF and Improved-Bagging. The COD prediction performance of the combination of different spectrum dimension reduction and modeling methods is shown in Table 6.
According to the comparative analysis of Table 6, if the model is the same, the model built by the UV-Vis spectrum after SCARS dimension reduction achieves better performance than the other two dimension reduction methods. The prediction set R 2 is the largest, and the RMSEP and variance are the smallest, indicating that the SCARS dimension reduction method is more effective. In terms of the same dimension reduction method, the model built by Improved-Bagging is better than the other two modeling methods. Its prediction set R 2 is the largest, and the RMSEP and variance are the smallest, which also demonstrates the superiority of the proposed Improved-Bagging modeling method. Among all the built COD prediction models, the best COD prediction model is the UV-Vis spectroscopy model through EEMD-Based denoising, SCARS dimension reduction, and Improved-Bagging modeling. The prediction set R 2 of the model is 0.9317, and the RMSEP and variance are 5.39 mg/L and 5.53 mg 2 , respectively.

E. COMPARISON OF PREDICTION PERFORMANCE OF ALL MODELS
The COD optimal prediction model obtained by each modeling method is compared in Tables 2 and 6. The R 2 and RMSEC/RMSEP of each model's calibration set and prediction set show good consistency. In terms of the same spectrum preprocessing method, the Improved-Bagging algorithm proposed in this paper has better prediction performance than the RF, Bagging algorithm and DT, which fully validates the superiority of the proposed Improved-Bagging algorithm. After spectrum preprocessing (denoising), the accuracy of the COD prediction model can be improved to some extent. Compared with full spectrum modeling, through feature dimension reduction, the prediction accuracy of the COD prediction model can be improved further. For the water COD prediction model of the experimental water samples, the full spectrum DT model of the raw spectrum has the worst prediction performance, and the Improved-Bagging model, which is denoised by the EEMD-Based algorithm and dimension reduction by SCARS, has the best prediction performance. The optimal model's (EEMD-Based+SCARS+Improved-Bagging model) prediction set R 2 is 0.9317, and the RMSEP and variance are 5.39 mg/L and 5.53 mg 2 , respectively. The COD prediction values and standard value scatter plots of the optimal model on the prediction set are shown in Figure 11. Figure 11 shows that the model performs well on the prediction set, and the prediction values and the standard values VOLUME 9, 2021 are similar, indicating that the prediction model built by this research has good robustness and adaptability, and can complete COD measurements in water accurately.

IV. CONCLUSION
Using the UV-Vis spectroscopy method to measure COD in water, an effective prediction model can be built using the UV-Vis spectrum of the water and the COD values. This paper proposed a model optimization method that used elastic net regression to improve the Bagging algorithm. Also, the input UV-Vis spectrum of the model was processed by spectrum preprocessing algorithms to further improve model performance. Results show that the prediction performance of Improved-Bagging algorithm is better than that of the traditional Bagging algorithm, and its prediction accuracy and generalization ability have been markedly improved. Appropriate denoising and feature dimension reduction methods can effectively reduce non-informative features, extract important features, and create a more accurate COD prediction model. Research shows that UV-Vis spectroscopy combined with the Improved-Bagging modeling method can perform COD measurements in water accurately. UV-Vis spectroscopy can thus be a new method for COD measurement in water.