A Bootstrapping Soft Shrinkage Approach and Interval Random Variables Selection Hybrid Model for Variable Selection in Near-Infrared Spectroscopy

High dimensionality problem in spectra datasets is a significant challenge to researchers and requires the design of effective methods that can extract the optimal variable subset that can improve the accuracy of predictions or classifications. In this study, a hybrid variable selection method, based on the incremental number of variables using bootstrapping soft shrinkage method (BOSS) and interval random variable selection (IRVS) method is proposed and named BOSS-IRVS. The BOSS method is used to determine the informative intervals, while the IRVS method is used to search for informative variables in the informative interval determined by BOSS method. The proposed BOSS-IRVS method was tested using seven different public accessible near-infrared (NIR) spectroscopic datasets of corn, diesel fuel, soy, wheat protein, and hemoglobin types. The performance of the proposed method was compared with that of two outstanding variable selection methods i.e. BOSS and hybrid variable selection strategy based on continuous shrinkage of variable space (VCPA-IRIV). The experimental results showed clearly that the proposed method BOSS-IRVS outperforms VCPA-IRIV and BOSS methods in all tested datasets and improved the percentage of the prediction accuracy, by 15.4 and 15.3 for corn moisture,13.4 and 49.8 for corn oil, 41.5 and 50.6 for corn protein, 12.6 and 5.6 for soy moisture, 0.6 and 6.3 for total diesel fuel, 19.9 and 14.3 for wheat protein, and 5.8 and 20.3 for hemoglobin.


I. INTRODUCTION
In recent years, near-infrared (NIR) spectroscopy has gained wide acceptance in different fields such as agriculture and the petrochemical and pharmaceutical industries by virtue of its advantages in recording spectra for solid and liquid samples.
The associate editor coordinating the review of this manuscript and approving it for publication was Barbara Masini .
NIR spectra typically consist of broad, weak, non-specific, and overlapped bands and some irrelevant variables [1]. These unrelated variables could lead to wrong or inefficient prediction results. To overcome this problem, a process of multivariate analysis for NIR spectroscopy should be followed as shown in Figure 1. The first step is to have NIR samples as X and the properties of interest as y. Then a preprocessing technique is used to remove physical phenomena in the spectra [2]. Next, the important variables are extracted using a variable section method. Finally, a multivariate calibration model is used to build the relationship between the selected variables and the properties of interest to predict the values of the interesting properties. Variable selection is a critical step in multivariate calibration of NIR spectroscopy. This is because the variable selection step reduces the curse of dimensionality, which results in speeding-up the operating model, providing a better interpretation of a model by selecting the informative variables, and improving the prediction performance by eliminating uninformative variables [3].
Deng et al. proposed a new and effective single variable selection method named BOSS [4]. This method showed a significant improvement of prediction accuracy on three NIR spectroscopic datasets and outperforms partial least square (PLS), Monte Carlo uninformative variable elimination (MCUVE), competitive adaptive reweighted sampling (CARS) and genetic algorithm coupled with partial least square (GA-PLS). The advantages of the BOSS method can be summarized in three aspects-first, the use of soft shrinkage, which lowers the risk of eliminating essential variables. Second, a fair comparison of variables compensates for the influence of collinearity on the regression coefficients because of the use of weighted bootstrap sampling (WBS). Third, the use of model population analysis (MPA), which extracts the information from a large population of submodels instead of one model to obtain more reliable results by considering the combined effects among variables [4]. Despite these advantages, the BOSS has drawbacks which can be summed up into three aspects as well. First, the BOSS ignores the high correlation among consecutive variables. Second and due to the use of bootstrap sampling that is inappropriate for the dependent data, the BOSS selects fewer variables, which causes missing some informative wavelengths. Third, it cannot avoid over-fitting problem BOSS uses RC, which is susceptible to noises [5], [6].
Most recently, three different methods have been developed and out-performed BOSS method. The first method is a modification of the bootstrapping soft shrinkage approach named new computational method stabilized bootstrapping soft shrinkage approach (SBOSS) [5], in which variables are selected by the index of stability of regression coefficients instead of regression coefficients absolute value. Second, fisher optimal subspace shrinkage (FOSS) [6] that splits variables into some intervals by the information from regression coefficients PLS model, then the weighted block bootstrap sampling (WBBS) is used to select intervals, and the mean of the absolute values of regression coefficients of the corresponding interval determines the weights of sub-intervals. Third, significant multivariate competitive population analysis (SMCPA) that combines the ideas of substantial multivariate correlation (SMC) and MPA, and employs WBS is an improved version of bootstrap sampling with different weights on sampling objects and exponential decline function (EDF) competition method used to force the elimination of uninformative or redundancy variables [7]. For corn and wheat protein datasets, both methods select informative intervals including the BOSS. However, the BOSS was unstable, and only a few variables are selected compared with other high-performance methods that were more accurate and selected more variables in these crucial intervals.
In terms of the selection of spectra intervals, all models except FOSS (i.e. BOSS, SBOSS, and SMCPA) have not considered this method, although it can provide a reasonable interpretation. Thus, using this method in the proposed model is expected to improve the accuracy as the vibrational spectral band relating to the chemical group generally has a width of 4-200 cm −1 [6]. Besides, none of these approaches, including FOSS, searches for optimal combinations in specific informative intervals.
Therefore, in this study, a new hybrid model is proposed based on the BOSS method. However, the proposed hybrid model works by incrementing the number of variables being selected rather than decreasing them. To the best of the authors' knowledge, there is no such hybrid model in the literature based on increasing the number of variables, but there are many developed hybrid models based on reducing the number of variables such as a hybrid VCPA-IRIV model [9], competitive adaptive reweighted sampling-successive projections algorithm (CARS-SPA) [10], and a combination strategy of random forest and backpropagation network (RF-BPN) [11]. The mentioned methods have their own merits and unique characteristics. The decreased-number-based variable selection methods attempt to utilize the features of other methods by making an effective combination. However, the overall performance can be reduced significantly if the preliminary method does not successfully select the key variables [12]. The proposed hybrid method follows the same concept by taking the advantage of the BOSS method that successfully proved to select important intervals: however the BOSS method selects fewer variables and does not select optimal combinations, so we used IRVS to add more variables in these intervals to have an excellent performance. Besides, we focus on the importance of interval as proved to be more robust and more interpretable, so we develop our model that increases the numbers in those informative intervals. The disadvantage of the increased number of variable selection methods that we don't know what is the optimal number of variables that need to be increased. Therefore, there is a need to tune the parameter to decide the optimal number.
The novelty in this research is the following: 1-There is no previous hybrid variable selection method in NIR spectroscopy based on increasing the number of variables. However, most of the studies use a hybrid model to eliminate variables. This paper introduces a new hybrid method based on the incremental approach.
2-The number of datasets used in the evaluation of the proposed hybrid model (i.e. 7 NIR datasets) is considerably large, which led to a proper evaluation. The used NIR datasets are corn datasets with moisture, oil and protein properties, hemoglobin, diesel fuel with total aromatics properties, soy with moisture properties, and wheat protein datasets. 3-Investigating the proposed hybrid methods with two high-performance model include hybrid VCPA-IRIV and BOSS.
4-Providing a comprehensive review of different variable selection methods in terms of the ability to select informative intervals and the performance of the models and numbers of the chosen variable.
The remainder of this paper is divided into the following sections. Related studies are described in Section II, followed by a detailed description of the proposed hybrid method in Section III. The datasets used in this study are described in Section IV. The experimental work and obtained results are presented in Section V. Finally, the conclusion of this study is presented in Section VI.

II. RELATED WORKS
During the last several decades, a large number of various mathematical strategies for variable selection have been employed in NIR spectroscopy.
Li-Li Wang has classified the single variable selection methods and interval variable selection methods into a different classification [13]. The only variable selection methods have been classified into classic stepwise methods, variable raking-based strategy, penalty-based strategy, MPA, heuristic algorithm-based strategy, and some other methods include successive projection algorithm (SPA) and uninformative variable elimination (UVE). On the other hand, the interval selection method is classified into; (1) classic methods including interval PLS (iPLS) and its variants, (2) moving windows PLS (MWPLS), and its variants; (3) penalty-based methods include elastic net combined with partial least squares regression (EN-PLSR), iterative rank PLS regression coefficient screening (EN-IRRCS) and group PLS (gPLS); (4) sampling-based methods include iPLS-Bootstrap and Bootstrap variable importance in projection (Bootstrap-VIP); (5) correlation-based method include sure independence screening and interval PLS (SIS-iPLS); finally, (6) projection-based methods include interval successive projections algorithm (iSPA).
Moreover, selecting the variables on near-infrared spectroscopy by utilizing models that hybridize two or more different techniques was recommended in [12]. In particular, the UVE method was used in [24] to filters the noise variables: then the SPA method was used to achieve an excellent selection. It is known as the UVE-SPA-MLR hybrid model. Another hybrid model called iPLS-mIPW combined two methods, i.e., iPLS with mIPW [25]. In iPLS-mIPW, the informative intervals were obtained using the iPLS method initially. Then further variables selection was performed using mIPW. Additionally, to select critical wavelengths in NIR spectra, the random forest was hybridized with the BP network by Chen et al. [11]. In the proposed model, some informative wavelengths initially selected using random forest. Then a new comprehensive variable group is produced, using BP network, with minimum errors. Recently, a VCPA-based hybrid model was proposed by Yun et al. [9]. In this model, VCPA was hybridized with the genetic algorithm (GA) and IRIV separately. Firstly, VCPA was used to continuously shrink and optimize the variable space from big to small. After that, additional optimization was performed, on the variables remained by VCPA, using IRIV and GA. Table 1 shows the comparison between previous methods in terms of selecting informative intervals and the performance of the methods and the number of the variable selected.

III. PROPOSED MODEL
In this section, a description of the proposed hybrid method named bootstrapping soft shrinkage approach and interval random variable selection (BOSS-IRVS) is provided in detail. It combines both the choice of informative intervals  using the BOSS method, as illustrated in Section A, and an interval variable selection method, as shown in Section B.
Then, a brief description of the compared methods and the model validation is given in Section C and D, respectively. Besides, Figure 2 shows an illustration of the proposed model.

A. INFORMATIVE INTERVALS SELECTION USING BOSS METHODS
The BOSS approach is designed to choose informative intervals, and that happens with the existence of collinearity. In a suitable shrinkage manner, data from regression coefficients are used by this approach [26]- [29]. Two types of sampling methods are used, including Bootstrap sampling (BSS) and Weighted Bootstrap (WBS). The purpose of the sampling method is to produce a random combination of variables and to construct sub-models of the system. Thus, two methods are coupled and used, including MPA [25] and PLS regression [29], to extract the information from the sub-models. The BOSS method has five main steps to select the informative intervals illustrated as follows.
Step 1: BSS is used to produce K subsets on a variable space. The variables chosen for BSS are extracted from each dataset, and the redundant variables are excluded. Thus, only the unique variables have remained. The replacement number, in BSS, is identical to the total number of variables P. Therefore, the number of variables chosen is roughly 0.632P in each subset. Here, all variables must be treated equally so that they can be picked into subsets with the same probability, i.e., equal weights (w) are set for all variables.
Step 2: The subsets obtained are used to construct K PLS sub-models. Then, the prediction error is calculated based on RMSEV, and a percentage of the lowest RMSEV models is selected, representing the best models (e.g., 10 percent).
Step 3: Regression coefficients (RC) are computed and adjusted to the absolute value of all elements on the regression vector and normalize each regression vector to unit length for any extracted model. Subsequently, equation (1) is used to obtain new weights for variables by summing up the normalized regression vector.
where w i is the new weight for ith variable, K denotes the number of sub-models and b i,A represents the absolute normalized regression coefficient value for the ith variable in the Ath sub-model.
Step 4: The WBS generates new subsets using WBS according to the variables' new weights. As in BSS, the variables chosen are extracted in each dataset to construct the sub-models, and the redundant variables are excluded. The average number of variables calculated in Step 3 is used to determine the number of replacements in WBS. Therefore, in the new subsets, the number of variables is 0.632 times of those previously determined [4]. The aim behind this step is to guarantee that the variables with larger absolute values of regression coefficients are likely to be selected in the best sub-models.
Step 5: Repeat Step 2-4 until a number of variables in the new subsets are 1, then return the optimal subset, which has the lowest RMSEV.
Step 6: Repeat the BOSS method twenty times to select informative intervals.

B. SELECTION OF INFORMATIVE VARIABLES IN INFORMATIVE INTERVALS
After applying the BOSS method to NIR datasets to select informative intervals by Algorithm 1, the output of this algorithm will act as the input for Algorithm 2. The later algorithm will select informative variables in the informative intervals. The selection of informative variables is affected by three parameters that need to be tuned carefully. These parameters are: (i) The number of populations (np): To select an adequate number of populations, three cases of 50, 100, 500 populations were investigated. For example, 50 populations combine 50 individuals in which each individual combines the variables selected in Algorithm 1 and the interval random variables method, which search for informative variables in informative intervals. Five hundred populations were chosen as the optimized number of populations based on 20 replicated results shown in Figure 3. Therefore, 500 population was set in this work. From Figure 3 (a), it should be noted that when the 50 generations have been used, the value of RMSEC varies from 3.1 to 3.9, which is an indication of underfitting, as shown in Figure 3. (b). However, with 500 generations, both values of RMSEC and RMSEP are dropped to the lowest level, which avoids overfitting and gives the best performance compared with 50 and 100 populations.
(ii) The way of selecting random variables: The first choice is to choose random variables gradually or to select random variables at one time. Selection of random variables gradually means to select specific random variables  in each run while selecting random variables at one time means to select all the random variables in only one run. Every random variable has small random interval from the big interval selected by BOSS. Figure 4 proves that the gradual selection of random variables is the optimal approach, which avoids overfitting. From the same figure, it can be seen that selecting random variables at one time leads to low RMSEC and high RMSEP, while gradual selection leads to low RMSEP.
(iii) The number of informative variables selected (nv): To select an adequate number of added informative variables, three cases of 3, 6, 9 variables were investigated. Among the three numbers tested shown in Figure 5, it can be realized that the three variables have the worst RMSEC value, while the nine variables have the best values. However, with nine variables being selected, the RMSEP is high. As a compromise, the 6 number of variables is chosen as it produces the lowest RMSEP value and an acceptable RMSEC value. The pseudocode of the proposed algorithm is presented in Algorithm 2. In the beginning, five hundred random populations are generated. Each individual in the population combines input variables and three random variables from the informative intervals. The input is the variables selected in Algorithm 1. For each individual, RMSEC value is calculated, and the individual with the lowest RMSEC value is selected. These steps are repeated until ny random variables have been selected. For each round, the input is updated by adding the three variables chosen from the previous round.

D. MODEL VALIDATION
With 5-fold cross-validation and test sets, the predictive ability of the models is assessed by the root mean squared error of training (RMSEC), the root mean squared error of crossvalidation (RMSEV), the root mean squared error of prediction (RMSEP), the coefficient of determination of training (Q 2 _C), the coefficient of determination of cross-validation (Q 2 cv) and the coefficient of determination of test set (Q 2 _T ).

RMSEC =
where y i ,ŷ, andȳ i are the experimental, predicted, and the average of predicted properties, respectively. Ntrain is the number of calibration samples in the training set. The RMSEP VOLUME 8, 2020 and RMSEV are computed similarly as RMSEC, while Q 2 _T and Q 2 CV are computed as and Q 2 _C, but with different Ntrain values that are changed with the testing sample for RMSEP and Q 2 _T only.

IV. DATASETS
In this study, seven NIR datasets have been used to evaluate the BOSS-IRVS, which are datasets of diesel, soy, wheat protein, corn, and hemoglobin. The important details of these datasets are summarized below.

A. CORN DATASETS
From http:/www.eigenvector.com/data/Corn/index.html, four NIR corn datasets were collected. In each dataset, there are 80 corn samples measured by m5 NIR spectrometers. Also, there are 700 wavelength points of 2 nm intervals in the range of 1100-2498 nm for each spectrum. The properties of interest were used are oil, protein, and the content of moisture. The samples were divided, in each of the 80 corn samples, equally into a 60 training set and a 20 independent test set.

B. DIESEL FUELS DATASET
This dataset has been downloaded from the website http:/www.eigenvector.com/data/SWRI/index.html. The range of wavelength points is between 750-1550 nm at intervals of 2 nm for each spectrum, including 401 points. Only one property of interest is considered, which is the total aromatics, while the remaining properties are removed. The 20 highleverage samples and one of the two random samples were used for each dataset to create the training set. The other group was used as an independent test set, leading in total dataset sample partitions 138 and 118 for training and testing, respectively.

C. SOY DATASETS
Spectrometer NIR was used to measure the samples of soy flour [33]. There are 175 wavelengths in each spectrum, with 8 nm in the range of 1104 and 2496 nm. The moisture content was considered as properties of interest. According to the reference [33], each dataset contains 54 samples, split between the training set (40 samples) and the test set (14 samples).

D. WHEAT DATASET
This NIR dataset [34] contains 100 wheat samples. The spectrum was reported at intervals of 2 nm from 1100 to 2500 nm with a spectrum of 701 points. The property of interest y is the protein value. Owning the problem of 'large p, small n' [35]. [36], an acceptable window size compresses the original spectrum into a limit of 200 frames [37]. This dataset is reduced to 175 variables by limiting window size to 4, and each of the original four variables is averaged. Out of 100 samples, 80 was used for training and 20 for testing.

E. HEMOGLOBIN DATASET
Using the IDRC shootout 2010 software, Karl Norris [38] has produced this dataset that has been used by Mohd Nazrul Idrus [39]. With the spectrometer of NIR Systems 6500, the blood samples have been analyzed. The blood hemoglobin reference was measured by a high-volume hematology analyzer. All spectra have 700 variables of 2 nm interval in the range between 1100 and 2498 nm. To evaluate the model, the dataset is divided into 173 sets and 194 unseen data sets, respectively, for training, and blind testing to measure the model's predictive accuracy.

V. RESULTS AND DISCUSSIONS
To assess the performance of the BOSS-IRVS, some highperformance wavelength selection methods, including BOSS and VCPA-IRIV, are used for comparison. All codes were applied in Matlab. The datasets are centered. In this study, the calibration set is used for building the model and performing the variable selection. The independent test set is then used to validate the calibration model. Several evaluation metrics, such as the RMSEV, Q 2 _cv, RMSEC, Q 2 _C, RMSEP, and   The variables selected by different selection methods on moisture datasets are shown in Figure 6. The wavelengths chosen by BOSS, VCPA-IRIV, and BOSS-IRVS models are located in two intervals and selected the two wavelengths of 1908 nm and 2108 nm. These two wavelengths are regarded as the key wavelength by Li et al. [9], [20], which correspond to the water absorption and the combination of O-H bonds according to the literature [22]. The number of the variable selected by the BOSS is 3.8, which indicates that the BOSS algorithm misses important variables and ignores the high correlation among consecutive variables. The BOSS-IRVS improved the BOSS prediction ability by adding six important variables. The VCPA-IRIV selects 5.5 variables which are the same as the BOSS-IRIV model when three variables are added. However, the variables selected by the BOSS-IRVS model give better performance compared to the variables selected by VCPA-IRIV and the reason is that the variable combinations of the BOSS-IRVS are better than the variable combinations of VCPA-IRIV. For oil dataset, From the Figure 7, it can be observed that VCPA-IRIV, BOSS and BOSS-IRVS methods select informative spectra intervals near 1700 nm (region 1) and 2300 nm (region 2), which correspond to the second and first overtones of the C-H stretching mode and the combination of C-H vibrations [8]. The VCP-IRIV shows a good concentration on the two intervals compared with the BOSS method, which the variables selected by BOSS is unstable since it uses bootstrap sampling. The BOSS-IRVS combines the variables selected by the BOSS and added six variables in the informative intervals selected by BOSS, which lead to outperforming VCPA-IRIV model. The BOSS has the lowest variables selected then both VCPA-IRIV and BOSS-IRVS models have the same number of the variable selected. For the protein dataset, From Figure 8, we could observe that VCPA-IRIV, BOSS, and the BOSS-IRVS methods select the combination of several groups that are chemical meaningful for data analysis of spectrum [5]. All the methods selected the intervals around 1680, 1800 and 2180 nm. It can be noticed that these selected intervals cover a wide range linking to the complicated structure of the protein, e.g. C-H, O-H and N-H bond with different vibration pattern, complex microenvironment of the three bonds, and the interaction of them [4]. The lowest number is selected by BOSS followed by the BOSS-IRVS model with three added variables, and then both the VCPA-IRIV and the BOSS-IRVS with six added variables have nearly the same variables selected. The BOSS-IRVS selects important variables near to intervals 1800 and 2180, which outperformed the BOSS and VCPA-IRIV.
Furthermore, from Figure 6, it can be seen that the proposed model had high stability variables since it focuses on specific important intervals. In more detail, all variables selected by BOSS in the first step of the proposed model are considered informative variables. Then, the selected BOSS variables are used as input for the IRVS algorithm, which means that the IRVS algorithm chooses the same variables selected by BOSS and adds the six selected incremental  variables to them. The process is repeated 20 times until the optimal incremental number of variables is reached. As a result, the variables selected in the first step will always have the highest frequency of 20.
From Table1, with respect to moisture dataset, there are many methods that succeed to select informative intervals include CARS, MCUVE, OHPL, SCARS, SPEA-LASSO, BOSS and VCPA-IRIV. However, some methods have lower performance compared to other methods due to various reasons. For instance, select uninformative variables in other intervals such as in iRF and LASSO methods, or low concentrate when choosing variables in informative intervals such as in CARS. Furthermore, although the methods succeeded to select important intervals, it chooses many variables, including uninformative variables such as in OHPL. Moreover, the combinations of variables are different, which the reason why some methods outperformed other methods that select the same informative intervals, such as SPEA-LASSO outperformed SCARS. Our hybrid method succeeded to select a lower number and good concentration in the informative intervals by select informative variables in informative intervals. For the oil dataset, some methods succeed to select informative intervals but select uninformative variables such as CARS, GA-PLS and VIP-GA. GA-pills and VCPA-IRIV  succeed to select informative variables; however VCPA-IRIV chooses the optimal number of variables and has a good combination of variables which outperformed GA-iPLS. FOSS method has a good performance by succeeding to concentrate in informative intervals and choose informative variables in theses informative intervals. Our hybrid method succeeded to select a lower number of variables and select informative variables in informative intervals. For the protein dataset, CARS and GA-PLS select important intervals; however, they also select uninformative intervals. The selection of uninformative intervals reduces the performance of both CARS and GA-PLS. VISSA and iVISSA methods have low performance because they select variables around all spectra. ICO method outperformed CARS, MC-UVE, VISSA and iVISSA because of the low number of variables and succeeded to select variables in informative intervals. A recent paper called SBOSS outperformed SCARS, BOSS, CARS, GA-PLS, and MCUVE. The SBOSS has a low variable with a good selection of variables in the informative interval.

B. SOY MOISTURE DATASET
The results of variable selection methods on soy datasets are shown in Table 3. A clear ranking of the VCPA-IRIV, BOSS, and the BOSS-IRVS models are as follows. The BOSS-IRVS are followed by BOSS and VCPA-IRIV. The RMSEP for the BOSS-IRVS method with six added variables, the BOSS-IRVS with three added variables, BOSS and VCPA-IRIV are 0.8610, 0.8701, 0.9126, and 0.9854, respectively. Moreover, the proposed BOSS-IRVS showed the best Q2_T with 0.9306 compared with 0.9126 and 0.9091 for BOSS and VCPA-RIV, respectively. Figure 9 shows that all the methods select two informative intervals around 1900 nm and 2100 nm, which are selected commonly by four methods. They correspond to the water absorption and the combination of O-H bonds [22]. The VCPA-IRIV selects some variables around 1550 and 2450, and the BOSS method selects intervals around 2450. The BOSS-IRVS method selects variables around 2100 which improves the accuracy of the model. Table 3 and Table1 show the performance of BOSS-IRVS method and other variable selection methods on the soy moisture dataset. Most of these methods select informative intervals; however, some methods select other intervals, such as CARS, MC-UVE, and GA-PLS. Also, some methods select more variables such as MC-UVE, siPLS, MW-PLS, and iRF which show low accuracy compared with a low number of variables such as BOSS and VCPA-IRIV and the  BOSS-IRVS methods. The proposed hybrid models select a good combination and an optimal number of variables that achieved higher accuracy.

C. TOTAL DIESEL FUELS DATASET
The results of variable selection methods on total diesel fuel datasets are displayed in Table 4 and Figure 10. It shows a clear ranking of prediction ability for all the methods; the BOSS-IRVS with six variables added, the VCPA-IRIV method, BOSS-IRVS with three variables added, and the BOSS method. The values of RMSEP are 0.5965 for the BOSS-IRVS with six added variables, 0.6004 for VCPA-IRV, 0.6026 for BOSS-IRVS with three added variables, and 0.6366 for BOSS. Wavelengths that have been selected by all methods are concentrate in the region of 1000-1100 nm, 1200-1300 nm, and 1450-1550 nm which indicate the importance of these intervals. Moreover, the VCPA-IRIV and BOSS have selected variables around different intervals include intervals between 800 and 900 and between 1300 and 1400. BOSS-IRVS models have selected their variables around these informative intervals which improve the BOSS method significantly. From Table 3 and  Table 1, it can be seen that MC-UVE, GA-PLS have a higher number of variables compared to BOSS, CARS, VCPA-IRIV, and proposed hybrid model. The methods that have a low number of variables have a good performance. The variables around 1104-1400nm can be selected by all methods which indicate the importance of this region which corresponds to the first overtone of the O-H stretch bond vibration [7]. The VCPA-IRIV select other variables in intervals around 1800 and between 2200 and 2400. The BOSS and the BOSS-IRVS concentrated on this region which shows better performance than VCPA-IRIV. The proposed method combines the variables select by BOSS and add only   three variables selected on the important intervals and showed a significant improvement of the prediction accuracy-the lowest variables selected by BOSS followed by VCPA-IRIV and BOSS-IRVS method. Table 5 and Table 1 showed the proposed hybrid method and previous different variable selection methods on the wheat protein dataset. We analyzed that, MC-UVE and CARS method select informative variable; however, it selects another variable in uninformative intervals. IVSO and GA-PLS-LRC method select informative intervals and concentrate their variables in these informative variables which lead to a good performance. IVSO outperformed PLS, CARS and MC-UVE while GA-PLS-LRC outperformed GA-PLS. A recent paper called SMCPA showed a good concentration with a low number of variables and outperformed BOSS, VCPA, and CARS. Our proposed hybrid method proved that adding three informative variables in informative intervals could improve the result significantly. VOLUME 8, 2020  Figure 12 showed that the intervals between 1600 and 1800, and between 2200 and 2400 are select by all methods. VCPA-IRIV and BOSS selected between 1200 and 1400. The BOSS-IRVS method added six variables only in intervals between 1600 and 1800 and between 2200 and 2400. The selection of these variables improved the result of the BOSS method significantly and outperformed VCPA-IRIV. All three methods select two intervals indicating the importance of these intervals. In addition, the BOSS-IRVS select fewer variables compared to VCPA-IRIV. However, the BOSS-IRVS outperformed the VCPA-IRV, the percentage of improvement for the hemoglobin dataset is 5.8 for VCPA-IRIV. Besides, when only six variables added to the BOSS, the result improved significantly to 20.3 %.

VI. CONCLUSIONS AND FUTURE WORKS
To conclude, a new hybrid strategy for variable selection has been proposed (BOSS-IRVS) in this study. The hybrid strategy takes full advantage of BOSS as proved to select informative intervals and uses interval random variables selection to search informative variables in the informative interval selected by BOSS. It solves the problem of BOSS's tendency to select fewer variables, and also improve the predictive accuracy. Seven NIR datasets were used to investigate the improvement of this hybrid strategy. The results show that the hybrid strategy significantly improved the model's prediction performance when compared with two high-performance methods (BOSS and VCPA-IRIV). It is worth pointing out that the proposed hybrid strategy is general and can be coupled with some other optimization or variable selection methods for further optimization. Although it was employed on the kind of NIR dataset in this study, it could be applied to other kinds of high dimensional data, such as genomics, proteomics, metabolomics, QSAR, and others. In future work, we will consider applying our proposed model in high performance variable selection method such as FOSS, SOBSS and SMCPA. Besides, we will consider the computational cost in the performance evaluation.