A New Formula for Faster Computation of the K-Fold Cross-Validation and Good Regularisation Parameter Values in Ridge Regression

In the present paper, we prove a new theorem, resulting in an update formula for linear regression model residuals calculating the exact k-fold cross-validation residuals for any choice of cross-validation strategy without model refitting. The required matrix inversions are limited by the cross-validation segment sizes and can be executed with high efficiency in parallel. The well-known formula for leave-one-out cross-validation follows as a special case of the theorem. In situations where the cross-validation segments consist of small groups of repeated measurements, we suggest a heuristic strategy for fast serial approximations of the cross-validated residuals and associated Predicted Residual Sum of Squares (<inline-formula> <tex-math notation="LaTeX">$PRESS$ </tex-math></inline-formula>) statistic. We also suggest strategies for efficient estimation of the minimum <inline-formula> <tex-math notation="LaTeX">$PRESS$ </tex-math></inline-formula> value and full <inline-formula> <tex-math notation="LaTeX">$PRESS$ </tex-math></inline-formula> function over a selected interval of regularisation values. The computational effectiveness of the parameter selection for Ridge- and Tikhonov regression modelling resulting from our theoretical findings and heuristic arguments is demonstrated in several applications with real and highly multivariate datasets.


I. INTRODUCTION
M ODEL-/parameter selection in statistical modelling is frequently justified from the maximum likelihood (ML) principle in combination with some measure of model quality (such as the Akaike's Information Criterion (AIC), Bayesian Information Criterion (BIC), Mallows C p , the PRESS statistic, etc.) that estimates the expected predictive performance for some candidate model(s) [1].
According to Hjorth [2] the application of cross-validation measures as a methodology for model-/parameter selection in statistical applications was introduced by Stone [3].Stone's ideas motivated the invention of the generalised crossvalidation (GCV ) method by Golub et al. [4] which is a computationally efficient approximation to the leave-oneout cross-validation (LooCV) method.It is invariant under orthogonal transformations and is considered to be a computationally efficient method for choosing appropriate regularisation parameter values in ridge regression (RR) modelling.Cross-validation is still an active area of research, see, e.g., [5]- [7] for some recent works regarding prediction estimates for cross-validation, and [8] for an analysis of the stability of generalisation bounds for leave-one-out cross-validation.The focus of the present work is the selection of an appropriate regularisation parameter value, primarily by grid search but numerical optimisation is also discussed.
The RR method was introduced to the statistics community by Hoerl and Kennard [9], and is perhaps the most important special case in the Tikhonov regularisation [10] (TR) framework of linear regression methods.The TR ideas were originally introduced to the community of numerical mathematics for solving linear discrete ill-posed problems in the context of inverse modelling.A good elementary introduction to the field is given in Hansen [11].
The fast and exact calculations of the LooCV based Predicted Residual Sum of Squares (PRESS) statistic for the ordinary least squares (OLS) regression have been demonstrated by Allen [12], [13].The purpose of the present paper is to demonstrate that such calculations are also available for the regularisation parameter selection problem of TR/RR at essentially no additional computational cost.In the present paper, we demonstrate this as follows: i) From the Sherman-Morrison-Woodbury updating formula for matrix inversion, see Householder [14], we prove a new theorem that gives the general formula for calculating the segmented cross-validation (SegCV) residuals of linear least squares regression modelling.
The formula for calculating the LooCV residuals in Allen's PRESS statistic [12], [13] follows as a corollary of this result.ii) We demonstrate how to obtain simple and fast LooCV calculations utilising the compact singular value decomposition (SVD) of a data matrix to quickly obtain PRESS values associated with any choice of the regularisation parameter for a TR-problem.In particular, this enables fast graphing of the PRESS-values as a function of the regularisation parameter at any desired level of detail.iii) For situations where some segmented cross-validation approach is required for obtaining the relevant PRESSstatistic values in the regularisation parameter selection, one may experience that even the segmented cross-validation formula from our theorem becomes computationally slow.To handle such situations, we propose an approximation of the segmented (K -fold) cross-validation strategy by invoking the computationally inexpensive LooCV strategy after conducting an appropriate orthogonal transformation of the data matrix.The particular orthogonal transformation is constructed from the left singular vectors of the K local SVDs associated with each of the K distinct crossvalidation segments.
We demonstrate that the latter alternative provides practically useful approximations of the PRESS-statistic at substantial computational savings -in particular for large datasets with many cross-validation segments (large K ) containing either identical or highly related measurement values.

II. MATHEMATICAL PRELIMINARIES
If not otherwise stated we assume that X is a centred (n × p) data matrix (X ′ denotes the transpose of X) and that the corresponding (n × 1) vector y of responses is also centred.We define the scalar ȳ and row vector x as the (column) averages of y and X obtained before centring, respectively.

A. MODEL ESTIMATION IN ORDINARY LEAST SQUARES AND RIDGE REGRESSION
In ordinary least squares (OLS) regression [1] one minimises the residual sum of squares to identify the least squares solution(s) of (1) with respect to the regression coefficients b.A least squares solution b OLS of (1) corresponds to an exact solution of the associated normal equations where b OLS is unique when X ′ X is non-singular.For later predictions of uncentred data, the associated vector of fitted values is given by where the constant term (intercept) b 0 = ȳ − xb OLS .For centred vectors/matrices, y and X, this equation becomes ŷ = Xb OLS = Hy.Here, the projection matrix, H, (a.k.a. the hat matrix) is defined as where T can be chosen as any orthogonal (n × r)-matrix spanning the column space of the centred X-data.
For various reasons a minimiser b OLS of RSS(b) in equation (1) is not always the most attractive choice from a predictive point of view [1], [11], [15].For instance X ′ X may be singular or poorly conditioned, the solution of ( 2) is not unique or inappropriate etc.An alternative and quite useful solution was independently recognised by Tikhonov [10], Phillips [16], and Hoerl and Kennard [9].Instead of directly minimising RSS(b), their alternative proposal was to minimise the weighted bi-objective least squares problem (5) where the scalar λ > 0 is a fixed regularisation parameter (of appropriate magnitude), the matrix I is the (p×p) identity matrix and 0 is a (p × 1) vector of zeros.This formulation explicitly represents a penalisation with respect to the Euclidean (L 2 ) norm ∥b∥ of the regression coefficients.The identity matrix I can also be replaced by an alternative regularisation matrix L as described in Appendix C. For a fixed λ, the unique minimiser of ( 5) is given by b λ of equation ( 8) below.The rightmost part of Equation ( 5) is sometimes referred to as a TR-problem in standard form [11].
The minimisation of equation (5) with respect to b is equivalent to solving the OLS problem associated with the augmented data matrix and response vector: Note that linear independence of the X λ -columns trivially follows from linear independence of the I-columns.The matrix product X ′ λ X λ in the associated normal equations is therefore non-singular, and the corresponding least squares solution of the augmented problem (6) becomes unique.Straight forward algebraic simplifications of (7) result in the the familiar normal equations associated with the RR-problem and the solution in (8) simplifies to For subsequent applications of the λ-regularised model to uncentred X-data, the appropriate constant term in the resulting regression model is and the associated vector of fitted values ŷλ is given by The full SVD of X = USV ′ yields VV ′ = I p and X ′ X = VS ′ SV ′ .Assuming that X has full rank it is shown in Appendix A that the regression coefficients are given by b λ = V r c λ , where V r are the right singular vectors of the compact SVD and the coordinate vector c λ has the scalar entries Compared to the relatively large computational costs associated with calculating the (compact) SVD of X, calculation of the regression coefficient candidates (even for a large number of different λ-values) only requires computing the vectors c λ according to Equation (37) and the matrix-vector multiplications b λ = V r c λ as derived in Equation (36).
For the regularised multivariate regression with several (q) responses, Y ∈ R n×q , the associated matrix of regression coefficients is is the obvious multivariate generalisation of the vector c λ introduced above.

B. OBTAINING CROSS-VALIDATION SEGMENTS BY PROJECTION MATRIX CORRECTION
When the columns of the data matrix X are linearly independent, the associated OLS-solution b OLS of the normal equations ( 2) is unique, and cross-validation residuals can be derived from the Sherman-Morrison-Woodbury formula for updating matrix inverses [14].From Theorem B in the Appendix, we obtain the general segmented CV (SegCV) residuals where {k} refers to the samples of the k-th CV segment, r ({k}) refers to the vector of predicted residuals when the segment samples are not included in the modelling, n k is the number of samples in the segment and, H {k} is the sub-matrix of the projection matrix H (defined in Equation ( 4) above) associated with the samples of the k-th CV segment.This means that updating residuals for a given segment entails the inversion of a matrix involving the entries of H corresponding to all pairs of sample indices of the k-th CV segment.The computational cost of the inversions obviously depends on the number of segments and the number of samples belonging to each segment.Allen [12], [13] suggested the PRESS (Prediction Sum-Of-Squares) statistic where ŷi,(i) denotes the OLS prediction of the i-th sample when the sample has been deleted from the regression estimation, and r (i) is the corresponding residual.With ŷi,({k}) denoting the predictions of the i-th sample after deleting the corresponding k-th CV segment samples from the regression problem in (1), the SegCV equivalent of the PRESS-statistic becomes: Here r i,{k} are the elements of the residual vectors defined in Equation 15.
1) The leave-one-out cross-validation Corollary B of Theorem B covers the special case of LooCV where Equation ( 15) simplifies to a computationally efficient scalar formula for updating the individual residuals h i is often referred to as the leverage value associated with the i-th sample (row) in X.For ŷi,(i) denoting the prediction of the i-th sample after deleting it from the regression modelling problem in (1), the LooCV PRESS-statistic, is given by In (19) ŷi is the i-th entry in the vector of fitted values ŷ = Xb OLS + b 0 , and h i denotes the i-th diagonal element of the projection matrix H defined in (4) above.The denominator (1 − h i − 1/n) scales the i-th model residual (y i −ŷ i ) to obtain the exact LooCV prediction residual (y i − ŷi,(i) ).The term 1/n in this denominator accounts for the centring of the Xcolumns and the associated inclusion of a constant term (b 0 ) in the regression model (3).
From the last identity in Equation ( 4) it is clear that the entries of the n-vector h = [h 1 h 2 ... h n ] ′ , corresponding to the diagonal elements of H, are identical to the square of the norms of the T-rows, i.e.
Here, T⊙T denotes the Hadamard (element-wise) product of T with itself and 1 ∈ R r is the constant vector with 1's in all entries.Appropriate choices of the matrix T can be obtained in various ways including both the QR-factorisation and the SVD of X.
It should be noted that calculating the matrix inverse (X ′ X) −1 in the process for finding the diagonal h of H in (4) is neither required nor recommended in practice.In general, the explicit calculation of matrix inverses (for non-diagonal matrices) should be avoided whenever possible due to various unfavourable computational aspects, see Björck [17, Section 1.2.6].
2) The generalised cross-validation The GCV (λ) was proposed by Golub et al. [4] as a fast method for choosing good regularisation parameter (λ) values in RR.Here, we consider the definition where (y i − ŷλ,i ) is the i-th entry of the residual vector r λ = y − ŷλ , hλ From the elementary matrix-vector multiplication formula (36) for computing the regression coefficients b λ , it is clear that GCV (λ) can be calculated very efficiently for a large number of different λ-values once the non-zero singular values of X are available.
In their justification of GCV (λ) as the preferable choice over the exact LooCV-based PRESS(λ), Golub and coworkers stressed the unsatisfactory properties of the PRESSfunction when the rows of X are exactly or approximately orthogonal.In this case, the estimated regression coefficient b (i) λ (obtained by excluding the i-th row x i of X) must be correspondingly orthogonal (or nearly orthogonal) to the excluded sample x i .Consequently, the associated leave-one-out prediction ŷi,(i) (= x i b (i) λ ) becomes a poor estimate of the corresponding i-th response value y i .
NOTE: In situations such as the one just described, it makes little sense to think of the X-data as a collection of independent random samples, and the statistical motivation for considering the LooCV idea becomes correspondingly inferior.In [4] it is claimed that any parameter selection procedure should be invariant under orthogonal transformations of the (X, y)-data.We are sceptical of this requirement as an inexpedient restriction.This relates to the context of approximating the PRESS-statistic for situations where a segmented/folded cross-validation approach is appropriate.

III. CALCULATION OF THE CROSS-VALIDATION BASED PRESS(λ)-FUNCTIONS
From Equations (17,19) and the matrix-and vector augmentations in Equation (6), it is clear that the computationally fast versions of the SegCV and LooCV with the associated PRESS-statistic are also valid for TR-problems when the regularisation parameter λ is treated as a fixed quantity.
Below we will first handle the general case of segmented cross-validation.Thereafter we derive an equation assuring fast calculations of the regularised leverages in the vectors h λ necessary for the LooCV situation.The required calculations are remarkably similar to a computationally efficient calculation of the fitted values ŷλ and closely related to the corresponding regularised regression coefficients b λ in (36).
Both h λ , ŷλ (and b λ ) can be obtained from the SVD of the original centred data matrix X.This makes the computations of the exact LooCV-based PRESS(λ)-function defined in (26) below about as efficient as the approximation obtained by the GCV (λ) in (21).
A. EXACT PRESS(λ)-FUNCTIONS FROM THE SVD OF THE AUGMENTED MATRIX X λ Again, we assume that the centred X has full rank r and that X = U r S r V ′ r is the associated compact SVD.By defining S λ,r to be the diagonal r × r matrix with non-zero diagonal entries s 2 j + λ, j = 1, ..., r, the r most dominant singular values of the augmented matrix X λ in (6) are given by the diagonal elements of S λ,r .From equation (34) in Section II, the right singular vectors V r of X are also the right singular vectors of X λ , and the associated r left singular vectors are given by where the matrix U λ,r def = U r S r S −1 λ,r denoting the upper n rows of T λ,r is the part of actual interest (the additional left singular vectors not included in (22) are all zeros in the upper n entries).Because S r S −1 λ,r is (r × r) diagonal, U λ,r is obtained by scaling the j-th column (1 ≤ j ≤ r) of U r with s j /(s j + λ/s j ).
From the above definition of U λ,r , calculation of the PRESS-residuals associated with the n original (X, y) data points in the augmented least squares problem X λ b = y 0 is straight forward.According to Equations (4,22), the regularised hat matrix H λ is given by For each choice of the regularisation parameter λ > 0 and the corresponding expression for the regression coefficients b λ in Equation (36), the fitted values are = H λ y + b 0,λ . Hence, where y {k} − ŷλ,{k} is the sub-vector of the residual vector r λ = y − ŷλ corresponding to the k-th CV segment and H λ,{k} is the associated sub-matrix of H λ .While Equation (25) defines the general, segmented cross-validation case, the special case of LooCV simplifies considerably.Only the diagonal entries of H λ (the sample leverages) are required, i.e., Equation ( 25) simplifies to Note that hλ in the denominator of Equation ( 21) defining GCV (λ) is identical to the mean of the h λ -entries, i.e. hλ = (1/n) n i=1 h λ,i , due to the fact that U r is an orthogonal matrix.Also note that the diagonal entries of H λ can be calculated directly by where the coefficient vector Consequently, the evaluation of the PRESS(λ)-function defined in (26) is essentially available at the additional computational cost of two matrix-vector multiplications (Equations (24,27)) where the matrices (U r S r and U r ⊙ U r ) are fixed and the associated coefficient vectors c λ and d λ are obtained by elementary arithmetic operations for each choice of λ > 0. A note on the number of floating point operations (flops) required for the fast calculation of the LooCV-based PRESS(λ)-function is included in Appendix H.

B. ALTERNATIVE STRATEGIES FOR ESTIMATING THE SEGCV-BASED PRESS(λ)-FUNCTION
The LooCV calculations in the previous section can be implemented at low computational costs dominated by the SVD of X.The SegCV version, however, also involves the inversion of several matrices associated with each combination of the regularisation parameter value of λ and cross-validation segment.In situations with many CV segments, e.g., defined by relatively small groups of replicates, the additional computational costs may be acceptable as the matrices to be inverted are small.However, for large datasets with few segments, e.g., 5-10, the required amount of computations may be rather large (comparable to explicitly holding out samples and recalculating a full TR model from scratch for each CV segment).
We therefore describe two alternative strategies for speeding up calculations.The first one is based on approximating the PRESS-values, while the second strategy involves clever usage of a small subset of exact PRESS(λ)-values to estimate the minimum of the PRESS(λ)-value and/or the complete PRESS(λ) curve within some range of the regularisation parameter value.

1) PRESS(λ) approximated by segmented virtual cross-validation -VirCV
We will consider a faster alternative for approximating the SegCV approach for the type of situations just described.In the following, we assume (without loss of generality) that the uncentred data matrix  together with the uncentred response vector is composed by K distinct sample segments.For denotes the compact SVD of segment number k, and that n k is the number of rows in X k so that the total number of samples is n = K k=1 n k .From the SVD of the k-th segment, we obtain the identity Consequently, the orthogonal transformation performed by left multiplication with the (n k × n k ) matrix U ′ k transforms the samples segment X k into a matrix of strictly orthogonal rows.Now we define the two block diagonal matrices with the properties T ′ T = TT ′ = I and T′ T = T T′ = I, i.e., both T and T are orthogonal.
The formulation of TR-modelling for uncentred X and explicit inclusion of the constant term corresponds to finding the least squares solution of the linear system and left multiplication of (31) by the orthogonal matrix T′ yields the system Note that the associated normal equations of the systems in ( 31) and ( 32) are identical.Hence, their least squares solutions are also identical.

Definition of the segmented virtual cross-validation
We define the segmented virtual cross-validation (VirCV) strategy as the process of applying the LooCV strategy to the transformed system in equation (32).As is noted above, multiplication by T ′ has the effect of orthogonalising the rows within each of the K segments in the X matrix.
The heuristic argument for justifying the VirCV approach as an approximation of a SegCV approach is that the rows within each transformed data segment are unsupportive of each other under the LooCV strategy (due to the internal "decoupling" of each segment into a set of mutually orthogonal row vectors).However, from practical cases, it can be observed that the accuracy of this approximation depends on the level of similarity between the original samples within each segment of data points.
Note that contrary to the LooCV, the GCV is not useful in combination with the VirCV strategy.The reason for this is that the singular values of X are invariant under orthogonal transformations.From equation ( 21) and the definition of hλ it follows that GCV (λ) is also invariant under orthogonal transformations, i.e., the systems in ( 31) and ( 32) lead to the same GCV (λ)-function .
With the VirCV we are clearly cross-validating on the orthogonal phenomena caused by the samples within each segment.As all the samples in a segment contribute to identifying these directions, the VirCV cannot be expected to provide exactly the same results as the SegCV.One may, however, expect that when the different segments are carefully arranged to contain highly similar samples only (which is a reasonable assumption to make for most organised studies with such data segments), then the VirCV should provide a useful approximation to the SegCV.This will be demonstrated in the application section below.For special situations deviating from highly similar samples in the segments, see Appendix D.

Computational aspects in the leverage corrections for the VirCV
As is noted in association with (29), the VirCV procedure requires an initial calculation of the transformation T from the segments of the uncentred X-data.For a correct implementation of the computational shortcuts similar to those of the LooCV, it is necessary to mean centre the data matrix X prior to executing the T-transformation and the least squares modelling.In practice, one must therefore mean centre the data prior to the multiplication with T ′ (or, equivalently, one can multiply by T ′ and subtract the projection of the transformed data onto the transformed vector T ′ 1 of ones).As T is an orthogonal transformation the angles and in particular the orthogonality between vectors will be preserved.For the transformed data, modelling by including a constant term is therefore associated with the transformed vector T ′ 1 of ones.With X c and y c denoting the centred data matrix and the associated centred response vector, respectively, the vector T ′ 1 is orthogonal to the columns of the transformed centred data T ′ X c and ∥T ′ 1∥ = ∥1∥ = √ n.The justification for the leverage correction described earlier therefore still holds, but the particular correction terms (1/n) changes.
With the transformed centred predictors X = T ′ X c and responses ỹ = T ′ y c in (32), the associated fitted values as ŷλ = Xb λ , the PRESS-function for the VirCV is given by Here the leverages h λ,i are calculated as in (27) based on the transformed version X of the centred data, and the enumerator of the correction terms are the entries of the vector m = T ′ 1 ⊙ T ′ 1 ∈ R n .This means that the correction term 1/n in the denominator of ( 26) must be replaced by m i /n in (33), where m i denotes the i-th entry of the vector m (to be consistent with the orthogonal transformation of the regularised least squares problem).
A comparison of the number of flops required for the VirCV compared to the SegCV is included in Appendix I.
2) Approximated PRESS-function using subsets of λ

Minimum PRESS-value estimation
If TR is used in an automated system (without subjective assessment) or only the optimal PRESS(λ) is needed, we can avoid redundant calculations by searching for the λ value that minimises (25) instead of calculating a large range of solutions.A possible approach for such a search can be based on the golden section search with parabolic interpolation [18].This method performs a search for the minimal function value over a bounded interval of a single parameter.To leverage the previously described efficient computations of fitted values, ŷλ , coefficient vectors, d λ , etc. the search for minimum PRESS(λ) is then performed over a fixed set of λ-values.The grid of λ-values can have high resolution while still achieving a considerable advantage in computational speed compared to the exhaustive PRESS-function calculations.It is well-known that this type of function minimisation cannot guarantee the optimal value to be found, however, the PRESS-functions of interest often have relatively smooth and simple graphs, where a global minimum over the λ-interval of interest can be found with high accuracy.

PRESS(λ)-function estimation by spline interpolation
In cases where estimating the detailed PRESS(λ)-function is beneficial, e.g., for plotting and inspection, it may be possible to reduce the number of accurate PRESS(λ)-evaluations to be calculated quite substantively without sacrificing much precision in the estimation.
We propose a cubic spline strategy, where the PRESS(λ)function is estimated from a small set of distinct λ-values, and new values are added to the set iteratively until the difference between estimation and true PRESS-value falls below a chosen threshold for all λ-values in the extended set.The latter is determined by cross-validation of the cubic spline interpolation, i.e., a low-cost operation.
As with the PRESS-minimisation procedure, we consider a fixed set of λ-values from which we choose starting points and select subsequent values.The λ-values extending the set in each iteration are the ones halfway to neighbours of the chosen λ-values on both sides, effectively doubling the local density of λ-values where needed (low accuracy of spline approximation).Starting values for the initial set of λs can be chosen equidistant (on a log 10 scale) or the sequence obtained using the above "Minimum PRESS-value estimation" strategy.Experience with real datasets indicates that the latter is an efficient strategy that may provide close to exact estimation of the minimum PRESS-value.

C. A SHORT NOTE ON MODEL SELECTION HEURISTICS
With the key formulas derived above we obtain efficient model selection procedures from minimising the PRESS(λ)or the GCV (λ)-functions with respect to the regularisation parameter λ.However, the minima of these functions will not necessarily assure the selection of the best model in terms of future predictions.This is particularly the case when the PRESS-and GCV functions are relatively flat for a relatively large interval of λs containing the minimum value.In such situations it is often useful to invoke heuristic principles such as Occam's razor for identifying a simpler model (in terms of the norm of the regression coefficients) at a small additional cost in terms of the PRESS (or the GCV ): The '1 standard error rule' described in [1] obtains a simpler (more regularised) alternative by selecting a model where the PRESS-statistic is within one standard error of the PRESS-minimal model.More precisely, we first identify the minimum PRESS value and calculate the standard error of the squared cross-validation errors associated with this model.Then the largest regularisation parameter value where the associated model has a PRESS-statistic within one standard error of the PRESS-minimum is selected.
The 'χ 2 model selection rule' to determine the regularisation parameter was originally introduced for model selection with Partial Least Squares regression modelling [19].By assuming that the residuals associated with the minimum value PRESS min of PRESS(λ) are randomly drawn from a normal distribution, the statistic given by n • PRESS min /σ 2 , where σ 2 is the associated (unknown) variance, follows a χ 2 n distribution (where n is the degrees of freedom).By fixing a particular significance level α, the selection rule says: Choose the largest possible value of λ so that n • PRESS min /PRESS(λ) ≥ χ 2 n,α .Here, χ 2 n,α is the lower α-quantile of the χ 2 n distribution and PRESS(λ) is a substitute for σ 2 .
Based on the efficient formulas for calculating the PRESS(λ) function, both these model selection alternatives can be implemented without affecting the total computational costs significantly.

IV. APPLICATIONS
In the following, we demonstrate some applications of our fast cross-validation approaches for model selection within the TR framework for several real-world datasets.We consider situations where both leave-one-out and segmented crossvalidation are appropriate.The required algorithms were implemented and executed in MATLAB, and prototype code is given in Appendices E-G.A corresponding implementation in R-code will be made available upon publication at https://CRAN.R-project.org/package=TR.We used a computer running Mac OS Ventura 13.0.1 and MATLAB R2022a, with 16 GB RAM, and an M1 Pro 10-core processor.For the derivative regularisation, we use the full rank approximations described in Section C with the scaling coefficient set to ϵ = 10 −10 in the appended rows in the discrete regularisation matrices.This is done to mitigate the numerical impact from these rows in the resulting regression coefficients.Most of the steps are common to all the algorithms.For the virtual CV we calculate local SVDs for each segment and left multiply by the transposed left-singular vectors of the segments prior to applying the regularisation matrix (if any).Detailed calculations and minor differences between the algorithms such as the modified leverage correction for VirCV are not shown.

A. THE FAST LEAVE-ONE-OUT CROSS-VALIDATION 1) Datasets
The following datasets will be considered in the examples presented below: 1) Octane data [20].This dataset consists of near-infrared (NIR) spectra of gasoline.There are 60 samples and 401 features (wavelengths in the range 900 nm − 1700 nm).The response value is the octane number measured for each sample.2) Pork fat data [21].This dataset consists of Raman spectra measured on pork fat tissue.There are 105 samples, 5567 features (wavenumbers in the range 1889.9 cm −1 − 200.1 cm −1 ), and 19 different responses.For modelling and prediction, we only consider the response consisting of saturated fatty acids as a percentage of total fatty acids, hereafter referred to as SFA. 3) Prostate gene data [22].The dataset is a microarray gene expression dataset.There are 102 samples, and the gene expression of 12600 different genes were measured.The response is binary (cancer/not cancer), and we consider the dummy-regression approach to the underlying classification problem.For this dataset, we VOLUME 11, 2023 standardise the data prior to modelling.The standardisation will introduce a small bias in the model selection that will be discussed later.
For all datasets, we have used approximately 2/3 of the available samples for model building and -selection.The remaining 1/3 of the samples were used for testing the selected models.(Note that our choice of data splitting is somewhat arbitrary, just to serve the purpose of illustrating the ideas with an appropriate number of samples for both training and testing.)We considered the following model selection alternatives identifying good regularisation parameter candidates: (i) PRESS min -the minimum PRESS(λ)-value, (ii) GCV min -the minimum GCV (λ)-value, (iii) the 1 standard error rule for PRESS(λ), (iv) the χ 2 -rule for PRESS(λ) using the significance level α = 0.2.

2) Model selection and prediction
For each dataset, the modelling was based on a grid search of 1000 regularisation parameter candidate values spaced uniformly on a log-scale.For the octane data, the displayed values were in the range 10 For the prostate data, the percentage correctly classified on the training set using cross-validation (classifying each sample to the largest of the fitted target values when using 0/1 dummy-coding for the group memberships) is 91.2% for all the parameter selection methods (it should be noted that this number happens to be identical to the test set result for most of the parameter selection methods).
It should be noted that most of the displayed PRESS-(and GCV -) curves are relatively flat without a very distinct minimum point.Therefore it may be advantageous to employ either the 1 S.E.rule or the χ 2 -rule to assure the selection of a simpler model.For the Prostate data, in particular, we note that the smallest available candidate regularisation parameter value provides the minimum PRESS-value.The effect in terms of prediction when using the 1 S.E.rule or the χ 2 -rule to obtain a simpler model varies between the datasets.For the Pork fat data, the χ 2 -rule gives better prediction than the other parameter selection methods for the SFA response, while the χ 2 -rule selects a poorer model than the other parameter selection methods on the Prostate data.
For the most precise identification of the PRESS-and GCV -minima a numerical optimiser should be used.However, in most practical situations the suggested strategy of considering just a subset of candidate regularisation parameter values is usually good enough for approximating the minima before doing the subsequent identification of parsimonious models (based on the principle of Occam's razor) that predict well.

3) Regression coefficients
Figure 5 shows the octane data together with the PRESSminimal regression coefficients using the L 2 -, the first derivative-, and the second derivative regularisations.Note that the choice of regularisation matrix heavily influences the appearance of the regression coefficients without the minimum PRESS-or GCV values changing much.Table 1 confirms that the predictive powers are relatively similar for all these models.Doing consistent model interpretations solely based on the regression coefficients in Figure 5 is obviously a challenging (if not impossible) task, see also [23].computational time from calculating the SVD only to finding PRESS, GCV and regression coefficients for a single regularisation parameter value for the first and second derivative regularisation.

B. SEGMENTED CROSS-VALIDATION 1) Datasets
In the following we will demonstrate the use of segmented cross-validation with L 2 regularisation for three datasets: 1) Raman spectra of fish oil [24].The dataset consists of 42 sample segments including 3 replicate spectra of each unique sample giving a total of 126 rows and 2801 wavenumbers in the range 3200 cm −1 to 400 cm −1 .The response variable was the iodine value (the response values were identical across each segment), which is frequently used as an indicator of the degree of unsaturation of fat [24].The spectra of this dataset are plotted in Figure 6 after applying Extended Multiplicative Signal Correction (EMSC) [25] with 6th-order polynomial baseline correction.and enzymes [26].The dataset consists of 332 samples including 1 to 12 replicates of each unique sample giving a total of 885 rows and 571 wavenumbers in the range 1800 cm −1 to 700 cm −1 .The response variable was average molecular weight (AMW) (identical across each replicate set), which can be used as a proxy for the degree of hydrolysation.The spectra of this dataset are plotted in Figure 7. 3) Raman milk spectra [27]- [29].The dataset consists of 232 unique sample segments including between 6 and 12 replicate measurements of each unique sample giving a total of 2682 rows and 2981 wavenumbers in the range 3100 cm −1 to 120 cm −1 .The response variables were the iodine value and the concentration of conjugated linoleic acid (CLA).Also for this dataset, the response values were identical across each segment.The spectra of this dataset are plotted in Figure 8 after applying EMSC with 6th-order polynomial baseline correction.

2) Fourier transform infrared (FTIR) spectra of hydrolysates from various mixtures of rest raw materials h h h h h h h h h h h h h h
For all datasets, we have excluded the endpoint regions of the original spectra due to noise and the poor quality of the measurements.The wave numbers reported above are those included after this truncation.Approximately 2/3 of the replicate segments were used for model building andselection, and the remaining 1/3 of the segments were used as a test set.
The following four model selection strategies were considered: (i) PRESS min -the minimum PRESS(λ)-value from LooCV (ignoring the presence of sample segments), (ii) GCV min -the minimum GCV (λ)-value, (iii) the PRESS min from the SegCV (successively holding out the entire sample segments), and (iv) the PRESS min from the VirCV.We have chosen to focus only on the parameter selections associated with the minima of the various error curves in this part of our study (neither the χ 2 -rule nor the 1 S.E.rule turned out to affect the model selections much).Neither of the two strategies for quicker estimation of PRESS-values is shown in the plots as the minimum PRESS-value (from searching) coincides with the minimum-PRESS value from the 1000 sampled λ-values and the cubic spline interpolation is visually indistinguishable from the full PRESS curve obtained from explicit segment removal.
2) Fish data -effect of pre-processing Spectroscopic measurements may be corrupted by both additive and multiplicative types of noise.Pre-processing of such data prior to modelling is therefore usually required.It is therefore of particular interest also to investigate how the model selection strategies considered above compare for preprocessed data.In particular, we will consider the Extended Multiplicative Signal Correction (EMSC) [25] with replicate corrections [30].
In general, the goal of the EMSC pre-processing is to adjust all the measured spectra to a common scale and to eliminate the possible effects of additive noise.This includes the estimation of an individual scaling constant for each spectrum and an orthogonalisation step that de-trends the spectra with respect to some set of lower-order polynomial trends (the reader is referred to the provided references for the technical details).In the present examples with Raman spectra, the samples were orthogonalised with respect to the subspace including all polynomial trends up to the 6-th degree.
The Raman spectra of fish samples were subjected to EMSC pre-processing to compensate for different scaling and competing phenomena such as fluorescence and optical/scattering effects in the equipment and samples.For the milk data, the spectrum having the least fluorescence background was chosen as a reference, though the effect of choice of reference spectrum is minimal.
For datasets including segments of replicated measurements, a replicate correction step is often considered to alleviate the presence of inter-replicate variance.Such correction can be done by an initial EMSC-based pre-processing of the spectra in each sample segment.Thereafter, the corrected sample segments can be individually mean-centred and organised into a full data matrix.
As we expect the dominant right singular vectors of the full matrix to account for the most dominant inter-replicate variance, orthogonalisation of the data with respect to one or more of the associated dimensions contributes to making the replicates more similar, see [30] for details.Because every sample in the training dataset is included in the pre-processing, some bias affecting the subsequent PRESS-calculations and model selection must be expected.
Figure 9 shows the model selection for pre-processed fish oil data based on the pure EMSC and for the EMSC where   30% of the inter-replicate variance is removed.It is evident that the SegCV and the VirCV become considerably more similar in the latter case.As one should expect, the GCVand PRESS curves based on the LooCV seem to provide unrealistically low error values and the selection of lesser regularised models.This phenomenon does not occur with the SegCV where an entire segment of replicates is held out in each cross-validation step.The VirCV seems quite robust against the inter-replicate variance.The prediction results for the test set of the fish oil data with the various pre-processing alternatives are presented in Table 5, and shows that the best results are obtained with the ordinary EMSC pre-processing and model selection based on the SegCV.By simultaneously considering Figure 9, it is clear that the more heavily regularised among the selected models (those based on the largest regularisation parameter values) perform better on the test set.With standard EMSC pre-processing the minima of the VirCV is located at a smaller regularisation parameter value than for the SegCV, suggesting an explanation of the difference in predictive performance.
For the milk data, the prediction error estimates obtained after pre-processing the data are similar for all the parameter selection methods (table omitted), as was also the case with the raw data.

3) Hydrolysis data -heterogeneous segments
The hydrolysis data is used as an example of a model comparison which is often performed using 5-fold or 10-fold segmented cross-validation.For the FTIR data, we have chosen a 5-fold strategy where replicates are kept together inside each fold to prevent information bleeding by replicates of the same sample appearing in both training and test data.The resulting cross-validation segments vary in size from 103 to 117 samples, each, due to the present replicate sets.We have chosen to combine this with a 2nd derivative regularisation.
In Figure 10, we have plotted the PRESS-curves for SegCV, VirCV, LooCV and GCV.For these highly heterogeneous cross-validation segments, the virtual cross-validation strategy coincides with GCV , both underestimating the prediction errors.Also, LooCV underestimates the errors, but less so.Since the general forms of the PRESS-curves are quite similar, the minimum PRESS-values are located quite close together, suggesting that for the FTIR dataset, any of the strategies will give a reasonable estimate of the optimal λvalue.As Table 6 suggests, performance when applying the regressions corresponding to minimal PRESS-values on the test data are also similar with a slight advantage to the more regularised LooCV solution.

4) Milk data -efficiency with many segments
The milk data is an example of relatively many samples (2682) and replicate groups (232), which can be challenging with regard to computational resources when cross-validating over a large range of λ-values.As can be observed from Figure 11, the differences between SegCV, VirCV, LooCV and GCV are small both with regard to the shape of the curves and location of respective minimum values.This is due to the low variation between samples within each replicate group, in sharp contrast to the FTIR dataset with its highly heterogeneous cross-validation segments.Of more interest, is the time usage for the various strategies, which is summarised in Section IV-B5 below.

5) Approximations of PRESS-values -computational speed
Table 7 shows the computational times for the different model selection strategies.Both the PRESS-and the GCV values are included as computing only one of them takes approximately the same time as computing both.Because the size of the replicate segments are relatively small for the Raman datasets (3 replicate measurements for the fish oil data and 6 to 12 replicate measurements for the milk data), the SVDs required for the internal orthogonalisations of the segments contribute insignificantly to the total computational load.The amount of computations required for model selection based on the VirCV is therefore quite comparable to the computations required for the LooCV version of PRESS (and for the GCV ).The strategy of searching for the minimum PRESSvalue by golden section search and parabolic interpolation (MinSearch), is remarkably similar to VirCV in time usage.However, there is a trade-off between obtaining an estimate of the exact minimum value (MinSearch) and a full PRESScurve (VirCV).Approximation of the SegCV using spline interpolation is slower than VirCV and MinSearch, but still sufficiently fast for practical use in all tested cases and with the advantage of giving a PRESS-curve highly similar to the one obtained by the SegCV.The implicit segmented crossvalidation (ImpCV) using Theorem B, is faster than SegCV for small segments and a bit slower for large segments, though still fast enough to provide exact results for all λ values.In general, the initial calculation of the SVD seems to be the main limiting factor in computational speed when the datasets grow in size.This is especially prominent for the milk data where SegCV performs this initial SVD 232 times.Here, a strategy avoiding SVD or using a randomised SVD algorithm [31] might be favourable, however, the other presented strategies are still usable.

V. DISCUSSION AND CONCLUSIONS
The essence of the TR-framework described in the present work is that just a single SVD-calculation (of either the original data matrix X or a transformed version X) is required to explore some particular regularised regression problem of interest.We have pointed out that the PRESS-and GCV values required for model selection(s) based on the LooCV or the GCV can be obtained at the computational cost of two matrix-vector multiplications for each choice of the regularisation parameter value λ.In the applications section, it is demonstrated that our framework scales well when increasing the number of candidate regularisation parameter valuesin the case of 'small n large p' problems.This scaling will also work well for problems involving multiple responses as most of the computations will be shared among responses.For smaller and medium-sized data as well as for other situations where the required SVD can be calculated (or approximated) reasonably fast, the acquired computational efficiency allows for the exploration of a large number of candidate models in a very short amount of time.
For situations where leave-one-out cross-validation underestimates validation error because of sample replicates or another grouping of samples, segmented cross-validation is the appropriate choice.We have proved a theorem saying that explicit remodelling for computation of cross-validated PRESS-values can be avoided, while still giving exact results, at the computational cost of inverting one matrix per sample segment per λ-value.For cases where the cost outweighs the benefits, we have proposed alternative strategies for reducing the number of inversions through careful selections of λ-values as well as an approximate virtual cross-validation (VirCV) strategy.The VirCV is a computationally efficient approximation of the traditional SegCV.In the applications (Section IV) we observed that the VirCV approximation of the SegCV appears to be quite accurate for model selection in the case of highly similar samples within each segment while using the LooCV or GCV in such situations is more likely to propose insufficient regularisation and models that predict poorer.
It is important to note that when the dataset is pre-processed and/or transformed by a data-dependent method, some bias both in the LooCV-and VirCV-based PRESS values must be expected.The data variable standardisation commonly used in RR is a typical example.The EMSC pre-processing that was used with or without replicate corrections is another.However, the main purpose of the LooCV-and VirCV-based PRESS values in the proposed framework is model selection rather than error estimation.The bias introduced by such preprocessing methods is therefore not likely to be very harmful as long as the (training) data does not contain serious outliers.
Although leverage correction of the model residuals for fast calculation of the LooCV in linear least squares regression problems is well known, there are some misleading assertions in the literature regarding both the properties and accuracy of PRESS-values that require clarification: i) Hansen [11, page 96] claims that the leverage values are not invariant under row permutations of the X-data making the PRESS-values dependent on the ordering of the data.However, when the rows of the data matrix are permuted it can be verified that the leverage values are unchanged and undergo precisely the same permutation.Consequently, the correct leverage values will match up perfectly with the corresponding model residuals in the calculation of the PRESS(λ) calculations assuring its invariance under any row permutation of the (X, y)-data.ii) Myers [32, page 399] claims that the expression for fast calculation of PRESS(λ) is only an approximation when performing centring and scaling of the data.This is, however, only true when the scaling factors are calculated from the data to be used in the model building.The data centring, as such, does not corrupt the leverage-and PRESS(λ)-values as long as the 1/n terms are included in the associated leverage corrections of the model residuals.iii) The version of Ridge regression implemented in the MASS package [33] for the R programming language includes a fast calculation of the GCV (λ)-values for a desired vector of corresponding λvalues.The 1/n term is, however, ignored when correcting the model residuals by the required averaged leverage value.Consequently, the resulting GCV -values are misleading when the centring of the data is included as a part of the Ridge Regression modelling.
We believe that future statistical texts and software dealing with Ridge Regression (and Tikhonov Regularisation) will find value in including the necessary pieces of linear algebra (in particular the simple matrix-vector multiplications of Equation ( 27) to establish the fast calculation of the PRESS(λ) in Equation (26).In our opinion, these relatively simple but still powerful results demonstrate yet another remarkable consequence of the SVD at the core of applied multivariate data analysis.
Finally, we have established a theorem describing how to compute the cross-validated residuals for (regularised) linear regression models from the fitted value residuals.The computation can be seen as a multi-sample kind of leverage correction that applies to any type of segmented crossvalidation strategy.In many cases, it represents a computationally efficient alternative to the computationally slower "hold out/remodelling approach" most common within statistics and machine learning.For the special case of LooCV, our theorem simplifies to the well-known scalar leverage correction calculations of the LooCV errors.■ Equation (44) shows that we can calculate the prediction residuals for a cross-validation sample segment of size k at the cost of inverting the k × k matrix [I k − H cv ] followed by a matrix-vector multiplication with the fitted residuals r cv .The case of k = 1 corresponds to leave-one-out crossvalidation (LooCV) where the prediction residual calculations reduce to scalar operations:

Corollary (LooCV)
The prediction residual r (i) when holding out the i-th sample (x i , y i ) from the modelling is where r i = y i − x i β is the fitted residual and Here, h i is the i-th diagonal element of the projection matrix H = X(X ′ X) −1 X ′ (projection onto the column space of X).

APPENDIX C THE TIKHONOV L 2 -REGULARISATION FRAMEWORK
Tikhonov [10] noted that it is straightforward to generalise the above L 2 regularisation of b to more specialised types of regularisation through a corresponding regularisation matrix L. These cases are expressed in terms of identifying the minimising solution of the bi-objective least squares problem for some fixed λ > 0. The minimisation of Equation ( 47) with respect to b can be obtained by considering the augmented data X L,λ = X √ λL and y 0 = y 0 , and solving the normal equations To avoid technical distractions we will in the following restrict our attention to the cases of square and non-singular regularisation matrices L (even for situations where a nonsquare regularisation matrix is the immediate choice, a nonsingular (p × p)-alternative that provides a good approximation is often available).By defining X = XL −1 , the solution of the OLS problem in (48) is equivalent to finding the unique OLS-solution β λ of the transformed problem Xλ β = y 0 , where Xλ = X L,λ L −1 = X √ λI and β = Lb.The associated expression minimised by β λ is i.e., in the standard form (5), and the minimising solution b λ of the original problem (47) is obtained by Among the many useful choices for the regularisation matrix L are the following: 1) diagonal scaling (e.g., the standardisation of variables often advised for RR applications): , where σi approximates the standard deviation of the ith variable (1 ≤ i ≤ p).
2) a (full) rank p discrete 1. derivative approximation: 3) a (full) rank p discrete 2. derivative approximation: The alternatives L 1 and L 2 are relevant for problems where the X-data are associated with discretised (uniform) sampling of continuous signals so that some smoothness in the solution candidates b λ is reasonable.The two last rows in L 2 (and the last row in L 1 ) above are scaled versions of the discretised and normalised Legendre polynomials [34] of order 0 and 1, respectively (c 1 and c 2 represent the normalisation constants, and ϵ > 0 is a scaling factor to be commented on below).It should be noted that both these row vectors are orthogonal to all the above rows in the derivative matrices where they appear.
The main purpose of the included Legendre vectors in these regularisation matrices is to ensure full rank of the regularisation matrices.Appropriate regularisation of the solutions b λ may be obtained by choosing the fixed scaling factor ϵ > 0 to be • either sufficiently large to make b λ practically orthogonal to the subspace spanned by the Legendre vectors, or • sufficiently small to inhibit any notable penalisation with respect to the same Legendre vectors.
The choice of ϵ in the last case can therefore not be made arbitrarily small in practice.It must be chosen large enough to avoid numerical difficulties in the computations of X and b λ .Alternative (non-invertible) differentiation matrix candidates taking various boundary conditions into account are described in [11].

APPENDIX D SPECIAL SITUATION FOR SEGMENT DECOMPOSITION IN VIRCV
In the following we will examine the proposed VirCV strategy more closely for three different situations: a) Segments of identical rows.b) Segments of collinear rows.c) The general case (segments with no particular structure in the rows).

Identical rows:
Let us assume that all the rows of a segment X i , (1 ≤ i ≤ K ) are identical.In this particular case, the PRESS-function associated with the VirCV is identical to the PRESS-function obtained by the SegCV.
The identity can be derived by noting that the leftmultiplication of the left-and right-hand sides of a linear system by an orthogonal matrix affects neither the least squares solution nor the norm of the associated residual vector.Consequently, the SegCV strategy applied to the two systems (31) and (32) will result in identical PRESS-functions.With all rows within each segment X k ∈ R nk ×p being identical to its first row (denoted x k,1 ) of the segment, it is straightforward to verify that X k has only one non-zero singular value s k,1 = x k,1 x ′ k,1 n k and the corresponding left-and right singular vectors are (51) By the orthogonality requirements of the SVD, any other left singular vector u must satisfy u ′ u k,1 = 0. Consequently meaning that there will be only one non-zero row in each segment on the left-hand side of the T-transformed system (32).It is therefore sufficient to demonstrate that the PRESSfunctions obtained from applying the SegCV and the LooCV to the system in (32) are equal: Clearly, for any row containing just zeros in the left-hand side of (32) the prediction based on it is trivially identical to 0 (zero) for either of the cross-validation strategies (regardless of the regression coefficients).Because such zero rows do not contribute to the calculation of the regression coefficients, we are forced to conclude that the regression coefficients obtained by holding out the (only) non-zero row of a segment must be equal to the regression coefficients obtained from holding out the entire segment.Thus the predicted values for the non-zero row in each segment must also be identical for both cross-validation strategies, and we can conclude that the PRESS functions obtained by the SegCV-and the VirCV strategies must be identical.

Collinear (proportional) rows:
One might expect the same result to hold when the rows within a segment are proportional.This is however not the case with the modelling strategy described above.The reason for this is that the inclusion of a constant term will make each of the K segments become a rank 2 -rather than a rank 1 submatrix.With more than one non-zero row on the left-hand side in each segment the argument of the previous situation fails, and doing LooCV on the transformed data is no longer equivalent to doing SegCV on the original data.However, when omitting the constant term from the modelling, each of the K segments has rank 1, and the SegCV and VirCV approaches will result in identical PRESS(λ)-functions.The rigorous explanation is similar to the argument given for the situation with identical rows.
and the effective degrees of freedom df (λ) def == n hλ + 1.This definition of GCV (λ) is proportional (by the sample size n) to the definition given in [4, page 216].The GCV (λ) is explained as a rotation invariant alternative to the LooCV that provides an approximation of the PRESS(λ)-statistic defined below.

FIGURE 1 .
FIGURE 1.Flow chart illustrating the LooCV, segmented CV, and VirCV.Most of the steps are common to all the algorithms.For the virtual CV we calculate local SVDs for each segment and left multiply by the transposed left-singular vectors of the segments prior to applying the regularisation matrix (if any).Detailed calculations and minor differences between the algorithms such as the modified leverage correction for VirCV are not shown.
−4 to 10 5 , for the Pork fat data in the range 10 2 to 10 25 , and for the Prostate data in the range 10 −1 to 10 8 .Different ranges were chosen for each dataset to avoid irrelevant levels of regularisation, and to obtain a good visualisation of the PRESS-and GCV curves including the located minima.In Figures 2-4 the PRESS/n and GCV /n are plotted as functions of the regularisation parameter for the different datasets and the different choices of the regularisation matrix.Such plots are useful for model selection as they allow for a direct comparison of the model quality for different values of the regularisation parameter.Division of the PRESS-and GCV values by the sample size n makes the model selection statistics directly comparable to the prediction results obtained by the test sets.The test set results are shown in the Tables 1-3.

FIGURE 2 .
FIGURE 2. Octane data.PRESS/n and GCV /n for a range of regularisation parameter values and different regularisation matrices.Top: L 2 regularisation.Middle: 1st derivative regularisation.Bottom: 2nd derivative regularisation.The minimum PRESS and GCV values have been marked, as well as the regularisation parameter values selected by the 1 S.E.rule and the χ 2 -rule.

FIGURE 3 .
FIGURE 3. Pork fat data and SFA response.PRESS/n and GCV /n for a range of regularisation parameter values and different regularisation matrices.Top: L 2 regularisation.Middle: 1st derivative regularisation.Bottom: 2nd derivative regularisation.The minimum PRESS and GCV values have been marked, as well as the regularisation parameter values selected by the 1 S.E.rule and the χ 2 -rule.

FIGURE 5 .
FIGURE 5. Octane data.Top: Plot of the NIR spectra of octane.Bottom: PRESS-minimal regression coefficients based on different regularisation matrices.

FIGURE 6 .
FIGURE 6. Plot of the fish oil spectra after pre-processing with EMSC with 6th order polynomial baseline (top) and additional replicate correction (bottom).

FIGURE 7 .
FIGURE 7. Plot of the hydrolysis spectra after pre-processing with EMSC with 2nd order polynomial baseline.

FIGURE 8 .
FIGURE 8. Plot of the milk spectra after pre-processing with EMSC with 6th order polynomial baseline.Noise in some replicates is clearly visible as spikes around the main variation.

FIGURE 9 .
FIGURE 9. Fish oil data.Model selection for data pre-processed with the EMSC both with and without replicate correction.Top: Standard EMSC pre-processing.Bottom: EMSC with 30% of the inter-replicate variance removed.

FIGURE 10 .
FIGURE 10.Hydrolysis data.Different model selection strategies for a range of regularisation parameter values using 2nd derivative regularisation.

FIGURE 11 .
FIGURE 11.Milk data.Different model selection strategies for a range of regularisation parameter values using L 2 regularisation.Top: CLA.Bottom: Iodine value.

TABLE 1 .
Octane data.MSE (from test data) using various regularisation types and parameter selection methods.

TABLE 2 .
Pork fat data.MSE (from test data) for the SFA response using various regularisation types and parameter selection methods.
FIGURE 4. Prostate data.PRESS/n and GCV /n for a range of regularisation parameter values using L 2 regularisation.The minimum PRESS and GCV values have been marked, as well as the regularisation parameter values selected by the 1 S.E.rule and the χ 2 -rule.

TABLE 3 .
Prostate data.Percentage of correctly classified (PCC) samples using the test set predictions of the selected 0 − 1 dummy regression model based on L 2 regularisation.

Table 4
and PRESS values for all responses.The main differences in computational time between finding the SVD in the case of L 2 regularisation and in the cases of first-and secondderivative regularisation are due to the initial calculations of X, see Section C. Similarly, the required transformation of the regression coefficients (see (50)) explains the increase in

TABLE 4 .
Computing time (in seconds) for model selection including finding the PRESSand GCV -minimal regression coefficients when varying the number of candidate regularisation parameter values.The times are the averages of 50 repeated runs rounded to the two most significant digits.

TABLE 5 .
Fish oil data.MSE (from test data) for different model selection strategies and different pre-processing alternatives.

TABLE 6 .
Hydrolysis data.MSE (from test data) using EMSC for pre-processing.

TABLE 7 .
Computational time for different model selection strategies for the Fish oil data, Hydrolysis data and Milk data when considering 500 candidate regularisation parameter values.The times are given in seconds, rounded to two significant digits, and is the average of 50 repeated runs.The speedup relative to SegCV is shown in parenthesis.