Linear Model Selection and Regularization for Serum Prostate-Specific Antigen Prediction of Patients With Prostate Cancer Using R

Prostate cancer is the commonly diagnosed cancer worldwide, and there were 1,276 thousand new prostate cancer cases and 359 thousand deaths in 2018. Prostate-specific antigen (PSA) blood level is often elevated in men with prostate cancer, so PSA testing can detect prostate tumours when they are small, low-grade, and localized. The PSA testing is hard to apply on the less developed and poor areas without sufficient medical funds, so the early accurate PSA level prediction by statistical machine learning models is significant to avoid later stages of prostate cancer that spread outside the Prostate. In this literature, we compare three linear model selection and regularization methods (shrinkage, subset selection, dimension reduction) and nine candidate models (OLS regression, Ridge regression, Lasso regression, Elastic net, best subset selection, forward subset selection, backward subset selection, PCR, PLS) based on leave-one-out-cross-validation (LOOCV) prediction error. As the selection criteria leave-one-out cross-validation is sensitive to outliers, Mahalanobis distance is used for outlier detection and deletion before running each model. The shrinkage method (only lasso and elastic net models) and subset selection method (based on adjusted $R^{2}$ , BIC, Cp, and cross-validation prediction error) can select the variables out. The feature selection results show that prostate weight, cancer volume, amount of benign prostatic hyperplasia, and whether seminal vesicle invasion is necessary variables must include predicting PSA. Age and capsular penetration are the least important variables. The variables of Gleason score, a percent of Gleason scores 4 or 5 are essential sometimes. All the diagnostic figures and results are coded by R, open access, and published on IEEE Xplore Code Ocean.


I. INTRODUCTION
An adenocarcinoma is a type of cancer that arises in the cells of glands. Most prostate gland cells are of the glandular type, so adenocarcinoma is the most common cancer type in the prostate [1]. Prostate cancer is a commonly diagnosed cancer worldwide and crucial challenges in developed and developing countries [2]. Statistical results show that there were 1,276 thousand new prostate cancer cases and 359 thousand deaths in 2018 [3]. However, the prognosis for prostate cancer is relatively good if it is detected early. Prostatespecific antigen (PSA) is a protein that is produced in the glandular epithelium of the prostate. It is secreted into the The associate editor coordinating the review of this manuscript and approving it for publication was Gina Tourassi. prostatic acini lumen and has an important physiological role in prostatic fluid [4]. As The blood level of Prostate-specific antigen is often elevated in men with prostate cancer [5], PSA has profoundly affected the diagnosis and treatment of prostate cancer, and prostate tumours can be detected by PSA testing when they are small, low-grade, and localized [6]. Some doctors and professional organizations suggested that a man has to take PSA examinations every year from age 50 [7], but most medical testing, including PSA testing, is hard to apply on the less developed and poor areas without sufficient medical funds [8], so the early accurate PSA prediction by statistical learning models is meaningful to avoid later stages of prostate cancer that spread outside the prostate. In recent years, plenty of machine learning classification models have been proposed and published for Prostate VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ cancer monitoring as well as detection, such as decision tree (DT) [9], Logistic Regression (LG) [10], and Random Forest [11]. Besides, some literature has been proposed computer-vision models to predict and detect Prostate cancer, such as Fully Convolutional Neutral Network (FCNN) [12], Near-infrared (NIRF) [13], and Artificial Neural Networks System (ANNS) [14]. However, few papers predict Prostatespecific antigen for Patients with Adenocarcinoma of the Prostate.
We compare three linear model selection and regularization methods (subset selection, dimension reduction, and shrinkage) and eight models (forward selection, backward selection, exhaustive selection, PLS, PCR, ridge, lasso, and elastic net) to find their optimal tuning parameters and leaveone-out cross-validation prediction error from the pre-process prostate cancer dataset. Besides, lasso, elastic net, and subset selection did variable selection to choose the necessary three predictors and not important predictors for PSA prediction. All the diagnostic figures and results are coded by R, open access, and IEEE Xplore Code Ocean. The framework of the remaining paper is as follows: Section II describes the dataset and proposed methodology. Finally, the experimental results are reported with the interpretation and concluded with a discussion in section III.

II. DATA AND METHODS
This section focuses on the data and methodology used for the literature. Subsections II-A, II-B, and II-C respectively explain the dataset, background of a PSA test and proposed framework.

A. DATA DESCRIPTION
The linear Machine Learning models were trained and tested on a pubic source prostate-specific antigen dataset [15] of 97 male patients before radical prostatectomy. The surgical operation that removes the entire prostate gland along with some surrounding tissue. Table 1 shows the descriptions and brief statistical summary of the attributes, where the set of svi and gleason are discrete numerical variables only with two and four values. The Gleason grading system divides the two largest tumour areas in a tissue sample into (1-5) levels, where 1 is the least aggressive and 5 is the most aggressive, then add these two levels together to get a Gleason score. The BPH and capsular penetration variables originally contained zeros, and a small number was substituted before the log transform was taken. The original paper did not declare why the log transform was taken though PSA varies over a wide range, probably aiming for the variable's linearity. Besides, it is also not clear why the variable pgg45 was constructed. Fig. 1 shows the bar plots for gleason and svi and densityhistogram plots for other variables. Fig. 2 shows the density correlation plots and Pearson correlation, calculated as (1), for the other six variables.
The left-bottom correlation plot between a log of cancer volume and a prostate-specific antigen log shows a high positive Pearson correlation of 0.734. Besides, a log of prostate weight, a capsular penetration log, and percent of Gleason scores 4 or 5 are also positively correlated with a log of prostate-specific antigen with Pearson correlation equal to 0.433, 0.549, and 0.422. Fig. 3 shows the box plots of variables of svi and gleason with outliers. The outliers are data points x that do not fall within the distances as (2).
B. PSA TESTING Prostate-specific antigen is a protein generated by normal and malignant prostate cells. In the PSA test, the laboratory analysis blood samples usually reported nanograms of PSA per mL of blood. In the past, most doctors thought that PSA levels of 4.0 ng/mL or lower were normal [16]. However, recent studies have shown that some men with PSA levels below 4.0 ng/mL have prostate cancer, while many men with higher PSA levels do not have prostate cancer [17]. For instance, if a person has prostatitis or a urinary tract infection, his PSA level will usually rise. In contrast, some drugs used to treat benign prostatic hyperplasia can reduce men's PSA levels, such as finasteride and dutasteride [18]. However, generally speaking, the higher a person's PSA level, the more likely he is to develop prostate cancer.

C. PROPOSED FRAMEWORK
In this literature, the proposed framework has been illustrated in Fig. 4. Transfer numerical variables to Factor variables, data standardization, and outlier detection are applied in data pre-processing. There are ten candidate models (OLS regression, Ridge regression, Lasso regression, Elastic net, best subset selection, forward subset selection, backward subset selection, PCR, and PLS) using to find the lowest LOOCV prediction error. Subset selection methods using adjusted R 2 , BIC, and Cp to eliminate unnecessary variables do not follow this framework.

1) FACTOR VARIABLES AND DATA STANDARDIZATION
In Fig. 3, the box plots show prostate-specific antigen levels are significantly different for patients whether seminal vesicle invasion (p-value < 0.001) and a Gleason score greater than 6 (p-value < 0.001). Therefore, the Gleason score is classified by the cut-off level of a Gleason score greater than 6. These two variables are treated as factor variables. Standardization is the concept and step of scaling and transforming to equal each feature's equal contribution. As formulation (3), Z-score normalization is one of the standardization techniques for achieving standard normal distribution with zero mean and unit variance.
The variables of lcavol, lweight, age, lbph, lcp, and pgg45 are treated as numerical variables and did the Z-score normalization.

2) MAHALANOBIS DISTANCE TO DETECT OUTLIER
Mahalanobis distance [19], introduced by Prof. P. C Mahalanobis in 1936, is a reasonable multivariate distance metric that measures the distance between a point and a distribution. Mahalanobis distance calculates the distance between two points by considering a covariance factor that is a difference between that and Euclidean distance. The Mahalanobis distance between two points p 1 and p 2 is presented as (4), where S is the covariance of multivariate data X .
The Mahalanobis distance is an effective way to find outliers for multivariate data [20]. The idea is to calculate the Mahalanobis distance between each point and centre that can be chosen as a mean value of multivariate data as (5). where When n is relatively large and X is a k-dimensional Gaussian random vector with mean vector µ and rank k covariance matrix C, D 2 i follows χ 2 k , proved in [21]. Based on VOLUME 9, 2021 FIGURE 2. The density correlation plots and pearson correlation for numerical variables of lcavol, lweight, age, lbph, pgg45, lcp, and lpsa. a chosen critical p-value αwith its critical value, it is 1 − α confident that the i-th observation is an outlier if

3) OLS REGRESSION
The loss function of ordinary least squares regression is finding the plane that minimizes the sum of squared errors (SSE) between the observed and predicted response (6).
The OLS performance depends on the key assumptions of OLS regression: • No or little multicollinearity • There are more observations (n) than features (p) • Homoscedastic (constant variance in residuals) • No autocorrelation • Multivariate normality • Linear relationship However, the number of features (p) is large for many reallife data sets. The OLS assumptions are easy to be violated for large p; there are three classical methods (shrinkage, dimension reduction, subset selection) to solve the large features.

4) SUBSET SELECTION METHOD
The first step of best subset selection is to fit a separate least squares regression for each possible combination of the p predictors, and then identifying the best subset of each number of predictors from 1, . . . , p. The last step is to find the optimal number of predictors among the best subset using adjusted R 2 , BIC, C p , or cross-validation prediction error. The detail is shown in Algorithm 1.
As Algorithm 2, stepwise forward selection starts with an empty set of attributes as the minimal set. The most relevant attribute is chosen (having minimum p-value or maximum SSR) and added to the minimal set. As Algorithm 3, stepwise backward elimination is initial with all the attributes; in each iteration, one attribute is eliminated from the set of attributes whose p-value is highest or SSR is lowest.
Best subset selection and stepwise selection, each of which contains a subset of the p-predictors, are used to create a set of models. Adjusted R 2 , BIC, Cp, and cross-validation prediction error are ways to determine which of these models is best. Choosing minimum test error estimation using the validation set or cross-validation is a direct method. Adjusted R 2 , BIC, and Cp indirectly estimate test error methods by adjusting to the training error to avoid overfitting. For a fitted least-squares model containing d predictors, the Cp [22], invented by C.L. Mallows, estimates test MSE (7).
The penalty of 2dσ 2 increases as the number of predictors d in the model increases, whereσ 2 estimates the variance of the error using the full model containing all predictors. The technique compares the full model with a smaller model with ''d'' parameters and determines how much error is left unexplained by the partial model. Various suggestions have been made about exactly how the statistic should be interpreted, but the general consensus is that smaller Cp values are better as they indicate smaller amounts of unexplained error. BIC [23] is derived from a Bayesian perspective, but the resulting formula, as (8), is similar to Cp, where n is a number of observations.
The R 2 is defined as 1 − RSS/TSS, whee TSS = (y i −ȳ) 2 . The R 2 increases when more variables are added because RSS always decreases as more variables including in the model. Adjusted R 2 [24] statistic (9), adds penalty on R 2 when increasing number of variables.

5) SHRINKAGE (REGULARIZATION) METHOD
Regularized regression methods' loss function is very similar to OLS regression; however, a penalty parameter (P) is added (10).
Regularized regression puts constraints on the magnitude of the coefficients and will progressively shrink the coefficients towards zero. This constraint reduces the magnitude and fluctuations of the coefficients and will reduce the variance of our model.
The penalty parameter is called L 2 because it means a second-order penalty is used on the coefficients. Tuning parameter λ controls the penalty parameter that can take on a wide range of values.
The full name of the Lasso model [26] is called as least absolute shrinkage and selection operator. L 1 penalty λ p j=1 |β j | in the loss function is used as (12).
Rather than ridge regression pushing variables to approximately but not equal to zero, the lasso penalty will actually shrinkage coefficients to zero, so the lasso model can improve the model with regularization and conduct automated variable selection.
The elastic net [27], which combines the L 1 and L 2 penalties, is a generalized ridge and lasso model (13).
The elastic net model is proposed for effective regularization via the ridge penalty and the lasso penalty's feature selection characteristics.

6) DIMENSION REDUCTION METHOD
Both subset selection and shrinkage methods are defined using the original predictors X 1 , . . . , X p . The dimension reduction method [28] transforms the predictors and then fit a least-squares model using the transformed variables. Let Z 1 , . . . , Z m represent linear combinations of the original p predictors, where M < p, Z m = p j=1 φ jm X j for some constants φ 1m , . . . , φ pm (m = 1, . . . , M ). Therefore, the linear regression model is fitted using the least-squares as (14) for i = 1, . . . , n.
Principal component analysis [29] is a popular unsupervised learning method for deriving a low-dimensional set of features from a large set of variables. The principal component regression (PCR) [30] involves constructing the first M principle components Z 1 , . . . Z M using PCA to reduce dimension from p to M, and then using these M components as the predictors in a linear regression model to fit optimal least square.
Partial least square regression (PLSR) [31] is a dimension reduction method like PCR. However, PLSR identifies the new features Z 1 , . . . , Z M in a supervised way. It uses response Y to identify new features that approximate the old features well and related to the response. In PLSR computing the first direction Z 1 = p j=1 φ j1 X j , PLSR places the highest weight on the variables that are most strongly related to the response. The residuals from regressing each variable on Z 1 are the remaining information that has not been explained by the first PLSR direction. And then, Z 2 was computed using the orthogonalized data in the same fashion of Z 1 based on the original data. Z 1 , . . . , Z M can be computed after M times repeating. The last step is to use optimal least squares to fit a linear model to predict Y using the new features Z 1 , . . . , Z M .

7) LEAVE-ONE-OUT CROSS VALIDATION
Leave-one-out cross-validation [32] is K-fold crossvalidation taken to its logical extreme, with K equal to N (the number of observations). It means that N separate times, the function approximator is trained on all the data except for one point, and a prediction is made for that point. Then the average error is computed and used to evaluate the model. The evaluation given by leave-one-out cross-validation error is great for a small sample dataset. Still, it is not friendly for a large sample dataset because it seems very expensive to compute.

III. RESULTS AND DISCUSSION
This section illustrated the results of Fig. 4. Mahalanobis distance result identifies outlier. The results of three linear model selection and regularization methods are detail explained with plots. In the end, we compare ten candidate models (OLS regression, Ridge regression, Lasso regression, Elastic net, best subset selection, forward subset selection, backward subset selection, PCR, PLS) and choose the best one based on leave-one-out-cross-validation (LOOCV) prediction error.

A. OUTLIER DETECTION BY MAHALANOBIS DISTANCE RESULT
For all 97 observations, the Mahalanobis distance D 2 i , i from 1 to 97, can be calculated as (5) by inputting meanx and covariance matrix S. As the S is rank 9 covariance matrix, D 2 i follows χ 2 df =9 . Set critical p-value αequals to 0.05 such as   coefficients when λ = 0 for the coefficient paths plots because there is no effect and the objective functions equals the normal OLS regression objective function of simply minimizing SSE. As λ → ∞, the penalty parameter becomes large and force coefficients of lasso and elastic net to exact zero but coefficients of the ridge to approximately but not equal to zero. The coefficients path plots illustrate the detail of how the largest λ values have pushed these coefficients to nearly 0.
In LOOCV MSE plots across the λ values for ridge and lasso, the plots show the MSE rise considerably when λ cross over the second vertical dashed lines for ridge and lasso. At the top of each LOOCV MSE plot, the numbers represent the number of variables in the model. As ridge regression does not force any variables to exact zero, all variables will remain in the model, and the top numbers are 8 across all the λ values. The upper and lower bar around the MSE results for each λ denote the MSE plus/minus its standard error. The first and second vertical dashed lines refer to the λ value with the minimum MSE and largest λvalue within one standard error minimum MSE. Adding one standard error to the minimum MSE value can get a more regularized model, a largest λvalue within one standard error minimum MSE is used to evaluate model performance.
The elastic net penalty has two tuning parameters: λ for the complexity and α for the compromise between LASSO and ridge. Fig. 6 is LOOCV prediction error ± one standard error of Elastic net α from 0 to 1 by 0.01. To find the best tuning parameter α, a tuning grid that searches across a range from 0 to 1 by 0.01 is created. Then, iterate over each α value and extract the minimum and one standard error MSE values and their respective λ. The blue label point shows that the VOLUME 9, 2021 FIGURE 8. Standardized coefficients path and LOOCV and adjusted LOOCV prediction error for different components using PLS and PCR methods. largest λvalue within one standard error minimum MSE for all α from 0 to 1 by 0.01 has minimum MSE when α equals to 0.47. Table 2 shows the detail of λ values with the minimum MSE and largest λ value within one standard error of the minimum MSE, LOOCV-MSE, standard error, and some variables shrinkage toward zero. Elastic net with α equals 0.47 > Lasso > Ridge based on minimum 1se PMSE compare. Besides, Lasso and Elastic net with α equals 0.47 models eliminated lcp and age variables.   Fig. 8 shows standardized coefficient estimates for different components and leave-one-out cross-validation MSE on preprocess Prostate data set using PCR and PLSR. The dimension reduction method (PCR and PLSR) can make feature selection but cannot select the variables out; the coefficients are toward zero but not equal to zero when the number of components decreases to one.

C. DIMENSION REDUCTION METHOD RESULT
To find the optimal tuning parameter M (number of components), ordinary CV estimate, and adjusted CV (bias-corrected CV estimate) are applied to find the minimum MSE. From the figures, it shows there is virtually no difference for LOOCV and adjusted LOOCV. The blue dots are the location of the minimum LOOCV MSE for PLSR and PCR models. Eight components minimize LOOCV MSE for PCR and PLSR methods to predict log of prostate-specific antigen. LOOCV MSE of PCR model with eight components (0.4859) is larger than that of seven components (0.5155) and six components (0.5094), and the second lowest LOOCV MSE is PCR model with five components (0.50197). LOOCV MSE of PLSR model with eight components (0.4859) is almost equal to that of seven (0.4860), six (0.4897), five (0.4904), four (0.4914), and three (0.4940). As mentioned in the dimension reduction methods introduction, the main practical difference between PCR and PLSR is that PCR often needs more components than PLSR to achieve the same prediction error. In this literature, LOOCV MSE of PLSR model with two variables (0.5061) is smaller than LOOCV MSE of PCR model with the variables that is less than eight.  Therefore, an optimal tuning parameter for PCR linear is eight-components that did not reduce the dimension. It is the same as OLS regression, but the independent variables did orthogonal transformation. It is hard to determine an optimal tuning parameter (number of components) for PLSR model. PLSR model with eight-components has minimum LOOCV MSE. However, choosing seven, six, or five components are reasonable as well.

D. SUBSET SELECTION METHOD RESULT
Use all 94 excluded 4 outlier data to build the regression subset model; the selected subsets are not the same for exhaustive, backward, and forward subset selection methods shown in Table 3, 4, and 5, but the selected four variables are the same in these three subset selection methods. They contain lcavol, lweight, svi, and lbph variables.     Table 3, 4, and 5 are not same, they will have different LOOCV MSE. The number of predictors with the lowest LOOCV MSE (0.47656) is equal to four among these three subset selection methods. The selected variables are lcavol, lweight, svi, and lbph. Fig. 10, 11, and 12 show the R 2 , adjusted R 2 , BIC, and Cp statistics of regression subset model for a different number of predictors among exhaustive, forward, and backward subset selection methods.R 2 statistic increases from 0.56 when only one variable is included in the model to 0.70 when all variables are included. As expected, the R 2 statistic increases monotonically as more variables are included, so R 2 statistic cannot evaluate the machine learning model's performance. The number of predictors in subsets is a tuning parameter. The best selected model for these three selection methods by BIC, Cp and adjusted R 2 are the same. The best performer using BIC method is four variables including in the linear regression model that are lcavol, lweight, lbph, and svi. Adjusted R 2 method suggests seven variables that only exclude gleason variable. Cp method selects five variables out that are lcavol,lweight, lbph, svi, and pgg45.
E. DISCUSSION Table 6 shows eight feature selection results by directly or indirectly calculating prediction error to eliminate variables based on the above experimental results, and Fig. 13 visualizes the variable selection results. The variables of lweight, lcaval, lbph, and svi are included all the times, and the variable of lcp and age are included only once by adjusted R 2 method. Therefore, prostate weight, cancer volume, amount of benign prostatic hyperplasia, and whether seminal vesicle invasion are necessary variables for prostate-specific antigen prediction. Age and capsular penetration are not important variables.

IV. CONCLUSION
Although the sample size is small in this experimental study, the data is collected precious and accurate. The Prostate specimens were subjected to detailed histological and morphometric analysis [15], and all 97 observations are effective without obvious outliers and extreme points. Besides, leave-one-out cross-validation is only efficient using a small sample size of data by standard computers, which can help reviewers and readers with different versions and R-language environments to reproduce the same figure and table results.
The paper discusses some important considerations for feature selection with a detailed R code. Besides, it proves prostate weight, cancer volume, amount of benign prostatic hyperplasia, and whether seminal vesicle invasion are necessary variables that must include predicting PSA. Age and capsular penetration are least important variables. The variables of Gleason score, a percent of Gleason scores 4 or 5 are important sometimes. Lastly, the less developed and poor areas are hard to apply PSA testing, so PSA prediction by statistical models is meaningful for them to detect prostate cancer early. (a) Consider all k models that contain all but one of the predictors in M k , for a total of k −1 predictors. (b) Choose the best among these k models, and call it M k−1 . 3) Select the best model among M 0 , . . . , M p by Adjusted R 2 , C p , BIC or cross-validation prediction error.