Regularizing the Deepsurv Network Using Projection Loss for Medical Risk Assessment

State-of-the-art deep survival prediction approaches expand network parameters to accommodate performance over a fine discretization of output time. For medical applications where data are limited, the regression-based Deepsurv approach is more advantageous because its continuous output design limits unnecessary network parameters. Despite the practical advantage, the typical network lacks control over the feature distribution causing the network to be more prone to noisy information and occasional poor prediction performance. We propose a novel projection loss as a regularizing objective to improve the time-to-event Deepsurv model. The loss formulation maximizes the lower bound of the multiple-correlation coefficient between the network’s features and the desired hazard value. Reducing the loss also theoretically lowers the upper bound on the likelihood of discordant pair and improves C-index performance. We observe superior performances and robustness of regularized Deepsurv over many state-of-the-art approaches in our experiments with five public medical datasets and two cross-cohort validation tasks.


I. INTRODUCTION
Survival analysis, also known as time-to-event analysis, is a crucial task in medical prognosis and risk assessment. Fundamentally, the objective is to create a predictive function mapping from relevant input data to time until event occurrence, or ''hitting time.'' A wide range of medical applications is enabled once the hitting times are predicted [14]. Unlike the typical regression tasks, the model has to account for right-censored outcomes (e.g., unknown time-to-death due to patients dropping out of studies.) The problem has been widely investigated in the Statistics community [18]. The typical approaches are developed from regression models with the additional consideration of outcome censoring.
The associate editor coordinating the review of this manuscript and approving it for publication was Wenbing Zhao . For example, Cox proportional hazard model (CPH) [5] and Survival Support Vector Machine (SVM) [20] are formulated from linear-based regressors, which allow applications of classical regularization and analysis, such as L-1 and L-2 norm, into the learning. Survival Tree [21] and Survival Forest [22] are early attempts to utilize non-linear regressors for the survival task.
Recent approaches utilize deep learning architectures for the prediction (''deep survival''). The introduction of deep survival networks not only opens opportunities for developing non-linear prediction models but also allows prediction using other data forms instead of handcrafted features, such as medical 3D-MRI scans, and sequence data for survival tasks [3], [7], [19]. Without the tedious work of defining relevant features, recent emphases of many works have shifted toward identifying relevant data and developments of suitable deep survival network architectures. Our study focuses on the development of deep survival approaches that have previously adapted to data forms without changing objective functions or introducing drastic preprocessing steps.
A prominent trend in developing deep survival approaches is discretizing possible survival outcomes for classification architecture in modeling hitting time distribution. The strategy trains deep networks to classify or maximize likelihoods corresponding to pre-defined intervals of discretized hitting time. Variations of the networks are differently characterized based on assumptions and outcomes about censored data. For example, partial-logistic regression [1] and NNet-survival [6] are feed-forward classification networks trained by increasing likelihood over output nodes corresponding to the period after censoring with the assumption that the unobserved event is likely to occur after the censored time point. The survival probability mass function (PMF) network [10] is trained to output a discretized cumulative density function of survival time instead of the likelihood of occurrence. DeepHit [12] is the state-of-the-art method that combines the discretized density training similar to that of the PMF with a multi-task network for the prediction. Notice that these discrete-time network can adapt to new input forms by simply changing the input layer architecture (e.g., Convolutional Neural Network [3], [40].) The drawback of this approach is the requirement of hitting time discretization, which inconveniently introduces a trade-off between the number of parameters and the granularity of output time. Specifically, fine-grain discretization helps the network distinguish cases with slight differences in hitting time at the price of more network parameters. Optimally discretizing the data for the deep-learning approach is non-trivial and requires consensus between engineers and medical experts. On the broader picture, the medical context often involves the lack of gold-standard outcomes and several difficulties in data acquisition, including privacy laws, lack of patient enrollment on research studies, and low frequency of events. Training a large-scale neural network under these data circumstances often causes overfitting and poor prediction performance. Unlike some other domains, solving the insufficient data using data augmentation is also limited as the learning from invalid data introduces risk in subsequent medical practice. The discretization leads to more difficulties in network design and training.
In Contrast to the discretization strategy, regression-based networks avoid unnecessary difficulties by modeling the hitting time density with continuous hazard output. By mapping the input to a single-dimension hazard value instead of a discrete outcome, the architecture development allows more flexible control over the number of parameters regardless of the survival outcome range, which is favorable in the smalldata medical context. Variations of the regression networks are defined on how the output values relate to survival likelihood. Deepsurv [9] is a well-established network under this approach with the assumption of proportional hazards such that the difference between survival likelihood for a given time is proportional to the difference in feature or hazard values. The model is developed with intuitions behind the Cox proportional hazard (CPH) model, a widely applied statistical method for survival analysis. The Cox-time network [11] improves this approach by dropping the CPH's proportionality assumption. Many survival prediction works adapted the regression approach for various health-related data [7], [19]. There are also hybrid regression approaches, such as Survival-net [3], that train the network to classify censored and uncensored data before using the features from penultimate layers for survival regressions. However, the regressionbased and the hybrid networks have inferior performance to many discretized networks, as reported in [6], [10], and [12].
Inspired by the advantages offered by the deep survival regression approach, we investigate the lack of feature distribution control during the regular Deepsurv training that exposes the network to suboptimal performance. Our contributions in this work are three-fold.
• Theoretical foundation to enhance network on the perspective of representation learning without scaling the number of network parameters.
• The projection loss regularization based on the theory. The regularization loss is applicable to survival prediction on various input forms implying that the proposed improvements are applicable to the other networks with similar organizations to Deepsurv, such as DeepCon-vSurv [19] networks, which we used in our experiments.
• Demonstration of the improvement generality through experiments on many expert-defined medical prognosis datasets derived from clinical outcomes, histology, cellularity, immune markers, and volumetric imaging. Unlike many previous deep survivals works, we present thorough qualitative and quantitative investigations covering both theorems and empirical experiments. We also set up experiments to mimic cross-cohort validations, a crucial practice in developing practical medical decision support systems. Our experiments also show that the regularized Deepsurv outperform discrete-time state-of-the-art baselines in the smalldata medical context.
It is noteworthy that the discrete-time and regression-based survival approaches focused in this study are not the entire spectrum of deep learning techniques for survival analysis. We consider approaches under a similar medical context and experimented with small-sample survival datasets. To consider applicable approaches to various data forms, we exclude networks that are difficult to adjust to new forms of data (e.g., focusing on specific modality [33] or specific architecture [34], and relying on time-varying information [35]) because there are increasingly diverse types of input for survival prediction. Because medical data are difficult to model, we also avoid approaches that attempt to model the distribution of data or distribution of outcome or outcome censoring (e.g., sampling from generative modeling [36], Perturbation [37], assuming distributions of outcome [38], and training with artificial target responses [39]) because data modeling and augmentation are difficult to validate and 8006 VOLUME 10, 2022 introduce risks of learning from invalid data. Deepsurv and our baselines under our investigations fit all these criteria. We discuss the advantages of Deepsurv over the other baselines in further sections.

II. PARAMETRIC SURVIVAL REGRESSION
Let {(I 1 , δ 1 , t 1 ) , (I 2 , δ 2 , t 2 ) , . . . , (I N , δ N , t N )} be the dataset for survival analysis where I i is input data for case i, t i is the latest elapsed time recorded for the case, and N is the total number of data cases. δ i indicates whether the outcome event of case i is observed where δ i = 0 means that the case outcome is censored after t i , and δ i = 1 means that the event takes place at time t i . The general goal is to model survival function S (t) = P(T ≥ t) where S(t) value is the probability of hitting time happening after time t. The model can be parameterized such that adjusting parameters gives flexibility in learning to predict new cases.

A. COX PROPORTIONAL HAZARD REGRESSION
One of the earliest methods for the parametric approach is Cox proportional hazard regression. Cox proportional hazard (CPH) model define S(t) as where X is a vector of input covariates or features extracted from I , and β is the vector of CPH regression parameters. h 0 (t) and H 0 (t) are baseline hazard function and cumulative baseline hazard function. h 0 (t) is often constructed using the Breslow estimator [13]. The cumulative hazard function is defined as Along with S (t), the method also defines hazard function h(t, X ) as Notice that taking a ratio between h (t, X 1 ) and h (t, X 2 ) cancels out h 0 (t), resulting in exp((X i − X j ) · β). This built-in model property without h 0 (t) is referred to as the proportional hazard assumption. Without hitting time censoring, β can be obtained with a least-squares approach explored in [8] by rearranging the target variable as a linear function with an X ·β term. With the censoring, however, the regression exploits the proportional hazard assumption and be solved by minimizing the negative log-partial-likelihood (NLPL) loss objective formed by all comparable pairs of X i and X j . The NLPL loss is defined as where R j = ∀i, t i > t j is a set of data case indices that are still at risk at time t j . Even though the loss itself is convex, it has no lower bound and causes β to be unnecessarily large or difficult to obtain after optimization runs, especially when the amount of data is less than that of parameters. Therefore, the training often reduces the loss with normbased regularizations, such as Ridge, Lasso, or their variations and combinations [23], to avoid the degenerate solution and to select suitable features for improving performance.
Regardless of regularization methods,β is a β estimate obtain from minimizing the loss. The loss prefers that the score term X i ·β should be higher than X j ·β corresponding to an event at the time later than t i . Minimizing the loss is the attempt to make all pairwise comparisons of X ·β as concordant with the risk ordering as possible. Specifically, the goal is to make X i ·β > X j ·β if t i < t j when case j is not censored. X ·β is often referred to as hazard score or risk value and used instead of h(t, X ) for performance evaluation. One performance measure of survival regression is the concordance index [17] or C-Index. For CPH, the measure can be calculated using X ·β as where P is the total number of comparable pairs in the dataset. i : δ i = 1 represents any index i belongs to the set where the δ i = 1 or not censored. j : t j > t i means any index j belongs to the set where recorded time t j > t i . I is an indicator function. The measure gauges the proportion of pairs with concordance between the event times and hazard scores, (i.e., the higher hazard score corresponds to the shorter event time within the pair.) Notice that the predicted hazard score X · β is subject to uncertainty and error. C value closer to 1.0 means less error such that the predicted score mostly following the expected ordering.

B. DEEPSURV NETWORK
Deepsurv network uses a similar formulation for S(t) as the CPH model. However, the approach replaces X · β with the output from the deep neural network. Specifically, the Deepsurv define the hazard score or risk score value as g (I |θ, β net ) = f (I |θ)·β net or X net ·β net and the loss in Eq. (4) become where θ is a parameter set for Deepsurv network until the penultimate layer f (I |θ), and β net is the last layer's parameter of the network. Notice that the output f (I |θ) is a feature vector from the network. Its linear combination with β net is the risk score g (I |θ, β net ). From the perspective of deep representation learning, the network simultaneously digests a new set VOLUME 10, 2022 of covariates X net = f (I |θ) and regression parameters β net as they are tuned in an end-to-end manner with a gradient-based backpropagation. Adjusting f (I |θ) as opposed to a manually defined X gives flexibility in lowering NLPL. Despite the network being relatively dated since its first conception compared to other deep learning alternatives, the major advantage of the approach is its ability to cope with wide ranges of outcome value without increasing network parameters while allowing flexibility of adjusting the neural network architectures with modern mechanisms (e.g., batch-normalization) to improve prediction performance [27]. Moreover, network variations can be created by changing to accommodate other forms of input data, such as a convolutional neural network for images input [19], and a recurrent neural network or similar mechanism for timedependent data [32]. With the benefits, Deepsurv remains widely employed recently to make survival predictions under small-sample medical and health-related contexts [28]- [31].

III. PROPOSED METHOD
We propose applying regularization to NLPL loss for the Deepsurv network. The regularization is designed to induce a correlation between features from the Deepsurv network and the desired hazard value trend as well as to reduce the probability of lower C-Index values caused by noise from the network's features. This section will first discuss the motivation and the quality measure that reveals the drawback of training the Deepsurv network under NLPL loss. Afterward, details on the regularization as well as its theoretical support are provided.

A. IMPROVING QUALITY MEASUREMENT OF DEEPSURV FEATURE
Our regularization objective is to ensure that the Deepsurv learns suitable feature distribution for X net = f (I |θ) during the optimization of pairwise concordance. Intuitively, a good set of features lead to well-perform β estimation and a higher C-Index. For the typical CPH model, poor feature quality is the most likely cause of underperformance because optimizing β is simpler with the convex loss. For Deepsurv, however, the loss function provides no explicit goal on improving the feature. Consequently, it is unclear during training whether X net , β, or the other aspects of the network architecture cause poor performance in models with inferior C-index. To clarify such questions, we seek to include the suitability of X net into the objective function to ensure that the best β for X net is easier to obtain, and efforts could be put elsewhere to improve the performance.
Despite the Deepsurv network being a non-linear regression approach, the risk score output X net · β net at the last layer is linear in nature. For linear predictions in general, the multiple-correlation coefficient R is a suitability measure for the feature. However, the typical calculation of R requires knowing the exact values of the target prediction variable, which are not available in our case due to censoring. We propose an alternative measure as the substitute measure R for the feature quality goal. is defined as β net is a directional unit vector calculated from normalizing the β net vector of parameters. Var is variance calculated from random variables in the parenthesis. We further simplify the calculation in Eq. (7) with principal component analysis (PCA), resulting in the following where each W i and λ i are eigen vector of the principal components (PCs) and its corresponding eigenvalue. The PCs are the result of applying PCA to the interested data of which p is the total number of dimensions. W i is a directional unit vector calculated from normalizing W i . The numerator is the summation of variance captured in the direction of β net proportional to the directional similarity between the PCs and β net . The operator (_) + is the element-wise absolute value to prevent any negative cosine value. The measure is bounded in the range of [0, 1]. According to Eq. (7), the value close to 0 means the network captures only a minor fraction of the total data variance for the prediction, which could entail more noises and ungeneralizable data trends. On the other hand, the value close to 1 implies utilizing a significant part of the variance for the prediction. Even though the fraction of variance does not directly lead to poor C-index performance, it semantically hints at the magnitude of feature information that supports β net and the efficiency of the prediction.
The purpose of is two-fold. The first purpose is to establish that maximizing increases the multiplecorrelation coefficient and decreases the discordance probability. We provide theoretical discussions on the derivation of measure in Section III-C. In brief summary, the theory reveals that ∝ R and 2 ∝ 1 P(disc) where R is the lower bound on the multiple-correlation coefficient and P(disc) is the upper bound on discordant pairs occurring probability. These properties are the foundation of our loss formulation. The second purpose is to gauge relative changes in the relevant feature distribution across many regularization strategies. Even though is relative to a lower bound measure and does not imply poor prediction quality, it is calculated from distributional variance measures, directly reflecting the feature distribution changes. It is surmisable that different regularization strategies alter the network's features, although the effect is unclear and difficult to gauge through C-index performance. Thus, we use to observe whether there is feature training emphasis among regularization strategies and to inspect whether the feature change improves C-index performances.

B. PROJECTION LOSS REGULARIZATION
To increase , we propose the following regularization term in addition to the NLPL loss where w reg is the weight of our proposed regularization term and L NLPL (θ, β net ) is the NLPL loss in Eq. (6). The loss in Eq. (9) is the primary objective function for our regularized Deepsurv network. With geometrical consideration, the loss is equivalent to the following rearrangement where α is an angle formed by β net and a features vector The regularization aims to simultaneously control both the distribution of f i and the β net . By decreasing f (I i |θ) 2 · β net 2 2 and increasing (f (I i |θ ) · β net ) 2 , the loss reduces the perpendicular components, which has a similar effect to geometrical projection and makes f (I i |θ) or X net more distributed along the direction of β net . The effect is illustrated graphically in Fig 1. In other words, the regularization reduces feature variance that does not contribute to the hazard score and improves the Var X · β /Var (X ) term value on . The proposed loss force of the Deepsurv network training to filter out the irrelevant information from the input I , and effectively predict with information in β net and X net that matter to the hazard score. It is noteworthy that the loss does not limit its applicability to only the typical Deepsurv network. The proposed method applies to any network designs with outputs with β net and X net , especially network with variations of f (I i |θ) explained in section II-B trained with NLPL. However, the proposed loss is not without a drawback. If the loss value is reduced to 0, then that potentially makes β net = 0 and renders the prediction useless. Thus, the strength of the regularization w reg must not overwhelm the main objective of minimizing L NLPL . The projection loss is proposed as a complementary objective such that it means to be reduced but not entirely minimized.

C. IMPROVE MULTIPLE-CORRELATION COEFFICENT AND REDUCE DISCORDANCE FROM NOISE
This section establishes the theoretical foundation and derivation of the proposed method. Our development strategy is to simply find the lower bound on the coefficient R or measure R 2 then seek to maximize them as the secondary learning objective for the Deepsurv network. However, we discover that maximizing the lower bound also lowers the upper bound of the discordance pair occurrence likelihood. In this section, we first state our assumptions surrounding our investigations. Then we provide statements, propositions, and theorems that we used to derive and the projection loss. Finally, we discuss the theorem on how our method reduces the upper bound on the discordance probability.
Consider a simplified survival regression problem in which all survival outcomes are available. Without the censoring, the simplified problem is to derive suitable hazard values and regress for the prediction model. Reference [8] explored the problem by deriving desired hazard value d and the linear prediction model where d = log (H 0 (t)) = X ·β * + e. The formulation converts the problem into a least-square estimation of β using X − d covariance. Notice that d can be re-centered and re-scaled to eliminate the intercept and derive a valid estimation of β * . The evaluation with the C-Index mostly concerns β * and the ordering of X ·β * such that estimates of β * in different magnitudes result in the same performance level. In survival prediction, it is common to assume that probability distributions of the outcome and the censoring are independent. If the model can capture the general trends and distribution of patient survival, the model from uncensored data can still be used to predict the censored cases. The relationship between the features and survival outcome would then apply to the censored cases as well.
We then made the following assumptions.
• First, we assume that β * ≈ β such that the NLPL solutionβ is a valid answer for the least-square β * formulations that capture the underlying patient survival trend.
• Second, magnitudes of d and β * are small such that is total variance in the features. This assumption is in-line with many regularizations approach for survival prediction [23] and regression in general, which try to keep parameters small and simple for better generalization. It also implies that d can be rescaled (e.g., z-normalization) such that Var (X ) ≥ Var (d) even if the variance is originally more significant than that of input features.
• Third, the noise term in d ≈ X · β * + e is a zeromean error term independent of the covariates X . This assumption is a general assumption of linear prediction, implying that the prediction error is the primary source of discordance. With the assumptions, we state the following statements and propositions.
is the standard closed-form solution for the least-square regression, which can be done when all the desired values are known. In statistical literature, Statement 2 defines the calculation of R. We use these statements to posit the following propositions.
Var(X T ·d)+Var(e) . Proof: With z-normalization, we can consider all elements of σ X vector equal to 1 and −1 XX = P −1 XX . R 2 can then be rearranged according to the above statements as Consider the following facts.
is 0 due to the z-normalization. Also, the linear formulation states that d = X · β * + e. Then, According to the third assumption, the noise term e is independent of X .
and E [e] = 0. Then, Given that E [X ] = 0 and E [e] = 0. Then, Thus, the equality above proves the proposition. Proposition 4: Given zero-centered rescaled d such that Var(X ·β * ) + 1. Proof: consider the formulation d = X ·β * + e. As d is a random variable. The variance of d can be calculated as Consider Var (X ) ≥ Var (d) and Cov (X ·β * , e) = 0 from the second and the third assumptions respectively. Then, Thus, the equality above proves the proposition. These statements and propositions give some clues on how to grasp the value of R from the available data. The apparent difficulty is d is not entirely available due to censoring. Therefore, we derive a corollary to proposition 3 and the theorem 6 to establish relationships between X, β * , and the lower bound on R.
The corollary is merely a derivation from Eq. (13) with Cauchy-Schwarz inequality in the denominator. Specifically, Proof: Consider Eq. (14) in Proposition 3. The equation can be re-arranged as Var (X · β * ) 1 8010 VOLUME 10, 2022 Plugging in the value in Eq. (17) into Eq. (16) in Proposition 4 leads to the following Let β * be a unit vector representing the direction of β * . Then, Thus, the equality above proves the theorem. Both Corollary 5 and Theorem 6 give helpful information about the lower bound of R related to X · β * ,in which we use them to derive and the projection loss. We substitute the direction β * with β from the NLPL formulation due to the first assumption. The Deepsurv formulation substitute X and β * with X net and β net . Therefore, the fraction of variance term can be substitute with defined in Eq. (7). Eq. (19) establish the lower bound R that and vice versa, where c = β * 2 is from by solving for β * using d or its scaled version according to the second assumption. The purpose of this relationship between R and is not to measure the exact value of R via because β * 2 is unknown due to the d censoring. The property shows that maximizing is equivalent to maximizing the lower bound R and encourages using as the feature quality measure.
Corollary 5 offers a simpler way to increase R . It suggests a simultaneous decrease in E X 2 β * 2 2 and increase in E (X · β * ) 2 , which we use in the projection loss formulation in Eq. (9). For the formulation, we substitute the constant β * with β net to control the variance direction and magnitude of β net at the same time. Technically, the proposed regularization improves by increasing Var X net · β net and reduce Var (X net ) that does not contribute to hazard score.
Because many parts of the derivation use z-normalization to ensure the zero-mean property, the theorem encourages using batch-normalization on the penultimate layer at the minimum. Therefore, we include batch-normalization as a crucial part of our network designs and training approach due to this theoretical insight.
There are valuable interpretations on improving R and R 2 through the proposed projection loss. Achieving good C-index performance with β net at a higher value of R means that the network finds concrete X net that correlated well with the desire hazard score even if complete knowledge of the exact score d is not available. The higher value of R 2 means the β net trend line has smaller fraction of unexplained variance, which implies that the network efficiently filters out the irrelevant information from the input I and effectively predicts with information in X net that really matters.
The minimizing projection loss not only forces the Deepsurv network to digest better information from the input I . We discover that it also reduces the discordance rate caused by noise and improves C-index performance. Consider that the desired hazard score variable d = X · β + e is the variable that strictly follows the ordering. In other words, the linear model inherently establishes that the error or noise term e causes the discordance of some X · β pair comparisons such that X · β = d − e and d − e diverge from the ordering of d, which follows the third assumption.
We establish the following proposition and corollary to proposition 4 to show the foundation of our upper bound on discordance probability. Proposition 7: Given β * , and (X i , X j ) which are two samples of random variables X , then the upper bound From the definition, e is non-negative. Then, the Markov inequality defines an upper-bound probability related to the quantity of e as where k is a non-negative constant. As values of β * , X i , and X j are given, let k = X i · β * − X j · β * 2 . Then, Thus, the derivation of Eq. (20) proves the proposition. The stated propositions and corollary provide information about the error term. Specifically, proposition 7 provide a probability bound of events that the error difference exceeds VOLUME 10, 2022 that of the hazard score. Corollary 8 suggests a worse-case quantity of error variance relative to variance in input data and the predicted score. Then, we derive the discordance probability in the predicted score as in the following theorem.
Theorem 9: Given β * and z-normalized X , then P (discordantinX ·β * ) ≤ Proof: Let X i , X j and e i , e j be two random samples of features X and the error noise e, respectively. We define discordance events caused by the noise term as events where the ordering of d = X ·β * +e differs from that of the predicted score X ·β * for any two pair of samples i, j. In other words, orders for d and X ·β * should be equivalent if the magnitude of the error term does not change the results of any pairwise comparison. Consider the formulation of discordance events as the following joint cases.
where # (discd) is the number of discordant events. The function sign (k) = 1 if k ≥ 0, and sign (k) = −1 otherwise. From the above definition, discordant pair occurs when pairwise the comparison of signs between X i ·β * , X j ·β * pair is not the same as that of e i , e j pair and the noise pair magnitude difference is significant enough to interfere with the order. Otherwise, the comparison of X i ·β * + e i and X j ·β * + e j is the same as that of X i ·β * and X j ·β * . Eq. (24) further establishes that where X ·β * = X i ·β * − X j ·β * , and e = e i − e j . The probability of ( e ) 2 ≥ X ·β * 2 is a P (discd) upper bound. Using proposition 7, Eq. (23) can be elaborated as We seek to gauge overall quantity for any pair comparison. Thus, we apply expectation on both sides of the inequality As P (discd) is an intrinsic constant on the population, the expectation operator does not change its value. Likewise, E e i − e j 2 is a constant. Then, Consider that X i , X j and e i , e j are randomly sampled from same variables X and e, respectively. Therefore, X i , X j and e i , e j are independent and identically distributed (iid). With zero-mean noise according to the third assumption and z-normalization X centered at zero means, then With these facts, Eq. (28) can be simplified as Consider the upper bound in Corollary 8, Eq. (29) can be re-arranged as Therefore, Eq. (30) prove the theorem. From theorem 2, we establish that the upper bound of the discordance probability P (discd) is inversely proportional to the fraction of the X · β * variance and the total variance in X . For the Deepsurv network, X is X net , and β * is substituted with β net according to the second assumption. Eq. (30) establish the upper bound that and vice versa, where c = 1/ β * 2 2 .The relationship further expands the merit of increasing such that maximizing 2 is equivalent to minimizing the upper bound of the probability of discordance pair in the predicted score. The theorem reveals that our proposed regularization improves C-index performance by increasing 2 , which we investigate further in our experiments.

IV. EXPERIMENTS
Our investigations through the experiments aim to address the following aspect under the medical context with limited sample size.
• Impact of the regularization on C-index performance.
• Effect of the regularization on the Deepsurv's feature.
• Performance of the regularized model relative to that of the State-of-the-art.
A. DATASETS Table 1 gives an overview of the survival datasets on the number of features and censoring. Our experiments employed five public datasets and four in-house developed datasets. We use two criteria in selecting the current public datasets in our experiments. First, the survival datasets should be related to healthcare, have limited sample sizes, and are highly censored. In this study, we posit that each dataset should have less than 8,000 cases, and more than 40% are censored. Second, the selected datasets have been previously used to demonstrate the performance of the included baselines (e.g., when the included baselines were proposed). Intuitively, we ensure fair comparisons such that all baselines are expected to perform well on some datasets. In brief details, Wisconsin Breast Cancer prognosis (Wisconsin) is a dataset for breast cancer diagnosis which analyzes digital images of cells taken from breast lumps to time recurrence of cancer. Expert-defined features were extracted from cellular nuclei images. Each record represents follow-up data for a cancer case. The censored case is defined by no recurrence during the follow-up period. The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) is a breast cancer survival prediction dataset based on gene expressions and clinical features. The target variable is the number of months until observed death. Rotterdam and German Breast Cancer Study Group (GBSG) dataset contains records of node-positive breast cancer patients with features related to effects from chemotherapy and hormone treatments. The recorded survival times are in the number of months. National Wilm's Tumor Study (NWTCO) [2] is a dataset to study the relationship between tumor histology on embryonal kidney cancer and treatment outcome. The features are clinical and histological variables. All the cases are associated with the time-to-death or survival time of patients. Patients who survived over the follow-up period are censored. Assay of Serum Free Light chain (FLCHAIN) is a dataset for studying the prevalence of the monoclonal gammopathy of undetermined significance (MGUS), an immune disregulation condition. We employ the same pre-processing as in [11]. Notably, the censored patients either survived or dropped out of the study during follow-up. The extracted variables are conditions related to the ailment. Wisconsin data is available at the UCI repository. METABRIC, GBSG, NWTCO, and FLCHAIN datasets are available either in the R Survival package or Pycox package.
We developed Sarcoma datasets from magnetic resonance imaging (MRI) scans of patients with sarcoma soft-tissue cancer. Soft tissue sarcoma is a heterogeneous cancer with severe outcomes for many patients. Pre-treatment contrastenhanced T1-weighted 3D MRI scans were acquired from two independent cohorts of patients diagnosed with biopsyproven soft tissue Sarcoma (STS) from the University of Washington (UW cohort) and the Technical University of Munich (Munich cohort). The acquisition was done with the institutional picture archiving and communication system (PACS) standard using a similar image matrix and resolutions to [15]. Using the similar protocol to [15], the selected patients had high-risk STS of various histologies of the extremity, trunk, or retroperitoneum. In both cohorts of Sarcoma datasets, radiologist and radiation oncologist experts manually segmented the gross tumor as ROI at the fixed resolution of 1mm 3 , which later resampled into the size of 64×64×64 voxels. The segmentation and resampling were completed using MIM software (version 6.6, MIM Software Inc., Cleveland, OH) for the UW cohort and iPlan RT (version 4.1.2, Brainlab, Munich, Germany) for the Munich cohort. Fig. 2 visualizes some samples in our datasets.
Sarcoma-Rad-UW and Sarcoma-Rad-Munich are radiomics features derived from the MRI scans describing the tumor's textural appearances from each cohort. The relationship of these empirical image features sets to patient survival, pathologic response, and tumor grade has been described previously [15], [16], [24], [25]. The features are extracted using the PORTS software package and extraction protocol as in [15]. Sarcoma-3DMRI-UW and Sarcoma-3DMRI-Munich data are bounded MRI scans from each corresponding institute. The scans were normalized as in [3]. Sarcoma-3DMRI is not in the typical feature vector format. There have been many successful explorations in predicting severity and patient risk directly from 3D cancer scans [3], [19], [26]. We use these datasets to demonstrate our regularization applicable on DeepConvSurv network [19] and to perform comparisons with many other deep survival baselines for end-to-end prediction. More details are provided in experiment Section IV-D test scenario three.

B. EFFECT ON C-INDEX PERFORMANCE
The first experiment demonstrated effects on the Deepsurv network C-Index performance given different weights w reg settings on Wisconsin, METABRIC, GBSG, NWTCO, and FLCHAIN datasets. In all datasets, all features were z-normalized before the experiments except for binary features. We randomly split each dataset into 60% training, 20% validation, and 20% testing. Our Deepsurv network architecture was (#feature-128-64) network which consisted of 2 hidden layers of size 128 and 64 nodes followed by a 1-nodes hazard output. Each of the hidden layers used Leaky-Rectified Linear unit (L-ReLU), defined as max(0.01X , X ), followed by batch-normalization. In prior experiments, we tried multiple activation functions and designated L-ReLU for its performance. We set w reg as 1 × 10 m where integer m ∈ [−5, 5]. All the results were compared with no regularization and norm-based Ridge and Lasso regularization on β net with the same regularization weight range. We set Adam optimizer with a 0.01 learning rate in all trainings. The networks were trained under loss function in equation (6) until no improvement on validation performance after 10 patience epochs from the best validation round. The experiments were repeated 1000 times with different random data splits. The estimated C-index from the testing set was averaged across all the repetitions. Then, the averaged C-index estimates were used to compare the performances of different regularization strategies. Notice that the splitting and result measurement reflect the scenario where the training, validation, and testing set belong to the identical distributions from well-explored datasets. We applied the same network architecture and the grid search on regularization weights of all the compared strategies. Fig. 3 summarizes the results. Improvement in C-index performance under our regularization method can be observed in different magnitudes across the five datasets. In the smaller Wisconsin, METABRIC, and GBSG datasets (<3000 cases), all the regularization strategies with appropriate weight led to better performance. For the larger datasets (approximately 4000-6000 cases), however, we observe that the norm-based method performances were marginally worse than that of no regularization, whereas our method was improved from the no regularization baseline to a certain extent. It is also noteworthy that the appropriate weight for our approach tends to be more stable such that the peak C-Index performances were from w reg values around [0.01, 1], whereas the suitable range for other strategies varied depending on the dataset. Applying too large w reg values can lead to poor performance worse than that of no regularization. Nevertheless, the performance trends show the potential of our proposed regularization on improving the C-index.

C. EFFECT ON THE FEATURE DISTRIBUTION
The improved performance warranted deeper investigations on the suitability between the learned feature and the hazard direction, along with the effect of regularization on the feature distribution. The second experiment repeated previous settings with the highest performance from each regularization strategy 1000 times in all datasets. In each run, measures were calculated to inspect whether there are increases from any regularization strategies. The values were averaged and then compared to observe the learned information from the trained network in each setting. In addition to the measure, testing X = f (I |θ) and β net of each strategy were extracted from a data split with the highest C-Index. Then PCA dimensional reduction was applied to reduce the dimension of X net to 2 for visual comparison. Table 2 presents average and C-index measures. Due to slight differences in C-index performances between some strategies, 95% confidence interval halfwidths are provided for all C-Index results. The unregularized Deepsurv resulted in poor and learned β net such that a tiny fraction of the variance contributed to the prediction. Even though low does not necessarily result in a poor C-index, the network could have done better optimization to digest input I for relevant information. Otherwise, the network would predict using less suitable β net and more likely to capture noise information. The results opened opportunity for the feature control and selection.
Norm-based regularization had varying effects on and C-index performances. We observe increases in both and C-index performance on Wisconsin, METABRIC, and GBSG data. On the other hand, the norm-based strategies somewhat altered the in NWTCO and FLCHAIN datasets, which means the regularizations have some influence over features distribution, not just the β net parameter. However, they offered no significant change on C-index, suggesting failure to select well-perform features information on both X net and β net . Thus, norm-based strategies are less effective for the Deepsurv network.
The proposed regularization remedies the problem by allowing more control on the distribution of the learned feature. Larger values mean that the network learned to encode more information from X net that the β net line can capture. Fig. 4 also elaborates our observations from Table 2. The useful information that drives the prediction is the data projected toward the β net line, with less variation in the other direction. Thus, network trainings with less effective control cause more data to be distributed more perpendicular to the line instead of the trend of increasing or decreasing hazards. From this perspective, the proposed regularization showed a more linearly oriented distribution toward the hazard prediction. Better C-index performance across all the datasets also supported the utility of our proposed method.

D. PERFORMANCE COMPARISON WITH OTHER STATE-OF-THE-ART APPROACHES
In the third experiment, we evaluate our proposed method against other state-of-the-art approaches. The tests are under 3 increasingly difficult data scenarios relevant to the prediction model development for medical prognosis and decision making. We design experiments to inspect whether the proposed method's performance can generalize.
The first scenario is when expert-defined variables are available, and distributions of training and testing datasets are almost identical. This scenario is expected at the early-stage model development to confirm that the prediction approach holds sufficient predictive power under the available data. The test is constructed to satisfy iid data environments such that errors are mainly caused by prediction approaches rather than the discrepancy between training and testing distributions. We use well-established public datasets to further ensure that tested models learn to predict only from the relevant feature information. We employ 5 datasets from previous experiments with the same training/validation/testing splits. For our proposed strategy, we tried various values of w reg for each dataset among the [0.1, 1.0] interval. Then, we select configurations with the best validation performance and compare them with those of 6 deep learning and 4 nondeep learning approaches. The deep learning baselines were unregularized Deepsurv [9], Cox-time network [11], Survival Net [3] with CPH model regression, NNet [6], PMF network [10], and Deep Hit network [12]. All baselines used the same core network as in experiment 1 with different last layers and losses depends on their respective training strategies. The non-deep learning baselines were a typical CPH model, survival SVM with linear kernel [20], Survival Regression Tree [21], and survival regression Forest [22]. For all baselines which require discretization of output time (NNet, PMF, Deep Hit), we tried varying coarse discretization by doubly increasing the grouping of survival time from 1 day, 2 days, 4 days, 8, days, and etc. After trials, we set the finest discretization defined as max(time) -min(time) + 1 because it consistently outputted the best C-Index across many of the discretization baselines. We designate the time data in the unit of days for Wisconsin, NWTCO, and FLCHAIN datasets, and in the unit of months for METABRIC and GBSG datasets.
Intuitively, the fine-grain discretization assumed no prior expert knowledge on time-to-event distribution such that the discretization should not destroy comparability between cases. Similar to the proposed strategy, all the deep learning baselines were repeatedly trained using Adam Optimizer with the same learning rate and early stopping criteria. The repeated evaluations compared average testing C-Index with 95% confidence interval (CI) under 1000 runs. Notice that all networks had their number of parameters greater than the number of data cases ( 8000). The large numbers reflect the usual scenario of limited data in survival analysis under the medical setting.
The second scenario is when distributions of training and testing sets are not necessarily identical. However, the features that encoded some expert knowledge are available. Under practical circumstances, information learned from the expert-provided features must be applicable for predicting various new cases. This expectation is necessary and realistic, especially when the survival model undergoes external validation between different cohorts of patients (e.g., due to differences in patient or tumor characteristics) from different institutions. The test is designed to mimic the validation to ensure that medical-decision making is based on generalizable predictions. The proposed method and baselines were trained and validated using Sarcoma-Rad-UW with random 80% training and 20% validation data split. Instead of the same dataset, we tested the trained models on Sarcoma-Rad-Munich data. The experiment repeated 1000 times with different train-validate splits. All baselines from the first scenario were also subject to this experiment with the same architectural settings as in the first scenario. We then compared C-Index performances with 95% CI on the testing set across survival prediction models.
The third scenario is when training and testing data distributions are not identical, and expert-defined features are unavailable. Unlike the second scenario, this scenario is more difficult as there is no expert guidance on specific information to capture from the raw data. The test is conducted to demonstrate the proposed method's applicability and to observe the generalizability of various deep survival approaches under non-typical input instead of the handcrafted feature vector. The proposed method and baselines were trained and validated using the Sarcoma-3DMRI-UW dataset and tested on the Sarcoma-3DMRI-Munich dataset. Similar to the second scenario, the UW cohort cases were randomly split into 80% train and 20% validate data. The experiment repeated 100 times with different train-validate splits. Due to no expertdefined feature variables, non-deep learning baselines were excluded for Sarcoma-3DMRI experiments. We used the same deep learning baselines as the first and second scenarios with the core network replaced with a convolutional neural network (CNN). CNN version of Deepsurv is also called Deepconvsurv, which has been explored in [19]. The CNN architecture setting was (img-conv16 conv32-conv64-conv128-flatten-512-128) consisted of 4 convolution layers with an increasing number of filters from 16-128 followed by a feed-forward network with 2 hidden layers of size 512 and 128. All convolutional filters have a size of 3 × 3×3. All convolutions and feed-forward layers used the L-ReLU activations function and followed by batch normalization. Random flipping augmentation was tried but later dropped due to validity concerns and no significant improvement in all baselines. All training used Adam optimizer with a learning rate set to 0.0001. C-Index performances with 95% CI on the testing set were compared. Table 3 summarizes the result comparisons of the proposed regularization with all the baselines. We observe that the proposed method performed on par or better than state-ofthe-art methods and non-deep learning baselines in 4 out of 5 datasets. The outperformance demonstrates that our regularization applies well when data distributions are similar across training, validation, and testing sets. It also shows the utility of the proposed method that does not require discretization of output time, which can be difficult to define without expert knowledge. In larger GBSG NWTCO and FLCHAIN datasets, inferior performances of NNet, PMF, and Deep hit may be due to the simpler core network, unlike bigger networks in their original works that greatly increase the number of parameters in the networks. In a separate experiment, we experienced increased performance using larger networks with more time-consuming training. However, the larger network performed poorly on the small Wisconsin dataset. Interestingly, Deephit outperformed all other baselines in the METABRIC dataset despite being trained with a relatively small amount of data and underperformances of other discrete-time approaches. This discrepancy is a good reminder that the approach is not a bad survival baseline and should be considered in our following scenarios. Nevertheless, unregularized regression-based networks (Cox-time and Deepsurv) mostly performed on par or better than many discretization approaches at the current core network setting. The underperformance demonstrates limitations of the discretization approach under the current network setting and scenario. Table 4 and Table 5 illustrate performances comparisons using Sarcoma-Rad datasets, whose expert features guide the prediction when training and testing datasets are not the same. The proposed method outperformed both deep learning and non-deep learning baselines, achieving the average C-Index of 0.657. In this small dataset, NNet, PMF, and Survival Net failed to outperformance many non-deep learning baselines, suggesting that this approach could not learn generalizable information for the prediction under such a data scenario. DeepHit, Deepsurv, Cox-time, and the proposed networks outperformed the non-deep learning baseline in this scenario. Most of these well-perform networks are from the regressionbased approach.
Similar trends in performance also exist in experiments with more difficult Sarcoma-3DMRI in which the training and testing dataset may not have the same data distribution and no expert variable to guide the prediction. In Table 6, all the deep learning approaches had lower performances without the expert information, especially DeepHit network whose performance dropped the most from the previous scenario. All the discrete-time baselines performed significantly poorly compared to the regression-based networks. Despite the weaker performance, the proposed method outperformed the baselines with the average C-Index of 0.630.
Performance across all the scenarios show the utility of our approach under small-data cross-cohort environments. Even though it is arguable that improved performances under the first scenario in Table 3 were marginal, the regularization prevented significant performance drops due to data changes in Table 5 and Table 6. The minor performance decrease means that our regularized network was successful in learning more generalizable predictions. The compatibility    between training data and network parameters sizes is the primary factor in the failed generalization of unregularized deep learning baselines. Unlike public datasets in previous experiments, many real-world medical datasets such as our Sarcoma datasets are small and highly censored, which is detrimental to highly parameterized neural network training. In such a scenario, it is more likely for larger discretetime networks to fail to generalize and capture proper information for the cross-cohort prediction. On the other hand, regression-based networks with fewer parameters outperformed the discretized alternatives, demonstrating the flexibility and robustness of our regularization to generalize across different data scenarios. It can be concluded that our regularized Deepsurv model further improves the regressionbased survival prediction performance, especially for small datasets.
Nevertheless, this work focuses on C-index survival prediction performance with one type of input and one outcome. In real-world scenarios, there are needs for considering and integrating multiple input types simultaneously into making decisions, such as imaging scans to time-series signals. Adapting the prediction network to these grand challenges is the subject of our future work.

V. CONCLUSION
We successfully developed a novel regularization strategy that theoretically upgrades features quality and practically improves the robust performance of the Deepsurv network for survival prediction. The experiment results demonstrate the advantage of deep survival regression approaches over discrete-time networks and the generalizability of our method across datasets in medical applications. The success of our work demonstrates that deep survival regression performance can be further improved with applied insight from representation space instead of more parameterization and expansion of deep neural network architectures. STEPHANIE K. SCHAUB received the M.D. degree from Florida Atlantic University. She is currently a Board-Certified Radiation Oncologist, who joined the Department of Radiation Oncology, School of Medicine, University of Washington, as an Assistant Professor, in 2020. She specializes in the treatment of adult sarcoma patients and pediatric tumors of all types. She completed her residency in radiation oncology at the University of Washington, where she was the Chief Resident during her final year, and Harvard's Brigham and Women's Hospital. Her research interests include developing novel, noninvasive biomarkers (such as imaging or blood sample signals) that can predict cancer outcomes or treatment-related toxicity to advance toward precision radiation therapy. She is a member of the Connective Tissue Oncology Society, the Children's Oncology Group, and the American Society of Radiation Oncology. PAUL E. KINAHAN (Fellow, IEEE) is currently a member of the UW Imaging Research Laboratory. He was a part of the group that built the first prototype combined PET/CT scanner and has also contributed to the current class of data processing image reconstruction algorithms used in PET/CT oncology imaging. He moved to the University of Washington, in 2001, where he continues his research in PET/CT imaging. He has served on committees for RSNA, AAPM, SNM, NIH, and IEEE. STEPHANIE E. COMBS received the Doctorate degree, with preclinical work in field of neuroanatomy, where she studied the sympathoadrenal system and the impact of growth factors on development and formation of the neuronal and non-neuronal networks. She studied medicine in Heidelberg; Norfolk, USA; and San Antonio, USA. After her graduation and promotion in 2003, she worked as a Research Associate in Heidelberg. Following her postdoctoral lecturer qualification in 2009, she was promoted to the Vice Chair of the Radiation Oncology Department in Heidelberg, in 2011. In 2014, she was appointed as a Professor and the Chair of the Department of Radiation Oncology, Technical University of Munich (TUM). In 2015, she also took over the Institute of Radiation Medicine, Helmholtz Zentrum. Since 2019, she has been heading the TUM Senate and Serves as the Vice Dean for diversity and talent management. Her key expertise is highly conformal radiation therapy (stereotactic treatment, IMRT/IGRT/ART, protons, and carbon ions). Her scientific work includes areas, including treatment optimization for brain and skull base tumors, biomarkers in radiation oncology, pediatric oncology, gastrointestinal oncology, uro-oncology, gynecological oncology, radiochemotherapy, and radioimmunotherapy. She is serving as a Board Member for the Neurooncology Working Group of the German Cancer Society (DKG). She received several scientific awards, such as the Hermann Holthusen Prize of the German Radiation Oncology Society.