The Deep Promotion Time Cure Model

We propose a novel method for predicting time-to-event in the presence of cure fractions based on flexible survivals models integrated into a deep neural network framework. Our approach allows for non-linear relationships and high-dimensional interactions between covariates and survival and is suitable for large-scale applications. Furthermore, we allow the method to incorporate an identified predictor formed of an additive decomposition of interpretable linear and non-linear effects and add an orthogonalization layer to capture potential higher dimensional interactions. We demonstrate the usefulness and computational efficiency of our method via simulations and apply it to a large portfolio of US mortgage loans. Here, we find not only a better predictive performance of our framework but also a more realistic picture of covariate effects.


Introduction
Lenders employ mathematical models to assist decision-making by estimating each customer's probability of a credit event.These models, known as credit scoring systems, were initially developed to predict the probability of default for specific products.
Over time, the use and purpose of these systems have become more diverse and aligned with the lender's strategic goals.Moreover, new computational advancements and the pursuit of better models have urged research on deep learning (DL) approaches in this field.Gunnarsson et al. [27], conducting a study comparing DL algorithms and their practicality in credit scoring, find that while ensemble methods are still favored, DL approaches have potential, e.g., in handling less traditional data sources.Meanwhile, Stevenson et al. [45] reveal the merit of DL for predicting default in small businesses using text data.In another study by Korangi et al. [31], transformer models were employed to process time-varying covariates, such as accounting metrics from the balance sheet, to predict the bankruptcy of middle-capitalization companies, showing better performance than traditional models.These findings are part of the recent evidence suggesting that DL techniques are promising to improve credit scoring systems and expand the range of data types that can be leveraged in this field.

arXiv:2305.11575v1 [stat.ML] 19 May 2023
Until now, most applied DL models to credit risk have focused on classification tasks, where a pre-defined performance period of a binary decision is established.A different route is that of survival analysis for building scoring systems but is less explored in the DL context [6].Here, the outcome of interest is the time until an event occurs.One challenge in survival analysis is to reliably describe the distribution of survival times, trying to convey, for example, if all subjects are prone to the event of interest.In credit risk modeling, it is however natural to expect that some borrowers will never experience the event, resulting in heavy censoring at the end of the study [19].In this situation, cure rate models are preferred [23], which extend survival models by including a latent cure fraction.The advantage is that these models allow to separate the factors that influence the probability of the event occurrence from those that affect its timing.
Another challenge is understanding how the subject-specific features (or covariate effects) relate to survival times.To this end, two main classes of cure models exist: the mixture cure model (MCM) [7] and the promotion time cure model (PTCM) [54,49].Although the MCM has been extensively studied in the credit risk literature (see Table 1), the PTCM, introduced in the late 1990s, has gone practically under the radar and is the focus of this paper.
The MCM assumes a binary response variable in the population that describes those cured and those susceptible to the event.This approach has been broadly developed in parametric, semi-parametric, and nonparametric formulations, and to handle continuous, discrete, and longitudinal data (see [2] for a comprehensive review).In contrast, the PTCM, which originates from cancer studies, assumes each subject has unobserved competing risk factors, such as cancer cells.In this situation, a cured patient will have zero cancer cells, while a susceptible patient's event will occur when the first cell develops into a palpable cancer mass.Although initially conceived for tumors, its statistical principles apply to broader contexts.For example, in credit-related applications, competing risk factors include causes for borrower default, such as job loss, inability to work, strategic default, and failed businesses [4].
We make three significant contributions in this manuscript: two methodological and one empirical.From a methodological standpoint, first, we reformulate the PTCM using a deep neural network (DNN) architecture.We label our approach Deep-PTCM.
The second methodological contribution allows the user to decompose the predictor as linear and non-linear components, with the latter estimated through a DNN.This separation aims to facilitate the interpretation of covariate effects, a common criticism when applying DL approaches.However, it is known that a neural network (NN) can approximate any continuous function [17], in particular, a linear one.Hence, to avoid identifiability issues, we follow [44] and add an orthogonalization layer that projects the non-linear component's output into the linear one's orthogonal complement.The orthogonalization is performed using a QR decomposition, which is stable when computed in a mini-batch training routine [43].
From an empirical perspective and to the best of our knowledge, this is the first study to apply such a general and flexible framework of PTCMs in the credit risk context.First, most of the cure models studied in credit scoring belong to the class of MCMs, leaving the PTCM relatively unexplored.However, we do not find any solid justification in the literature for choosing MCM over PTCM, and its preference may be due to its popularity.Second, none of the cure models, regardless of the class selected, allow for complex and often more realistic non-linear relationships and interactions between covariates and survival.As we show later, this assumption limits the predictive power, an essential aspect of credit risk management [46].Concretely, we build a cure model to predict the time to default in a large US mortgage portfolio.We show that the Deep-PTCM significantly outperforms the standard PTCM in calibration and discrimination.
Overall, our Deep-PTCM, has the following highly relevant advantages over existing competitors: (i) It generalizes the standard PTCM, which assumes linear dependency in its predictor.This can be seen as a one-layer NN with one unit.(ii) It provides more flexibility than traditional estimation pipelines by replacing data preprocessing and feature engineering with a differentiable loss function estimated via gradient descent.That facilitates the model to incorporate structured and unstructured data such as text and images, increasing the scope of its applicability.(iii) It is scalable since all model parameters are integrated into an end-to-end DNN, making the estimation procedure computationally efficient and easily parallelizable (GPUs/TPUs).
This provides comparative advantages over recent efforts from medical research by Xie and Yu [53].The authors propose a PTCM with a DNN component and show it can increase the model's performance compared with non-parametric approaches with splines.However, this is the only work in this respect, indicating that the interface of PTCMs and DNNs is an underexplored area from both a modeling and an application perspective.An example is the estimation procedure, carried out iteratively using the Expectation-Maximization (EM) algorithm introduced by [14], where the DNN is optimized at each maximization step.That results in a computationally inefficient procedure that limits the approach's materiality in increasingly prevalent big data environments.Through a simulation study, we demonstrate that Deep-PTCM scales better than the approach proposed in [53].This improvement permits us to estimate the model on a training set with approximately 150k borrowers, the largest in this context (see Table 1), in a few minutes rather than hours.
(iv) Its implementation uses the TensorFlow framework [1], making it easy to accommodate all layers, optimizers, and features available there.
The rest of the paper is organized as follows.In Section 3, we describe the PTCM, its reformulation in an end-to-end DNN framework, and how we estimate it efficiently even when the dataset is large.Section 5 offers two simulation studies, one comparing our estimation approach to Xie and Yu's [53], and the second studies how the Deep-PTCM can recover the linear effects when the orthogonalization step is included.In Section 6, we present our credit risk study, while Section 7 concludes.

Related work
Although the contributions presented have the potential for applications beyond the credit-related context, the motivation for this work arises from the importance of credit scoring models in a predominantly data-driven industry and the lack of studies combining cure models and DL.
Most of the cure models applied so far belong to the class of MCMs.Tong et al. [48] introduce the MCM and compare its performance to the logistic regression and the Cox Proportional Hazard model (Cox PH), noting the ability to distinguish among borrower's susceptibility is appealing for risk management.Similarly, Dirick et al. [21] compare the performance of different survival approaches in ten datasets.They find comparable performance between the MCM, Cox PH, and Accelerated Failure Time models, with a promising economic performance by the MCM.
Moreover, Louzada et al. [35] demonstrate that the flexibility of the MCM allows modeling survival data even when the proportional hazard assumption is not satisfied.Extensions to include exogenous time-varying covariates can be found in [18], from a discrete-time perspective, and in [19], for the continuous-time one.Furthermore, Zhang et al. [56] introduce a new MCM to allow the non-cured borrowers to be susceptible to a subset of risks instead of all of them as it is commonly assumed in competing risk settings.
Although much of the literature focuses on the MCM, some work, all from the same group, have studied the PTCM.Namely, in [39], a PTCM is applied to relate the intensity of default and recovery rates in a Brazilian loan portfolio.This study, however, does not include covariates in the model.In addition, in [4], different activation mechanisms of the PTCM are analyzed.A bivariate survival process is considered in [11], and Ribeiro de Oliveira Jr et al. [42] extend it to account for events in time zero.More recently, Toledo et al. [47] allow for early events with fractions incorporating covariates effects.
Table 1 compares the various approaches involving cure models with application to credit risk.All these contributions, whether with MCM or PTCM, consider covariate linear effects, an assumption that is often unrealistic and too simplistic.Furthermore, regarding cure rate models that include non-linear effects, traditional non-parametric methods, such as smoothing splines, do not scale well with high-dimensional interactions [53].In this regard, the Deep-PTCM can not only capture complex covariate effects but is also scalable by leveraging the flexibility of DL.

The Promotion Time Cure Model
The PTCM assumes that a subject has K unobserved competing risk factors, each of which can lead to the event's occurrence.Furthermore, suppose that K is distributed as a Poisson with mean θ and denote by Y k , k = 1, . . ., K the random time for the kth risk factor.Given K, it is assumed that the random variables Y k are independently distributed with cumulative distribution function (CDF) F (t).The time to event T * is the time elapsed until the first unobserved competing risk factor is triggered, i.e.T * = min{Y 0 , Y 1 , . . ., Y K }, and the cure fraction is lim t→∞ S p (t) = exp(−θ).Note that, since lim t→∞ S p (t) can be positive, S p (•) is not a proper survival function.We deliberately call it the survival function of the population and add the subindex p to differentiate it from S(t) = 1 − F (t), the (proper) survival function of the risk factors.
Denote by x the q-dimensional vector of covariates.Then the PTCM in its standard version relates θ with x through θ(x) = exp(w x + b), where w ∈ R q is the vector of regression coefficients and b ∈ R the intercept [15].Note that the hazard function of the population is then dt .Since the hazard function preserves proportionality, the PTCM is also known as the proportional hazard cure model [41].
The PTCM has been studied and extended in several directions in the statistical community.For example, an EM algorithm to estimate the model with missing covariates [14], extensions to handle interval-censored data [34] and to include random effects [13], or categorical time-varying covariates [32], latent risk classes [30], and longitudinal covariates [10] have been developed.Moreover, non-parametric approaches have recently been explored, in particular, to model univariate covariate effects by including smoothing splines [16,8].Still, these do not scale well in large-scale applications.

The Deep-PTCM
It is commonly assumed that there is a linear relationship between covariates and survival, which facilitates interpretation.However, this also restricts the regression predictor to a specific structure.To address this limitation, we redefine the PTCM as an end-to-end DNN architecture.This approach lets us consider non-linear relationships and high-dimensional interactions between covariates and survival.Furthermore, we develop an efficient estimation algorithm using existing DL libraries, allowing the framework to be applied to large and unstructured datasets.
Consider that subject i in the population (i = 1, . . ., N ) has a vector of covariates x i ∈ R q , and a number of unobserved competing risk factors the associated random event time for the kth risk factor with CDF F (t), ∀k.As in the PTCM, the uncensored time to event for subject i is Assuming a right-censoring mechanism and denoting by C i the censored time for subject i, the observational event time follows T i = min{C i , T * i }.Denote the realization of T i by t i and by δ i the event indicator that is 1 if the event occurs at time t i and 0 otherwise.
The survival function of the population follows analogously to Equation (1), i.e. S p (t; x i ) = exp(−θ(x i )F (t)), and the corresponding hazard function is h p (t; with η : R q → R a general continuous function rather than a linear combination of the covariates x i . We proceed similarly to [53] and model η through a DNN.From the universal approximation theorem [17], we know that a NN, under certain conditions, can approximate any continuous function.Moreover, the linear specification described in Section 3.1, η(x i ) = w x i + b, can also be parameterized with a NN in which only one layer and one neuron are needed.Therefore, the Deep-PTCM can be regarded as a generalization of the standard PTCM.
The Deep-PTCM is flexible enough to represent complex covariate relationships and/or handle unstructured data.However, there are situations in which the interest is to identify if the predictor η has structured linear effects to ease interpretation.To this aim, we also allow the framework to estimate η as an identifiable sum between linear (η lin ) and non-linear (η non ) predictors.
To avoid potential identifiability issues specifically, we follow the orthogonalization procedure in [44].This is accomplished by calculating orthogonal projection matrices P, P ⊥ ∈ R N ×N , such that η lin = Pη and η non = P ⊥ η, where η = (η(x 1 ), . . ., η(x N )) (η lin and η non follow analogously).The projection matrices are obtained by performing a QR decomposition of X = [1 N , X] = [1 N , (x 1 , . . ., x N ) ], i.e.X = QR, where Q and R are orthonormal and upper-triangular matrices, respectively, and 1 N ∈ R N is a vector of ones (intercept).Hence, we obtain the projection matrices as P = QQ and P ⊥ = I − QQ .While the QR decomposition is usually performed on the entire set X, it has been shown that this decomposition can be computed stably in a mini-batch training routine [44,43], and thus can be readily included in the learning procedure.
Figure 1 depicts the main idea of the Deep-PTCM architecture.The DNN block has as inputs the covariates X and is here where we define the appropriate architecture for the data at hand.For example, if we are provided with unstructured data, such as images, we may include in this block convolutional NNs [25].The output of this block goes to the Orthogonalization layer.If the orthogonalization step is required, then η is built by the sum of a linear predictor and the appropriate projection of the DNN block output into the orthogonal complement of that linear predictor (e.g., a subset of covariates).If not, the whole predictor η is estimated without decomposition (P ⊥ = I).
For illustrative purposes, let us consider a fully connected feedforward network (FCFN) with L hidden layers.Specifically, suppose that layer l (l = 1, . . ., L) has n l neurons, hence the lth layer, g (l) : R n l−1 → R n l (n 0 = q), follows Where a (l) : R → R is the activation function for the lth layer, w m ∈ R n l−1 the weights associated to the mth neuron of the lth layer, and b (l) m ∈ R the corresponding intercept.Many activation functions have been proposed (see [55], Chap.5.1.2).One popular choice is the rectified linear unit (ReLU) [38], which is the non-linear transformation defined by a(x) = max(0, x).
Therefore, when a FCFN is considered in the DNN block, its output is a vector η ∈ R n L formed by the composition of the L hidden layers, i.e. η(x i ) = g (L) • • • • • g (2) • g (1) (x i ).Moreover, if no orthogonalization is performed, η(x i ) is finally computed through a final unit with a linear activation, i.e. η(x i ) = w (L+1) η(x i )+ b (L+1) .On the other hand, if orthogonalization is carried out for all covariates, then where w lin ∈ R q and b lin ∈ R are, respectively, the vector of linear coefficients and the intercept.Moreover, the Endpoint Layer defines the loss function by receiving the inputs t = (t 1 , . . ., t N ) and δ = (δ 1 , . . ., δ N ) , in addition to η.The loss function for the Deep-PTCM is the negative log-likelihood, introduced below in Equation ( 2).Thus, it is in the Endpoint Layer where F is specified in order to calculate the loss.Standard specifications in the PTCM context are the Weibull or the piecewise exponential function.Following [53,14], we use the latter as described in Section 3.3, but other specifications can be easily accommodated.

Estimation of the Deep-PTCM
Traditionally, estimation in the PTCM is carried out using the EM algorithm, where K i , the number of risk factors for subject i, is treated as missing data [14].As we illustrate in Section 5, however, this approach does not scale well when considering NNs and large datasets.To overcome this computational limitation, we present an end-to-end framework to estimate both the predictor η and the parameters associated with F through the NN optimization problem.
For that, first, note that the log-likelihood is ( Considering F as a piecewise exponential function, we partition the length of the study in J intervals according to the distribution of the events, i.e. u 0 = 0 < u 1 < . . .< u J with u J > max i∈{1,...,N } t i .
In each interval (u j−1 , u j ], the hazard function of the competing risk factors is assumed to be constant.Denote these constants by λ j , j = 1, . . ., J. Thus, for t ∈ (u j−1 , u j ], F and f can be expressed as We train this model efficiently using backpropagation [25].This involves (i) initializing the weights of the DNN units randomly, (ii) feeding the input data through the DNN and calculating the loss, (iii) adjusting the weights of the units to minimize the loss in Equation ( 2), and (iv) repeating the steps (ii), (iii) of feeding the input data through the DNN, calculating the loss, and adjusting the weights until the loss in a validation set is minimized.
Once the DNN is optimized, we can use its estimated weights for prediction.In particular, we can retrieve the DNN block to infer the predictor η for new data.Analogously, we can create any quantity of interest, such as S p and S, by recovering the corresponding parameters of each block.We created a Python package, deepcure, for the estimation of the Deep-PTCM, which is available on GitHub.The implementation uses TensorFlow, allowing for seamless integration of all available optimizers and additional features provided by the framework.

Performance Metrics
In the empirical study presented in Section 6, we evaluate the performance of the models under two metrics.
The area under the receiver operating curve (AUC) for cure proportions, which measures how well the model distinguishes between cured and non-cured subjects, and the integrated Brier score (IBS), which measures the calibration throughout the whole study period.We describe these metrics in the following.

AUC for Cure Proportions (AUC cure )
The AUC [24] is commonly used in survival analysis to evaluate the performance of a corresponding model.However, the classical formulation does not take cure proportions into account.The receiver operating curve can be regarded as the curve formed by the true positive rate (TPR) and the false positive rate (FPR) for all cutoff points c in [0, 1].Asano et al. [3] propose the imputation-based AUC for mixture cure models, and [53] extend it to the PTCM.This version of the AUC evaluates the TPR and FPR concerning the probability of being cured.Denote the estimated long-term survival probability as π(x i ) := lim t→∞ Ŝp (t; x i ) = exp(− exp(η(x i ))), where η(•) is the point estimate of η(•).Therefore, the estimates of TPR and FPR, for a given cut-off point c are given by , where 1(•) denotes the indicator function.AUC cure is calculated using trapezoidal integration over c ∈ [0, 1].

Integrated Brier Score (IBS)
The Brier score [9] corresponds to the mean squared error of the predicted probabilities for binary classification.In the survival context, we can estimate whether a subject survives longer or not at a specific time t.Moreover, Graf et al. [26] introduced a generalization of the Brier score to handle censoring.This is the version that we use and is specified as where Ĝ(•) is the Kaplan-Meier estimator of the censoring survival function.By integrating the time-dependent Brier score over time, we obtain the Integrated Brier score (IBS) [26].

Simulation study
The purpose of this section is three-fold.First, we illustrate how the proposed estimation framework scales well to large sample sizes, commonly seen in the credit context.Second, by using simulation setups identical to those presented by [53], we show the computational advantages of estimating the model through end-to-end trained DNN architecture versus iteratively optimizing the DNN in the maximization step of the EM algorithm.Finally, we analyze how the orthogonalization procedure can recover the structured linear predictor without compromising performance compared to the setting without orthogonalization (unrestricted).

Simulation Design
We study three sample sizes N : 50,000, 100,000, and 150,000 subjects.The sample sizes from the works presented in Table 1 have, on average, ∼ 30, 000 subjects, with a maximum N of 80,641.Therefore, we consider 50,000 as a relevant starting sample size in this context and scale it to 150,000, which is roughly the size of our dataset (and the largest we are aware of).One can argue that the cure models used for credit applications so far and commonly estimated via the EM algorithm, could not be scaled due to computational burdens.Our framework does not have those constraints.
We evaluate four simulated scenarios: three presented in [53] and a fourth in which we added a linear component to study the orthogonalization feature.All scenarios are described in detail in Appendix A.

Summary of Results
Table 2 shows the comparison between the EM implementation (EM-PTCM) versus the Deep-PTCM for each combination of sample size (N ) and scenario.The column Time is the average time in minutes needed to estimate the model for the corresponding setting.Moreover, the columns ∆S, ∆S p , and ∆η show the mean square difference between the true and estimated quantities S, S p , and η, respectively.
That is e.g.∆S = 1 , where Ŝ(r) (•) and S (r) (•) are, respectively, the estimated and the true survival function for replication r.The other cases follow analogously.These metrics are evaluated on 100 (R = 100) holdout datasets with the same data generation process but different random seeds.The minimum between both approaches is shown in bold.We observe that the mean square differences, for S p and η, are generally lower for Deep-PTCM than for EM-PTCM.In the case of S, this difference is not so clear.Nevertheless, since both implementations are meant to estimate the same model, it is not surprising that these results are indeed comparable.The great advantage, however, is that the Deep-PTCM is significantly faster than the EM implementation (more than 100 times for some cases) without compromising accuracy.
To analyze orthogonalization, we create a fourth setting with η = η lin + η non in which η lin = b lin + w lin 1 x 1 + w lin 2 x 2 + w lin 3 x 3 and η non follows similarly to the one defined in Scenario 2. Table 3 summarizes the results from 100 independent replications comparing the Deep-PTCM and its version with orthogonalization.We note that, in general, the performance of both models is similar concerning the mean square differences.Moreover, Figure 2 depicts the 2.5-97.5% range of the estimations of b lin , w lin 1 , w lin 2 and w lin 3 across the 100 replications.We observe a suitable recovery of the true parameter values (dashed vertical lines), especially when increasing the sample size.6 Application

Data
We analyze the publicly available single-family loan-level dataset from Freddie Mac1 .The dataset contains loan-level origination and monthly performance for fixed-rate US mortgages and is periodically updated.The event of interest is the credit default, defined as the moment the loan is past due in 90 or more days.
The training set contains 149,561 loans granted between 2009 and 2011.The test set includes 49,888 loans granted in 2013.The monitoring periods for both sets date from loan origination to December 2021.The data include eleven categorical and eight numerical variables.Tables 5 and 6 in the Appendix describe the categorical and the numerical variables, respectively.Some categorical variables present high cardinality, which can be challenging from an estimation perspective (poor generalization and high resource usage [12]).Common practice is to either drop these variables, thus discarding valuable information or to transform the attributes into numerical representations, such as target encoding [37].To compare different preprocessing practices for these variables when estimating the standard PTCM, we employ target encoding, one-hot encoding, and principal component analysis (PCA) for dimensionality reduction.On the other hand, when estimating the deep version, we only use one-hot encoding, arguing that the DNN should be able to generalize well without further preprocessing steps.The dimension of the feature space after encoding is 921.We normalized all numerical variables to make the training procedure more efficient.
Figure 3 illustrates the distribution of default events as a function of duration (left) and the number of defaults with respect to the borrowers at risk over the calendar time (right).The solid blue line corresponds to the moving average for a six-month window.
Figure 3: Single-family loan-level dataset from Freddie Mac.Left: distribution of default events versus duration.Right: ratio between the number of default events and borrowers at risk over calendar time.The solid blue line is the moving average for a six-month window.

Network Architecture and Training
For the available data, we use an FCFN in the DNN block (cf. Figure 1).This network architecture is often employed in tabular data settings and can be effective for learning complex relationships within features.Although we use this type of network to predict the time to default in a mortgage portfolio, we emphasize that since the Deep-PTCM implementation uses the TensorFlow framework, any architecture available there can be adapted to the needs of other applications.
To train the network, the architecture of the FCFN needs to be tuned to achieve high prediction accuracy.This includes defining the number of layers, the number of units for each layer, the activation functions, which optimization algorithm to use, etc.These hyperparameters are not learned during backpropagation and must be set manually.Hyperparameters can significantly impact the performance of the model, and several tuning strategies have been proposed [28].We use the random search strategy commonly employed in DL [5].We then chose the combination of hyperparameters that accomplished the minimum average loss in three execution runs per trial with independent random initializations.Running each trial multiple times means avoiding making the final decision strictly dependent on the initial random values.
In addition, to prevent overfitting when training each trial, we use early stopping.We set a maximum number of epochs and a "patience" parameter, which is the number of epochs to wait before deciding to stop the training process.During training, we monitor the model performance on a hold-out or validation set not used in the optimization.If the model performance on the validation set does not improve for a certain number of epochs (the patience parameter), then the training is stopped.
After hyperparameter tuning, the resulting network architecture for the DNN block of the Deep-PTCM has three layers (two hidden plus the output layer).The first two hidden layers have 512 units, ReLU activation functions, and a dropout rate of 0.2.The output layer, representing the predictor η, has one unit and a linear activation function.The best optimization algorithm was found to be stochastic gradient descent (SGD; see [55], Chap.12.4) with a learning rate schedule that follows the inverse time decay with an initial rate of 0.01, a decay rate of 0.75, and 100 decay steps.The final model is re-trained using the entire training set (see Table 7 in the Appendix for further details on the search space of the hyperparameters).

Results
Figure 4 illustrates the Kaplan-Meier curve for the whole population (left) and the estimated survival function S(t) = 1 − F (t) of the risk factors (right).The transparent curves represent 500 estimations of the survival function based on resampling with replacement.The interpretation of S(t) is sometimes mistakenly considered as the survival function of non-cured subjects (e.g.[53]).But since F (t) is the CDF of the risk factors, and the time to the event is when the first one is triggered (cf.Section 3.1), S(t) represents an upper bound of the survival function of the susceptible individuals [40].Therefore, since the Kaplan-Meier estimator does not control for cured and non-cured subjects, it calculates, for instance, that the probability of default, or "not surviving", would be ∼ 5% after ten years of payments.However, the Deep-PTCM estimates that if the subject belonged to the susceptible population, the probability of default would not be lower than ∼ 35%.Table 4 depicts the results obtained on the test dataset for the performance metrics described in Section 4. The numbers in parenthesis are the standard deviations obtained from 100 bootstrap samples of the same size as the original data.We notice that among the three PTCMs, the best discrimination, as measured by AUC cure , is obtained by the version with one-hot encoding (PTCM ).In terms of calibration, as measured by IBS, the PCA-PTCM showed the minimum among the three, but the difference is not significant.Moreover, compared to the deep versions, neither of the three PTCMs with linear predictors performed better in discrimination.
Between the two deep versions, we note orthogonalization does not improve predictive performance for this case study.However, interpretability gains are, of course, always present.The best results for discrimination and calibration are accomplished by Deep-PTCM, showing an AUC cure of 0.88, compared to 0.85 from PTCM, and an IBS of 0.022, compared to 0.023 from PCA-PTCM.  in their based values but the one in question, for which we recreate a grid within its range.Results for three numeric covariates are presented in Figure 5.One can first notice that the deep version measures non-linear relationships between some covariates and the predictor.Two remarkable examples are the combined loan-to-value ratio cltv and the borrower's external credit score fico.Both covariates have shown conforming signs in the credit risk literature when linearity is assumed [51,36].Greater values of cltv are associated with a greater risk of default, and greater fico values are associated with lower risk.We show the same trend but in a non-linear way.
In the cltv case, we observe that the Deep-PTCM, like the PTCM, reckon similar risk increments between 50 and 120.However, for values lower than 50 or above 120, the risk estimated by the deep version is lower.
For fico, we observe that the effect between 550 and 750 calculated by Deep-PTCM is more significant than the one shown by PTCM.Yet, if the score assigned by the credit bureau is higher than ∼ 750 (good creditworthiness), the risk measured by the deep version starts to go down comparatively.
In addition, we note that there are covariates, such as the interest rate int_rt, where both the PTCM and the Deep-PTCM, estimate a linear relationship, despite the fact the last one is not restricted to do so.The effects of the other numerical covariates are in the Appendix.
However, the Deep-PTCM can reveal not only the non-linearities of single covariate effects but also potential interactions.To illustrate this, Figure 6 visualizes a slice of the bivariate interaction of the pair int_rt-fico and int_rt-cltv.We observe that the effect of int_rt for values of fico less than 600 does not change substantially.Similarly, we see that for loans with interest rates close to 6%, the effect of cltv is maintained for values greater than 80.The traditional PTCM cannot provide this information.Overall, we conclude that the Deep-PTCM is able to recover the simpler embedded PTCM without requiring this (often too restrictive) assumption to be made in advance.
The estimates for the linear effects further support this conclusion.For the variables cltv, fico, and int_rt, these are 0.30/0.26,-0.41/-0.52,and 0.39/0.38 for the PTCM and Deep-PTCM-Ort models, respectively (see the Appendix for more details).However, the increased flexibility of Deep-PTCM-Ort comes at the cost of having more unknown parameters and, therefore, greater parameter uncertainty.Nevertheless, given the slightly better prediction performance of Deep-PTCM and our simulations, which demonstrate that uncertainty can be significantly reduced with more data, we believe our approach makes a valuable and innovative contribution to future large-scale credit risk applications.

Discussion
Survival models assume that all individuals are, sooner or later, prone to the event of interest.However, there are applications, such as mortgage default prediction, where it is noticeable that some are not susceptible to the event.Under these circumstances, cure rate models are preferable.The literature on credit risk modeling with cure fractions mostly considers the MCM approach (cf.Table 1).However, another class of models, the PTCM, has not received the same level of attention in this context.
We propose a reformulation of the PTCM, the Deep-PTCM, that simultaneously estimates the covariate effects and the parameters associated with the survival distribution in an end-to-end DNN.This allows us, on the one hand, to account for complex and often more realistic non-linear relationships between covariates and survival and, on the other, to have a computationally efficient approach that scales well to large datasets, such as the ones seen in credit risk applications.Moreover, interpretability can be crucial in some fields when choosing a model.While DL offers flexibility and superior predictive power, it has its critics regarding explainability.Our approach tries to leverage the advantages of DL with an orthogonalization method that facilitates interpretability.We aim to provide an accurate and transparent framework, allowing decision-makers to make informed decisions without making restrictive linearity assumptions a priori.
Via simulations, we demonstrate the scalability of our method compared to an existing one based on the EM algorithm, reducing, for some cases, the average training time to the one-hundredth part.In addition, we show that the Deep-PTCM can significantly improve discrimination and calibration metrics compared to the standard PTCM when predicting the time to default in a large US mortgage portfolio.Finally, we further explore how DNN flexibility accounts for the effects of the covariates on the predictor, observing, first, its ability to correctly detect present deviations from linearity in the predictor as assumed by the classic PTCM and, second, to recover it if the evidence supports it.
Despite concentrating on one specific use case, credit risk modeling, in this paper, we emphasize that the statistical and computational properties of the Deep-PTCM ensure broader applicability in modeling time-toevent data in other fields.To facilitate this, we provide a Python package, deepcure that accompanies the paper and is available on GitHub.
Ultimately, we envision high-potential paths of research for the Deep-PTCM.In particular, we can extend the framework to include covariate effects on both the predictor η and the CDF F [50].One way is to create a second DNN block that takes as input, in addition to t and δ, a covariate matrix (can be the same as X or not), and has as outputs F and f .Together with η from the first DNN block, these would be the new inputs to the Endpoint Layer.The appeal of this setup is that we can analyze short-and long-term effects without strong parametric assumptions in F by including covariates and modeling them via a DNN block.

Figure 4 :
Figure 4: Left: Kaplan-Meier curve.Right: survival function of the risk factors S(t) (dashed).The blue-shaded curves are those obtained by 500 bootstrap samples with replacement.

Figure 5 :
Figure 5: Comparison of the effect of numerical covariates on the predictor η for four cure models.

Table 1 :
List of references in the credit risk literature with cure models.T|N|X refer to the maximum performance periods (months), the sample size, and the number of covariates (before preprocessing).

Table 2 :
Simulation results for 100 independent replications.Time is the average time in minutes needed for training.

Table 4 :
AUC cure and IBS results evaluated in the test set.To illustrate how the best-performing model, Deep-PTCM, measures the effect of the numeric covariates in η and how it compares to the linear ones, we emulate fictional borrowers with all the covariates centered

Table 5 :
Categorical variables in the single-family loan-level dataset from Freddie Mac.Denotes whether the property type secured by the mortgage is a condo/planned unit development/manufactured housing/singlefamily/cooperative share rel_ref_ind 2 Indicates if the loan is part of a relief refinance program (yes/no) servicer_name 24 Entity acting in its capacity as the servicer of mortgages to Freddie Mac (24 servicers) zipcode 860 First 3 digits of the zipcode of the property

Table 6 :
Numerical variables in the single-family loan-level dataset from Freddie Mac.