The Conditional Super Learner

Using cross validation to select the best model from a library is standard practice in machine learning. Similarly, meta learning is a widely used technique where models previously developed are combined (mainly linearly) with the expectation of improving performance with respect to individual models. In this article we consider the Conditional Super Learner (CSL), an algorithm that selects the best model candidate from a library of models conditional on the covariates. The CSL expands the idea of using cross validation to select the best model and merges it with meta learning. We propose an optimization algorithm that finds a local minimum to the problem posed and proves that it converges at a rate faster than <inline-formula><tex-math notation="LaTeX">$O_p(n^{-1/4})$</tex-math><alternatives><mml:math><mml:mrow><mml:msub><mml:mi>O</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>n</mml:mi><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math><inline-graphic xlink:href="valdes-ieq1-3131976.gif"/></alternatives></inline-formula>. We offer empirical evidence that: (1) CSL is an excellent candidate to substitute stacking and (2) CLS is suitable for the analysis of Hierarchical problems. Additionally, implications for global interpretability are emphasized.


INTRODUCTION
T HE idea of combining different models to obtain one that is better than any of its constituents (meta learning) has been explored extensively and it is currently used in many applications [1], [2], [3]. Meta learning today, however, mainly consists of creating linear combinations of models (i.e stacking). Its purpose is to improve the accuracy of the individual models, albeit at the expense of interpretability. Related ideas are also explored for ensemble methods which create linear combination of simpler models. Two main ensemble methods can be highlighted: bagging and boosting [4], [5]. In bagging, models are averaged to reduce the variance of individual models and improve accuracy. In boosting, simple models are sequentially learned reducing the bias of the estimator at each step [6].
Usually thought independently of meta learning, the use of cross validation to select the best algorithm from a library (either different models or different hyperparameters) is widely popular [7]. Establishing the theoretical basis for designing an oracle algorithm that will select the best from a library of models (using cross validation), Van der Laan et al. demonstrated that cross validation can be used more aggressively than previously thought, terming the cross validation selector "super learner" [8]. Specifically, it was shown that if the number of candidate estimators, KðnÞ, is polynomial in sample size, then the cross validation selector is asymptotically equivalent to the oracle selector -one that knows the best algorithm [8]. Similarly to the empirical use of cross validation, the super learner proposes to select one model from the library for all the observations. Everything else being equal, we would prefer if a simple model is selected to be able to afford interpretability and easier deployment. However, if the data generating process is complex and one uses simple models in the library, it is possible that the model selected to be the best in one region might not be the best in another. Therefore, using simple models in the library will introduce biases and decrease accuracy.
In the present article, we develop an algorithm that selects the best model from a library conditional on the covariates, called here the Conditional Super Learner (CSL). This meta algorithm can be thought as learning in the cross validation space. With the CSL, therefore, we investigate a meta learning strategy that reduces the bias of the models in the library by selecting them conditional on the covariates. We show how the CSL has implications for both: the accuracy and inteerpretability. Additionally, we also provide extensive empirical results on how the CSL can be seen as an alternative to stacking and, due to its hierarchical nature, an excellent candidate for the analysis of hierarchical problems. Specifically, in this article we: 1) Develop the theoretical foundations for the Conditional Super Learner: An algorithm that selects the best model from a library conditional on the covariates. 2) Provide convergence rate theorems and inequalities for the CSL. 3) Illustrate how the CSL is a generalized partitioning algorithm that finds different boundary functions (not just vertical cuts as CART does) with M-estimators algorithms at the nodes. 4) Establish the connection between CSL and interpretability. 5) Show empirically how CSL improves over the regular strategy of using cross validation to select the best model for all observations. 6) Show empirically how CSL can give better R 2 / accuracy than stacking in a set of regression problems and classification problems. 7) Show empirically how CSL performs in the analysis of Hierarchical Data.

CONDITIONAL SUPER LEARNER
The algorithms that we discuss in this paper are supervised learning algorithms. The data are a finite set of paired observations X ¼ fðx i ; y i Þg N 1 . The input vectors x, whose components are called here covariates, are assumed to be on R p , while y can be either a regression or classification label.
We propose to solve supervised learning problems by 1) dividing the input space into a set of regions that are learned by an iterative algorithm and 2) fit simple interpretable models that we call "experts" to the data that fall in these regions. The regions are learned by fitting a multi-class classification model that we call the "oracle" which learns which expert should be used on each region. Given an oracle oðxÞ, region k is defined as foðxÞ ¼ kg, that is, the set of points for with the oracle predict to use the function f k ðxÞ.
An example of an application of the CSL is shown in Fig. 1. In here we have 4 variables to predict houses prices: bedrooms, bathrooms, latitude, longitude. Two of these variables (latitude and longitude) are given to the "oracle". The rectangular region shows how the oracle divides the latitude and longitude (normalized) in 3 regions. Each region has its own expert, represented here by a diagram of a tree, to make predictions. Each of the experts receives as input the 4 variables.
As with any meta-learning algorithm, the Conditional Super Learner algorithm for learning the oracle oðxÞ (given the fits of the K experts) will be applied to a cross-validated data set, using V -fold cross validation. That is, for each Y i falling in one of the V validation samples, we have a corresponding training sample. We couple each observation Y i with K expert algorithms trained on subsets (from the current best estimate of oracle) of its corresponding training data set, thereby creating a cross-validated data set of N observations. In this section, for the sake of explaining the conditional super-learner algorithm, this formality of cross validation will be suppressed, but in our theoretical section (see supplementary material, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/ 10.1109/TPAMI.2021.3131976) we make the full conditional super-learning algorithm formal.

Definition of CSL
Given an oracle oðxÞ and K experts models fF k ðxÞg K k¼1 fitted on each of the corresponding regions foðxÞ ¼ kg, the CSL can be defined as where oðxÞ 2 f1; 2; ::; Kg. CSLðxÞ is the Conditional Super Learner that outputs the prediction from the best model F k ðxÞ selected from a library of K models conditional on the covariate x. The idea is to find the oðxÞ and corresponding fits fF k ðxÞg K 1 that minimize a given loss function over the training data arg min

Fitting the Oracle
To find the solution to Equation (2) we will employ a trick often used in machine learning. We will iterate between solving oðxÞ and solving for fF k ðxÞg K 1 . To solve for oðxÞ we will assume that all fF k ðxÞg K 1 are known, the library, and that we also have unbiased estimations (i.e., cross-validated) of the loss at each training point Lðy i ; F k ðx i ÞÞ. In this case, CSLðxÞ will aim to find the best oðxÞ that minimizes the loss function over the training data arg min oðxÞ X N i¼1 L y i ; and using the definition of the indicator function, we can take the sum outside the loss function and get Equation (4) arg min To introduce how we fit the oracle, we define a new dataset that we called "extended" dataset. This is a dataset for a multi-class classification problem with K classes, each class corresponding to one of the expert models. This dataset has K Á N observations -each x i appears K times with corresponding labels f1; . . .; Kg and a specific weight. Definition 1. Given dataset X ¼ fðx i ; y i Þg N 1 , expert functions F ¼ fF k g K 1 we define the extended dataset X F ¼ fðx i;k ; z i;k ; w i;k Þg with K Á N observations where: k is a weight on observation ðx i;k ; z i;k Þ, being l i;k ¼ Lðy i ; F k ðx i ÞÞ and l i;þ ¼ P K 1 l i;k : Lemma 2.1. Solving problem (4) is equivalent to finding the oracle oðxÞ that minimizes the weighted misclassification error of an extended dataset X F arg min Proof. First, note that a misclassification loss for a multiclass classification problem can be written as Lðy; fðxÞÞ ¼ 1ffðxÞ 6 ¼ yg. We want to write Equation (4) as a misclassification loss of a classification problem. For each observation, ðx i ; y i Þ consider the weighted dataset fðx i ; 1; w i;1 Þ; ðx i ; 2; w i;2 Þ; . . .ðx i ; K; w i;k Þg with misclassification loss P K k¼1 w i;k 1foðx i Þ 6 ¼ kg. That is, we want to find for each observation ðx i ; y i Þ weights ðw 1 ; . . .; w K Þ such that Since oðx i Þ can just have values in f1; . . .; Kg we can consider all the options. For example, if oðx i Þ ¼ k, the equality in Equation (6) becomes Lðy i ; F k ðx i ÞÞ ¼ P K j¼1 w i;j À w i;k . If we consider all possible values for oðx i Þ we get the following set of equations The previous equation can be written in matrix form . . .w i;K Þ T . Which gives us As a result of Lemma 2.1, oðxÞ is the solution of a multiclass classification problem on the extended dataset X F . We approximate oðxÞ by fitting any standard classification algorithm on X F .

Fitting the Experts
Similarly to the previous section, in order to fit the experts we assume that oðxÞ is known. Then Equation (2) becomes K independent classification/ regression problems that minimize the empirical loss over observations fi : oðiÞ ¼ kg, for each k ¼ 1; . . . ; K, which is generally already solved by standard machine learning algorithms.

A Two Step Algorithm
Finding fF k ðxÞg K 1 indicates that Equation (2) can be minimized iteratively. Following the K-mean algorithm's philosophy, let us propose the minimization of Equation (2) in two steps: one to fit the oracle and the other to fit the experts. Please note that if we take into consideration that at every time that each step is applied the loss function decreases, the convergence to a local minimum is guaranteed. Of course, this is only true if we use, for each observation, the estimation of ðy i ; F k ðx i ÞÞ on the training data. This, however, will most likely result in overfitting. After this discussion we are ready to write Conditional Super Learner pseudocode (see above).

RELATED WORK
To further the understanding of the CSL, let us highlight different relationships and connections that it has with different algorithms. First, please note that oðxÞ partitions the space in K regions or subsets fRg K 1 where the models fF k ðxÞg K 1 are used for prediction; ergo establishing the connection between meta learning and generalized partitioning machines through the CSL. Different from recursive algorithms like CART, MediBoost or the Additive Tree [9], [10], [11], CSL partitions defined by the oracle can have complex forms and are not forced to be perpendicular to the covariates.
CSLðxÞ also generalizes the strategy of using cross validation to select the best model. Please note that if in Equation (1) we force oðxÞ ¼ c where c is a constant 2 f1::Kg then the solution to (4),ôðxÞ, just selects the model that minimizes the cross validation error. Using cross validation to select the best model is the simplest case of CSLðxÞ where the meta learner predicts a constant regardless of the covariates.
CSL can also be thought as a generalization of the K-means algorithm. If the expert models are constant, and the oracle has infinity capacity to always be able to assign each observation to the best mean, then CSL becomes the K-mean algorithm.
Another algorithm that is closely related to the CSL is the Hierarchical Mixture of Experts(HME) [12]. In both cases, a hierarchical topology is found. In the HME, however, this hierarchy depends on a parametric specification of the probability distribution while the CSL can estimate the hierarchical topology in a non-parametric fashion with a rich set of meta algorithms that can include linear models, trees or neural networks. Equally important, the HME predicts with a combination of experts, which harms interpretability while the CSL does not. Finally, as shown above, due to its archictecure, CSL is a non-parametric hierarchical algorithm and performs quite well for this type of problems.

ALGORITHM ASSESSMENT
In this section, we describe our experiments. We consider regression and classification problems from the Penn Machine Learning Benchmarks [13]. In the first set of experiments, datasets where the number of observations was between 200 and 500000 (N ¼ 84) were considered. In the second set of experiments, we selected a subset of those with at least 2000 points (N ¼ 19). Table 1 describes the characteristics of the classification datasets used in our second experiments. In the third set of experiments, we consider synthetic hierarchical data.
We report R 2 as the performance metric for regression and accuracy for classification. In the first experiments, datasets were split in 80% training and 20% testing sets. For the second set of experiments, we need a validation set, therefore data was split in 70%=15%=15% for train/validation/test.
In all our experiments, we use the following set of base algorithms or experts: Although the CSL can be used with more complex models as experts, using simple linear or tree models protects the algorithm for overfitting and improves interpretability, as discussed below.

Single Step CSL and its Comparisson to Cross Validation
First, we wanted to evaluate empirically how CSL (F 3 class of meta learners in the theoretical section presented in the supplemental material, available online) performs compared to the naive strategy of using the expert model selected through cross validation for all the points (F 1 class in the theoretical section). We used the library of experts described above and a decision tree algorithm with max depth ¼ 1 as our oracle. No partition was allowed if the terminal node did not have more than 2% of observations belonging to it. This decision tree algorithm was selected as the oracle because if not partition is performed, then the minimization of Equation (1) results in selecting the model that minimizes the cross validation error (equal to F 1 class in the supplemental material, available online). As such, our CSL in this case includes the possibility of just using cross validation, and it should perform, on average, equal or better than using cross validation to select one model. Please note that in this section the experts are obtained on all the training data and only one iteration is allowed for the meta. The empirical evaluation of this CSL was compared to the performance obtained if we use cross validation to select the best model from one of the sixth algorithms mentioned above. Fig. 2 shows the results obtained when we compared R 2 for both CSL and cross validation. In 77:5% of the datasets CSL had at least the same performance as cross validation and in 20% its R 2 was bigger by at least 1%. Although these results show that CSL improves over naive cross validation, overfitting can happen and extra attention needs to be paid. In fact, we obtained an outlier where cross validation outperformed the CSL by 0.125. This problem is the synthetic dataset 658 À fri À c3 À 250 À 25 from the Friedman's regression datasets [13], [14]. The same is a hard problem where algorithms are prompt to overfit since training only contains 200 points with 25 explanatory variables, several of them correlated to each other. In the next section, we compare a full CSL to stacking. F 2 versus F 3 meta learning strategies in our theoretical discussion (supplemental material, available online).

CSL Versus Stacking
Stacking [1], [3], [15] is a general procedure where a learner is trained to combine the individual learners. The base level models are trained on the original training set, then a metamodel is trained on the outputs of the base level model as features. The base level often consists of different learning algorithms. In our experiments, we use the same 6 base models defined in previous section. As meta-model in stacking we use a linear regression model.
CSL. In this experiment, we use the same set of experts as in previous section and we use a two layer feed-forward neural network as the oracle. The oracle was written in PyTorch and fitted with Adam optimizer with learning rates of 0.15. The number of epochs at each iteration was a function of the sample size ( 3000 log ðNÞ 2 ). The hidden layer was also set as a function of the sample size (minð2log ðNÞ; 150Þ). After the first linear layer and before the Relu activation function, batch normalization was used. A dropout layer with p ¼ 0:2 is used before the second linear layer.
In Table 2 we show results from comparing CSL and stacking on regression datasets. For each dataset we run each algorithm 10 times by splitting training, validation and testing sets using different seeds. CSL_mean shows the mean R 2 over all experiments. Similarly, Stack_mean shows the mean R 2 of the stacking experiments. Diff shows the % difference between CSL_mean and Stack_mean. Column Test results shows whether a t-test found the difference in mean to be significant. There were 19 datasets in this experiment. For one dataset, stacking is significantly better than CSL. In 8 problems, both algorithms are statistically the same. In 11 problem, CSL is better. In the experiment for dataset 574_house_16H, one of the runs produces an outlier, which is responsible for the mean difference.
In Table 3 we show results from comparing CSL and stacking on classification datasets. For each dataset we run each algorithm 10 times by splitting training, validation and testing sets using different seeds. CSL_mean shows the mean accuracy over all experiments. Similarly, Stack_mean shows the mean accuracy of the stacking experiments. Diff show the difference between CSL_mean and Stack_mean. Column Test results shows whether a t-test found the difference in mean to be significant. There were 5 datasets in this experiment. For one dataset, both algorithms are statistically the same. In 4 problem, CSL is better.

CSL on Hierarchical Synthetic Data
Hierarchical models are extremely important in fields like Medicine [16]. CSL, due to its architecture, seems specially suited to analyze hierachical data. In this section, we evaluate the performance of CSL on hierarchical synthetic problems. Specifically, we wanted to investigate if: 1) CSL can discover hierarchical structures. 2) Compare its performance in these type of problems to gradient boosting and random forest.  Given a real dataset fðx i ; y i Þg N 1 from the Penn Machine Learning Benchmarks, we generate syntetic data by using the observations from the covariates but generating new labels. Here is how we generated the new (ỹ): 1) Sample 70% of the data. 2) Use the covariates to find K ¼ 3 clusters using K-means algorithm. 3) Fit a 2-layer neural network using the cluster id as a label. 4) Using the neural network, predict a cluster id (l i ) on each of the original observations, creating the dataset fðx i ; y i ; l i Þg N 1 . 5) Fit each subset fðx i ; y i Þ such that l i ¼ kg for k 2 f1; 2; 3g to a regression model (Ridge). Use the prediction from the regression model as the new synthetic label (ỹ). For the experiments in Table 4, gradient boosting was ran with the following parameters: min_child_weight = 50, learning rate = 0.1, colsample_bytree = 0.3, max_depth = 15, subsample = 0.8, and with 500 trees. For random forest, we used 1000 trees, max_features ='sqrt' and we found max_depth with cross validation for each problem. The problems are a subset of the problems in Table 2, where the R 2 of the base expert is lower than 0.9. The CLS algorithm was initialized using 11 linear models as based experts. It turns out that by using trees together with linear models on these problems the algorithm sometimes would get stuck in suboptimal local minima so run the CSL multiple times. Table 4 shows the results (R 2 ) of CSL on synthetic data compared to random forest, gradient boosting or the best base expert (selected using cross validation). As it can be seen, CSL significantly outperforms all other algorithms as expected.

Implications on Interpretability
Without attempting to formally quantify and define interpretability here, we will illustrate below how the CSL results in models that are highly transparent.
Predicting House Prices. To illustrate how CSL can be use as an interpretable algorithm, we use a dataset of house rental prices in New York City. We have 4 input variables: latitude, longitude, number of bedrooms and number of bathrooms. Two make it really simple to visualize and interpret, the oracle was given two of the variables: latitude and longitude. The CSL model found a solution in which the oracle partition the space of latitude and longitude in 3 regions (see top of Fig. 1) and for each region a tree of depth 5 predicts the house prices. This simple solution gets an R 2 of 0.68. As a comparison, the best single model of a tree of depth 5 has an R 2 of 0.62 and a random forest with 500 trees (of depth 9) has an R 2 of 0.72. To find the best random forest, we did grid search on the number of variables and the depth of the trees.
The simple solution of a tree of depth 5 is interpretable since a tree of depth 5 can be easily examined. Also, 3 trees of depth 5 can be easily examined, as well as the 2 dimensional space where the oracle split the restricted input regions. On the other hand, the random forest with 500 trees and unrestricted depth would be harder to interpret.

THEORETICAL RESULTS
In this section, due to space constrains, we will summarize the main theoretical results obtained and presented in the supplemental materials, available online. The main result, proven on Theorem S1, establishes that a Super Learner given by the function f n will converge to the oracle f 0n , the true underlying function generator, at a rate at least faster than O p ðn À1=4 Þ where n is the number of observations. Additionally, we provided a theoretical analysis of the variancebias tradeoff incur for different choices of the meta learning algorithm and the base learners. Specifically, if one considers relatively large meta-learning models, as is easily the case for our conditional super-learner, then there is a risk that one worsens the performance relative to simpler metalearning models. Therefore, the most sensible strategy is to define a sequence of Super Learner models where one goes from the simplest (cross-validation) to more complex (trees, neural networks, etc.) This process is referred to as double super-learning, and a detailed discussion is presented in the supplemental materials, available online. We then finalized by presenting an oracle inequality for the double superlearner, showing that it is asymptotically equivalent to a super-learner that knows the true oracle choice of metalearning. Additionally, we also presented a finite sample oracle inequality for an -net double super-learner. These results establish that the double super-learner also approximates the oracle choice in its meta-learning model at a rate at least as fast as n À1=4 .We refer those readers interested in diving deeper into our theoretical analysis to the supplemental materials, available online.

DISCUSSION
In the present article we introduced the CSL, provided extensive empirical simulations and baseline results in a wide variety of problems and mathematical proofs about its convergence rates (see supplemental material, available online).
There are some important technical points that can help understand the CSL and point to possible avenues for its improvement. In our experiments, we initialized the CSL by picking the type of experts (e.g., trees, linear models, etc.) and a random subset of the data to fit each expert to introduce diversity. CSL, similar to K-means, can get stuck in local minimum. Therefore, we ran the algorithm a few times in our experiments and used a validation set (different from the test set) to select the best solution. Comparison are made with random forest, gradient boosting and the best preforming base expert. For each dataset, we generate 10 synthetic problems. For each method, the mean test R 2 and standard deviation is shown.
Researching better initialization methods will likely improve the performance of the algorithm. While running CSL, it is often the case that some experts will collapse (e.g., model selection). This happens when the range of values predicted by oðxÞ is less than K or similarly when the size of fx : oðxÞ ¼ kg becomes very small. In these cases, we readjust the size of K as the algorithm runs. This behavior is highly desirable because it allows the end user to specify a many bases and the algorithm will provide model selection provided enough regularization is applied.
There are different strategies to regularize the CSL. One would be to penalize over complex meta learners using the specific hyper-parameters for the chosen model. Additionally, we can use the step when we are fitting the experts to introduce regularization. If the oracle oðxÞ also estimates probability of an observation belonging to a model (e.g., logistic regression), then we can use pð0ðxÞ ¼ k; xÞ to introduce similarity among the experts when we are fitting them, specially around the boundaries defined by the oracle. Other ways of regularization can also be explored in the future, given that the partition nature of the CSL makes the local models data hungry. In fact, for those datasets where CSL was better, on average they had 3805 data points compared to 2368 in those datasets where cross validation did better. Therefore, as a general rule, CSL needs more than 2500 data points to perform better than cross validation. These empirical results corroborate our suggestions on the theoretical session and the need for a two tier Super Learner algorithm where the type of meta strategy, or class of functions as defined here, is also selected. It also underscores the need for regularization. In that sense, global regularization for the local parameters like that explored at the Highly Adaptive Lasso estimators seem to be appealing and will be investigated in the future [17].
Finally, we would like to finish our discussion with some notes on the implications that the CSL can have on interpretability. In many high-stake domains like medicine and the law, interpretability of machine learning algorithms is highly desirable, since errors can have dire consequences [10], [18], [19]. This has led to a renew interest for the development of interpretable models. Interpretability, however, can only be judge relative to the field of application.
There is a considerable disagreement on what the concept means and how to measure it [20], [21], [22]. Different algorithms afford different degrees of interpretability, and even black boxes can be investigated to gain some intuition on how predictions are being made. For instance, variable importance or distillation can be used to interpret neural networks [23]. Alternatively, post hoc local model explainability is a popular topic. Algorithms like LIME or SHARP build models around specific points [24], [25]. Specifically, LIME explains a complex model behavior for a specific observation by perturbing the covariates around the point, getting the model's prediction and then fitting a linear model. Although the specific limitations of this approach to recover the true underlying explanations are beyond the scope of our article, we must say that at its best LIME provides local explanations [25]. This level of interpretability might be enough for applications that do not impose high risk. In other applications (e.g medicine), the need to understand the models globally rises [10], [11], [26].
The CSL is then a hybrid between the idea of using one very simple model for all observations and building a different model for each. By finding a finite number of simple models that will be used for predictions (e.g., linear models), it then provides a sense of global understanding of what variables are important and their contribution.

CONCLUSION
In this work, we introduced the CSL algorithm. We proved theoretically and empirically how we can extend the idea of meta learning and develop an algorithm that outperforms the naive use of cross validation to select the best model. We proved that the CSL has a rate of convergence faster than O p ðn À1=4 Þ. Moreover, we have obtained very interesting and practical results. For instance, CSL outperformed stacking in most of the datasets analyzed. Additionally, it significantly outperformed Random Forests or Gradient Boosting in the analysis of Hierarchical Data. Finally, its connection to interpretability and other algorithms were highlighted to deepen our understanding of its performance. CSL is an algorithm suited for the analysis of datasets in high stake domains where hierarchical models, accuracy and interpretability are of importance. Yannet Interian received the PhD degree in applied mathematics from Cornell University and the BS degree in mathematics from the University of Havana, Cuba. She is currently an assistant professor in the master's in Data Science program. Her research interests include application of machine learning and deep learning to medical data. After a postdoctoral fellowship with UC Berkeley, she worked for five years as a data scientist with Google. She co-founded Akualab, a start-up that helped organizations develop data-driven products using machine intelligence and has designed data science courses for both UC Berkeley and USF.
Efstathios Gennatas received the MBBS degree in medicine from Imperial College London and the PhD degree in neuroscience from the University of Pennsylvania. From 2017 to 2019, he was assistant professional researcher with the University of California San Francisco and from 2019 to 2020 a research scientist with Stanford University. In August 2020, he was with the Department of Epidemiology and Biostatistics, University of California, San Francisco.
Mark Van der Laan received the PhD degree in statistics from Utrech University in 1993. He is currently the Jiann-Ping Hsu/Karl E. peace professor of biostatistics and statistics with the University of California, Berkeley. He is a founding editor of the Journal of Causal Inference.