Optimal causal decision trees ensemble for improved prediction and causal inference

Ensemble methods can be used to identify causal relationships in data for a better understanding and taking the right decision in processes that involve high risk. This paper explores the idea of a causal decision tree forest and proposes a regularized ensemble method by integrating optimal causal trees for improved prediction accuracy while not compromising on accurately estimating heterogeneous treatment effects. The proposed method is based on selecting a subset of the most accurate causal trees from a sufficiently large pool based on their out-of-sample error estimates. The selected trees are integrated to form an ensemble that is used for estimating heterogeneous treatment effect and predicting unseen data. The proposed method is applied on Pakistan’s income function consisting of 27964 observations on wages of workers age 10 and above as an example dataset. The paper gives a detailed simulation study where datasets are generated under 5 different designs. The proposed method is assessed against ordinary least square (OLS), least absolute shrinkage and selection operator (LASSO), Ridge, Causal Tree and the standard decision trees forest via mean square error (MSE), root mean square error (RMSE), mean absolute deviation (MAD) and Pearson correlation (r) as performance metrics. The analyses given in the paper reveal that the proposed method can be used effectively for estimating heterogeneous treatment effects and achieves better prediction performance and as compared to the rest of the methods given in the paper.


I. INTRODUCTION
The identification of the causal relationships in the data is key to provide a better understanding and the knowledge for taking an accurate decision in processes with risk. Such types of relationships are usually established with the help of experiments which are effective but, at the same time, costly and difficult to conduct [1]. Observational studies can also be used to find the causal relationships in the data [2], which are tested by taking a sample from historical data or by observing the characteristic of interest over a period of time, thereby making the observational study time-consuming.
Machine learning is generally used for accurately predicting unknown data based on learning from known data. However, sometimes, the purpose of using machine learning methods could potentially exceed prediction, such as representing and discovering causal relationships in data and esti-mating heterogeneous causal effects. This kind of application provides a compact and precise graphical representation of the causal relationships between a set of predictor attributes and an outcome attribute. Various machine learning methods are in use for obtaining the desired results. Typical examples include classification and regression trees, k-nearest neighbours models, support vector machines, etc.
The methods for the identification of causal relationship in the data should be capable of identifying the causal effect without any prior knowledge. Moreover, these methods should be capable of dealing high dimensional data sets efficiently. The methods of classification such as decision trees [3] have the ability to identify the causal relationships in the data by using a supervised learning approach where the response variable is known or fixed. Such type of methods are commonly used in medical and social data analyses, for ex-ample. However, these methods are not specifically designed for identifying the causal relationships in the data, and hence can provide incorrect estimates of causal relationships.
Classification methods, like random forest, are fast and could find causal signals in the data effectively. These methods, with the provision of scalability and automation, are important for exploring causal relationships in large datasets. To this end, researchers have proposed several machine learning methods for exploring causal relationships. Although these methods serve the purpose of finding causal signals, they fail to achieve higher accuracy. Therefore, the aim of this paper is to achieve both causal exploration and high prediction accuracy by proposing a causal decision trees ensemble. This will help economists in predicting and answering various causal questions for policy implementation in a machine learning framework.
Various authors have suggested that combining weak models leads to efficient ensembles. Moreover, combining the outputs of multiple classifiers also reduces generalisation error. Ensemble methods have high efficacy in that the different models involved have different inductive biases, where such diversity reduces variance-error while not increasing the bias error [4]- [6]. As the number of trees in a random forest is often very large, there has been a significant work conducted on the problem of minimising this number to reduce computational cost without decreasing prediction accuracy [7]- [10]. The overall prediction error of a random forest is highly associated with the strength of individual trees and their diversity in the forest. This idea is backed by Breiman's [11] upper bound for the overall prediction error of random forest given by Err ≤ ρerr t , where t = 1, 2, . . . , B, and B denotes the number of trees in the forest, Err is the overall prediction error of the forest and ρ represents weighted correlation between residuals from two independent trees, i.e. mean (expected value) of their correlation over entire ensemble, and err t is the prediction error of some t th tree in the forest.
Generally, a random forest is based on a large number of base trees, and researchers have always tried to minimise this number in order to gradually shrink the cost of computation without negatively affecting the prediction accuracy. Overall, the prediction error of a random forest is strongly connected with the accuracy of individual trees and their diversity in the forest. The proposed method selects a subset of the best causal trees in terms of their individual strength, i.e. accuracy, from a large ensemble grown by the causal random forest. The selected trees are combined in an ensemble for predicting unknown data and estimating heterogeneous treatment effects in the data. The proposed method is applied on an example dataset from the labour force survey (LFS) of Pakistan for causal effects exploration. The paper also gives a detailed simulation study of the proposed method in comparison with causal decision tree, causal random forest, ordinary least square (OLS) linear regression, least absolute and shrinkage and selection operator (LASSO) and ridge regression for further assessment. For judging the efficacy of the newly developed method, conditional average treatment effect (CATE), average treatment effect (ATE), mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE) and Pearson correlation coefficient (r) are used as performance measures.
The remainder of this paper is organised as follows. Section II provides a summary of the related work done in the literature; Section III presents a detailed description of the method proposed in this paper; Section IV gives the analyses conducted in this paper based on the simulated and real datasets; and Section V concludes the findings.

II. RELATED WORK
Extensive research has been done in the literature for estimating the parameters of interest, like heterogeneous treatment effects. Some well known methods consist of local maximum likelihood and local generalized method of moments such as [12]- [17]. Some applications of these techniques in the field of economics includes multinomial choice models in a longitudinal data type i.e. [18] and instrumental variables regression i.e. [19]. To estimate the parameters at a particular value of covariates, the core idea is to use kernel weighting function in order to place more weight on nearby observations in the covariate space. The main problem in such types of techniques is that, if the feature space is high dimensional, then the performance of these methods can be suffered from the problem known as "curse of dimensionality" [20].
Authors in [21] replaced kernel weighting with the forestbased weights i.e. weights that are obtained from the trees fraction that contain observation in the same leaf as the response value of covariate vector. Study in [11] proposed a random forest algorithm for non-parametric classification and regression building on insights from the ensemble learning literature [22]- [25]. Random forest as a type of adaptive nearest neighbor prediction is closely built in the studies given in [26] and [27], which is forest-based method for quantile regression and survival analysis. Authors in [28]- [34], have used gradient based test statistics to identify the change points in likelihood models.
According to [35], numerous data mining techniques such as classification, k-nearest neighbors and sequential pattern mining are sequentially applied to identify the similarity in decision trees. Also authors in [36] used unified Granger causality analysis (uGCA) framework for sequential medical imaging. The study in [37] proposed hierarchical probabilistic graphical model to simultaneously handle classification of multi-sensor and multi-resolution remote sensing of the same scene. This study consists of hierarchical Markov model along with quadtree structures in order to model the necessary information present in various special scales and a planar Markov model to tackle contextual spatial information at each resolution as well as the ensemble of causal decision trees for pixelwise modeling. Further reading on the methods for causal analysis used in machine learning can be found in the recent literature as given in [38]- [44].
Recursive partitioning models using gradient-based test statistics were considered in [45]. Several authors in [46]- [48] achieved the statistical stability by using a random forest resampling mechanism. Similarly, another study in [49] adopted a greedy and non-parametric regression technique, utilising gradient-based approximation. Several studies [27], [46]- [48], [50]- [62] have considered the regression problems by using the random forest algorithms. Athey, Tibshirani and Wager [21] proposed a method which is computationally efficient in generating generalised random forest (GRF). The estimates of this method are consistent and asymptotically normal, thereby providing a valid confidence interval. This method is designed to handle three main tasks, i.e., heterogeneous treatment effects, nonparametric quantile regression and conditional average treatment effects via instrumental variables.
The computational burden of trees ensemble could also be decreased without compromising the prediction accuracy by combining a small number of diverse and accurate trees. This can be achieved by using out-of-bag prediction errors from each training bootstrap sample in order to select the optimal trees on the basis of their individual performance [63]. The proposed method is a modified version of the generalised random forest [21], which involves generating a large number of causal trees and then selecting a proportion of those trees whose error rate is minimum among all the constructed causal trees.

III. METHODS
This section provides a detailed description of the method used in this paper. Before introducing the proposed method, it is deemed important to introduce the causal decision tree as a building block of the suggested optimal causal trees ensemble (OCTE).

A. CAUSAL DECISION TREE
Using the decision tree for estimating the heterogeneous treatment effect is totally different from the classical decision tree used for classification and regression problems. The classical decision tree uses a function mapping characteristics to the response variable about an individual. This can be illustrated from the tree given in Figure 1 as discussed in [64], where the matchmaking mobile app i.e. "Tinder" is used to cure a particular disease. The decision tree is unable to identify the true causal effect since majority of the young people use "Tinder" as compared to old people. Moreover, old people have little chance of recovery from the disease as compared with young people. Thus, the comparison between the two groups, which are not actually comparable, led to a misleading decision tree. A standard decision tree may thus be considered as an appropriate choice in terms of predicting the recovery from the disease but fails to identify the true cause of the disease. Furthermore, its nodes posses no causal interpretation.
On the other hand, the causal decision tree calculates the average of the treated and untreated observations in each node and then computes the difference between these av-FIGURE 1: A standard classification tree erages, which represents the actual treatment effect in that node.
Estimating the individual treatment effect, i.e. τ i = Y 1i − Y 0i , is not possible in real world problems, because the outcome of the ith individual is either Y 1i (the sample is treated) or Y 0i (the sample is untreated). One of the two outcomes, i.e. Y 1i or Y 0i , has to be predicted, using the counterfactual model (the potential outcome model). For example, suppose an individual has salary Y 0i and education below secondary level (untreated). We want to know the ith individual's salary Y 1i if the person had education above secondary level (treated), i.e,. X i = 1. The average treatment effect of a group (population) is simply the average of the individual treatment effects included in the population, i.e., A general work flow of a causal decision tree is given in Figure 2.
Decision trees for causal inference are generally used to separate data into buckets in order to estimate the average treatment effects within each node. The process of decision tree learning for causal inference can be separated into two steps for each of these tasks, commonly referred to as the splitting step and the estimation step, respectively. Therefore, a causal decision tree can effectively be used for estimating heterogeneous treatment effects in a computationally efficient manner. However, its prediction in the cases of regression and classification problems is less efficient.
Although a single causal decision tree model is interpretable and fast, the estimates it returns for heterogeneous treatment effects might not be generalisable. Therefore, the ensemble of causal decision trees can solve this problem by providing robust estimates for causal relationships at the cost of interpretability without significantly increasing computational cost. Combining a few accurate and diverse causal decision trees could provide improved estimates and might be taken forward in the direction of improving interpretability of the standard causal decision trees ensemble.
This work aims at improving the causal random forest with the help of best trees selection for size reduction and VOLUME 4, 2016 FIGURE 2: A standard causal decision tree improved estimation. To achieve this, B sub-samples are taken from the given training data L = (X, Y, W ), where X is the feature space, Y is the response and W is the binary treatment. A causal decision tree is grown on each sub-sample. The performance of trees is evaluated on the basis of out-of-sample observations and ranked accordingly. Trees having the smallest error estimates on the out-ofsample observations are selected, while the rest of trees are discarded. Then, the selected trees are combined to form the optimal causal tree ensemble.
Partitioning of the given training data L = (X, Y, W ) is carried out randomly into two non-overlapping groups, i.e., While doing so, a random subset of p < p features is selected from the entire set of features at each node of the causal tree; (where, p is total number of features and p is a subset of features taken from total p features randomly). This will add additional randomness to the causal trees. As the observations in L v = (X v , Y v ) take no part in the training phase of the causal trees, error estimates are calculated for these out-of-sample observations and the trees are ranked in descending order with respect to the error estimates. The error estimates taken in this paper are the standard errors for out-of-sample observations by each tree. The final ensemble is constructed by choosing the top ranked M causal trees.

B. STEPS OF THE PROPOSED (OCTE)
The proposed algorithm considers the following steps to assess treatment effect: 1) Select B number of sub-samples from the training part of the dataset i.e. L = (X, Y, W ). 2) Use generalized random forest for growing a causal decision tree on each sub-sample. 3) Arrange the causal trees according to their out-ofsample predicted standard errors. 4) The best M causal trees are chosen having the smallest individual prediction standard error on out-of-sample observations. 5) Combine the M selected trees to for an optimal causal decision trees forest and use it to predict the treatment effect of new/test data points.
Pseudocode of the proposed method OCTE is given in Algorithm 1 along with an illustrating flow chart in Figure 3. Grow a causal decision tree using generalized random forest; 5: Compute error rate on out-of-sample observations; 6: Save all the trees; 7: Save the out-of-sample errors; 8: end for 9: Rank the causal trees based on the out-of-sample standard error; 10: Select top ranked M causal trees; 11: Combine the M selected causal trees to construct optimal causal tree ensemble (OCTE); 12: Use OCTE for estimating treatment effect and predicting unseen data. In this paper, the proposed OCTE is assessed using five different simulation scenarios. It is then compared with five state-of-the-art methods, i.e., OLS, LASSO, Ridge, causal tree and causal random forest.
The OCTE is also applied on a real dataset, the nationally representative Labor Force Survey of Pakistan (LFSP). The LFSP data include records from 2017 to 2018 taken from Pakistan's Bureau of Statistics. The Labor force survey is a nationwide survey containing micro-data from all over the country's demographic and employment information. Each cluster is generated by drawing a random number from a normal distribution with a small standard deviation and rounded mean I to the closest integer. All generated data contain about 4, 000 observations, and a conditional sample size of five instances is used for (J, I). 2) For all observations i.e., i = 1, 2, . . . , n j for jth cluster, simulate confounders with individual level X ij = (X 1ij , X 2ij , X 3ij ), measured confounder for cluster level Z j , and unmeasured confounder for cluster level U j i.e.

3) Status of individual treatment W ij is generated from
the logistic model propensity score as follow.
and r ij ∼ N (0, 1), where r ij is random error for ith sample in jth cluster.

Design 2: Uniformly Distributed U j and Linear Outcome Model
This design utilizes a similar model for data generating as Design 1, but the only difference is that U j (unmeasured cluster level confounder) has uniform distribution i.e. U j ∼ U (−2, 2). Design 3: Uniformly Distributed U j , Linear Outcome Model, and Misspecified Working Models This design is also similar to Design 2, but higher order terms in the outcome model are ignored. Design 4: Uniformly Distributed U j and Nonlinear Binary Outcome Model This design is also similar to Design 2, but the difference is that the outcome model nonlinear and binary. The outcome model is given by Design 5: Uniformly Distributed U j and Linear Outcome with Exponential Error Model The construction of Design 5 is almost similar to Design 2 with U j as uniformly distributed and linear outcome model. Unif(-2, 2) Non-Linear Yes 5 Unif(-2, 2) Linear with Yes exponential error term All five designs consist of about 4000 samples and data in each design is divided into 70% training and 30% testing parts. MSE, RMSE, MAD and correlation coefficient (r) are used as performance measures for both conditional average treatment effect (CATE) i.e. τ i as will as average treatment effect (ATE) i.e. τ .
For CATE, 500 realization are made under each design and MSE, RMSE, MAD and correlation coefficient (r) are calculated reporting their average values. Expressions of the metrics used are given bellow.  Table 4 shows the results of the proposed method (OCTE) and all the other methods considered in this study, in terms of conditional average treatment effect (CATE). The results suggest that the proposed OCTE outperformed all the other statof-the-art procedure on almost all the five designs. The results are also shown in the form of bar plots in Figures 4-7. The proposed OCTE provides minimum mean square error (MSE) on first four designs, while causal forest (CF) gives optimal value for Design 5. The OLS, LASSO, Ridge and CT did not perform well on any design. In terms of root mean square error (RMSE), the proposed OCTE performed better than the other methods on four designs and CF outperformed the others on Design 5. Apart from OCTE and CF, the remaining methods did not outperformed the rest of the methods on any of the designs. Similarly, OCTE is giving optimal results in terms of mean absolute deviation (MAD) as compared to the other methods. In terms of Pearson's product moment correlation coefficients (r), the OCTE also outperformed the other procedures on Designs 3 and has similar performance on Designs 1 and 4, while OLS outperformed the rest of the methods on Designs 4 and 5. Table 5 shows the results for average treatment effect (ATE) for the 5 scenarios. It is evident from the results that the proposed OCTE had better achievements than the other methods in four scenarios in terms of MSE, RMSE and MAD, and it outperformed them in three scenarios in terms of correlation. The CF method, on the other hand, gave the same correlation value in Design 3 as that of OCTE and outperformed the other methods in Design 4. However, LASSO had the best achievement in Design 5 in terms of correlation. For a visual illustration, the results are also shown in the form of bar plots in Figures 12-15. The results obtained in Table 4 and Table 5 can also be seen in the form of bar-plots in Figures 4-7 and Figures 12-15, respectively.
For further assessment, boxplots of MSE, RMSE, MAD and correlation values of CATE obtained from all the 500 runs of the simulated data for each design are displayed in Figures  8, 9, 10 and 11. The boxplots reveal that the OCTE is more consistent in comparison with the other well-known methods.
Moreover, the execution or running times (in seconds) of the new method and the causal forest method (CF) are also given in Table 2, where it can be noticed that, as the number of trees grow, the execution increases linearly, i.e. f (B) ≤ C × O(B), where B is the number of trees. The execution time of the OCTE is greater than that of the CF method, due to the additional tree selection step. To reduce the execution time, Step 3 of the proposed algorithm can be parallelised using existing tools, such as the "parallel" R package [69].

C. LABOUR FORCE SURVEY OF PAKISTAN (LFSP) AS AN EXAMPLE
In this research, the nationally representative labor force survey of Pakistan (LFSP) data from 2017 to 2018 is taken from Pakistan Bureau of Statistics. LFSP is a nationwide survey consisting of labours employment information from all over the country at micro level. The working sample used is based on those in wage employment and comprises a total of 272610 workers. The analysis is restricted to those older than 10. Missing values and unusable observations are discarded, leaving a total of 27964 observations. The variables used to analyse each worker's wage include hours worked, education, occupation, residence (in urban or rural area and in one of the four provinces), schooling attainment, gender, employment status, marital status, experience, industry, kind of enterprise and training. A brief description of the variables is given in Table  3. Hours worked: Hours worked is the sum of total hours an individual work. X 5 Gender: Tow classes, male and female. X 6 Marital status: Marital status is classified as widow, single, married, divorced. X 7 Employment status: Categorical with four classes, paid employees, Employer, Self Employed, Contributing Family Helpers X 8 Occupation: It is a categorical variable with nine classes and shows the occupation of employees, such as manager, professional, etc. X 9 Training: Binary variable and shows whether training has been given to the employee for a job or not. X 10 Region: Residence of employee, which has two categories, rural and urban. X 11 Province: Province of residence, which has four categories i.e. Khyber Pakhtunkhwa, Punjab, Sindh and Balochistan.    Figure 16, shows the conditional average treatment effect (CATE) of the proposed OCTE computed for LFSP data. It can be observed from the figure that the treatment (education) has a positive effect on the average income of the individuals. This implies that higher education leads to higher income of the individuals. Similar conclusion could be drawn from the box-plot constructed in the same figure. Figures 17-19 discuss the heterogeneity of education in the variables "Province", "Region" and "Gender", respectively. From Figure 17, it is clear that in province Khyber Pakhtunkhwa, the average effect of education on the income of the individuals is less than the rest of the provinces. Punjab and Sindh posses almost equal average effects of education on the income of the individuals. Figure 18 indicates the heterogeneity of education between the rural and urban areas where it is clear that in rural areas, the education has less effect on the income of the individuals. In case of variable "Gender", education has minimum effect on the income of males as compared to females (Figure 19). In a nut shell, education is observed to have heterogeneous effect among the variables i.e. gender and region, whereas in variable province it has an approximately homogeneous effect.