SDTR: Soft Decision Tree Regressor for Tabular Data

Deep neural networks have been proved a success in multiple fields. However, researchers still favor traditional approaches to obtain more interpretable models, such as Bayesian methods and decision trees when processing heterogeneous tabular data. Such models are hard to differentiate, thus inconvenient to be integrated into end-to-end settings. On the other hand, traditional neural networks are differentiable but perform poorly on tabular data. We propose a hierarchical differentiable neural regression model, Soft Decision Tree Regressor (SDTR). SDTR imitates a binary decision tree by a differentiable neural network and is plausible for ensemble schemes like bagging and boosting. The SDTR method was evaluated on multiple tabular-based regression tasks (YearPredictionMSD, MSLR-Web10K, Yahoo LETOR, SARCOS and Wine quality). Its performance is comparable with non-differentiable models (gradient boosting decision trees) and better than uninterpretable models (regular FCNN). On top of that, it can produce fair results with a restricted number of parameters, only using a small forest or even a single tree. We also propose an “average entropy” metric to evaluate the level of interpretability of a trained, soft decision tree neural network. This metric also helps to select proper structure and hyperparameters for such networks.


I. INTRODUCTION
Regression refers to a category of supervised learning, whose output is continuous other than a limited set of values. In the past decades, many algorithms were proposed to perform regression tasks. Roughly speaking, we can divide them into two categories, and each category has its pros and cons.
• Parametric models. This category ranges from the simplest Linear Regression models [26] to the most complicated deep neural networks. Most of them can be optimized via backpropagation, thus can be easily integrated as a component in complex, end-to-end pipelines, and applicable for a wide field of problems. However, there is an implicit tradeoff between the model's accuracy and the ease of interpretation. A simple function, for example, linear regression or the Cox model [42] for survival analysis, is easy to be interpreted; while in real-world settings, a proper approximation function can The associate editor coordinating the review of this manuscript and approving it for publication was F. K. Wang .
be hard to find. In most cases, such a function might not exist in solution spaces of linear regressors or other straight-forward models. When dealing with a large dataset, as in many real-world regression tasks, a complicated neural network can be trained to obtain a close approximation of the ground-truth results [39]. However, it can hardly be interpreted due to its multi-layer nature and complex connections.
• Non-parametric models. This type of regression model tries to make assumptions about target distribution given the patterns observed from input samples, other than making assumptions beforehand. Kernel regression [27] and gradient boosting tree-based methods (which are quite popular these years) are members of this category. It is relatively easier to interpret non-parametric models: such a model generally tends to ''group'' similar examples into the same cluster or assign them the same branch on a tree node. This behavior naturally reveals how the model works. However, such non-parametric models are harder to combine with other gradient-based methods. As is proposed by Frosst and Hinton [13], using deep neural nets to mimic the branching structure of decision trees helps us to explain the neural network model better, thus inheriting the merits of both parametric and non-parametric models. The proposed Soft Decision Tree (SDT) model performs slightly worse than traditional convolutional neural networks in MNIST hand-written digits classification [22] task. However, the model itself shows a clear relationship among different classes in a hierarchical fashion. It automatically groups some similar classes (such as digits 5 and 6) by assigning a common parent for those classes without explicit supervision to force the model to do so.
However, the potential of SDT on heterogeneous tabular data and regression tasks was left unnoticed, as [13] only examines the performance of SDT on homogeneous image classification datasets (e.g., MNIST). We adopted this idea and developed the Soft Decision Tree Regressor (SDTR) for regression tasks and heterogeneous tabular data.
In this paper, we enumerate our main contributions as the following: • We proposed SDTR, a lightweight differentiable soft decision-tree based neural regression model.
• Based on observations of single-tree settings of SDTR, we developed several techniques for improving the accuracy of soft decision trees.
• We tested the performance of SDTR on different scales, including single-tree and ensembles on both bagging and boosting. We compared its performance against traditional FCNN, state-of-the-art decision tree-based NN methods NODE [34], non-hierarchical NN TabNet [2] and several non-differentiable gradient decision tree methods (XGBoost [9] and DeepForest [49]). Among those models, SDTR is competitive for regression tasks on tabular data, and it achieves comparable performance with NODE and TabNet using fewer parameters.
• We proposed the exponentially decaying L 1 regularization to encourage the sparsity of weight matrix in SDT and SDTR, making its interpretability level closer to GBDT. We also introduced the ''average entropy'' metric to evaluate the interpretability of SDTR and other soft decision tree models. This metric can be used to select proper structure and hyperparameters.

II. RELATED WORK A. DECISION TREES
Decision Trees have been a common approach to regression problems. As proposed in [6], if one defines a node's predicting value as the constant prediction and the node's impurity as the sum of squared deviations about its samples' mean, the decision tree will become a regression model. In the past decades, this field's primary research focus was based on the gradient boosting decision tree method (GBDT) proposed in [12]. There are multiple open-source packages that implement the GBDT algorithm (for both classification tasks and regression tasks), for example scikit-learn [33], gbm [36], XGBoost [9] LightGBM [19] and CatBoost [50]. While the core idea was left unchanged, these packages mainly focused on speed up, parallelization, large-scale datasets handling, and robust training.

B. SOFT DECISION TREES
Recently, Deep Neural Networks (DNNs) achieved great success in fields like Computer Graphics [16], Natural Language Processing [44], speech recognition [15], and reinforcement learning [25]. However, DNNs have met obstacles in processing heterogeneous tabular data, unable to outperform prevailing traditional models consistently.
However, traditional models are generally not differentiable, thus unable to be integrated as components in pipelines. Arik and Pfister (2019) [2] proposed a new canonical DNN architecture TabNet, which outperforms or is on par with traditional tabular learning models.
Another line of research tries to imitate the traditional learners by neural networks. Inspired by decision trees, which are proved to be capable of processing heterogeneous data, researchers have developed various sorts of differentiable, or ''soft'' decision trees/forests. Soft decision trees were first introduced by Suarez and Lutsko [40]. They performed a ''fuzzification'' process over a trained CART decision tree skeleton, replacing the hard threshold at each non-leaf node with sigmoid functions. The fuzzy model was treated as a feed-forward network and thus can be trained via backpropagation.
In recent years, soft decision trees (SDT) have been an active field of research again. Léon and Denoyer [23] proposed a computationally efficient backpropagation scheme to directly optimize the hard partitioning function at each node. Frosst and Hinton [13] proposed a regularization method to encourage a balanced split at each internal node, improving the robustness of SDT.
While traditional SDT uses a single feedforward layer with sigmoidal activation function as their decision function, modern SDTs may choose various decision functions to resolve different problems. Bulo and Kontschieder [37] proposed a tree-shaped neural network with randomized MLP decision function to solve semantic image labelling, and ensembled multiple such networks to make a ''decision forest''. Yang et al. [47] took soft binning function into account, using a neural network to faithfully imitate the ''splitting'' choice made at each node split. Popov et al. [34] uses a similar technique, but instead of a full binary tree, they tried to implement an oblivious decision tree for faster training/inference. Tanno et al. [41] took a more adaptive approach. In their work, each edge of the tree would further stand for a transformation function (often residual).

C. ENSEMBLE LEARNING AND TREE ENSEMBLE
In real tasks where the features are highly entangled, the performance of a single tree is limited. The conclusion holds for both traditional decision trees and their soft counterparts. A widely accepted workaround is to use more trees, then aggregate the results together. This method falls into the ensemble learning category.
Ensemble learning is a well-developed direction of machine learning research. It has been utilized in various fields and applications, e.g. stock returns prediction [43], credit scoring [1], sentiment analysis [31] and text classification [30]. In these works, decision tree often serves as a base learner, and its result is often aggregated with other base learners such as Naïve Bayes, nearest neighbor classifiers [29] and support vector machines.
Modern decision trees also use ensemble methods to increase the model's capacity. In this scope, all base learners are decision trees. Different decision trees would deal with different inputs or in charge of different positions in a pipeline. Breiman [5] proposed the random forest algorithm, organizing decision trees by bagging on random subsets of examples and features. Friedman [12] proposed the gradient-boosted trees (GBT) method which yielded stateof-the-art performance in many fields including research institutions ranking [38], recommender systems [46], bioinformatics [8] and medical applications [21]. Zhou and Feng (2017) [49] proposed the Deep Forest approach to generate a cascade ensemble of forests, which outperforms traditional tree ensembles in various domains.
Soft decision trees can also be ensembled via a similar scheme. Kumar et al. [20] proposed an ensemble of soft decision trees for robust classification; Yıldız et al. [48] tested the bagged soft decision trees on two-class classification datasets and regression datasets. Some previously mentioned neural SDTs also utilized ensemble learning. Tanno et al. [41] trained multiple models and took the average of outputs as the final result, and observed a performance gain; Popov et al. [34] stacked the oblivious trees, feeding the output of previous trees into following ones, and improved the performance at the cost of training time.

III. METHODOLOGY
In this section, we describe the algorithm and techniques used in SDTR.

A. THE HIERARCHICAL MIXTURE OF CONSTANT PREDICTIONS
The main idea of SDTR follows the ''hierarchical mixture of bigots'' setting proposed by Léon et al. [23] and Frosst et al. [13]. Unlike [47] and [34], each bigot draws its conclusion on all feature vectors through the decision function of its own, rather than selecting a splitting point for a particular feature.
In the binary decision tree model, decisions are made by a series of hierarchical nodes. Each node can be viewed as a binary classifier: for an incoming sample, the node would decide which branch should further handle the sample.
Similarly, the SDTR structure is a full binary tree. Each node i in this tree represents a binary classifier, with learnable parameters w i , b i . Given a certain input, the classifier's output is calculated by a sigmoid function representing the probability of choosing the left branch: (1) To prevent too soft decisions, we multiply a β i on the term w i x + b i , before calculate the sigmoid. Each β i is initialized as a hyperparameter (in all our experiments, we choose 1.5), and different β i s are independently trainable. Thus the output would be soft again at some nodes when needed. Note that the choice function p i (x) is not necessarily a linear function w i x + b i . Any function can be used to make the decision, as long as its output scope is [0,1], and thus we can interpret the result as ''probability''. Multi-layered networks or convolutional networks (for image processing) are also feasible choices. Given the probability of choosing the left branch, the probability of choosing the right branch should be 1 − p i (x).
All nodes naturally form a hierarchical mixture of experts (HME) [18]. Each leaf node corresponds to a scalar R , which serves as the model's prediction. The label y is normalized by y = y −ȳ σ , where y is the original label and (ȳ , σ ) are the mean and standard error of y in the training set. R is initialized via sampling from a standard normal distribution. Consider each leaf, the probability of choosing the leaf equals the multiplication of probabilities of a series of nodes that lie along the path from the root to leaf: where l i stands for the routing (whether the next node is the left child of i) on Path( ). Unlike traditional neural networks where the output of previous layer serves as the input into following layer, all nodes in SDTR use the same input x from dataset. This structure itself mitigates vanishing gradient problem, making it possible to stack multiple sigmoid functions sequentially while preserving the strength of gradient descent.
We illustrate a 2-layer SDTR tree in figure 1.
The objective function seeks to minimize the MSE (mean squared error) between each leaf's prediction R and the target ground-truth value y, or to minimize other reasonable differentiable objective functions. Take MSE as an example. The objective function at each leaf is weighted by its path probability: There are two ways to interpret the model's output to obtain the predictionŷ.
• Take the value on the leaf with the largest probability as the prediction. * = arg max x stands for the normalized input feature all nodes use the same x from data), and we use a linear function followed by a sigmoid to generate the decision on each non-leaf node. Note that at each decisional node, we can use an arbitrarily differentiable function, as long as the output vector sums up to 1.
• Sum up the predictions on all leaves and weight them by corresponding path probabilities.
In our experiments, the second interpretation performs better in most settings, which means the conclusion of [13] still holds for regression tasks. In the experiment section, we only use the second interpretation.

B. REGULARIZATION
As in [13], we used a ''path penalty'' term that encourages a balanced tree classifier. This penalty causes each internal node to equally use both left and right sub-trees, thus avoiding getting stuck on too biased weights (the case that classifier assigns most examples to one leaf node).
We compute the penalty term for each node. Mathematically speaking, this penalty term is the cross-entropy between the ideal ''balanced'' distribution [0.5,0.5] and the actual distribution. The latter is sampled per mini-batch.
The actual distribution [α i , 1 − α i ] for node i is defined by: where P i (x) is the accumulated path probability of feature vector x from root to node i. Then, we compute a weighted sum of the cross entropy obtained from each non-leaf node (inner node) to get the final path penalty term: d denotes the depth of node i. As mentioned in [13], a deeper node is more possible to handle a non-equal split. Its weight in L pp should be decayed; otherwise, such path penalty can be harmful to the model's accuracy. In our early experiments, we also observed that exponentially decaying the weight according to the node's depth d achieved better results.
Besides that, we also regularize each node's weight matrix by L 1 loss. On the one hand, this loss binds the output in the sigmoid function's near-center region, thus helping gradient flow. On the other hand, this regularization term helps the model generate more interpretable results.
Intuitively, a decision tree's interpretability relies on two facts: it uses hierarchical structure, and the decision function at each node is a threshold function on a particular feature. The ensembles of decision trees (both bagging and boosting) are typically regarded as black-box models. However, their interpretability is still better than that of regular deep neural networks. For example, it is possible to compute a feature importance ranking in a trained GBDT. Friedman [12] proposed a feature importance measure based on the relative influence of individual input feature. The importance of feature j on tree T can be approximated by the resulting improvement in squared error lossˆ , summed up for each inner node t: 1(v t = j) denotes whether the node's splitting feature is feature j.
It is theoretically possible to implement the same approach in SDTR, making its interpretability better than traditional NNs. However, the ''split gain''( t ) at each node t is hard to compute. By GBDT's greedy splitting policy, we can reach an intuitive conclusion that the splitting feature used at the root or top levels of the tree is more important than those used at the bottom, near-leaf levels. Since we have utilized the path penalty term to balance the distribution at each tree classifier, we assume that the near-root nodes, and their corresponding decision functions, are more important in the overall decision process.
Another difficulty lies in the decision function. As we have replaced the hard threshold function with linear sigmoid, we cannot directly use the term 1(v t = j). However, we still hope our decision functions to be sparse; thus, the most important feature(s) at a certain node would be clearer.
By introducing the L 1 regularization loss, we encourage the sparsity at each decisional node. Based on previous observations on L pp andÎ 2 j (T ), we have chosen that the weight of L 1 loss at node t should be reweighted in an exponentially decaying fashion, according to the node's depth d t . Let w (t) be our 1 × K weight matrix associated with an inner node t, the final expression of our regularization loss L reg is: By reweighting the L 1 loss, we keep the sum of L reg of each layer at the same magnitude. This regularization approach also controls each decision node's input norm (sigmoid function), keeping the function value near zero, where the gradient descent is effective. Like Batch Normalization [17], this approach helps to effectively prevent the occurrence of over-saturated decision nodes, and avoid the risk of gradient vanishing. Our experiments show that this regularization term brings notable performance gain.
Both path penalty term and the L 1 regularization loss are integrated in the MSE loss function. Our final optimization target is the sum over these three expressions: The strength of L pp and L reg are controlled by two hyperparameters λ 1 and λ 2 .

C. IMPROVING SDTR: SINGLE TREE
A single regression tree is weak. In a single-tree setting, since our predictions can only be made by R at each leaf node, the number of leaf nodes will influence the minimum MSE (Mean Squared Error) that we can obtain.
In order to achieve a better result in the single-tree setting, our tree needs to go deeper. However, merely increasing the depth of the tree leads to a less robust model, and such model requires more steps to converge.
We observed that main obstacle in training deeper SDTR lies in the gradient of the leaf nodes' weights. Consider updating a single leaf node 's response R , by taking derivatives on (3), we can obtain: Since P = 1, as the model's depth increases, the gradient's norm on each leaf node's weight shrinks exponentially, thus causes slow convergence in the training phase. We attempted to resolve this problem by re-weighting the learning rate of R .
We manually multiply the learning rate of R by the total number of leaf nodes (2 d−1 ). Note that directly increasing the overall learning rate only results in a failing model: previous decision nodes would become unstable. So we only adjust the learning rate associated with each R .
Besides toggling the learning rate, we tried to use hidden layers (fully connected layer with ReLU [28] activation) to enhance a single tree's expressive power. Before feeding input features into each decisional node, we first transform them by a fully-connected layer. All decisional nodes share the same FC layer. However, such a method introduces extra parameters, and we have not seen apparent improvement in our experiments. One possible reason is that the hierarchical structure itself is sufficient for representing higher-order logic, making the preceding FC layer unnecessary.

D. IMPROVING SDTR: ENSEMBLE
To resolve the limited accuracy of a single tree, Friedman et al. [12] and many following papers used the ''ensemble'' technique. Generally speaking, there are two ways to ensemble models: one is to take the average output of multiple different models, or the so-called ''voting'' mechanic; the other is called ''boosting''. The ''boosting tree'' structure refers to a sequence of decision trees. The first tree aims to optimize the target value directly; the following trees aim to compensate for the accumulated error made by all previous trees.
The output of SDTR is a weighted average of all leaf nodes. If we view each leaf node as a base learner (which returns a constant), then a single SDTR tree is already a bagging model. Each base learner would deal with a specific division of data; the path probabilities would determine the division, and the division itself is updated through backpropagation. However, this bagging scheme is limited by the single tree structure. To alleviate the problem, we initialized a forest of trees differently, then average their outputs. Such a forest can produce a prediction directly or serves as a ''layer'' in our boosting scheme.
Despite being successful, the boosting method is harder to implement in an end-to-end training environment. This technique implies that some part of the whole network will be repeatedly trained while the other parts are frozen, introducing considerable overheads. Popov et al. [34] proposed a workaround for end-to-end setting: each tree would have multi-dimensional outputs (rather than a scalar), a tree located at successor layers can treat outputs of all its previous trees as additional input features. The final output is then given by averaging across the first dimension of all tree's outputs. The method improves accuracy but significantly increases the number of model parameters and time consumption: for trees in successor layers, the dimension of input features increases by O(l 2 ), thus prohibiting the model from getting deeper.
To alleviate the problem, we made a single tree more expressive than the oblivious decision trees implemented in [34]. We will further justify this in the experiment section. As the complexity of a single tree increases, we can achieve comparable results by lesser boosting layers or using fewer trees in each layer.

A. OVERVIEW
We built the SDTR model via PyTorch python package [32]. 1 All experiments were performed on a single GTX 1080Ti. We test the performance of SDTR on each dataset under three different conditions.
• Single-tree mode: In this mode, only one tree was used.
We compare it against a fully connected neural network (FCNN) and a small ensemble of oblivious trees (NODE) [34]. The aim of this part is to show that a single SDTR model has comparable expressive power as FCNN and slightly stronger than its oblivious counterparts. We keep the search space of hyperparameters of SDTR identical across all experiments.
• Single-layer mode: In this mode, we constructed multiple trees and integrated them as a forest. We averaged all trees' output as the model's output and used it for both the training and inference phases. The single-layer approach is suitable for parallelization, thus would be a useful metric in practice. The number of trees across all five datasets are n = 256.
• Multi-layer boosting mode: In this mode, we arrange multiple trees in ''layers'' as previously described in section III-D. One layer consists of an SDTR forest, and the output of such forest serves as additional input features for its following layers. This approach tests the best possible performance of a certain model on tabular data. We compare the performance of SDTR with other state-of-the-art models (XGBoost, Light-GBM, CatBoost, NODE, TabNet and DeepForest). Since our regularization term is computed per-batch, some may argue that a bigger batch size would help train a more balanced SDTR tree. We provide ablation results in section V on batch size for single-tree and single-layer setups, keeping all the other hyperparameters fixed; based on the observation, for multi-layer boosting mode, we keep the batch size as 1024.
In all our experiments with SDTR, we stopped training when the validation MSE does not improve for 1000 consecutive batches.
Other models. We also provide results of FCNN, XGBoost [9], LightGBM [19], CatBoost [50], NODE [34], DeepForest [49] and TabNet [2], trained with the same dataset. For FCNN and NODE, we provide the results of both shallow and deep variations. We used the ''Dense-LeakyReLU-Dropout'' structure as the basic building block of FCNN. The shallow version consists of 2 layers, while the deep version consists of 7. We keep the search space of hyperparameters the same in the same experiment group (Small, Shallow, and Deep). The shallow NODE is built as a 1 Source code will be available on GitHub soon. forest of n = 2048 oblivious trees, while the deep counterpart used the previously mentioned ''boosting'' technique to stack forests together. As the number of trainable parameters in one single tree of NODE is approximately one magnitude smaller than in SDTR, we also provide a smaller single-layer version of NODE (NODE-small) with the same hyperparameters as NODE-shallow but consists of n = 10 trees.

B. DATASETS
To evaluate the performance of proposed SDTR model, we performed regression tasks on five open-source regression datasets: YearPredictionMSD [11], MSLR-WEB10K [35], Yahoo LETOR [7], SARCOS [45] and WINE quality [10]. For the first three datasets, we used the same train-val split for all experiments on different hyperparameters and models, the same as what NODE [34] used in their paper. For SAR-COS, we used 20% samples from the train split as a validation set and used the provided test split. For WINE, which did not provide a test split, we manually split the dataset by 8:1:1. The val split is used for hyperparameter tuning and early stopping. Then, we evaluate each derived model on the test split defined by dataset providers (if available). The train/val/test split is fixed for different models.
Here we provide a brief description for each dataset.

YearPredictionMSD (Year):
A dataset derived from Mil-lionSong dataset [4]. The features are extracted from the ''timbre'' features generated by the Echo Nest API. The target is the release year of a certain song, an integer ranging from 1922 to 2011.
Yahoo LETOR (Yahoo): Similar to MSLR-WEB10K, Yahoo LETOR is a learning-to-rank dataset with query-URL pairs, labeled by integers from 0 to 4. We treat both ranking datasets as regression tasks, using feature vectors to predict the (real-valued) label directly. SARCOS: The regression task is to solve an inverse dynamics problem for a SARCOS anthropomorphic robot arm. We map the 21 input variables to the first of seven joint torques, transforming the multi-regression problem in [2] into a single one.

WINE quality (WINE):
The dataset aims to model red wine quality based on physicochemical tests. Each instance corresponds to a red Vinho Verde wine sample. Input variables are physicochemical properties, for example, density, pH, and fixed acidity. Instances are labelled by a ''quality score'' (between 0 and 10) based on sensory data. Table 1 shows details of datasets. They differ in both scales and the number of input features. For the last two datasets, it is more likely to overfit in our training process. The results on SARCOS and WINE can be used to test the models' robustness.

C. TRAINING
Here we describe our training routine.
Data Preprocessing: Each input feature is transformed into a normal distribution using quantile transform. Specifically, we use the scikit-learn implementation [33]. This step is critical for fast and robust training. The integer labels are normalized by the mean/standard error on train split. We performed quantile transform for all models in our experiments (SDTR, FCNN, NODE, XGBoost, TabNet, and DeepForest).
Training: SDTR is trained end-to-end via SGD-based optimizer. We only used MSE as the objective function in the experiments, but any differentiable objective function can fit into the SDTR model. As in [34], we use Quasi-Hyperbolic Adam (QHAdam) [24] as our optimizer.
The w and b for each node is initialized via Xavier's uniform [14]. R s are initialized by a standard normal distribution.
The group of hyperparameters that yields the best test MSE in SDTR-Single was used as hyperparameters in the SDTR-Shallow setting. The number of trees across all datasets is n = 256.
For deep settings, we performed a mixture of grid and random search on the hyperparameters of the model via Hyperopt [3]. The tuned hyperparameters include depth of a single tree, output dimensions of each tree, number of trees in one ''boosting'' layer, number of layers, learning rate, λ 1 and λ 2 . Only the last two parameters are searched continuously.
The hyperparameter choices of other benchmark models are described in Appendix A.

D. RESULTS AND ANALYSIS
We use mean squared error (MSE) as our metric across all five datasets. As shown in table 2, SDTR performed well in ''single tree'' and ''shallow'' settings while performed slightly worse than NODE/XGBoost after being boosted. In small datasets (SARCOS and WINE), SDTR showed better robustness than other neural models (FCNN, NODE, and TabNet), and is comparable with state-of-the-art undifferentiable models (XGBoost, LightGBM, CatBoost, and DeepForest).
It is worth noting that we use the same structure / hyperparameters for SDTR for all datasets in the same set of experiments. For ''shallow'' setups, the hyperparameters, including learning rate, tree depth, and λs, are identical to those in the ''single'' setup. We found that the hyperparameter set- ting which performed well in the ''single tree'' setup also performed well after being ensembled. The consistent performance of SDTR shows the robustness of our proposed model. We can perform a low-cost hyperparameter search in the ''single tree'' setup; the resulting optimal hyperparameters can be used for the following ensemble phase. It might also explain why the shallow SDTR models performed better than their deeper counterparts in YAHOO and SARCOS. Theoretically, shallow models have few advantages against deeper counterparts: their solution space is a subset of deeper models; however, deeper models naturally have more hyperparameters, need more careful tuning, and have a higher risk of overfitting. The boosted model for YAHOO showed signs of overfitting in our training phase. The ''boosting'' process also introduces significant structural change; we have found no evidence showing that the optimal hyperparameters for ''single'' and ''shallow'' setups are still optimal for the boosted models. Furthermore, the boosting scheme significantly increases the training time and memory consumption, while the performance gain against the ''shallow'' scheme is only trivial. Thus we do not recommend using the boosted version of SDTR. If we would use the same number of trees in specific cases (for example, the memory is limited), placing all trees in the same, ''shallow'' layer would be a better choice.

V. ABLATION STUDY
In this section, we compare the different hyperparameter settings of our SDTR model. We also justify the effectiveness of our proposed training techniques, i.e. exponentially decaying L 1 regularization and gradient re-weighting. Besides, we proposed a new ''average entropy'' metric to evaluate the model's interpretability, finding that the new metric is consistent with our optimization target (MSE) in tree regression tasks, possibly explaining the effectiveness of tree-structured model in this field.

A. ABLATION STUDY: BATCH SIZE AND GRADIENT RE-WEIGHTING
In many cases, the size of each batch is critical in models that use Batch Normalization layers [17]. Despite the fact VOLUME 9, 2021  that our model did not use BN, the regression terms on each inner node are still computed per-batch, which naturally led to a guess that larger batch size results in a more precise approximation, thus improving the model's overall MSE. We provide ablation results in table 3. We keep all hyperparameters (including random seeds) the same and coordinated our early-stopping conditions to ensure that we stop training after not observing improvement for exactly the same amount of data. For each batch size setting, we provide two results (with or without gradient re-weighting).
Although a larger batch size accelerated the overall training process, we did not observe apparent improvements in the model's accuracy. On the other hand, a smaller batch size (and more training batches) might benefit our QHAdam optimizer, for the optimizer will update its momentums more frequently. The consistent performance on different choices of batch size makes SDTR capable of implementations in low-resource settings.
Multiplying the learning rate related to ''responses'' R improved the performance by approximately 2%, which proved the usefulness of the learning-rate re-weighting technique.

B. ABLATION STUDY: λ 1 AND DEPTH FOR SINGLE-TREE SETTING
To test the stability against different hyperparameters, we ran ablation studies on the MSLR-Web10K dataset with different λ 1 and tree depth choices. We provide the results in figure 2.
To control the variables, we set λ 2 = 0 in this set of experiments. Because of the absence of exponentially decaying L 1 regularization, the performance in this set of experiments is slightly worse than our main result on MSLR.
To sum up, the value of λ 1 is flexible and just slightly affects the model's performance, while a suitable depth of tree will bring a significant boost. Based on our observations, the optimal choice of depth depends on the dataset's characteristics.

C. ABLATION STUDY: SINGLE-LAYER ENSEMBLE
As shown in previous results, the model can achieve higher precision when we integrate multiple trees in the same layer and take the average of each tree as our final output. To investigate this method, we performed an ablation study on the different number of trees and the depth of each tree on the Microsoft-Web10K dataset. The results are provided in figure 3.
A straightforward conclusion is that the model's accuracy improves as the number of trees increases. Also, the accuracy seems to be related to the accuracy of every single tree. The hyperparameter choice, which performs well on its own (such as depth = 5 in single-tree ablation), also yields the best result when forming an ensemble. As mentioned before, this characteristic of SDTR is friendly to hyperparameter tuning.

D. ABLATION STUDY: L1 REGULARIZATION AND INTERPRETABILITY ANALYSIS
As proposed in section III-B, a sparse weight matrix implies better interpretability. To quantitatively evaluate an SDTR's interpretability, we propose an ''average entropy'' metric suitable for our experimental setup.
For each non-leaf node in our tree, the 1 × k input features are multiplicated by a k × 1 weight matrix W to obtain a scalar, which is fed into a sigmoid function. Note that all input features are already normalized, so the scale of weights on different features are roughly the same. Naturally, a larger weight (absolute value) means the corresponding feature is more effective; thus, we normalize W to get a new matrix W : is a small positive scalar to prevent the occurrence of zeros. In all of our experiments we use = 10 −3 . The values inside our new W is between [0, 1] and sums up to 1. This W can be viewed as a multinoulli distribution, and thus we can compute its entropy: Intuitively, if we have one (or several) dominating feature(s), the entropy of conducted multinoulli will be small; otherwise, if all features are of equal importance, the previous expression will reach its maximum.
Again, we use the weighted sum of such entropy of all nodes to get our final ''average entropy'' (AE) metric. Let W (t) be the normalized weight matrix corresponding to an inner node t, our metric is then given by: We use this metric to evaluate the effectiveness of our exponentially decaying L 1 regularization. Table 4 shows the relationship among λ 2 , MSE and AE. λ 2 = 0 means we do not use L 1 regularization at all.
It can be shown that the existence of the L 1 regularization term improved the model's interpretability, and a proper value of λ 2 improved the model's performance. An interesting finding is that a model with a smaller MSE tends to have a smaller AE. Figure 4 further illustrates this phenomenon. This phenomenon also showed up in our experiments on the other two datasets. Note that we do not use AE to optimize our models: it did not show up in our loss function. Furthermore, VOLUME 9, 2021 the early-stopping scheme relies solely on the MSE on the validation set. This finding may imply that the tabular data regression task favors sparse weight matrixs, at least on these datasets.
We also provide a visualization of the first three layers for a shallow SDTR single tree (depth = 4), trained on the YearPredictionMSD dataset. The model reached 79.32 MSE. Each sub-gram denotes the weights associated with each node. Our L 1 regularization efficiently encouraged sparsity for d = 1 and d = 2 nodes; because of the exponentially decaying weight, the regularization strength (and sparsity) decreases in deeper layers. One can already conclude some most important features (in this example, feature #0), and exclude some unused features in following experiments.
It is worth noting that ensemble and boosting will harm the interpretability of the base model. This observation also holds for SDTR. In our ''single layer ensemble'' scheme, the final output is an average of every tree regressor's prediction. Every tree's contribution to the final output is associated with the statistics of the tree's path probabilities and leaf predictions. We leave this direction for the future.

E. ABLATION STUDY: GRADIENT REWEIGHTING ON SDT
Theoretically, gradient vanishing at leaf nodes also exists in the original SDT for classification tasks. To justify our gradient reweighting technique, we performed an ablation study on the original SDT model proposed by Frosst and Hinton [13]. We used depth = 5 and learning rate 10 −3 on the MNIST classification task, using cross entropy as our loss function, and the only difference between the two models is whether the gradient reweighting technique is used. Figure 6 showed our results on MNIST. The reweighting method significantly accelerated our training process, reaching a near-optimal result after a few hundred epochs. It is worth noting that directly increasing the learning rate by 2 depth caused the model to fail: the loss functions soon became untractable (nan) after 100 epochs. The final accuracy on the test set was 94.49% (reweight) versus 94.94% (original), which implies that the gradient reweighting might introduce trivial performance loss. Such loss can be easily covered by tuning hyperparameters.

VI. CONCLUSION
In this paper, we integrated an existing model (Soft Decision Tree) to handle heterogeneous tabular data regression, which troubles DNNs for a long time. We proved by experiments that despite our model performs slightly worse than NODE and XGBoost under fine-tune settings, it exceeds the two competitors in shallow settings and performs far better than FCNN. The SDTR model performs best in extremely constrained low-memory, single tree setting.
The differentiable nature of SDTR makes it possible to be incorporated into complex pipelines, and it can easily fit into the back-propagation regime. As a regressor, it is possible to use SDTR to extract valuable features from tabular data or directly solve the regression problems with a low cost.

APPENDIX A OPTIMIZATION OF HYPERPARAMETERS
For each model, we use a grid search on the group of parameters described below. All models are trained on the same train/val/test split. Some models, e.g., XGBoost and SDTR, require a validation set to perform early-stopping and prevent overfitting. For those models, we trained them using the train split, performed early-stopping according to the val, and report the best result on the test split among choices of different hyperparameters.

A. FCNN
As described above, the FCNN uses ''Dense-LeakyReLU-Dropout'' structure as the basic building block. FCNN-Shallow consists of 2 blocks, and FCNN-Deep consists of 7 blocks. The negative slope for LeakyReLU is 0.01. We use Adam optimizer for back-propagation.
The number of neurons in hidden layers were set to be a proportion of input feature size k. To make the comparison fair, XGBoost-Tuned setting uses the same search space as in [34]. We describe the hyperparameter choice below. The hyperparameters of NODE-Small are sampled from below.
Based on our observations, the boosting process greatly influences the model's structure and logic. We use the same search space as in [34]: We use the PyTorch implementation of TabNet. 2 The hyperparameters are sampled by uniform choice from below. The bold text stands for the default hyperparameters. We use the CascadeForestRegressor in the Deep-Forest library. 3 The hyperparameters of CascadeForestRegressor are sampled by uniform choice from below. The bold text stands for the default hyperparameters.