I. Introduction
Decision trees in machine learning have a long history. Automatic Interaction Detection (AID) [1] and THeta Automatic Interaction Detection (THAID) [2] are often considered to be the first published decision tree algorithms for regression and classification, respectively, which were combined and extended into the Classification And Regression Trees (CART) algorithm [3]. Later, ensemble learning methods such as bagging [4], gradient boosting [5]–[7] and random forests [8], [9] were developed, which combine multiple decision trees to substan-tially improve prediction accuracy over individual decision trees. Despite this long history, decision trees and ensemble methods such as random forests and gradient boosting are still among the most frequently used machine learning algorithms today, as shown by a recent Kaggle survey [10]. This is because of the inherent advantages coming from the relatively simple concept, support for numerical and categorical features and interpretability of predictions.