Learning Explainable Decision Rules via Maximum Satisfiability

Decision trees are a popular choice for providing explainable machine learning, since they make explicit how different features contribute towards the prediction. We apply tools from constraint satisfaction to learn optimal decision trees in the form of sparse k-CNF (Conjunctive Normal Form) rules. We develop two methods offering different trade-offs between accuracy and computational complexity: one offline method that learns decision trees using the entire training dataset and one online method that learns decision trees over a local subset of the training dataset. This subset is obtained from training examples near a query point. The developed methods are applied on a number of datasets both in an online and an offline setting. We found that our methods learn decision trees which are significantly more accurate than those learned by existing heuristic approaches. However, the global decision tree model tends to be computationally more expensive compared to heuristic approaches. The online method is faster to train and finds smaller decision trees with an accuracy comparable to that of the k-nearest-neighbour method.


I. INTRODUCTION
Explainable artificial intelligence (XAI) is a family of methods focusing on the algorithmic transparency and interpretability of decision procedures [1], [2]. The ability to explain the decision of a machine learning (ML) algorithm is a vital component of diagnostic, feedback, and humanin-the-loop systems [3]- [5]. Furthermore, with increasing statutory restrictions planned on the use of ML methods in customer-oriented applications, there has been renewed interest in the field [6], [7].
Transparency in ML algorithms can be achieved in a number of direct and indirect ways [8]. This work pursues methods whose underlying classification rules are constructed from logical clauses of simple operators. The resulting decision tree models comprise an enumeration of feature combinations leading to a positive prediction. At moderate complexity, tree models can provide an intuitive explanation of the prediction.
Decision trees are aggregates of classification paths propagating through test nodes or decision nodes. A node The associate editor coordinating the review of this manuscript and approving it for publication was Szidónia Lefkovits . represents a binary test (or ''yes/no question'') assessing the state of a single feature, e.g. ''Is your monthly income less than 2400e?''. By modelling these nodes as propositional variables, we can represent a decision tree in propositional logic as a conjunction of disjunctions of variables. Also known as k-cnf formulae, such formulae can provide a compact and interpretable representation of classification rules over a set of features.
A recent line of work, adapting ideas from Boolean compressed sensing, focuses on learning sparse k-cnf rules via linear programming, [9]- [11]. Following advancements in SAT-solving, the problem has also been cast into various maximum-satisfiability (MAX-SAT)-based frameworks [12]- [14]. Other works consider learning decision lists and decision sets, which are generalizations of k-cnf rules [15], [16].
This work extends the MAX-SAT-based framework for learning k-cnf rules proposed in [13]. In particular, we generalize their framework to non-constant clause weights and add cardinality constraints for more efficient recovery. While this allows for finding solutions over much larger datasets, the global model still suffers from the inherent NP-hardness of SAT solving. Therefore, we propose a locally weighted learning approach for query-specific applications [17], [18]. Similar to lazy decision trees proposed in [19], our local model constructs a k-cnf rule within the neighbourhood of a query data point.
We empirically validate our methods on a dataset of Housing Benefit Applications provided by the Social Insurance Institution of Finland (Kela). The task is to classify benefit applications as accepted/rejected. In addition, an explanation of the decision is provided by the classification rule. We also consider six publicly available datasets.

A. OPTIMAL DECISION TREES
Decision trees ( Fig.1) are directed acyclic graphs whose internal nodes represent propositional variables (e.g., v = ''income < 5000'') and whose edges represent assignments to source nodes. We use the term literal, e.g., t to denote a propositional variable v or its negation v. In propositional logic, decision trees can be modelled as formulae in conjunctive normal form (CNF) T cnf = T k , where clauses T k = t m k are disjunctions of literals of variables in the tree and where variables model tests for input features. (Alternatively, a decision tree can be represented in disjunctive normal form (DNF) as T dnf = T k , where T k = t m k is modelled as a conjunction of literals.) We only consider binary variables which can either be true or false.
Let T = {{t 1 k , . . . , t 2L k } : 1 ≤ k ≤ K } and σ : T → {0, 1} be a valuation. We define a decision tree (in CNF) in terms of a valuation σ as be a dataset with numeric feature vectors x i ∈ R M and labels y i ∈ {0, 1}. Such a feature could be the age of a person or their household income. We quantize a feature x j using a set of thresholds a j = {a 1 j , . . . , a |a j | j } to produce |a j | binary features τ a j = τ a 1 j , . . . , τ |a j | a j . We will discuss how to choose the thresholds later in section IV. To avoid notational clutter, we use l to index threshold a l in We classify a data point x over a decision tree T σ as An optimal decision tree over D solves with loss function The regularization parameter λ trades off prediction accuracy against sparsity of decision tree. Sparsity is measured as the number of true literals in clauses in a tree, where |T k | measures the number of literals assigned as True in clause T k . For example, T cnf in Fig. 1 has T cnf 1 = 1 + 2 + 2. The sparsity of a decision tree determines its generalization ability. In general, we expect a sparse decision tree to be less prone to overfitting. The constraints (4) require clauses T k to be non-empty while the constraints (5) exclude implied nodes such as τ a and τ b for thresholds a = b for feature j (e.g. node 'income < 500' implies node 'income < 700'). The weights w i in (6) are explained in section III. For now we will assume w i = 1.

B. OPTIMAL DECISION TREES VIA PARTIAL MAX-SAT
Owing to the 0-1 loss function in (6), finding direct solutions of (3)-(5) is known to be NP-hard [20]. We reformulate (3)-(5) as a weighted partial MAX-SAT problem, which will enable the use of specialized solvers.
To this end, let η = {η 1 , . . . , η N } and extend σ to σ : T ∪ η → {0, 1}. Let ω : F → R ∪ {∞} be a mapping associating non-negative weights with clauses in F (to be defined later in this section). It will be convenient to reformulate (3) by using the equivalence min(f ) = −max(−f ), as where T denotes the negation of variables in T . Let which lists, for each rule k, the variables matching the quantized representation of datapoint x i . We can represent the weighted indicator function I[T σ (x i ) = y i ] by conjoining the paired constraints The clauses (9) with weights w i ∈ R are selectors of data points in the training dataset. The clauses (10), in turn, regulate the correctness of the model's prediction over the chosen subset. In particular, with slight abuse of notation, we have the equivalence Similarly, the cardinality constraint λ T σ 1 in (7) is incorporated into the clauses Lastly, we redefine the cardinality constraints (4) and (5) as propositional clauses where {r 1 , . . . , r n }1 denotes a 1-out-of-n bound on the number of literals r ∈ {r 1 , . . . , r n } which can be set to True. For efficient encoding of these cardinality constraints, we make use of the following encoding developed in [21]: In practice, writing the clauses D i in CNF for negative examples (y i = 0) is less straightforward. This difficulty can be resolved using Tseytin's transformation. We introduce new variables s 1 , · · · s K and set we have the equivalence of (3) and its partial weighted MAX-SAT form Optimal decision trees T σ * in the form of (1) yield interpretable decision tree classifiers T σ * : {0, 1} M → {0, 1} over the subset {X i , y i } σ * (η) ⊂ D train . Interpretability is endowed by the semantic interpretation of variables t m k ∈ T which inherit meaning according to their represented features. In this way, the model provides an interpretation of the dichotomy in the dataset, as well as a justification for predictionsŷ = T σ * (x) made on unseen data.

III. LOCAL DECISION TREES
The number of constraints (10) grows as O(LN ) with the number of data points N and binary features L. As a consequence, the problem (9)-(13) is unsolvable in practice for large datasets. This motivates a local learning approach where (15) is solved locally in a query-oriented fashion.
Let x q ∈ R M be a real-valued query and τ q ∈ {0, 1} 2L its binarisation. We construct D local in the vicinity of x q by selecting data points {x i , y i } N (x q ) in the neighbourhood N (x q ) = {i|d(x i , x q ) ≤ δ}. Distance information, in the feature space, is incorporated into the weights w i in (6) using a kernel function K(d), where d is a metric in the (real-valued) feature space. A popular choice is the Gaussian kernel [22] with cut-off parameter δ (although in practice we include a fixed number of nearest neighbours). We will consider the weighted norm with sample variances s j over the training dataset. Classification of a query point x q amounts to solving (9)-(13) over {x i , y i } N (x q ) with weights w i = K(d(x i , x q )) and classifyingŷ q = T * σ (τ q ).

IV. EXPERIMENTS
We validate the global and local models of the previous section on a number of public datasets (

A. HOUSING BENEFIT APPLICATION DATASET
The HBA dataset consists of non-identifying data selected from 6250 housing benefit applications received by the Social Insurance Institution of Finland. A typical application comprises binary-and multiple-choice questions, numerical fields (e.g. age, income), and optional text fields. We concern ourselves with the prediction of whether to accept or reject such an application based on categorical and numerical features. Specifically, we consider the salient features: income, #residents, #children, max expenses, housing expenses,  deductibles, and calculated expenses. Feature binarisation is done for a fixed number of thresholds, except for discrete features (#residents, #children) for which all thresholds a j are included. After experimenting with a number of ways of choosing these thresholds, we settled with choosing thresholds a j as in-between values observed in the training data. In particular, for each feature j, we ordered the training data in ascending numerical order, took the differences between all ordered consecutive pairs, and chose a fixed number of quantiles from these differences.

B. GLOBAL MODEL
We compared the optimal decision tree model of Section II-B with rule size K against the decision tree algorithms CART [24] and RIPPER [25]. For CART, we trained decision trees of both fixed and maximum tree depth using the DecisionTreeClassifier algorithm in the scikit-learn v0.23.1 Python package. For RIPPER, decision trees were trained using the Java Weka (v3.9.2) Jrip classifier [26]. In our optimal decision tree model, we fixed the regularization parameter λ ∈ [1.0, 0.1, 0.01] and the number of binarisation thresholds per numerical features at |a j | = 20; except for the HBA dataset, where we used |a j | = 60. To provide statistically meaningful measures of accuracy, we report 5fold cross validation scores averaged over 100 realizations of the training set [27]. Table 2 reports the average training accuracies (above) and test accuracies (below) for each dataset. For the Decision-TreeClassifier (DT), we report the highest achieved test score over all fixed-depth and unlimited-depth CART classifiers. We only considered RIPPER after relinquishing our rights to the HBA dataset, wherefore its entry is missing for the HBA dataset. Increasing the number of thresholds |a j | during the discretization of a dataset tends to increase the accuracy of the model. This trend is plotted in Fig. 2 (left) for the HBA dataset and fixed parameters K = 5 and λ = 10 −2 . The regularization parameter λ reflects model sparsity and directly impacts the overfitting/underfitting of the model during training. A typical trend for the HBA dataset is plotted in Fig. 2 (right) for K = 5 and |a j | = 60.
We provide an example 3-clause classification rule extracted by our method on the HBA dataset below.
expenses < 550 or housing exp < 700 and deductibles < 1040 or child exp. < 970 and expenses < 80 or income < 1460 For decision trees larger than K = 3, the 2 hour timeout was consistently reached. In comparison, the CART and RIP-PER methods trained within milliseconds.

C. LOCAL MODEL
The second experiment considered query-based classification: each query (housing benefit application) was classified on a local dataset of N data points closest to the query. That is, for each query point in the test dataset D test , we subsampled a training dataset D train of N nearest data points in D databse based on (17). We then trained the model on D train , as described in Section III, and used it to classify the query point.
Since the model was query-based, we fixed a randomly chosen test set, |D test | = 1250, for each experiment and chose D databse from the remaining 5000 data points. We again fixed the model parameters λ = 0.1 and |a j | = 10 and chose N = 100 nearest neighbours. Since the training dataset was restricted to a small neighbourhood around the query point, we considered shallow decision trees with K = 2 and K = 3. The resulting model was compared against the Nearest Neighbourhood algorithm (we used the scikit-learn 0.23.1 Python package with default settings). We repeated the experiment over 100 realizations of the dataset. Fig. 3 (top) shows the median classification accuracies and 10th/90th percentiles (shaded) of the models when |D databse | is increased. Fixing |D databse | = 5000, we also plot the dependencies of our model's classification accuracy on the size of the training dataset (left) and on the number of feature thresholds (right). As expected, increasing the size of the database leads to more accurate models. The local decision tree model performs best on small neighbourhood sizes (∼100) and does not require a large number of thresholds a j . Average runtimes were around 30 seconds for K = 2 and 5 minutes for K = 3 respectively.

A. CONCLUSION
We have generalized existing methods for constructing weighted optimal decision trees in the form of k-cnf rules via MaxSAT. We also translated this framework into a lazy learning scheme where decision trees are constructed locally in a query-oriented fashion.
Our empirical study (Table 2) supports the belief that optimal methods can be significantly more accurate than heuristic approaches, at the cost of computational efficiency.
Our lazy-learning method (Section III) shows performance comparable to that of the k-nearest-neighbour method (Fig. 3). The method finds optimal decision trees with two clauses (K = 2) and small neighbourhood sizes (δ), which allows for particularly fast training.
The explanations (18) are in the form of clauses in a decision tree. It might require additional work to translate these mathematical explanations into proper justifications of decisions on social benefit applications. It is difficult, even impossible, to base required legal reasoning of a benefit decision on the mathematical explanation provided. In spite of this, we believe that these methods could provide sufficient transparency in other applications.

B. FUTURE WORK
A key restriction for learning optimal decision trees is the exponential growth of data constraints. Specializing MAX-SAT solvers to the subclass of problems described by (9)-(13) can make searching more efficient. Furthermore, we expect that exploiting symmetries can help guide the search for candidate solutions. In fact, during our experiments, we made an attempt at introducing symmetry constraints explicitly, but this had an adverse effect of slowing down the solver. We leave it for future work to address both solver specialization and symmetry constraints.