Rethinking Logic Minimization for Tabular Machine Learning

Tabular datasets can be viewed as logic functions that can be simplified using two-level logic minimization to produce minimal logic formulas in disjunctive normal form, which in turn can be readily viewed as an explainable decision rule set for binary classification. However, there are two problems with using logic minimization for tabular machine learning. First, tabular datasets often contain overlapping examples that have different class labels, which have to be resolved before logic minimization can be applied since logic minimization assumes consistent logic functions. Second, even without inconsistencies, logic minimization alone generally produces complex models with poor generalization because it exactly fits all data points, which leads to detrimental overfitting. How best to remove training instances to eliminate inconsistencies and overfitting is highly nontrivial. In this article, we propose a novel statistical framework for removing these training samples so that logic minimization can become an effective approach to tabular machine learning. Using the proposed approach, we are able to obtain comparable performance as gradient boosted and ensemble decision trees, which have been the winning hypothesis classes in tabular learning competitions, but with human-understandable explanations in the form of decision rules. To the best of authors' knowledge, neither logic minimization nor explainable decision rule methods have been able to achieve the state-of-the-art performance before in tabular learning problems.


Rethinking Logic Minimization for Tabular Machine Learning
Litao Qiao , Weijia Wang , Sanjoy Dasgupta, and Bill Lin , Member, IEEE Abstract-Tabular datasets can be viewed as logic functions that can be simplified using two-level logic minimization to produce minimal logic formulas in disjunctive normal form, which in turn can be readily viewed as an explainable decision rule set for binary classification.However, there are two problems with using logic minimization for tabular machine learning.First, tabular datasets often contain overlapping examples that have different class labels, which have to be resolved before logic minimization can be applied since logic minimization assumes consistent logic functions.Second, even without inconsistencies, logic minimization alone generally produces complex models with poor generalization because it exactly fits all data points, which leads to detrimental overfitting.How best to remove training instances to eliminate inconsistencies and overfitting is highly nontrivial.In this article, we propose a novel statistical framework for removing these training samples so that logic minimization can become an effective approach to tabular machine learning.Using the proposed approach, we are able to obtain comparable performance as gradient boosted and ensemble decision trees, which have been the winning hypothesis classes in tabular learning competitions, but with human-understandable explanations in the form of decision rules.To the best of authors' knowledge, neither logic minimization nor explainable decision rule methods have been able to achieve the state-of-the-art performance before in tabular learning problems.
Impact Statement-Decision rule sets are an important hypothesis class for tabular learning problems in which the ability to provide human understandable explanations is of critical importance.However, they are generally not the winning hypothesis class in terms of accuracy.Black-box models like gradient boosted and ensemble decision trees are generally the superior models.In this article, we revisit the use of logic minimization to derive explainable decision rule sets from tabular datasets.Logic minimization alone produces complex models with poor generalization because it exactly fits all data points as provided.We overcome this problem by removing instances that cause inconsistencies and overfitting via a novel statistical framework.The proposed approach makes possible the learning of decision rules that achieve the state-of-theart classification performance in tabular learning problems with explainable rule-based predictions, which has not been achieved before.

I. INTRODUCTION
I N MACHINE learning domains, such as healthcare and criminal justice where human lives may be deeply impacted, creating inherently interpretable models that can provide human understandable explanations is critically important [1].In these domains, the datasets are often provided as tabular data with naturally meaningful features.Due to their intrinsic explainability, decision rule sets [2], [3], [4], [5] are often a popular hypothesis class of choice in these applications.However, they are not the winning class in these tabular learning problems in terms of accuracy.For example, in Kaggle competitions, gradient boosted, and ensemble decision trees [6], [7], [8] are generally the superior models.While these more complex classifiers can provide some level of feature attributions to predictions, their interpretability is limited compared to rule-based sentences that decision rule sets provide, which can be easily understood by humans.
In this article, we explore the use of two-level logic minimization as a means for deriving explainable decision rule sets for tabular learning.An example of a decision rule set with two conjunctive rules is as follows.

IF
(Systolic blood pressure > 120) OR (age > 60 AND cholesterol = very high).THEN Presence of cardiovascular disease.
In this example, the model would predict someone to have cardiovascular disease if the person has systolic blood pressure above 120, or if the person is above 60 years of age and has a very high level of cholesterol.The model not only provides a prediction, but the corresponding matching rule also provides an explanation that humans can easily understand. 1 1 As discussed in Section V on related work, prior work on decision rule sets has established the benefits of interpretability of decision rule sets for tabular learning problems over black-box models (e.g., [1]), primarily because the activated IF-THEN rule also provides an explanation in terms of humanunderstandable features.Beyond what has already been studied in the literature about the interpretability of decision rule sets, we do not make further claims in this article regarding the interpretability of decision rule sets.Instead, our focus is on a new logic minimization approach for deriving decision rules that can achieve the state-of-the-art classification performance in tabular learning problems, which neither logic minimization nor decision rule methods have been able to achieve before.We believe advancing the start-of-the-art in both logic minimization and decision rule learning for tabular machine learning is of important significance.This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I TEST ACCURACY (AS A PERCENTAGE) FOR THE CARDIOVASCULAR DISEASE
DATASET (CARDIO) [9] In particular, the explanations are stated directly in terms of meaningful input features, which can be categorical (e.g., color equal to red, blue, or green) or numerical (e.g., age > 60) attributes, where the binary encoding of categorical and numerical attributes is well-studied [4], [5].
When binary encoded, tabular datasets can be viewed as logic functions to be minimized, and the minimized logic in disjunctive normal form (DNF) can be readily viewed as an explainable decision rule set for binary classification.However, tabular datasets often contain overlapping examples that have different class labels, which have to be resolved before logic minimization can be applied since logic minimization assumes consistent logic functions.Such inconsistencies can be resolved by taking the majority class such that the largest consistent subset of nonoverlapping training instances is retained.Logic minimization can then be applied to the derived incompletelyspecified logic function to fit the data points exactly with a minimal number of rules and a minimal number of conditions in each rule with respect to the provided incompletely specified logic function.However, in practice, the logic minimized decision rule set derived this way tends to perform poorly in test accuracy and contains complex rules.
Consider the cardiovascular disease dataset (cardio) from the Kaggle competition [9].This task predicts whether a patient has cardiovascular disease or not based on the patient's basic information, the results of medical examinations, and the extra information given by the patient.The performance of the logic minimization derived classifier in the abovementioned manner is shown in Table I with a test accuracy of only 66.03% [shown in the row labeled "logic minimization (no-denoise)"], which is quite poor in comparison, for example, to known decision rule learners like RIPPER [2] that achieves 70.57% test accuracy or a state-of-the-art nonexplainable tabular learner like XGBoost (gradient boosted decision tree) [6], which achieves 73.06% test accuracy.
Our conjecture why logic minimization used in the abovementioned manner is not effective in producing accurate classifiers is due in part to the overfitting of the training data.In particular, because logic minimization exactly fits all data points, noisy data points (those whose label is not the Bayes-optimal choice) can be quite problematic.These noisy data points can lead to a model that both generalizes poorly and is larger than would be needed.In addition, resolving inconsistencies by means of the majority class is often not the best strategy.How best to remove training instances to eliminate overfitting and inconsistencies is highly nontrivial.
To remedy these problems, we propose a statistical framework for denoising (to be detailed later) the training dataset by removing a subset of noisy data points, both for purpose of eliminating overfitting and inconsistencies.Logic minimization can then be applied to this edited dataset to produce simple and accurate decision rules from the minimized DNF formula.With the denoising preprocessing step, logic minimization is able to produce a classifier that achieves 73.20% test accuracy for the cardio dataset, as shown in Table I, which is significantly better than logic minimization without denoising, significantly better than known decision rule learners, and comparable to state-of-art tabular learners like XGBoost.As shown in the evaluation section, our logic minimization approach with denoising is able to achieve accuracies within just 0.7% on average over all datasets evaluated in comparison with the state-of-the-art, but nonexplainable tabular learners.Thus, our approach is able to achieve comparable state-of-the-art results while providing human understandable explanations in the form of decision rules.To the best of authors' knowledge, neither logic minimization nor explainable decision rule methods have been able to achieve the state-of-the-art performance before in tabular learning problems.
The rest of this article is organized as follows.Section II formulates tabular learning as a logic minimization problem.Section III introduces our denoising framework to enable logic minimization to achieve the state-of-the-art performance.Section IV provides extensive evaluation of our proposed approach.Section V outlines related work.Finally, Section VI concludes this article.

II. TABULAR LEARNING AS LOGIC MINIMIZATION
As discussed in the previous section, a tabular dataset can be viewed as an incompletely specified logic function that can be minimized into a DNF formula, which can then be readily translated into independent unordered IF-THEN decision rules.In this section, we first provide further details regarding the binarization of tabular datasets into incompletely specified logic functions.We then summarize the role of two-level logic minimization as a decision rule learner.

A. Binarization of Tabular Data
Although binary features commonly appear in tabular datasets, these datasets also generally include categorical and numerical features, which are naturally used when the data is collected.In this work, we assume all data are binary encoded and, thus, categorical and numerical features need to be first binarized using well established preprocessing steps in the machine learning literature.In particular, we follow exactly the same binarization approach used in some decision ruler learners [4], [5], where we simply one-hot encode all categorical features into binary vectors.For numerical features, we adopt quantile discretization based on the distribution of numerical values in the training data to get a set of thresholds for each feature, where the original numerical value is one-hot-encoded into a binary vector by comparing with the thresholds (e.g., age ≤ 25, age ≤ 50, age ≤ 75) and encoded as 1 if less than the threshold Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
or 0 otherwise.This binarization approach for numerical features has been widely used by decision rule learners and shown to achieve the better performance than directly discretizing numerical values into intervals [4].

B. Logic Minimization as a Decision Rule Learner
Once binary encoded, the tabular dataset can be viewed as an incompletely specified logic function.As noted earlier, when an instance in the dataset has both positive and negative labels, the majority label can be taken to define the incompletely specified logic function.In particular, a binary encoded tabular dataset can be viewed as an incompletely specified logic function f : {0, 1} m → {0, 1, * } that maps an m-dimensional binary encoded vector x ∈ {0, 1} m into either 0, 1, or * .The set of vectors {x ∈ {0, 1} m : f (x) = 1} is referred to as the ON-set, the set of vectors {x ∈ {0, 1} m : f (x) = 0} is referred to as the OFF-set, and the set of vectors {x ∈ {0, 1} m : f (x) = * } is referred to as the dc-set (the do not care set) [10].
With respect to a binary encoded tabular dataset, all instances x with a positive label would be included in the ON-set (i.e., f (x) = 1), and all instances x with a negative label would be included in the OFF-set (i.e., f (x) = 0).All other input combinations x ∈ {0, 1} m not specified in the encoded dataset would belong to the dc-set (i.e., all input combinations x not specified in the encoded dataset are implicitly defined to be f (x) = * ).
Given an incompletely specified logic function, well established two-level logic minimization algorithms can be employed to produce a minimized DNF formula as a disjunction (OR) of conjunctive (AND) terms [10].Modern two-level logic minimization algorithms are able to guarantee a prime and irredundant cover for a given incompletely specified logic function, which means no conjunctive (AND) term can be made simpler by removing a feature (i.e., the conjunctive term is a prime), and no conjunctive term can be removed to further simplify the DNF formula (i.e., the cover is irredundant).In terms of the corresponding decision rule set, it means no rules can be further simplified or removed from the rule set.However, as noted earlier, logic minimization alone can lead to models with poor generalization due to the presence of noisy instances in the training data that leads to detrimental overfitting.This problem can be remedied by first removing these noisy data points through a denoising process, as described in Section III.

C. Example Rule Set From Logic Minimization
Consider a toy example shown in Table II, corresponding to a truth table derived from a binary-encoded tabular dataset.Logic minimization can be applied to this incompletely specified logic function to produce the following DNF formula: This minimized DNF formula corresponds to the following decision rule set.Overall, given a binary-encoded tabular dataset as an incompletely specified logic function, logic minimization produces a decision rule set in DNF as a classifier.

III. DENOISING FORMULATION
We now present a formal model in which the denoising process can be analyzed and understood.
Consider a binary classification task in which data points lie in an instance space X and the possible labels are Y = {0, 1}.There is an unknown distribution P over X × Y from which all instances and labels-past, present, and future-are drawn.
The distribution P over (X, Y ) pairs can as usual be broken into two parts: the marginal distribution of X, denoted μ, and the conditional probability distribution of Y given X η(x) = Pr(Y = 1|X = x).
A classifier h : X → Y has error rate, or risk, err(h) = P (h(X) = Y ).The lowest achievable risk is that of the Bayesoptimal classifier Notice that if η(x) = 1/2, then either prediction is optimal.The risk of g * , that is, R * = err(g * ), is called the Bayes risk.In many applications, a significant part of the instance space has η bounded away from 0 and 1 and, thus, R * > 0.

A. Lack of Consistency of Learning Decision Rules by Logic Minimization
Given a dataset D = {(x 1 , y 1 ), . . ., (x n , y n )}, logic minimization will find a DNF formula that exactly fits all these points.In cases where there is even a little bit of stochasticity in the labels-that is, R * > 0-this can be problematic.
To see this, consider a situation where X is finite and η(x) ∈ {0, 1} (that is, there is some stochasticity in xs label) for all x.Thus, any point x can potentially occur in the dataset with both labels.Given a sufficiently large dataset, this will happen with every point.
For this reason, logic minimization alone is not a consistent method for learning a classifier: it is not guaranteed to converge to g * as the size of the training set grows.More generally, stochasticity in the labels can lead to the selection of a model that both generalizes poorly and is larger than would be needed for, say, the Bayes-optimal labeling.In particular, the problem is the presence of noise in the dataset, where a point (x, y) is said to be noisy if y = g * (x) and η(x) = 1/2, i.e., y is not the Bayes-optimal label for x.

B. Preprocessing as a Denoising Step
Our proposed preprocessing step has the effect of denoising the labels in the training set.We now establish this formally.
Given training data D = {(x 1 , y 1 ), . . ., (x n , y n )}, we fit a classifier g n to D and then define an edited training set That is, the edited training set will contain the instances whose labels agree with the predicted labels of the classifier.Finally, we apply logic minimization to the edited data to get a set of decision rules.
In the original data D, the labels y i can disagree with the Bayes-optimal predictions g * (x i ) on as many as half of the points, since η(x i ) can be arbitrarily close to 1/2.We will now see that for the edited data, the fraction is much smaller.
In interpreting the following lemma, recall that when η(x) = 1/2, either prediction (0 or 1) is Bayes-optimal.This leads to some messiness when stating results; to avoid it, we assume that none of the x i has η(x i ) exactly 1/2.This holds with probability one if the decision boundary has measure zero, i.e., μ({x : η(x) = 1/2}) = 0.
Lemma 1: Fix any x 1 , . . ., x n ∈ X .Assume that η(x i ) = 1/2 for all of these points.Suppose each label y i is drawn according to the conditional probability distribution η(x i ), and let D = {(x 1 , y 1 ), . . ., (x n , y n )}.Let g n be any classifier learned from D, and let n denote the fraction of the points {x i } for which g n (x i ) = g * (x i ).Finally, define the edited dataset D ⊂ D as in (1) above.Then, a) |{(x, y) , where the expectation is over the randomness in the labels y i .Proof: Let B denote the set of "bad" data points x i on which We are given that |B| = n n.
For part (a), note that any x i that makes it into D has y i = g n (x i ).Therefore, the only way that The first set in this expression is B. The second set has expected size at most n/2, since for any i, Pr( In short, D contains roughly at least half the original training points, and the fraction of faulty (non-Bayes-optimal) labels in it is at Lemma 1 works for any choice of intermediate classifier g n .We suggest taking g n from a family of classifiers that is strongly consistent, that is, for which err(g n ) → R * almost surely as n → ∞.Under this condition, the error n defined in the lemma goes to zero.Strong consistency is known to hold for the adaptive nearest neighbor rule [11], for boosted decision trees [12], [13], and for support vector machines with the Gaussian kernel [14].

C. Localization Properties of the Preprocessing
In data drawn from the underlying distribution P , as many as half the points x could have labels that disagree with the Bayes-optimal prediction g * (x), due to the stochasticity in the conditional probability distributions η(•).The preprocessing step selects a subset D ⊂ D that is not too much smaller than D and in which at most an O( n ) fraction of the labels are noisy.
However, even a small amount of noise can be troublesome if it is scattered throughout the instance space.This is because logic minimization searches for logical rules (conjunctions) that perfectly agree with the data, and even one noisy label could falsely invalidate a good rule.
We now show that if estimator g n is an adaptive nearest neighbor rule [11], then any noisy points in D are localized: they are not spread throughout X , but lie in a region around the decision boundary, and this region shrinks as the size of the training set, n, is increased.
We begin with a brief overview of the adaptive nearest neighbor classifier.In contrast with k-nearest neighbor, which makes a prediction on a query point x by looking at its k nearest neighbors in the training set, the adaptive rule does not use a predefined choice of k.Instead, it grows k until the resulting set of training labels has a significant majority, and then predicts accordingly.If a significant majority is never achieved, then it outputs "?" (don't know).The tradeoff between accuracy and level of abstention is managed through a single confidence parameter 0 < δ < 1.The smaller this parameter, the higher the required level of significance; this results in more don't-knows as well as higher accuracy when a prediction is actually made.
In the terminology above, the adaptive nearest neighbor classifier produces predictions g n (x) ∈ {0, 1, ?}.Our editing rule will discard any point (x i , y i ) with g n (x i ) = y i ; this includes any point with g n (x i ) = ?.
What are the points on which g n will fail to predict the correct label (or abstain)?It turns out that these are guaranteed to be near the decision boundary, that is, to have η(x) close to 1/2.The following result is a corollary of the convergence guarantees of the adaptive nearest neighbor estimator, [11,Th. 2].
Lemma 2: Suppose X ⊂ R d and η is α-Holder continuous.Let g n denote the adaptive nearest neighbor classifier with confidence parameter 0 < δ < 1.Let D be the edited training set, as defined as in Lemma 1.Then, with probability at least 1 − δ (over the randomness in the original dataset), every point in D with Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
The adaptive nearest neighbor work introduced a notion of margin at each point x ∈ X , called the "advantage" at x and denoted adv(x) [11].This is a value in the range [0,1] and corresponds to the statistical ease of estimating the Bayes-optimal prediction at x. Roughly speaking, points with low advantage are those near the decision boundary.Under Holder smoothness conditions on η, it can be shown by following [15,Lemma 18] that for any q ∈ [0, 1] adv(x) < q ⇒ x ∈ BD(Lq β ) for some L > 0 and β = α/(d + 2α).
The key convergence guarantee (see [11,Theorem 2]) of the nearest neighbor estimator g n is that with probability > 1 − δ, it will correctly classify all points with significant advantage The statement then follows by tracing the proof of Lemma 1 and observing that any mistake in D is also a point on which g n disagrees with g * .

D. Details of the Denoising Step
In the abovementioned section, the definition of an edited training set is given by (1).We further elaborate on this denoising step in the pseudocode shown in Algorithm 1.
The denoised datasets generated using Algorithm 1 are guaranteed to be functions (there is only one unique output label for each input combination) even though the original dataset may be a relation (each input combination may have multiple output labels).This point can be easily derived from the fact that all classifiers g n used for denoising are functions, and thus, the label that corresponds to the prediction of a classifier g n (x i ) is unique for input x i , i.e., if x i = x j , then g n (x i ) = g n (x j ).On the other hand, as noted before, in the original dataset, the same x i may have both positive and negative labels.In that case, to derive an incompletely specified logic function, one of the labels has to be taken, for example, by taking the majority label.

IV. EXPERIMENTAL EVALUATION
A. Evaluation Setup 1) Datasets: We performed numerical evaluations on seven publicly available tabular datasets, most of which have more than 10 000 instances and comprise categorical and numerical attributes for each instance before binarization.Among them, four are from Kaggle (churn, airline, market, and cardio), two (adult and chess) are from the UCI Machine Learning Repository [16] and the last one (retention) is from the AIX360 package [17].For all datasets, we adopted the preprocessing approach discussed in Section II to encode categorical and numerical attributes into binarized features.
A fixed number of ten thresholds is used for all numerical features unless there exists less than ten unique values in the feature column, in which case we used the unique values as thresholds.All results in this section were obtained using the nested five-fold cross-validation that selects the best parameters for optimizing the models' performances on each partition.
2) Denoising and Logic Minimization: As discussed in Section III, we first perform a denoising step to remove noisy training samples.In particular, we experimented with three strongly consistent classifiers that are known to theoretically converge to a Bayes-optimal classifier to perform the denoising, namely adaptive nearest neighbor (AKNN) [11], support vector machines (SVM) with the Gaussian kernel [18], and gradient boosted decision trees (XGBoost) [6].We then applied logic minimization to the denoised training datasets to derive the decision rules in DNF.For logic minimization, we used the ESPRESSO minimizer [19], which is a widely used computer program that efficiently solves the two-level logic minimization problem with iterative improvements.In our experiments, the decision rule set models derived by applying ESPRESSO to the denoised datasets are named "Denoise-A," "Denoise-S," and "Denoise-X" for AKNN, SVM, and XGBoost, respectively.We also included the results of using ESPRESSO directly on the original noisy datasets, which is named "no-denoise." 3) Baselines and Parameter Tuning: Apart from the baseline models that we used to remove the noisy training samples in the datasets, we also included the following five other classifiers: RIPPER [2], CG [5], random forest (RF) [8], classification and regression tree (CART) [20], and a deep neural network (DNN).The first two are representatives of the state-of-the-art decision rule learners, while the next two are popular machine models used on tabular datasets.We also included a neural network as another black-box model for comparison.In particular, we used a 6-layer deep fully connected neural network, with 64 neurons per layer and ReLU activation in the intermediate layers.Overall, we consider three explainable models (CART, RIPPER, and CG) and five nonexplainable models (AKNN,2 RF, SVM, XGBoost, and DNN) in our evaluations.In particular, decision trees were constructed using the CART [20] algorithm, RIP-PER is an old variant of the sequential covering algorithm that greedily mines rule set from the dataset, and CG formulates the problem of learning a set of decision rule set as a mixed-integer programming problem with a loss function that captures the interpretability and accuracy of the decision rule set at the same time.Since CG cannot implicitly learn the negations of the input binarized features, the negations of the input features were appended to the datasets for CG only so that we can get the best models from it.
As stated before, all classifiers were trained with the best parameters according to the nested five-fold cross-validation.Specifically, we varied the minimum number of samples per leaf for CART and RF, the regularization term for XGBoost, the regularization parameter C for SVM, the parameter A corresponding to the confidence parameter δ as stated in the AKNN paper, and the learning rate for DNN.For rule learners, we tuned the parameters used in the actual implementations that control the complexity of the decision rule set: the maximum number of conditions and the maximum number of rules for RIPPER; the cost of each clause and the cost of each condition for CG.Since there is no parameter to be tuned for ESPRESSO, the results of no-denoise were obtained on the same training and test datasets used by other methods without any parameter tuning.Also, the parameters tuned for Denoise-A, Denoise-S and Denoise-X are exactly the same as the parameters tuned for AKNN, SVM, and XGBoost, respectively.We used the sklearn implementations [21] for RF, CART, and SVM.The implementations of other models are publicly available on GitHub. 3

B. Classification Results on Popular Tabular Datasets 1) Denoised Datasets Statistics:
The sizes (total number of data points) of the training sets and the percentages of noisy instances in the training set removed by the denoising methods, i.e., AKNN, SVM, and XGBoost, are shown in Table III.
The standard benchmarks shown are generally considered to be large and sufficiently representative, with some benchmarks containing up to 56 000 instances.The percentages of the removed instances reflect how noisy each classifier thinks about the datasets, which spans a wide range from 0.10% to 25.72%.Among the seven datasets, chess, retention, and airline comprise the least amount of noise whereas cardio has around a quarter of data points being noisy, which, as we will see later, matches the performance of logic minimization when no denoising is applied.In general, the number of removed noisy instances is relatively limited compared with the size of the dataset, which matches our theoretical expectation.
2) Improvements Over Standard Logic Minimization: As seen in the first four rows of Table IV, logic minimization with denoising techniques (Denoise-A, Denoise-S, and Denoise-X) always achieve significant improvements over standard logic minimization (no-denoise), where the latter on average shows the weakest competitiveness among all models due to the lack of consistency as a classifier.In particular, logic minimization with denoising yields an improvement in test accuracy by as much as 9% (adult, churn) comparing to its no-denoise counterpart when a clear degree of noisiness (e.g., > 13%) is present in the dataset, which validates that preprocessing the dataset by removing the noisy data points is an effective method to enhance logic minimization as a machine learning model.On the other hand, no-denoise outperforms or is on par with all other explainable models (CART, RIPPER, and CG) and AKNN on the chess, retention, and airline datasets, indicating that logic minimization without any preprocessing might be a good choice for the datasets that come with low stochasticity.In both scenarios, we can always expect a performance gain by denoising the datasets first before applying logic minimization, with logic minimization benefiting more substantially from noise removal when the noise percentage is higher.
As already shown in Tables III and IV, denoising noisy datasets can significantly improve the performance of models derived from logic minimization.We further show this in Fig. 1, where we see four quadrants depicted.In the upperright quadrant, we see that a large reduction in the denoised dataset generally correlates with a significant improvement in test accuracy.This is because a large reduction implies that the dataset is noisy, which causes detrimental overfitting problems for the logic minimizer.Therefore, logic minimization gains significant improvements by first denoising the dataset.On the other hand, we see in the lower-left quadrant that a small reduction in the denoised dataset generally correlates with a more modest improvement in test accuracy.This demonstrates a clear positive correlation between the noise ratio and the corresponding accuracy improvement after denoising.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE IV TEST ACCURACY (AS A PERCENTAGE) FOR ALL CLASSIFIERS WITH STANDARD DEVIATION
Fig. 1.Percentage of the noisy instances removed by each classifier (X-axis) versus the test accuracy improvement by applying logic minimization to the denoised dataset compared with the no-denoise results (Y -axis).

3) An Overall Better Explainable Model:
The benefit of decision rule set in DNF is that the user can always extract satisfied rules to reason a decision.In comparison with other explainable models, our paradigm generates explainable decision rule sets that achieve the superior predictive performance, as can be seen in Table IV (rows 2-7).The accuracies across all datasets of Denoise-A, Denoise-S, and Denoise-X are as much as 11% better than CART, RIPPER, and CG.
4) Competitive Compared to Nonexplainable Models: Not only are the decision rule sets from logic minimization on denoised datasets dominant over other explainable models, but they are also very competitive even when compared with nonexplainable models.As can be seen by comparing the logic minimization models (rows 2-4) and the nonexplainable models (rows 8-12) in Table IV, ESPRESSO applied on the denoised datasets are very closed to the performances of the nonexplainable models with the maximum difference less than 1.5%, while still being completely explainable.This is significant because it has been generally thought that explainable models are not competitive with nonexplainable models, but our results show otherwise.
5) Decision Rule Set Without Loss of Performance: Lastly, we compare the performance of the ESPRESSO models with their corresponding classifiers that were used to remove the noisy data points in the training dataset, and the difference can be seen in Table V.Both Denoise-S and Denoise-X only decreased by less than 1%, providing further evidence that logic minimization applied to the denoised dataset can achieve comparable performance as the corresponding denoising classifier.Moreover, Denoise-A actually outperformed AKNN by more than 1%, further indicating that logic minimization can potentially generalize better on the datasets that have low stochasticity.The last row in the table shows that the difference between logic minimization after the denoising process is very close to the state-of-the-art nonexplainable models with only less than 0.7% discrepancy in the average test accuracy.

C. Denoising Example
In this section, we provide some intuition behind what denoising is doing by means of an example.Consider again the cardiovascular disease dataset (cardio) from the Kaggle competition [9], where the classification task is to predict the presence of cardiovascular disease based on the patient's information.This is a large training dataset comprising 56 000 data points.As shown in Table III, this is a noisy dataset, where the denoisers removed on average 24.77% of the dataset.This correlates with the significant accuracy improvements between the logic minimization results with denoising versus no-denoising, with an average improvement of about 7% in accuracy (see Table IV).In the case of no-denoising, one of the decision rules derived by logic minimization is as follows.However, this rule only applies to 14 patients in the original dataset, which is a relatively small number of cases in comparison to the complete training dataset of 56 000 cases (under 0.03% of the cases).This causes logic minimization to introduce many more rules that are more complex than would be needed just to exactly fit the given dataset.
On the other hand, after denoising the dataset (using SVM with a Gaussian kernel in this example), the above decision rule simplifies to just the following.
IF Systolic blood pressure > 130 THEN Presence of cardiovascular disease.This new rule covers 15 740 data points in the original dataset.Our denoising step removed 2603 of these data points as noise (about 16.5% of these data points).In particular, without denoising, the original rule involving height, weight, and an upper limit on the systolic blood pressure was needed to cover relatively rare cases of patients without cardiovascular disease that had systolic blood pressure above 130, for example, with weight above 58 kg or height shorter than 170 cm.These rare anomalous cases detract from the general trend that patients with systolic blood pressure above 130 overwhelmingly have cardiovascular disease.This is an example of denoising that led to significant simplification of rules and much better generalization, as evidenced by the significant improvements in test accuracy.

D. Synthetic Data Experiments to Quantify the Impact of Denoising 1) Quantifying the Impact of Noise on Logic Minimization:
To quantitatively evaluate the impact of noisy data points on logic minimization, we manually generated synthetic noisy datasets based on predefined DNF rules.In particular, we consider the input space comprising 16 binarized features, which leads to about 64 000 combinations.Then, we randomly generated five rules as the predefined rules, each of which is a conjunction of three randomly chosen features.To generate a synthetic dataset, we randomly sampled p "ground-truth" data points and labeled them according to the predefined rules.Besides the ground-truth data points, we further added q noisy data points into the synthetic dataset, so that the combined dataset contains p + q = 10 000 data points.To generate the noisy data points, we randomly sampled q new combinations and purposely mislabeled them with the label opposite to the predefined rules.For example, if we want to inject q = 3000 noisy data points, we would randomly sample p = 7000 ground-truth data points for a total of 10 000 data points in the synthetic dataset.
In Fig. 2, we show the impact of noise on logic minimization.In particular, as shown in Fig. 2(a), we varied the number of noisy data points in the synthetic dataset from q = 0 to q = 3000 on the X-axis (with a corresponding p = 10 000 to p = 7000 ground-truth data points).On the Y -axis, Fig. 2(a) shows the corresponding test accuracies of both no-denoise and Denoise-S with the parameter C for SVM fixed to be 1.The test accuracies are based on sampling another 10 000 test instances labeled according to the predefined rules.As expected, as we Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.increase the noise percentage from 0% to 30%, the test accuracy of no-denoise decreases linearly, while the performance of Denoise-S remains relatively the same near 100%.This shows that no-denoise poorly generalizes with increasing amount of noise, whereas Denoise-S is robust to noisy samples.
In Fig. 2(b), we show the complexities of the decision rule sets derived from both no-denoise and Denoise-S, where the complexity of the rule set is calculated by summing the number of features across all rules and the number of rules.As we increase the noise percentage on the X-axis, the complexity of the derived rule set from no-denoise increases dramatically due to the overfitting of the noisy samples.On the other hand, the rule set complexity of Denoise-S remains at a minimum level, verifying that a classifier with provable Bayes-optimal convergence properties can successfully remove most of the noise in the training set so that logic minimization can uncover the underlying distribution, which in this case is the predefined rules.
2) Quantifying the Performance of Denoisers: We next evaluate the effectiveness of different denoisers in identifying noise in datasets.To quantitatively evaluate this, we again manually produced synthetic noisy datasets using the same randomly generated five rules as the predefined rules, as in the previous section.In this experiment, we again generated synthetic datasets with 10 000 data points, but this time, we randomly sampled 9000 "ground-truth" data points labeled according to the predefined rules.We then injected two types of noisy points into the dataset.The first type of noise (Noise 1) is the same type of noise in the previous section: we randomly sampled new combinations and purposely mislabeled them with the label opposite to the predefined rules.We also injected a second type of noise (Noise 2): we randomly selected some combinations among the 9000 ground-truth combinations and assigned the opposite labels to them.In other words, these combinations have both positive and negative labels in the synthetic dataset.We added in total 1000 noisy points to the datasets with different ratios of the two types.We then applied the denoisers to the datasets consisting of 9000 + 1000 = 10 000 points and let them identify the noisy data points.The results are shown in Table VI.
In particular, in all cases with different compositions of noise types, SVM always perfectly removes all noisy points without Although XGB incorrectly removes about 0.1%-0.2% of the correct data out of the 9000 points, and AKNN incorrectly removes about 1.5% of the correct data, the percentage of incorrectly removed data points is negligibly small in both cases relative to the total number of clean points.Overall, the amount of correctly identified noise is substantially greater than the missed and incorrectly-removed points, which meets our expectation: the denoisers remove almost all noisy points and not too many of the clean points.

E. Impact of Denoising on Other Explainable Models
Throughout this article, we extensively discussed the importance of denoising when logic minimization is used to derive Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE VII AVERAGE IMPROVEMENT (AS A PERCENTAGE) IN TEST ACCURACY BY DERIVING OTHER MODELS ON THE DENOISED DATASETS
decision rule sets from tabular datasets since logic minimization exactly fits all data points.In this section, we also evaluate how denoising affects other models.The results are summarized in Table VII.
The average improvements of applying logic minimization to the denoised datasets are significant, over 4%.This is shown under the column labeled ESPRESSO.It can be seen that denoising also generally improves the test accuracies of CART, RIPPER, and CG.However, the improvements are relatively small in comparison with logic minimization, under 0.25%, 1.5%, and 0.5% for CART, RIPPER, and CG, respectively.The reason why the improvements are much smaller with these methods may be due to the fact that these learners are already somewhat noise-tolerant: they tend to underfit data in favor of simple models and are, therefore, less affected by noisy points, but this comes at the price of inferior performance.Furthermore, we observe that denoising can have a small negative impact on the performances of SVM and XGBoost, which is not suprising as both SVM and XGBoost already have good generalization capabilities without denoising.On the other hand, because logic minimization tends to overfit, it can greatly benefit from our denoising framework, as shown in Table VII.

V. RELATED WORK
Satisfiability-based (SAT-based) logic minimization has been proposed before [22], [23] to derive minimized DNF formula that translate to decision sets.However, these methods assume there is no inconsistency in the training data, meaning that there are no overlapping examples that have different class labels.They resolve inconsistencies by taking the majority class such that the largest consistent subset of nonoverlapping training instances is retained.In turn, these SAT-based methods act as a logic minimizer to exactly fit the resulting dataset.As explained throughout this article, logic minimization performed this way (corresponding to the "no-denoise" case in our paper) often fail to produce accurate decision models because the resulting dataset often still contains noisy points that lead to detrimental overfitting and poor generalization.How best to remove training instances to create a consistent dataset is highly nontrivial, which is precisely our key contribution: a novel theoretically grounded denoising framework that substantially improves the performance of two-level logic minimization in the learning of accurate decision sets, not simply the use of logic minimization (whether ESPRESSO [10] or a SAT-based approach) to derive decision sets.
The consistency of learning algorithms is a central question in statistics and machine learning.For parametric classifiers, such as linear separators, the desired outcome is convergence to the best model in the function class as the number of training points grows.This typically depends upon the boundedness of some complexity measure, such as Vapnik-Chervonenkis (VC) dimension [24] and holds broadly.However, such function classes might not be rich enough to capture all the intricacies of the underlying classification task.For nonparametric models, it is possible to hope for better: convergence to the Bayes-optimal model.This has been established for various popular methods, including k-nearest neighbor (with suitably growing k) [25], [26], boosting with certain base classes [12], and families of kernel machines including the support vector machine with RBF kernel [14].
There is also existing literature devoted to label noise, focusing mainly on unreliable or erroneous labels.Frénay and Verleysen [27] provided an extensive analysis of label noise and the potential problems that they can cause in classification problems and reviews the existing literature on algorithms for filtering erroneously labeled instances.For example, Thongkam et al. [28] proposed to use SVM as the filterng method to identify misclassified instances in a breast cancer survivability dataset.Jeatrakul et al. [29] proposed to use neural networks to detect misclassification patterns in the training data.Northcutt et al. [30] introduced a model-agnostic denoising framework that identifies the noisy samples using the out-of-sample predicted probabilities of the training instances by a user-specified classifier.However, these works either do not provide insights into why their choice of classifiers will work in the given scenarios or do not provide guidance regarding the choice of a user-specified classifier.In contrast, our denoising framework takes a different view of the problem by assuming that the true labels of the training instances are Bayes-optimal labels.This assumption provides us with guidance on what classifiers to select to give an accurate estimate of the underlying true labels, which is grounded in theory, as discussed in Section III.That is, our denoising framework can in theory correctly identify all noisy instances since we use classifiers with provable Bayes optimal convergence properties to identify noise in the dataset.This is not case, for example, in prior works like [29], [30].To the best of authors' knowledge, our work is the first to synthesize the ideas of logic minimization, convergence to the Bayes-optimal model for ensuring consistency, and label noise removal to simultaneously achieve the state-of-the-art classification performance in tabular learning problems with explainable rule-based predictions.
Finally, the learning of rule sets has received considerable attention due to their ability to provide human-understandable explanations.Contrary to greedy rule mining methods developed before [2], [31], recently proposed methods [3], [4], [5] explicitly consider the tradeoff between the explanation complexity and the predictive performance and aim to get Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the best training accuracy under a certain complexity constraint.However, there is still a noticeable performance discrepancy between decision rule sets and black-box models, such as RF [8] and gradient boosted trees [6] even if there is no constraint on the complexity of the rule set.Although our work also generates human-understandable IF-THEN rules, and therefore falls into the same category of decision rule set learning, we show that our method significantly improves on modern decision rule learners and bridges the gap with black-box models.

VI. CONCLUSION
In this article, we explore the use of two-level logic minimization as a machine learning paradigm for tabular datasets.Although tabular datasets can be viewed as logic functions that can be simplified with two-level logic minimization to derive minimal logic formulas in DNF, this has not been a successful approach in the past, leading to complex models with poor generalizations.Our conjecture is that these problems are caused by the presence of noisy instances in the training data.Because logic minimization exactly fits all data points, these noisy instances can lead to detrimental overfitting problems, leading to models that both generalize poorly and are far more complex than necessary.We propose a statistical framework for denoising the training data, corresponding to the removal of noisy data points that have anomalous labels or are close to the decision boundary.This denoising approach allows logic minimization to be effective in deriving simple DNF formulas that have good generalization properties.The DNF formulas can in turn be readily converted to explainable decision rules.Using this approach, we are able to obtain comparable performance as gradient boosted and ensemble decision trees, which have been the winning hypothesis classes in tabular data learning competitions, but with human understandable explanations in the form of decision rules.We hope our successful results will open the door to further fruitful research in the underexplored area of logic minimization as a viable machine learning direction.

Fig. 2 .
Fig. 2. Impact of noise on logic minimization on synthetic data.(a) Impact on test accuracy with increasing noise.(b) Impact on model complexity with increasing noise.

TABLE II TOY
EXAMPLE OF AN INCOMPLETELY SPECIFIED LOGIC FUNCTION, WHERE x 1 , x 2 , x 3 , AND x 4 CORRESPOND TO AGE ≤ 50, SMOKER, CHOLESTEROL ≤ 130, AND BLOOD PRESSURE ≤ 120, RESPECTIVELY, AND f (x) REPRESENTS LOW HEART DISEASE RISK

TABLE III NUMBER
OF TRAINING INSTANCES FOR EACH DATASET (SIZE) AND THE PERCENTAGE OF THE NOISY INSTANCES REMOVED BY EACH CLASSIFIER FOR EACH DATASET

TABLE V DIFFERENCE
INTEST ACCURACY (AS A PERCENTAGE) BETWEEN THE CLASSIFIERS USED TO REMOVE NOISY INSTANCES AND THEIR CORRESPONDING LOGIC MINIMIZATION RESULTS

TABLE VI DENOISING
STATISTICS ON SYNTHETIC DATASETS WITH FIVE RULES, EACH AS A CONJUNCTION OF THREE RANDOMLY CHOSEN FEATURESmistakenly removing any correct data, which is consistent with the results shown in TableIVwhere Denoise-S always achieves the best accuracy.With respect to XGB and AKNN, both are able to correctly identify most of the noisy data points, with XGB successfully finding about 79%-99% of the noisy points, and AKNN successfully finding about 90%-95% of the noisy points.