A Regularized Attribute Weighting Framework for Naive Bayes

The Bayesian classification framework has been widely used in many fields, but the covariance matrix is usually difficult to estimate reliably. To alleviate the problem, many naive Bayes (NB) approaches with good performance have been developed. However, the assumption of conditional independence between attributes in NB rarely holds in reality. Various attribute-weighting schemes have been developed to address this problem. Among them, class-specific attribute weighted naive Bayes (CAWNB) has recently achieved good performance by using classification feedback to optimize the attribute weights of each class. However, the derived model may be over-fitted to the training dataset, especially when the dataset is insufficient to train a model with good generalization performance. This paper proposes a regularization technique to improve the generalization capability of CAWNB, which could well balance the trade-off between discrimination power and generalization capability. More specifically, by introducing the regularization term, the proposed method, namely regularized naive Bayes (RNB), could well capture the data characteristics when the dataset is large, and exhibit good generalization performance when the dataset is small. RNB is compared with the state-of-the-art naive Bayes methods. Experiments on 33 machine-learning benchmark datasets demonstrate that RNB outperforms the compared methods significantly. INDEX TERMS Attribute weighting, classification, naive Bayes, regularization.


I. INTRODUCTION
The Bayesian classification framework is fundamental to statistical pattern recognition and widely deployed in many machine-learning tasks [1]- [6]. Bayesian decision rule with 0/1 loss function leads to the optimal classification in statistical pattern recognition [7]. However, the estimated covariance matrix in Bayesian classification often deviates from the data population due to the curse of dimensionality, which may reduce classification performance [7]. To tackle the problem, many naive Bayes (NB) approaches [8]- [11] have been developed, which regularize the covariance matrix to a diagonal matrix. In these methods, it is assumed that each feature dimension is conditionally independent, and then the posterior probability can be estimated separately for each feature dimension. NB classifiers are competitive with many latest classifiers as shown in [12], [13].
However, NB may be oversimplified as the assumption of strong independence is often invalid, resulting in a decrease in The associate editor coordinating the review of this manuscript and approving it for publication was Xian Sun . classification performance [14]. Many improved naive Bayes classifiers have been developed to alleviate the conditional independence assumption, which can be broadly divided into five categories: 1) Structure extension [15], [16]; 2) Instance selection [17], [18]; 3) Instance weighting [19]; 4) Feature selection [20], [21]; 5) Feature weighting [22]- [36]. Among these methods, attribute-weighting methods [22]- [36] relieve the independence assumption by assigning different weights to different attributes so that the discriminative features will have a larger weight.
Attribute-weighting methods can be further divided into filter-based methods [22]- [27] and wrapper-based methods [28]- [36]. The former determines the attribute weights in advance by using the general characteristics of the data, while the latter determines the attribute weights by using classification feedback to minimize the classification error. In most cases, the filter-based methods calculate weights faster than the wrapper-based ones, but the classification accuracy of the latter is higher than that of the former.
Attribute-weighting methods often assign the same weight to each attribute in different classes, e.g. Zaidi et al. weighed VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the attributes to alleviate naive Bayes' independence assumption (WANBIA) [34]. In class-specific attribute weighted naive Bayes (CAWNB) [35], attributes of different classes are weighted differently to enhance the discrimination power of the model. CAWNB better captures the characteristics of dataset and achieves significant performance improvements compared with other attribute-weighting methods. However, with more weights to be optimized, the model complexity increases and hence over-fitting may occur, especially if the dataset is small. To alleviate the problem, we propose to add a regularization term to the formulation of CAWNB to penalize the model complexity, which will tend to use simpler models to avoid over-fitting, similarly as in [7], [37], [38]. Naive Bayes can be regarded as a regularized form of the Bayesian classification framework by restricting the covariance matrix to be diagonal [7]. L1-or L2-regularization has been widely used in machine-learning tasks [39], [40]. L2-regularization [40] could be applied on the model parameters to encourage the attribute weights with poor effect to decay towards zero and assign higher weights to attributes with higher effect. Alternatively, L1-regularization could be applied to the model parameters of CAWNB, which is more robust to noise and outliers than L2-regularization. L1-regularization in general produces better results, but at a higher computational cost [39]. Sparse representation is an example of L1-regularization [39].
Both L1-regularization and L2-regularization will introduce a significant computational overhead. In this paper, a simple yet effective way is proposed to regularize CAWNB, i.e. add a simpler model to constrain CAWNB. Simpler models usually achieve better generalization performance [41]. WANBIA is simpler than CAWNB, as the number of weights estimated in WANBIA are fewer than that in CAWNB. Hence, it will improve the generalization capability of CAWNB by integrating with the simpler model WANBIA. Furthermore, it will not significantly increase the computational complexity by integrating these two models, as both share similar procedures to solve the optimization problem [34], [35]. The proposed approach is named as regularized naive Bayes (RNB).
In the proposed RNB, the target is to find the optimal model parameters M = {W , w, α} to minimize the difference between the posterior derived from the ground-truth label and the posterior P(M) estimated from the data, where (1) P D (W ) is the posterior probability with attributes weighted on a per-class basis, and W is the matrix to weight the attributes differently for different classes. P I (w) is the posterior probability with attributes weighted the same for all classes, and w is the weight vector for the attributes. P D (W ) is a more complex model than P I (w), as more weights need to be optimized in W than that in w. Thus, P I (w) is a simpler model that can provide better generalization capabilities. Now the challenge is how to jointly find the optimal model parameters including W , w, and α. To achieve this, a gradient-based optimization procedure is proposed, similar to L-BFGS-M [42] used in CAWNB and WANBIA. More specifically, the partial derivatives of P(M) w.r.t. W , w and α are derived, and a gradient-descent-based method is utilized to iteratively update W , w and α respectively, towards the objective of minimizing the classification error. Compared with other regularization methods, the proposed method requires minimal modifications to the optimization problem of CAWNB, and it does not significantly increase the computational complexity.
In the proposed formulation, α is used to automatically adjust the trade-off between discrimination power and generalization capability. More specifically, when the dataset is small and hence a simpler model is preferred, α will be smaller and hence a larger weight will be assigned to P I (w), which will ensure better generalization capabilities. This is verified by the experiments shown in Section IV.
To validate the effectiveness of the proposed RNB, a series of empirical comparisons have been conducted with stateof-the-art naive Bayes on the collection of 33 benchmark classification datasets from the University of California at Irvine (UCI) repository [43]. Experimental results show that the performance of RNB is significantly better than all compared methods [8], [21]- [23], [33]- [36].
The contributions of this paper are summarized as follow: 1) The poor generalization capability of CAWNB is identified and RNB is proposed to address the problem. 2) An optimization procedure is designed to derive the optimal model of the proposed RNB. 3) The proposed RNB improves the generalization performance of previous methods and automatically balances the discrimination power and the generalization capability, so that better performance can be obtained regardless of the size of datasets.
The rest of the paper is organized as follows. Section II reviews related work. Then, the proposed regularized naive Bayes is introduced in section III. In section IV, experimental comparisons with state-of-the-art naive Bayes are conducted to demonstrate the effectiveness of the proposed method. Finally, this work is concluded in section V.

II. RELATED WORKS
Naive Bayes classifiers have been widely used in many applications [9]- [11]. As the strong assumption of feature independence in NB is often invalid, many improvements have been developed, which can be broadly divided into 5 categories. The first category is structure extension [15], [16], which extends the structure of naive Bayes to represent the feature dependencies. The second category is instance selection [17], [18], which employs the principle of local learning to build a set of local naive Bayes classifiers using a subset of the dataset. The third category is instance weighting [19], which weights the instances differently in order to maximize the discriminant power. The fourth category is feature selection [20], [21], which removes the strongly correlated or irrelevant features, as those features are harmful to reliable classification, and/or selects the most discriminative feature subset. The fifth category is weighted naive Bayes, which tackles the problem by assigning different weights to attributes so that the discriminative features have a larger weight and hence the discriminative power will increase [22]- [36]. The attribute-weighting methods can be further categorized into filter-based methods [22]- [27] and wrapper-based methods [28]- [36].
Filter-based methods [22]- [27] utilize the characteristics of the data to determine attribute weights. Lee et al. determined the weights by using the Kullback-Leibler (KL) divergence between attributes and class labels [25]. In [24], Hall defined the weights by utilizing the minimum depth in a decision tree. In [22], the conditional probabilities of naive Bayes are estimated by deeply computing feature weighted frequencies. Recently, Jiang et al. developed a correlation-based attribute-weighting NB, which defines the weight of each attribute as a sigmoid transformation of the difference between mutual relevance and average mutual redundancy [23]. Filter-based approaches determine the weights in advance by measuring the relationship between features and classification variables, such as mutual information, KL divergence and correlation.
Wrapper-based methods [28]- [36] utilize the classification feedback to optimize attribute weights. Due to the iterative process, wrapper-based methods usually have higher time complexity and better classification performance than filter-based ones. In [28], Zhang and Sheng updated attribute weights based on a hill-climbing strategy to maximize the classification accuracy. Wu and Cai utilized a differential evolution algorithm to determine the weights [33]. In [36], Yu et al. developed a hybrid attribute-weighting method by initializing the weights through a correlation-based filter and then adjusting them through a wrapper. Zaidi et al. optimized attribute weights by minimizing the mean squared error between predicted and ground-truth labels [34]. Very recently, Jiang et al. developed CAWNB [35], which determines the optimal weight for each attribute of different classes to capture more characteristics of the dataset, instead of ignoring the class dependency as in [34]. Hence it achieves excellent classification performance on many benchmark datasets.
Unlike WANBIA [34] that assigns the same attribute weight for all classes, CAWNB [35] assigns different weights to different classes. Thus, the CAWNB model is more complicated and more prone to over-fitting, especially when the dataset is small. Some form of regularization to CAWNB is required to improve its generalization performance.

III. REGULARIZED ATTRIBUTE-WEIGHTED NAIVE BAYES A. PROBLEM ANALYSIS OF PREVIOUS NAIVE BAYES METHODS
In the Bayesian classification framework, the posterior probability is defined as: where x is the feature vector and c is the classification variable. Because it is difficult to reliably estimate the likelihood P(x|c) due to the curse of dimensionality, in naive Bayes methods, the likelihood is estimated by assuming that the attributes are independent given the classification variable c, which results in the following formulation: where x j is the j-th dimension of the feature vector x, and m is the feature dimensionality. Then, the posterior probability can be estimated by: Naive Bayes regularizes the Bayesian framework by assuming that each attribute is independent conditioned on the classification variable, but this assumption is often invalid. To alleviate the problem, weights are assigned to attributes in WANBIA [34], and the weights are optimized via minimizing the mean squared error between the estimated posteriors and the posteriors derived using ground-truth labels.
Jiang et al. showed that attribute weighting should be class-specific to enhance the discrimination power of naive Bayes [35]. Thus, different weights are assigned to the attributes for different classes in CAWNB [35]. CAWNB is more complicated than WANBIA considering the number of model parameters. Class-specific attribute weights provide CAWNB with greater discrimination. However, the model complexity is considerably increased, so the generalization capability may decrease. The problem will be severe when the dataset is small, so the training samples are not enough to derive a reliable naive Bayes model.
To improve the generalization capability of CAWNB, we propose to add a simpler model, WANBIA, to constrain CAWNB. Besides, CAWNB is an improved version of WAN-BIA, and both share the similar optimization procedure. It will not significantly increase the computational complexity by integrating WANBIA into CAWNB.

B. OVERVIEW OF PROPOSED REGULARIZED NAIVE BAYES
In the proposed method, the target is to use the classification feedback to optimize the attribute weights. More precisely, the target is to find the optimal attribute weights to minimize the difference between the estimated posteriors and the posteriors derived from the ground-truth labels. The mean squared error is often used to capture such differences: where D represents the whole dataset,P(c|x i ) is the estimated posterior of class c given x i , and the posteriors derived from the ground-truth labels are defined as: The posteriorP(c|x i ) consists of two parts. The first part that emphasizes the discriminative power of the model, whose attributes are weighted on a class-dependent basis, is defined as:P where π = [π 1 , π 2 , . . . , π l ] are the prior probabilities, and π c is the prior probability that sample x belongs to class c. The matrix of likelihood probabilities is defined as: where θ c,j is the likelihood of the j-th attribute of x given the class c. π and are estimated from training samples using (13) and (14) respectively, as shown in section III-C later on.
is the attribute-weighting matrix on a per-class basis and w c,j is the weight of the j-th attribute for class c. The other posterior probabilityP I (c|x) that emphasizes the generalization capability of the model, whose attributes are weighted on a class-independent basis, is defined as: where w = [w 1 , w 2 , . . . , w m ] is the weight vector and w j is the weight of the j-th attribute.
In the proposed RNB, the regularized posterior probability is defined as: where M = {W , w, α} consists of class-dependent attribute weights W , class-independent attribute weights w and a hyper-parameter α. α is used to balance the trade-off between the discrimination power and the generalization capability. The block diagram of the proposed regularized naive Bayes is shown in Fig. 1. In the training process, the elements in W and w are all initialized to 1 and α is initialized to 0.5, so that the initial model is the original naive Bayes. Then, P D (c|x) andP I (c|x) are estimated using training samples and these two posteriors are integrated as the regularized posterior P(c|x) with the weighting factor α, as shown in (9). Then, f is calculated as the sum of the squared differences between P(c|x) andP(c|x), as shown in (5). The model parameters are optimized iteratively by using a gradient-descent-based method to minimize f until convergence. The detailed procedures to derive the optimal model parameters are given in Section III-D. The class-independent weights significantly improve the generalization capability of the model, as evidenced in Section IV.
In the testing process, the estimated prior probabilities π , the likelihood probabilities and the optimal model parameters M * = {W * , w * , α * } are used to compute the posterior probabilityP(c|t) for a given test instance t by using (9). Finally, the class label of t is estimated by using MAP estimation as follows: where C is the set of labels for all classes.

C. ESTIMATION OF PRIOR PROBABILITIES AND LIKELIHOOD PROBABILITIES
Firstly, prior probabilities π and likelihood probabilities are estimated based on training samples. Traditionally, the prior probability π c for class c is estimated as follows: where n is the number of training samples, c i is the class label of the i-th training instance, and δ(•) is a binary function, which is 1 if its two parameters are identical and 0 otherwise. The likelihood function θ c,j for the j-th attribute of class c is estimated as follows: where x ij is the j-th attribute value of the i-th training instance and x j is the j-th attribute.
To make the estimation numerically stable, e.g. to avoid estimating π c to 0 due to insufficient training samples, in the proposed method, the prior probability π c and the likelihood θ c,j are estimated by adding a regularization term as follow: where n j is the number of discretized values for the j-th attribute.
The aforementioned procedures work for discrete features. Continuous features are transformed into the discrete features by using the Fayyad & Irani's MDL method [44]. Then, (13) and (14) are used to compute prior probabilities and FIGURE 1. Proposed regularized attribute weighting framework for naive Bayes. In the training process, the model parameters are initialized and the posteriorsP(c|x) are estimated from training samples, which consist of two parts: the posteriors with attributes weighted on a class-dependent basis, and the posteriors with attributes weighted on a class-independent basis. Then, the model parameters are optimized iteratively through a gradient-descent-based algorithm using the classification feedback. When the classifier error is small enough, the optimized model parameters will be then used in the testing process. Finally during testing, the posterior for each testing sample t will be estimated and the class label for t is derived by using MAP estimation.
likelihood probabilities of continuous features respectively in the same way as discrete ones.

D. SOLVING THE OPTIMIZATION PROBLEM
Now the challenge is how to jointly find the optimal model parameters M including W , w, and α. To achieve this, a gradient-descent-based optimization procedure is proposed, similar to L-BFGS-M [42] used in CAWNB and WANBIA. More specifically, the target is to find the gradient direction of the objective function w.r.t. the model parameters W , w, and α, respectively. Then, the model parameters are updated iteratively along the gradient direction to minimize the error function defined in (5).
The partial derivative of f w.r.t. each element of W , w c,j , is given as follows: ∂f Similarly, the partial derivative of f w.r.t. each element of w, w j is calculated as: The detailed derivations are omitted here and a brief derivation is described in Appendix. Finally, the partial derivative of f w.r.t. α can be calculated as: After deriving the partial derivatives of the objective function f w.r.t. the model parameters, the model parameters W , w, and α are iteratively updated to minimize the classification error. After the i-th iteration of optimization, the model parameters W i , w i , α i are updated using the following VOLUME 8, 2020 equations: where ∇W i is the gradient matrix whose elements are defined in (15), ∇w i is the gradient vector whose elements are defined in (16), ∇α i is the partial derivative defined in (17) and is the learning rate. The iteration will stop when: where η is a predefined small constant. The optimal model is denoted as M * = {W * ,w * ,α * }. The learning algorithms for training and testing are summarized in Algorithm 1 and Algorithm 2, respectively.

Algorithm 2 Testing Algorithm
Input: t: a test instance, M * = {W * , w * , α * }: the set of the optimal model parameters, π: the prior probabilities, : the likelihood probabilities. Output: the class label of the test instance t. 1: Derive the class-dependent posteriorP D (c|t) using (7). 2: Derive the class-independent posteriorP I (c|t) using (8). 3: Derive the regularized posteriorP(c|t) using (9). 4: Determine the class labelĉ(t) of the test instance t using (10). 5: Return the predicted class labelĉ(t). α is initialized to 0.5 so that the initial model will not bias the discrimination power or the generalization capability. α is optimized to achieve the best trade-off between discrimination power and generalization capability. A small value of α means that a small weight is assigned toP D (c|x), and a large weight is assigned toP I (c|x). As a result, a better generalization capability is expected. Note that in the extreme case, the model is reduced toP D (c|x) for α = 1, orP I (c|x) for α = 0. All the weights of W and w are initialized to 1, which means that the model is initialized to naive Bayes at the beginning. In the proposed regularized naive Bayes, not only the prior probabilities and the likelihood probabilities are regularized to avoid numerical instability as shown in (13) and (14), but also the posterior is regularized to improve the generalization capability as shown in (9).

IV. EXPERIMENTAL RESULTS
The proposed approach is compared with original naive Bayes [45], Gaussian naive Bayes [8] and several state-ofthe-art NB algorithms. TCSFS-NB improves the performance of naive Bayes through feature selection [21]. DAWNB [22] and CFW [23] are two recent filter-based attribute-weighting methods. The comparisons with them can illustrate the performance gain of the proposed RNB over filter-based approaches. DEAWNB [33], WANBIA [34], CAWNB [35] and CWANB [36] are four wrapper-based attribute-weighting methods in recent years. They can provide a comprehensive comparison to wrapper-based attribute-weighting methods. These competitors are summarized in Table 1.

A. EXPERIMENTAL SETTINGS
Comprehensive experiments are conducted on a collection of 33 benchmark datasets from the UCI repository, 1 which represent a wide range of domains and data characteristics [43]. Most datasets are from real-world problems e.g. diabetes, hepatitis and primary tumor, vehicle classification, 1 These 33 datasets could be downloaded from ''https://archive.ics.uci.edu/ml/index.php'' letter recognition and so on. Besides, the characteristics of the datasets including the number of instances, attributes and classes are significantly different. The sizes of datasets are between 57 and 20000, enough to evaluate how the algorithms perform on datasets of different sizes. For example, smaller datasets such as breast-cancer, heart-c and iris will prefer methods with better generalization capabilities. Attribute weighting methods with good discrimination power will perform better on larger datasets such as sick, hypothyroid, waveform-5000 and mushroom. In addition, 17 out of 33 datasets have missing values, which simulates the difficulties in real life when collecting datasets, and imposes additional challenges for classifiers. Besides numeric values, the attributes of some datasets are nominal values, which imposes another challenge for classifier design. These 33 benchmark datasets provide a comprehensive evaluation of the effectiveness of the proposed RNB. The dataset descriptions are summarized in Table 2.
The missing values in the datasets are replaced with the average value of the numeric attributes or the mode of the nominal attributes in the available data. In CAWNB, they use Fayyad & Irani's MDL method [44] to discretize numeric attributes which may lead to information loss. Thus, in the experiments, the Fayyad & Irani's MDL method is fine-tuned to reduce the information loss. Besides, two irrelevant attributes are deleted, i.e. ''instance name'' in ''splice'' and ''animal'' in ''zoo''.
The results of NB, DAWNB, DEAWNB, WANBIA and CAWNB are obtained from [35]. The results of TCSFS-NB, DAWNB and CWANB are obtained from [21], [22] and [36], respectively. GNB is implemented using Weka and the proposed RNB is implemented in MATLAB. The classification accuracy of the proposed algorithm on each dataset is derived via 10-fold cross-validation. During optimization, η is set to 10 −7 in the stop criterion defined in (21). The learning rate is determined using the linear search programs [46].

B. COMPARISON TO STATE OF THE ART
The comparisons to the state-of-the-art algorithms on the 33 datasets are shown in Table 3. The symbol • represents the statistically significant improvements achieved by the proposed regularized naive Bayes for paired one-side t-test with the p = 0.05 significance level. The average classification accuracy and the Win/Tie/Loss on the 33 datasets for all the algorithms are summarized at the bottom of Table 3. The average classification accuracy over all the datasets can provide a straightforward comparison for their performance. Each entry of W /T /L in the table indicates that the competitor wins on W datasets, ties on T datasets and loses on L datasets compared to the proposed RNB.
From Table 3, it is obvious that the proposed RNB obtains the highest average classification accuracy. Compared with the original naive Bayes and Gaussian naive Bayes, the proposed RNB achieves 2.34% and 6.15% of improvement respectively on average. Compared with filter-based approach, DAWNB [22] and CFW [23], the proposed RNB achieves 2.26% and 1.82% of improvements on average, respectively. Compared with feature-selectionbased approach, TCSFS-NB [21], RNB achieves 2.32% of improvement on average.
Compared with the previous best algorithm, CAWNB, the proposed RNB achieves more than 1% of improvement for the average classification accuracy over the 33 datasets. Among them, the improvements on some datasets are significant. For example, the classification accuracies of RNB on balance-scale, glass, sonar and vowel are more than 5% higher than the most recent attribute-weighting method, CAWNB. On relatively small datasets such as glass, iris and sonar, the proposed approach significantly outperforms CAWNB and the others because of the good generalization capability. On relatively large datasets such as segment and letter, the proposed RNB also shows statistically significant improvements. All these demonstrate that the proposed approach could well adapt to the datasets of different sizes, and automatically adjust the balance between the discrimination power and the generalization capability.  [45], DAWNB [22], DEAWNB [33], WANBIA [34], CAWNB [35], CWANB [36], GNB [8], TCSFS-NB [21] and CFW [23]. It is obvious that overall RNB achieves the best classification accuracy among all approaches. The average classification accuracy of RNB is more than 2% higher than NB's. Besides, RNB obtains more than 1% of improvement on average compared with the previous best attribute-weighting method, CAWNB. The classification accuracies of RNB on some datasets e.g. balance-scale, glass, sonar, and vowel achieve about 5% of improvement compared with CAWNB.

C. ANALYSIS OF EXPERIMENTAL RESULTS
In the statistical significance tests shown in Table 3, the proposed approach significantly outperforms CAWNB [35], CWANB [36], WANBIA [34], DEAWNB [33], CFW [23], DAWNB [22], TCSFS-NB [21] and GNB [8] on 8,9,10,12,14,17,17 and 23 datasets, respectively. Compared with the original NB, on more than half of the datasets, the proposed RNB achieves statistically significant improvements. Compared with the previous best algorithm, CAWNB [35], the proposed RNB achieves statistically significant improvements on 8 datasets, which demonstrates the effectiveness of the proposed approach. Table 4 summarizes the results for statistical significance tests. For each entry u(v), u is the number of datasets on which the proposed RNB outperforms the corresponding competitor, and v is the number of datasets on which the performance gain is statistically significant with significance level p = 0.05. Table 4 shows that on average the classification accuracies on more than two-thirds of 33 datasets improves and half of them are statistically significant. It hence can be concluded that the proposed RNB outperforms all compared approaches.
From the experimental results, it can be seen that the proposed regularized naive Bayes achieves a remarkable performance improvement. The hyper-parameter α is optimized along with class-dependent attribute weights and class-independent attribute weights. The optimal value of α on each dataset is shown in Table 5, together with the number of instances and the number of instances per class. The values of α * vary on different datasets. In general, larger the dataset, higher the α * value.
To better see the trend, the average value of α * across datasets and the performance gain of the proposed RNB against the second best algorithm, CAWNB [35], are summarized in Table 6. The 33 datasets are divided into small and large datasets according to the number of instances per class, e.g. if it is larger than 500, the dataset is considered large, and small otherwise. Table 6 shows that for small datasets, the average α * value is significantly smaller than that for large datasets. This indicates that α * could be automatically adjusted during optimization so that for small datasets, α * will be small to favor the generalization capability, whereas for large datasets, α * will be large to favor the discrimination power. It can also be seen that the proposed   RNB indeed demonstrates good generalization capabilities for small datasets by achieving a larger performance gain than that on large datasets.

V. CONCLUSION
In this paper, after a thorough literature review of the state-ofthe-art attribute-weighting naive Bayes methods, we find that class-dependent attribute-weighting naive Bayes has poor generalization capabilities on relatively small datasets. Therefore, we propose to add a regularization term to alleviate the problem. The regularization term is extracted from a simpler naive Bayes which has better generalization capabilities. The proposed regularized naive Bayes is hence derived by integrating the regularization term into the CAWNB. A gradient-descent-based optimization procedure has been designed to derive the optimal model parameters including class-dependent weight matrix W , class-independent weight vector w and the hyper-parameter α. Experimental results on the 33 datasets validate the effectiveness of the proposed RNB. The proposed method outperforms the previous best algorithm CAWNB on 21 datasets, of which 8 are statistically significant, and the average performance gain on the 33 datasets is more than 1%.

APPENDIX A
In this section, a brief derivation of the gradients of f w.r.t W and w is provided. Firstly, the partial derivative of f w.r.t. each element of W , w c,j , is calculated as: Denote γ D (W ) = π c j θ w c,j c,j . Then,P D (c|x) defined in (7) can be re-written asP D (c|x) = γ D (W ) c γ D (W ) . It is easy to show that Derive ∂P D (c|x) ∂w c,j using the chain rule by utilizing (23) and (24), and then plug it into (22) to obtain the partial derivative of f w.r.t. w c,j as defined in (15).
Secondly, the partial derivative of f w.r.t. w j is derived as: Denote γ I (w) = π c j θ w j c,j . Similarly,P I (c|x) defined in (8) can be re-written asP I (c|x) = γ I (w) c γ I (w) . Note that every term in the summation of the denominator is a function of w j . The partial derivative ∂P I (c|x) ∂w j is calculated as: Similar to (24), it is easy to show that ∂γ I (w) ∂w j = γ I (w) log(θ c,j ). Plug it into (25), the partial derivative of f w.r.t. w j shown in (16) can be obtained.