On the Effectiveness of Discretizing Quantitative Attributes in Linear Classifiers

Learning algorithms that learn linear models often have high representation bias on real-world problems. In this paper, we show that this representation bias can be greatly reduced by discretization. Discretization is a common procedure in machine learning that is used to convert a quantitative attribute into a qualitative one. It is often motivated by the limitation of some learners to qualitative data. Discretization loses information, as fewer distinctions between instances are possible using discretized data relative to undiscretized data. In consequence, where discretization is not essential, it might appear desirable to avoid it. However, it has been shown that discretization often substantially reduces the error of the linear generative Bayesian classifier naive Bayes. This motivates a systematic study of the effectiveness of discretizing quantitative attributes for other linear classifiers. In this work, we study the effect of discretization on the performance of linear classifiers optimizing three distinct discriminative objective functions --- logistic regression (optimizing negative log-likelihood), support vector classifiers (optimizing hinge loss) and a zero-hidden layer artificial neural network (optimizing mean-square-error). We show that discretization can greatly increase the accuracy of these linear discriminative learners by reducing their representation bias, especially on big datasets. We substantiate our claims with an empirical study on $42$ benchmark datasets.


Introduction
One of the many factors that affect the error of a learning system is its representation bias (van der Putten and van Someren, 2004), or, as it is also called, its hypothesis language bias (Mitchell, 1980). We define representation bias herein as the minimum loss of any model in the space of models available to the learner. It is clearly desirable in the general case to use a space of models that minimizes representation bias for a given problem. Learning algorithms that use linear models, such as logistic regression (LR) (Murphy, 2012) and support vector classifiers (SVC) (Chih-Jen, 2010), are very popular, possibly in part due to their lending themselves to convex optimization. In this paper we argue that learning algorithms that Figure 1: Illustration of the effectiveness of a perceptron after EMD discretization on simple two-dimensional contrived data.
learn linear models often have high representation bias on real-world problems, and that often this bias can be reduced by discretization. We illustrate this in Figure 1 which shows on simple synthetic data how a linear classifier cannot create an accurate classifier on the numeric data but can when the data are discretized using simple univariate discretization. There are two important observations that motivate this work: • A linear classifier with discretization is not linear with respect to the original data.
• Contrary to what might be thought given that discretization loses information, a linear classifier with discretization can reduce representation bias and may consequently reduce error.
Note that we do not claim that discretization is the only useful feature transformation or a substitute for other approaches to creating non-linear classifiers, such as multi-layer perceptrons. The simple AND problem illustrated in Figure 1 is a case where where discretization is as effective as any other method for obtaining non-linear decision surfaces. However, there are problems where discretization will not be this effective. One example is the X-OR problem illustrated in Figure 2. Note also that we do not claim that discretization always reduces representation bias. However, linear models make a strong implicit assumption, that the data are well modeled by a linear decision boundary. As illustrated in Figure 1 discretization can overcome this assumption. The startling result that we reveal is that doing so is often useful in practice. We show that discretization often reduces the bias of a linear classifier and that this reduction in bias frequently results in lower error when the data quantity is sufficiently large to minimize overfitting.
The rest of this paper is organized as follows. Some preliminary background and terminology is given in Section 2.1. We discuss discretization in general in Section 2.2. Linear  Actual probability of event (e) P(e) Probability of event e P(e | g) Conditional probability of event e given g x = x 0 , x 1 , . . . , xn An object (n-dimensional vector) and x ∈ D Y Random variable associated with class label y y ∈ Y . Class label for object. Same as x 0 C |Y |, Number of classes X i Random variable associated with qualitative attribute i x i Actual value that X i takes. |X i | (Applicable only to qualitative attributes) Number of values of attribute X i β LR parameter vector to be optimized β y,i LR parameter associated with quantitative attribute i for class y β y,k,j LR parameter associated with qualitative attribute k for class y taking value j β y,0 LR intercept term for class y. λ Regularization parameter classifiers based on Conditional Log-Likelihood (CLL) Hinge Loss (HL) and Mean-squareerror loss are discussed in Sections 2.3, 2.4 and 2.5 respectively. Optimization strategies for training these linear classifiers are discussed in Section 2.6. An overview of related work is given in Section 3. Experimental results are given in Section 4. We conclude in Section 5 with pointers to directions for future work.

Terminology
In machine learning and data mining research, there exists variation in the terminology when it comes to characterizing the nature of an attribute (or feature). For example, 'continuous vs. discrete', 'numeric vs. categorical' and 'quantitative vs. qualitative'. We believe that the 'quantitative vs. qualitative' distinction is best suited for our study in this paper and hence, this is used throughout the paper. Qualitative attributes are the attributes on which arithmetic operations can not be applied. The values of a qualitative attribute can be placed or categorized in distinct categories. Sometimes there exist a meaningful rank among these categories, resulting in distinction of ordinal and nominal among quantitative attributes. Quantitative attributes, on the hand, are the attributes on which arithmetic operations can be applied. They can be both discrete and continuous. For example, Number of Children is a discrete-quantitative attribute (values determined by counting), whereas Temperature is a continuous-quantitative attribute (values determined by measuring).
A list of the various symbols used in this work is given in Table 1.

Discretization
Discretization is a common process in machine learning that is used to convert a quantitative into a qualitative attribute (Liu et al., 2002;Garcia et al., 2013). The need for discretization originates from the facts that some classifiers can only handle, and some others sometimes to operate better with qualitative attributes. The process involves finding cut-points within the range of the quantitative attribute and to group values into intervals based on these cut-points. This removes the ability to distinguish between data points falling in the same interval. Therefore, discretization entails information loss. Discretization methods can be categorized into two categories: Supervised and Unsupervised. In the unsupervised case, class information is not used during cut-point determination process. Popular approaches are equal-frequency and equal-width discretization. Equal-width discretization (EWD) divides the quantitative attribute's range (maximum value x max i and minimum value x min i ) into k equal-width intervals where k is provided by the user. Each interval will have a width of w = Equal-frequency discretization (EFD), on the other hand, divides the sorted values of a quantitative attribute such that each interval has approximately k number of data points. Each interval will contain N/k data points. It is also important that data points with identical value are placed in the same interval, therefore, in practice, each interval will have slightly different number of data points. Choosing EWD or EFD and the number of bins is problem specific and can have huge impact on the overall performance of any model. Of course, choosing a large k will result in less information loss, but can result in over-fitting on small datasets.
Supervised discretization methods, on the other hand, utilize the class information of the data point to better define the cut-points. For example, state-of-the-art discretization technique Entropy-Minimization Discretization (EMD) sorts the quantitative attribute's values and then finds the cut-point such the information gain is maximized across the splits (Kohavi and Sahami, 1996). The technique is applied recursively on the successive splits and the minimum-description-length (MDL) criterion is used to determine when to stop splitting.

Linear Classifier -CLL
A Logistic Regression classifier optimizes the conditional log-likelihood (CLL) which is defined as: where . (2) The term β y,0 + β T y x (l) is expanded as: β y,0 x n , where x 0 can be assumed to be 1 for all data points. Since the objective function as defined in Equation 1, is linear in x, it is a linear classifier.
Equation 2 leads to a multi-class softmax objective function. Since, a set of parameters are learned for each class, we have made this distinction explicit with subscript y in parameter notation, that is, β y,j denotes a parameter for class y and attribute j. Typically, an LR minimizes the negative of the CLL known as negative log-likelihood (NLL), which is defined as: Note, in the following, for simplicity, we will drop the superscript (l) notation. It should be noted that many software libraries for multi-class LR are either based on implementing multi-class (softmax) objective function of Equation 3 or they optimize a more simpler binary objective function of the following form: and solve a one-versus-all classification problem. Note that in the case of binary classifiers, there is only one set of parameters for the two classes as oppose to C set of parameter that needed to be optimized for the softmax case. At classification time, one needs to apply C different trained LR classifiers and choose one with the highest probability. Nonetheless, optimizing a standard LR with NLL based either on Equation 3 or Equation 4 requires substantial input manipulation, i.e., appending 1 to all data points and then, converting qualitative attributes using one-hot-encoding. For example, a qualitative attribute X j taking values {a, b, c}, will be converted into three attributes X j , X j+1 , X j+2 , each taking values either 0 or 1. An alternative to manipulating the input is to modify the model and optimize the following objective function instead: Note that the models expressed in Equation 3 and 5 are exactly equivalent and will lead to the same results. The only difference is that the model in Equation 3 requires converting all qualitative attributes into quantitative ones using one-hot-encoding, whereas the model in Equation 5 does not. Equation 5 can be simplified even further -for datasets with only qualitative attributes, and including only terms that are not canceled out, we have: Instead of converting qualitative attributes into quantitative ones and using the model of Equation 3, one can convert quantitative attributes into qualitative ones using discretization methods as discussed in Section 2.2 and use the model of Equation 6. It can be seen that with Equation 3, the number of parameters optimized are: (C − 1) + (C − 1)n. Whereas, with Equation 6, (C −1)+(C −1) n i=1 |X i | parameters are optimized. Since the two models are not equivalent, this will result in different training time, speed and rate of convergence and of course, classifications.

Linear Classifier -Hinge Loss
Hinge Loss (HL) is widely used as an alternative to CLL and has been the basis of Support Vector Machines. A classifier optimizing either a Hinge Loss objective function or its variant is a linear classifier and is known as the Support Vector Classifier (SVC). Here we define L2-Loss HL as: An alternative is L1-Loss HL which is equal to: N l=1 max(0, 1 − yβ T x). In this work, we will focus only on the L2-Loss. In practice, a penalty term is also added for regularizing the objective function as: where λ is the regularization parameter. We will discuss the gradient and Hessian of this objective function later in Section 2.6.

Linear Classifier -Mean-Square-Error
Another linear classifier is based on optimizing the Mean-Square-Error (MSE) objective function and is defined as: where P(c | x) is given in Equation 2 andP(c | x) is the actual probability of class c given data instance x. This will be a vector of size C with all zeros except at the location of the label of x, where it will be 1 (assuming there are no duplicate data points in the dataset). The objective function of Equation 9 is similar to that optimized by artificial neural-networks (ANN). However, in ANN, P(c | x) is defined in terms of multiple layers. We can interpret Equation 9 as the objective function of a zero-layer ANN.

Optimization
There is no closed form solution to optimizing the negative log-likelihood, hinge loss and mean-square-error objective function, and, therefore, one has to resort to iterative minimization procedures such as gradient descent or quasi-Newton. An iterative optimization procedure generates a sequence {β k } ∞ k=1 converging to an optimal solution. At every iteration k, the following update is made: β k+1 = β k + s k , where s k is the search direction vector. The following equation plays the pivotal role as it holds the key to obtain s k by solving a system of linear equations: where f is the objective function that we are optimizing. There are two very important issues that must be addressed when solving for search direction vector using Equation 10 (Nocedal and Wright, 2006). First, it can be infeasible to explicitly compute and store the Hessian, especially on high-dimensional data. Second, the solution obtained using Equation 10, does not guarantee convergence. There are three main strategies for addressing the first issue: • Consider ∇ 2 f (β k ) to be an identity matrix -in this case, s k = −∇f (β k ). This leads to a family of algorithms known as first-order methods such as Gradient Descent, Coordinate Descent, etc.
• Do not compute ∇ 2 f (β k ) directly, but approximate it from the information present in ∇f (β k ) instead. This property is useful for large scale problems where we cannot store the Hessian matrix. This leads to approximate second-order methods known as quasi-Newton algorithms, for example, L-BFGS which, is considered to be the most efficient algorithm (de-facto standard) for training LR.
• Third, use standard 'direct algorithms' for solving a system of linear equations such as Gaussian elimination to solve for s k . Or, use any one of the iterative algorithm such as conjugate gradient, etc. For large datasets, generally iterative methods are preferable over direct methods, as the former requires computing the whole Hessian matrix. The optimization method now has two layers of iterations. An outer layer of iteration to update β k , and an inner layer of iterations to find Newton direction s k . In practice, one can only use an approximate Newton direction in early stages of the outer iterations. This method is known as 'Truncated Newton method' (Nash, 2000).
It should be noted that these methods differ in terms of the speed-of-convergence, costper-iteration, iterations-to-convergence, etc. For example, Coordinate Descent updates one component of β at every iteration, so the cost-per-iteration is very low, but iterationsto-convergence will be very high. On the other hand, Newton methods, will have high cost-per-iteration, but very low number of iterations-to-convergence. The three methods described above are all affected by the scaling of the axis. Therefore, scaling quantitative attributes or converting quantitative into qualitative attributes will effect the speed and the quality of the convergence. The second issue can be addressed by adjusting the length of the Newton direction. For that, two techniques are mostly used -line search and trust region. Line search methods are standard in optimization research. We can modify Equation 10 as where η k is known as the step-size. Standard line searches obtain an optimal step-size as a solution to the following sub-optimization problem: η k = argmin η f (β k + ηs k ). Trustregion methods, unlike line search, are relatively new in optimization research. Trust-region methods first find a region around the current solution -in this region, a quadratic (or linear) model is used to approximate the objective function. The step size is determined based on the goodness of fit of the approximate model. If a significant decrease in the objective function is achieved with a forward step, the approximated model is a good representative of the original objective function and vice-versa. The size of the (trust) region is specified as a spherical area of size ∆ k . The convergence of the algorithm is guaranteed by controlling the size of the region which (in each iteration) is proportional to the reduction in the value of objective function in the previous iteration.
In the following we will define the gradient and Hessian of the three objective functions conditional log-likelihood, hinge loss and mean square error. Note, we only define the gradient and the Hessian for qualitative attributes here. For softmax CLL, ∇f (β k ) and ∇ 2 f (β k ) can be written as: (1 y =y − P(y |x))P(y |x)x i x j .
For Hinge-loss, the (sub-) gradients can be written as: where N , are the instances for which yβ T x < 1 is true. Similarly the Hessian can be written as: For MSE, one can write the gradients as: (1 y=c − P(c|x))(1 k=c − P(k|x))x i , and the Hessian can be written as:

Related Work
Discretization is often motivated by a need to adapt data for a model that cannot handle quantitative attributes. In Statistics and many of its related and applied branches (such as epidemiology, medical research and consumer marketing), it goes by names of 'dichotomization' and 'categorization' (where the two techniques differ as the former splits the measurement scale into two while the later can have more than two categories) -and has been examined in many studies (Irwin and McClelland, 2003;MacCallum et al., 2002;Greenland, 1995). However, in most of these studies a majority opinion is against the use of dichotomization -and for categorization, it is advised to be used with caution. The main reason cited for this is that dichotomization and categorization lead to information loss since the variability among the members of the group is subsumed. For example, Altman and Royston (2006) write: ... Firstly, much information is lost, so the statistical power to detect a relation between variable and patient outcome is reduced ... and considerable variability may be subsumed within each group. Individuals close to but on opposite sides of cut-point are characterized as being very different rather than very similar ...
In practical machine learning, the common practice is to discretize an attribute only if necessary (i.e., if a model expects categorical attributes). An exception is for Bayesian classifiers, where it is common practice to discretize numeric attributes (Yang and Webb, 2009). The ambivalence towards discretization is understandable. Obviously, the quality (and sometime quantity) of data is the key to training accurate models and hence getting good results. In many cases, the data are the result of costly and time-consuming efforts (for example in breast cancer research where there are several stake-holders involved just to obtain a few attributes of the data). Losing some of the data (or more precisely, losing some distinction among the instances) due to discretization should be undesirable. However, a number of motivations for discretization have been put forward: • Discretization can lead to simplification of statistical analysis. For example, if a quantitative attribute is split on the median, then one can compare the two groups based on t, χ 2 or some other test to estimate the difference between the two groups. This may ease interpretation and presentation of results (Altman and Royston, 2006).
• If there is error in the measurement scale, discretization can improve the performance of the model by reducing the contamination (Flegal and Keyl, 1991;Reade-Christopher and Kupper, 1991;Fung and Howe, 1984;Shentu and Xie, 2010).
• In many domains there exist pre-defined (or standard) thresholds to convert a quantitative to a qualitative scale. In these cases, a discretized attribute might better represent the task at hand as it will be more interpretable or have distinct significance. For example, in medical research doctors might better interpret blood-pressure as high and low rather than on a numeric scale.
• A discretized attribute might be better utilized than the quantitative attribute by the learning system. For example, consider a classifier that relies on estimation of conditional probabilities such as P(x i | y). If X i is quantitative, x i can take infinite many values and if the number of training samples are small, reliable estimation of P(x i | y) from the data is not possible. A common approach is to impose a parametric model to estimate the value of P(x i | y) based on this model in which case the accuracy will depend on the appropriateness of the parametric model selected. Discretization can obviate this problem. By converting a quantitative attribute X i into a qualitative one X * i , the probabilities will take the form of P(x * i | y) which may be reliably estimated from the data as there will be many x i values falling into the same interval (Yang and Webb, 2009).
• The final reason for discretization has to do with overcoming a model's assumptions.
It might be the case that discretization help avoid some strong assumption that the learner makes about the data. If those assumptions are correct, discretization will have a negative impact, but if those assumptions are false, discretization may lead to better results (Altman et al., 1994). It is this final motivation that we examine herein.
The effect of discretization on various classification algorithms such as naive Bayes, Support Vector Machines and Random Forest is discussed in Lustgarten et al. (2008). On many biomedical datasets, it is shown that discretization can greatly improve the performance of the learning algorithm. The role of discretization as feature selection technique is also explored. On various contrived datasets, Maleki et al. (2009) studied the effect of discretization on the precision and recall of various classification methods.
The effectiveness of discretization for naive Bayes classifier is relatively well studied (Hsu et al., 2000;Dougherty et al., 1995;Yang and Webb, 2009). Dougherty et al. (1995) conducted an empirical study of naive Bayes with four well-known discretization methods and found that all the discretization methods result in significantly reducing error relative to a naive Bayes that assumes a Gaussian distribution for the continuous variables. Hsu et al. (2000) attributes this to the perfect aggregation property of Dirichlet distributions. In naive Bayes settings, a discretized continuous distribution is assumed to have a categorical distribution with Dirichlet priors. The perfect aggregation property of Dirichlet implies that we can learn the class-conditional probability of the discretized interval with arbitrary accuracy. It is also shown that there exists a partition independence assumption, by virtue of that, Dirichlet parameters corresponding to a certain interval depend only on the area below the curve of the probability distribution function, but is independent of the shape of the curve in that interval.

Experiments
In this section, we compare the performance of linear classifier with discretized linear classifier on various datasets from the UCI repository (Lichman, 2017).
We denote linear classifier optimizing the conditional log-likelihood as LR, a linear classifier optimizing the Hinge loss as SVC (support vector classifier) and a linear classifier optimizing the mean-square-error as ANN 0 (artificial neural network with zero hidden layers) -their discrete counterparts are denoted as LR(d), SVC(d) and ANN 0 (d) respectively.
In the remainder of this paper, when discussing results, we will collectively refer to LR, SVC and ANN 0 as linear classifiers and denote them by LC. We will collectively refer to LR(d), SVC(d) and ANN 0 (d) as discretized linear classifiers and denote them by LC d .
The details of datasets used in this work are given in Appendix B. For discretized linear classifiers, different supervised and unsupervised discretization techniques were considered. Since, this is not a comparative study on the relative efficacies of various discretization techniques for linear classifiers, we only report results with supervised entropy-based discretization of Fayyad and Irani (1992), which we found gives better results than other discretization methods such as equal-frequency, equal-width, etc.
Each algorithm is tested on each dataset using either 5 or 10 rounds of 2-fold cross validation.
During the presentation of results, we split our datasets into two categories -Big and Little. The Big category comprises of datasets with more than 100, 000 instances and the Little category comprises of the remaining datasets with < 100, 000 instances.
We compare four different metrics: 0-1 Loss, RMSE, Bias and Variance. We also compare training-time, testing time, and rate of convergence. As discussed in Section 1, the reason for performing bias-variance estimation is that it provides insights into how the learning algorithm might be expected to perform with varying amounts of data. We expect low variance algorithms to have relatively low error for small data and low bias algorithms to have relatively low error for large data (Brain and Webb, 2002). There are a number of different bias-variance decomposition definitions. In this research, we use the bias and variance definitions of Kohavi and Wolpert (1996) together with the repeated cross-validation bias-variance estimation method proposed by Webb (2000).
We report Win-Draw-Loss (W-D-L) results when comparing the 0-1 Loss, RMSE, bias and variance of two models. A two-tail binomial sign test is used to determine the significance of the results. Results are considered significant if p ≤ 0.05 and shown in bold. For hinge-loss, a dataset with more than two classes was transformed into a binary dataset. Data points belonging to the majority class were assigned to class A and the remaining data points were assigned to class B.
Missing values of the quantitative attribute were replaced with the mean of the attribute values whereas missing values of the qualitative attribute were treated as a distinct attribute value.
Quantitative attributes were also normalized between 0 and 1, as this is often recommended for gradient-based optimization methods.
Three optimization methods -gradient descent, quasi-Newton, Trust-region based Newton method (TRON) were used. We found TRON to be converging relatively faster than the other methods. Therefore, in the following, we report results with TRON optimization only. However, it is worth mentioning that a similar pattern of results was seen between LC and LC d for the other optimization methods.

Comparison of the Accuracy of LC d and LC
In this section, we compare the accuracy of LC d and LC in terms of their 0-1 Loss and RMSE on 52 datasets. Results are shown in Figures 3 and 4. It can be seen that the three LC d classifiers result in much better accuracy than their corresponding LC. In the scatter plots, results on Big datasets are shown in green dots, whereas results on Little datasets are shown in red dots. It can be seen that, on almost all Big datasets, LC d leads to higher accuracy (most green-dots are below the diagonal line). It can also be seen that some of the differences are substantial -this shows the effectiveness of discretization on LR, SVC and ANN 0 especially for big datasets.
A comparison of the win-draw-loss between the two models is given in Table 2. It can be seen that on big datasets, LC d wins on all except on 2 datasets -very promising result. This proves our hypothesis that on big datasets, discretization leads to low-bias non-linear classifier resulting in far superior results than a linear classifier with no discretization. On   small datasets, discretization is significantly effective for SVC and non-significantly effective for ANN 0 . Note that on small datasets, LR(d) and LR leads to similar performances with 13 wins and 14 losses for 0-1 Loss and 12 wins and 15 losses for RMSE. However, one should take into account that the scale of LR(d) wins is much higher than that of LR. This can be seen from the spread of red-dots in the left-most plots of Figures 3 and 4.

Comparison of the Bias and Variance
Figures 5 and 6 present scatter plots of the bias and variance of LC and LC d classifiers. It can be seen that the three LC d classifiers lead to low-bias and high-variance models. Note that we present a bias-variance analysis on only Little datasets. This is because the software we have for obtaining bias-variance estimates is single threaded and could not benefit from the high-performance environment in which most of our experiments were run. As a result it was not feasible to run these experiments on the larger datasets. Nonetheless, results confirm our hypothesis that LC d classifiers tend to have lower bias than LC.

Comparison of the Convergence Curves of LC d and LC
As training of both LC d and LC classifiers are based on iterative optimization algorithms, they produce a sequence of values as part of their training, i.e., of their objective function which (should ideally) decrease with successive iterations until convergence. A technique that leads to the global minimum faster (steeper curve) and in fewer iterations (shorter curve) is desirable. Note that LC and LC d have different models (and parameterizations) and, therefore, the optimization space for the two problems is also very different. In the following, let us compare the convergences of LC and LC d on some sample datasets. A similar trend was observed on all datasets, here we report results on nine representative datasets only. A comparison of the variation in NLL objective function for LR and LR(d) is shown in Figure 7. It can be seen that LR(d) has steeper curve -that is, it asymptotes to its global minimum much quickly. It is also important to see that LR(d) leads to much lower NLL. Better accuracy of LR(d) is the result of this much lower NLL.
Figures 8 shows the variation in HL for SVC and SVC(d) whereas, Figure 9 shows the variation in MSE for ANN 0 (d) and ANN 0 . A similar trend to NLL can be seen, that is SVC(d) and ANN 0 (d) leading to a better value of the objective function while converging more rapidly.

Comparison of the Learning and Classification Time LC d and LC
In this section, we compare the training and classification time of LC d and LC. It can be seen from Figure 10 that LR(d) and ANN 0 (d) are slightly faster than LR and ANN 0 respectively (majority of points below the diagonal line), whereas SVC(d) and SVC have similar trainingtime profiles. We already have seen the superior classification performance of LC d classifiers. These training-time results are extremely encouraging as they suggest that LC d can result in much better classification accuracy without compromising computational performance. We have reported the results only on four big datasets. This is because, the results were obtained by running the jobs on a local-desktop computer (i.e., a controlled set-up), rather than the heterogeneous cluster-computing environment in which most of the experimentation was performed. The scatter plots of classification time results for LC and LC d are presented in Figure 11. It can be seen that LC has slightly faster classification time than LC d . However, in most cases the difference in the magnitude is small.

Conclusion and Future Works
In this paper, we study the role of discretization for linear classifiers in machine learning. Current practice is primarily to apply discretization only when the learner requires qualitative data. Overall, there exists some aversion to discretization as it loses information.
We argue that discretization -despite losing information, can help model non-linear rela- tionships in the data and, therefore, can help reduce the bias of a learner that uses linear models. A linear classifier trained on discretized data is not linear any more, which has the potential to help in modeling non-linear decision boundaries which might otherwise require the use of kernels and multi-layer networks.
We show that discretization can greatly reduce the error of logistic regression and other discriminative linear classifiers optimizing Hinge Loss and Mean-square-error especially on large datasets. We compare the performance of linear classifiers trained with both qualitative and quantitative attributes (denoted as LC) with LR trained with qualitative attributes only (denoted as LC d ), where quantitative attributes were discretized first. Our empirical analysis on 52 datasets showed that LC d led to a low-bias model and, therefore, it resulted in significantly better 0-1 Loss and RMSE performance on large datasets. Quite surprisingly, it also reduced training time and had more desirable convergence, converging more rapidly to models that better fit the data. These substantial benefits come at a cost of a minor increase in classification time. Given the surprising gains from discretization, it is tempting to include both the original quantitative and derived discretized features in the data. Doing so avoids losing any information due to discretization. We undertook some preliminary experiments with this approach. They suggested that while it led to slight lower bias, they did not produce any improvement (in terms of error or convergence) over using only discretized-quantitative features. Further investigation of this research direction has been left as a future work.
With faster training, better convergence and low-bias we believe that discretization is worth consideration in any context where linear classifiers are learned from quantitative data.  People used for recording of the data were wearing four tags (ankle left, ankle right, belt and chest). Each instance is a localization data for one of the tags. The tag can be identified by one of the attributes.
TwitterAbsoluteSigma500 140607 0 76 2 N The objective of this dataset is to determine whether or not these time-windows are followed by buzz events. In this dataset, each instance covers seven days of observation for a specific topic. Considering the couple day following this initial observation; If there is at least 500 additional active discussions by day then, the predicted attribute Buzz is True.
MiniBooNE PID 130065 0 50 2 N This dataset is used for classifying signal and background event based on 50 particle ID variables.