A Novel Linear Classifier for Class Imbalance Data Arising in Failure-Prone Air Pressure Systems

An Air Pressure System (APS) is one of the crucial components of an automobile. Its failure leads to financial loses, and it may lead to loss of lives. Thus predicting such failure is a critical problem that requires a rigorous solution. Recently, many researchers have presented machine learning techniques to deal with APS failure detection. One of the major challenges in dealing with APS failure data is the presence of high class imbalance. Conventional classification criteria may not be able to efficiently handle such data. In this paper, a new machine learning method for APS failure detection is proposed. It is designed to specifically deal with the class imbalance. The method uses a linear decision boundary by maximizing Area Under the Curve (maxAUC) criterion. The proposed method was experimentally validated on an industrial dataset of APS failure. The results of the proposed method are thoroughly compared with existing linear as well as non-linear classifiers.


I. INTRODUCTION
As one of the crucial components of an automobile, Air Pressure System (APS) plays a vital role in gauging brakes, shifting gears, adjusting seats, and controlling suspensions. Faulty APS leads to improper functioning of brakes, gears and suspension, which may result in undesired and/or unpleasant incidents. Proper functioning of APS involves efficient (adequate and timely) supply of compressed air to the above systems.
Typically, APS works as follows [1]: • It cleans and dries incoming compressed air. • It distributes air to different pneumatic circuits. Its major components in an automobile are: air drier, circuit protection valves, and control unit. Air drier dehumidifies the incoming air generated at the compressor. The circuit protection valves control the different circuits, like parking brake circuit and service brake circuit, by activating them The associate editor coordinating the review of this manuscript and approving it for publication was Yongming Li .
at different predefined pressures. The control unit consists of temperature and pressure sensors in addition to a circuit board. It decides when to activate the compressor depending on APS pressure level.
APS failures can lead to huge financial losses, and be life-threatening sometimes. Thus, their detection before they actually occur is an important area of research. A critical domain in this research area, known as APS failure detection, is to detect whether an APS failure is the cause of the overall system failure or not. With the advent of Industrial Internet of Things (IIoT) and Industry 4.0, machine-learning based methods for APS failure detection (e.g., [2]) is gaining popularity. One of the major challenges in APS failure detection using machine learning is the presence of high-imbalance class distribution. Generally, the number of samples where APS failure is the key reason for the overall system failure is significantly less than the number of samples where the system failure is due to other reasons. It is widely known that developing a machine-learning based system using imbalance data samples is not straightforward. Based on the literature VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of machine-learning research, imbalanced data can be handled by: modifying the data (like under sampling) to reduce the imbalance, or modifying the machine-learning algorithm (like weighted classification) to handle them. In this paper a new machine-learning method for data classification is proposed that can handle the imbalanced data problem in APS failure detection. The proposed method is designed to efficiently classify highly imbalanced and overlapping data classes. The presented model uses a linear decision boundary for data classification. The parameters of the decision boundary are obtained by maximizing the Area Under the ROC Curve (AUC). Moreover, the key mathematical model of the proposed method is linear, which leads to a O(n) worse case complexity. The effectiveness of the method is evaluated on a public dataset of APS failure detection for Scania trucks [3]. The dataset consists of 60, 000 samples of failures, where less than 2% of the failures are related to APS. Based on the experiments, it can be seen that our method outperforms previously reported methods on the dataset.
Our contributions can be summarized as: • Developing a new machine-learning prediction/ classification method based on an AUC maximization criterion.
• Incorporating a mechanism to deal with imbalanced data present in APS failure detection.
• The worse case complexity of the proposed method is O(n). The rest of the paper is organized as follows: In Section II, related work from the literature is presented. The proposed method and few conventional data classification methods are depicted in Section III. The experiments, results, and discussions are illustrated in Section IV. Finally, the paper is concluded with discussions in Section V.

II. RELATED WORKS
Data analysis and machine learning methods have been frequently applied in diagnosis of transportation systems ( [4]- [6]). In this section our focus is to present the work related to APS failure prediction. We also present some important work in the literature related to dealing with imbalanced dataset for classification.
Gondek et al. [7] analyzed APS failure prediction data using multiple existing data classifiers. A number of classifiers including Naive Bayes, Support Vector Machines (SVM), Multi-Layer Perceptron (MLP), and random forest were deployed. The authors extracted new features from the APS failure prediction data using feature engineering methods. The original as well as the engineered features were used in the analysis. A feature selection approach based on feature ranking was also used in the analysis. Missing values were replaced by medians in each feature. The results are reported in terms of the cost where predicting a failure wrongly has a cost of 10 units and missing a failure has a cost of 500 units. The authors could achieve an average cost of 0.6 with their proposed technique.
Costa and Nascimento [8] handled APS failure prediction data using weighted data classifiers. In order to deal with class imbalance, class-specific weights were incorporated in Logistic Regression (LogR) and SVM classifiers. The weights assigned to the classes were inversely proportional to the number of samples in a class. In the case of random forest and K -Nearest Neighbors (KNN) classifiers, the authors shifted the threshold for predicting a sample. The threshold was shifted according to the proportion of samples in each class. For example, using KNN, if a minority-class sample was among the 60 nearest neighbors, it was classified as positive. In order to deal with missing data, soft-impute, a form of Expectation Maximization (EM) based imputation, was employed. Results are reported in terms of miss-classification cost, false positive percentage, and false negative percentage.
Peruffo [1] investigated the use of different entropy measures as indicators of node impurity, in order to decide splitting in a decision tree classifier. The author argued that for imbalanced data, alternative definitions of entropy (instead of traditional Shannon's definition) lead to better measures for splitting. This can lead to improvement in minority-class precision at the expense of majority-class precision. The author tested the entropy measures on the public APS dataset. Accuracy, false negatives, and false positives were used as the performance measures.
Rafsunjani et al. [9] investigated five different imputation techniques (Expectation Maximization, Mean Imputation, Soft Impute, Multiple Imputation by Chained Equation (MICE), and Iterative Singular Value Decomposition (SVD)) in conjunction with five different classifiers: KNN, Naive Bayes, Gradient Boosted Tree, Random Forest, and SVM. Their results indicated that MICE was the most effective imputation technique, and random under-sampling was the most effective technique to deal with imbalanced data. In addition to the accuracy, true negatives, false positives, false negatives, and true positives were used as the performance measures.
Ranasinghe and Parlikad [2] investigated the use of Conditional Generative Adversarial Network (CGAN) in APS failure prediction. They proposed the usage of CGAN for generating artificial samples of the minority class. They used the CGAN by sampling from joint distributions of auxiliary information related to failures and noise. APS failure dataset for Scania trucks were used for the experimentation. The authors generated 2000 extra samples for the minority class to be used during training. True positives, true negatives, false positives, and false negatives were used as the performance measures apart from the misclassification cost.
Akarte and Hemachandra [10] presented the use of gradient boosting trees for APS failure prediction in Scania trucks dataset. The authors employ weighted samples by assigning more weight to the samples of the minority class. The weight was set based on the ratio of the samples of the positive and the negatives classes in the training set. The hyper-parameters were optimized using cross-validation results. Samples with over 70% missing values were removed. Other missing values in a feature were replaced with feature median. True positives, true negatives, false positives, false negatives, precision, recall, F1-scores were used as measures apart from the misclassification cost.
Jose and Gopakumar [11] presented a modified random forest algorithm for APS fault prediction. Their main contribution was to use bagging with random forest as the training strategy. They also suggested training a separate random forest on the miss-classified data, and use them together with the original random forest classifier. They presented their results on the public APS dataset for Scania trucks. A KNN method with K value of 33 was implemented for imputing the missing values in the dataset. Precision, F-measure, and Matthews Correlation Coefficient (MCC) were used as the performance measures. They reported precision of 0.46 and F-measure of 0.62 on the dataset.
Fatlawi et al. [12] presented a hybrid model for classification problems in the presence of high number of features. They suggested that feature reduction is an important aspect of such problems. Their model is based on K-means clustering and bagging. First, performance for each feature is recorded using a number of metrics. A clustering is then performed on the metric space to partition the space into 2 clusters: one representing a weak cluster and the other a strong cluster. The idea of the former is that the cluster center has values (which represents the specific values of performance metric for different features) lower than the latter's. The features whose values are members of the weak cluster are regarded as irrelevant features and are removed from further consideration. Finally, bagging-based decision trees are used for training the classification models. The authors report their results on the public APS dataset for Scania trucks. A number of performance measures are used to report the results including precision, recall, F-measure, and AUC. Furthermore, in [13] a feature reduction approach is embedded within the decision tree framework. The node selection is done based on a weighted Gini index value of the feature.
When it comes to dealing with imbalanced datasets, in general, machine learning techniques can be broadly classified into two categories: techniques dealing at the data level (oversampling or under-sampling) and techniques dealing at the classifier level by modifying the algorithms to suit the imbalanced scenario.
Random under-sampling of the majority-class samples, and over-sampling by duplication of the minority-class samples are two of the simplest ways of dealing with the classification of imbalanced datasets. However, they also produce unwanted effects such as over-fitting or information loss by duplicating or deleting examples respectively (cf. [14]). A hybrid technique combines both over-sampling and under-sampling. Synthetic Minority Over-sampling Technique (SMOTE) [15] is another frequently used technique where instances of the minority class are synthetically created between samples of the class and their neighbors. Borderline-SMOTE [14] is a modification of the SMOTE technique, where oversampling of the minority-class is performed only for the samples which are close to the decision boundary. This method considers a minority-class instance to be qualified for oversampling with the SMOTE technique if more than a half of its m nearest neighbors come from the majority class. In [16], under-sampling technique was combined with noise filters in-order to handle noise in the minority class.
Ertekin et al. [17] considered active learning with online SVM to deal with imbalanced data. Online SVM learns incrementally by adding samples one at a time to the training set. The sample to be added to the training set at a given iteration is selected using the active learning strategy. 59 data points are randomly sampled, and the closest sample to the current boundary is added to the training set. An early stopping criterion is used to stop the training. It is based on the idea that if the number of support vectors stabilizes implying that all the possible support vectors have been selected, then the training can stop.
Nguyen et al. [18] presented the idea of oversampling the minority class only at the borderline between the class samples in order to deal with classification for imbalanced data. Their justification was that the samples that lie close to the border are more important for the classification problem. Hence oversampling should be at the borderline instead of using all the minority-class samples. The presented method was found to be effective when the overlap between the classes is low.
Ensemble of under-sampled classifiers is another technique to deal with the imbalanced data (e.g., [19]). As different batches of under-sampled datasets are created, an ensemble seems to perform more robustly as compared to a single classifier. Oh et al. [19] presented an incremental technique based on randomly selecting a balanced subset from the complete data, and then iteratively adding 'useful' samples to the training set. The usefulness of a sample is determined by improvement in the information gain of the classifier by adding that sample to the existing subset of training examples. Díez-Pastor [20] created an ensemble of classifiers termed as RB-Boost. The idea was to combine AdaBoost with random sampling. Here, random sampling refers to the idea that proportion of classes in a training set for a AdaBoost instance is selected randomly. Then, SMOTE is used for augmenting data for a class having fewer samples, and random undersampling is used for reducing the number of samples for a class having more data. Through these methods, the goal is to achieve the desired ratio between the class samples.
Shao et al. [21] presented Weighted Lagrangian Twin Support Vector Machine (WLTSVM) for dealing with binary classification of imbalanced data. A graph based undersampling of the majority-class was presented to deal with imbalanced data. Furthermore, weighted bias was introduced to improve the performance of the class that has fewer samples. Maldonado and López [22] presented a new secondorder cone programming formulation for SVM to deal with classification of imbalanced data. The approach is based on VOLUME 9, 2021 cost sensitive learning, where cost of misclassifying samples of a minority class are higher than that of the majority class samples, and is performed separately for each class. Linear programming SVM formulation was adapted based on second-order cones and the problem was split into two margin variables.
Kang et al. [23] illustrated a Weighted Under-sampling SVM (WU-SVM) method based on space geometry distance.
The key idea in WU-SVM, is to generate SubRegions (SRs) by grouping majority samples and assign weights to the majority samples within the SR based to their Euclidean distance to the SVM's decision plane. This is done in-order to retain the data distribution of the original data while undersampling.
From the literature review, it can be seen that the APS failure detection problem is of importance, and has been considered by many researchers. In addition to that, typical APS data involves imbalanced data. Presence of imbalanced data makes the classification problem a challenging task. In the following section, the proposed method to classify APS failure detection is depicted.

III. RELATED AND PROPOSED METHODOLOGIES
In this section, some of the traditional linear classifiers that can be used for binary classification is presented. Next, the proposed method is given. The method is based on maximizing AUC criterion, and it can handle imbalanced data. Consider classification data (x i , d i ) for i ∈ , containing two classes P and N , where | | is the total observations, and P ∩ N = . Without loss of generality, let us say that |P| ≤ |N |, d i = 1 iff i ∈ P, and

A. LINEAR SVM
The basic idea of SVM [24], [25] is to separate two classes (say P and N ) by a hyperplane defined as: Obviously, there could be infinitely many possible choices to select (w, b) in the case of linearly separable classes. Among all these infinite choices, the goal of SVM is to choose (w, b) that minimizes the risk of misclassifying a new unlabeled data point. In other words, the aim is to find a hyperplane that is sufficiently far from both the classes. This can be realized by finding two parallel hyperplanes that separate the classes, such that the following properties are satisfied: the distance (or margin) between the hyperplanes is maximum, and there is no data point in between the hyperplanes. A classifier satisfying the above properties is called a maximum margin classifier. In order to build the maximum margin classifier, without loss of generality, consider the following two parallel hyperplanes: The distance between the supporting hyperplanes defined in (2) & (3) is given as: ||w||.
(4) Fig. 1 depicts the notion of supporting hyperplanes and the maximum margin. In order to achieve the maximum margin criterion, the following optimization problem is solved: The objective of (5) is replaced by minimizing ||w|| 2 /2, i.e., the above formulation is recast as: The above formulations work very well when the data is linearly separable. However, data in most of the practical problems is imbalanced and overlapping. In order to extend the usability of SVMs for overlapping data, additional slack variables are introduced which capture the degree of overlap for some of the data points. This extended classifier is termed as a soft margin classifier, denoted as cSVM, and the changes are incorporated as follows: where s i ≥ 0 is a slack variable, and c is a parameter that reflects the cost of soft margin. Fig. 2 depicts the notion of a soft margin classifier.

B. LOGISTIC REGRESSION
When data from the two classes overlap, it is sometimes desirable to provide a probabilistic interpretation to the classification results in order to quantify the uncertainty of class labels during prediction. The basic idea of Logistic Regression (LogR) is to assign probabilities to each observation, defined as: where w ∈ R F and b ∈ R. The aim of LogR is to choose (w, b) such that h(x) < 0.5 when x ∈ N , and h(x) ≥ 0.5 when x ∈ P. Fig. 3 depicts the probability function. The optimization model of LogR can be written as: The above formulation is recast as: where ξ () is a cost function or a measure of similarity.

C. PROPOSED APPROACH
The basic idea of the proposed approach is to achieve maximum Area Under the Curve (maxAUC) criterion. It can be  The following LP model can be used to achieve the maxAUC criterion: subject to: where y, s ∈ R , w ∈ R F and b ∈ R are the variables. In addition to that, D is a constant that is estimated from the data points. The objective function in (11) is designed to achieve the maxAUC criterion. Similar parameters and variables are used to compare and contrast the proposed model with SVM and LogR. Constraint (12) and (16) are similar to the soft margin constraint. Furthermore, Constraints (13) to (15) linearly scales the predicted class labels, which is similar to LogR. Fig. 5 depicts a linear scaling function. To sum, VOLUME 9, 2021 the proposed model combines the characteristics of SVM and LogR, and aims towards improving AUC.
The effectiveness of the proposed model is evaluated in the next section using the standard performance metrics.

IV. EXPERIMENTS, PERFORMANCE METRIC, RESULTS, AND DISCUSSIONS
In this section the performance of the proposed method is experimentally evaluated. At first, a brief description of the performance measures for comparing the results obtained from the proposed and the existing methods is presented. Next, the performance of the proposed method is illustrated by using 2d simulated data. Finally, the APS Scania truck data are classified by using the proposed method, and the results are compared with the existing results.

A. PERFORMANCE METRIC
The following performance metrics are used for evaluating the proposed method and for measuring its effectiveness: True Positive: True Positive (T P ) represents the total number of positive samples classified by the classifier that are actually from the positive class. Typically, the minority class is considered as the positive class.
False Positive: False Positive (F P ) is the total number of samples classified by the classifier as positive while the actual class label of the samples was negative, i.e., from the majority class.
True Negative: True Negative (T N ) is the total number of samples classified by the classifier as negative that are actually from the negative class.
False Negative: False Negative (F N ) is the total number of samples classified by the classifier as negative while the actual label of the samples was positive.
Sensitivity: Sensitivity is computed via: Specificity: Specificity is computed via: AUC: In addition to the above measures, classifier's performance can also be evaluated in terms of Receiver Operating Characteristics (ROC). Typically, ROC curve plots sensitivity against specificity for different threshold levels for the classifier. However, for a fixed threshold, ROC can be approximated by a single point on the graph as shown in 4. ROC depicts the performance of a classifier without regard to class distribution. The Area Under the ROC Curve (AUC) summarizes the quality of classification, and is used as a performance measure. AUC is one of the popular performance measures to evaluate a classifier for an imbalanced dataset.

B. EXPERIMENTATION ON SIMULATED IMBALANCED DATA 1) DATA GENERATION
In order to examine the efficacy of the proposed method for classifying binary imbalanced data, a number of synthetic datasets were generated. Specifically, random binary classification data, with various overlapping degrees are generated, where |P| = 10 and |N | = 90. The negative class is Gaussian data with zero mean and unit variance. The positive class is another Gaussian data with unit variance. In Experiment-1, the mean of the positive class is 6 units, which implies no overlapping scenario. In Experiment-2, the mean of the positive class is 2 units, which implies medium level of overlapping. In Experiment-3, the mean of positive class is 1 unit, which implies high level of overlapping between the samples of the two classes.

2) SOLUTION APPROACH
All the three models, i.e., SVM, LogR, and maxAUC are solved using the standard solution methods. The cSVM model is solved using the libsvm toolbox, the LogR model is solved using glmfit toolbox, and the LP model is solved using the Cplex solver. In addition to that, for cSVM, the hyperparameter c is cross-validated in powers of 10 from −3 to 3 with an increment of 1. The value of c that gives the maximum AUC is finally selected. Fig. 6 shows decision boundaries achieved by all the three methods considered in this study, when no overlapping is present between two classes. As depicted in Fig. 6, for this dataset, the proposed method as well as the other two methods can find a separation boundary between samples of the two classes accurately. Fig. 7 shows decision boundaries of the three proposed methods in the presence of medium level of overlap between the samples of the two classes. We notice that for the medium overlapping imbalanced dataset, the proposed method can separate the minority class with better sensitivity as compared to LogR and cSVM classifiers, with low trade off in specificity. As shown in Fig. 7, the decision boundary (the green colored line) for the proposed method can classify most of the  data points of the minority class (marked as a circle) except one. As mentioned in Section III, for imbalanced data, if a classifier labels all the test data points as the majority class, then the overall classification accuracy would be very high. However, such behavior of the classifier is misleading. From  Fig. 7, it can be seen that cSVM and LogR have the misleading behavior. That is, cSVM and LogR push the boundary line to a region such that maximum overall classification accuracy can be achieved. Clearly, the decision boundary of cSVM and LogR is drawn towards the region of the minority class. As a result, most of the data points within the overlapping region are classified as majority class. Table 1 summarizes the training performances of the three methods for the medium overlapping scenario over 100 random trials using the performance metrics considered in this study. For a given measure, and for a given method, average value (standard deviation) of the metric over 100 iterations is presented in Table 1. It is interesting to note that the proposed method was more attentive towards the data points from the minority class, due  to the maxAUC criterion. From Table 1, it can be seen that the accuracy of the two conventional methods (T P for LogR is 7.861; T P for cSVM is 7.91) in classifying the minority class are worse than our proposed method (T P using our method is 9.50). However, in terms of the overall accuracy, due to the misleading behavior, it appears that our method has low overall accuracy of 0.9411. Fig. 8 shows decision boundaries of the three methods in the presence of high degree of overlap between the samples of the two classes. Correspondingly, Table 2 provides the performance summary of the three methods for 100 random trials. As expected, for the highly overlapping dataset, the minority class accuracy of cSVM (T P : 1.05) and LogR (T P : 2.36) are even worse than the case of medium overlap. This is due to the tendencies of the other two classifiers to push the decision boundary towards the minority class samples, with an objective to achieve high overall classification accuracy. The respective AUCs of cSVM and LogR are close to 0.5, which indicates poor performance of these classifiers. Since, the proposed method is based on maxAUC criterion, the accuracy for the minority class classification is very encouraging (T P : 7.961), in the presence of high overlapping and imbalanced data. Furthermore, the high AUC value of 0.78480 also indicates that our method can classify data points from both classes (majority and minority) efficiently. Table 3 shows the average (standard deviation) time per trial in seconds taken by the three methods for the three scenarios.  For many pragmatic applications that generates imbalanced data, including the APS failure prediction and classification, it is crucial to correctly classify the samples of the minority class, when compared to correctly classifying the samples of the majority class. The proposed method excels in AUC when classifying synthetically generated imbalanced data. Therefore, we are encouraged to apply the proposed method for classifying APS failure detection data obtained from Scania.

D. EXPERIMENTS ON APS FAILURE PREDICTION
In this section, the experiments conducted on APS failure detection is illustrated. First, the demographics of the dataset used for APS failure detection is depicted. Next, a summary of the solution approach is provided. Lastly, results and in-depth comparisons and discussions are presented.

1) APS FAILURE DATASET
APS failure dataset presented in [3] is considered in this experiment. The dataset collected by Scania contains 80, 000 data points from heavy trucks operating in five European markets. Each data point is labeled (either positive or negative class), where the positive class represents system failure due to APS failure. 171 different measurements (attributes) are collected from different component of the vehicle, and the attribute names are anonymized by Scania. The aim of this experiment is to predict the APS failure (positive) cases with high accuracy. According to Scania, the cost of incorrectly predicting a non-APS failure as APS failure is $ 10. Whereas, if APS failure that leads to system failure is undetected, then the incurred cost is $ 500. The former cost corresponds to false alarms, which is mostly related to checks done by the mechanics. The latter cost corresponds to breakdown of a truck/system. The highly imbalanced nature of the dataset is the key difficulty in developing a good classification mechanism. In the given dataset, 59, 000 data points belong to the negative class, and only 1, 000 data points belong the positive class, in the training dataset. Thus, the ratio of positive to negative class samples is 1 : 59 in the training dataset. Similarly, in the test dataset, the ratio of positive to negative class samples is ≈ 1 : 42, where 375 samples belong to the positive class and 15725 samples belong to the negative class. In addition to that, there are many data points with missing values.

2) SOLUTION APPROACH
At first, the missing values per feature is replaced by feature mean, using the KNN method. The value of K based on the literature is taken to be 33. Next, the training data is scaled to zero mean and unit variance. The scaled data is then used for training the models. The solution approach for the APS failure detection data is similar to the simulated data solution approach (Section IV-B2). Table 4 compares the classification performances of the proposed model with LogR and cSVM. The hyper-parameter (cost C) in cSVM is calibrated by selecting the value that gives the best AUC during the training phase. The proposed method can detect failures due to APS with an AUC value of 0.97021 and 0.94339 for training and testing respectively. Our results clearly outperform the results obtained by LogR and cSVM. As discussed in Section. IV-B, for imbalanced data, the overall accuracy is misleading. This misleading behavior is repeated by LogR and cSVM. For the training dataset, cSVM correctly predicts 648 data points out of total 1, 000 data points from the positive class. In contrast, the proposed method correctly predicts 967 data points out of 1, 000 data points as positive class (i.e., APS failure case). A similar pattern is seen for the test dataset. Evidently, the proposed model can predict the cases of APS failure better than the two conventional classifiers reported in Table 4. For the test data, the number of true positives identified by the proposed method is 345, which is high in comparison with 283 for LogR, and 261 for cSVM. According to Scania, the high true positives achieved not only would minimize the chances of any accidental situations of Scania truck failures, but also would reduce the operational cost. The cost involved in missing any system failure due to APS failure is 50 times the cost of incorrectly attributing system failure to APS failure. In Table 5, the incurred costs for cSVM, LogR and the proposed method is presented. Interestingly, the proposed method can reduce the total cost by a significant amount, when compared to the other two classifiers. As shown in the table, the total amount of savings (if the proposed model is used in predicting APS failure) for training dataset is $ 144650 when compared with cSVM, and $132439 when compared with LogR. Similarly, the savings for the test dataset is $37290 ($26910) when compared with cSVM(LogR). Furthermore, training time is a big concern when dealing with big data. Our proposed model is a linear programming model. Hence, the worse case time complexity of the pro-    features in each sample and is the desired accuracy. The time complexity of training cSVM is in the order of O(nd 2 ) assuming that the number of features per sample are less than the total number of samples in the training set. This is in addition to the time needed to select the optimal value for the hyperparameter c. For logistic regression, the time complexity of training is in the order of O(nd) in addition epocs, i.e., the number of times the algorithm iterates through the complete training set. Hence, the proposed model is not only better in terms of classification performance (as per the AUC measurement) but also in terms of the training time complexity.

3) RESULTS AND DISCUSSION ON APS FAILURE PREDICTION/CLASSIFICATION
As mentioned in Section II, there exist few studies that attempt to predict APS failure using the same dataset. Tables 6 & 7 summarizes the classification performances reported in those studies. It should be noted that, most of the studies use nonlinear classifier to predict the APS failure. In contrast, in this study we propose and develope a new linear classifier built on the maxAUC criterion. Thereby, the performance comparison of the proposed method with the methods reported in the existing studies is biased towards existing studies. Nevertheless, the purpose of the comparison is to benchmark the proposed approach with the existing studies. Tables 6 -7 [9]). However, as expected, after preprocessing the dataset either by oversampling (e.g., using SMOTE) or under sampling, the performances of non-linear classifiers improve a lot (results reported in Eleonora Peruffo (2016) [1] and Rafsunjani et al. (2019) [9]). Table 7 also shows that, the proposed method outperforms the AUC of rpart gini (AUC: 0.8418), rpart information (AUC: 0.7962), and J48 (AUC: 0.9035) as reported in Eleonora Peruffo (2016) [1]. It should be noted that, the dataset is first transformed to a balanced one through deploying a SMOTE approach. Thereby, the performances of the classifiers reported in that study are enhanced and Table 7's row 3 reports the best performances of the respective classifiers. In contrast, the proposed method utilizes linear decision boundary, and maximizes AUC. Hence, we conjecture that no data preprocessing is required to enhance the classification performance. Nonetheless, a nonlinear decision boundary with the maxAUC criterion would certainly provide an enhanced classification performance. As a future work, we plan to develop one such classifier and study its performance on APS failure data. Rafsunjani et al. (2019) [9] report an enhancement on the performance of a few nonlinear classifiers with an under-sampling approach. Table 7's row 2 shows the best performances of the classifiers in [9]. We see that our proposed model can outperform the performances of Naive Bayes (AUC: 0.9306), SVM (AUC: 0.8411) and Gradient Boosted Tree (AUC: 0.9224) in terms of AUC. All the classifiers reported in Table 7 are nonlinear classifiers except the proposed classifier. AUC values (highlighted in bold fonts) are higher than that of the proposed model. To sum, the proposed linear method can outperform 11 out of 18 different nonlinear models reported in the literature. Although, the performance comparison between linear and nonlinear classifiers on APS data is biased, the proposed linear method empirically proves its capability of handling the complex imbalanced nature of APS failure data. Furthermore, the proposed method also outperforms many nonlinear classifiers, where the data are tailored into a balanced data set for classification. For example, Rafsunjani et al. (2019) [9] report that the AUC of SVM is 0.8411 (after transforming the dataset to a balanced data by using an under-sampling approach). In contrast, the AUC of the proposed method is 0.94339. Generally under-sampling or oversampling approaches have following issues: i) Generating new data samples might introduce impurity to the dataset and ii) valuable information might be lost due to removing samples from majority class. Thus, the sampling approaches to transform imbalanced data into a balanced data might not be always efficient in improving the classification performance. The proposed model aims at maximizing AUC, and thereby it can suitably handle an imbalanced dataset. Thus, it does not require any sampling based transformation of the dataset.

V. CONCLUSION
APS failure detection is a crucial issue in automobile/vehicle systems. The data collected from such systems is typically imbalanced. Thus, the conventional data classification criteria like maximum margin or maximum likelihood may not efficiently handle the imbalanced data. Generally, sampling approaches (under or over sampling) are used to transform the data into a balanced data. The proposed method is designed based on the maximum AUC (maxAUC) criterion. The proposed method by its very nature can handle imbalanced data. Using synthetic data-based experiments, the performance of the proposed method on imbalanced data with varying degree of class overlap is depicted. The proposed method is implemented on the APS failure detection benchmark data collected by Scania. From the performance comparison of the proposed method with the existing methods, it can be stated that the proposed method is a potential candidate for classifying imbalanced data set arising in APS failure detection. In future, we will use deep learning approaches on the APS failure detection scenario and evaluate their performances. IRFAN AHMAD received the Ph.D. degree in computer science from TU Dortmund, Germany, in 2017. He is currently an Assistant Professor with the Department of Information and Computer Science, KFUPM, Saudi Arabia. He has published several articles in high-quality journals and international conferences. He has also coauthored two book chapters and holds three U.S. patents. His research interests include pattern recognition and machine learning. He regularly reviews articles for well-known journals and conferences in his area of research in addition to being a program committee member for some of the reputed international conferences. VOLUME 9, 2021 MOHAMMAD MEHEDI HASSAN (Senior Member, IEEE) received the Ph.D. degree in computer engineering from Kyung Hee University, South Korea, in February 2011. He is currently a Professor with the Information Systems Department, College of Computer and Information Sciences (CCIS), King Saud University (KSU), Riyadh, Saudi Arabia. He has authored or coauthored around more than 260 publications, including refereed IEEE/ACM/Springer/Elsevier journals, conference papers, books, and book chapters. His research interests include cloud computing, edge computing, the Internet of Things, body sensor networks, big data, deep learning, mobile cloud, smart computing, wireless sensor networks, 5G networks, and social networks. He has served as the Chair and a Technical Program Committee Member in numerous reputed international conferences/workshops, such as IEEE CCNC, ACM BodyNets, and IEEE HPCC. He was a recipient of a number of awards, including the Best Conference Paper Award from IEEE International Conference on Sustainable Technologies for