Treating Class Imbalance in Non-Technical Loss Detection: An Exploratory Analysis of a Real Dataset

Non-Technical Loss (NTL) is a significant concern for many electric supply companies due to the financial impact caused as a result of suspect consumption activities. A range of machine learning classifiers have been tested across multiple synthesized and real datasets to combat NTL. An important characteristic that exists in these datasets is the imbalance distribution of the classes. When the focus is on predicting the minority class of suspect activities, the classifiers’ sensitivity to the class imbalance becomes more important. In this paper, we evaluate the performance of a range of classifiers with under-sampling and over-sampling techniques. The results are compared with the untreated imbalanced dataset. In addition, we compare the performance of the classifiers using penalized classification model. Lastly, the paper presents an exploratory analysis of using different sampling techniques on NTL detection in a real dataset and identify the best performing classifiers. We conclude that logistic regression is the most sensitive to the sampling techniques as the change of its recall is measured around 50% for all sampling techniques. While the random forest is the least sensitive to the sampling technique, the difference in its precision is observed between 1% – 6% for all sampling techniques.


I. INTRODUCTION
Typically, many companies from the energy sector face financial crises due to Non-Technical Loss (NTL). It is a loss that is endured by the electric supplier and caused by the unusual suspect activities from the electric consumers. The suspect activities include illegal hooking of the wires, incorrect meter reading, meter bypassing, or even reversing the meters. The objective of these activities is the reduction of the bill amount. These activities are mainly practiced in those cities and industrial areas where manual infrastructure is still used.
The associate editor coordinating the review of this manuscript and approving it for publication was Md Zakirul Alam Bhuiyan .
The Advanced Metering Infrastructure (AMI) has made many illegal activities hard to practice. However, many countries use manual electricity infrastructure, and hence, the monthly manual meter reading is still practiced in countries, including India, Pakistan, Brazil, etc. The multibillion-dollar annual loss is reported from such countries. For e.g., a yearly loss of 12 billion dollars is estimated for India due to the occurrences of NTL. An estimated loss of 58.7 billion dollars occurs every year on account NTL in power industries of the top 50 emerging countries [1], which shows the significance of NTL detection.
To combat the NTL, many techniques have been tested over the past decade to successfully detect NTL occurrences and analyze the measures to avoid the losses incurred by NTL. The network-oriented, data-oriented, and hybrid techniques are commonly tested for this purpose [2]. The network-oriented techniques include the installation of separate hardware for the detection of NTL. The data-oriented techniques use the consumption data to identify the occurrences of NTL. Data-oriented techniques mainly focus on applying several machine learning classifiers to a pre-processed consumption data, including Support Vector Machine (SVM), Decision Trees (DT), K-Nearest Neighbors (KNN), CatBoost, XGBoost, LightGBoost, etc. These techniques use various performance evaluation metrics to measure the number of potential fraudsters detected correctly. In our previous contribution [2], we have identified the best metrics that can be used to evaluate the performance of the classifiers considering the characteristics of the datasets used in NTL detection. In another contribution [3], we have identified the best individual classifier and the best type of classifiers that outperformed others in NTL detection.
Many datasets which are used in machine learning have a problem of class imbalance. It is the characteristic of the dataset where the samples of one class are heavily represented while the samples of the other class are least represented. As reported in [4], due to this imbalanced distribution of the classes, a biases in the dataset are observed, resulting in a correct prediction of the majority class but an incorrect prediction of the minority class. The sensitivity of this issue is increased when the focus is in the identification of the minority class. NTL detection in the energy sector is also an application of the class imbalance domain. As the number of normal electric consumers outnumber the fraudsters, a dataset pertaining to NTL is characterized by the drawbacks of data bias, and hence, an imbalanced distribution of classes.
In order to deal with the problem of class imbalance in the datasets, multiple techniques are proposed in the literature. One of the techniques is under-sampling the majority class. An alternate is the over-sampling of the minority class [5]. Instance weighting schemes are also proposed in the literature to address the issue of class imbalance. Like other class imbalance problems, the datasets pertaining to NTL detection have also been tested in order to balance out the number of samples of the minority and the majority class in the pre-processing step. For this, under-sampling and over-sampling techniques have been tested separately. However, there is still a need to thoroughly explore the impact of using under-sampling, over-sampling, hybrid of both, and cost-sensitive approaches by applying them individually as well as in combination in a real dataset. In this work, we have used a real dataset of an electric supplier company operating in Pakistan. The dataset comprises 71 features and 80, 244 records. These are monthly meter readings of a specified neighborhood for a period of 15 months. The main contribution of the paper is to treat the class imbalance problem in NTL-oriented real datasets using a variety of different sampling techniques, such as under-sampling, over-sampling and penalized classification models. The dataset used in this study is collected in the real-life scenario by an electric supplier company in Pakistan. The dataset is then used to evaluate the performance of a range of classifiers. In the end, we present an exploratory analysis of the impact of using different sampling techniques on NTL detection in a real dataset and identify the best-performing classifiers.
The rest of the paper is as follows: Section II describes the literature review of the recent contributions in NTL detection. Section III presents the description of class imbalance problems. Section IV first describes the dataset used in this contribution, and it outlines different class imbalance methodologies and the description of the performance evaluation metrics used. Experiments are explained in Section V. Finally, conclusion and future work are presented in Section VI.

II. LITERATURE REVIEW
During the past decade, the financial deficit caused by electricity theft has been an alarming situation for many countries. Hence, the problem of NTL detection has become a thoroughly studied area. There are many sub-processes which the research community is trying to use to lessen the impact of NTL. For example, an increasing interest is found in using multiple classifiers, using multiple evaluation metrics to correctly figure out the losses, and the post-processing phase, etc. Multiple combinations of classifiers have been tested for the detection of NTL, which includes wide and deep CNN [6], recurrent neural network (RNN) [7], fuzzy logic [8], CatBoost [3], etc. Apart from classification, several other techniques have been tested, which include association rule mining [9], hierarchical clustering [10], outlier detection techniques [11], etc.

A. RECENT TRENDS IN PRE-PROCESSING STEPS OF NTL DETECTION
With the increase in recognition of the importance of NTL detection over the years, the research community has been equally showing an increasing interest in the pre-processing steps of the dataset before it is used in the training and the testing of the classifiers. The pre-processing steps include handling the imbalance characteristic of the datasets, feature selection, feature extraction, and data merging from the other sources. These recent trends in the pre-processing steps of NTL detection are described in Figure 1.

B. HANDLING CLASS IMBALANCE PROBLEM
Historically, the research community has shown a deep interest in tackling the imbalanced nature of the datasets. For this, generally, two sampling approaches are followed, namely under-sampling and over-sampling. These techniques are defined in Section IV-B1. The use of the synthetic minority over-sampling technique (SMOTE) [5] remains an attraction in many fields. This technique generates synthetic records using the records of the minority class.
There have been some notable contributions in this regard. For example, Buzau et al. [12] have used two under-sampling techniques to combat the imbalanced distribution of the classes. The first technique removes those normal consumers from the training set due to which the fraudulent consumers were wrongly identified as normal consumers. The second technique under-samples the training set randomly. The paper observers that there is not much of a difference in the output of the two under-sampling techniques. In addition, this contribution compares the results of SVM, logistic regression, KNN, and XGBoost using AUC (Area Under the Curve) as the performance evaluation metric. A dataset from a Spanish company is used in the experiments. In addition, this contribution has also included some geographical information and the technical information of the meters. However, the paper has not used any of the over-sampling techniques to compare the results of under and over-sampling.
In contrast, Hasan et al. [13] have used SMOTE as an over-sampling technique in the pre-processing step to overcoming the impact of the imbalanced distribution of the classes. The authors have used a dataset provided by an electric supplier in China. They have used long short-term memory (LSTM) and Convolutional Neural Network (CNN) to identify NTL. The performance evaluation measures used are precision, recall, F-1, and accuracy. The paper reports an accuracy of 89%. A somewhat different approach is used in [14]. The authors have combined the data of power and voltage measurements with the normal consumption data to train and test the SVM classifier to detect the faulty consumers and their time and the intensity of NTL in kW. This work has used an Irish dataset that includes half-hourly consumption records for 5000 consumers. The authors have used replication as an over-sampling technique to overcome the problem of class imbalance. The results are evaluated using AUC and accuracy. The paper observes an accuracy of 99.4%. However, this work has not performed any under-sampling techniques.
The same dataset is used in [15], but the paper focuses on detecting NTL in industrial supplies only. The authors have used a deep learning-based mechanism to extract some advanced features as a pre-processing step from the dataset, which are then used in a semi-supervised autoencoder for theft detection. Their results are compared with SVM, KNN [16], XGBoost [17], and multi-layer perceptron (MLP) using precision, recall, F-1, AUC score, and accuracy. The paper concludes that their proposed framework outperformed the other available classifiers. However, the article has not addressed the imbalance behavior in the pre-processing step. In our recent contribution [18], we have proposed an incremental feature selection algorithm that helps in selecting the minimum number of suitable features for NTL detection. The algorithm uses the feature importance [19] of every feature. The work has identified the top 9 features out of a total of 71 features in a real dataset. The precision, recall, and F-1 scores of the classifiers using selected features are comparable or better than all features. The classifiers used are CatBoost, KNN, and decision tree.
Another contribution using a dataset from a Chinese company is presented in [20]. The authors have used the time-series data to convert it into image form, which is helpful in the long run for analyzing the consumer's consumption behavior. The paper evaluates its results using precision, recall, F-1 score, and AUC curve and concludes that their work performs best when the labeled classes are few in the dataset. The imbalance behavior of the dataset is, however, not discussed in the paper.
A similar dataset from China is used in [21]. The authors have not dealt with the imbalance behavior of the data. However, they have taken into consideration the effect of climatic changes in the occurrence of NTL. The authors have combined the electric data with the climate data and concluded that the occurrence of NTL is more in the regions of 98930 VOLUME 9, 2021 extreme weather. They have used different classifiers belonging to the neural networks (NN) and ensemble methods. The paper concludes that the ensemble methods perform better than NN.
Recent work in NTL detection [22] uses a binary masking scheme in the pre-processing step to fill the missing values. However, the paper has not dealt with the imbalance behavior of the data. This work has used CNN as a base classifier and evaluated the performance using AUC and F-1 scores. The paper concludes with a higher F-1 score as compared to other techniques.
There is a need to further explore the pre-processing steps in NTL detection by a comprehensive study comparing the effect of using the under-sampling and over-sampling approaches on a real imbalance dataset. A summary of the literature review is present in Table 1.

III. THE PROBLEM OF CLASS IMBALANCE
The imbalanced distribution of the classes is typical behavior in classification problems ( [23], [24]). This class imbalance occurs in datasets having a disproportionate ratio of instances or examples in each class. In other words, it is caused by a skewed distribution of data between classes. Many real-life problems such as fake review detection, fake news detection, fraud detection, customer churn prediction, electricity loss prediction, and others appearing in different domains are prone to imbalanced data. Most classifiers are sensitive to real-life imbalanced data and suffer from achieving accurate results because state-of-the-art classification algorithms expect balanced class distribution ( [25]- [28]).
The characteristic of class imbalance exhibits the difference in the class distribution in the training and the test set. In contrast, the conditional distribution of X in the training set is the same as the conditional distribution of X in the test set given the same class label. Let X be the observation sample, and Y be the target variable, then the class imbalance relation is shown in Equations 1 and 2 [29]: where P train (X |Y = y) is the distribution of X given Y in the training set, P test (X |Y = y) is the distribution of X given Y in the test set, P train (Y ) is the distribution of the target variable in the training set and P test (Y ) is the distribution of the target variable in the test set.

IV. METHODOLOGY A. DATASET
Many countries use the traditional metering infrastructure, which uses the on-site meter readings every month. For this, a monthly consumption record is entered in the database for each consumer. The dataset used in this contribution is a real dataset taken from a power supply company in Pakistan. This dataset also contains monthly readings of the electricity consumption of a neighborhood. The dataset contains 71 features which include the numeric, string, and date data types. The total number of records used is 80, 244. The company also provided the records of consumption where NTL has been identified, due to which this is a labeled dataset where the classes are already known. The dataset is split into training and test sets with an 80% − 20% ratio. The training set contains 61, 456 records with negative class and 2, 739 records with positive class. Similarly, the test set contains 15, 366 records with negative class and 683 records with positive class. As Figure 2 depicts, a clear imbalance in the representation of the two classes in the training and the test sets is observed, making this dataset a perfect example for exploring the impact of class imbalance in NTL detection.

B. CLASS IMBALANCE METHODOLOGIES
To achieve high performance of the machine learning classifiers in the imbalanced datasets, care must be taken to balance out the representation of all the classes before the dataset is used for classification. The techniques to deal with the class imbalance problem belong to two main types. One type deals with the situation in the data-level phase. The second type, also termed cost-sensitive learning, deals with the imbalance problem at the algorithmic level [30].

1) DATA-LEVEL TECHNIQUES
In the pre-processing phase, the following three techniques are widely used: 1) Under-Sampling Techniques: These techniques attempt to reduce the number of samples of the majority class. The moist widely used under-sampling method is random under-sampling [31]. The drawback of using the under-sampling techniques is the loss of potentially useful records. However, the training time of the classifiers is improved with the reduction of the training set. 2) Over-Sampling Techniques: These techniques attempt to generate synthetically new samples or randomly duplicate the existing samples of the minority class.
The synthetic generation of the minority class is termed SMOTE [5]. The drawback of over-sampling techniques is the increase in the size of the training set, and hence, an increase in the computation time of training the classifiers. However, in contrast with the under-sampling techniques, the over-sampling

2) ALGORITHMIC-LEVEL TECHNIQUES
One of the schemes to counter the biases in the dataset is cost-sensitive learning (also termed as penalized classification model). This technique can be applied at the algorithmic level as well as at the data level. It assigns the weights (cost) to the training observation, which is responsible for the missclassification. For example, if the ratio of class imbalance is 1 : 10 in favor of the negative class, then the cost of miss-classification of the positive class will be nine times as compared to the cost of miss-classification of the negative class [28]. These costs can be calculated for every observation using Equation 3 [4]: where P train (Y ) is the distribution of the target variable in the training set and P test (Y ) is the distribution of the target variable in the test set. As the dataset that we have used is an application of class imbalance where the normal users are too many compared to the number of fraudsters, we have tested under-sampling, over-sampling and cost-sensitive technique to counter the class imbalance problem in our dataset.

C. CLASSIFICATION METHODS FOR NTL DETECTION
As discussed in Section II, many classification methods have been used to detect NTL in the electric power industry. The current work extends and advances our previous contributions ( [2], [3]), where the same dataset has been used. The dataset is described in detail in Section IV-A. The earlier contributions ( [2], [3]) identified the best performing classifiers and highlighted the suitable performance metrics that can be utilized to investigate NTL efficiently and effectively. In this work, we have analyzed the effect in the performance of nine classifiers after applying different sampling techniques and compared their performance with the performance of the original, untreated class imbalance dataset. The untreated class imbalance dataset is the one in which no sampling technique has been applied in the pre-processing step and original distribution of positive and negative classes is utilized, as shown in Fig. 2. SVM is well known for maximizing the boundaries between the classes. It shows good performance when used with high-dimensional datasets. One of the linear learning models used is Stochastic Gradient Descent (SGD). Sensitive to scaling, SGD also shows good results under high-dimensional data. Despite having the weakness of overfitting, a Decision Tree (DT) is a good option for some datasets due to its simple method of constructing if-else rules. Random Forest (RF) is an ensemble of different decision trees. Overfitting is avoided in RF by using various training sets for each DT [33]. The classification of each instance in KNN is performed by majority voting in k-Nearest Neighbors. Care must be taken in setting the value of k, as with the increased value of k, the training time of the dataset also gets increased. Incorporated with multiple hidden layers, Multi-Layer Perceptron (MLP) is heavily used in classification problems. Although it requires several hyperparameters to be tuned, its added advantage of non-linear compatibility remains an attraction. The only difference between MLP and Logistic Regression (LR) is that LR contains only one intermediate layer between the input and the output layer. CatBoost and XGBoost are the boosting techniques, with CatBoost having the flexibility that it handles the categorical data on its own while for XGBoost all data is needed to be converted to numerical datatype [34].

D. PERFORMANCE METRICS
Three performance evaluation metrics are chosen to evaluate the classifiers, namely precision, recall, and F-score.
Precision measures correctly classified True Positive (TP) instances out of the total predicted TP instances. The formula of precision is shown in Equation 4.
Recall gives a measure of correctly predicted TP samples out of total predicted samples. The formula for recall is shown in Equation 5.
F-score is the metric used to prioritize between recall and precision. The formulae for F-score is shown in Equation 6.  When β is 0.5, recall and precision have equal priorities. When β is greater than 0.5, recall has the higher priority and when β is less than 0.5, precision has the higher priority.
As cited in our previous contribution [2], it is necessary to prioritize between FN and FP for NTL detection. Having a higher FP will result in an additional cost of on-site checking for NTL occurrence, but having a higher FN will directly affect the identification of NTL. We need the number of FN as low as possible. There is an indirect relation between FN and recall, i.e., with the decrease of FN, there is an increase in the recall. This leads to an interesting conclusion about NTL detection that as we need a lower FN, so, the classifier with a higher recall should be preferred.

V. RESULTS AND DISCUSSION
A detailed result containing TP, TN, FP, FN, precision, recall, and F-Score of nine classifiers tested with imbalance data and different sampling techniques is presented in Table 3 of Appendix VI. A comparison between confusion matrices of selected classifiers is shown in Table 2. Four classifiers with the best recall are chosen from each category of imbalance data, SMOTE, ADASYN, random over-sampling and random under-sampling. As discussed in Section IV-D, the classifier with the highest recall should be given priority for NTL detection. Considering this factor, with the original, untreated class imbalance dataset, MLP classifier, CatBoost and XGBoost has the best recall of 0.99 98934 VOLUME 9, 2021 each. Their corresponding numbers for FN are 4, 6, and 8, respectively.

A. IMPACT OF SAMPLING TECHNIQUES ON IMBALANCE DATA
Detailed comparison of various confusion matrices obtained from the set of classifiers achieving maximum recall using balanced and imbalanced datasets is presented in Table 2. It is pretty evident from the table that the recall of KNN and XGBoost jumps to 1 when applied to the sampled dataset obtained through SMOTE since the corresponding number for false negatives is reduced to zero. The best classifiers identified after applying ADASYN are Catboost, KNN, LR, and XGBoost. Considering recall, the most sensitive classifiers after applying random over-sampling are CatBoost, KNN, XGBoost, and MLP. Finally, the top four sensitive classifiers for random under-sampling are SVM, SGD, MLP, and KNN.
The classifier that is most sensitive to the sampling techniques is LR. As shown in Figure 3, the difference in LR recall is around 50% for SMOTE, ADASYN, ROS, and RUS. The next sensitive classifier found is SGD, in which the difference of recall is 0.23% for all the sampling techniques. One of the reasons behind this increase is the low recall of LR and SGD in imbalance data, which is 0.49 and 0.81. A difference of 8% is observed in the recall of KNN in all sampling techniques. KNN is susceptible to imbalance class distribution because it classifies an instance by a majority vote amongst the k-nearest neighbors. That's why KNN suffers in getting accurate results when it comes to dealing with imbalanced data. The recall is improved once the imbalance distribution of classes is removed by the sampling techniques.
A significant decrease of 63% in the precision of SVM is observed when ADASYN is applied. The top three percent decrease in the accuracy of the classifiers for ADASYN in SVM, SGD, and MLP, as depicted in Figure 4. Considering precision, the least sensitive classifier to sampling techniques is RF, which has a decrease of 1%, 0%, 0%, and 6% for SMOTE, ADASYN, ROS, and RUS, respectively.
As F-score evaluates the classifier with respect to both the precision and the recall, the most sensitive classifier for sampling techniques is LR, which has an increase of around 30%−34% for all sampling techniques, as shown in Figure 5. In contrast, a decrease of 44.9% in F-score is observed for SVM when ADASYN was applied. Considering F-score, the top two insensitive classifiers concerning sampling techniques are RF and CB.
Considering recall as a performance metric and relevant measure, RUS is the best sampling technique. Eight out of nine classifiers resulted in a recall of 1 after applying RUS, while LR resulted in a recall of 0.99. On the contrary, as shown in Figure 4, a noticeable decrease in precision is observed in those classifiers after applying RUS hinting at the increase of FP. However, it is important to note that these findings are coupled with the specific dataset used in this study. The sampling methods implemented in this study might behave differently from our results, considering the nature of the dataset under investigation.
The precision of imbalanced data remains the highest compared to all sampling techniques because sampling techniques try to contract the decision boundaries. In doing so, many values which were treated as FN in imbalance data are treated as FP, resulting in the decrease of precision.

B. IMPACT OF COST-SENSITIVE LEARNING ON IMBALANCE DATA
In our experiments, we also performed cost-sensitive learning by applying a weighted SVM strategy. Every miss-classification resulted in penalizing the training samples, which are responsible for miss-classification. The recall VOLUME 9, 2021 of weighted SVM is increased from 0.97 to 1.00 while the precision decreased from 0.98 to 0.95. The F-score remained stable at 0.97. Considering recall as the preferred performance evaluation metric for NTL detection, cost-sensitive learning resulted in an increase of 3% in SVM recall.

VI. CONCLUSION AND FUTURE WORK
In this paper, a real dataset of monthly consumption records was used for the experiments. The dataset includes approximately 80, 000 records, along with 71 features. The paper compares the performance analysis of 9 classifiers with four 98936 VOLUME 9, 2021 sampling techniques applied on imbalanced data. The results are compared with the performance of untreated imbalance data. Additionally, the impact of cost-sensitive learning is also analyzed on imbalanced data for NTL detection.
One of the findings is that considering recall and logistic regression is the most sensitive classifier for all sampling techniques. The difference in the recall is observed around 0.50% for SMOTE, ADASYN, ROS, and RUS. A decrease of 63% is observed for SVM when ADASYN is applied. Another finding is that the top three percent decrease in the precision of the classifiers are for ADASYN in SVM, SGD and MLP, respectively. The random forest is observed as the least sensitive classifier with a percent decrease of 1% − 6% for all four sampling techniques.
The best sampling technique found is RUS, for which eight out of nine classifiers resulted in a recall of 1. Cost-sensitive learning was also applied by experimenting weighted SVM technique. Its recall increased from 0.97 to 1.00, while the precision decreased from 0.98 to 0.95. The F-score remained constant at 0.97.
In the future, we have a plan to explore the impact of sampling techniques on selected features for NTL detection. The features on which the class label is more dependent can be filtered out and tested for the sampling techniques. Another potential future direction for NTL detection is deep learning in the all-feature dataset compared with the selected features.

APPENDIX A EXPERIMENTAL RESULTS
A detailed experimental results containing TP, TN, FP, FN, precision, recall and F-Score of selected classifiers using imbalance data and different sampling techniques are presented in Table 3.