Employing a Machine Learning Boosting Classifiers Based Stacking Ensemble Model for Detecting Non Technical Losses in Smart Grids

In the modern world, numerous opportunities help detect electricity theft happening in electricity grids due to the widespread shifting of people from old metering infrastructure to advanced metering infrastructure (AMI). It is done by studying the consumers’ energy consumption (EC) readings provided by smart meters (SM). The literature introduces a variety of machine learning (ML) and deep learning (DL) strategies to use EC data for identifying power theft in smart grids (SGs). However, the existing schemes provide low performance in electricity theft detection (ETD) due to the usage of imbalanced data and using schemes individually. Moreover, the existing detectors are validated using a limited number of performance evaluation measures, which are unsuitable for conducting the model’s comprehensive validation. To tackle the problems mentioned above, an ML boosting classifiers-based stacking ensemble model (MLBCSM) is proposed followed by an adaptive synthetic sampling technique (ADASYN) in the underlying work. Data preprocessing, data balancing and classification are the three major parts of the model introduced in this work. Besides, the EC data acquired from the consumers’ SMs is used for detecting electricity theft. Moreover, the simulation results reveal that MLBCSM combines the benefits of adaptive boosting (AdaBoost), extreme gradient boosting (XGBoost), histogram boosting (HistBoost), categorical boosting (CatBoost), and light gradient boosting (LGBoost). Additionally, the model’s validation is ensured via different metrics. It is deduced via extensive simulations that the proposed model’s outcomes are superior to those of the individual models in terms of ETD.


I. INTRODUCTION
Electricity is one of the significant resources in human life, which is provided to consumers by electric utilities. In return, the electric utilities obtain benefits in the form of money. However, electricity losses occur while dispatching the energy from the generation side to the consumption The associate editor coordinating the review of this manuscript and approving it for publication was Alba Amato .
side. The losses heavily disturb the economic benefits of both utility and electricity consumers. Typically, the division of electric losses is done into two groups: technical and non-technical [1]. The physical state of the power system's devices becomes the reason for technical losses (TLs). These losses can be reduced to some extent (but not fully) by changing the hardware components of the power system. Electricity theft, non-payment of energy consumption (EC) bills, meter installation flaws, accounting errors, etc., become the reasons for non-technical losses (NTLs). Among these, electricity theft is responsible for a remarkable loss [2]. Illegal energy usage in multiple ways is referred to as stealing electricity, also known as electricity theft [3]. The principal reason for energy theft is tapping, which is responsible for almost 80% of the total NTLs [1]. The rate of electricity theft in developing countries is more than 30% [4]. Due to energy theft, China incurs a loss as high as 20 billion Chinese Yen per year [4]. The US loses around 6 billion US dollars per year to electricity theft [5]. Conventionally, technicians were hired to study the monthly meter readings over several months to identify the abnormal energy usage. Afterwards, they visited each consumer in-person to look over the connection and status of each energy meter [6]. However, this method of detecting energy theft needs experts' knowledge.
Moreover, the decisions made by the respective domain experts are scarce in contrast to the maximization of the EC readings on a day-to-day basis [5]. The advanced metering infrastructure (AMI) is a significant element of the smart grid (SG) comprising the smart meters (SMs) that record and monitor the EC readings. With the emergence of AMI and SG, a massive amount of EC data are available. Therefore, a new hope is raised when employing machine learning (ML) and deep learning (DL) algorithms for detecting abnormal EC patterns from massive data [5]. These techniques reduce the working load of technicians and obtain better NTLs' detection accuracy values. Besides, ML and DL techniques are used in other fields such as transport, healthcare, agriculture, etc. Recently, many ML and DL-based models have been proposed to tackle the problem of NTLs [7], [8], [9], [10], [11]. These approaches employ the EC readings' history of consumers to analyze the data to detect NTLs in SGs. However, some of these techniques have low detection accuracy. In addition, some of them give high false positive rate (FPR) values. These bad performances are caused due to various reasons. The ML classifiers are individually employed in the abovementioned techniques, and no stacking ensemble model is developed from multiple heterogeneous techniques to achieve improved performance in electricity theft detection (ETD). Furthermore, the imbalanced class problem is not handled, resulting in biased results in terms of majority class samples. Very few performance metrics are considered in some articles for performance validation of their models, which are not enough to perform comprehensive and accurate performance validation of the proposed approaches.
To address the abovementioned limitations, we developed an ML boosting classifiers-based stacking ensemble model (MLBCSM) followed by an adaptive synthetic (ADASYN) sampling technique for detecting NTLs in SGs. In addition, the proposed MLBCSM model is comprehensively validated using eight popular performance measures, namely, accuracy, precision, receiver operating characteristic-area under the curve (ROC-AUC), precision recall-AUC (PR-AUC), F1 score, FPR and false negative rate (FNR).
The following points highlight the vital contributions made in the underlying research article.
• An MLBCSM stacking ensemble model for detecting NTLs in SGs is proposed, similar to the model proposed in [12] for financial market forecasting. It comprises five boosting classifiers: four are considered as base learners, and one is selected as a meta-learner.
• We tackle the data imbalance issue through an ADASYN approach. The approach is employed to oversample the minority class samples to achieve balanced data and avoid biased NTLs' detection results.
• Extensive simulations are conducted on a substantial realistic EC dataset by considering eight performance evaluation measures for comprehensive validations of our proposed model. Simulation results depict that our MLBCSM followed by ADASYN provides magnificently better NTLs' detection performance than baseline models. Following is the breakdown of the remaining sections of this research paper. Section II presents the related work while the problem statement is presented in Section III. The proposed model's discussion is offered in Section IV. The simulations' findings are provided in Section V while the concluding text is provided in Section VI.

II. RELATED WORK
The authors in [1] suggest ensemble classifiers for ETD in SGs. They employ the EC data gathered from the smart meters (SM) of the commission for energy regulation. The techniques they use for ETD include adaptive boosting (AdaBoost), light gradient boosting (LGBoost), XGBoost, categorical boosting (CatBoost), etc. In addition, EC data is normalized via the min-max normalization technique. However, all the ensemble classifiers are used individually to find FPR and TPR or detection rate (DR) for each ensemble model. Most of the current electricity theft detection (ETD) research is based on ML and DL classifiers because of the development of advanced metering infrastructure (AMI) and SGs. The authors in [4] propose an adaptive time series recurrent neural network (TSRNN) to identify theft consumers in time series EC data. To tackle the data imbalance issue, synthetic minority oversampling technique (SMOTE) is employed. Whereas, the grid search method is leveraged for hyperparameters' optimization of the TSRNN. The analysis is performed using 820 days' EC data from 01 January 2017 to 31 March 2019. Moreover, accuracy, false alarm rate, and true positive rate (TPR) metrics are considered to validate the TSRNN. However, these metrics are limited, and lack in providing the model's comprehensive and fair evaluation. The authors in [5] propose an extreme gradient boosting (XGBoost) classifier to detect electricity theft using EC data from the Irish smart energy trails dataset. The preprocessing of the EC data in terms of dealing with the missing or not a number (NaN) values, outliers, and unscaled data is also done. Furthermore, a data-oriented approach is proposed in [7], in which a gradient-boosting classifier is employed as a VOLUME 10, 2022 data-oriented model for ETD. One new synthetic theft attack is also added to the previously designed six theft attacks. The authors intended to improve FPR and accuracy values using this combined strategy. However, FPR calculation is ignored and 88% accuracy is achieved, which is not as improved as normally needed.
To identify energy theft in SGs, a DL model, a modified convolution neural network (CNN), is developed [8]. Additionally, EC privacy is protected via a Paillier technique. The data of energy consumed by the users acquired from the state grid corporation of China (SGCC) is used to analyze electricity theft. According to the simulation findings, the modified CNN achieves 92.67% accuracy on the ETD task. However, CNN is a standalone classifier that is used for ETD. Moreover, the data imbalance problem is ignored. Furthermore, in [9], a specialized theft detector, named as a deep neural network with low FPR (LFPR-DNN), is proposed to achieve minimum FPR. The data imbalance issue is tackled with the help of the focal loss; the extension of cross_entropy function. In addition, to optimize the FPR value, the particle swarm optimization (PSO) technique is leveraged. LFPR-DNN model's hyperparameters are tuned via grid search method. Moreover, recall or TPR, FPR, AUC, and bayesian detection rate (BDR) performance measures are used for the LFPR-DNN model's validation. However, the performance measures selected for the proposed model's validation prove insignificant to performing proper and extensive validation. In addition, a comparatively poor recall value is obtained, which needs to be improved. Moving further, a stacked sparse denoising autoencoder (SSDAE) is used to identify electricity theft in [10]. To enhance the robustness and feature extraction abilities of the SSDAE, noise and sparsity parameters are added to the autoencoder. In addition, the PSO technique is used to optimize of both sparsity and noise hyperparameters. Moreover, only DR and FPR measures are selected to validate SSDAE, which are not enough to conduct fair and comprehensive performance validation of SSDAE. A CatBoost based theft detector is used in [13] to classify honest and dishonest consumers. In addition, k nearest neighbors (KNN) imputation and min-max scaler methods are used to fill in the missing data and normalize the unscaled data in EC dataset, respectively. Furthermore, SGCC dataset is used for theft analysis. The proposed model obtains 92% TPR, 93% accuracy, and 95% precision values in ETD.
A gradient-boosting theft detection model is proposed in [14], based on three modern gradient-boosting models: LGBoost, XGBoost, and CatBoost. The Irish EC data is used two Irish organizations release that: (1) electric Ireland, which is an Irish gas and electricity utility company, and (2) electric Ireland and sustainable energy authority of Ireland (SEAI), which is an Irish governmental agency. In addition, since the dataset contains all the honest consumers' EC data, revised synthetic six theft attacks are proposed and employed to generate the theft class data synthetically. Afterwards, SMOTE is used to balance the data. However, the three boosting classifiers are employed individually. Furthermore, in [15], a pattern-based context-aware ETD (PCETD) strategy is introduced. KNN and dynamic time warping are used to check the consumers' anomalousness and examine the relationship between two different EC patterns, respectively. In addition, seven novel theft attacks are introduced in this work to address the imbalanced data problem. A stacked autoencoder (SAE) and under-sampling and resampling-based RF (UaRe-RF) ETD approach is put forward in [16]. The extraction of essential features is done via SAE. At the same time, data balancing and ETD are performed via UaRe-RF. Furthermore, for identifying energy theft in the transformers installed at the distribution side, the authors in [17] put forward an efficient approach that used EC curve comparison for the designated task. It is achieved by a static state estimator of EC data obtained from SM. In addition, self-organizing maps and multilayer perceptron artificial neural network are leveraged to identify energy theft users. The Irish EC dataset released by electric Ireland and SEAI is employed. The results depict the model's inferior energy theft detection capability.

III. PROBLEM STATEMENT
Electricity theft is a hazardous activity for electric grids, such as economic destruction, energy scarcity, and grid's instability. Besides, the main goal of an electricity thief is to underpay the amount of the consumed energy. Therefore, efficient detection of electricity theft in the SGs is crucial as it saves the utilities from major loss and helps avoid all the abovementioned issues. Conventionally, individual classifiers were deployed for detecting energy theft in SGs. However, low classification results were obtained [1]. In addition, the proportion of the theft consumers in most cases is tiny. The existing EC datasets are often imbalanced. As a result, the class with more samples is given more consideration in the training stage. The prediction results are negatively affected and biased results are obtained [18]. Furthermore, in most cases, very few validation metrics are considered for the proposed models' evaluation. Moreover, using fewer validation metrics does not yield accurate, appropriate, fair, and detailed validation of a model. It proves to be inefficient in terms of detecting electricity theft in the SGs.

IV. PROPOSED SYSTEM MODEL
In the underlying work, an ETD approach is put forward to identify electricity consumption abnormalities. Data preprocessing, balancing, and classification make up the proposed ETD approach. A comprehensive view, in Fig. 1, presents the proposed system model. All of these components are comprehensively discussed in the subsequent subsections.

A. DATA PREPROCESSING
The SGCC data [19] is used for performing ETD in SGs. The SGCC dataset's metadata is being provided in in Table 1. 38757 honest (non-theft) consumers' EC data and 3615 dishonest (theft) consumers' EC data for the duration of two years and ten months, which is equivalent to 1034 days is present in the dataset. The original dataset contains a large amount of NaN data, outliers, and unscaled data. It is not suitable to train a model using a dataset having missing values, outliers and unscaled data. So, we need to preprocess the dataset first.
EC data contains missing data due to various reasons, such as signal communication errors and hardware device malfunction [20]. To compute the missing data and impute it into the dataset, a simple imputer method (SIM) is used in this article. SIM is implemented using SimpleImputer class with strategy = mean through the scikit-learn library in Python [21]. In addition, as previously mentioned, the dataset also contains some outlier values, which affect the model's predictive performance. Therefore, outliers are reset via three sigma rule of thumb [22], implemented using Equation 1 [22].
where x i,a indicates the EC related to the i-th electricity consumer at a-th time slot (i.e., day). In our scenario, i = 42373 and a = 1034. X is a dataframe created in Python that contains several x i,a EC values. X σ represents the standard deviation of the dataframe X . Moreover, X avg represents the average value of X . After dealing with the outliers and missing values, it is time to deal with data diversity. Thus, min-max scaler is employed for normalizing the data [23] via Equation 2 [23].
where the minimum and maximum values of X are denoted by X min and X max .

B. DATA BALANCING
In this subsection, the data balancing component is exploited to tackle the severe class imbalanced issue existing in the SGCC dataset. Generally, the ETD datasets are severely imbalanced. It means there are many honest users' samples in them while theft users' samples are limited. This work employed ADASYN [24] to deal with data imbalanced  [25]. To fight this problem, data balancing is needed. The data must be balanced with the similar or same frequency of fraud and non-fraud samples.
To perform data balancing, traditionally, oversampling and undersampling techniques are employed. In undersampling techniques, most examples are deleted until they are equivalent to the minority samples. However, it is data inefficient. The deleted data may contain important information regarding the majority (non-fraud) class. To deal with this data inefficiency issue, oversampling is employed. In oversampling, the examples of the minority (theft) class are duplicated (copied) N-times until its sample count becomes similar or equivalent to the majority class sample count. However, the model overfits the minority class due to multiple sample copies in such cases. Therefore, to combat both the abovementioned issues created due to oversampling and undersampling techniques, ADASYN [25] is introduced. It is a data balancing method that creates synthetic examples by not copying the same minority class' data samples, but, it generates more synthetic data for harder-to-learn samples. The steps to be taken for creating synthetic data using ADASYN are given below [25].
1) Compute the ratio between minority and majority data samples using Equation 3.
where the numbers of majority and minority class samples are denoted by S maj and S min .
2) Compute the total number of the synthetic minority class data samples to be generated. This is done using Equation 4.
where β represents the desired post-ADASYN implementation ratio of minority:majority data samples. β = 1 denotes the post-ADASYN perfectly balanced dataset. TM indicates the total number of minority class samples to be generated. 3) Find the K nearest neighbours (KNNs) for each minority sample and compute r i . r i is calculated using Equation 5.
where N maj is the number of majority samples in KNNs of a specific minority sample. K shows the number of nearest neighbours to be chosen for a specific minority example. r i shows the dominance and superiority of the majority class in KNNs of a specific minority sample. The maximum value of r i comprises the majority of samples that are hard to learn. 4) Normalize and scale r i values using Equation 6 .
5) Compute the number of synthetic samples to be generated per each neighborhood using Equation 7.
if the value of r i is maximum, it means that the neighbourhood of a specific minority sample is largely composed of the majority class samples. So for such a situation, more artificial minority samples will be generated. In this way, we can say ADASYN has an adaptive nature; more minority samples were created for difficult to learn minority samples. 6) Generate TM i samples for every single minority neighbourhood. First select the minority sample x i . Afterwards, select another minority sample in the neighbourhood of x i randomly. Then, create the new sample using Equation 8 where λ represents a random value between 0 and 1. NX i shows the newly generated minority sample. rx i and x i are the minority samples in the same neighborhood area. Using the above steps, the dataset is now balanced and the balanced data is then passed to the proposed MLBCSM for accurate detection of electricity theft with low FPR value.

C. CLASSIFICATION
After completing the data preparation and balancing steps, data classification is performed to detect electricity theft using the proposed MLBCSM. Electricity theft and non-theft consumers are classified using the proposed MLBCSM. We selected the stacking ensemble strategy in this article for the reason that the stacking ensemble strategy outperforms the techniques employed in the literature. Moreover, stacking ensembles recently won many data science competitions specifically Kaggle and Netflix for classification problems [26]. Hence, stacking ensembles are considered the best of all classifiers. Therefore, we chose the stacking ensemble strategy to obtain maximum ETD performance accuracy. The stacking ensemble strategy is an efficient strategy that comprises multiple standalone classifiers at two levels (level-0 and level-1), where level-0 and level-1 classifiers are also called base-learners and meta-learner, respectively. Our proposed MLBCSM consists of multiple ML boosting classifiers as base and meta-learners. AdaBoost, XGBoost, HistBoost, and CatBoost are selected as level-0 learners while LGBoost is chosen as a level-1 learner for our proposed MLBCSM. The pseudo-code of the proposed MLBCSM for theft and non-theft consumers' classification is given in Algorithm 1. More in-depth details about these base and meta-learners are provided in the subsequent subsections. The predictions along with the original labels are then passed to the level-1 classifier for training [27].
A detailed description of each of the four base-learners is provided below.
• Adaptive boosting AdaBoost [28] is a popular model in data science. It was developed for the first time by Freund and Shapire in 1996 [29]. It is built based on the concept of boosting type ensemble strategy where the main idea of boosting is that multiple weak learners can be combined to create a robust algorithm using voting strategy. The learner that slightly surpasses a tossing coin in terms of prediction result is regarded as a weak learner. Such a learner achieves 55% or any other value close to 50% accurate results. In other words, a classifier with the loss of less than but close to 50% is called a weak learner. In this scenario, the in-sample loss rate is the count of wrongly classified samples, i.e., (y i = G(x i )), divided by the total data samples' size (N ), as given in Equation 9 [28].
In boosting, multiple weak learners are trained sequentially using a consecutive modified version of data points. It means that in the first boosting cycle, a weak learner (G 1 (x)) is trained and prediction results are generated, in which we can observe that some of the examples are misclassified. In the second boosting cycle, some weight (W i ) is assigned to each of the examples; however, the previously misclassified records are weighted more than correctly classified records to force the second weak learner to learn and correctly classify them. Now, the second weak learner correctly classifies the previously misclassified observations. However, it may misclassify the previously correctly classified observations. After iterating this process for M times, weak learners are combined using a robust metalearner (G(x)). The final meta-learner now assigns a prediction label to each record using a weighted majority voting mechanism provided in Equation 10 [28].
where α is the weight of the weak learners in the final majority voting mechanism. In AdaBoost, multiple weak learners (i.e., stumps in AdaBoost) are trained sequentially. These weak learners create a meta-learner that obtains prediction results employing a weighted majority voting strategy. In every boosting round, more weights are assigned to the previously wrongly predicted samples. This process is wrapped up using Algorithm 2 [28].
• Extreme gradient boosting XGBoost [30] was introduced by Tianqi Chen and Carlos Guestrin at the Washington University. In gradient VOLUME 10, 2022  boosting, the weak learners are trained using a gradient descent optimizer and a differentiable loss function. Therefore, it is called gradient boosting algorithm. XGBoost is a computationally faster and extremely effective-type of gradient boosting algorithm [31]. As XGBoost is fast to execute and obtains better predictive results, we chose it as one of the base-learners in our proposed MLBCSM. In XGBoost, we train a learner using the gradient of the loss from the previous learner. Moreover, in XGBoost, the gradient boosting technique is modified to make it able to work with any of the differentiable loss functions [32]. XGBoost integrates the decision trees with a gradient boosting mechanism. At each training round of a tree (weak learner), the residual of the previous tree is used in the next tree to minimize the loss function [33]. XGBoost also avoids overfitting problems and minimizes computational complexity. The final classification result, in the end, is acquired by combining all the weak learners, i.e., decision trees. The final output is predicted using Equation 11 [33].
where g j (X i ) represents the output generation function of each weak learner. X i denotes the data given to a weak learner. The algorithmic description of the XGBoost is given in Algorithm 3 [34]. Compared with the LGBoost and CatBoost, XGBoost cannot deal with the categorical data and only handles numerical data like a random forest bagging classifier. However, if someone wants to process categorical data using XGBoost, some encoding methods, such as one-hot encoding, label encoding, etc., must be applied first.
In Algorithm 3, f m (x i ) is the best weak learner in mth iteration.ŷ Add the best tree f m (x t ) into current model based on: y 7: end for 8: Until all M weak learners are executed 9: Create a strong classification model based on weak trees. 10: Result in a prediction value, i.e., 0 or 1 the objective of XGBoost algorithm. Besides, g i is the loss function and h i is the second derivative of the loss function used to define a loss calculation function that is twice differentiable.

• Histogram boosting
The gradient-boosting decision tree algorithm requires more training when dealing with a big dataset, and sometimes the prediction accuracy is compromised.
HistBoost is an effective model for dealing with a huge dataset [12]. HistBoost minimizes the training time without degrading the accuracy. Consequently, it can be stated that HistBoost is an algorithm that rapidly trains weak learners in a gradient-boost framework. The HistBoost's splitting procedure is different from other gradient boosting methods. Instead of determining the splitting points on feature values, the HistBoost buckets the continuous values of features into discrete bins, using which multiple feature histograms are created. As Hist-Boost is both training time and memory-consumption efficient algorithm, we select it in our proposed MLBCSM as one of the base (level-0) classifiers. More details about how histogram-based algorithm works can be found in Algorithm 4 [35]. Furthermore, the gradient-boosting decision trees ensemble training process is expedited in the proposed work. Big training datasets containing tens of thousands of rows or even more lead to a deadly slow creation of decision trees as splitting points on each value, for each dimension, must be taken into account while creating the trees [36]. Moreover, the training process of weak learners, usually the decision trees, appended into the ensemble model can be expedited due to binning (discretizing) the continuous input features to only a few hundred unique values. Thus, the gradient boosting ensemble models, which implement this (binning) method and adjust (tailor) the training model over the input features that follow the transform made by binning, is known as histogram-based gradient boosting ensemble models. Furthermore, suitable data structures like histograms can be employed to represent data discretization. In this way, the decision trees' creation algorithm can be further adjusted for histograms' effective and efficient employment in creating every decision tree. Hence, from the above discussion, it is concluded that a gradient-boosting technique supporting histogram data structures is referred to as the HistBoost technique. Find the optimal split on H 13: end for 14: end for 15: Update the Node set and Row set based on the optimal splitting points 16: end for

• Categorical boosting
CatBoost was introduced by Yandex (a technology company in Russia) in 2018 [37]. It is a better technique than other gradient-boosting models due to its ability of directly tackle categorical features (without applying any encoding scheme) and faster training [12]. Besides, categorical features, it can also handle textual and numerical features as well. However, it has a better handling method for categorical data [38]. As CatBoost directly supports the categorical features (without using any encoding method), therefore, it is called CatBoost [39]. Generally, gradient boosting based models perform better in huge and small datasets. The algorithmic details of the CatBoost are provided in Algorithm 5 [40]. for i = 1 to n do 3: for j = 1 to i−1 do 4: g j = d da loss(y j , a), a=M i (x j ) 5: end for 6: M=LearnOneTree((X j , g j ) for j = 1 to i − 1) 7: end for 9: end for 10: return M 1 , M 2 , . . . , M n ; M 1 (X 1 ), M 2 (X 2 ), . . . , M n (X n ) estimation for this sample. Using M k , we compute (estimate) gradient on X k and leverage this computed gradient to get the output tree. loss(y j , a) is the loss optimization function in which y is the label (target) value and a represents the formula (predicted label) value.

2) LEVEL-1 LEARNER
In order to make the final classification decision based on the results generated by the level-0 learners, the level-1 learner is selected.
LGBoost is chosen as level-1 classifier for the proposed MLBCSM.
• Light gradient boosting LGBoost is one of the widely used classifiers. It was introduced by Guolin Ke in 2017 [41]. Basically, LGBoost enhances the basic gradient boosting technique by appending the ability to focus on samples with comparatively huge gradients and feature selection, which lead to faster training as well as enhanced prediction results of the classifier [42]. Exclusive feature bundling (EFB) is an automatic feature selection technique, used for bundling the sparse (i.e., mostly zero and rarely nonzero) mutually exclusive features. Gradient-based one-sided sampling (GOSS) focuses on the training samples that comparatively have higher gradients and exclude the notable portion of the samples with low gradients to estimate (compute) the information gain.
As the data samples having large gradient values play a significant role in information gain's computation, GOSS can yield accurate computation of the information gain with a smaller dataset. Since GOSS focuses on examples with larger gradients, it results in faster learning and minimizes the computational speed of the algorithm. Together, the two modifications above expedite the training time of the model upto 20 times. Due to these significant qualities, we selected LGBoost as a meta-learner for our proposed MLBCSM. It can be concluded that LGBoost consists of a gradient-boosting algorithm combining EFB and GOSS. The algorithm of LGBoost is shown in Algorithm 6 [43]. VOLUME 10, 2022

Algorithm 6
LGBoost Algorithm Input Training set: D=(x 1 , y 1 ), (x 1 , y 1 ), . . . , (x N , y N ); Loss function: L(y, θ(x)); Maximum Iterations: M High gradient data sampling ratio: a; low gradient data sampling ratio: b 1: Combine mutually exclusive features using EFB method 2: Put θ 0 (x) = argmin c N i=1 L(y i , c) 3: for m=1 to M do 4: Compute absolute gradient values: Resample the data using GOSS: Compute information gains using: 10: In Algorithm 6, step 5 shows that GOSS first sorts out the data samples based on absolute value of their gradients and choose top a × len(D) or 100% of the samples. D represents the training data. frac1 − ab is the constant value that is multiplied with the summation of absolute gradient values when computing the information gain in step 6 of the algorithm.

V. SIMULATION RESULTS AND DISCUSSION
The proposed model's simulations' settings, the performance evaluation measures, and the simulation results of the proposed and baseline (individual) classifiers with respect to eight various performance measures are discussed in this section. More details are available in the subsequent subsections.

A. SIMULATIONS' SETTINGS
The proposed model for ETD employs the EC readings obtained by the SGCC [19] for analyzing electricity theft. Moreover, the simulations are performed using the DELL Intel core i5-2450M system with a 500 GB hard drive and a total 8 GB RAM in two slots. Python programming language with scikit-learn, lightboost, xgboost, and catboost ML libraries is used to implement the proposed MLBCSM. Google Colab is employed to execute Python's code on cloud servers owned by Google. By default, the SGCC dataset contains 42372 data instances and 1034 columns (features). From 42372 instances only 3615 instances belong to the theft (abnormal) class and the rest, i.e., 38757 instances belong to the non-theft (normal) class. The classes' distribution ratio is 8.53% and 91.47%. It is clear from the distribution ratio that the dataset is severely imbalanced. To balance the dataset, ADASYN is used. It oversamples the minority class (theft class) instances to raise the total number of instances from 42372 to 77050. As a result, the classes' distribution ratio becomes almost equal. Hence, the dataset is balanced.

B. PERFORMANCE EVALUATION MEASURES
This subsection provides a detailed discussion of the selected performance. In supervised ML algorithms, the data with the proper labels and features are passed to the algorithm for training purposes. Afterwards, the trained classifier is tested to evaluate its ability to predict and generalize unlabeled data. The said models' are assessed via accuracy, F1 score, FPR, FNR, ROC-AUC, precision, recall, and PR-AUC metrics [44], [45] are considered the most appropriate and reliable metrics that can be employed for a fair and comprehensive evaluation. However, in [7], [9], and [10], very few inappropriate performance evaluation metrics are employed, which are not enough for fair and comprehensive evaluations of their models. Therefore, to conduct a fair and extensive evaluation of our proposed MLBCSM, accuracy, ROC-AUC, F1 score, FPR, FNR, precision, recall, and PR-AUC are considered. The calculation of all the selected metrics is based on the confusion matrix [26], which consists of four unique values defined below.

1) ACCURACY
It is one of the most often used performance measures. It can be regarded as the proportion of all categorized samples that were correctly classified [46]. It is suitable to employ when there is an equal frequency of samples from all classes. Accuracy is calculated via Equation 12 [26].

2) ROC-AUC
The ROC curve, where the y-axis displays the TPR and the x-axis displays the FPR, illustrates how well a binary classifier performs on the positive class. Recall or sensitivity are other names for TPR [45]. The TPR and FPR are calculated as follows [26], [45].
A model is considered to have no discriminative ability between theft and non-theft classes if it forms a diagonal line between TPR of 0 and FPR of 0, i.e., (coordinate(0, 0) or classify all negative (honest)) to a TPR of 1 and FPR of 1, i.e., (coordinate(1, 1) or classify all positive (theft)). Thus, ROC-AUC shows a classifier's discriminative power between TPR and FPR.

3) PR-AUC
Precision calculates the number of correctly predicted positive results by a classifier. It is calculated by following formula [26], [45].
Precision's output is between 0 and 1 where 1 shows perfect precision and 0 shows no precision. Furthermore, out of all positive (theft) predictions, recall measures the proportion of true (correct) positive predictions. The computation formula for the recall is given in Equation 13. Its output also ranges between 0 and 1. 1 means perfect recall and 0 means no recall. Both recall and precision focus only on the theft samples [45]. Consequently, in the precision-recall (PR) plot, recall is shown on the abscissa and precision is presented on the ordinate. A no-skill classifier, having equal values of precision and the count of positive (minority) samples, is illustrated by a horizontal line. Contrarily, the curve given by the perfect (skillful) classifier inclines towards the (1,1) coordinate, and is denoted by PR-AUC. The PR curve value is 0.5 in the balanced case of data. The most appropriate measure for binary classification techniques based on imbalanced data is the PR curve since it pays attention to the positive class [45].

4) F1 SCORE
For binary classification models' evaluation, F1 score is a widely used metric [47]. It is computed using precision and recall scores. Its value ranges between 0 and 1. The said metric is computed using Equation 16 [26].
where R denotes Recall and P denotes Precision.

5) FNR
It is another important performance metric that is rarely employed in evaluating the models designed for ETD in SGs. But, it is very significant to consider it for evaluating ETD models. A high FNR leads to a vast and considerable problem as compared to a high FPR. It is risky and threatening to wrongly classify a theft consumer as an honest consumer [48]. The high FNR value leads to high electricity loss, financial loss, energy supply quality loss, and grid safety loss to the electric utility. The mathematical formula to calculate FNR is provided in Equation 17 [49].
C. PROPOSED MLBCSM's PERFORMANCE RESULTS Table 2 and Fig. 2 show the performance comparison of the proposed MLBCSM and another baseline (individual) models. All the individual classifiers are chosen and implemented with their default parameters. Moreover, individual or standalone classifiers with default parameters are then combined using a stacking ensemble mechanism to develop our proposed MLBCSM for detecting electricity theft in SGs. The dataset split is performed using train_test_split class in scikit-learn library in Python. The split ratio for testing and training data is 20% and 80%. Table 2 and Fig. 2 present that our proposed MLBCSM outperforms the standalone models, CatBoost, HistBoost, XGBoost, AdaBoost, and LGBoost, for different performance measures, given in Table 2 and Fig. 2. However, in terms of recall and FNR, CatBoost generates slightly better results than our proposed model. Since CatBoost has an overfitting detection mechanism by default, its results are slightly better in terms of recall and FNR than all the models used in the paper. It is concluded that our proposed model achieves poor performance in terms of FNR and recall. However, better performance is achieved in accuracy, F1 score, ROC-AUC, precision, PR-AUC, and FPR. Moreover, two key characteristics enable CatBoost to provide better results than our proposed and other baselines with respect to FNR and recall. The first one is building balanced trees, which helps CatBoost control overfitting and leads to improved performance. The second one is that CatBoost employs the concept of ordered booting (a permutationbased approach to fit a technique on a subset of data while computing the residuals on another subset), which protects CatBoost from overfitting and target leakage problems. Thus, resulting in improved theft detection in terms of all the above mentioned performance metrics. Furthermore, the reason for the improved results generated by our proposed model is that our proposed MLBCSM exploits and take benefits from the multiple good-performing individual classifiers (i.e., CatBoost, AdaBoost, HistBoost, LGBoost, and XGBoost). After our proposed model and CatBoost, LGBoost provides promising results, as seen in Table 2 and Fig. 2. This classifier generates good results because it has some unique properties over the other boosting models, i.e., abilities to keep full attention on data samples with higher gradients and feature selection, which are achieved using GOSS and EFB methods, respectively and these unique properties make LGBoost able to produce better predictive results with faster training. On the other hand, AdaBoost is the worstperforming classifier. Its sensitivity to distorted (noisy) data and it can only perform better on a quality dataset (dataset free of outliers and noisy data). With noisy data, AdaBoost is prone to overfitting and provides poor classification results. SGCC dataset may contain some noisy data due to which AdaBoost gets prone to the overfitting issue and generates the worst results among other models implemented in this article. Fig. 3 shows the ROC curves of the proposed MLBCSM and baselines. The proposed scheme achieves a 0.92396 ROC-AUC value, which is better than all the baselines, HistBoost, AdaBoost, CatBoost, LGBoost, and XGBoost. It simply means that our proposed model very effectively  differentiates, and separates the normal and abnormal classes, as it can exploit multiple individual classifiers and perform better than single learners. Furthermore, the PR-AUC is presented in Fig. 4. In the case of ETD, both precision and recall are crucial for utility companies. The maximum PR-AUC score shows the efficiency of the model. The proposed MLBCSM yields a PR-AUC value of 0.94129, which is the maximum as compared to all the baselines under consideration. This proves that our proposed model is advantageous to electric utilities in pinpointing the energy fraudsters and saving maximum energy losses.
FNR is also thought as an important metric. In FNR, the abnormal energy consumers are predicted as normal by the classifier, which is very dangerous and negatively affects the power utilities in terms of financial loss, energy loss, energy supply quality, and power system safety. Thus, FNR is needed to be minimized. Therefore, we considered and calculated this metric, as shown in Fig 5. The proposed model yields the FNR value of 0.06778, which is the minimum value among all the baseline classifiers. On the other hand, AdaBoost achieves the FNR value of 0.25753, which is the highest value among all employed classifiers.
FPR is an important performance measure in which the non-theft energy consumers are considered as theft, which maximizes the misclassification rate of the classifier. There 121896 VOLUME 10, 2022  exists a direct relation between FPR score and the on-site inspection cost. One of the key objectives of ETD is to reduce this cost. Therefore, FPR is considered a significant measure for ETD in SGs. We computed FPR in this work and is shown in Fig. 6. Our proposed MLBCSM obtains the minimum FPR value of 0.08405. Whereas, AdaBoost has the maximum FPR of 0.29932. It is worth mentioning that in stacking ensemble models, the performance of the meta-learner is dependent upon the level-0 (base) learners and therefore, we select powerful boosting classifiers for level-0, which provide accurate predictions to the meta-learner that have a positive impact on classification results of the overall stacking ensemble model. This is why our proposed stacking model obtains better results than the baselines. Moreover, AdaBoost's worst performance in terms of FPR is that it needs a good-quality dataset to perform well. Otherwise, it faces an overfitting problem due to noisy data.
Finally, our proposed MLBCSM's superiority in terms of having the highest accuracy, ROC-AUC, F1 score, PR-AUC, precision, and the lowest FPR values as compared to the baselines is proved through extensive simulations.

VI. CONCLUSION
In the work performed in this research, an MLBCSM is introduced as a binary classifier for electricity consumers' classification. The publicly available SGCC data is used in this study. Moreover, ADASYN is leveraged as a data balance for oversampling the minority samples that are difficult to learn. For performing classification, the data balanced through ADASYN is forwarded to MLBCSM. The results generated by the proposed MLBCSM are compared with five standalone classifiers. The proposed model is extensively validated using eight performance evaluation measures and compared with the standalone models. The proposed model achieves 92.395% accuracy, 92.396% ROC-AUC, 91.458% precision, 92.332% F1 score, 94.129% PR-AUC, 8.405% FPR, 6.778% FNR, and 93.222% recall scores. We conclude from the simulations that our proposed model obtained enhanced performance compared to its baselines in terms of ETD. Thus, our proposed MLBCSM followed by ADASYN proved well-suited for correctly classifying all the actual positive theft cases.