A PLSTM, AlexNet and ESNN based Ensemble Learning Model for Detecting Electricity Theft in Smart Grids

The problem of electricity theft is exponentially increasing around the globe, which is harmful to the power sectors and consumers. The recent development in the advanced metering infrastructure brings opportunities for experts to identify the electricity thieves in the smart grid community. Many advancements are made in the area of the smart grid for Electricity Theft Detection (ETD), where the data collected from the smart meters is utilized. However, the problems of imbalanced distribution of data and inaccurate classification are not efficiently addressed. Therefore, to overcome the problems, machine learning and deep learning models are proposed for ETD. Initially, to refine the smart meters’ data, pre-processing methods are used. Then, the class imbalance problem is solved through Synthetic Minority Oversampling Tomek Links (ST-Links). It solves the classifier’s biasness problem, which occurs due to imbalanced data. It achieves the benefits of both data oversampling and undersampling. Afterwards, an AlexNet and peephole long short-term memory network based feature extractor with an attention layer is developed to extract the relevant features from electricity consumption profiles that are most suitable to classify honest and theft consumers. After the extraction of suitable features, the classification of consumers is performed by an echo state neural network. Moreover, an evolutionary grey wolf optimization technique is utilized to tune the hyper-parameters of the proposed model. A paired t-test is also applied on the final classification results for a reliable assessment of the proposed model. The simulations are conducted on a realistic smart meters’ dataset of China to check the performance of the proposed model. In addition, different benchmark models are implemented to perform a comparative analysis. Different meaningful performance metrics are considered for the fair evaluation of the proposed model: Matthews Correlation Coefficient (MCC), F1-score, Area Under Curve (AUC), precision and recall. The simulation results depict that the proposed model obtains accuracy, recall, F1-score, AUC, PR-AUC, precision and MCC score of 96.3%, 92.1%, 92.0%, 96.4%, 97.3%, 90.0% and 84.0%, respectively. It is worth mentioning that the application of the proposed solution is quite general. Therefore, it can be used by the power companies to overcome the power losses in the energy sector.


I. INTRODUCTION
Electricity Theft (ET) is the stealing of electric power from electricity distribution systems. It is performed by meter tampering, bypassing meters, billing irregularities, etc., [1]. Recently, the power sectors have raised concerns about ET because they have to bear huge financial losses and energy inefficiencies. Worldwide, the electric utilities lose $25 billion per year due to ET [2]. The negative financial impact of this act is faced both by the developed and the developing nations. Table 1 shows the statistics of power losses incurred due to ET in different countries. In the USA, the electricity losses exceed $6 billion per year due to ET [3]. Whereas, the electric utilities of developing countries like India, incur a loss of $4.5 billion every year. This is almost 30% of the total energy produced in India [4]. Likewise, in Pakistan, the commercial losses exceed $0.89 billion per annum due to ET [5]. In literature, many research works are done to resolve the aforementioned challenges. For instance, in [6], Electricity Theft Detection (ETD) is performed using a gradient boosting algorithm. Moreover, a specific theft window is designed to detect the suspicious activity of electricity consumers in the peak hours. However, an efficient and accurate ETD mechanism is still required. To overcome the ET problem, the researchers have implemented various solutions. The most popular ETD solutions are categorized into state based, game theory based and Machine Learning (ML) based solutions. In state based solutions, hardware devices are involved, i.e., smart meters, sensors, distribution transformers, etc., to detect the ET. In [7], a physically inspired state based solution is presented to detect energy theft in smart grids. A special type of transformer is integrated with the smart meters to investigate the EC patterns of electricity consumers. Moreover, a statistical linear regression algorithm is developed, which perfectly captures the relationship between electricity consumption values and voltage magnitude. The simulation results exhibit that the proposed solution outperforms the baseline models. For game theory based solutions, a game is played between the electricity thieves and the power company. In [8], a game theory based solution is introduced for smart homes to optimize the energy cost in peak hours. Furthermore, the proposed solution establishes the coordination among the smart appliances and mitigates the energy losses to a minimal level. On the other hand, the ML based solutions use the Electricity Consumption (EC) data acquired from smart meters to detect the dishonest electricity consumers. In [9], a Deep Convolution Neural Network (DCNN) is developed to detect the power quality disturbance in power grids. The DCNN model intelligently extracts the potential features from the high dimensional data through convolution operations and performs classification of power quality disturbance using a softmax layer. The performances of the above mentioned solutions are acceptable. However, the existing solutions have their drawbacks, which are itemized as follows.
1) The state based solutions are expensive because they require additional hardware cost to perform ETD [7].
2) The game theory based solutions are complex. They need to define strategies and set of rules for all the players in the game, which make the solutions very difficult to implement and operate [8]. 3) In conventional ML models, the overfitting issue occurs because they are trained on high dimensional and complex data [9]. 4) The imbalance nature of the dataset leads to model's biasness, which raises the misclassification rate. The authors of [10] conduct a survey to highlight the challenges of data imbalance. 5) The performance of the classifier is primarily dependent on the input data. In most cases, the available data has missing values, outliers and is unscaled. As a result, the accuracy of the classification models is reduced [11].
A new solution is proposed in this study, which has four main stages: classification, feature extraction, data balancing and data pre-processing. The main contributions of this study are listed as follows.

1) A Synthetic Minority Oversampling
Tomek Links (ST-Links) technique is used to resolve the imbalanced data problem. The ST-Links technique achieves the benefits of both data oversampling via Synthetic Minority Oversampling Technique (SMOTE) and under sampling through tomek links method. Moreover, it efficiently extracts the important features from the data to mimic the behavior of real electricity thieves, which significantly improves the performance of the classifier. 2) Essential statistical features, such as mean, median, mode, min and max are calculated to enhance the ETD performance. 3) An AlexNet and Peephole LSTM (APLSTM) based feature extractor along with an attention layer is presented to extract the potential features from the EC data to avoid overfitting. 4) A novel Echo State Neural Network (ESNN) is utilized for the classification of honest and dishonest electricity consumers. 5) An evolutionary Grey Wolf Optimization (GWO) technique is utilized to optimize the hyper-parameters of the proposed model. 6) A paired t-test hypothesis is applied to the final classification results to provide statistical support. 7) Thorough simulations are conducted and the performance of the proposed model is enhanced in terms of ETD. The proposed model is evaluated using suitable performance metrics, such as Precision Recall Area Under the Curve (PR-AUC), Matthews Correlation Coefficient (MCC), AUC, recall and precision.
The remaining part of this paper is organized as follows. The related work is presented in Section II. The proposed system model is discussed in Section III. The simulation results are provided in Section IV. Finally, the conclusion and future work are given in Section V.
In this section, the details of the work performed in ETD domain is provided. The work done in the existing literature is classified into non-hardware and hardware based solutions. The state based solutions are also known as hardware based solutions [12]. The non-hardware solutions are based on ML and game theory [13]. The ML techniques are further categorized into unsupervised learning (clustering), semisupervised learning and supervised learning (classification) techniques. In the proposed solution, the supervised techniques are adopted. Moreover, the details of the recent advancements that are made in the ETD domain are studied.
The smart grids are helpful for efficient energy management and provisioning of electricity at lower prices [14], [15]. Moreover, the smart meters' data helps in detecting the anomalous consumption behaviour of electricity thieves. In [14], a Deep Long Short Term Memory (DLSTM) is developed to forecast the electricity price and electricity load in smart grids. The proposed DLSTM model mitigates the energy losses to a minimal level. The authors of [15] propose a novel framework, which combines a Finite Mixture Model (FMM) and a Genetic Algorithm (GA) to perform ETD. In the proposed work, the FMM model is exploited for the customers' segmentation and GA is used for the selection of potential features. In [16], electricity thieves are detected using a LSTM model. The LSTM model maintains the temporal correlation between the current and the previous EC time sequences to distinguish the honest consumers' data from the dishonest consumers' data. However, the data imbalance issue is not addressed, which increases the misclassification rate. In [17], the authors address the problem of inaccurate detection of Non Technical Loss (NTL) through Support Vector Machine (SVM) and Random Under Sampling (RUS) based Extreme Gradient Boosting (XGBoost). The aim of the study is to achieve high True Positive Rates (TPRs) and low False Positive Rates (FPRs). This methodology is tested on Endesa dataset and validated through AUC and PR-AUC performance measures. In [18], the authors address the problem of high FPR in NTL detection by proposing a new model, which is based on Bidirectional Gated Recurrent Unit (BiGRU) and Synthetic Over Sampling Tomek Links (SOSTLink). For feature extraction, the BiGRU technique is utilized while the SOSTLink technique is used to handle the imbalanced data. This model achieves the lowest FPR when it is compared with SVM, Convolutional Neural Network (CNN) and LSTM models. In [19], Maximum Overlap Discrete Wavelet Packet Transform (MODWPT) with Random Undersampling Boosting (RUSBoost) based scheme is proposed. The MODWPT scheme is used for selecting the features that are relevant to the data. On the other hand, data classification is performed using RUSBoost. However, this model loses important information by randomly reducing the data size that causes the under-fitting issue. In [20], LSTM and Bat optimization based RUSBoost techniques are used to solve the issues of overfitting and imbalanced data in ETD. Moreover, several performance metrics are used to show the effectiveness of the model.
In recent times, most of the researchers modify the internal structure of the ML algorithms for enhancing their performance. For instance, the structure of LSTM is changed using the softplus activation function instead of a tangent hyperbolic activation function [21]. Along with the improved LSTM, Gaussian Mixture Model (GMM) is used for classification. It achieved satisfactory values of accuracy and F1score. However, high computational time is consumed by the model. In [22], a Smart Energy Theft System (SETS) is proposed for ETD. In the system, smart meters are integrated that digitally transmit the data to reduce the manual work. However, the hardware installation cost problem is not resolved. The authors in [23], [24] use the supervised learning techniques to tackle the class imbalance and the overfitting issues. To resolve the class imbalance issue, SMOTE is used. The ETD model in [24] is developed through a hybrid scheme that combines LSTM and CNN. In [23], the output layer of CNN is replaced by Random Forest (RF) to avoid the overfitting problem. Apart from deep learning techniques, the traditional ML techniques like SVM are also popular for handling the binary classification problems. Likewise, the authors in [25] use SVM for ETD. Besides SVM, a Principle Component Analysis (PCA) model is applied for dimensionality reduction. Although SVM performs better on linearly separable data, it uses a complicated kernel function for nonlinear data, which makes it computationally expensive.
In most cases, the EC data is taken from smart meters. However, this data contains outliers and missing values, which raise the misclassification rate. Therefore, the authors in [26] focus on pre-processing to refine the input data and improve the classification accuracy. Moreover, they propose a Gradient Boosting Decision Trees (GBTD) model that helps in selecting the relevant information from the smart meters' data, which also reduces the execution time. The losses incurred due to ET in Brazil are alarming and the amount of commercial losses faced in 2011 reached $4 billion [27]. To address this problem, the authors utilize the Binary Black Hole Algorithm (BBHA). The algorithm beats the existing optimization techniques, i.e., genetic and particle swarm optimization techniques, in terms of accurate NTL detection and execution time. However, the model is not evaluated using reliable performance metrics, such as recall and precision. A reliable evaluation measure is very necessary for assessing the performance of models in the case of imbalanced data classification problem. To address the NTL problems in Spain, the authors in [28] propose a hybrid model, which consists of LSTM and Multi Layer Perceptron (MLP). In the model, two types of data are utilized to detect the electricity thieves, i.e., sequential data and nonsequential data. The sequential data is processed using the LSTM model and the non-sequential data is learned using MLP. The authors in [29] change the internal structure of CNN to enhance the accuracy of ETD. The enhanced version of CNN is composed of wide and deep components. The former extracts the local features from one-dimensional (1D) VOLUME 4, 2016 data while the latter derives the global features from the twodimensional (2D) EC data. In [30], the authors use Gradient Boosting Machine (GBM) to detect the electricity thieves.
In [31], the authors use SMOTE based Adaboost algorithm to perform ETD. The boosting algorithm is an ensemble strategy in which multiple weak models are joined to create a strong model to boost the classification performance. The overview of the existing ETD based supervised techniques is given in Table 2, which contains objectives, proposed solutions, datasets, performance metrics and their limitations.
With the development of the advanced metering infrastructure, the electric utilities can record a high dimensional data from smart meters that is useful for the researchers to design data driven approaches [23], [24]. However, there are some limitations, which are needed to be addressed.
• The distribution of classes in the data is not balanced. The authors in [18], [23], [24] generate synthetic data to overcome this problem. However, synthetic data does not represent the behavior of real data. • The features available in the EC dataset are not sufficient for efficient ETD [29]. • The overfitting issue is raised due to the presence of high dimensional, noisy and overlapping features in the EC dataset [25]. Moreover, the ML techniques like Linear Regression (LR) [17] require manual feature extraction and are not suitable for classifying the high dimensional data. • Limited generalization ability of the traditional ML classifiers adversely affects the ETD process [25]. • Traditional neural networks increase computational complexity due to the excessive learning of weights for each hidden layer [9]. • The selection of suitable hyper-parameters for deep learning models is a challenging task [24], [29]. • Without applying proper statistical tests, the performance results are not reliable [20]- [22], [25].
In this study, a new model is proposed to resolve the aforementioned issues in ETD. The flowchart diagram of the proposed model is depicted in Figure 1. In the figure, the red color depicts the limitations labeled as L.1-L.7 while the blue color highlights the proposed solutions, which are labeled as S.1-S.7.

III. THE PROPOSED SYSTEM MODEL
The proposed model is discussed in this section and is presented in Figure 2. It comprises four components: classification, feature extraction, data balancing and data preprocessing, which are discussed in the following subsections.

A. DATA COLLECTION
The proposed methodology is designed using the real EC data, which is released by State Grid Corporation of China (SGCC) [29]. The dataset details are given in Table 3. It consists of the EC information of forty-two thousand, three hundred and seventy-two consumers. Where 91% consumers are honest and 9% are fraudulent. The information about the fraudulent consumers is a ground reality. The number of honest and theft consumers shows the imbalanced nature of data. Figure 3 shows the EC patterns of two consumers, i.e., the honest consumer and the electricity thief. It shows that the electricity thief has irregular EC patterns and its EC value is less because of meter tampering. Whereas, the honest consumer has normal EC patterns.

B. DATA PRE-PROCESSING
In most cases, the data contains missing values and outliers, and is unscaled. These values occur due to faulty measuring equipment like defective smart meters and sensors. In this regard, the data pre-processing steps are required to be employed. Therefore, in this paper, a series of data preprocessing steps is applied, i.e., data interpolation, removing the outliers and data normalization.

1) Removing the missing values
The missing instances present in the original data mislead the model to inaccurately classify the electricity thieves and honest consumers. Therefore, it is required to fill the missing values in the EC data. For this purpose, the data interpolation method is exploited to handle the missing values. This method takes the mean of the neighboring values and fills the missing value. If the neighboring values are also missing, the missing value is replaced with 0. Equation (1) is used to fill-in the missing values, taken from [29].
where x i represents the EC value of a consumer and NaN denotes the values that are non-numeric.

2) Identification and removal of outliers
The outliers are the values present in the dataset, which show abnormal behavior. In this study, a three sigma rule is utilized for removing the outliers from the overall dataset as defined in Equation (2), which is taken from [29]: where avg(x) represents the average EC value of a consumer and the standard deviation is denoted as std(x). To remove the outliers, a threshold is set, i.e., a standard deviation of 2. The rule to select the standard deviation is not specified.

3) Data normalization
Data normalization is an important step. In neural networks, it plays a key role to increase the convergence rate and reduce the execution time. The values of different features in the dataset are not the same. Therefore, the min-max normalization technique is used to scale the values. This technique transforms the values of features between 0 and 1. The min-max normalization is done using Equation (3), which is mentioned in [29]: where x nor represents the normalized value while the minimum and maximum values of the data are represented as min(x) and max(x), respectively.

C. FEATURE GENERATION
Initially, the EC dataset is univariant and has only one feature of EC. However, it is important to generate more statistical features for efficient ETD. In this regard, I compute significant statistical features, such as mean, median, mode, min and max from the original EC data to enhance the ETD performance.

D. DATA BALANCING THROUGH ST-LINKS
The major problem in ETD is the imbalanced distribution of honest consumers and electricity thieves in the data. The number of honest consumers is higher than theft consumers. Due to this problem, the classifier gives biased results [32].
The ST-Links technique is proposed to address the imbalanced class issue. Where SMOTE increases the theft class instances after k-nearest neighbours of a similar group are  If (x 1 , x 2 ) are the instances of minority class with their nearest neighbours selected as (x 1 , x 2 ), then the data instances synthesized by SMOTE are given in Equation (4) [24]: where In Equation (4), random (0, 1) is a function that generates synthetic random samples in the range 0 to 1 using the neighbours of the minority class. On the other hand, ST-Links removes instances from both classes. This method identifies pairs of the nearest neighbours, which have different groups in the same dataset. It creates a boundary line by identifying the pairs of nearest neighbours of two classes. The generation of new data by the ST-Links technique is shown in Figure 4.

E. FEATURE EXTRACTION USING AN ALEXNET AND PEEPHOLE LSTM BASED HYBRID TECHNIQUE WITH ATTENTION LAYER
Feature extraction is a significant stage in accurate identification of electricity thieves. In this process, more relevant fea-   5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 The Normal customer One month data (kWh)  A simple neural network fails to store the dependencies for a longer time in its memory to predict future information. Therefore, in this paper, an AlexNet [33] and PLSTM based hybrid solution is used along with an attention layer for feature extraction. The architecture of the proposed feature extractor is given in Figure 2. The AlexNet module has three fully connected layers, five pooling layers and five convolutional layers. These layers operate as a sequential model to perform feature extraction. AlexNet extracts relevant features through these layers. Initially, the balanced data is fed to the convolutional layer. Afterwards, the convolutional layers' output is sent to the pooling layers. In this study, the activation function used to deal with the non-linear data is a Rectified Linear Unit (ReLU) [34], which is given as: In Equation (6), x is the input data. ReLU performs activation if x is positive, otherwise, it returns 0. The max pooling layer is used in this paper, which computes the highest value for every feature map. The convolutional and pooling layers minimize the data dimensions to avoid the overfitting problem. The max pooling layer is used after each convolutional layer and then three fully connected layers (also called dense layers) are utilized to give the final output. Afterwards, the extracted feature set of AlexNet is passed to PLSTM for extracting more granular temporal correlated features. The PLSTM model is the advanced variant of traditional LSTM in which the concept of peephole connection is introduced. The peephole connections link the value of long-term memory state with every gate at each time step to increase the memorization power of the model. Furthermore, the integration of the attention layer with the PLSTM model guides it to focus only on the relevant information. This process increases the convergence speed of PLSTM to a greater extent. The mathematical representations of PLSTM's memory gates are given below [28].
Where i t , f t and o t denote input gate, forget gate and output gate, respectively. Similarly, W i , W f and W o represent the weight matrices of input gate, forget gate and output gate, respectively. σ indicates sigmoid activation function. The original input at time step t is expressed by x t . C t−1 and h t−1 denote cell state and hidden state at time step t − 1. c andĉ represent the information in cell state at overall and current time step, respectively. Whereas, b i , b f , b o and b c indicate the bias values of input gate, forget gate, output gate and cell state, respectively. Afterwards, the extracted feature set of both AlexNet and PLSTM is passed as an input to the ESNN model for the classification of honest and dishonest consumers. The more detailed description of the ESNN model is provided in section III-F. Moreover, the function of early stopping is used for monitoring the loss rate to avoid the overfitting problem. If the model achieves an optimal generalized performance, then the training is stopped and the relevant features are extracted.

F. CLASSIFICATION USING ECHO STATE NEURAL NETWORK
A novel version of recurrent neural network is utilized, termed as ESNN [35], for the final classification of honest and dishonest electricity consumers. The basic architecture of ESNN is based on recurrent memory networks. The ESNN model is introduced to perform time series prediction and classification tasks. It contains a reservoir where multiple nodes (neurons) are placed in such a way that they are partially connected with each other to store important time series information. These nodes maintain temporal correlated patterns in their hidden states, which decay over time. The basic structure of the ESNN model consists of three layers: input, reservoir (hidden layer) and output. The extracted feature set of APLSTM is aligned in a 1D vector and is fed to the input layer where random weights are assigned to each input. Afterwards, the feature vector is passed to the hidden layer (reservoir). The nodes in the reservoir capture the necessary temporal patterns from the received input. In particular, they also maintain non-linear variance in EC patterns, which increases the generalization ability of the classification model. The output layer performs the classification of honest and dishonest consumers accordingly. The major advantage of the ESNN model over the deep models is that it updates the weights of only those neurons which are connected with the output layer neurons. By doing this, the computational overhead of the model is reduced and the convergence speed is increased to a greater extent. The mathematical formula used for updating the reservoir [35] nodes is given as follows.
x(n + 1) = 1/(1 + exp(−α * (W in u(n + 1) Where α and v indicate a coefficient and a small proportion of noise, respectively. W in and W r represent the weight matrices of input layer and reservoir, respectively. W back depicts the weight matrix that is sent back to the reservoir for updation. The expression u(n + 1) represents the input that is given to the network at time step n + 1. The activation of the output layer's neurons is updated as follows.
y(n + 1) = W out (u(n + 1), x(n + 1), y(t)) Where W out depicts the weight matrix of the output layer. Algorithm 1 describes the feature extraction and classification processes of the proposed model. Lines 1-4 represent the input, output and variables' initialization of the algorithm. The feature extraction process of AlexNet and PLSTM Algorithm 1: Feature extraction and classification through the proposed model 1 Input: Balanced dataset X data = {(x 1 , y 1 ), (x 2 , y 2 ), ..., (x n , y n )}, x, y ∈ R 2 Output: Det model 3 Variables' initialization: W l (.) and b l (.) ∀l, W, W in , W r , W back , W out , b, contex v 4 x = X data 5 Feature extraction: 6 while not converged do 7 for each training sample x do 8 AlexNet layers: 9 for each hidden layer l = 1, ..., L do Peephole LSTM layers: 15 for each recurrent layer l p do 16 for each time step t do Weights and biases' updates: 39 for epoch=1 to n do 40 Hidden state weights' updation: x(n + 1) = 1/1 + exp(−α * (W in u(n + 1) + W r x(n) + W back y(n) + v)) 41 Output state weights' updation: y(n + 1) = W out (u(n + 1), x(n + 1), y(n)) 42 Fully connected layer: Det model = σ(W * y(t) + b) 43 end VOLUME 4, 2016 models is shown in lines 5-36. Furthermore, lines 37-42 exhibit the final classification of honest and dishonest energy consumers using the ESNN model. In addition, more detailed working of the attention layer and the back propagation process is given in [36].

G. HYPER-PARAMETER OPTIMIZATION USING GREY WOLF OPTIMIZATION
Being inspired from [37], where the authors proposed an ensemble learning and Jaya based solution to predict the postfault transient stability status of power systems, a GWO and deep learning based ensemble learning solution is proposed to detect the energy thieves in smart grids. The utilization of heuristic techniques have greater influence on the performance of classification algorithms. Moreover, the hyperparameter tuning is the most important step taken to enhance the performance of deep learning models. In this regard, an evolutionary meta heuristic technique, known as GWO [38], is employed to tune the hyper-parameters of APLSTM and ESNN models and enhance the ETD performance. The GWO technique is based on the working behavior of gray wolves. Moreover, it converges quickly towards a global optimal solution due to random and diverse numbers of individual assignment in exploitation and exploration phases.

IV. SIMULATION RESULTS AND DISCUSSION
The simulation results for the proposed solution is discussed in this section. The proposed solution is compared with other benchmark schemes to demonstrate its effectiveness. Moreover, it is worth mentioning that APLSTM is combined with ESNN after being motivated from the articles [23] and [24] where similar work is done in the context of ETD in smart grids. In these articles, initially, the feature extraction is performed and then the classification is carried out to identify the honest and dishonest electricity consumers. The simulation results of these studies prove that the combined model performs better than the standalone model. Keeping this in view, a combined ensemble model is proposed in this study. Where APLSTM is employed to extract the temporal correlated features from the EC data. Afterwards, the extracted feature set is passed to the ESNN model for final ETD. The classification results proved that the combined APLSTM and ESNN based model is a better solution to detect energy thieves as compared to other baseline models.

A. SIMULATION SETUP
The simulations are performed using open-source libraries of Python programming language, known as Tensorflow and Keras. The simulations are performed to train and validate the model. The data is grouped into 75% and 25% for training and testing the proposed model, respectively.

B. MODEL EVALUATION METRICS
In ETD, the concern is about the validation of the classifier using imbalanced data because the accuracy does not give a reliable evaluation for imbalanced classification problems.
In this regard, more suitable performance metrics are used, i.e., PR-AUC, AUC, precision, MCC, recall and F1-score, to assess the performance of the proposed model. Moreover, similar performance metrics are used in [39] to evaluate the performance of imbalanced classification models. The performance metrics are measured using the confusion matrix, which gives the information about the following:  [40], recall [41], F1-score [42], [43] and MCC [44] are calculated using Equations (15)(16)(17) as: Where recall and precision are computed from the values of FN, FP and TP. The precision shows those values that are correctly identified as honest consumers while recall determines the positive class instances that are accurately identified by the model. In Equation (16), the F1-score is calculated, which is a more reliable metric than recall and precision. It balances both recall and precision to provide a single score. The PR-AUC is another useful metric, which is a graphical representation that demonstrates the recall values on the y-axis and the precision values on the x-axis. The outcome of the PR-AUC is between 0 and 1. In all of the above mentioned performance metrics, MCC is more reliable because it takes the correlation of all the four possible confusion matrix outcomes, i.e., FN, FP, TN and TP. It is computed using Equation (18): (18) To get the graphical representation of the evaluation model, the AUC metric is used. It is calculated from TPR and FPR. In ETD, TPR is the number of electricity thieves, which are found guilty while FPR is the number of legitimate energy consumers, which are misclassified as electricity thieves. AUC gives the performance value between 0 and 1. It is calculated using Equation (19), which is taken from [24] as: In this equation, specificity is a performance metric that shows the percentage of dishonest users that are classified as dishonest by the model. This metric is also known as a True Negative Rate (TNR).

C. RESULTS OF THE PROPOSED MODEL
In this section, the performance of the proposed solution is verified for solving the imbalanced data and the overfitting problems. Moreover, the performance of the model is also evaluated in terms of different performance metrics.

1) Performance on imbalanced dataset
The theft users are remarkably less than the honest users in the SGCC dataset. This makes the model biased towards the majority class during the training process. In this regard, the ST-Links technique is used. For comparison, the performance of the proposed model is compared with different data balancing techniques, i.e., SMOTE [31], Adaptive synthetic (Adasyn) [45] and Near Miss (NM). The Adasyn technique performs relatively better than SMOTE with an improved performance of 85.2% precision, 87.1% recall, 88.3% F1score, 86.0% AUC, 85.1% PR-AUC and 77.4% MCC, as shown in Figure 5. In SMOTE and Adasyn algorithms, the minority class data is replicated to balance the dataset. However, as a result of the unintelligent replication of the minority class data, an overfitting issue occurs, which causes a high misclassification rate. The aim of using NM is two-fold. Firstly, it creates a boundary line and removes those majority samples that are considered as noise and are far from the decision boundary. Secondly, it reduces the execution time of the model. However, NM causes the loss of information due to random under-sampling, which under-fits the model. Figure 5 shows that the NM technique improves the classifier's performance by achieving 75.2% precision, 78.4% recall, 80.2% F1-score, 74.6% AUC, 75.3% and 68.1% MCC. ST-Links is a good solution to balance the dataset. It avoids the loss of information by intelligently selecting the number of instances in the data. Therefore, it prevents the classifier from biasness. It is worth noticing that the performance of ST-Links is better than the other data balancing techniques. Figure 5 shows that the ST-Links technique has improved the model's performance by securing 90.0% precision, 92.1% recall, 92.0% F1-score, 96.4% AUC, 97.3% PR-AUC and 84% MCC.
I also compare the computational time of the ST-Links technique with other data balancing techniques. ST-Links based ensemble learning model has less execution time than the SMOTE and Adasyn methods. Although it is greater than the NM, however, the NM method removes those samples, which causes noise in the data. Due to the training of the model over less data, NM is beneficial in reducing the computational time of the model. Whereas, SMOTE takes more execution time due to the addition of synthetic samples in the dataset. Moreover, the execution time of Adasyn is the highest because it adds more random values in the data, which are linearly correlated to the main data with less variance. Figure 6 shows the executional time of the four data balancing techniques: SMOTE, Adasyn, NM and ST-Links methods. It is seen that NM has less execution time as compared to Adasyn, SMOTE and ST-Links.  On the other hand, the hyperparameters' tuning through the GWO enhances the model's performance by increasing the recall, precision, F1-score, AUC, PR-AUC and MCC by 4.1%, 3.0%, 2.8%, 6.2%, 15% and 6.8%, respectively. Moreover, it is important to mention that the GWO technique selects the best solution by comprehensively assessing the three optimal solutions during the exploration phase [46]. Furthermore, simulation results demonstrate that the aforementioned property of GWO efficiently avoids the problem of premature convergence. It is seen that the selected hyper-parameter set of GWO increases the performance of the proposed model to a greater extent. The term generalization is used to describe the ability of the model to perform on unseen data. In Figure 8, the loss values of APLSTM-ESNN are shown. When a small epoch value is selected for training the model, it effectively trains the model on high dimensional data. Furthermore, as the value of epoch raises to 40, it causes overfitting. It shows 38% and 41% loss on training and testing data, respectively. These results show good prediction capability of the proposed model on testing data. Similarly, Figure 9 exhibits the accuracy curve of the model. It is seen that the model's accuracy gradually improves with each epoch. This implies that the model efficiently captures EC patterns from consumers' profiles and improves its generalization ability.
In Figure 10, ESNN shows the highest AUC score of 96.4% and 99.8% on the testing and training data, respectively. It shows that the model has accurately identified the honest consumers and electricity thieves. Moreover, the model does not show overfitting. Therefore, the proposed model has enough capability to identify the dishonest users even from the unseen data.

D. THE COMPARISON BETWEEN THE PROPOSED AND BENCHMARK MODELS
To show the effectiveness of the proposed model as compared to other benchmark models, I give the performance comparison with CNN-GRU, MLP-LSTM, XGBoost, LR and LSTM-RUSBoost. These state-of-the-art models are used for the classification of electricity thieves in the literature.
1) The results of CNN-GRU model CNN and GRU are two popular deep learning models, which are used for the identification of electricity thieves in the literature [18], [23]. CNN is used for feature extraction. It has the ability to extract relevant features from the data, which help to perform an accurate ETD. Moreover, GRU performs the final classification. Using the SGCC dataset, I evaluate the CNN-GRU model on the selected performance measures. This model achieves 77.0% accuracy, 85.4% precision, 68.0% recall, 80.1% F1-score, 84.6% AUC, 82% PR-AUC and 64.0% MCC, as given in Table 4. However, it is a  2) The results of LR model LR is used for the ETD in [17]. It uses the principle of neural networks. The results of the LR are given in Table 4. It achieves 73.0% accuracy, 70.9% precision, 72.9% recall, 71.9% F1-score, 69.4% PR-AUC and 45.0% MCC. The AUC of the LR model is given in Figure 11, i.e., 70.3%. Thus, it is not suitable for binary classification with a large dataset.

3) The results of XGBoost model
In this paper, XGBoost [17] is used as a benchmark model. It is the most popular ML algorithm that won 17 out of 29 competitions in the Kaggle platform in 2015 [45]. The performance of the XGBoost model is evaluated on different performance metrics, which are given in Table 4

5) The results of LSTM-RUSBoost model
The authors in [20] use a hybrid of RUSBoost and LSTM for ETD. Where the LSTM model is used for extracting the long term correlations between the consumption patterns. Whereas, the purpose of RUSBoost is two-fold. Firstly, it balances the data using RUS. Secondly, it performs the classification through ensemble learning. In Table 4 The comparison of the proposed model with the benchmark models is depicted in Figures 12 and 11. It is seen that the proposed model surpasses the existing models in terms of PR-AUC and AUC. The average execution time of the proposed and benchmark schemes is depicted in Figure 14. From the figure, it is seen that the ESNN model has the least training time of 17 seconds because the weight updating process of the model is pretty simple. It updates the weights of only those neurons that participate in the output layer. Due to this property, the proposed model minimizes the computational overhead to a minimal level. Furthermore, it successfully obtains the highest performance results as compared to other baseline models. Therefore, in practical applications, the proposed solution is capable to detect energy thieves accurately within a minimum time. In contrast,    Table 4 and Figure 13 L the complex layering structure of the baseline deep models increases the training time to a greater extent. Moreover, the mapping of the addressed problems with the proposed solutions and their respective validation is given in Table 5. The limitations are labeled as L.1-L.7 while the proposed solutions are labeled as S.1-S.7 and the validation of results are labeled as V.1-V.7. Firstly, the model is validated in terms of handling the imbalanced data. The results in Figure 5 show that ST-Links achieved better results than other data balancing techniques, i.e., SMOTE, Adasyn and NM. ST-Links obtained 90% precision, 92.1% recall, 92% F1-score, 96.4% AUC and 84% MCC. Secondly, Figure 13 depicts that the ESNN model performs well with newly generated statistical features. Thirdly, Figures 7, 10 exhibit that the proposed model avoids overfitting and achieves the best classification performance on the unseen data. Fourthly, Figure 10 shows that GWO-ESNN gives improved generalized performance as it achieves the highest score regarding AUC, i.e., 96.4% and 99% on the testing and training data, respectively. Fifthly, Figure 14 illustrates that the ESNN model has the least training time. Sixthly, Figure 7 shows that the GWO technique increases the classification accuracy by selecting the best hyper-parameters' values for the ESNN model. Seventhly, the classification results of the ESNN model are verified by applying a statistical test, which is known as paired ttest. Finally, benchmark methods are compared with the proposed model, i.e., CNN-GRU, MLP-LSTM, XGBoost, LR and LSTM-RUSBoost. The accuracy performance metric is not suitable to differentiate between the honest and dishonest consumers. It is a misleading metric in a classification problem where the classes of the data are imbalanced. So, more reliable performance metrics are used, i.e., AUC, PR-AUC, precision, F1-score and recall, to assess the proposed model. Using the reliable metrics, it is observed that the performance of the proposed model is better than the existing models. Moreover, it is worth mentioning that the proposed ETD model is trained on a realistic dataset of China, which is a massive dataset. In addition, the proposed approach is based on deep neural networks where the models perform efficiently with a huge dataset. In this regard, the proposed ETD approach could be applicable in the scenarios where the number of electricity consumers is high or low. Finally, it is concluded that the proposed ETD approach is scalable.  The reliance on the final classification accuracy of the proposed model is not justifiable without applying any suitable statistical test. Therefore, a well-known statistical test is used, termed as paired t-test, for the fair evaluation of the final classification results. First of all, I compute 60 independent accuracy values by changing the hyper-parameters' settings of the APLSTM and ESNN models with different training and testing distributions (e.g., 70:30 and 80:20). Afterwards, the paired t-test is applied on these values to ensure that the obtained accuracy of the classification model does not occur by chance. The test achieves 1.983723 and a p-value of 0.045256, which means that the null hypothesis is rejected. Hence, it is proved that the classification results of the proposed model are strongly reliable without the presence of any randomness. The average accuracy value of the proposed model in 60 different settings is shown in Figure 15.

V. CONCLUSION AND FUTURE WORK
In the proposed study, real EC data of SGCC is used to identify the electricity thieves. Initially, necessary statistical features are computed and data pre-processing is performed VOLUME 4, 2016 to refine the input data. Afterwards, the imbalanced data problem is addressed by ST-Links that intelligently balances the data classes. In addition, the relevant features are extracted by APLSTM to improve the ETD performance. The extracted feature set is given to the ESNN model for classifying the electricity thieves. Moreover, the hyper-parameters of the proposed model are tuned through GWO. The hyperparameters selected through GWO increase the performance results of the APLSTM-ESNN model to a greater extent. Furthermore, a statistical test, termed as paired t-test, is applied on the classification results for the reliable assessment of the proposed model. The simulation results depict that the proposed model achieved MCC, AUC, PR-AUC, F1-score, recall, precision and accuracy of 84.0%, 96.4%, 97.3%, 92.0%, 92.1%, 90.0% and 96.3%, respectively. The proposed model performs better than the existing models in terms of classification. Although the proposed model is an ideal solution for efficient ETD, however, there are some abrupt changes in the performance of the proposed model with respect to the input data. Furthermore, the proposed model is trained on the low sampling data, which limits its performance towards capturing the more granular information about EC patterns. Therefore, in future, high sampling ED data and some other factors, such as different usage behavior of consumers, temperature and seasonality will be considered to develop a robust model.