Electricity Theft Detection in Smart Grids based on Deep Neural Network

Electricity theft is a global problem that negatively affects both utility companies and electricity users. It destabilizes the economic development of utility companies, causes electric hazards and impacts the high cost of energy for users. The development of smart grids plays an important role in electricity theft detection since they generate massive data that includes customer consumption data which, through machine learning and deep learning techniques, can be utilized to detect electricity theft. This paper introduces the theft detection method which uses comprehensive features in time and frequency domains in a deep neural network-based classification approach. We address dataset weaknesses such as missing data and class imbalance problems through data interpolation and synthetic data generation processes. We analyze and compare the contribution of features from both time and frequency domains, run experiments in combined and reduced feature space using principal component analysis and finally incorporate minimum redundancy maximum relevance scheme for validating the most important features. We improve the electricity theft detection performance by optimizing hyperparameters using a Bayesian optimizer and we employ an adaptive moment estimation optimizer to carry out experiments using different values of key parameters to determine the optimal settings that achieve the best accuracy. Lastly, we show the competitiveness of our method in comparison with other methods evaluated on the same dataset. On validation, we obtained 97% area under the curve (AUC), which is 1% higher than the best AUC in existing works, and 91.8% accuracy, which is the second-best on the benchmark.


I. INTRODUCTION
E LECTRICITY theft is a problem that affects utility companies worldwide. More than $96 billion is lost by utility companies worldwide due to Non-Technical Losses (NTLs) every year, of which electricity theft is the major contributor [1]. In sub-Saharan Africa, 50% of generated energy is stolen, as reported by World Bank [2].
The ultimate goal of electricity thieves is to consume energy without being billed by utility companies [3], or pay the bills amounting to less than the consumed amount [4]. As a result, utility companies suffer a huge revenue loss due to electricity theft. [5] reports that in 2015, India lost $16.2 billion, Brazil lost $10.5 billion and Russia lost $5.1 billion. It is estimated that approximately $1.31 billion (R20 billion) revenue loss incurred by South Africa (through Eskom) per year is due to electricity theft [2]. tacks [4], [7]. Recently, researchers have worked towards detecting electricity theft by utilizing machine learning classification techniques using readily available smart meters data. These theft detection methods have proved to be of relatively lower costs [8]. However, existing classification techniques consider time-domain features and do not regard frequencydomain features, thereby limiting their performance.
Regardless of the fact that there is active ongoing research on electricity theft detection, electricity theft is still a problem. The major cause of delay in solving this problem may be that smart grids deployment is realized in developed nations while developing nations are lagging behind [9]. The challenges of deploying smart grids include the lack of communication infrastructure and users' privacy concerns over data reported by the smart meters [10]. However, [10] reports that smart meters are being considered by many developed and developing countries with aims that include solving NTLs. [11] predicted smart grids global market to triple in size between 2017 and 2023, with the following key regions leading smart grids deployment: North America, Europe and Asia.
In this paper, we present an effective electricity theft detection method based on carefully extracted and selected features in Deep Neural Network (DNN)-based classification approach. We show that employing frequency-domain features as opposed to using time-domain features alone enhances classification performance. We use a realistic electricity consumption dataset released by State Grid Corporation of China (SGCC) accessible at [12]. The dataset consists of electricity consumption data taken from January 2014 to October 2016. The main contributions are as follows: • Based on the literature, we propose a novel DNN classification-based electricity theft detection method using comprehensive time-domain features. We further propose using frequency-domain features to enhance performance. • We employ Principal Component Analysis (PCA) to perform classification with reduced feature space and compare the results with classification done with all input features to interpret the results and simplify the future training process. • We further use the Minimum Redundancy Maximum Relevance (mRMR) scheme to identify the most significant features and validate the importance of frequencydomain features over time-domain features for detecting electricity theft. • We optimize the hyperparameters of the model for overall improved performance using a Bayesian optimizer. We further employ an adaptive moment estimation (Adam) optimizer to determine the best ranges of values of the other key parameters that can be used to achieve good results with optimal model training speed. • Lastly, we show 1% improvement in AUC and competitive accuracy of our model in comparison to other data-driven electricity theft detection methods in the literature evaluated on the same dataset.
The remainder of this paper is organized as follows. Section II covers the related work done in literature to tackle the electricity theft problem. In Section III, we briefly introduce techniques used in this paper. Section IV covers step by step method taken in this work; which includes dataset analysis and work done to improve its quality and customers' load profile analysis which lead to features extraction and classification. In Section V, we show and discuss the results. We finally conclude the paper in Section VI.

II. RELATED WORK
Research on electricity theft detection in smart grids has attracted many researchers to devise methods that mitigate against electricity theft. Methods used in the literature can be broadly categorized into the following three categories: hardware-based, combined hardware and data-based detection methods and data-driven methods.
Hardware-based methods [13]- [19] generally require hardware devices such as specialized microcontrollers, sensors and circuits to be installed on power distribution lines. These methods are generally designed to detect electricity theft done by physically tampering with distribution components such as distribution lines and electricity meters. They can not detect cyber attacks. Electricity cyber attack is a form of electricity theft whereby energy consumption data is modified by hacking the electricity meters [7].
For instance, in [13], an electricity meter was re-designed. It used components that include: a Global System for Mobile Communications (GSM) module, a microcontroller, and an Electrically Erasable Programmable Read-Only Memory (EEPROM). A simulation was done and the meter was able to send a Short Message Service (SMS) whenever an illegal load was connected by bypassing the meter. Limited to detecting electricity theft done by physically tampering with distribution components such as distribution lines and electricity meters. Authors in [16] used the GSM module, ARM-cortex M3 processor and other hardware components to solve the electricity theft problem done in the following four ways: bypassing the phase line, bypassing the meter, disconnecting the neutral line, and tampering with the meter to make unauthorized modifications. A prototype was built to test all four possibilities. The GSM module was able to notify with SMS for each theft case.
Authors in [17] designed ADE7953 chip-based smart meter which is sensitive to current and voltage tempering, and mechanical tempering. ADE7953 was used to detect overvoltage, dropping voltage, overcurrent, the absence of load and other irregularities in voltage and current. It sent an interrupt signal to the Microcontroller Unit (MCU) which reported tampering status. Mechanical tampering was overcome by connecting a tampering switch to MCU's IO ports so that it can send alarm signals to MCU once tampered with. The design was tested with tampering cases such as connecting the neutral and the phase lines, connecting the meter input and output in reverse mode, and bypassing the phase line to load. The probability of detection failure was 2.13%.
Authors in [15] used a step down transformer, voltage divider circuit, microchip and other hardware components to design a circuitry to detect electricity theft by comparing forward current on the main phase line with reverse current on the neutral line. The circuitry was installed before the meter.The design was tested on Proteus software and on actual hardware. When the meter was bypassed, the problem was detected and an alarm sounded. In [14], a circuit to detect electricity theft done by bypassing the meter was designed. The transformers, rectifiers, microcontroller, GSM module and other hardware components were used. The GSM controller notified the operator with SMS when the meter was bypassed.
Authors in [18] proposed putting the Radio Frequency Identification (RFID) tags on ammeters and capturing unique data about each ammeter. Ammeters were to be tracked and managed real-time. Electricity theft was to be inspected onsite. Damaged, removed or a tag with a different information from the original one means high possibility that an electricity theft happened. Evaluation based on analysis on cost of deployment. With a case study made on utility company in China, Return on Investment (ROI) was found to be >1. In [19], An Arduino-based real-time electricity theft detector was designed. The following hardware was used: Arduino Uno, GSM module, current sensors and LCD. The Arduino Uno obtained measurements from current sensors which were located one on the secondary side of the transformer and the other on the electric service cap. If the difference between current sensors' measurements exceeded a set threshold, the message would be sent to the operator via a GSM module. The simulation was done using Proteus 8 software and the prototype was built on hardware, which was able to report theft cases when tested.
Apart from their inability to detect cyber attacks, these methods are also expensive due to their need for special hardware deployment and maintenance. Combined hardware and data-based electricity theft detection methods [20]- [22] employ the use of hardware, machine learning and/or deep learning techniques to tackle the electricity theft problem. Due to hardware requirements, these methods also pose the challenge of being expensive to deploy and maintain.
In [20], a method to measure the total consumption of a neighbourhood and compare the results with the usage reported by the smart meters in that neighbourhood was proposed. A significant difference between smart meters' and transformers' measurements would mean the presence of unfaithful customers in the neighbourhood. To locate the unfaithful customers in the neighbourhood, the authors proposed using a Support Vector Machine (SVM) classifier. The classifier was tested on a dataset of 5000 (all faithful) customers. A maximum detection rate of 94% and a minimum false positive rate of 11% were achieved.
Authors in [22] developed a predictive model to calculate TLs. To get NTL, TLs would be subtracted from total distribution network losses. Based on an assumption that distribution transformers and smart meters share data to the utility after every 30 minutes, a smart meter simulator was used to generate data for 30 users in 30 minutes intervals for 6 days. On the simulator, unfaithful users stole electricity by bypassing the meter. Stolen electricity was varied between 1% and 10% of the total consumption. For stolen electricity value over 4%, the detection rate was 100%, which diminished as stolen electricity percentage was decreased.
In [21], a method which would use an observer meter that would be installed on a pole away from households and record the total amount of electricity supplied to n households where it is suspected that one or more meters have been tampered with was proposed. The observer meter would have camera surveillance to protect it from being tampered with. A mathematical algorithm that utilizes data from an observer meter and smart meters to detect a smart meter tempered with was developed. A mathematical algorithm was tested with a real-world consumption dataset by increasing the consumption of some meters which were picked randomly. The algorithm was able to detect the meters with altered consumption.
Due to high-cost demand in the above categories, many researchers work on data-driven methods to overcome the electricity theft problem. For instance, the authors in [3] designed an electricity theft detection system by employing three algorithms in the pipeline: Synthetic Minority Oversampling Technique (SMOTE), Kernel function and Principal Component Analysis (KPCA) and SVM. They used SMOTE to generate synthetic data for balancing an unbalanced dataset, KPCA to extract features and SVM for classification. They obtained maximum overall classifier quality characterized by Area Under the Curve (AUC) of 89% on validation.
Authors in [4] used wide and deep Convolutional Neural Networks (CNN) model to detect electricity theft. Based on that normal electricity consumption is periodical while stolen electricity consumption data is not periodical, wide was to learn multiple co-occurrences of features for 1-D series data, while deep CNN was used to capture periodicity with data aligned in a 2-D manner by weeks. They varied training and validation data ratios, to obtain maximum AUC value of 79%. By utilizing the same dataset used in [3] and [4], the method we present in this paper achieves AUC results beyond 90% on both validation and testing.
In [23], PCA was used to transform original highdimensional consumption data by extracting Principal Components (PCs) which retained the desired variance. An anomaly score parameter that was defined between set minimum and maximum thresholds was introduced. For each test sample, the anomaly score parameter was calculated. If the result was not between the set thresholds, the sample would then be treated as malicious. The true positive rate (TPR) was used to evaluate the method, which hit the best-recorded value of 90.9%. Authors in [24] used One-Class SVM (O-SVM), Cost-Sensitive SVM (CS-SVM), Optimum Path Forest (OPF) and C4.5 tree. From customer consumption data, VOLUME 4, 2016 different features were selected, and the performance of each classifier was analyzed independently on a different set of features, followed by combining all classifiers for the best results. Best results were achieved when all classifiers were combined, with 86.2% accuracy.
Authors in [25] employed a combination of CNN and Long Short-Term Memory (LSTM) recurrent neural network deep learning techniques. Seven hidden layers were used, of which four of them were used by CNN and three were utilized by LSTM. This method relied on CNN's automatic feature extraction ability on a given dataset. Features were extracted from 1-D time-series data. On model validation, the maximum accuracy achieved was 89%. The authors in [26] used a combination of Local Outlier Factor (LOF) and k-means clustering to detect electricity theft. They used kmeans clustering to analyze the load profiles of customers, and LOF to calculate the anomaly degrees of customers whose load profiles were from their cluster centres. On the evaluation of the method, they attained an AUC of 81.5%. Our model achieves a maximum value of 91.8% accuracy and 97% on validation.
In [27], two electricity theft models were developed. The first model is based on Light Gradient Boosting (LGB) classifier. A combination of SMOTE and Edited Nearest Neighbour (ENN) was used to balance the dataset. Feature extraction was done using AlexNet, followed by classification with LGB. This proposed model was named as SMOTEENN-AlexNet-LGB (SALM). The second model is based on the Adaptive Boosting classifier. Conditional Wasserstein Generative Adversarial Network with gradient penalty (CWGAN-GP) was used to generate synthetic data that resembled the minority class data to balance data of the unbalanced classes. Feature extraction was performed using GoogleNet, then classification by AdaBoost followed. The proposed model was named as GAN-NETBoost. The models were evaluated with SGCC data used in this work. SALM and GAN-NetBoost attained an accuracy of 90% and 95%, and AUC of 90.6% and 96% respectively on validation.
Although these models were able to achieve impressive results, their consideration of time-domain features alone limited their performance. Our solution shows that adding frequency-domain features on time-domain features improves classification performance.

III. PRELIMINARIES
In this section, we give a summary of the main techniques used, which are: Deep Neural Networks (DNNs), Principal Component Analysis (PCA), and Minimum Redundancy Maximum Relevance (mRMR).

A. DEEP NEURAL NETWORKS
Artificial Neural Networks (ANNs) are a class of machine learning techniques that have been built to imitate biological human brain mechanisms [28], [29]. They are typically used for extracting patterns or detecting trends that are difficult to be detected by other machine learning techniques [30].
They consist of multiple layers of nodes/neurons which are connected to subsequent layers [29]. A neuron is the basic element of a neural network, which originates from the McCulloch-Pitts neuron, a simplified model of a human brain's neuron [31]. Figure 1 shows a model diagram of a neuron that comprises a layer following the input to the ANN.

Inputs
Weights Bias g FIGURE 1. First hidden layer neuron model It consists of an activation function f , which takes a weighted sum of the real number input signal and gives real number output y given by Equation (1).
x is input vector, w is weights vector and b is the bias [31]. Neural network nodes mimic the brain's neurons, while connection weights mimic connections between neurons, which are unique for each connection [28], [29]. A neural network stores information in the form of weights and bias.
The Deep Neural Networks (DNNs) concept originates from research on ANNs [32]. DNNs are characterized by two or more hidden layers [28]. They are able to learn more complex and abstract features than shallow ANNs [33]. Oftentimes in classification problems, the output layer is made up in such a way that one neuron represents a certain class [29]. All neural network layers are used to filter and learn the complicated features, except for an output layer which classifies based on learnt features [29] [34]. Before DNNs development, most machine learning techniques explored architectures of shallow structures which commonly contain a single layer of non-linear transformation [32]. Examples of these architectures include SVMs, logistic regression and ANNs with one hidden layer.
DNNs have different architectures, which are used to solve different problems. Examples of DNN architectures include feed-forward DNN, convolutional DNN and recurrent DNN. In this research work, a fully connected feed-forward DNN was used. the typical structure of a fully connected feedforward DNN is shown in Figure 2.
The DNN given in Figure has the following major parts: • Input layer (x) A layer that comprises input data features or representation.

Hidden Layers
Output Layer Input Layer The layers of neurons between the input and output layers. They are used to analyse the relationship between the input and output signals [30].
Weights of the connections between the hidden layers.
Weights between the last hidden layer and the output layer. • Output layer (y) The last layer of a DNN. It gives the output of the network from network inputs. In a feed-forward architecture, computation is a sequence of operations on the output of a previous layer. The final operations generate the output. For a given input, the output stays the same, it does not depend on the previous network input [33]. [33] reports that ANNs were first proposed in the 1940s, and research on DNNs emerged in the 1960s. In 1989, the LeNet network, which used many digital neurons, was built for recognizing hand-written digits. Major breakthroughs were seen in the years beyond 2010, with examples such as Microsoft's speech recognition system, AlexNet image recognition system, and DNN accelerator research such as Neuflow and DianNao brought into play.

1) History of DNNs Development
The following reasons are reported by [30], [32], [33] as major contributors to DNNs' improved development: • Advancements in semiconductor devices and computer architecture, leading to parallel computing and lower costs of computer hardware. • Huge amount of data obtained by cloud providers and other businesses, making large datasets that train DNNs effectively. • Advances in machine learning and signal/information processing research which leads to the evolution of techniques to improve accuracy and broaden the domain of DNNs application. With present technology permission, DNNs can have a count of layers that is beyond a thousand [33].

2) DNN Training
A large dataset and high computational abilities are the major requirements in training the DNN since weight updates require multiple iterations [33]. DNN training process is concerned with adjusting the weights between the neurons [30]. Through the training process, the DNN learns information from the data. Learning can be in the following major four ways: supervised, semi-supervised, unsupervised or reinforcement [33]- [36].
In this work, supervised learning was used. The typical procedure for supervised learning in DNNs as given by [28], [34] is as follows: 1 3) Output error is calculated, and then weights adjusted with an aim to reduce an error. 4) Steps 2 and 3 are repeated for all training data.

3) Backpropagation
A loss function of a multi-layered ANN is composed of weights from successive layers between input and output layers [36]. Backpropagation uses chain rule to obtain the gradient of the loss function in terms of summation of local gradient products over different nodes connections between input and output layers [28], [29], [36]. Backpropagation algorithms typically use gradient-based optimization algorithms to update the neural network parameters on each layer [37].

4) Activation functions
An activation function takes an input signal, by simulating a response of a biological neuron, transforms the signal into an output signal which may be an input to another neuron [38], [39]. There are many activation functions, which can be generally divided into two kinds: the linear and non-linear activation functions. The type of activation function used in a DNN plays a major role in the prediction accuracy of the model [39]. The selection of an activation function depends on the reasons such as computational power, analytic flexibility and whether the desired output should be continuous or discrete [30]. Let z = (w i x i ) + b. Then Equation (1) can be re-written as shown in Equation (2).
Linear activation functions Linear activation functions usually have an activation that is directly proportional to the input. They can be expressed in the form of Equation (3).
where C is a constant. The output of the linear activation function is in the range (−∞, ∞) and its derivative is f (z) = C. Since the gradient is not related to the input, an error can not be minimized by the use of a gradient [40]. This activation function is normally used in regression problems [41].
Non-linear activation functions Non-linear activation functions are widely used in DNNs because of their ability to adapt to data variations and differentiate outputs [40]. Among the many developed nonlinear activation functions, the most popular are described as follows [38]- [41].
Due to less computation in finding its derivative, this activation function is widely used in shallow neural networks. It is rarely used in DNNs' hidden layers because of its soft saturation property which makes DNNs delay converging during training. • Hyperbolic tangent activation function Like the sigmoid, hyperbolic tangent is continuous and differentiable everywhere. It is given by Equation (6). Its derivative is given by Equation (7).
The input z ∈ (−∞, ∞) and an activation f ∈ (−1, 1). Using a hyperbolic tangent for activation makes the neural networks converge faster than when using a sigmoid, therefore a hyperbolic tangent is more preferred than a sigmoid.

• Rectified linear unit activation function
Rectified linear unit (ReLU) activation function is given by Equation (8) and is derivative by Equation (9).
Compared to sigmoid and hyperbolic tangent activation functions, ReLU is the simplest and most commonly used in DNNs because of its good property of being close to linear, hence better convergence. It is more efficient since it activates less number of neurons at the same time. For z > 0, its gradient is constant thereby avoiding the vanishing gradient problem. Its gradient is cheaper to compute as there are no calculations that involve exponents. • Softmax activation function Softmax activation function is given by Equation (10).
where K is the number of classes. Softmax is typically used in the output layer of a DNN for classification purposes. The output of a softmax is a probability of a particular class j, therefore if the softmax activation function is used in the output layer, all of the output layer activations sum to 1.

B. PRINCIPAL COMPONENT ANALYSIS
PCA [42] is used to extract important information from a data table of inter-correlated features/variables that represent observations. This extracted information is represented as a new set of orthogonal variables known as Principal Components (PCs). In this work, PCA uses a Singular Value Decomposition (SVD) algorithm [43] which works in the following manner: for input feature matrix X, SVD decomposes it into three matrices, i.e., X = PQR , such that: • P is the normalized eigen vectors of the matrix XX , where E is a diagonal matrix of eigen values of matrix XX , and • R is the normalized eigen vectors of matrix X X. When PCA is applied to a matrix X of size m × n, n PCs {c} n i=1 are obtained, which are ordered in descending order with respect to their variances [23]. A PC at position p is given by and its variance is obtained by evaluating ||Xc p || 2 .
The main goals achieved with PCA are as follows: • Extraction of most important information from data/feature table, thereby compressing and simplifying dataset description, and • Analysis of observations and variables' structure.
For dimensionality reduction purposes, the first r ≤ n PCs that retain acceptable variance can accurately represent feature matrix X in a reduced r-dimensional subspace.

C. MINIMUM REDUNDANCY MAXIMUM RELEVANCE
An mRMR [44], [45] is a feature selection scheme that selects features that have a high correlation with the response variable and low correlation with themselves. It ranks features based on mutual information of a feature and a response variable, and pairwise mutual information of features. Mutual information between variables A and B is given by For all features {X i }, maximum relevance R l is implemented using mean value of their mutual information with an output class O.i.e., Minimum redundancy R d helps to select features that are mutually maximally dissimilar. It is given by: where X i , X j ∈ X. mRMR feature selection goal is achieved by optimizing relevance and redundancy in the following manner: max(R l − R d ).

IV. DNN-BASED ELECTRICITY THEFT DETECTION METHOD
The electricity theft detection method outlined in this section consists of the following three steps: Data Analysis and Preprocessing, Feature Extraction, and Classification. Figure 3 shows the workflow diagram.

A. DATA ANALYSIS AND PRE-PROCESSING
In this sub-section, we present the dataset used and its quality improvement by identifying and removing observations that had no consumption data. In this work, an observation refers to a single instance/record in the dataset, for the duration of measured consumption. i.e., given a dataset A of size N , We show customers' load profiles analysis. We further present data interpolation and synthetic data generation details that have been undertaken. As stated in Section I, we used a realistic electricity consumption dataset released by SGCC, which is accessible at [12]. The dataset consists of daily electricity consumption data taken from January 2014 to October 2016, summarized in Table I. The sampling rate of the data is uniform for every customer, it is one measurement per day; which corresponds to the total power consumption for that day. The used dataset consists of 42372 observations, of which 3615 observations are electricity consumption data of unfaithful customers and the remaining observations are electricity consumption data of faithful customers.

1) Dataset Analysis and Preparation
As with many datasets used in the literature, data comes with many errors caused by factors such as smart meters failures, data storage problems, data transmission issues and unscheduled systems maintenance [4]. Dataset used in this work is no exception. It consists of traces of non-numerical or null values.Using data analysis methods, we found approximately 5.45% of observations in this dataset to either have only null values, or zeros, or a combination of both, for the whole duration of 1034 days. These observations were regarded as empty observations. i.e., An observation a is regarded as an empty observation if a i = 0 or a i / ∈ R for all a i ∈ a. These observations do not have any differentiating characteristics between the classes since they do not have any consumed electricity record greater than 0 kWh. To improve the dataset quality, these observations were removed. They could not be identified with any class as they were labeled with either of the classes, therefore they were discarded. The third column of Table I shows a summary of observations left after the removal of empty observations. Figure 4 shows line plots of consumption data of a faithful customer and an unfaithful customer against the consumption days, for the duration of three months. Comparing the two graphs, we observed that the consumption behaviour of the honest customer is mostly uniform and has a predictable trend, while electricity thief's consumption behaviour takes different forms and is not predictable. We further carried out histogram analyses for both classes of customers, as shown in Figure 5.
From the histograms shown, we observe that for faithful customer's consumption data, statistical parameters mean, mode, and median are generally closer to the histogram centre as compared to unfaithful customer's consumption  data. We did a similar analysis for many customers and found that an observation presented here is true for most of the dataset. From these observations, we argue that by defining outliers as values beyond three Median Absolute Deviations (MAD), honest customers can be characterized as having fewer outliers percentage in a given data, than unfaithful customers.

2) Data Interpolation
For all observations consisting of a combination of null or non-numerical values and real consumption values, data were interpolated. Piecewise Cubic Hermite Interpolating Polynomial (PCHIP) [46] was used to fill in missing data during data interpolation while preserving consumption patterns. A cubic Hermite interpolating polynomial H(x) is a shape-preserving interpolant which preserves data monotonicity on a sub-interval x i ≤ x ≤ x i+1 applied to. For the data consumption vector containing NaN values at the beginning, the raw data mean was evaluated excluding NaN values and then inserted as the first vector element. The rest of the elements were filled in using PCHIP. This helped to maintain consumption shape and avoided adding outliers to data. Figure 6 shows an example of one observation taken randomly before and after interpolation. A consumption duration of 200 days around days with missing consumption data is shown for clear presentation. Interpolated data points make a smooth curve that lies between the minimum and maximum near points with no overshooting, as can be seen from Figure 6b. In this manner, the consumption data is preserved from the addition of outliers and data points that can make interpolated data pattern to resemble unfaithful customer's consumption pattern of the minority class of unfaithful customers, such shown in Figure 4b. (b) Consumption data after interpolation FIGURE 6. Plots of consumption data before and after interpolation

3) Synthetic Data Generation
After eliminating empty observations and interpolating data, we carried out the initial classification process. Experimenting with the dataset as is, we observed that the classifier satisfactorily classified faithful customers and performed badly on unfaithful customers due to a class imbalance problem [20], [25]. A class imbalance problem is a situation whereby the number of observations in one class is much greater than the number of observations in the other class. In a class imbalanced problem, classification models classify the majority class on a dataset successfully, while performing badly on the minority class [25]. Dataset used in this work has faithful customers number that is much greater than that of unfaithful customers. We solved the class imbalance problem in the following manner: 1) Define q and r as the number of faithful and unfaithful customers respectively and evaluate the difference p = q − r.
2) From a set of faithful customers observations, randomly select p observations represented by p × 1034 matrix O defined by Equation (15) 3) Inspired by [20] and dataset analysis observations in IV-A, we evaluated synthetic observations O s by Hadamard product in Equation (16).
where C is a matrix of randomly generated numbers of size p × 1034 with elements between 0 and 1. This helps to distort the pattern of consumption as observed through faithful customers' consumption data line plots shown in IV-A; hence the result better represents unfaithful customers' consumption data. This approach of generating synthetic is cheap and fast to do as it uses the available data of faithful customers' class to generate data for the opposite class. It involves a single operation on the measured data, which is multiplication of measured data by a matrix of randomly generated numbers. The resulting data was added to the original dataset, labeling it as belonging to unfaithful customers consumption class. The fourth column in Table I shows a summary of observations after synthetic data generation.

B. FEATURE EXTRACTION
Electricity consumption data used in this project is univariate time-series data. A univariate measurement is a single measurement frequently taken over time [47]. For solving classification problems, data can be represented by its features (properties), which can then be fed as input to the classifier, as is the case in [29], [34] and [48]. Data is classified based on the similarity between features [47] given a dataset of different samples. In this work, time-domain and frequencydomain features were extracted and used as input to a deep neural network for classification. Classification performance comparison between time-domain, frequency-domain and combined features from both domains was carried out.

1) Time-domain Feature Extraction
As shown in IV-A, faithful and unfaithful customers' consumption data differs clearly by a pattern of consumption, as shown by line plots and histogram graphs. Based on this information, time-domain features stipulated in Table II can collectively be used to differentiate between the two classes of customers. Apart from an observation that consumption data of faithful customers roughly follow a predictable pattern, and unfaithful customers consumption behaviour is not VOLUME 4, 2016 predictive, as shown in Figure 4, customers do not consume an equal amount of energy per given time. Energy needs per customer may differ due to different reasons such as the number of appliances used, kind of appliances per household, household size, etc. To achieve higher accuracy in classifying features, all observations are made to fit within the same axes. This is achieved by normalizing data for each observation using the Min-Max method [49] given by Equation (17). The Min-max method shrinks the data between 0 and 1 while keeping the original consumption pattern.

2) Frequency-domain Feature Extraction
Fourier theorem states that a periodic signal x(t) can be represented by a summation of complex sinusoidal signals with frequencies that are an integer multiple of fundamental frequency f T [50]. Using the Fourier theorem, the consumption data graphs shown in IV-A can be seen as a time series signal that can be transformed into the frequency-domain by using Fourier transform. Represented in frequency-domain, we extracted frequency-domain features from each observation. Since neural networks are sensitive to diverse input data, using Equation (17), features were normalized after being extracted so that they could be fed as input to the classifier. Table II shows features extracted from both domains.

1) Network Architecture
A fully connected feed-forward DNN architecture shown in Figure 7 was used for the classification process.
In order to avoid network underfitting and overfitting [35], the following rule of thumb methods [35], [51] were considered in the design of hidden layers of a deep neural network classifier shown in Figure 7:.  Rectified Linear unit (ReLU) activation function was used in the hidden neurons because of its better convergence property in comparison to other activation functions [28].

2) Training
The maximum number of training iterations was limited to 1000. The classification approach was split into four parts. In the first part, only time-domain features were used for classification. In the second part, only frequency-domain features were used. The third part comprised of combined features from both domains, while in the last part, classification was performed in reduced feature space by incorporating PCA.
Holdout validation scheme was used as follows: in all the procedures, as a rule of thumb, 80% of the whole data was used for training and validation, while 20% of the whole data was used for testing. Within training and validation data, 80% was used for training while 20% was used for validation. Similar results were obtained when using k-fold cross-validation scheme with k = 5. More about using k-fold cross-validation scheme with k = 5 can be seen in [52] as an example.
Recall/True Positive Rate (TPR): is the measure of the fraction of positive examples that are correctly labeled. It is given by: Precision/Positive Predictive Value (PPV): is the measure of the fraction of examples classified as positive that are truly positive. It is given by: F1-Score: shows the balance between precision and recall. It is given by: Accuracy: shows the fraction of predictions classified correctly by the model. It is given by:

Accuracy =
Number of correct predictions Total number of predictions Matthews Correlation Coefficient (MCC): a single digit that measures a binary classifier's performance. Its value ranges from -1 to +1, with values closer to +1 signifying good performance, while values closer to -1 signify bad performance. MCC is given by: Area Under the Curve (AUC): measures the classifier's overall quality. Larger AUC values indicate better classifier performance.

4) Hyperparameters Optimization
To achieve the best classification performance at a reasonable amount of time, we used the Bayesian optimization method [57] to tune the following hyperparameters: number of hidden layers, size of each layer, regularization strength and activation function. Bayesian optimization is derived from Bayes' theorem which states that for events A and B, This optimization method determines the distribution of hyperparameters by assuming that an optimization function obeys the Gaussian distribution. To get the best combination of hyperparameters, 100 optimization steps were conducted. The resultant optimized network was trained and tested in a similar manner as the network in Figure 7.

5) Impact of Key Parameters Investigation
Using adaptive moment estimation (Adam) optimizer [58], an impact of the following three key parameters were investigated on the optimized network: initial learning rate, minibatch size and l2-regularization parameter. Data was divided into two parts: the training and validation data.
The volume of the training and validation/test data plays an important role in classification success. The higher the correlation between input features and the class label, the lesser the data needed for training [59]. However, given a dataset, the training data portion of less than 50% is not adviced for as it will negatively affect the test results [59]. With this in mind, we therefore determined parameters' impact with different training data percentages.
We carried out the following procedure for 60%, 70% and 80% training data portions. For each parameter, its impact was investigated by determining training and validation accuracies with varied parameter values. Parameters were logarithmically varied in 100 steps between the initial and final values. For each step, the number of training epochs was limited to 30. The other parameters were held at fixed values while adjusting a parameter under study. Table III shows investigated parameters' initial values, step values, final values as well as fixed values.

V. RESULTS AND DISCUSSION
In this section, we show and discuss the experimental results. In Section V-A, we present results obtained before synthetic data generation. In Section V-B, we show a comparison between classification performance when using time-domain features, frequency-domain features and combined features from both domains as inputs to the classifier. We analyze PCA dimensionality reduction impact on experimental results in Section V-C. We present Bayesian optimization results as well as best results attained with optimized classifier in Section V-D, and we finally present an investigation of optimal parameter settings for best classification performance by varying different parameters using Adam optimizer in Section V-E.

A. VALIDATION RESULTS BEFORE SYNTHETIC DATA GENERATION
As stated in Section IV, when there was an imbalance in the number of observations between two classes, the classifier performed badly on the class with a relatively lower number of observations. The classifier shown in Figure 7 was trained with features extracted from an original dataset with no augmented synthetic data. 80% of the data was used for VOLUME 4, 2016 training while 20% was used for validation. The third column of Table IV shows the validation results. For the faithful customers class, validation results are much better than the unfaithful class. This can be seen by a comparison between faithful and unfaithful customers' recall, precision and F1score shown.
Compared with validation results in combined domains before the incorporation of PCA, there was no significant change in the recall, precision and F1-score for faithful customers' class since the difference in corresponding values was within 1% margin. However, for the unfaithful class, which was the minority class, validation results in terms of recall, precision and F1-score were not good at all before balancing the classes. A significant improvement was obtained after balancing the classes. This shows that the sensitivity of the classifier to the minority class was not as good as its sensitivity to the opposite class.
The subsequent subsections show the results which were obtained after augmenting synthetic data to the original dataset to balance classes.

B. DIFFERENT DOMAINS FEATURES' CONTRIBUTION ANALYSIS
To ensure the reliability and robustness of the method introduced in this work, we present experimental results based on widely-accepted performance metrics summarized in Table IV. To simplify the analysis, classification performance between time-domain, frequency-domain and combined features from both domains is graphically presented in Figure  8.
From Table IV and Figure 8, it can be seen that the classification process taken with time-domain features gave impressive validation and test results for both faithful and  The best results were obtained when all features from both domains were combined. For example, on validation, accuracy was 87.5%, which improved to 89.9%, and finally 91.1% when the experiment was done with time-domain features, frequency-domain features and all features from both domains respectively. The red trend line in Figure 8 graphs portrays significant improvement on results obtained from experiments done with time-domain features, frequencydomain features and all features from both domains. This improvement can be explained by a bar chart of predictors presented in order of their prominence shown in Figure 9, which has been produced through the mRMR scheme.  As shown by Figure 9 bar chart, there are more frequencydomain features to the left of the bar chart (i.e., features with the best scores) than time-domain features, with mean frequency achieving the highest predictor score. We confirmed the exactness of features' ranking through the mRMR scheme by doing classification tasks using top 3, middle 3 and bottom 3 features on the same network in Figure 7. Figure 10 bar chart shows classification accuracy and AUC-ROC results.
Comparing the results in Figure 10, we observed that accuracy and AUC-ROC are best for the top 3 features and worst for the bottom 3 features, as expected. MCC was determined on the last experiment when all features were combined. Its values were found to be 0.84 and 0.75 on validation and test respectively, which are closer to +1 than -1. AUC-ROC values were found to be 97% and 93% on validation and test respectively. These results portray a satisfactory overall classification task.

C. ANALYSIS OF COMPONENTS REDUCTION WITH PCA
When PCA was incorporated with the component reduction criterion of leaving enough components to explain 95%   Figure  11, we observed that frequency-domain features contributed more to principal components. This was also confirmed by features importance scores analysis shown by Figure 9 based VOLUME 4, 2016 on the mRMR scheme. The last two columns in Table IV shows both validation and test results obtained after components reduction with PCA. We observed that with just seven principal components, we were able to achieve results very close to when no feature reduction criterion was used.

D. HYPERPARAMETERS OPTIMIZATION RESULTS
Following the hyperparameters optimization procedure stipulated in Section IV-C4, Figure 12 shows observed objective function values vs optimization steps. The best hyperparameters combination was obtained at the 26 th optimization step and remained unchanged till the 100 th step. Their values are shown in Table V. An improved classification network architecture constructed with optimized hyperparameters achieved maximum validation and test accuracies of 91.8% and 88.1% respectively, which are 0.7% and 0.8% higher than an unoptimized architecture. The classifier obtained a maximum AUC-ROC value of 97%.

E. KEY PARAMETERS' IMPACT ANALYSIS 1) Impact of initial learning rate
To determine the impact of the initial learning rate on training and validation accuracies, the initial learning rate was varied between 10 −5 and 10 −2 in 100 steps. Figure 13 shows scatter plots of the results with fitted curves to simplify analysis. For all tested training data portions, training and validation accuracies values were lowest for the lowest initial learning rates, with recorded values less than 90%. Significant improvement in both accuracies was seen for initial learning rate values

2) Impact of minibatch size
To determine the impact of the minibatch size on the accuracy, the minibatch size was varied between 10 1 and 10 5 in Training Accuracy (%) Training data -60% Training data -70% Training data -80% (a) Training Validation Accuracy (%) Training data -60% Training data -70% Training data -80% (b) Validation FIGURE 14. Impact of varying minibatch size on accuracy at different training ratios 100 steps. We present training and validation accuracy versus minibatch size parameter plots in Figure 14. For all tested training data portions, the training and validation accuracies averages were a little bit higher than 90% for minibatch size values less than 10 3 . For minibatch sizes closer to 10 1 , the training accuracy varied significantly between 80% and 100% for each training task, however, this did not have an impact on validation as validation accuracy stayed the same just above 90%. Both training and validation accuracies declined drastically as minibatch size increased beyond 10 4 . This is because as the value of the minibatch size increased, the model had to learn from increased data size, resulting in poor generalization. However, smaller minibatch size values required relatively much time to train a model. A minibatch size less than but closer to 10 3 is recommended to balance efficiency and generalization. To determine the impact of the L2-regularization parameter on validation accuracy, the L2-regularization parameter was varied between 10 −8 and 10 −2 in 100 steps. Figure  15 shows the results. For all training data portions, training accuracy laid between 83% and 99%, with an average value at around 91% for l2-regularization parameter values in the range [10 −8 , 10 −4 ). Unstable average values of training accuracy were observed for l2-regularization parameter values ≥ 10 −4 . On the other hand, validation accuracy significantly decreased for l2-regularization parameter values ≥ 10 −4 . This may be caused by the fact that when the l2regularization parameter ≥ 10 −4 , at each training iteration, a significantly large number of weights was left not updated, thereby making it hard for the model to converge to a good solution. Best results were obtained when L2-regularization parameter values were in the range [10 −8 , 10 −4 ]. For all investigated parameters, the best validation accuracy was obtained for the 80% training data portion, followed by the 70% training data portion and lastly 60% training data portion. This shows that the more data is available for training the model, the more accurate the model becomes in detecting electricity theft.

F. COMPARISON WITH EXISTING DATA-BASED ELECTRICITY THEFT DETECTION METHODS
Based on electricity customers consumption data, different data-driven methods have been used to tackle the electricity theft problem. Due to the scarcity of datasets containing both faithful and unfaithful customers' consumption data, many methods have been evaluated on different uncommon datasets. In Table VI, we present an analysis in the difference between our work and the recent works in the literature. For each work, dataset details are given. We look at the techniques and/or algorithms used, as well as features extracted from the data in respective methods.
For the four methods which used the same dataset as ours (References [3], [4], [27]), we compare the results in terms of AUC and accuracy percentages. We obtained AUC that is 1% higher than the best AUC in the benchmark and accuracy that is the second best. The results show that our work is very competitive against other methods recently undertaken.

VI. CONCLUSION
In this work, the detection of electricity theft in smart grids was investigated using time-domain and frequency-domain features in a DNN-based classification approach. Isolated classification tasks based on the time-domain, frequencydomain and combined domains features were investigated on the same DNN network. Widely accepted performance metrics such as recall, precision, F1-score, accuracy, AUC-ROC and MCC were used to measure the performance of the model. We observed that classification done with frequencydomain features outperforms classification done with timedomain features, which in turn is outperformed by classification done with features from both domains.
The classifier was able to achieve 87.3% accuracy and 93% AUC-ROC when tested. We used PCA for feature reduction. With 7 out of 20 components used, the classifier was able to achieve 85.8% accuracy and 92% AUC-ROC when tested. We further analyzed individual features' contribution to the classification task and confirmed with the mRMR algorithm the importance of frequency-domain features over time-domain features towards a successful classification task. For better performance, a Bayesian optimizer was also used to optimize hyperparameters, which realized accuracy improvement close to 1%, on validation. Adam optimizer was incorporated and optimal values of key parameters were investigated.
In comparison with other data-driven methods evaluated on the same dataset, we obtained 97% AUC which is 1% higher than the best AUC in existing works, and 91.8% accuracy, which is the second-best on the benchmark. The method used here utilizes consumption data patterns. Apart from its application in power distribution networks, it can be used in anomaly detection applications in any field. Our work brings a small contribution towards accurately detecting energy theft as we detect theft that only took place over time.
We wish to extend our method to detect real-time electricity theft in the future. Since this method was evaluated based on consumption patterns of SGCC customers, it can further be validated against datasets from different areas to ensure its applicability anywhere.