Non-technical Loss Detection Using Deep Reinforcement Learning for Feature Cost Efficiency and Imbalanced Dataset

One of the problems of the electricity grid system is electricity loss due to energy theft, which is known as non-technical loss (NTL). The sustainability and stability of the grid system are threatened by the unexpected electricity losses. Energy theft detection based on data analysis is one of the solutions to alleviate the drawbacks of NTL. The main problem of data-based NTL detection is that collected electricity usage dataset is imbalanced. In this paper, we approach the NTL detection problem using deep reinforcement learning (DRL) to solve the data imbalanced problem of NTL. The advantage of the proposed method is that the classification method is adopted to use the partial input features without pre-processing method for input feature selection. Moreover, extra pre-processing steps to balance the dataset are unnecessary to detect NTL compared to the conventional NTL detection algorithms. From the simulation results, the proposed method provides better performances compared to the conventional algorithms under various simulation environments.


I. INTRODUCTION
Since the advent of advanced metering infrastructure (AMI) in smart grid (SG), electricity consumption information of the users is analyzed at the data center of power utilities. Malfunction detection, demand forecast, and electricity overload detection are various applications provided by the utilities to grid users. The applications are derived by analysis of the electricity data. Moreover, AMI has enabled the utilities to detect and report non-technical loss (NTL) in their power transmission and distribution (TD) network [1]. NTL is caused by the illegal action of consumers or malfunctions of metering devices unlike technical loss (TL) in the TD process as power transmission loss. The malicious actions to exploit energy from the utilities are known as energy theft or electricity theft. The menace approaches are led to a degradation of the overall quality of distributed power and stability of the grid system. Especially, NTL caused by fraudulent behavior can impose a direct economic loss on the power utilities and legal user. The results have been reported in various countries including Turkey [2], Jamaica [3] and India [4]. A wide range of approaches has been taken to mitigate the NTL problem.
In [6], the solutions for NTL problems are broadly categorized into three groups: theoretical studies, hardware solutions, and non-hardware solutions. Theoretical studies are based on socio-economic and demographic knowledge. Hardware solutions are the approaches of employing metering equipment and constructing different electricity network frameworks or infrastructures. Non-hardware solutions are the methods to solve NTL problems with the collected data from the grid system.
In [7], non-hardware solutions are divided into three subcategories: network-oriented methods, data-oriented methods, and hybrid methods. The standard of dividing each subcategory is defined by the data to be analyzed.
Network-oriented methods are adopted to observe the presence of deceiving behavior in the TD network with the aid of network configurations or network topologies. The network-oriented methods are implemented by extra network unit to control electricity flows. This is observed in the various studies such as deployment of feeder remote terminal unit with dynamic programming to build network configuration [8], deployment of intermediate monitor meter for load flow control [9]. Data-oriented methods are utilized to examine the personal information of people connected to the TD line of AMI. The electricity data are energy consumption records or client registration information. Hybrid methods are implemented to detect theft by combining data and network methods. The hybrid method is dealt with some studies; observer meter with support vector machine (SVM)-based data analysis [10], NTL detection system adopting voltage sensitivity analysis, power system optimization, and SVM [11].
Adaptation of new processes is continuously necessary for network-oriented methods. This is due to the changes in the infrastructure of the electricity grid system and the amount of electricity demand [12]. On the other hand, the data-oriented methods are more utilizable and flexible approaches to NTL detection. This is because restrictions from the improvement of hardware structure are unconcerned.
Data-oriented methods are implemented by traditional machine learning ( [28]. However, in the previous articles, data preprocessing steps are required to balance the dataset unless the balanced dataset is dealt with. The problem is solved by the data augmentation method as oversampling or generating synthetic data in most of the previous works as in [5], [13], [18], [28]. Moreover, the previous solutions assume that all the gathered features of AMI are usable at the instant of theft detection without [13]. Feature cost efficiency (FCE) is an important problem in classification tasks as [20].
In this paper, the DRL-based NTL detection method is applied to use partial input features for classification. The proposed method is one of the data-oriented methods to identify the existence of the NTL from the electricity usage dataset. The proposed DRL algorithm is suggested to solve the class imbalance problem of the NTL dataset. The advantage of the proposed classification method is that the imbalance dataset is used without changes. The two datasets used in this paper are gathered from Korea Electric Power Corporation (KEPCO) [27] and Irish Social Science Data Archive [30].
The major contributions and innovations of the proposed method are summarized as follow: 1) The DRL algorithm is applied to classify abnormal users from the imbalance dataset.
3) One dimensional-convolutional neural network (1D-CNN) is applied as the neural network structure to improve the detection performance of time-series data.
The rest of this paper is organized as follows. Section II describes the proposed system model for NTL detection based on deep Q-learning (DQL), which is one of DRL algoritms. The DQL-based NTL detection method is explained in Section III. The simulation environments and results for the KEPCO dataset are presented in Section IV. In the Section IV, performance metrics and conventional NTL detection algorithms are explained. In Section V, the convergence behaviors of the 1% mixed imbalanced Irish datasets are represented by average rewards of reinforcement learning (RL). Finally, the paper is concluded in Section V.

II. SYSTEM MODEL FOR NTL DETECTION
The overall system model for NTL detection is illustrated in Fig. 1. Load profiles of grid systems are normalized to range from 0 to 1 using min-max scaler. The equation for min-max scaler is given by where is one sample of the recorded electricity usage data from the dataset , is the th recorded value of , ( ) is the minimum value of and ( ) is the maximum value of .
After the normalization process, a pre-processed dataset is generated with a false data injection (FDI) process. The dataset is divided into two datasets to train and test deep Q-learning network (DQN). The network is to decide an action of the DRL algorithm. The trained neural network is used to detect NTL from the test dataset.
KEPCO and Irish datasets are used in the simulation process. The two datasets are comprised of legally registered users. Therefore, the FDI process is necessary to generate simulation datasets for NTL detection.
FDI is implemented in various research projects as in [10], [11], [13]- [16], and [19]. The total number of types of FDI is set to 6 in this paper as given in the previous papers. First, the above five types in [10] are used in this paper to inject false data, which are widely adopted FDI methods. The reason for selecting the five types of FDI without type 6 is to generate practical theft patterns mentioned in [13]. In addition to the five types, the FDI type 6 in [10] is replaced with the revised version of cut-off error, which is type 2 in [14], [15]. Each type of FDI used in this paper is mentioned as FDI1, FDI2, FDI3, FDI4, FDI5, and FDI6, respectively.
The details of DQN-based classification method are explained in the following sections.

A. CLASSIFICATION METHOD USING DQN FOR NTL DETECTION
The main problem of NTL detection is that the number of illegal users is less than legal users in the collected dataset. In this paper, the classification of thieves among users is executed with DQN. The proposed NTL detection method is an improved version of the previous DQN-based FCE classification model [20] for an imbalanced dataset.
The purpose of using the DQN algorithm for NTL detection is to optimize the trade-off relationship of classification error of imbalanced dataset and FCE. The DQN algorithm is suggested to optimize the equation as where the input dataset is denoted as ( , ). The size of ( , ) is represented as . ∈ is a specific class of in the class set . Each time step of is denoted as th feature, ∈ , where is input feature space. is the total number of . + means the classification error of a majority class.
− means the classification error of a minority class. � is denoted as a selected feature set for the NTL classification.
( ) is the feature cost function for feature selection applying as a reward of the DRL. The value of ( ) is unified in this NTL problem.
The components of DRL are agent, environment, state , reward , and action . Each load profile of users is represented as an episode of DQN. For each state = � , , � � of an episode, an agent performs one of the actions in an action space = � , �. In the action space, is a feature-selecting action and is a classification action terminating the episode. The agent observes the partial information of comprised of � � , � �, where � represents selected data corresponding to � . In the return of a specific ∈ , the is given by the environment. In this paper, the reward in [20] is modified to (3) for imbalanced dataset. For each state, of the is defined as follows: where denotes the label of the majority class, denotes the label of the minority class. A reward of the majority class, which is abnormality ratio (AR), is denoted as . The concept of is same as imbalance ratio in [21]. The AR is represented as follows: The transition function ( , ) of environment to get next state ′ is denoted as follows: where is a terminal state. � ′ is containing the newly selected feature based on � . The aim of DQN is to find optimal Q-function * with an optimal policy * which chooses at to maximize Q-value using neural network. The Bellman equation with discount factor is expressed as follows: The Q-function of the neural network is expressed as ( , ) with network parameter . Huber loss function is used as a loss function of the neural network to find . converges to * when the error between and decreases. Cost function of batch for training the neural network is dictated as follows: where is the size of . Huber loss function is used for loss function ( ) for NTL detection, which is denoted as: where is − ( , ). The is defined as: where ′ is Q-function from the target network with target network parameter ′ .

B. Neural network model for Q-function based on 1D-CNN
The neural network configuration of [20] is replaced with 1D-CNN because KEPCO and Irish datasets consist of timeseries electricity usage data. 1D-CNN has an advantage of local pattern of time series data [23]. The 1D-CNN is applied to solve classification problems of time series data as [24], [25].
The experiments are conducted to compare the performance of 1D-CNNs and fully connected layer-based neural networks (FNNs). The dataset for simulations is 1% imbalanced dataset with mixed FDI types. The neural network in [20] is referred to as baseline model. The simulations are conducted 100 times for the two neural network structures to experiment with the efficiency of the 1D-convolutional layer for electricity data. Seven neural networks are constructed for the simulations, which are three FNNs and four 1D-CNNs. The performance metrics and details of simulation environment are explained in Section IV-A.
The baseline model is comprised of 3 fully connected layerbased hidden layers, 1,024 units, and rectified linear unit (ReLU) activation function. The structure of 1D-CNN is comprised of two convolution layers with average pooling and fully connected layers, which is illustrated in Fig. 2. The number of units of fully connected layers in 1D-CNN is equal to 1024 as the baseline model. All 1D-convolution layers output channel size is 128 with kernel size 2 and stride 1 with average pooling of kernel size 2 and stride 2. ReLU activation function is used for all layers. The input channel size of the proposed 1D-CNN is 2. The two input channels are for � and � , respectively. In the proposed 1D-CNN network, the correlation between � and � is considered by two input channels and average pooling process. This is underestimated by the input structure of the baseline. ( , ) is the output of the network.
According to the simulation result listed in Table 1, average F1-scores of 1D-CNNs are higher than FNNs. Moreover, the values are similar for FNNs regardless of increasing the number of hidden layers. The average F1-score is increased at least 4.32% by changing the network structure from FNN to CNN. From the perspective of the average value, increasing the hidden layers of FNN is ineffective for improving the performance of NTL detection. Although the maximum and average F1-scores of C2F5 are the highest among the neural networks, the standard deviation of the C2F5 is marked higher than C2F4. C2F4 is selected for the neural network structure in the following simulations because of the lowest standard deviation with prominent average F1-score.

C. ALGORITHMS OF NTL DETECTION USING DQN
Steps to pre-process energy datasets are summarized in Algorithm 1. First, a min-max scaler is applied to the original dataset to normalize the dataset for the neural network. The normalized data is modified with selected FDI types. The scaled dataset is combined with FDI data sampled with an abnormality ratio. The combined dataset is shuffled to be separated into the training and test datasets. The validation dataset is included in the training dataset.
The DQN-based training algorithm is represented as Algorithm 2. In the initialization stage, all the neural network parameters and ′ are randomly initialized. The parallel environments are initialized with new episodes with the initially state 0 which is ( , , 0 ). 0 is the initial selected feature vector with all zeros. The buffer is initialized with the transition ( , , , ′ ) following -greedy policy until the number of transitions reaches buffer size . For each epoch, the transition is continuously collected by and recorded to replace the old transition. After saving the transition, batch is randomly sampled from with the predetermined batch size . is calculated by (9) for the . The neural network parameter is obtained by (7), (9) using Adam optimizer [26]. The target network parameter is updated with soft replacement.
The mechanism of the step function ( , ) is explained in Algorithm 3. For the feature selecting an action, the newly selected feature is added to the previously selected feature vector � . The value of each selected feature of � the is turned to 1 from 0. The returns of ( , ) are ′ and . The ′ has three components: a vector for selected data � ′ corresponding to � ′ , , vector for selected features � ′ . The � ′ is updated from � to reflect the result of . The value of � ′ is derived from the inner product between and � ′ . The of the newly selected feature is − ( ). Four different rewards are used for each classification action for the imbalanced dataset, whereas two different reward values are needed for the Deleting old transition end if Add transition ( , , , ′ ) into buffer end for Sample a random batch from for ( , , , ′ ) ∈ do Calculate with (9) end for Find with Adam optimizer with (7), (9) using soft replacement ′ ⟵ + (1 − ) ′ end for Algorithm 3. DQN Environment for NTL detection for Reset environment with 0 of new episode from dataset Return ( , ) end if end for VOLUME XX, 2018 9 balanced dataset. Each reward value of an imbalanced dataset is defined as (3). After the classification action, the environment is reset with a new episode from the electricity dataset. The classification action terminates the episode returning the classification reward.

A. SIMULATION ENVIRONMENT
All simulations are executed in a PC with i5-9400F CPU, GeForce GTX 1650 GPU, and 32-GB RAM. The program for coding is PyCharm 2019 with Python 3.6. KEPCO dataset is comprised of local home and business electricity usage profiles from 2017-2019. The number of grid users of the dataset is 6,782 with the data containing information more  than half of the total. The time-step of the data is 36 because the company provides the monthly consumption energy data. The feature cost of ( ) of NTL detection based on DQN is set to 1/3600 for all the simulation settings. The cost has an impact on the selected feature number. An increase in feature cost leads to a decrease in selected feature numbers. In this paper, the lowest ratio of abnormal users for simulation is 1%. The total feature cost is decided to be 0.01, which is 100 times smaller than the reward of the minority class. is 0.99. is 0.01. The buffer size is 30,000. The batch size is 2,000.
The training dataset is separated into training and validation datasets. The proportion of training, validation, the test dataset is 3:1:1. More detailed settings are following the parameter settings in [20]. The specification of implemented C2F4 is summarized in Table 2.
Three different experiments are simulated under various FCE conditions. The first simulation is for measuring the performance of the proposed method changing AR. The theft data is made of every FDI types, which is denoted as mixed types. The second simulation is executed with different FDI types for the balanced dataset. The last experiment is for the extreme imbalanced case. The details of data for each simulation are listed in Table 3.

B. PERFORMANCE METRICS AND DIFFERENT DATA ANALYTIC METHODS FOR NTL DETECTION
In this paper, four classification metrics are used to demonstrate the performance of the proposed method compared to conventional algorithms. The metrics are truepositive rate (TPR), false-positive rate (FPR), false omission rate (FOR), and F1-score (F1). The equations are listed as follows: where positive is theft class data. negative is normal class data. Condition means the actual class of the data. Predicted means the classified state of the data. recall is equal to TPR. Precision of F1 is given by The direct comparison of our method with other algorithms is impossible due to the difference in the number of input features. Therefore, input features of the conventional algorithm are selected by feature selection function based on the best score of analysis of variance (ANOVA). The number of selected features is closely selected as the used feature number of our method.
Six different classification methods are selected to compare with the proposed method. Four Conventional methods and two neural network-based models are selected as below:  [19], the performance of the XGB is the highest on a balanced dataset among 11 algorithms. Therefore, this algorithm is selected as a comparison algorithm. The parameters of XGB are set as [19], which are listed as follows: booster is 'gbtree'; learning rate is 0.05, maximum depth is 8, lambda for L2 regularization is 2. 4) Random Forest (RF): RF is also listed in [19]. The overall performance of RF is highly ranked. The parameters are the same as follows: the number of estimators is 50; the maximum depth of tree is 6. 5) Simple C2F4: Simple C2F4 is the neural network-based classifier constructed without RL. The detailed structure is same as the proposed C2F4 without the input. The size of the input of the Simple C2F4 is sample to the used input feature number, which is equal to 1*the used input size. This is because the inputs are comprised of the selected input features without � ′ . The early stopping tolerance is 5 with the total epoch of 1,000. 6) Simple Baseline: Simple Baseline is the neural networkbased classifier constructed without RL. The detailed structure is same as the Baseline model without the input size. The details of the implementation are equal to the Simple C2F4.

C. SIMULATION RESULTS OF DIFFERENT ABNORMALITY RATIOS
Under the circumstance of mixed FDI types, the ratio difference between normal users and abnormal users has an impact on the results. ARs for each simulation is 0.1, 0.2, 0.3 and 0.5. For each AR, the numbers of used features for classification are 22, 20, 14, and 12. The numbers out of total feature number are listed below each AR, respectively. The numbers are selected automatically by DRL classification algorithms. The simulation results are shown in Table 4.
From TPR values, our method marks the highest TPR values for all ARs as the lowest value is 93.08% for AR=0.1 and the highest value is 98.3% for AR=0.5. TPR value is related to the correctly classified thieves among all thieves. It is demonstrated that the abnormal users are effectively classified by our method compared to other methods.
From FPR values, the FPR values of our method are constantly marked values around 5 to 6.5 for all ARs. In contrast, FPR of other methods normally tends to decrease as the AR decreases. Higher FPR values indicate the missing normal users among all normal users. The classification of normal data is less affected by the AR for our method, where the difference of minimum and maximum value is 1.49. In contrast to the proposed method, the minimum value of difference is 7.71 for Simple Baseline. The simulation result can be interpreted that the incorrectly classified normal users are increased as the number of thieves increases.
From FOR values, the FOR values of our method are below one for all the ratios with the lowest value. Higher FOR value means that the ratio of undetected theft is higher without detection. Therefore, the ratio of undetected energy thieves among predicted normal users is relatively lower than conventional algorithms.
F1 is related to the number of overall correctly classified imbalance data. For F1, our method shows the highest performance for AR=0.3 and 0.5. For the case of AR = 0.1 and 0.2, the total number of correctly classified users of the proposed method is relatively lower than the Simple Baseline model. However, the ratios of correctly classified thieves are the highest for the proposed method as shown in TPR and FOR. From the results, the proposed method has strength in detecting theft data at the expense of misclassified registered users. Algorithms performing a low FPR value with a low TPR value are defective because the method tends to decide all the inputs as a majority class. Moreover, the high FOR value is related to undetected energy theft. FOR is the most important value for NTL detection because undetected abnormal users directly degrade the performance of the overall grid system. Therefore, the proposed method is robust to imbalanced datasets from performance metrics of Table 4. On the contrary, most of the data from the negative class is detected as the positive class by other methods. In the point of FCE condition, the number of used features is increased as AR decreases.

D. SIMULATION RESULTS OF DIFFERENT FDI TYPES FOR BALANCED DATASET
For each FDI type, all the simulation results are shown in Table 5. The used feature numbers for each FDI type are 15, 19, 15, 9, 9, and 22. The numbers out of total feature number are listed below each FDI type, respectively. Each method has strength in each FDI type. Our method is robust to balanced dataset compared to conventional methods due to the highest F1 for all FDI types. The minimum value of difference between the proposed method and the other methods is 0.33 occurred in FDI4 compared with C2F4. The maximum value is 11.49 occurred in FDI6 compared with KNN.
For FDI1, the XGB method effectively detects thieves among all methods from TPR = 99.42 and FOR = 0.7. However, the normal users are incorrectly determined as illegal users by the XGB, which is demonstrated by FPR = 15.15. For the proposed method, the number of correctly classified normal users is relatively higher than the XGB because the FPR value of our method is 5.67.
For FDI2, the thefts are effectively detected by the proposed method from the highest TPR (97.68) and FOR (2.31). The normal users are misclassified by the RF with the lowest FPR (2.39). The minimum difference of FOR values between the proposed method and the conventional methods is equal to 5.32, which is the highest value among all FDI types. This can be interpreted as the ratio of undetected theft is higher with the conventional methods for FDI2.
For FDI3 and FDI4, the number of properly classified normal users is the highest by the proposed method among all classification methods. This is because F1=96.79 for FDI3 and F1=99.89 for FDI4. From the TPR and FOR, all thieves are correctly classified by the RF for FDI3. Form the two metrics, The SVM, KNN, and Simple Baseline methods detect all illegal users for FDI4. For the two FDI types, the proposed method efficiently detects legally registered user with the expense of thieves, which is demonstrated by the lowest FPR value. The two FPR values of the proposed method are 5.6 and 0.15 for FDI3 and FDI4. For FDI5, all values of the performance metrics are the best for the proposed method. This means that the proposed method is efficient for FDI5 compared to the other methods.
All classification methods have difficulty in FDI6, which is shown by the relatively low F1 values. The number of correctly detected thefts is the highest by the XGB for FDI6 from TPR. However, the number of the misclassified normal users by the XGB is higher than the proposed method. This is shown by the FPR value and the difference of FPR values is 10.38. From the FPR value, the normal users are effectively classified by the Simple Baseline with lots of misclassified thefts. This is because the FPR value of Simple Baseline is marked the lowest value of 4.1 with the highest FOR value of 21.36.
The used feature number for FDI4 and FDI5 is relatively lower than other types with better performance of all four metrics. For our method, F1 values are 99.89% and 98.63% for FDI4 and FDI 5, respectively. This can be interpreted from the perspective of FCE. For the two FDI types, the expense of using all input features to detect NTL is relatively critical to the grid system compared to other FDI types because the proposed method uses 25% of total input features to detect NTL.

E. RESULTS OF THE EXTREME IMBALANCED NTL CASE
In the extreme case as [19], the AR of the extreme simulation case is set to 1% mixed FDI types. From the proposed method, the used feature number of the proposed method is 3, which is 8.3% of the total input features. For the simulation environment, the performance of the proposed method is compared with SVM, which is relatively robust to the imbalanced dataset in the previous section. Fig. 3(a) is a confusion matrix of the proposed DQN-based detection method, and Fig. 3(b) is the result of SVM. From the confusion matrix of our method, about 85% of thieves are detected by the DRL method. Moreover, only 7.9% of the normal users are incorrectly determined as abnormal users by our method. Although the total number of incorrectly classified data by SVM is lower than our method, every theft data in the test dataset is classified as normal by SVM. From the result of confusion matrices, the SVM-based NTL detection method is inadequate for an extremely imbalanced dataset compared to the proposed method.

F. CONVERGENCE BEHAVIOR FOR 1% IMBALANCED DATASET
The convergence behavior is an important aspect of ANN. In this paper, the performance metric adopted as convergence behavior is an average reward. The average reward is a performance metric for the overall performance of the RL algorithm. The tests are executed under a 1% imbalanced KEPCO dataset of mixed FDI types to investigate the behavior of the proposed 1D-CNN structures compared to the FNN structures with the baseline model in Section III. B. For each structure, the experiments are executed ten times as in [20]. The mean of the simulated average rewards over training epochs is shown in Fig. 4. The summaries of simulations are listed in Table 6. The average reward is calculated every 200 epochs in the simulations.
In Fig. 4, the solid line is dedicated to 1D-CNN models and the dashed line is dedicated to FNN models. As mentioned in [20], the performances of the networks are improved as the training process. From Table 6, the overall average rewards of the 1D-CNNs are marked higher than the FNNs except for C2F5. The highest average reward is achieved from C2F4, which is the main network structure in this paper. The average  reward of C2F5 is firstly converged to its optimal value, which leads to the shortest training epochs.

V. CONVERGENCE BEHAVIOR FOR IRISH DATASET
In Section V, the performances of the proposed DQN method are verified by the Irish smart meter dataset [32] for various AR conditions. The Irish dataset consists of daily electricity consumption recordings of more than 5,000 consumers from 2009 to 2010. From the total dataset, 100 consumers are randomly selected to prepare five datasets for simulation. The daily consumption record for each user is one sample of the dataset. The consumption amounts are recorded every 30 minutes, which means that the size of the total input feature is 48. The details of each dataset are listed in Table 7. The daily consumption records of consumers are removed when null values occupy more than 3/4 of the total features. The rest of the empty data are filled with zero. The samples are randomly selected to apply FDI according to AR = 0.1.
( ) is set to 1/4,800. The simulations are conducted ten times for each dataset to find the average values. The neural network for simulations is C2F3. The average reward is calculated every 200 epochs in the simulations.
The average convergence behaviors of the proposed method for each dataset are shown in Fig. 5. For each Irish dataset, average rewards are increased at a similar rate. Moreover, the rewards are similarly converged to around 0.8. Although the performances of the neural network for the Irish datasets are slightly degraded compared to the KEPCO dataset, the suggested settings of the neural network are stably worked for the Irish datasets.

VI. CONCLUSIONS
The DRL-based NTL detection method for the imbalanced dataset has been proposed in this paper to detect energy theft and simulate under the environment of costly features. It was demonstrated that the proposed method is robust to the NTL dataset compared to various detection methods from the simulation results. This was accomplished by the different rewards for each classification action of the positive and negative classes. For the imbalanced dataset, the proposed method can effectively detect energy theft without the data augmentation method. The strength of our method is that the pre-processing step to select input features is automatically done in the classification step. In addition, average rewards of 1D-CNN are converged to higher values compared to values of FNNs for 1% extreme case with KEPCO dataset.
Estimation bias is incurred by overestimations of DQN as mentioned in [29]. The bias is the cause of the large standard deviation values of the proposed method in Table 2. Therefore, advanced DQNs such as Double DQN will be applied to the DQN-based NTL detection method to improve the stability in future works. Our proposed method will also be applied to a similar classification problem of an imbalanced dataset for detecting malfunctions of electrical appliances or electrical devices, whose functions are provided by a home management system or building management system in SG.