Deep Reinforcement Learning for Anomaly Detection: A Systematic Review

Anomaly detection has been used to detect and analyze anomalous elements from data for years. Various techniques have been developed to detect anomalies. However, the most convenient one is Machine learning which is performing well but still has limitations for large-scale unlabeled datasets. Deep Reinforcement Learning (DRL) based techniques outperform the existing supervised or unsupervised and other alternative techniques for anomaly detection. This study presents a Systematic Literature Review (SLR), which analyzes DRL models that detect anomalies in their application. This SLR aims to analyze the DRL frameworks for anomaly detection applications, proposed DRL methods, and their performance comparisons against alternative methods. In this review, we have identified 32 research articles published from 2017–2022 that discuss DRL techniques for various anomaly detection applications. After analyzing the selected research articles, this paper presents 13 different applications of anomaly detection found in the selected research articles. We identified 50 different datasets applied in experiments on anomaly detection and demonstrated 17 distinct DRL models used in the selected papers to detect anomalies. Finally, we analyzed the performance of these DRL models and reviewed them. Additionally, we observed that detecting anomalies using DRL frameworks is a promising area of research and showed that DRL had shown better performance for anomaly detection where other models lack. Therefore, we provide researchers with recommendations and guidelines based on this review.


I. INTRODUCTION
Anomaly detection is a significant problem that has been researched for decades. To identify anomalies for various purposes, a variety of techniques have been proposed and employed. The challenge of detecting patterns in data that do not match predicted behavior is known as anomaly detection [1], [2]. Anomaly detection is commonly applied in a wide range of different applications. Anomaly detection is also employed in cyber security intrusion detection, network The associate editor coordinating the review of this manuscript and approving it for publication was Mohamed Elhoseny . intrusion detection [3], [4], [5], anomaly detection in videos to detect any unusual activity like road crimes or robberies etc., fault detection, streaming, and hyperspectral imaging, among other applications. The relevance of identifying anomalies in many application areas arises from the possibility of unprotected data, which might include valuable, relevant, and essential data. For example, detecting an anomalous network traffic pattern may reveal an intrusion from a hacked machine [6]. It is also used in medical applications. Another instance is identifying abnormalities in banks or credit card transaction data, which might suggest fraud [7]. Furthermore, identifying an anomaly from an aviation detector may lead VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ to discovering a defect in one or more of the airplane's systems. Many techniques have been used for anomaly detection. Statistical anomaly detection techniques are some of the oldest algorithms used to detect anomalies [8]. They use a statistical model to calculate and detect unusual patterns in the data. Machine Learning (ML) has been a trendy technique for anomaly detection. It is the most conventional and popular approach to detecting anomalies. ML has been successful to some extent. They include a supervised model, which uses labeled data, unsupervised, which uses un-labelled data and semi-supervised learning methods, which use a small labeled and large set of unlabeled datasets to detect anomalies. It simply builds models that separate the ordinary and anomalous classes [9]. The agent (ML algorithm) learns the input-output mapping (model) using labeled training data in supervised learning. A supervised learning method generalizes across training cases to predict data labels. Labels are not always correct. In the process sector, the subject matter expert is often an unreliable and noisy sensor measuring a process's present status (temperature, pressure, etc.). The supervised learning agent cannot defeat the subject matter expert since it copies the expert's labeling behavior. The agent's performance limit is called the Bayes error rate and is commonly used unsupervised learning, e.g., similarity-based data separation. Segregating data depending on data set components is one example. Unsupervised learning aims to reduce dimensions, extract features and clustering. Semi-supervised learning combines supervised and unsupervised approaches. Manually labeling data sets is costly in the process industry, but many applications, like defect detection, require them. Semi-supervised learning can be used to learn from labeled data and unlabeled data. Semi-supervised learning cannot outperform the supervisor. Older approaches can just reduce expenses while failing to increase modern capabilities.
Reinforcement learning (RL) is a sub-domain of ML that does not need labeled data. Unlike supervised ML, it uses an intelligent agent to make an optimal decisions by maximizing rewards to achieve the goal [10]. RL is similar to dynamic programming. Deep Reinforcement Learning (DRL) combines deep learning and reinforcement learning. DRL incorporates the DL to a solution which helps the agent in RL to make an optimal decision from unstructured data and solve the problem of manual engineering of the state space in RL. DRL algorithms can perform well for huge-scale datasets and are helpful in diverse applications, including anomaly detection, video games, robotics, transportation, NLP, healthcare, computer vision, and finance [11].
Anomaly detection is an important application of Deep Reinforcement Learning (DRL). DRL combines the ability of deep learning with the decision-making ability of Reinforcement learning [12]. It solves the critical yet largely unsolved problem of detecting anomalous data. DRL approach actively seeks novel classes of anomalies that lie beyond the scope of the label dataset. It outperforms the other model to detect anomalies in massive volume datasets, which is practically hard to handle in alternative unsupervised problems [13].
The primary objective of this research is to conduct a systematic review that represents a comprehensive study of proposed frameworks of DRL for anomaly detection and its applications. In addition, this review presents DRL models, and their performance compared to alternative models, and suggests DRL models for various anomaly detection applications. This review also represents all anomaly datasets that have been used in the research articles that are selected for review in this SLR.
The remaining part of this paper consists of the following sections: Section 2 discusses the related work, Section 3 contains the methodology used to do this research, Section 4 consists of results and discussion, and Section 5 addresses limitation, conclusion, and suggested future work.

II. LITERATURE REVIEW
Anomaly detection is a critical topic that has already been researched and implemented in various disciplines. Many anomaly detection systems have been adapted to specific purposes but are much more generic. The following subsections address the concept of anomaly detection and DRL with an investigation of the prior works, anomaly detection types, methods, and applications.

A. ANOMALY DETECTION
Anomaly detection is the process of identifying anomalous patterns that do not conform to expected behavior; these anomalous patterns are commonly known as anomalies and outliers [62]. Anomaly detection has been applied to various fields of study, including data breaches, identity theft, networking, manufacturing, video surveillance, and IoT anomaly detection.
Solid knowledge of the nature of anomalies is essential for the development of anomaly detection systems. Anomalies are divided into three classes: • Point Anomalies: A data point-based anomaly is an instance of data that is regarded as an aberration compared to the rest of the data. This sort of anomaly is the simplest and is typically the focus of most of the research on anomaly identification. This category is shown in Figure 1(a), which depicts the discharge capacity data collected from a lithium-ion battery and the anomaly locations.
• Contextual Anomalies: A context-based anomaly is an instance of data that is considered anomalous if it is anomalous in a particular context but not in another. Figure 1(b) illustrates a temperature time series that depicts the average monthly temperature for a region. At time t1 (winter), a temperature of 20 • F is typical. However, a temperature of 20 • F at time t2 (summer) may be anomalous.
• Collective Anomalies: This category specifies that a group of data instances are out of the ordinary relative to the overall dataset. Figure 1(c) illustrates an ECG output, and the highlighted zone is an anomaly set since the human ECG output should not remain below for an extended period.
ML-based anomaly detection is becoming more prevalent, and this technique is used to construct a model that differentiates between normal and abnormal classes [59]. Based on the data function, anomalous approaches can be categorized into three types. These are the three categories: • Supervised Anomaly Detection: requires all dataset instances to be labelled ''normal'' and ''anomalous'' This method is essentially a type of binary classification task [64].
• Semi-Supervised Anomaly Detection: requires only ''normal'' cases in a dataset to be labelled. In this method, the model will predict only normal occurrences [65].
• Reinforcement Learning: is a learning model comparable to supervised learning, with the exception that the algorithm is not taught using a dataset. The reinforcement learning paradigm acquires knowledge from external feedback provided by a thinking entity or the environment [68]. Anomaly detection is an important application of deep reinforcement learning. DRL combines the ability of deep learning with the decisionmaking ability of RL [12]. It solves the critical yet largely unsolved problem of detecting anomalous data. DRL approach actively seeks novel classes of anomalies that lie beyond the scope of the label dataset. It outperforms the other model to detect anomalies in massive volume datasets, which is practically hard to handle in alternative unsupervised problems [13].
The authors in [1] for instance, gave a comprehensive overview of anomaly detection approaches and their applications. A detailed comprehensive review of several machine learning and non-machine learning algorithms, including statistical and spectral detection methods, was conducted. In addition, the review covers a variety of anomaly detection applications and techniques. Cyber intrusion detection, fraud detection, medical anomaly detection, industrial damage detection, image processing detection, textual anomaly detection, and sensor networks are all instances of cyber intrusion detection. However, this anomaly comprehensive survey lacks discussing the recent and powerful algorithms in detecting anomaly and does not focus on DRL. The same researchers also published a survey [8] of discrete patterns of anomaly detection. This researcher gave a thorough and well-organized review of the available research on identifying anomalies in symbolic patterns.
Nevertheless, the limitations of the survey in [8] involved classical methods, and anomaly detection-based-DRL was not discussed. The authors in [14] also gave an overview of ML and statistical anomaly detection methods. Additionally, the authors compared the benefits and drawbacks of each technique. Thus, DRL-based anomaly detection is still a hot and popular area with praise from academia and industry's massive interests.
Agrawal and Agrawal [7], on the other hand, offered a survey on anomaly detection using data mining approaches. The methods in [7] survey still have limitations for large-scale unlabeled datasets and do not perform well. The author in [9] presented an SLR of anomaly detection using machine techniques. This SLR includes comprehensive research of supervised, unsupervised, and semi-supervised methods for anomaly detection. They compared all model's performancewise and made a recommendation for the researcher of this domain. Moreover, they represented all anomalous datasets using the papers they used in their SLR. This SLR also did not focus on the methods and applications of DRL in the anomaly detection domain.
Similarly goes to the systematic literature review conducted by [59], the authors only focused on anomaly detection using ML methods in smart shirts. The SLR in [59] does not include or discuss the DRL methods for anomaly detection; instead, it explores only classical ML methods targeting smart shirt anomaly detection. A different survey was conducted by authors in [60] for dynamically varying environments using RL algorithms. The survey in [60] presents the various categories of RL-based MDP, decision rules and policies and value function. It does not explain the hybridization of DNNs and RL, their benefits, performance, and challenges in the field of anomaly detection.
Numerous studies aimed at identifying anomalies in certain areas and applications like [15], in which the researchers gave an overview of broad clustering-based fraud detection approaches and evaluated them from various viewpoints. The author gave several frameworks and classification techniques for anomaly detection in automated surveillance in [16]. The authors looked at research papers based on the issue, scope, technique, and strategy. Furthermore, the researcher in [17] presented an overview of the most used anomaly detection approaches in the area of geochemical data analysis, including fractal models, compositional data analysis, and machine learning (ML). However, the author mainly emphasizes on ML algorithms. In [18], to the contrary, looked at the models for log-based anomaly detection. The authors looked at six different anomaly detection algorithms and ranked them. The authors also compared the accuracy and efficiency of two primary production log datasets.
Many studies focused on anomalous intrusion detection. For example, in [19], the author published thorough research on anomalous intrusion detection approaches such as statistics, ML, NNs, and data mining. The author in [20] also looked at intrusion detection, although their emphasis was on ML approaches. They presented a review of ML approaches for solving intrusion detection issues that were published between 2000 and 2007. Furthermore, the authors examined similar studies based on classifier design types, datasets, and other criteria. In [21], they conducted a comprehensive analysis of anomaly detection and intrusion detection strategies, while in [22], they examined ML and data mining approaches for cyber intrusion detection. They described each approach and discussed the difficulties of using ML and data mining for cyber security. Finally, the researcher in [23] showed how to enhance the effectiveness of detecting abnormalities in network intrusion systems by combining several ML approaches with particle swarm optimization.
Identifying network abnormalities have long been a focus of study [24], [25]. As a result, several surveys have been conducted on the subject. In [26], detailed research on network anomaly detection was published for contrast. They defined the types of assaults that IDS are most likely to experience and then explained and evaluated several anomaly detection approaches' efficiencies. The authors also examined the techniques used by network security. The authors in [6] comprehensively analyzed very well distance-based, density-based, and supervised and unsupervised learning approaches in network anomaly detection. In [27], on the other hand, emphasized on DL approaches, including machine-based DNN, DRNN and ML for network anomaly detection systems. Furthermore, the article provides studies that show how deep learning algorithms may be used to analyze network traffic data.

B. DRL FOR ANOMALY DETECTION IN DIFFERENT DOMAINS 1) VEDIO ANOMALY DETECTION
In surveillance videos, the primary action is frequently identified as commonplace, unproblematic behavior. A smart video surveillance system's more critical and challenging task is to locate and detect anomalous actions that are predicted to occur with a lower likelihood than regular activity [32]. Public security was greatly enhanced by smart video surveillance, which used computer vision algorithms to analyze and comprehend the longer video stream. Abnormal activity detection is a crucial component of smart video surveillance because it automatically determines and recognizes anomalies when watching a constantly changing scene and acts when necessary to deal with emergencies. Due to numerous efforts to flag violent activity in surveillance videos, anomaly detection systems have seen a lot of progress in recent years in helping to resolve security issues [34], [61]. The introduction of deep reinforcement learning shows a significant impact on recognition of area and action from the video.

2) NETWORK INTRUSION DETECTION
One of the most essential security protection techniques used today to keep an eye on computer networks or systems for network-based threats or harmful assaults that might impair system functionality is Network Intrusion Detection Systems [38], [40]. A misuse-based network intrusion anomaly-based system relies on a large database of malicious activity. Furthermore, this system has a slow processing speed and is vulnerable to zero-day attacks. An anomaly-based IDS system uses atypical traffic patterns to spot computer system threats that are concealed. Reinforcement learning (RL) is another machine learning technique that has promise in a variety of applications, including robots and gaming. Recently, several articles have examined the effects of RL in NIDS applications; however, less research has examined the effects of RL on the NIDS problem with unbalanced dataset [43], [49].

3) NETWORK INTRUSION IN IOT
An intrusion detection system (IDS) is consistently regarded as one of the effective tools for protecting the Internet of Things (IoT) network's critical data. IoT devices are more susceptible to security assaults due to the ongoing expansion of interconnected Internet of Things (IoT) devices, which has greatly increased network traffic, complexity, and the constantly shifting Internet environment. To secure the IoT environment, a strong and sophisticated intrusion detection system (IDS) based on cutting-edge machine learning techniques is needed. Reinforcement learning (RL) is one of the best ways to protect the Internet of Things (IoT) from hostile environment learning, incorporating environmental behavior into the learning process. The RL maximizes the overall benefit by engaging the agent with the environment. The data set is created by the agents, who then utilize it to train their models. Using a strategic selection of pertinent features, the RL agent recognizes and categorizes various attacks. Exploring the surroundings and getting positive or negative feedback helps the agent perform better. The agent learns certain attack behaviors after gathering feedback from the environment, at which point it creates a strategy to safeguard IoT against intrusion [53].

4) CYBER ATTACK INTRUSION DETECTION
Cybersecurity is the collection of procedures and techniques created to defend against attacks, unauthorized access, alteration, and damage of computers, networks, programmes, and data. Network security systems and computer (host) security systems make up cyber security systems. Each of these has a firewall, antivirus programme, and intrusion detection system, at the very least (IDS). IDSs assist in finding, determining, and identifying information systems' unlawful use, duplicate, change, and destruction. Attacks from outside the company (external intrusions) and internal intrusions are among the security lapses. In recent few research, DRL has been used to defend systems against network intrusion attacks and solve the problem [51], [58].

5) INTRUSION DETECTION IN CLOUD
Cloud computing offers a very adaptable and scalable platform for compensation on-demand access to computing power, data storage, and infrastructure components. Due to its dispersed structure, cloud computing is a prime target for hackers who frequently use new techniques to take advantage of its flaws [35]. There are several innovative assaults and ongoing modifications to attack patterns in the present cloud environment, which makes it more challenging to identify breaches. The current systems require regular updates via retraining with a fresh dataset together with an old dataset to remain viable in such situations, which is not always practicable given the computing cost and resources required. Based on the specific attack types that have been directed at it, a context suggests a certain sort of cloud network. As a result, there is a need for a low-cost IDS that automatically picks up on and adjusts to any changes in attack patterns in the environment while requiring the least amount of human involvement. In this regard, a cloud IDS architecture based on deep reinforcement learning is adaptable and maintains a balance between accuracy and FPR. We now give a succinct history of reinforcement learning (RL) [37].
Although some literature reviews are available, none of the studies has addressed these methods appropriately. However, to the best of our knowledge, this study is among the first SLR on Anomaly detection using Deep reinforcement learning techniques, which is the primary motivation behind this research. Our systematic literature review is considerably different from those described in the earlier section, as we present extensive research on detecting anomalies using DRL techniques. Our SLR includes: • Various DRL models for anomaly detection.
• Performance comparison of those with alternative techniques.
• Applications of anomaly detection that are used in the research articles selected for this SLR.
• Represent all anomaly datasets used in the research articles selected for this SLR.
• This SLR covers research articles from 2017-2022.

III. METHODOLOGY
This research follows the Kitchenham and Charters methodology [28] to conduct this Systematic Literature Review. Planning, conducting, and reporting the research are all process parts. Each level has several stages. The planning step is broken down into six sections. The first step is to come up with research questions that are relevant to the review's goals. After determining the appropriate search keywords, the second stage is to devise a search strategy for gathering research articles on the issue that answers the research questions. The research selection processes, which comprise exclusion and inclusion criteria, are identified in the third step. In the fourth stage, there is a laying up an extraction approach to address the previously stated research topics. Finally, the data must be synthesized in the fifth stage. The following subsections illustrate how we implemented the review procedure.

A. RESEARCH QUESTION
In this SLR, we aim to present a comprehensive study of DRL models for anomaly detection, which includes an examination of DRL models and their performance from 2017-2022.
Research questions raised for this purpose are: 1. RQ1: What anomaly detection applications are discussed using DRL techniques? RQ1 aims to discuss the application of anomaly detection that is used in this SLR using DRL.

RQ2: What anomalous datasets are used for anomaly detection using DRL techniques?
RQ2 aims to present various anomalous datasets that are used in the papers selected for this SLR.

RQ3: What algorithms of DRL are used to detect anomalies?
The purpose of RQ3 is to mention precisely which DRL algorithm is proposed for detecting anomalies in this research.

RQ4: What is the performance of the DRL model compared with the alternative method?
RQ4 focuses on the model's performance, which includes estimation, and prediction accuracy to detect anomalies using DRL and their performance with other alternative models.

B. SEARCH STRATEGY
The search scope is defined and restricted to computer science, social science, information systems, and information security (behavioral aspect). This research focuses on automated and manual search techniques to get as many research papers as feasible to meet the study's goals. As previously mentioned, a manual search procedure was also carried out using search engines and reference lists of similar publications. To conduct this SLR, the procedure that we followed is listed below: We used the following libraries that we used in this SLR to collect research papers which include conference and journal papers:

1) INCLUSION AND EXCLUSION CRITERIA
Inclusion criteria to select a paper for this SLR are given below: • Articles that written in English including scientific journals and conference proceedings.
• Articles on anomaly detection or its application.
• Articles which use the DRL technique to detect anomalies.
• Articles published from 2017 to 2022. Exclusion criteria to reject a paper for this SLR are given below: • Papers with no clear publication information. • Papers related to DRL but do not mention anomaly detection.
• Papers related to anomaly detection but do not discuss DRL.

C. STUDY SELECTION
To conduct this SLR, we collected 46 papers based on search terms discussed earlier. After observing them using selection criteria, we discarded 4 review papers and 6 unrelated papers which do not define the inclusion criteria and 5 duplicate articles. After this filtration, we finally selected 32 papers to observe and review for this SLR. These filtration steps to select paper are given below: 1. Remove duplicate research papers collected from different digital libraries. 2. Apply the inclusion and exclusion criteria discussed Section B. 3. Remove review papers. 4. Apply quality assessment rules to include the bestselected paper for this SLR. 5. Search related articles from references of selected papers and repeat the steps above. Figure 2 shows the study selection criteria utilized in this SLR, and Figure 3 illustrates the identified 32 research articles written from 2017-2022 that discuss DRL techniques for various applications of anomaly detection.

D. DATA EXTRACTION STRATEGY
In this SLR, we aim to present the various DRL techniques for anomaly detection and specify their application. We also aim to present the different anomalous datasets that they have used for anomaly detection. For this purpose, the information we extracted from the selected papers includes the title of the research paper, year of publication, type of anomaly detection, DRL models they proposed to detect an anomaly, dataset they used and performance of the DRL model. All of these are included in RQs.

E. SYNTHESIS OF EXTRACTED DATA
In completing this SLR, we employed several techniques to collect knowledge to address the RQs by synthesizing the information from the chosen publications. To answer the RQ1, we identified all anomaly detection applications from selected papers and represented them in a tabular form mentioning paper ID. To answer RQ2, we extracted all the datasets from all selected papers and represented them in a tabular form as shown in TABLE 1. To address RQ3, we mention the DRL models used in each selected paper in TABLE 2. To address RQ4, we made a performance comparison of each DRL model discussed in selected papers are presented in TABLE 3.

IV. RESULTS AND DISCUSSION
This section provides an overview of the chosen papers. In the following section, the outcomes of each study topic are discussed in depth. The results of each research question are detailed in the following four sections. A total of 32 papers were chosen for this SLR which implemented and discussed deep reinforcement learning and anomaly detection application. These research articles were published from 2017 to  2022, which is relatively recent. The list of chosen papers for this SLR is given in Table 1.

A. ANOMALY DETECTION APPLICATION (RQ1)
In this section, we address Research Question 1 (RQ1), which discusses anomaly detection and its applications that are implemented using DRL techniques. Anomaly detection may be applied in a wide range of applications. In this research, we found 13 different applications in the anomaly detection-based-DRL publications gathered from the literature. Table 2 lists these applications and mentions the paper discussing them.
As shown in Table 2, our selected articles discuss general anomaly detection, network anomaly detection, intrusion detection, network intrusion detection, cloud intrusion detection, video anomaly detection, building anomaly detection, wireless network security, and the internet of things (IoT). In addition, the frequency of each application discussed in the     Figure 4 shows the percentage of each anomaly detection application from the selected papers, general anomaly detection and application related to intrusion detection, which includes network and cloud intrusion detection, are the most popular applications which have been used for detection using deep reinforcement learning techniques. DRL outperforms other popular techniques like ML and another statistical model for the anomaly detection application, which requires extensive unlabeled data or signal data like in network, wireless signals or cloud intrusion detection. DRL is also popular and performs well for video anomaly detection because the video dataset is high dimensional and contains raw and unlabeled anomalies, which has been a problem for other models.

B. ANOMALY DATASETS (RQ2)
This section addresses RQ2, which aims to represent all the datasets used for anomaly detection using DRL. Various datasets exist depending on which application of anomaly detection you are dealing with. We have presented 48 different datasets utilized by each selected research paper for this SLR, as given in Table 3. Table 3 shows that the authors in P1 have used four databases for anomaly detection named NB15, Thyroid, HAR, and Cover type. These databases include 12 different anomalous datasets. In P2, the author has used 24 different datasets used for anomaly detection. Datasets used in P1 and P2 can be used for general anomaly detection models. P3, P11, P12, P13, P14 and P15 have used different network anomaly datasets, which can be used for models built for other network anomaly detection. ISOT-CID, NSL-KDD, AWID, and UNSW-NB15 are the Intrusion detection datasets in P8, P9, P10, P11, P15, and P17 used in ML to detect network intrusions or attacks. P4 used Gazebo's hand-designed objects dataset to detect the position and shape of objects in robotics using anomaly detection. In P5 and P7, they used video datasets named UCF and UCSD, respectively. P3 used Connected and automated vehicles (CAV) sensor data to detect anomalies using the DRL model. MedbIoT is a dataset containing traces of the internet of things (IoT) used in P18 and P19 used aerial computing network data in their research for anomaly detection using DRL. In UCF-anomalydetection-dataset, it is about 1900 untrimmed and 128 hours long real-world surveillance videos containing 13 cases of real video anomalies. About UCSD, it is an anomaly detection video dataset that was acquired with a camera mounted on walkways used to detect anomalous pedestrian motion patterns. In P6, it is a building-specific anomaly detection dataset used to detect anomalies for building and checking the performance of the parameters from all sensors.

C. TYPES OF DEEP REINFORCEMENT LEARNING TECHNIQUES (RQ3)
This section addresses the RQ3 in which we aim to specify the DRL algorithms that have been used to detect anomalies utilized in the selected papers, which is one of the primary goals of this review. Table 4 represents 17 Deep reinforcement learning algorithms used for anomaly detection from 2017 to 2022, along with their application. DRL models combine artificial neural networks with Reinforcement learning helps the agent learn to achieve the goal. Deep Q learning, Actor critic, deep policy gradients, and neural networks with RL are popular algorithms used for different anomaly detection in the selected papers are explained in the following subsections:

1) DEEP Q-NETWORK (DQN) AND DOUBLE DEEP Q-NETWORK (DDQN)
To make reinforcement learning effective in extensive features and complex situations like video games and automation, DQN is an RL algorithm that combines Q-Learning with DNNs. DQN, however, has several drawbacks that DDQN resolves. When attempting to approximate the state-action value function, it corrects for the DQN algorithm's sporadic propensity to exaggerate some values. Therefore, provided the prediction error is maintained at a low, the DQN can be trained. Despite being efficient, the Deep Q Learning method is known to have serious problems, such as overestimating action values in some circumstances. Researchers developed an enhanced technique to address these issues: Double Deep Q-learning. It is possible to choose exaggerated values, leading to too-optimistic value estimations, because the max operator in both Q-learning (DQL) and DDQN picks and analyzes an action using the same values. By breaking down the target's optimum operation into action selection and action assessment, Double Deep Q-learning aims to reduce overestimation. Double DQN varies from DQN solely during the Q-value update phase.

2) ACTOR-CRITIC (AC)
Adapting to a new one, this is a simple and compact framework for deep RL. The actor-critic technique optimizes deep neural network integrators via concurrent gradient descent. Depending on concurrent versions of four common RL algorithms, the study was carried out. The findings demonstrate that concurrent actor-learners stabilize learning and enable all four techniques to train the neural net regulators effectively. According to the results, the best technique, an asynchronous actor-critic variation, exceeds the most significant algorithms currently available. According to research, a concurrent actor-critic also works well on a wide range of persistent motor control issues.

3) POLICY GRADIENT (PG)
The foundation of policy gradient is the training of a policy function, which specifies the course of action to be followed VOLUME 10, 2022 for each potential state. Except for the last layer, which uses softmax activation to create a probabilistic model for the action, a basic NN with a few layers and ReLU activation for all layers approximates the policy function. The technique shown employs generalized trajectories that consist of a list of pairs generated by a state and the ground-truth label that goes with it. A small batch of n trajectories includes this generic trajectory. The algorithm's training iterations process every mini-trajectory, batches, and for each iteration, a new minibatch is created because of the process. To use the states and the policy equation, the algorithm first predicts the actions. All the states in a trajectory are subject to action prediction,   which results in a list of anticipated actions. The probability distribution of the actions specified by the policy function was sampled to produce these projected actions. The phrase ''Prob. Distribution Sampler'' is used to describe this.

4) DEEP DETERMINISTIC POLICY GRADIENT (DDPG)
Deep-RL algorithms that are actor-critical, off-policy, and sample-efficient are DDPG. With deterministic policy and off-policy updating utilizing a replay buffer, DDPG is a mix of DQN and QAC. It employs deterministic policy as a rough action space Q-value maximizes. It uses target networks, a postponed update, and Gaussian noise for stochastic actions in discovery. A few weaknesses and instability in DDPG can be attributed partly to an overestimation bias in critic updates. Because of its sensitivity to hyper-parameter settings, it is well known to be challenging to tune. These problems can be solved using well-tailored code baselines that include many cutting-edge methods.

5) META POLICY ACTIVE LEARNING (META-ADD)
Deep Reinforcement Learning is used in active anomaly Detection with Meta-Policy (Meta-AAD), an active anomaly detection method. Meta-AAD may be a universal framework for active anomaly detection since it may intrinsically optimize short-term and long-term incentives. It is a brandnew methodology that develops a query decision metapolicy. Meta-AAD uses deep reinforcement learning to train the meta-policy to choose the best example to optimize the number of anomalies found during the querying procedure. Since a learned meta-policy may be applied immediately to any new datasets without additional adjustment, Meta-AAD is simple to implement. It can acquire a meta-policy that explicitly maximizes the number of anomalies found. More precisely, we model active anomaly detection as a Markov decision process and use deep reinforcement learning to train the meta-policy to choose the best example in each loop.

6) DEEP AUTOENCODER Q-NETWORK (DAEQ-N)
This model framework is built on an unorthodox approach to experience replay comparable to recently published groundbreaking research. The incentives in our suggested model are determined by adding up all the discrepancies between encoding and decoding. It employs an auto-encoder because tiny changes in the weights can generate bigger changes in the state distribution. The average total reward in RL tends to fluctuate dramatically. A typical deep neural network has oscillating average reward graphs. However, throughout training, we try to develop the average total reward. Depending on the auto-encoder, we have a very high probability of making consistent, steady improvements.

7) RCNN DRNN
Deep reinforcement learning (DRL) uses deep neural networks to achieve specified objectives while aiming to train an autonomous agent to interact with a given environment (DNN). Recurrent neural network (RNN) based DRL has proven to perform better than other approaches because RNNs are more adept at capturing the time dynamics of the environment and delivering the right agent responses. Besides their exceptional performance, RNNs' internal environmental comprehension and long-term memory are also little understood. For deep learning professionals, it is crucial to reveal these specifics to comprehend and enhance DRLs. However, doing so is problematic since these models contain complex data transformations.

D. PERFORMANCE ANALYSIS FOR DEEP REINFORCEMENT LEARNING ALGORITHMS (RQ4)
In this section, we address the RQ4, which is concerned with the performance of the DRL model and its comparison with other alternative models utilized in the papers we selected for this review. Table 5 shows all the DRL models, specifying the application of anomaly detection, mentioning the dataset, and showing the models' performance with their accuracy. Some papers mention the accuracy of the model. Others evaluated the models based on comparison with stateof-the-art. As we can see from the table, the NSL-KDD dataset is used by papers P8, P9, P11 and P15. In P11, Deep reinforcement learning for network intrusion detection system DRL-NIDS proved to be the better DRL algorithm with 91.4% accuracy over other models to detect network anomalies from NSL-KDD dataset. Both P10 and P11 used the UNSW-NB15 dataset, but in P11, DRL-NIDS performed better with 91.8% accuracy. Concerning the application type of anomaly detection, P5 and P6 performed video anomaly detection on real-time large video datasets. For comparison, IVADC-FDRL and Deep Q learning Network (DQN) model in P7 performed better with up to 98% accuracy over another model for video anomaly detection. Each proposed model with DRL reviewed in selected papers from 2017-2021 has outperformed other competitors or alternative models. DRL algorithms have performed better for anomaly detection, whereas other techniques lack.
According to our SLR, researchers utilize a variety of evaluation metrics to assess the performance of various RL models. Table 6 provides a collection of commonly employed evaluation metrics.

V. CONCLUSION A. THEORETICAL IMPLICATIONS
First, this study is among the first systematic literature review on Anomaly detection using Deep reinforcement learning (DRL) techniques. Although some literature reviews are available, none of the studies has addressed these methods appropriately. Our systematic literature review focuses on presenting extensive research on detecting anomalies using DRL techniques, datasets used and the performance of each DRL model. Our SLR provides a review of various DRL models for anomaly detection, performance comparison of those with alternative techniques, applications of anomaly detections that are used in this research articles selected for this SLR, and we represented all anomaly detection datasets that are used in this research articles which are selected for this SLR covered from 2017 to 2022.
In recent years, DRL has outperformed Deep learning and Machine learning in many ways. DRL models combine artificial neural networks with reinforcement learning to help the agent learns to achieve the goal. As far as our topic, anomaly detection, is concerned, the main techniques used in DRL are Deep Q learning, policy gradient, deep auto-encoder Q learning, double deep Q learning, policy gradient, and actor-critic models. These models of DRL have outperformed the other deep/machine learning techniques for detecting an anomaly in various applications. This research shows that a deep Q network can be used if the researcher is dealing with intrusion or video anomaly data. Deep policy gradient techniques have been used for building anomaly detection. The actor-critic has been used for intrusion detection.
In anomaly detection study, there are various datasets that DRL has covered, like Network anomaly, industrial network anomaly, wireless network anomaly, network intrusion, cloud intrusion, general anomaly, video anomaly, building anomaly, signal anomaly and unknown anomaly detection. We have shown that one DRL technique uses a different application and dataset of anomaly yet outperforms other models. Therefore, DRL proves to perform best for all applications of anomaly detection.

B. LIMITATIONS
This research is about anomaly detection from deep reinforcement learning which limited number of research articles because it is a new technique that was started in 2017. Therefore, our Systematic literature review starts from 2017 to 2022. This SLR is also limited to journal and conference papers that have used only DRL framework for anomaly detection exclude several other anomaly detection methods to meet the selection criteria requirement. We believe this systematic literature review would have been improved by increasing the scope and sources.

C. FUTURE AVENUE FOR RESEARCHERS
This review presents DRL models for anomaly detection with only 32 papers published from 2017 to 2022. Therefore, we highly recommend that other researchers conduct more research on deep reinforcement learning for anomaly detection to gain evidence about the performance of DRL for anomaly detection. RL is an emerging field and has many scopes. Moreover, we observed that there are limited anomaly detection applications that have been used for DRL. Possible future avenues for other researchers to explore DRL techniques for other anomaly detection applications not listed in this SLR.
Anomaly detection can be applied to a wide range of applications. We found 13 different applications in this SLR. Most of the research on DRL is about network or intrusion-type anomaly detection. Researchers can experiment with DRL techniques for other anomaly detection applications, e.g., video anomaly, wireless anomaly, or anomaly detection in the industry, as DRL has performed well for these applications.
There are various anomaly datasets available in the literature. Most of the anomaly data found in the research articles identified in this SLR using DRL techniques includes medical and network datasets, and DRL techniques have outperformed the ML and DL techniques. Another future avenue for researchers is to work on another anomaly dataset with these DRL techniques to prove to be better than other techniques for each type of anomaly data.
As we can see from table 4, Deep Q learning, actor-critic, policy gradient and reinforcement learning with RNN are the most valuable techniques of deep reinforcement learning, so researcher should explore their more variant, e.g., double deep Q learning, Q network with autoencoders, meta policy and combine DL models with RL techniques and experiment on different applications to gain more evidence on DRL with anomaly detection.

VI. CLOSING REMARKS
This systematic literature review presents anomaly detection through Deep reinforcement learning (DRL). We collected a total of 32 research papers that used the DRL framework for anomaly detection from 2017 to 2022. We reviewed and analyzed these papers from these four perspectives: the type of anomaly detection application, the anomaly detection dataset, the proposed DRL techniques, and the DRL model performance over other alternative models.
For RQ1, we observed 13 different applications of anomaly detection that have been used in with DRL in selected papers. We have observed that the most popular anomaly detection with applications with DRL includes network intrusion detection, video anomaly detection, and general anomaly detection. In RQ2, we identified 50 different anomaly detection datasets from different specific anomaly detection applications. Most datasets are real-time datasets, and some are public datasets. As for RQ3, we demonstrated 17 different DRL models that have been used for anomaly detection in the selected papers from 2017 to 2021. The most popular DRL methods are Deep Q learning, Actor critic, deep policy gradient, and neural networks with RL. Finally, for RQ4, we presented a performance comparison of the DRL technique with the alternative models from the selected papers.
KINZA ARSHAD received the bachelor's degree in computer science from the University of Management and Technology, Lahore, Pakistan, in 2020, where she is currently pursuing the M.S. degree in data science. Her projects and research interests include machine learning, deep learning, and machine translation.
RAO FAIZAN ALI received the bachelor's degree in computer science from COMSATS University Islamabad, Pakistan, and the M.Phil. degree in computer science from the University of Management and Technology, Lahore, Pakistan, and the Ph.D. degree from University Technology PETRONAS, Malaysia. He has ten years of experience in teaching and research. He is currently an Assistant Professor with the University of Management and Technology. He is with various computer science positions in financial, consulting, academia, and government sectors.
AMGAD MUNEER received the B.Eng. degree (Hons.) in mechatronic engineering from the Asia Pacific University of Technology and Innovation (APU), in 2018, and the master's degree in information technology from Universiti Teknologi PETRONAS, Malaysia, in 2022, where he is currently pursuing the Ph.D. degree. Currently, he is a Research Assistant II with the Department of Imaging Physics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA. He has authored several high-impact articles in well-reputed journals and conferences. His research interests lie in AI applications for cancer data sciences, manufacturing data analytics, the Internet of Things, medical imaging, and bioinformatics. He is a reviewer of many international impact-factor journals.
IZZATDIN ABDUL AZIZ received the Ph.D. degree in information technology from Deakin University, Australia, working on the domain of hydrocarbon exploration and cloud computing. He is currently a Researcher with the High-Performance Cloud Computing Centre (HPC3), Universiti Teknologi Petronas (UTP), where he focuses on solving complex upstream oil and gas (O&G) industry problems from the viewpoint of computer sciences. He is the Deputy Head of the Computer and Information Sciences Department, UTP. He is working closely with O&G companies in delivering solutions for complex problems, such as offshore O&G pipeline corrosion rate prediction, O&G pipeline corrosion detection, securing data on clouds, designing and implementing Metocean prediction system, and bridging upstream and downstream oil and gas businesses through data analytics. He is also working on big data transmission, security, and optimization problems on high performance clouds.
SHERAZ NASEER received the M.S. degree in information security and the Ph.D. degree in computer science. He has 15 years of experience in industry and academia. He is currently an Assistant Professor with the Department of Computer Science, University of Engineering and Technology, Pakistan. His research interests include bioinformatics, data driven information security, and anomaly detection. He received the professional certifications of IT including, CISSP, CoBit, and ITIL.
NABEEL SABIR KHAN was born in Lahore, Pakistan. He received the M.C.S. and M.S. degrees from the University of Central Punjab and the Ph.D. degree from the University of Management and Technology, Pakistan, in 2020. He is currently an Assistant Professor with the University of Central Punjab, Lahore, Pakistan. He has more than 13 years of teaching experience. He is also the Regional Director of ACM-ICPC ASIA, Lahore. His research interests include theory of programming language, machine translation, and computer science education.
SHAKIRAH MOHD TAIB (Member, IEEE) received the bachelor's degree in information technology from Universiti Utara Malaysia and the M.Comp. degree from the University of Tasmania, Australia. She is a Lecturer and a Researcher with the Centre for Research in Data Science (CeRDaS), Universiti Teknologi PETRONAS (UTP), Malaysia. She has more than 15 years working experience at Universiti Teknologi Petronas (UTP). Her research interests include data science, machine learning, knowledge discovery, and information retrieval, using Artificial Intelligence techniques. She is a member of international organization, such as Malaysia Board of Technologists (MBOT) and Association for Information Systems (AIS).