Double deep q-learning with prioritized experience replay for anomaly detection in smart environments

Anomaly detection in smart environments is important when dealing with rare events, which can be safety-critical to individuals or infrastructure. Safety-critical means in this case, that these events can be a threat to the safety of individuals (e.g. a person falling to the ground) or to the security of infrastructure (e.g. unauthorized access to protected facilities). However, recognizing abnormal events in smart environments is challenging, because of the complex and volatile nature of the data recorded by monitoring sensors. Methodologies proposed in the literature are frequently domain-specific and are subject to biased assumptions about the underlying data. In this work, we propose the adaption of a deep reinforcement learning algorithm, namely double deep q-learning (DDQN), for anomaly detection in smart environments. Our proposed anomaly detector directly learns a decision-making function, which can classify rare events based on multivariate sequential time series data. With an emphasis on improving the performance in rare event classification tasks, we extended the algorithm with a prioritized experience replay (PER) strategy, and showed that the PER extension yields an increase in detection performance. The adaption of the improved version of the DDQN reinforcement learning algorithm for anomaly detection in smart environments is the major contribution of this work. Empirical studies on publicly available real-world datasets demonstrate the effectiveness of our proposed solution. Here specifically, we use a dataset for fall and for occupancy detection to evaluate the solution proposed in this work. Our solution yields comparable detection performance to previous work, and has the additional advantages of being adaptable to different environments and capable of online learning.


I. INTRODUCTION
The Internet of Things (IoT) refers to smart objects that are connected to the internet. Smart objects are sensorenabled devices that operate within an environment. The term smart refers to the capability of automatically getting knowledge about an environment, its surroundings and applying it according to the user's needs [1]. Smart environments comprise interconnected smart objects. By integrating IoT with smart environments, the environment can be controlled and monitored remotely [2]. Abnormal situation recognition in smart environments is important, especially when dealing with rare events, which can be safety-critical for individuals or infrastructure. Smart environments can be prone to malfunctions, which can occur due to internal or external factors and may have fatal consequences for the users that interact with it [3]. Recently, various deep learning approaches have been applied to detect these rare situations. This is done by analyzing and detecting anomalous patterns in data, which originate from homogeneous and heterogeneous sensor sources. Anomaly detection methods have been investigated and proven to be well suited for the recognition of unwanted behaviors and safety-critical situations within smart environments. These safety-critical situations can also be the consequence of physical or cyber-attacks. Among others, applications of anomaly detection algorithms include network intrusion detection [4], fraud detection [5], system health monitoring [6] and smart sensor networks [7]. By enabling and improving the ability to recognize rare events, anomaly detection algorithms contribute to the comfort and safety of inhabitants and help to prevent fatal consequences. However, recognizing rare events in smart environments is challenging, because of the complex and volatile nature of the data recorded by monitoring sensors. Especially if the data is noisy, multivariate and time-dependent. The methodologies proposed in the literature are frequently domainspecific and are suspect to biased assumptions about the underlying data [8]. By solving continuous markov decision process(es) (MDP), reinforcement learning (RL) approaches could overcome this challenge. The conditions under which data patterns are categorized are often modeled as rulebased and manually defined by human experts. This raises the need for methodologies that do not rely on any explicit assumption about the data and events, that can occur within smart environments. Our proposed method directly learns to detect rare events by creating experiences in a data-driven fashion. Recent work on anomaly detection has considered the combination of deep learning methodologies with the paradigms of reinforcement learning.
In this work, we propose the novel use of DDQN for anomaly detection in smart environments. We adapted and extended the algorithm with a prioritized experience replay (PER) strategy, with an emphasis on increasing the performance in rare event classification tasks. The result is a novel anomaly detector for multivariate sequential time series data based on the paradigms of reinforcement learning. Evaluation of the proposed solution is performed on two independent datasets, which contain real-world sensory data originating from smart environments. Specifically, a dataset for fall detection [9] and a dataset for occupancy detection [10] has been chosen. These datasets provide a good foundation for the development of important methodologies, which can detect events that are safety-critical for individuals (e.g. a person falling to the ground) or for the security of infrastructure (e.g. unauthorized access to protected facilities). The evaluations indicate that the use of the DDQN algorithm with PER for anomaly detection in smart environments results in accurate predictive performance. Our proposed method additionally achieves superior performance on the task of fall detection compared to the state-of-the-art. Moreover, our proposed method does not rely on assumptions, that are made by human experts. These assumptions are often applied in rule-based approaches, are inherently biased, and require human intervention. Instead, our proposed solution is an online learning algorithm that is adaptive in the way, that it consequently learns from experiences that are captured by monitoring sensors.
The next section presents work on applications of deep reinforcement learning for anomaly detection. Section III presents the anomaly detector proposed in this work. This includes the definition of the anomaly detection problem in terms of reinforcement learning, as well as the description of the DDQN algorithm that we adapted. In Section IV, the datasets used for the experiments conducted in this work are described. The experimental setup and hyperparameter configurations are described in Section V. Section VI presents the results achieved by our detection algorithm, including a comparison to state-of-the-art approaches.

II. RELATED WORK
In recent years, anomaly detection emerged and caught the attention of the research community for the recognition and prevention of unforeseen events in smart environments [11]. Ensuring the security of IoT architectures is one of the main application areas of anomaly detection in smart environments, as shown by many surveys focusing on intrusion detection systems (IDS) [12]- [15]. The literature reviews include a critical review on IDS for IoT architectures [12], security and privacy challenges in different IoT layers [13], IDS research for IoT networks [14], as well as detailed categorizations of the IDSs in the IoT domain [15].
Anomaly detection based on machine learning has been proven to perform well on the vast amount of sensory data provided by smart sensor networks. However, work on reinforcement learning methods only constitutes a minor part of the current research and is scarcely considered for the task of anomaly detection and situation recognition as shown in many surveys on deep learning based anomaly detection [8], [16]- [18]. Although, reinforcement learning methods achieve superior performance compared to humans in decision-making tasks like game playing [19], [20].
Deep reinforcement learning has been indirectly applied for anomaly detection in buildings. By using the deep deterministic policy gradient (DDPG) algorithm, Wu and Ortiz [21] explored the hyperparameter space of a building-specific anomaly detection algorithm. However, the authors proposed an indirect application of reinforcement learning. The DDPG algorithm has not been used to perform the actual detection task, but to optimize the hyperparameters of the anomaly detection algorithm.
Similarly, the methodology suggested in Zha et al. [22] is an indirect application of deep reinforcement learning for anomaly detection. The authors investigated a policy selection task, that can be solved by proximal policy optimization (PPO). The authors suggest to re-rank possible anomalies based on information gained from anomaly verification procedures, which were performed by human anomaly analysts. The proposed approach aims to support human domain specialists in ranking anomalous sequences in time series so that more true anomalies can be discovered.
In contrast to [21], [22], Kurt et al. [23] suggested a direct application of reinforcement learning for anomaly detection. The authors investigated a model-free reinforcement learning approach for the online detection of cyberattacks in smart grid applications. They modeled the anomaly detection problem as a partially observable markov decision process (POMDP) and evaluated the effectiveness of a model-free SARSA algorithm to produce the optimal anomaly detector for small problem sizes. In [24], Zhong et al. proposed a deep actor-critic reinforcement learning framework for anomaly detection on sensory data. The deep actor-critic agent proposed by the authors dynamically selects the sensor to be tested based on sequential process data. Oh and Iyengar [25] investigated the effectiveness of inverse reinforcement learning (IRL) for anomaly detection based on sequential data in safety-critical environments. Their approach determines the agent's reward function by using a neural network, which they inferred via IRL. The proposed method adopts a Bayesian approach to take the confidence of the predicted anomaly scores into account. The authors evaluated their method on a real-world dataset that contains car trajectory data. Their results show that the proposed approach performs well in detecting anomalous data patterns in GPS data. However, their approach only works in a low dimensional feature space. The work of Yu and Sun [26] represents a comprehensive variation. The authors proposed a general policy gradient anomaly detector, that is based on the asynchronous advantage actor-critic (A3C) algorithm. Although the A3C algorithm allows for continuous action spaces, the authors stated that they neglected the use of continuous action space. In [27], the authors proposed a general, experience collecting framework for time series anomaly detection, that is based on deep Q-learning (DQN). They adopted a long short-term memory (LSTM) network to model the temporal dependencies in data and applied the Q-learning algorithm with memory replay. The authors achieved competitive detection performance on the Numenta dataset. However, a shortcoming of the proposed approach is that it is limited to a one-dimensional feature space.
The previously mentioned work emphasizes the use of reinforcement learning for anomaly detection in particular domains. Although the developments and investigations described in this work are mainly based on the method described in [27], our work substantially extends by adapting to multivariate sequential time series learning scenarios. Additionally, we propose the use of DDQN for anomaly detection, to make policy estimation more stable. Moreover, we propose to extend DDQN with PER to emphasize learning from rare data patterns, and show that our DDQN-PER solution yields a performance increase.

III. METHODOLOGY
This section presents the definitions of our anomaly detector and its' components. Figure 1 shows a complete overview of the processing pipeline of our proposed methodology. Motivated by [27], we present the adaption of an improved reinforcement learning algorithm for anomaly detection in smart environments. Our detector is based on the DDQN algorithm [28], which dynamically improves its' detection performance based on experiences. In the following, we present the definitions of our anomaly detector. In Section III-A we describe the DDQN algorithm we applied. In Section III-B we present how we extended the DDQN algorithm with PER. Finally, in Section III-C, we describe the processing pipeline in further detail.

ANOMALY DETECTOR → π
The anomaly detector follows the policy π. Equation (1) defines the policy π as a conditional probability distribution.
where S equals the set of states and A equals the set of actions of the system. π(s, a) = p(A = a | S = s) denotes the probability for the action a in the provided state s.

DETECTOR PERFORMANCE → Vπ
The performance of the anomaly detector is given by (2).
where d π (s) equals the probability of the system being in state s when acting according to the policy π. Q(s, a) represents the accumulated reward from state s with action a. The average accumulated reward following the policy π is a measure for the performance of the anomaly detector.
OPTIMAL ANOMALY DETECTOR → π * An optimal anomaly detector aims at maximizing its' performance. The maximal performance is achieved by following the optimal policy as given by (3).
In the case were d π (s) is approximately the same for all s ∈ S and |S| is the amount of states in S, it follows that: Equation (4) shows that the optimal anomaly detector that follows the optimal policy π * is determined by the accumulated Q-value function Q(s, a). This holds true under the assumptions that: 1) the anomaly detection problem is deterministic 2) d π (s) is approximately uniform.

Q-LEARNING
Q-learning is a variant of temporal difference (TD) learning in which the agent evaluates the utility of an action, rather than a state. Although, methods of dynamic programming can be used in model-based environments to derive the optimal policy π * , they require the full dynamics of the MDP to be known beforehand. Q-learning represents a model-free approach, which can be used to solve environments without a complete environmental model.
In Q-learning the agent performs an action a t for the current state s t according to its policy π and receives the VOLUME 4, 2016  resulting reward r t . From the subsequent state s t+1 , the agent assumes the most promising action a t+1 as the future reward according to its current evaluation function. Based on this principle, the agent adjusts its evaluation function according to equation (5).
where θ t is the TD-error. The TD-error is the difference between the estimated optimal future reward and the current reward estimate. The TD-error is given by (6).
where γ is the discount factor, that determines how much the reinforcement learning agents cares about distant future rewards relative to those in the immediate future [29].

EXPERIENCE
Gaining experiences from data observations is the major factor that enables the Q-learning algorithm to improve the policy π. We define experience as a tuple of: The reward gained by choosing action a in state s at time step t is denoted by r t , s t+1 is the subsequent state and θ t is the TD-error. The experience includes all past behaviors of the anomaly detector. Gaining a better estimation of Q(s, a) corresponds to the aim of the anomaly detector to consistently learn from experience.
With the replay methodology of Q-learning and the PER extension, the estimation of Q(s, a) can be improved more effectively when dealing with unequal experienced stateaction transitions. Unequally experienced state-action transitions can be a result of unbalanced training data, like in the case of anomaly detection tasks, where commonly less annotated anomalous training samples exist than normal samples. In the default experience replay approach, transitions are replayed at the same frequency that they were originally experienced. However, the PER extension takes the significance of transitions into account, such that important transitions are replayed more frequently [30].

MARKOV DECISION PROCESS
The definition of an abstract MDP is defined by the 4-tuple (S, A, P a , R a ). The MDP as formulated in this work has been adapted to be suitable for multivariate sequential time series anomaly detection. The elements of the adapted MDP are given by: A state s includes a feature vector v which consists of all available time series features x i , where x is a numerical value and i is the feature index. Furthermore, s keeps track of the actions a, such that the anomaly detector can be extended and applied to partially observable MDPs. The concatenation of the feature vector v and the action a over time t represents the state of the MDP, such that the state s at time t is given by: where the horizon h is a hyperparameter, defining the number past observations and actions that are considered in a state.
• Action space A The anomaly detector described in this work differentiates between two actions that it can choose in a specific state. This corresponds to a binary anomaly classification task which is represented by the actions: where the action 0 indicates a normal state and the action 1 indicates an anomalous state.
Under the assumption that the anomaly detection process is deterministic, it can be concluded that P (s t , s t+1 ) = 1 for all actions a ∈ A. However, in real-world scenarios, there exists an uncertainty on the determinism of anomaly detection tasks. Choosing the same actions in repetitious states might lead to different outcomes in real-world scenarios.
In general, the reward function for MDPs is a sensitive factor because the reward values directly influence the performance of the anomaly detector. The reward function for the approach proposed in this work is defined by: Notice that the recognition of an anomalous state, denoted by true positive (TP), results in the highest reward.
In the case an anomalous state is mistakenly classified as normal, denoted by false negative (FN), the lowest reward is obtained. A false positive (FP) denotes a normal state that is falsely classified as abnormal. A true negative (TN) denotes a correctly classified normal state. By using the proposed reward function, our learning algorithm is guided towards the recognition of anomalous states, while avoiding misclassifications of anomalous states. Avoiding misclassifications of anomalous states is very important in safety-critical applications because an unrecognized anomalous state could have fatal consequences for individuals or infrastructure.

A. DOUBLE DEEP Q-LEARNING
Q-learning [31] is one of the most popular RL-algorithms, although it is suspect to unexpected high action values under certain conditions. The overestimation problem of action values arises from the maximization step in the Q-learning update function, where overestimated values are automatically preferred. The default DQN algorithm cannot deal with this problem as the Q-function estimator, a multilayer perceptron (MLP), directly represents the bootstrapped value function for each sequential update operation. While the addition of an uncorrelated replay memory works as an antagonist, overestimation still occurs and results in negative effects on the learned policy [28]. Hasselt et al. [28] proposed a latency update target network to further decorrelate bootstrapped experience from the value function update. In standard deep Q-learning the target values are calculated in the following way: In DDQN [28], the target network's weights are updated delayed and with uncorrelated experience, by bootstrapping the target values from a periodically updated target estimator Q θ . Using (8), we computed the target values in DDQN.
where Q θ is the declaration of the target network. We use Q for action selection and Q θ for action evaluation. The pseudocode of the DDQN algorithm with the experience replay addition is given by Algorithm 1. The adaption of DQN towards DDQN is a minimal possible change. In Algorithm 1, the changes applied to the default DQN algorithm are highlighted in yellow. The remaining part of the DQN algorithm remains as proposed in the original method [31]. However, this adaption adds a small computational overhead [28]. Furthermore, in [28] it has been shown that for large-scale problems with deterministic MDPs the inherent estimation of errors is a prevalent problem. DDQN offers a rather simple solution to tackle this problem. In the case of anomaly detection, the fact of overestimation is very important to look at. On large time series datasets, policy estimation might be unstable during training. Overestimated Q-function values on time series patterns can result in bad anomaly detection strategies. As a result, the anomaly detection algorithm might perform weakly on volatile datasets. Therefore, we improved the DDQN algorithm with PER such that it can learn policies on MDPs in a more stable way. This contributes to the novel RL-based anomaly detector presented in this work. To the best of our knowledge, DDQN with PER has never been investigated in the context of anomaly detection, although it leads to significant performance improvements as indicated by our results. Our novel estimation in this direction leads to significant performance enhancements as will be shown in Section VI of this paper.

B. PRIORITIZED EXPERIENCE REPLAY
Experience replay is one of the main features of deep Qlearning. With an emphasis on rare event classification tasks, we suggest extending the DDQN algorithm with PER. Schaul et al. [30] originally proposed to use PER to outperform the default DQN algorithm with the uniform experience replay strategy on game playing tasks. In this work, we propose to extend DDQN with PER for rare event detection tasks. The PER extension enables the learning algorithm to adjust the frequency and importance of learning experiences by priority, past transitions between states are remembered and reused.
Normally, experiences are sampled uniformly from the replay memory. PER proves, that the replay frequency and importance of experiences can be adjusted by priority, regardless of their significance [30]. The main idea of PER is that some experiences are more important to the agent than others. Therefore, the relevance of transitions is measured with each experience. In the replay phase of an agent, the experiences are sampled with a certain strategy from the replay memory. Nevertheless, the pitfall of losing diversity and gaining bias on certain transitions has to be avoided. Schaul et al. [30] have recommended stochastic prioritization and importance sampling as strategies against these pitfalls.
The importance factor for transitions has first been proposed by Andre et al. [32]. The authors stated that the TD-error δ is a measure of the unexpectedness between transitions. Therefore, a prioritized sampling strategy is an intuitive extension. In the default Q-learning algorithm, the TD-error is computed for value-function updates. Extracting the importance factor δ does not add any computational overhead. However, to overcome the challenge of effectively deciding which transition to replay, a feasible data structure is necessary. In [30] a binary heap is recommended to effectively select the transition to replay by priority in a memory buffer. The sample effort for the maximum error in a buffer of size N can then be estimated by O(log N ). Schaul et al. [30] also highlighted that PER can be prone to overfitting. The reason for this behavior is correlated to the priority updates, which are only applied to the sampled transitions. Hence, less important transition (δ ≈ 0) might never be replayed in the agents' lifetime.
Equation (9) defines stochastic sampling by a probability value P(t).
Making use of stochastic sampling results in an unbiased sampling distribution. With the exponent β, the ratio of sampling can be adjusted priority based. The prioritization value p t can be either defined by the proportional calculation p t = |δ t | + or by a rank-based prioritization p t = 1 rank(t) . When sampling on the sorted rank of each transition in the replay memory, one calculates the power-law distribution for P(t) with exponent β [30]. Schaul et al. [30] suggest using the rank-based sampling variant because it is more robust and scores a higher mean for most experiments. The rank-based variant is also less sensitive to outliers and error magnitudes on average. The experience replay strategy we use is computed by (9) with rank-based prioritization. Algorithm 2 shows the extension we applied to Algorithm 1. The PER sampling variant we applied in each updated step is highlighted in yellow. Experiences are prioritized based on the sorted rank of their importance.

C. PROCESSING PIPELINE
This section presents the processing pipeline of our adaption of the DDQN algorithm. The processing pipeline is visualized in Figure 1. Each training episode of the algorithm consists of two distinct phases. In the first phase, the replay memory is filled. In the second phase, the parameters of Qnetwork and the target network are updated. In step 1, a state s t is observed. In step 2, an action a t is executed, depending on the state s t . In step 3, the state s t , the selected action a t , the obtained reward r t , and the subsequent state s t+1 are stored in the replay memory. The steps 1-3 are repeated until the replay memory is filled. In step 4, experiences are sampled from the replay memory. In step 5, the expected reward based on the observed reward r t and the Q-network estimations for the subsequent state-action pair is computed. In step 6, gradient descent is performed and the Q-network parameters are updated. The squared distance between the observed reward and the estimated reward serves as loss function. Steps 4-6 are repeated depending on the target network update frequency. The target network update frequency denotes the number of Q-network parameter updates performed before the Q-network parameters are copied to the target Q-network. Finally, in step 7, the network parameters of the Q-network are copied to the target Q-network. On the right-hand side of the figure, the neural network architectures of the Q-network and target Q-network are visualized. The input provided to the networks is a state s t , which consists of h samples and their corresponding features. The output layer returns a Q-value for each action. The Q-value is a measure of how good a certain action is in a state. The Q-values are used to select an action a t , that is executed in state s t , so that a reward is obtained from the environment.

IV. DATASETS
This section presents the database that we use for the experiments and evaluations conducted in this work. The database consists of two independent datasets, a dataset for fall detection [9] and a dataset for occupancy detection [10]. The datasets contain noisy real-world sensory data originating from smart environments. They represent an important foundation for the development of methodologies, that can detect safety-critical events for individuals or infrastructure. The database enables us to show the ability of the proposed approach, to cope with noisy real-world sensory data, while achieving high detection performance in safety-critical event detection tasks.

A. OCCUPANCY DETECTION
Candanedo and Feldheim [10] published a dataset that contains real-world sensory data for the purpose of developing methodologies that can accurately detect occupancy of office rooms. Light, humidity, CO 2 and temperature sensor readings are included. The dataset additionally contains timestamps and humidity ratios. A combination of these sensors is already existing in many smart buildings nowadays. In [10], the authors manually engineered features by exploiting the timestamps of the sensor measurements. They extracted the number of seconds from midnight for each day and classified the timestamp as either a weekend or a weekday. For the experiments conducted in this work, the timestamps are not considered as training data because the methodology should be able to reliably predict room occupancy by observing the sensor measurements only, independent of the daytime. Table 1 shows the overall distribution of training and testing samples as well as the fraction of anomalous data samples contained in the dataset. The ground truth labels contained in the dataset have been automatically gathered by a video surveillance system. For the development and evaluation of the solution proposed in this work, only the training data has been used for training and only the testing data for testing.  Effectively detecting room occupancy can contribute to less energy consumption in future smart buildings as well as contribute to the security of protected environments (e.g. facilities with access control).

B. FALL DETECTION
The dataset created by Kaluza et al. [9] contains local position data of persons. The localization system Ubisense has been used to track the position of persons, using a set of four localization tags. The localization tags have been placed at four distinct body positions: chest, belt, left and right ankle. The data has been collected with the intention to enable the development of mechanisms, for activity recognition and elderly healthcare. The major objective of this dataset is to increase the safety of independently living elderly people. For the evaluation of the proposed methodology in this work, the dataset has been modified to be suitable for fall detection in a binary classification scenario. The dataset consists of 134229 training samples and 30030 testing samples and is split into 25 parts. 20 parts are only used for training and 5 parts are only used for testing. High volatility was observed in the sensor readings, acquired and transmitted wirelessly in a realworld scenario. The distribution of samples is less prevalent than in the occupancy detection dataset. On average, each time series contains 5% anomalous samples. Table 2 shows VOLUME 4, 2016 the overall distribution of training and testing samples, as well as the fraction of anomalous data samples contained in the dataset.  The overall distribution between training and testing samples in the fall detection dataset [9].

V. EXPERIMENTAL SETUP
In all experiments, the values contained in the respective datasets have been min-max scaled such that the values range from 0.0 to 1.0. The Q-function estimator Q and the target estimator Q θ , that are modeled as MLPs, have the same number of feed-forward layers and neurons. Layer normalization is enabled and the hidden layers use the rectified linear unit (ReLU) activation function. The final layers consist of two output neurons and use the linear activation function, such that normal and anomalous states can be differentiated. Optimization is performed by the Adam optimizer, based on the mean squared error (MSE) loss function. Table 3 provides an overview of the relevant hyperparameters, that have been investigated. Additionally, Table 4 lists the hyperparameter configurations used for the fall and occupancy detection experiments, that resulted in the best performance. The number of past actions a t and feature vectors v t represented by the state s t , is defined by the horizon hyperparameter h. The data samples recorded between t 0 and t h−1 in each time series, have not been considered for the evaluation of the method proposed in this work. The first prediction of our DDQN-PER approach happens after h samples have been observed in each time series.

EXPERIMENT: OCCUPANCY DETECTION
The hyperparameters listed in Table 3    In general, is an important parameter to produce feasible results in -greedy Q-learning [33]. By default, the -fraction defines the fraction of randomly chosen experiences, which the agent will gather during training. By randomly choosing experiences, the agent can explore the environment. The experiments conducted cover 28 different hyperparameter configurations, while 14 trials use the prioritized experience sampling strategy and the other half uses random sampling. Out of the 14 trials, 50% use a high target network update frequency of 256 steps, while the other 50% of the trails use a smaller frequency of 1500 update steps. The target network update frequency is a crucial factor for stable policy improvement. When choosing a high frequency, the learned policy might suffer from unstable conditions during training, which should be avoided. Table 5 lists the neural network architecture of our best performing instance. Although previous work did not report on the computational complexity of the proposed approaches, we list the number of neural network parameters and the model size of our solution.  The exploration factor has been set to 85.0% for most of the experiments because a high exploration percentage during the training phase is necessary to learn an optimal policy. The training scenario uses an annealing exploration factor, to ensure that the policy learner is more greedy in its action selection over time. At the beginning of a learning task, especially when replay prioritization is used, it is necessary to experience a broad range of transitions. The DDQN-PER algorithm then ensures that relevant transitions are replayed more frequently. Furthermore, by choosing bigger horizon sizes it is necessary to scale up the capacity of the underlying neural networks. Table 6 lists the neural network architecture of our best performing instance, as well as the model complexity.

RESULTS: OCCUPANCY DETECTION
The best results achieved by our DDQN approach on the occupancy detection dataset are listed in Table 7. The performance metrics of our approach are reported on both testing series, with and without PER. The corresponding model hyperparameter configuration, that has been used to achieve these results, are listed in Table 4. An accuracy of 96.4% and a F1-Score of 95.1% is the best result we achieved on testing series 0. On testing series 2 we achieved an accuracy of 98.2% and a F1-Score of 96.0%. Overall, our adaption of the DDQN algorithm with PER performs better on both testing series. Only the recall of the DDQN algorithm with PER is 0.1 percentage points lower on testing series 0 and 0.3 percentage points lower on testing series 2. The other performance metrics increased with the PER extension on both testing series. On the larger testing series 2, the observed performance improvements with PER are more significant. The overall accuracy improved by 1.5 percentage points, while the precision improved by 6.0 percentage points. The F1-Score improved by 3.1 percentage points on testing series 2. This indicates the superiority of the PER sampling strategy over the random sampling strategy. In the fall detection experiment, conducted on the more volatile fall detection dataset, the performance improvements gained by the PER sampling strategy are even more significant.

COMPARISON: OCCUPANCY DETECTION
In this subsection, the occupancy detection performance of the approaches found in previous work is compared to our DDQN-PER approach. Table 8 lists the performance metrics and decimal places as reported in previous work. The evaluated models in [10] include random forest (RF), linear discriminant analysis (LDA), classification and regression trees (CART) and gradient boosted models (GBM). The achieved results vary from 93.06% to 97.90% accuracy on testing series 0 and 95.14% to 98.76% accuracy on testing series 2. Accuracy-wise, only their LDA performs slightly better compared to our DDQN-PER method. A difference of 1.5 percentage points on testing series 0 and 0.56 percentage points on testing series 2 can be observed. In [34] two different approaches have been suggested. The authors proposed a multivariate convolutional neural network (MVCNN), that VOLUME 4, 2016 performs better on testing series 0, but is outperformed by DDQN-PER on testing series 2. Their random forest (RF) approach is outperformed accuracy-wise by our DDQN-PER approach, on both testing series. Unfortunately, the performance metrics reported in [10], [34] are limited to the accuracy metric and do not include precision, recall and F1-score. These metrics however, are important in order to compare the predictive performance of the models on the respective dataset, because the dataset is highly unbalanced in its' class distribution. In [35], the authors improved the traditional radial basis function network (RBFN), by applying their multicolumn radial basis function network (MCRN) mechanism. The MCRN mechanism divides the training set of the dataset in smaller subsets, using the k-d tree algorithm. The reported results in [35] however, only consider the testing series 0 of the dataset. The authors achieved an accuracy of 97.60% and 93.20%, depending on the number of subsets they used for training their method. However, the recall of 95.00% they reported, is outperformed by our DDQN-PER approach by 2.1 percentage points. The authors also evaluated the performance of a support vector machine (SVM). The SVM scored 1.5 percentage points higher accuracy-wise compared to our DDQN-PER approach. However, the recall has not been reported for the SVM they evaluated.
In addition to the reported performance metrics, Table 8 lists whether or not the respective approaches are adaptable to different environments or capable of online learning. We consider an approach adaptable to different environments if it does not require hand-crafted rules that are defined by human experts. We consider an approach capable of online learning when the underlying model can be trained on data that becomes available in a sequential fashion. For approaches that are not capable of online learning, the entire training dataset must be available at once during training to generate the best predictor. As indicated in Table 8, only our adaption of the DDQN algorithm is capable of online learning compared to previous work that reported on the occupancy detection dataset because our method is purely based on the paradigms of reinforcement learning. Although the other approaches suggested in previous work are adaptable to different environments, they require that the entire training dataset is available at once to create the best predictor.
In conclusion, our DDQN-PER approach performs by no means inferior to the approaches found in previous work. Although, for a better comparison it is necessary that the authors of the respective works include the precision, recall and F1score metrics. On testing series 0, our DDQN-PER approach achieved competitive results. On testing series 2, only the LDA performs 0.56 percentage points better accuracy-wise. The other approaches are outperformed by DDQN-PER on testing series 2. Additionally, our method has the advantage of being capable of online learning compared to previous work that reported on the occupancy detection dataset. 93.06 95.14 CART [10] 95.57 96.47 LDA [10] 97.90 98.76 SVM [35] 97.90 -KNN [35] 95.90 -RBFN [35] 97.00 -MCRN [35] 97.60 / 93.20 -MVCNN [34] 97

RESULTS: FALL DETECTION
The best results achieved by our DDQN-PER approach on the fall detection dataset are listed in Table 9. The results indicate the superiority of the PER strategy in rare event classification tasks, compared to the results achieved using the default random sampling strategy. The difference in detection performance in this experiment is rather large. This could be due to the fact that the dataset for fall detection is larger and the fraction of anomalous samples is lower compared to the occupancy detection dataset. An overall accuracy of 92.6% and a F1-Score of 70.5% has been achieved using the PER sampling strategy. The corresponding model hyperparameter configuration is listed in Table 4. Although, the recall and balanced accuracy achieved using the PER sampling strategy is lower, the overall accuracy, precision, and F1-Score significantly dominates the results achieved using the random sampling strategy. Using the PER sampling strategy, the DDQN algorithm learns from more important state transitions. The performance improvements gained by the PER strategy are particularly reflected by the precision and F1-Score. The precision increased by increased by 27.5 percentage points, while the F1-Score increased by 13.0 percentage points.   [9] 72.0 -Rule-based (Expert-knowledge) [9] 88.0 -HMM (Meta-prediction) [9] 91.3 -Confidence system (one tag) [36] 90.1 -Confidence system (four tags) [36] 94.7 -J48 (Decision Trees) [37] 52.0 58.0 JRip (Rule-based) [37] 51.0 57.0 SMO [37] 53.0 59.0 RF [37] 52.0 58.0 NaiveBayes [37] 40.0 53.0 CDKML (initial) [37] 63.0 64.0 CDKML (refined) [37] 66.0 65.0 CDKML (adapted) [37] 81.0 71.0 Confidence system (one tag) [38] 94.2 -Confidence system (four tags) [38] 95.

COMPARISON: FALL DETECTION
In this section, the performance of previous work that reports on the fall detection dataset is compared to our DDQN-PER approach. Table 10 lists the performance metrics and decimal places as reported in the respective research. In [9], the authors propose an approach that is based on a set of distinct agents. Their machine-learning agents are based on a SVM and the C4.5 decision tree algorithm. Only if both agents output a fall event, the event is considered a fall. The authors report that their machine-learning agents yield an accuracy of 72.0%. Their expert-knowledge agents can detect four types of emergency situations, using a set of handcrafted rules. However, handcrafted rules can potentially be biased by assumption and human experts are necessary in order to define them. Additionally, the expert-knowledge agents make use of information about the location of objects, such as beds, chairs and tables. The object location information however, is not contained in the respective dataset and therefore, could not be used for the development of our approach. The authors report that their expert-knowledge agents yield an accuracy of 88.0%. Their meta-prediction agents merge the outputs of the machine-learning and expert-knowledge agents, and increase the detection accuracy to 91.3%. Unfortunately, the authors do not report precision, recall and F1-Score. In comparison, our adaption of the DDQN-PER algorithm yields a fall detection accuracy of 92.6% and outperforms the agents proposed in [9] accuracy-wise. Additionally, our DDQN-PER approach does not rely on handcrafted rules defined by human experts. Furthermore, our DDQN-PER approach was developed without object location information. Similar to [9], in [36], the authors conduct fall detection based on the data captured from the location tags. Their confidence system is a complex multi-agent system that consists of seven groups of intelligent agents. The authors report 90.1% and 94.7% fall detection accuracy for one and four location tags, respectively. Compared to our DDQN-PER approach, their confidence system scores 2.1 percentage points higher accuracy-wise based on four location tags. However, the authors did not report precision and recall values, that are necessary for a fair comparison. In [38] , Lustrek et. al improve their confidence system and provide insight into the usability of the system they propose. Regarding to the authors, the confidence system appears to be sufficiently accurate for real life applications and is accepted by its users. The authors improve the detection performance using additional accelerators. They report an accuracy of 95.3%, which improves upon our DDQN-PER approach by 2.7 percentage points. Similarly to [9], [36] however, their confidence system requires human experts to define handcrafted rules. Additionally, their confidence system makes use of context information (i.e. the location of bed, chairs, and tables in the test environment).
In [37], the authors propose a method called combining domain knowledge and machine learning (CDKML). Their CDKML system consists of multiple phases, specifically an initialization, refinement, and online adaption phase. In the initialization phase a human-understandable classifiers is generated by making use of traditional rule-based and decision tree algorithms. Refinement of the initial classifier is performed using genetic algorithms under expert supervision. The genetic algorithm then outputs the final general rulebased classifier. In the final phase, an online learning process is performed and the classifier is adapted based on user feedback. Their method yields an accuracy of 81.0% and a F1-Score of 71.0%. In comparison to our DDQN-PER adaption, only the adapted version of their CDKML method yields a slightly higher F1-Score. Similarly to our approach, their method is based on online adaption using a MDP. However, their approach has a major drawback because it is based on hand-crafted rules that require the intervention of human experts. In comparison, our adaption of the DDQN-PER algorithm does not require humans to define any handcrafted rules. All other methods reported by [37], are outperformed by our DDQN-PER approach.
In conclusion, our adaption of the DDQN-PER algorithm outperforms the majority of approaches proposed in previous work and has a unique set of advantages. Our approaches is both, adaptable to different environments and capable of online learning. The approaches suggested in [9], [36]- [38] require human supervision and are thus not adaptable to different environments. In addition, in comparison to previous work, only the CDKML system [37] is capable of online learning. However, the proposed approach requires expert supervision as well as feedback from the users to obtain adequate results. Moreover, our approach does not make use of additional contextual information (i.e. the location of objects in the test environment). VOLUME 4, 2016

VII. CONCLUSION
This work presents the novel use of an improved reinforcement learning algorithm for anomaly detection in smart environments. We adapted the DDQN algorithm for anomaly detection, to make policy estimation more stable. Additionally, we proposed to extend DDQN with the PER sampling strategy to emphasize learning from rare data patterns, and showed that our DDQN-PER solution yields an increase in detection performance. Using PER, the problem of class imbalance in the respective datasets is less prevailing, resulting in a more robust detector. Moreover, our work substantially extends by adapting to multivariate sequential time series learning scenarios. The evaluations conducted in this work show, that the use of PER based on stochastic sampling, yields detection improvement on rare event classification tasks. Our solution yields 98.2% accuracy and a F1-Score of 96.0% on the occupancy detection dataset. On the larger and more volatile fall detection dataset, our solution yields 92.6% accuracy and a F1-Score of 70.5%, outperforming the majority of approaches proposed in previous work. Additionally, our solution is adaptable to different environments, because it does not rely on hand-crafted rules, that are defined by human experts. Moreover, our adaption of the DDQN-PER algorithm is an online learning algorithm, that directly learns a decision-making function by creating experiences in a data-driven fashion. The underlying model can be trained by observing data in a sequential order without the need the retrain the model based on the complete training dataset. NILS JOREK eventually started his pursuit of a computer science degree in 2014, one year after his high-school diploma. He finished his bachelor in computer science in 2018 during his practical experience at the Lufthansa Global Business Services company. While working on the management of SAP systems, he felt the need to dive deeper into software engineering and the roots of machine learning methods. After he achieved his bachelor of computer science he continued his studies in a masters program, which he finished with a masters degree in computer science in 2021. His passion is driven by the principles of software engineering and the urge for clean code. Since three years he is developing ERP software for small, private companies. The application of these principles can be seen in the development of an extensible anomaly detection framework, with the use of reinforcement learning.
NASER DAMER is a senior researcher at the competence center Smart Living & Biometric Technologies, Fraunhofer IGD. He received his PhD in computer science from the Technical University of Darmstadt (2018). He is a researcher at Fraunhofer IGD since 2011 performing research management, applied research, scientific consulting, and system evaluation. His main research interests lie in the fields of biometrics, machine learning and information fusion. He published more than a 100 scientific papers in these fields. Dr. Damer is a Principal Investigator at the National Research Center for Applied Cybersecurity ATHENE in Darmstadt, Germany. He lectures on Biometric recognition and security, as well as on Ambient Intelligence at the Technical University of Darmstadt. Dr. Damer is a member of the organizing teams of a number of conferences, workshops, and special sessions. He serves as a reviewer for a number of journals and conferences and as an associate editor for the Visual Computer journal. He represents the German Institute for Standardization (DIN) in the ISO/IEC SC37 international biometrics standardization committee.
FLORIAN KIRCHBUCHNER is trained as an information and telecommunication systems technician and served as IT expert for the German Army from 2001 to 2009. Afterwards he studied computer science at Technical University of Darmstadt and graduated with a Master of Science degree in 2014. He has been working at Fraunhofer IGD since 2014, most recently as head of the department for Smart Living & Biometric Technologies. He is also Principal Investigator at the National Research Center for Applied Cybersecurity ATHENE. Mr. Kirchbuchner participated at Software Campus, a management program of the Federal Ministry of Education and Research (BMBF) and is currently doing his PhD at Technical University of Darmstadt on the topic "Electric Field Sensing for Smart Support Systems: Applications and Implications". VOLUME 4, 2016