Unsupervised Anomaly Video Detection via a Double-Flow ConvLSTM Variational Autoencoder

With the rapid increase of video surveillance points in the market in recent years, video anomaly detection has gained extensive attention in the security field. At present, the distribution of normal and anomalous data is unbalanced in unlabeled video data. Variational autoencoder (VAE), as one of the typical deep generative models, gets increasingly popular in unsupervised anomaly detection. However, this model is not good at processing time-series data, especially video data. In addition, the strong generalization ability which is over-reconstructing anomaly behavior of many autoencoder-based works leads to the missed anomaly detection. To solve these problems, in this paper, we present a double-flow convolutional long short-term memory variational autoencoder (DF-ConvLSTM-VAE) to model the probabilistic distribution of the normal video in an unsupervised learning scheme, and to reconstruct videos without anomaly objects for anomaly video detection. Experiments verify the effectiveness and competitiveness of our DF-ConvLSTM-VAE on multiple public benchmark datasets. In particular, our model achieves the state-of-the-art performance on anomalous event count.


I. INTRODUCTION
Anomaly detection has a wide range of practical applications in campus monitoring, intelligent transportation, banking transactions. Nowadays, in an era of data explosion, unlabeled data, especially unlabeled surveillance video data pervades every aspect of life. Compared to other algorithms [1], [2], unsupervised learning algorithms are becoming the future trend and are of great interest to scientists [3]- [6]. As an essential area of anomaly detection, anomaly video detection provides us with various pattern classification of normal and anomalous behaviors in respective domains [7]- [9]. In fact, anomaly video detection task suffers from several challenges. For the existing large amounts of video data, there is bound to be a large number of normal videos without event occurrence. Finding out the time period of major event occurrence is of great significance for storage and review of videos. Therefore, it is of great research value and practical significance to detect anomalous videos using unsupervised The associate editor coordinating the review of this manuscript and approving it for publication was Zahid Akhtar . learning methods. In many cases, whether real-life events are normal or anomalous depends on their surrounding circumstances. For example, a person running in a sports field is perfectly normal, but in a court of law, it is clearly abnormal. Another example is the presence of a speeding truck on a campus sidewalk, which is clearly unusual and potentially dangerous. These cases show that identifying whether an event is an anomalous event is difficult. In addition, it is well known that video presentation learning is the most basic problem in video processing technology. Compared with the static images, video involves richer dynamic information about events. In addition, due to the diversity and variability of video, it becomes an urgent problem to study the algorithm which can find the internal spatio-temporal correlation and discriminating features of video.
Researchers usually extract handcrafted video features to detect anomalies over the past few years. Traditional methods are based on low-level features, such as histograms of optical flow(HOF) [10], spatio-temporal gradient [11], and mixture of dynamic textures(MDTs) [12], to complete anomaly classification tasks. These models based on manual feature classification are inefficient, and their accuracy cannot meet actual requirements. With the development of the deep learning, many neural networks are proposed by researchers and used for detecting anomalies. More discriminating features of videos are learned by these networks through unsupervised learning. Hasan et al. [13] employ the Convolutional Autoencoder (ConvAE) to construct an anomalous event detection model. Although the network input is continuous multiple frames, 2D convolution is adopted, failing to fully utilize the temporal information between video frames. Considering the motion characteristics, Yiru et al. [14] use an autoencoder with 3D convolution for anomaly video detection.Although the model has the ability of reconstruction and prediction, it is not good at modeling long video. Xu et al. [15] leverage a stacked denoising autoencoders to learn both appearance and motion features, and based on the learned features, multiple one-class SVM models are used to predict the anomaly scores of each input. However, this method is time-consuming by dividing spatio-temporal features of video into optical flow and appearance features. Long Short-Term Memory(LSTM) network, a typical type of recurrent neural networks(RNN) architecture, is proposed by Hochreiter et al. [16] and widely used in many research tasks. Take video for example, this network has been applied in action recognition [17]- [19], video retrieval [20], [21], video segmentation [22], [23] and Video Captioning [24], [25], etc. LSTM-Autoencoder, a typical sequence-to-sequence [26] framework, is proposed by Srivastava et al. [17] and applied for learning video action recognition. LSTM also performs well in video anomaly detection task [27]. Medel and Savakis et al. [28] complete video anomaly detection by combining LSTM network and ConvLSTM unit. Base on ConvAE structure, Yong and Yong [29] add three ConvLSTM layers to learn the spatio-temporal information of video event and detect video anomalies. Lin et al. [30] explore a hybrid autoencoder architecture, composed of ConvAE and LSTM-Autoencoder with ConvLSTM unit, to improve the extrapolate capability of the corresponding decoder through the shortcut connection. The prediction branch of the hybrid autoencoder is used for anomalies detection. These models are designed by autoencoder structure and use the reconstruction error to detect anomalies. This reconstruction-based algorithm, as one of the common techniques for anomaly detection, calculates the maximum reconstruction error of test samples to determine whether it is anomalous or not. In fact, the anomalous object is generated in the reconstructed image by these methods and it is relatively fuzzy or low in pixels compared with the original image. It would be a better choice for detecting anomalous data if the reconstructed image contains only normal instead of any abnormal objects during the test phase.
In recent years, variational autoencoder (VAE) [31] has become increasingly popular. In particular, VAE cannot only generate the characteristic output close to the original input and reflect the similar information of similar data, but also learn the potential characteristic vector. An and Cho [32] propose an anomaly detection method using the reconstruction probability from the variational autoencoder. The reconstruction probability is a probabilistic measure that takes into account the variability of the distribution of variables. Experimental results of this paper show that this method outperforms autoencoder-based methods on MNIST dataset [33]. Compared with the reconstruction error used by the autoencoder and the principal component-based anomaly detection method, the reconstruction probability with a theoretical background is more principled and objective. However, VAE limits its applicability to time series, especially to video, for it does not take the temporal characters of video into account. For processing the time-series data, Sölc et al. [34] utilize RNNs and the variational inference to learn time-series data for anomaly detection. Park et al. [35] use a long shortterm memory-based variational autoencoder(LSTM-VAE) for multimodal anomaly detection. These two papers demonstrate that the VAE-based models are better than the other approaches, and inspire us to apply a recurrent VAE for anomaly detection in video.
In this work, in order to solve above problems, our two models choose ConvLSTM units instead of LSTM units to learn the internal spatio-temporal relations of video. These two asymmetric models blend ConvLSTM with VAE architecture to reconstruct videos without anomaly objects for anomaly detection (see . One is called ConvLSTM-VAE(Asymmetric); The other is named DF-ConvLSTM-VAE. More information about the structures of these two models is described in Section III. We use reconstruction error probability which is different from reconstruction probability to detect anomalies. Experiments verify the effectiveness and competitiveness of our DF-ConvLSTM-VAE on multiple public benchmark datasets. In particular, our model achieves the state-of-the-art performance on anomalous event count. The key contributions of our work can be summarized as follows: • For the disadvantage of strong generalization ability of many autoencoder-based models, and the VAE does not take the temporal dependence in data into account, which limits its applicability to time series, especially video sequence. We present two models-ConvLSTM-VAE(Asymmetric) and DF-ConvLSTM-VAE to solve this disadvantage. These two models are consisting of ConvLSTM and VAE, to model the probability distribution of video sequence by capturing the crowd spatialtemporal features. The experimental results verify the validity of these two asymmetric models.
• Based on the analysis and verification of the ConvLSTM-VAE(Asymmetric) model, we propose an improved network, namely DF-ConvLSTM-VAE to detect anomalies. The DF-ConvLSTM-VAE model adopts the idea of asymmetric structure and increase the width of network structure to achieve high training efficiency and short test time.
• The DF-ConvLSTM-VAE model is successfully utilized for anomaly detection in videos. The experimental results demonstrate that the DF-ConvLSTM-VAE model VOLUME 10, 2022 has a certain competitiveness compared with current leading methods on benchmark datasets.
The remainder of this paper is organized as follows. In Section II, we briefly review many related works. Section III, describes the proposed approach. Experiments are conducted for analysis in Section IV. We discuss the limitation of our work in Section V. Finally, we draw conclusions and present future research directions in Section VI.

II. RELATED WORKS A. CONVOLUTIONAL LSTM UNIT
Convolutional Long Short-term Memory (Conv-LSTM) unit, as a variant of the LSTM unit, is firstly proposed by Shi et al. [36]. Compared to the usual fully connected LSTM (FC-LSTM) [17], spatial information is encoded by ConvLSTM when dealing with spatio-temporal data in inputto-state and state-to-state transition. With respect to predicting future video sequences for a synthetic Moving-MNIST Dataset [37], ConvLSTM exhibits superior performance than FC-LSTM.
The formulation of the ConvLSTM unit can be summarized with Equation (1), where the symbol ' * 'denotes a convolution operation, and '•'denotes the Hadamard product. The input, forget, cell, output and hidden state of each timestep are denoted by i, f , C, o and H respectively, the activation is denoted by σ , and the weighted connection between states by a set of weights, W. The input is fed in as images, while the set of weights for every connection is replaced by convolutional filters.
This operation prompts ConvLSTM to work better with images than the FC-LSTM, for the model has the ability to propagate spatial characteristics temporally through each ConvLSTM state. Inspired by this, our two models apply the Conv-LSTM as a basic block for recurrent connections inside the VAE model.

B. AUTOENCODER
An autoencoder (AE), composed of an encoder and a decoder, aims to reconstruct input data x from a learned hidden representation z. The objective function of an AE is represented in Equation (2) below, where φ and θ denote the hidden parameters of the encoder E and the decoder G, and L AE denotes the loss of AE. We use the reconstruction error of each test data to calculate the anomaly score, and we consider that the data with high anomaly score is anomalies. The AE can behave well in reconstructing normal data, while failing to do so with anomaly data that the autoencoder has not encountered. 2 (2)

C. VARIATIONAL AUTOENCODER
The Variational Autoencoder (VAE) is proposed by [31]. The structure of VAE is similar to that of AE. But essentially, a difference between them is that the encoder of VAE forces the representation z to obey some kind of prior probability distribution p(z) (e.g. N (0, I )). Then the decoder generates new realistic data with code z sampled from p(z). p θ (z) is the prior distribution of the latent variable z. By inheriting the architecture of an AE, a VAE consists of the following three parts.
(1) Recognition network (encoder network): a probabilistic encoder E φ , which map input x to the latent representation z to approximate the true posterior distribution p(z|x). This recognition network can be represented as the approximate posterior q φ (z|x).
(2) Sampling process: (3) Generative network (decoder network): a generative decoder G θ , which reconstructs the latent representation z to the input value x, does not rely on any particular input x. This generative network can be represented as p θ (x|z).
where φ, θ denote the parameters of recognition and generative network, respectively. The data distribution p θ (x) is intractable by analytic methods, so variational inference methods are introduced to solve the maximum likelihood log p θ (x). The loss of the VAE is represented as Equation (6).
In order to estimate this maximum likelihood, a VAE needs to maximize the evidence lower bound (ELBO) L VAE . KL is a similarity measure between two distributions. To optimize the KLD between q φ (z|x) and p θ (z), the encoder estimates the parameter vectors of Gaussian distribution q φ (z|x), mean µ and standard deviation σ . There is an analytical expression for their KL divergence, because both q φ (z|x) and p θ (z) are Gaussian. For optimizing the second term of Equation (6), the VAE minimizes the reconstruction errors between the inputs and the outputs. The objective function of the VAE can be rewritten as: where the first term L MSE is the reconstruction error (MSE, the mean squared error) between the inputs and their reconstructions. The second term L KLD is the Kullback-Leibler divergence between the inference model q φ (z|x) and p θ (z) . And regularize the encoder by encouraging the approximate posterior q φ (z|x) to match the prior p θ (z) . Use the ''reparameterization trick'', φ and θ can be obtained by optimizing Equation (7) via stochastic gradient variational bases. AE uses the reconstruction error as the anomaly score in the test phase, while VAE defines reconstruction probability for anomaly detection. To estimate the probabilistic anomaly score, a VAE samples z according to the prior p θ (z) for L times and calculates the average reconstruction as reconstruction probability. That is why the VAE works more robustly than the traditional AE in the anomaly detection domain.

A. THE CONVLSTM-VAE(ASYMMETRIC) MODEL
In this work, we combine ConvLSTM units with the VAE to model the video sequences for anomaly detection. Due to the traditional network based on the VAE structure, it is easy to train the VAE into the AE model in the training process. We artificially weaken the decoder from the structure, to design an asymmetric model. Figure 1 provides the structure of the ConvLSTM-VAE(Asymmetric) model which is composed of the following three parts: encoder, sample, and decoder. The encoder consists of two modules: Conv and ConvLSTM , where Conv represents a set of convolutional layers for extracting spatial features from each frame, and ConvLSTM denotes convolutional long short-term memory units for learning temporal patterns of video sequences from spatial features. In the sampling process, z is sampled from the encoder of the ConvLSTM-VAE(Asymmetric) model. The sampled data z has temporal and spatial properties. The decoder is made up of only one module: Deconv, which represents a set of deconvolutional layers, corresponding to the Conv module of encoder to generate new realistic input.
The objective function of the ConvLSTM-VAE (Asymmetric) model can be expressed in Equation (8).
More details and configuration about our ConvLSTM-VAE(Asymmetric) model is presented in Table 1 of Section IV, and the algorithm for training the ConvLSTM-VAE(Asymmetric) is shown in algorithm 1.

B. THE DF-CONVLSTM-VAE MODEL
We believe that the decoder of the ConvLSTM-VAE (Asymmetric) model consisting only of deconvolutional layers cannot adequately decode the sampled spatio-temporal information. Meanwhile, inspired by traditional symmetric structures of many VAE-based networks, we propose an improved model, namely DF-ConvLSTM-VAE model to improve performance of networks. The DF-ConvLSTM-VAE model is a non-traditional symmetric structure variational autoencoder for processing time series data. Figure 2 displays the structure of the DF-ConvLSTM-VAE model consisting of the following two flows: the left flow and the right flow. In Figure 2, the blue arrows represent the left flow, and the black arrows denote the right flow. Note that the structure of the right flow is the same as the ConvLSTM-VAE(Asymmetric) model. The left flow is different from the right flow. The left flow is a model which is composed of Algorithm 2 Training Algorithm for the DF-ConvLSTM-VAE Network Input: Normal training dataset X for every frame x t , t = 1, . . . , T . Output: probabilistic encoderE φ , E φ . probabilistic decoder G θ , G θ .
θ ← update parameters using gradients of L = T t=1 L t until convergence of parameters the following three parts: encoder Conv, sample, and decoder Deconv. In particular, the left flow skips the ConvLSTM model directly.
Many networks often improve the network performance by increasing the depth and width of the spatial view. At the same depth of the network, we increase the network width from spatial and temporal views to improve the utilization of features, and thus improve the performance of the model. We offer a new option to learn the temporal pattern of video sequences. The DF-ConvLSTM-VAE model is composed of the following three parts: encoder, sample, and decoder. Different from the three parts of ConvLSTM-VAE(Asymmetric) model, the encoder of DF-ConvLSTM-VAE model comprises two modules: Conv, and ConvLSTM of the right flow, the sampling process consists of two sample processes: the data z of the right flow sampled from N (µ, σ 2 ) and the data z of the left flow sampled from N (µ , σ 2 ), and the decoder is a module: Deconv.
The objective function of DF-ConvLSTM-VAE model can be represented in Equation (9).
where the second term L KLD and the third term L KLD represent the Kullback-Leibler divergence of the right and left flow, respectively. The algorithm for training the DF-ConvLSTM-VAE is shown in algorithm 2. More details and configurations about our DF-ConvLSTM-VAE model are provided in Table 1 and Table 2 of Section IV.

C. ANOMALY DETECTION
In this paper, we propose video anomaly detection models to calculate the anomaly score from the reconstruction error probability(REP). Given a frame x t of the test video clip as the input, the encoder estimates the parameters of latent gaussian variables µ and σ as the output. Then the reparameterization trick is used to sample z for L times according to the latent distribution N (µ, σ 2 ), i.e. z (l) = µ + σ (l) , where ∼ N (0, I ) and l = 1, . . . , L. The generative network receives z (l) as input data and outputs the reconstructed frame x (l) t . We compute the reconstruction error probability of a pixel's intensity value I at location (u, v) in frame x t of a given video sequence by the Equation (10).
where I (l) (u,v,t) denotes a pixel's intensity value I at location(u, v) in reconstructed frame x (l) t . From each frame, we compute the REP of a frame x t by summing up all the pixel-wise errors probabilities: (u,v,t) . We compute the regularity scores s(t) of a video sequence through the Equation (11): In addition, in order to know the number of abnormal events in a given video, we explore local minima that are very noisy and not all meaningful in the time-series of regularity score to detect abnormal events. Distinct local minima indicate that video frames are most likely to contain anomalies. We use the Persistence1D [39] algorithm to identify meaningful local minima. In this step, if the distance of two local minima is less than 50 frames, they are identified as a part of the same abnormal event.

A. DATASETS
To test our two methods, we conduct experiments on several challenging datasets, namely USCD Ped1 and Ped2, Avenue datasets.

1) USCD DATASET
UCSD ped dataset [12] consists of two sub-datasets,namely UCSD ped1 and UCSD ped2. In UCSD ped1 dataset, there are 34 training video clips for training and 36 video clips for testing. The resolution of each frame is 238 × 158 pixels. UCSD ped2 dataset consists of 16 training and 12 testing video clips, each with 360 × 240 resolution. Anomaly events mainly contain two categories in UCSD ped dataset, the movement of non-pedestrian entities and anomalous pedestrian motions. Anomalous events of UCSD ped dataset include bikers, skaters, carts, wheelchairs and people walking off the walkway.

2) AVENUE DATASET
There are 16 training and 21 testing video clips in AVENUE dataset [40]. The resolution of each frame is 640 × 360 pixels. Each video clip is around 2 minutes long. The training video clips contain mostly normal activities, but do include a few anomalous events. There are several typical anomalous events, including running, throwing objects and walking in the wrong direction in testing video clips. In addition, it is worth noting that the camera in this dataset has jitter problems, while the other datasets are from stationary cameras.

B. IMPLEMENTATION DETAILS
In order to verify the performance of our asymmetric structure models, two symmetric structure models are designed for comparison. These two models are VAE and ConvLSTM-VAE(symmetric), respectively. As shown in Figure 3, the left model is VAE model consisting only of symmetric Conv and Deconv modules. The right model is ConvLSTM-VAE(Symmetric) model which symmetrically adds one ConvLSTM layer compared with VAE model. In detail, the corresponding modules parameters of the two symmetric models are the same as our model.

1) EVALUATION METRIC
In the field of video anomaly detection, two commonly used anomaly detection evaluation criteria are Equal Error Rate (EER) and Area Under Receiver Operating Characteristic Curve(AUC). These two criteria are derived from Receiver Operating Characteristic Curve(ROC), which is well suited for comparison of algorithm performance. ROC curve evaluates the detection effect of abnormal events. The ROC curve takes False positive rate(FPR) as abscissa and True positive rate(TPR) as ordinate. Here, TP(True Positive) indicates true positives, FN(False Negative) indicates false negatives, FP(False Positive) indicates False negatives, TN(True Negative) indicates true negatives. We compute FPR and TPR through the Equation (12): We select different threshold and calculate the TPR and FPR respectively to make ROC curve. EER is the point where the TPR and FPR are equal on the ROC curve, namely, the intersection of the ROC curve and the diagonal (line [0,1]- [1,0]) in the ROC space. If the EER in the ROC curve of an algorithm is smaller and the AUC is larger, it indicates that the performance of this method is better.

2) CONFIGURATIONS OF OUR MODELS
The input images are resized to 224 × 224 pixels and converted to gray-scale. The input length of two networks is ten (T = 10). Figure 4 gives comparison of average L MSE of sequence of the ConvLSTM-VAE(Asymmetric) model with respect to different learning rate (Figure 4(a)), mini-batch (Figure 4(b)) and optimizer (Figure 4(c)) on USCD ped1 dataset. The three blue curves show that our asymmetric model performs best with its corresponding hyperparameters. From the Figure 4, we use an Adam optimizer with a learning rate of 10 −4 to train our two networks from a Xavier uniform random weights initialization. Our two networks are L 2 regularized with a weight decay of 5 × 10 −4 . On USCD ped1 and Avenue, the batch size is set to 4, and on USCD ped2, it is set to 8. Figure 5 shows that comparison of AUC and EER of different dimension of the learned hidden representation z on USCD ped1 dataset. As can be seen from Figure 5, the performance of the ConvLSTM-VAE(Asymmetric) model is the best when the dimension of the hidden representation z is set to 256. Figure 6 and Table 1 provide the structure and corresponding parameters of the ConvLSTM-VAE(Asymmetric) model, respectively. The ConvLSTM-VAE(Asymmetric) model concatenates the outputs of three recurrent ConvLSTM layers and sends it to next two fully connected layers to calculate the mean and the variance (ConvLSTM4, ConvLSTM5, ConvLSTM6 → FC7, FC8). Table 2 Figure 9(b) also is a partial magnification of the right Figure 9(c). From Figures 7-9, it is easy to see that the average L MSE of sequence curve and the average KL divergence of sequence curve of the ConvLSTM-VAE(Symmetric) model are obviously different from the other three models.
From three Figures 7(a), 8(a) and 9(a), in the early training process, we find that the Convlstm-VAE(Symmetric) model tends to fall into local optima or saddle point, and lingers for a long time before jumping out and continuing to optimize. The other three models do not present this phenomenon. Obviously, the convergence rate of ConvLSTM-VAE(Symmetric) model is slower than that of the other three models. In addition, non-convergence sometimes occurs when the ConvLSTM-VAE(Symmetric) model is trained on the AVENUE dataset.  As can be seen from three Figures 7(a), 8(a) and 9(a), the average L MSE of sequence curve of the DF-ConvLSTM-VAE model is at the bottom compared to the other curves. From Figures 7(b), 8(b) and 9(b), it is obvious that the average KL divergence of sequence curve of the DF-ConvLSTM-VAE model lies between VAE and the ConvLSTM-VAE(Asymmetric). Therefore, the structural design of the DF-ConvLSTM-VAE model composed of the VAE and the ConvLSTM-VAE(Asymmetric) model is effective. Obviously, compared with the ConvLSTM-VAE(Asymmetric) model, which can avoid falling into the saddle point for a long time, the training time of the DF-ConvLSTM-VAE model is relatively less.
In Table 3, these four models are experimented on three test datasets. Overall, the experimental results show that the result of ConvLSTM-VAE(Symmetric) model is better than the other three models. The performance of the DF-ConvLSTM-VAE model is better than ConvLSTM-VAE(Asymmetric) and VAE models. Although the performance of the DF-ConvLSTM-VAE model is not the   best, the performance of the DF-ConvLSTM-VAE model for anomaly detection is worth considering and selecting in terms of training and time consumption.
We implement our four models using the Tensorflow Framework. All the test experiments are conducted on a GPU GeForce RTX 2080 Ti. We test the time(in seconds) consumed per frame by these four models on USCD ped1 dataset, and the results are shown in Table 4. The running time taken by the ConvLSTM-VAE(Symmetric) model is the longest, due to its two Symmetric ConvLSTM layers. Our DF-ConvLSTM-VAE model has two sampling processes in the data stream of each frame and thus, takes longer time than that of the ConvLSTM-VAE (Asymmetric) model, but it is less time-consuming than that of the ConvLSTM-VAE (Symmetric) model with one sampling process. Table 5 compares the anomaly detection accuracy of our DF-Convlstm-VAE model against other state-of-the-art methods on three datasets. In Table 5, Adam, SF, MPPCA, MPPCA+SF, and HOFME are traditional methods. It is easy to see that our DF-Convlstm-VAE method is significantly better than these traditional methods in terms of AUC and EER on USCD dataset.  In Table 5, ConvAE, ST-AE, two-stage, and ISTL are unsupervised deep learning methods, where ConvAE, ST-AE, ISTL and our DF-Convlstm-VAE algorithm belong to a class of one stage models. Comparing these four models, our algorithm ranked second on USCD ped1 dataset and first on AVENUE dataset in terms of AUC and EER. The two-stage, Ada-net, and ST-CaAE models are more complex networks, where Ada-net and ST-CaAE networks have a high complexity because they are designed with GAN model. In particular, ST-CaAE uses extra optical flow information, and employs 2D/3D convolution methods to extract short-time temporal-spatial features, and integrates classical dual-flow model for video anomaly detection. Our algorithm uses Convsltm units to extract long-time temporal-spatial features of VOLUME 10, 2022   video sequences without using additional optical flow information which increases the computation. Compared with the ST-CaAE model, our DF-Convlstm-VAE algorithm performs well on EER. Compared to these state-of-the-art deep learning methods, our algorithm ranks third in terms of EER on USCD datasets and second in terms of AUC and EER on AVENUE dataset. In summary, our DF-Convlstm-VAE model is competitive in EER compared with other advanced deep learning models.

2) QUANTITATIVE ANALYSIS: ROC AND ANOMALOUS EVENT COUNT
The comparions of anomalous events and false alarm counts are provided in Table 6. We employ our DF-Convlstm-VAE model to calculate true positive and false alarm by Per-sistence1D [39] algorithm. Observing Table 6, it is obvious that our algorithm performs very well in three datesets aspects of True Positive. As for False Alarm, our algorithm performs well on Ped2 and Avenue, except in Ped1 dataset. In summary, the performance of our DF-Convlstm-VAE model is comparable to the state-of-the-art anomalous event detection methods.
As seen in Table 5 and Table 6, compared with other state-of-the art methods, our DF-ConvLSTM-VAE model has competitive advantages. Figures 10-12 show three examples of generated videos by our DF-ConvLSTM-VAE network, and there are anomalous objects on these ground truth video sequences.

3) QUALITATIVE ANALYSIS a: VISUALIZING THE RECONSTRUCTED IMAGES
In Figure 10, the first and the third rows are the ground truth video sequences of frames 70 − 80 from UCSD Ped1 testing clip #20, while the second and the fourth rows show the corresponding reconstructed images. We can observe that the pedestrians in the generated images are different from the ground truth images, because the data generated by the network is different from the original dataset but with the same distribution. The network can pay attention to the spatio-temporal characteristics of learning videos and generates continuous foreground information. By observing this figure, the ground truth images show a person in a wheelchair, and at same position, a walking person is generated by our DF-ConvLSTM-VAE model in reconstructed images.
In Figure 11, The first and the third rows are the ground truth video sequences of frames 70 − 80 from UCSD Ped2 testing clip #4, while the second and the fourth rows show the corresponding reconstructed images. We can see that  the ground truth image shows a moving truck while our DF-ConvlSTM-VAE model does not produce any results in the reconstructed images at the same place.
In Figure 12, the first and the third rows are the ground truth video sequences of frames 20 − 30 from Avenue testing clip #20, while the second and the fourth rows show the corresponding reconstructed images. We can see that a person walking in the wrong direction(walking toward the camera) in the ground truth video. This behavior does not occur in the training set, and therefore, in the generated images, nothing is generated in the corresponding position by our DF-ConvlSTM-VAE model. In addition, observe the ground truth video, we can see that the pillar is obscured by the abnormal object, but it is generated well in our generated images by our DF-ConvlSTM-VAE model. This is because the essence of VAE-based model is a probabilistic graphical model.   Since the distribution of the generated samples with the DF-ConvLSTM-VAE model is the same as and similar to the training datasets, there is no anomalous object in our reconstructed images.

b: VISUALIZING TEMPORAL REGULARITY
In Figures 13-15, we compare our two asymmetric models in terms of the regularity scores on different datasets clips. The anomalous ground truth regions are highlighted in red, and distinct local minima is represented by a blue dot. The lower the regularity score value under the anomalous conditions, the higher the curve value in normal circumstances, indicating that the performance of model is better. Figure 14 shows that the capability of the DF-ConvLSTM-VAE model is stronger than that of the ConvLSTM-VAE(Asysmetric) model. There are two anomalous objects(a moving truck and a person with bike) in video #4. When two anomalies occur at the same time, the curve only shows that the video frame is anomalous, but cannot indicate that there exist two anomalous objects on this video VOLUME 10, 2022 frame. In Figure 15, the curves show that both of our two asymmetric models can detect the anomalous behavior of a person throwing papers into sky.
From these figures, it is easy to see that when there are irregular motions, the regular score curve drops significantly and forms a nail shape, and the performance of our DF-ConvlSTM-VAE model is slightly better than the ConvlSTM-VAE(Asymmetric) model.

V. DISCUSSION
Although this method takes the whole video frame as the input, it is very advantageous for extracting global features, but when extracting features, we find that the size of the foreground target is relatively small, which brings challenges to extracting the detail features of targets. Therefore, in the subsequent study, we suggest to fully consider removing background information unrelated to the foreground and extract relevant features in the form of patch.

VI. CONCLUSION AND FUTURE WORK
In this paper, both the ConvLSTM-VAE(Asymmtric) model and the DF-ConvLSTM-VAE model consist of ConvLSTM and VAE, and are proposed to learn training data distribution for video anomaly detection. The ConvLSTM-VAE(Asymmetric) model is designed by weakening the decoder. Compared with the ConvLSTM-VAE(Symmetric) model, the ConvLSTM-VAE(Asymmetric) model has some advantages in terms of training time and difficulty. Experiments show that the DF-ConvLSTM-VAE model is superior to the ConvLSTM-VAE(Asymmtric) model. Compared with other typical methods, the experiments verify the validity and competitiveness of our DF-ConvlSTM-VAE on multiple public benchmark data sets. Since the simple gaussian model cannot meet the complexity of real data, in the future, we will try to construct a new probability graph model to accomplish this task by forcing the representation z to obey a more complex model.