Abnormal Event Detection via Feature Expectation Subgraph Calibrating Classification in Video Surveillance Scenes

At present, the existing abnormal event detection models based on deep learning mainly focus on data represented by a vectorial form, which pay little attention to the impact of the internal structure characteristics of feature vector. In addition, a single classifier is difficult to ensure the accuracy of classification. In order to address the above issues, we propose an abnormal event detection hybrid modulation method via feature expectation subgraph calibrating classification in video surveillance scenes in this paper. Our main contribution is to calibrate the classification of a single classifier by constructing feature expectation subgraphs. First, we employ convolutional neural network and long short-term memory models to extract the spatiotemporal features of video frame, and then construct the feature expectation subgraph for each key frame of every video, which could be used to capture the internal sequential and topological relational characteristics of structured feature vector. Second, we project expectation subgraphs on the sparse vector to combine with a support vector classifier to calibrate the results of a linear support vector classifier. Finally, the experiments on a common dataset named UCSDped1 and a coal mining video dataset in comparison with some existing works demonstrate that the performance of the proposed method is better than several the state-of-the-art approaches.


I. INTRODUCTION
In recent years, abnormal event detection in intelligent video surveillance has gained more and more attention in academic and industrial communities [1], [2], which has become an important task in intelligent video surveillance since it is related to visual saliency [3], interestingness prediction [4], dominant behavior detection [5] and other topics in computer vision. Abnormal event detection for video sequences is a difficult challenge because of the volatility of the definitions between normality and abnormality [6] and dependence of the definitions on the context scenario. Nevertheless, The associate editor coordinating the review of this manuscript and approving it for publication was Guangdong Tian . it can generally be considered that abnormal behavior or an activity by unexpected events occurs less often than normal (familiar) events [7]. In order to detect abnormal events in surveillance videos, various kinds of modeling techniques are proposed in the literature, such as trajectory-based models [8], spatiotemporal feature-based models [9], [10] and sparse reconstruction-based models [11], where the majority to address the anomaly event detection are to learn and extract the hand-crafted or deep appearance features of video from given samples first and then classify and decide whether the events are abnormal if they deviate from the model of normal event.
At present, feature extraction is regarded as one key factor for abnormal event detection in existing models. From the feature representation point of view, abnormal event detection models are mainly classified into hand-crafted features-based models and deep features-based models.
In the hand-crafted features-based models, trajectory [12], flow [13] and vision modeling [14] can be used to describe the dynamic information and spatiotemporal information of video sequences, besides trajectory modeling, such as color, texture, optical flow, bag-of-words (BOW) [15] modeling and so on. The models based on color and texture features can describe appearance features in a video sequence, but they ignore motion representations. Optical flow modeling can describe dynamic information of video, but it is susceptible to illumination. The bag-of-words approach computes an unordered histogram of visual words occurrences that encode only the global distribution of low-level descriptors, but it ignores the local structural organization of salient points [16]. Although trajectory modeling can represent motion characteristic of foreground objects, it is not robust for complex scenes of video. In general, hand-crafted features-based models depend on some priori knowledge, and are not generalized well for complex video surveillance scenes.
With the development of machine learning studies, various approaches based on deep learning have achieved remarkable progress in abnormal event detection. For example, convolutional neural networks (CNNs) [17], recurrent neural networks [11] and other deep learning models can learn better feature representation than hand-crafted feature modeling. It is conductive to determinate the occurrence of abnormal event in video sequence.
Nevertheless, most of abnormal event detection models based on deep learning mainly focus on data represented in a vectorial form, which pay little attention to the impact of the internal structure characteristics of feature vector on classifying and determining abnormal events in video sequences. Moreover, a single classifier is difficult to ensure the accuracy of classification. Especially for the complex video surveillance scenes, the disturbances from the light source, occlusions and other factors in video will obviously affect the data represented in a vectorial form and accuracy of algorithm. Hence, deciding how to utilize the structure characteristics of feature vectors to filter unexpected eigenvalues that correspond to disturbances to improve the accuracy of abnormal event detection remains a challenging task.
In this paper, we propose an abnormal event detection hybrid modulation method via feature expectation subgraph calibrating classification (DF-ESCC) in video surveillance scenes to address the above issues. Our method consists of three parts: deep feature extraction, feature expectation subgraph construction and expectation subgraph-based calibration classification. First, we employ a convolutional neural network and long short-term memory (LSTM) model to extract the features in video surveillance scenes. Second, we construct the feature expectation subgraph for each key frame of every video, which could be used to capture the internal sequential and topological relational characteristics of structured feature vector. Finally, we project expectation subgraph on the sparse vector, which is used to combine support vector classifiers to calibrate the classification of linear support vector classifiers to determinate whether there exist abnormal events in the video surveillance scenes. The common dataset UCSDped1 [18] and Coal Mining video datasets [19] are used to verify the effectiveness of our proposed method. In summary, our contributions are summarized as follows: (1) we introduce the feature expectation subgraph to represent the internal sequential and topological relational characteristics of structured feature vector; (2) we propose a DF-ESCC method combining feature expectation subgraph with support vector classifiers to calibrate the classification of linear support vector classifiers; (3) the proposed method is validated on challenging UCSD dataset and coal mining video dataset, where the coal mining video dataset has complex context scenarios.
The rest of paper is organized as follows. In Section 2, we present a brief review of related abnormal event detection methods. We propose an abnormal event detection hybrid modulation method via feature expectation subgraph calibrating linear support vector classifier classification in Section 3. The experiments are performed to verify the performance of the proposed DF-ESCC method in Section 4. Finally, we conclude this paper.

II. RELATED WORK
In this section, we briefly review previous abnormal event detection models from the point of view of appearance feature for image in video. We first recall some hand-crafted featuresbased models, and then review deep features-based models for abnormal event detection. Finally, we analyze the shortcoming of the above methods.

A. HAND-CRAFTED FEATURES-BASED MODELS
In the past decade, trajectory features are widely used to abnormal event detection. For example, the study in [20] presents a complex event processing method based on trajectories. Song et al. [21] propose an approach that firstly obtains the trajectories of vehicles and pedestrians, and then detects the abnormal events using the trajectory features. However, the features of above methods are relatively single. Serhan et al. [22] propose to incorporate object trajectory analysis and pixel-based analysis for abnormal behavior inference and event detection, but this method is not suitable for images with poor quality. In [23], a multi-feature fusion method is used to obtain characteristic information of pedestrians, and then motion information is attained by trajectory analysis. The limitation of this method is that it is susceptible to feature changes. Moreover, the graph-based representation and learning of relevant features are combined and correlated with target behaviors to detect abnormalities in moving object trajectories, so as to determine whether the events of interest are normal or abnormal [24]. Fu et al. [25] utilize reference points as well as the piecewise linear segmentation algorithm to compress the trajectories, and then propose a time-aware and spatially correlated collaborative VOLUME 8, 2020 algorithm to increase the density of the trajectories to improve the accuracy of abnormal event detection. However, in this method there exists the issue of large cumulative errors in trajectory calculation. The work in [26] presents a survey of trajectory-based surveillance applications with a focus on abnormal event detection.
In general, the representation of trajectory features is sensitive to noise interference, and there exists the discontinuity of target trajectory. Thus, the models based on trajectory features are not completely reliable and not robust for the crowded scenes and other complex scenes.
To overcome the drawback of trajectory-based models, spatiotemporal features [27]- [29] are extracted from lowlevel appearance and motion cues to address the above problems. For example, the study in [30] proposes an approach that relies mainly on spatial abstractions of each object, mining frequent temporal patterns in a sequence of video frames to form a regular temporal pattern, which is used to detect abnormal events. However, this approach is difficult to describe spatial abstractions of each object accurately when the image features are transformed. In [31], the spatiotemporal information and slow feature analysis method are combined to represent the discriminative information in videos to detect abnormal crowd motion, but the high semantic inherent features of this method have limited ability to represent nonlinear features. The work in [32] proposes distribution of magnitude and orientation of local interest frame descriptor is used to learn a support vector machine based a binary classifier to detect violence events. Moreover, a feature descriptor is proposed by adopting the covariance matrix coding optical flow in multi-regions of interest to represent motion information, and then one-class support vector machine is applied to detect the abnormal events in [33]. The limitation of these approaches is that a single classifier is difficult to ensure the accuracy of classification. Wang et al. [34] propose to learn the histograms of optical flow orientations of the observed video frames by a hidden Markov model to detect abnormal events in a crowded scene. In order to at least alleviate the impact of label information on supervised or semi-supervised models, the study in [35] proposes an unsupervised algorithm that combines the manifold-based feature with a graph density search mechanism to detect abnormal network events. However, these algorithms need to know the data distribution in advance.
In summary, the models based on spatiotemporal features have a better recognition ability for moving objects with linear or nearly linear features, but they need to know prior knowledge, and have limited representation ability for nonlinear features.

B. DEEP APPEARANCE FEATURES-BASED MODELS
In order to overcome the drawback of hand-crafted features, many models based on deep appearance features are proposed to detect abnormal events. The above deep appearance features can be obtained by using convolutional neural networks [36], [37], recurrent neural networks [38], [39] and autoencoder networks [40], [41].
For example, the work in [42] first combines the saliency information with multi-scale histogram optical flow of video frames to represent spatiotemporal information, and then adopts a deep learning network named PCANet to extract high-level features of video to detect abnormal events. As an extension of the above model, Damla et al. [43] explore different convolutional neural networks to model patterns in a video sequence to detect abnormal behavior. In [44], the temporal convolutional neural network and optical flow models are combined to detect local anomalies. The study in [45] integrates the one-class support vector machine into convolutional neural network to implement a novel end-toend model. The above approaches pay more attention to the extraction of spatial features, but the spatial-temporal relationship is not close enough.
In order to address that issue, it is necessary to introduce the recurrent neural networks to capture the temporal features. For example, a convolutional autoencoder integrates with a long short-term memory model to detect abnormal events in video surveillance in [46]. Kothapalli et al. [47] use mixture of Gaussians to subtract the background of each frame first, and then a convolutional neural network is used to extract spatial features that are fed into long short-term memory to learn temporal features. Finally, a linear support vector machine is used to classify to detect abnormal events. In [48], a novel recurrent neural network is constructed to learn sparse representation and dictionary to detect anomaly events by proposing an adaptive iterative hard-thresholding algorithm. The work in [49] combines the body shape, depth and optical flow features with long short-term memory network to implement the fall detection. The limitations of above approaches based on long-short term memory network are that the feature of noise interference will continue to spread in the process of recurrent neural network, which will affect the accuracy of features representation.
Moreover, autoencoder networks are used to detect the anomaly event. For example, an unsupervised deep feature learning algorithm is proposed by using a deep three-dimensional convolutional network and multi-level similarity trees after sparse coding to detect abnormal events in [6]. Wang et al. [50] use hybrid spatiotemporal autoencoder to solve the problem that long-short term memory encoder-decoder framework fails to account for the global context of the learned representation with a fixed dimension representation. The work in [51] uses a two-stream recurrent variational autoencoder to detect abnormal events in video streams. However, these approaches pay little attention to the impact of the internal structure characteristics of two feature vector on classifying and determining abnormal events in video sequences.
In summary, the models based on deep appearance feature have a better recognition ability for moving objects with nonlinear features, but the disturbance features from the light source, occlusions and other factors in video will spread in the depth neural networks, which will seriously affect the accuracy of feature representation. In addition, a single classifier or activation function is difficult to ensure the accuracy of classification. Hence, we utilize the structure characteristics of the deep appearance features to filter unexpected feature representations, and combine a support vector classifier to calibrate the results of a single classifier.

III. THE PROPOSED METHOD
In this section, we describe how to utilize structure characteristics of the feature vector to improve the performance of abnormal event detection modeling. Recently, there have been a large number of works focusing on key-points or feature vectors to classify and detect abnormal events in video sequences. The key insight of these works is to exploit appearance feature representation and utilize probability statistical models or clustering approaches to determinate whether the events as abnormal if they deviate from the model of normal event in video surveillance scenes. However, feature representation of vectorial form is not easy to describe the topological, geometric and other complex relational characteristics of real-world data, and the disturbances from the light source, occlusions and other factors in video can affect the feature representations. Moreover, a single classifier is difficult to ensure the accuracy of classification.
In this paper, we try to construct a feature expectation subgraph to filter unexpected feature representations that from various disturbances in video and combine feature expectation subgraphs and support vector classifiers to improve the identify result of single classifier. The advantage of using feature expectation subgraphs is to obtain principal component of feature vector while retaining the sequential and topological relational characteristics inside feature vector. It is conducive to classification and recognition of abnormal event detection. Therefore, we employ the convolutional neural network and long short-term memory models to extract the features in video surveillance scenes first, and then construct expectation subgraphs by measuring the distance between eigenvalues in a feature vector. In the following, we combine expectation subgraphs with support vector classifiers to calibrate the classification of linear support vector classifiers to determinate whether there exist abnormal events in a video surveillance scene. Fig. 1 illustrates the overview of our method, which contains three parts: CNN-LSTM feature extraction, feature expectation subgraph construction and calibration classification based on feature expectation subgraph. In the following, we will describe these parts separately.

A. CNN-LSTM FEATURE EXTRACTION
Deep neural network models have more powerful learning capacity and excellent representational capacity than hand-crafted features models. Convolutional neural networks are a kind of common deep neural network, which are suitable for spatial relationships learning on raw input data. Among the various convolutional neural network models, a convolutional neural network named VGG-16 can be employed to extract spatial features as well as for high accuracy image recognition because of the depth of network [47], and therefore it can be applied to feature extraction for complex video surveillance scenes. However, the VGG-16 network is difficult to represent the temporal relationship of the input video sequences accurately. In order to overcome such a limitation, we employ a long short-term memory network to extract dynamic temporal behavior feature in video stream. In consideration of spatiotemporal features of video, we first select several video clips as the training samples to input VGG-16 network to extract the spatial features, and then the obtained feature maps are fed into LSTM to further extract the temporal features of input video clips. Suppose that the above-mentioned video clips are with a size of w × h × c × l, where w × h denotes the size of video frame, c denotes the number of channels for each frame, and l denotes the VOLUME 8, 2020 frame number of the video clips. We set w and h as 224, and c = 3 before training the VGG-16 network. Moreover, we first fix 3×3 convolutional kernel with stride 1 in convolution layer, and then fix 2 × 2 pooling window with stride 2 in pooling layer to implement the convolutional operation and max-pooling process. During the process of convolutional operation, feature matrix Y ij can be obtained by the following formulation: where f (·) denotes the activation function, X ij is the window matrix around the pixel of i th row and j th column in video frame, i ∈ [0, h − 1] and j ∈ [0, w − 1]. Moreover, W denotes the weight matrix, and b is the bias. In the VGG-16 network, we select a rectified linear unit function to represent f (·) and set a variable z to denote the maximum value of all elements in a variable z to denote the maximum value of all elements in feature matrix Y ij , and then f (·) is described as follows: Through five groups of convolution and max-pooling layer, we use three fully connected layers to extract spatial feature vectors of size [4096,1]. In addition, the function of cross entropy loss is used to optimize a convolutional neural network. Taking a video frame of coal mine video as an example, we visualized the feature maps of different convolution layers, as shown in Fig. 2. It can be seen from Fig. 2 that the edge features of video frame are salient in the first convolution layer. However, with the increase of convolution layer, the feature maps are more and more abstract, and finally the high-level features of video frames are obtained.
Subsequently, the extracted feature vectors are fed into a long short-term memory network to further extract temporal feature. Here we employ a two-layer long short-term memory network, and the long short-term memory network in each layer has the same architecture, which consists of input gate, forget gate and output gate. In the process of training a long short-term memory network, we set the learning rate to 0.01, the number of input nodes to 64, and the number of nodes in hidden layer to 256. Moreover, we utilize the cross-entropy function as the loss function to train, i.e., where y i is the i th eigenvalue in feature vector from output gate, y i denotes the label corresponding to y i , and i ∈ [1, 1024]. After we complete the VGG-LSTM networks training, we can obtain the feature vectors of size [1024,1] from the output layer of long short-term memory network to represent the video clips. The concrete architecture of VGG-LSTM networks is described in Fig. 3.

B. FEATURE EXPECTATION SUBGRAPH CONSTRUCTION
The disturbances from the light source, occlusions and other factors will affect feature extraction whether in normal or complex video surveillance scenes, which are also reflected in feature representations. Although the principal component analysis algorithms can reserve the main features of video frames while reducing the impact of disturbances, the structure characteristics of feature vector will change. At present, most studies mainly focus on data represented in a vectorial form, which pay little attention to the impact of the internal structure characteristics of feature vector for abnormal event detection in video surveillance scenes. In this section, we briefly describe how to construct a feature expectation subgraph to represent sequential and topological relational characteristics between eigenvalues in a structured feature vector.
Suppose that we obtain a set of feature vectors S = {V i } n i=1 by using the VGG-LSTM networks, where the i th feature vector V i ∈ R 1024×1 . Since the distribution of feature points has the sequential and topological relationships in video frame, the distance between two eigenvalues y . In order to represent the internal sequential and topological relationships of a feature vector, we first transform the feature vector to a two-dimensional matrix by using the following formulation: where t denotes the t th row, l denotes the l th column in matrix A (i) , and the i th matrix A (i) corresponds to the i th ]. Second, we use a mapping ϕ : y t,l , l) to obtain an eigenvalue point in two-dimensional space if the value of an element is not 0 in A (i) . Therefore, each eigenvalue y (i) corresponds to an eigenvalue point y (i) t,l in two-dimensional space. Suppose that we have two eigenvalue points P(y (i) t1,l1 , l 1 ) and P(y (i) t2,l2 , l 2 ). We can measure the distance between two eigenvalue points by using where the parameters t 1 , t 2 , l 1 , l 2 ∈ [1, 1024], α 1 and α 2 are constraint factors, and y t2,l2 ∈ A (i) . According to [16], the position of eigenvalue points in two-dimensional space is also the main factor to measure the internal sequential and topological relationships of a feature vector, besides eigenvalue. Therefore, the first term ψ1(y measures the similarity of eigenvalues between two eigenvalue points, and the second term ψ2(l 2 , l 1 ) measures the similarity of the position between two eigenvalue points. Moreover, we calculate k using Eq. (6) to roughly measure the contribution relationship between two terms for distance measurement.
where dim(V i ) denotes the dimension of feature vector V i . On this basis, we use the Euclidean distance function to represent ψ1(y t2,l2 ) and ψ 2 (l 2 , l 1 ); thus we can further describe Eq. (5) as below: where r denotes the range of neighborhood. We employ the idea of K-NearesNeighbor algorithm to calculate the distance only in r scope (we set r = 100 in our experiment), which can not only reduce the computational cost but also decrease the influence of the far position of eigenvalue point in feature vector on distance calculation. If, dis(P(y where µ T is a given threshold, we consider two points P(y (i) t1,l1 , l 1 ) and P(y (i) t2,l2 , l 2 ) as similar eigenvalue points, and utilize an edge to represent incidence relation between each other. In this way, there are several eigenvalue points that will be related to each other by using edges, and several edge sets are generated to represent incidence relation of all eigenvalue points in feature vector. Through the above eigenvalue points and edge sets, we can construct a graph G = (v, ε(v)), where v denotes the set of eigenvalue points, and ε(v) denotes the corresponding edge set. In order to utilize the structure characteristics of deep feature vectors to filter unexpected eigenvalues that correspond to disturbances to improve the accuracy of abnormal event detection, we present VOLUME 8, 2020 to construct a feature expectation subgraph for each key frame of the video. First, we calculate the expected value of edge sets in graph G in below: where f (ε(v)) denotes discrete function on ε(v). Since the probability of occurrence of any ε(v) is random, we can further describe it as below: After that, we can obtain feature expectation subgraph Fig. 4. Fig. 4(a) shows the eigenvalue points in feature vector generated from VGG-LSTM networks, and Fig. 4(b) shows a feature expectation subgraph G . From Fig. 4(b), we can see that some eigenvalue points are filtered when some eigenvalue points do not satisfy the condition dis(P(y (i) t1,l1 , l 1 ), P(y (i) t2,l2 , l 2 )) ≤ µ T , and others are retained. Moreover, the graph that is composed of these eigenvalue points can reserve the main part of internal sequential and topological relational characteristics of structured feature vector. When there are less feature expectation subgraphs, we will use all feature subgraphs as feature expectation subgraphs. When a feature subgraph contains all eigenvalue points, we can regard it as the maximum feature expectation subgraph.

C. CALIBRATION CLASSIFICATION BASED ON FEATURE EXPECTATION SUBGRAPH
Once the frames of video are represented using feature expectation subgraphs, we can use them to classify and recognize anomaly. In this section, we will combine with support vector classifiers and feature expectation subgraphs to calibrate the classification of a single linear support vector classifier.
First let {G , y i } n i=1 be the corresponding labeled feature expectation subgraphs for n frames from N training videos where the label y i is −1 for feature expectation subgraphs of abnormal event and +1 for feature expectation subgraphs of normal event. Second, we utilize the support vector classifier to classify G and detect the abnormal events. In this paper, we solve the classification problem of support vector classifier for feature expectation subgraphs, which is based on the improved support vector machine model in [16], as formulated below: where α i and α j are Lagrange multipliers, y i ∈ {−1, +1}, K (G i , G j ) is the graph kernel function, and C is the box constraint parameter. Since we can use an inverse mapping ϕ : P(y t,l to obtain a sparse vector V (i) S that corresponds to a feature expectation subgraph G i , we can establish a conversion relationship: S . On this basis, we adopt the linear kernel function to measure the similarity between any two G i and G j . The decision function for a test G will be: where b is the bias, and f (·) = f (−1, +1) is the prediction function. Although feature expectation subgraphs can be used to obtain principal component of feature vector while reserving the main sequential and topological relational characteristics inside feature vector, a single classifier is difficult to ensure the accuracy of classification. In addition, sparse vectors obtained by feature expectation subgraphs cannot completely represent the feature of video frame. Hence, we combine with the linear support vector classifier to detect the abnormal events in video scenes as follows: where V is the feature vector that is extracted from VGG-LSTM networks for test samples. By the logical OR operation, we can utilize the result of f (G , G i ) to calibrate the classification of f (V , V i ).

IV. EXPERIMENTAL EVALUATION
We conduct extensive experiments on a widely used abnormal event detection dataset and a coal mining video dataset to evaluate the performance of the proposed DF-ESCC and compare them with several state-of-the-art methods such as SURF+BoW, SIFT+BoW [16], HMM with optical flow [34], CNN-2D+LSTM [43] and CNN-2D+LSTM+SVM [47]. All the experiments are conducted on a machine having a Inter Core (TM) i7-7700HQ processor with 8G memory and a Huawei server having 4-Inter Xeon processors with 8G memory, respectively. The programs are written in Python with version 3.5. In what follows, we describe the details of experiments and results.

A. DATASET AND EVALUATION CRITERIA
In real life, there are video quality issues in the collected video data and more repetitive information in each video frame, which are not conducive to the detection of abnormal events in the video surveillance scenes. In order to verify  the effectiveness and performance of the proposed method in common scenarios, we choose the UCSDped1 dataset [18] to evaluate the proposed method since it is the most commonly used benchmarks for abnormal event detection in videos. Moreover, we also mainly focus on the coal mine video dataset [19] as it has complex scenes, which is more challenging than UCSDped1 dataset. The abnormal event of coal accumulation is also common in production of the coal mine. By using coal mine video dataset, the validity and performance of our proposed method can be verified in complex scenarios. The UCSDped1 dataset can provide 34 training clips and 36 testing clips, and each clip has around 200 frames with a 238 × 158 pixels resolution. In our experiments, we utilize total 1393 video frames from 4 videos to detect ''biker'', ''cart'', ''wheelchair'' and ''skater'' abnormal events. In the coal mining video dataset, there are 73 videos that can be used for training and testing, and each frame is the 3-channel image of 224 × 224 pixels resolution. In our experiments, we select total 6879 frames from 6 videos in 3 scenes to detect an abnormal event of coal accumulation. In order to evaluate the performance of the proposed approach, the following metrics [43]: accuracy, precision and recall metrics are used for evaluation of abnormal event detection, which are expressed by: where TP is the number of true positive samples, FN is that of false negative samples, FP is that of false positive samples and TN is that of true negative samples.

B. RESUTLS ON DIFFERENT DATASETS
For the UCSDped1dataset and coal mining video dataset, the initialization of weights and biases variables are random values in the process of training VGG-LSTM networks. Moreover, we utilize the dropout function, parameter sharing [52] and data enhancement [53] methods to address the overfitting problem, and adopt the Adam algorithm to optimize the loss function. The change of weight variables and loss in the process of training for the UCSDped1 dataset and coal mining video dataset are shown in Figs. 5 and 6. From Fig. 5, we can see that the loss function is convergent, the different weights in CNN-LSTM networks change in the range of −0.2 to 0.2, and the different biases change in the range of −1 to 1. Moreover, according to Fig. 6, we can see that the loss function also is convergent, the different weights in CNN-LSTM networks change in the range of −0.2 to 0.2, and the different biases change in the range of −0.7 to 1.3. Therefore, there is no overfitting problem in the process of training and validation in our experiment.
In our experiments, there are different feature expectation subgraphs that are constructed through different thresholds µ T , as shown in Figs. 7 and 8. Through these figures, we can find that the number of eigenvalue points in the feature expectation subgraph increases with the increase of µ T ,  the topological structure of subgraphs in Fig. 7 changes until µ T = 1.6, and the topological structure of subgraphs in Fig. 8 also changes until µ T = 2.0. In addition, Figs. 9 and 10 show the situation that exists the less eigenvalue points, the higher accuracy. The accuracy is the highest when µ T = 1.3 in Fig.7 and µ T = 2.0 in Fig. 8. Less eigenvalue points in a feature expectation subgraph are not enough to represent features of video frame completely, and some feature expectation subgraphs or feature graphs may contain some eigenvalue points that correspond to disturbance factors, which will affect the accuracy of abnormal event detection.
To further study the performance of the proposed approach, we compare DF-ESCC with several state-of-the-art approaches. The results, as shown in Tables 1 and 2, demonstrate that the performance of hand-crafted features-based models is weaker than deep appearance features-based models, and our approach improves the performance effectively.   Some eigenvalue points correspond to the disturbances of light source in the coal mining video dataset and the dense crowd in UCSDped1 are filtered, which can reduce the influence of some disturbances on abnormal event detection. However, if the disturbances have a great influence on image features, the effect of maximum feature expectation graph will be better, such as in the case of µ T = 2.0 in Fig. 10. Finally, the results of abnormal event detection are shown in Fig. 11, where Fig. 11(a) shows the abnormal event that the car is on the pedestrian way, and Fig. 11(b) shows the  abnormal event that coal accumulates on belt conveyor in the process of coal mining.

V. CONCLUSION
In this paper, we present an abnormal event detection hybrid modulation method via feature expectation subgraph calibrating classification (DF-ESCC) in video surveillance scenes. The proposed method based on feature extraction of VGG-16 and long short-term memory networks can extract the salient features accurately from surveillance videos. Moreover, some unexpected eigenvalues can be filtered by utilizing the constructed feature expectation subgraphs and mapping sparse vectors. Finally, the accuracy of abnormal event detection can be improved by using the classification of feature expectation subgraphs to calibrate the results of a single classifier. The experimental results on two challenging datasets indicate the effectiveness of DF-ESCC and show competitive performance with the existing approaches. In summary, the accuracy of abnormal event detection can be improved by utilizing internal sequential and topological relational characteristics of structured deep appearance features.
Although our approach can learn the effective discriminative features from CNN-LSTM networks, its performance still needs to improve in complex video surveillance scenes, and the graph kernel model also needs to improve. In the future, we plan to use inception networks and other graph kernel methods to further improve the performance of our method.
OU YE received the B.S. degree in computer science and engineer and the M.S. and Ph.D. degrees in computer software and theory and mechanical engineering from the Xi'an University of Technology, China, in 2007, 2010, and 2014, respectively.
He is currently an Associate Professor with the College of Computer Science and Technology, Xi'an University of Science and Technology. His current research interests include data cleansing, video retrieval, and image processing.
JUN DENG received the B.S. degree in mining engineering from the Xiangtan Mining College, in 1993, and the M.S. and Ph.D. degrees in mining engineering from the Department of Mining Engineering and Active College, Xi'an University of Technology, in 1996 and 2004, respectively.
He is currently a Professor with the College of Safety Science and Engineering, Xi'an University of Science and Technology, China. His current research interests include coal fire safety and public safety. He is currently a Professor with the College of Computer Science and Technology, Institute of Systems Security and Control, Xi'an University of Science and Technology, Xi'an. He has authored more than 20 technical articles for conferences and journals, and holds two invention patents. His research interests include cyber-physical systems and system security.
TAO LIU received the B.S. degree in computer science and technology from the Xi'an University of Science and Technology, in 2017. He is currently pursuing the M.S. degree with software engineering with the College of Computer Science and Technology, Xi'an University of Science and Technology. His research interests include video image/video abnormal detection and machine learning.
LIHONG DONG received the B.S. degree in computer education from the Xi'an Mining College, and the Ph.D. degree from the China University of Mining and Technology. She is currently a Professor with the College of Computer and Science and Technology, Xi'an University of Science and Technology, China. Her current research interests include software engineering and mining industry internet technology. VOLUME 8, 2020