Vision-Based Fall Event Detection in Complex Background Using Attention Guided Bi-directional LSTM

Fall event detection, as one of the greatest risks to the elderly, has been a hot research issue in the solitary scene in recent years. Nevertheless, there are few researches on the fall event detection in complex background. Different from most conventional background subtraction methods which depend on background modeling, Mask R-CNN method based on deep learning technique can clearly extract the moving object in noise background. We further propose an attention guided Bi-directional LSTM model for the final fall event detection. To demonstrate the efficiency, the proposed method is verified in the public dataset and self-build dataset. Evaluation of the algorithm performances in comparison with other state-of-the-art methods indicates that the proposed design is accurate and robust, which means it is suitable for the task of fall event detection in complex situation.


I. INTRODUCTION
With the growth of the elderly population, the safety of the elders living alone becomes a rising issue for society [1]. Falling event is one of the most common and potential dangers occurring to elders residing in solitary indoor scenes, because the older people are deteriorating in physical function, slow sensory response and loss of balance [2]. In general, those falls cause injury, loss of mobility, and even more worse health problems. Therefore, the detection of fall events is essential to safety in solitary indoor scenes.
In recent years, varied methods based on advanced devices have been proposed for the detection of fall events. Wearable devices are growing fast, and they rely on sensors that are attached to the person's body, such as tilt sensors, accelerometers, gyroscopes, interface pressure sensors, and magnetometers, so they are wildly used in previous works [3][4][5][6][7]. Although these approaches have achieved high performances in fall events detection for elder care, they have to wear the sensors in daily life. Therefore, they are not a practical solution since it is difficult for elders to wear specific devices for long-term use. Vision-based devices method are increasingly used in different scenarios. A variety of vision sensors have been applied in fall detection tasks, including RGB cameras, depth sensors, and infrared sensors [8]. Among them, RGB cameras are the cheapest and easy to set up, since surveillance systems have been well developed in our daily life.
Plenty of works focus on camera-based fall detection methods and performed well on existing datasets [9][10][11][12]. The algorithm consists of two stages: background subtraction and feature classification. In the background subtraction stage, these methods are divided into traditional methods and deep learning methods. The former including inter-frame difference, Gaussian Mixture Model (GMM), and Geometric Multi-grid (GMG) [13][14][15] whereas the later one is commonly used Yolov3 [16,17]. However, the conventional technique does not perform well when the lighting changes, shadow changes, and the changes in the background due to short-term movements, which is difficult to meet the urgent needs of fall detection in complicated scenes at present. Besides, the Yolov3 needs to cooperate with tracking technology to extract the bounding box of the single person and cannot get the contour information from the foreground object. Therefore, we use a new method, Mask R-CNN into the fall detection task [18], which not only has stronger robustness in the noise background but also achieves end-toend training in the model.
In this paper, we propose an attention guided Bidirectional LSTM fall detection method to handle the complex background environment, which integrates the information of spatial and temporal domains in complex scenes. We show that using our model for fall detection gives better results compared to the existing models [17,19,20].
In our method, an effective person detector Mask R-CNN is used for background subtraction firstly, and then a deep learning-based method is utilized for feature extraction. After obtaining the features, an attention guided Bi-directional LSTM model is developed to detect solitary fall events. Spatially, the visual attention model tends to focus on the most significant regions of fall events. The Bi-directional LSTM fuse the forward with backward time information to predict the classification results of continuous sequences.
The rest of this paper is organized as follows. Section 2 presents an overview of the related works on fall detection. The proposed fall detection framework is explained in Section 3, and Section 4 presents experimental results, evaluation of our technique, and comparison with other stateof-the-art methods are discussed. Section 5 concludes the paper with future research directions.

II. RELATED WORK AND CONTRIBUTION
Over the last few years, the fall detection algorithms can be divided into three main categories: wearable sensor-based methods, ambience sensor-based methods, and computer vision-based methods.
A wearable sensor-based fall detection system determines their motion status or position information by sensors worn on the older person's body. Most of the researches are based on accelerometers [21]. The acceleration sensor detection system determines whether or not to fall by analyzing and collecting the acceleration of multiple axes. The most common one in the literature is the three-axis accelerometer. Mathie et al. fixed the sensor at the waist to obtain the data of human body from walking to the fall in acceleration. After analyzing these data and combining the data with the threshold set, the fall and non-fall of human body can be judged [22]. Purwar et al. proposed a fall detection device composed of a three-axis acceleration sensor, a gyroscope and an inclination sensor. Based on multiple sensor data obtained and fused to detect whether the human body has fallen [23]. In these systems, most of the wearable sensors (such as accelerometers, gyroscopes, etc.) are cheaper, more accurate, and easier to operate than environmental sensors. However, they are highly invasive, which is their biggest drawback.
The environmental sensor-based method places the sensor around the environment [24,25], which is typically in indoors. The signals used for detection include pressure, vibration, audio, infrared arrays, Wi-Fi, radar and so on. The main idea of this kind of fall detection method is using wireless techniques to identify environment change and building the relationship between the wireless signal and human activities [24]. It not only allows elderly person to perform daily activities naturally without wearing any equipment on the body [25], but also protects their privacy.
However, the cost of environmental sensor-based methods is relatively high, which is limited by the detective range.
With the development of computer vision and image processing technology, computer vision-based fall detection has become an important method, as the systems are less invasive to elderly and higher precision and robustness. The algorithm includes background subtraction and feature classification. In the traditional method, after the moving object is subtracted from the background, handcrafted features and classical machine learning methods are used to detect falling events in isolated scenes [26][27][28][29][30][31].
In [26,27], fall detection is based on width to height aspect ratio of the human body. Mirmahboub et al. [28] utilize a simple background subtraction method to create the silhouette of the human body, and several features are then extracted from the silhouette area. Finally, an SVM classifier is employed to perform the classification based on these silhouette-related features. Rougier et al. [29] applied a shape matching technique to track the silhouette of the human body in the object video clip. The shape deformation is then quantified from these silhouettes, and the classification is based on the shape deformation using a Gaussian mixture model. An adaptive background Gaussian mixture model (GMM) is used to obtain the moving object in [30], and an ellipse shape is built from the moving object for body modeling. Several features are then extracted from the ellipse model. Unlike [29], two Hidden Markov Models (HMMs) are applied to classify falls and activities of daily living. Leila et al. [31] employed two shape features and one 2D position feature to distinguish different postures including standing, sitting, bending, squatting, lying on the side and lying toward the camera. Posture classification is completed by SVM algorithm. Rougier et al., proposed a method based on Motion History Image (MHI) and shape features [32]. This method used MHI to characterize the intensity of the action. According to the features provided by the direction of the fitting ellipse, the fall events were detected.
Deep learning method has been widely used in computer vision [33][34][35][36]. Unlike the most conventional vision-based fall detection methods relying on hand-crafted features, the methods based on deep learning techniques can automatically learn features and hence have got widely concern recently. In reference [17], VGG-16 net combined with an attention guided LSTM was applied to capture spatial-temporal features for fall detection. In [37], an extremely deep residual network and LSTM network were used for fall detection. Taramasco et al. [38] used a thermal sensor array for older people who live alone. They classified fall or non-fall by applying three recurrent classifiers (Bi-directional LSTM, LSTM, and GRU). Each classifier achieves an accurate performance. But, Bi-directional LSTM is the most performant to others classifiers.
The principal contribution of our work, with respect to the state of the art, is the implementation of Mask R-CNN applied to Background subtraction for fall detection.
Moreover, we takes into account not only the forward information of LSTM but also the backward information in feature classification stage, comparing with [17] method. Our method has achieved the state of art in the experiment.

III. FALL DETECTION BASED ON ATTENTION GUIDED BI-DIRECTIONAL LSTM MODEL
We propose a new Bi-directional LSTM attention method for fall detection in the indoor environment, as shown in Figure  1, which is divided into three parts: Mask R-CNN layer, Bidirectional LSTM layer, and an attention layer. We first use the Mask-RCNN [18] to detect the person in the frames. After detecting, useful features are aimed to extract from each detected binary image of the human body contour. We use the output of the last convolutional layer of VGG-16 [39] and feed features of each binary image into attention guided Bi-directional LSTM model for final fall detection.

A. Mask-RCNN FOR BACKGROUND SUBTRACTION
Background subtraction is a key step in the process of fall detection. Only when the ideal human body contour is extracted, the falling behavior can be correctly classified according to the foreground contour. In the task of fall detection, the most commonly used algorithms are interframe difference [13], GMM [14], and GMG [15], but these methods are sensitive to light, shadow and ghosting of moving objects.
In order to solve these challenges, we introduce Mask R-CNN based on deep learning to replace these conventional methods, which is faster and robust. Mask R-CNN is COCO 2016 challenge winner, the structure, as shown in Figure 2. Its performance outperform many two-stage algorithms and one-stage algorithms, such as R-CNN [40], Fast R-CNN [41], Faster R-CNN [42], Retina Net [43], Yolo series [44][45][46], SSD [47].
Mask R-CNN extents the technology of Faster R-CNN by adding a parallel branch to predict object mask and replacing ROI Align with ROI pooling, which not only improves the detection performance of small size object but also makes the model obtain more semantic information. Among the existing background subtraction algorithms, Mask RCNN can obtain the most accurate human body contour.

B. Bi-directional LSTM LAYER
The CNN layer of this paper mainly extracts the features of the binary human body contour image induced by Mask R-CNN. Moreover, these features output a deep feature matrix of T × 4096 through the full connection layer and transmit it to the Bi-directional LSTM layer (T is the number of video frames, which is 15 in here).
Bi-directional LSTM consists of a forward and backward direction LSTM. In the LSTM [48], the memory controller is used to determine which information is forgotten and retained. It is implemented through three structures: input gate, forget gate and output gate. The unit structure is shown in Figure 3, and the input and output parameters of the proposed method are given in Table 1.
The operation process is expressed as: Where wf , wi , wg , wo and uf , ui , ug , uo are the output of the previous feature vector and the input of the current feature vector through the weight of each control gate. bf , bi , bg , and bo are the bias terms passing through the control gate.
(1) calculates the discarded information after passing through the forget gate. (2) and (3) (4). (5) and (6) determine which part of the unit state will be output through the output gate.  The traditional LSTM network can only learn in one direction, thus ignoring the reverse information. However, in Bi-directional LSTM [49], the input of the current moment not only depends on the previous video frame but also on the subsequent video frame.
The combination of the two LSTM units fully considers the temporal information before and after the video frame, and the model structure is shown in Figure 4. In the figure, wi (i = 1, … , 6) denotes the weight from one unit layer to another, and xt is the feature vector obtained by extracting deep features from the video frame through the VGG layer (1×4096), h means LSTM units of input feature sequence (…, xt-1, xt, xt+1, … ), gt ' indicates LSTM units of input feature sequences (…, xt+1, xt, xt-1, … ) and ot is the corresponding output after the feature vector passes through the Bi-directional LSTM network.
( ) ( ) Where bt (1) , bt (2) , bt (3) , bt (4) are the biases in the Bidirectional LSTM network at time t. ot ' and ot '' are the results of two LSTM units dealing with the feature vectors output from the VGG layer at the corresponding time. As shown in Eq. (11), the average of the two vectors at the corresponding time as an output feature vector ot. The vector is feed into the attention mechanism to learn the network weight.

C. ATTENTION LAYER
The attentional mechanism [50] is similar to that of brain signal processing peculiar to human vision. Which highlights some important features, by calculating the weights of the feature vectors output from the Bi-directional LSTM network at different time step, makes the whole network model can show better performance.
The attention model of this paper is shown in Figure 5, at each time-step t, the feature vector of region is weighted by the attention mechanism. Thus, the output of the attention layer at time t can be formulated as:  Where αt is a softmax over xt locations, which is defined as:

IV. EXPERIMENT A. UR FALL DATASET AND IMPLEMENT
The experiments are conducted on the URFall dataset detailed in [12], in which 70% are used for training and the other 30% are used for testing. The evaluation criteria are as following: where TP, FP, FN and TN are the true positive, false positive, false negative and true negative respectively.

B. EXPERIMENTAL RESULTS
Background subtraction is an important step in fall detection, which greatly affects the validity of feature extraction. To investigate the effectiveness of background subtraction algorithms, we selected three common methods as the baseline for comparison in the fall detection task, including Inter-frame difference [13], GMM [14], GMG [15]. In the method [31] proposed by Panahi, the ATC (composed of aspect ratio, tilt angle, centroid height of the contour of the human body) features were extracted from the binary image of human foreground, the feature vectors were used to train the SVM to achieve fall event detection. Here, we use the inter-frame difference, GMM, GMG to replace Panahi's background subtraction algorithm. As shown in Figure 6, traditional background subtraction is greatly affected by ground shadows and object movement. Nevertheless, Mask R-CNN presents greater robustness in the noise situation and thus achieves better performance in the URFall dataset [12], as shown in Table 2.
According to the experiment in [17], their method with detection and tracking modules achieve better performance in complex scenarios. After adding the detection module, the objects in the frame can be extracted separately. The trajectory of each person becomes an independent image sequence after tracking. When detecting fall events in each independent image sequence, the detection performance can be greatly improved due to the elimination of mutual interference. Nonetheless, it is unable to do end-to-end network training since the tracking module is added . For this reason, we use Mask R-CNN to replace the detection and tracking modules.
As shown in Table 2, the attention guided Bi-directional LSTM can improve the fall detection performance. Because the current fall behavior depends not only on the previous video frame but also on the subsequent video frame, while the Bi-directional LSTM can consider the temporal information before and after the video frame at the same time step.

C. NOISE EXPERIMENTAL RESULTS
To investigate the effectiveness of our method in noise background (i.e. Figure 7), we add Gaussian noise to our self-build dataset, where the noise is the distribution from N (0, 0.3×255). The dataset was collected by common camera, which contains 60 fall video clips and 180 non-fall video clips in indoor scene. As show in Table 3, the baseline methods have poor performance in recall and precision. The reasons are as follows: For the method proposed by Feng et at al., its results depended on the human body extracted from bounding boxes. However, the detection method is not robust enough in noisy background. Specifically, under the disturbance of strong Gaussian noise, the person detection contains some cracks and large of undetected error in light deficiency or the backward fall behavior, which is shown in Figure 8. Finally, it results in false positive of fall behavior. Fortunately, Mask R-CNN has stronger detection robustness and richer contour information of human body in noisy environment.
The experimental results have shown that compared with other algorithms, Mask R-CNN can extract the contour information of the human body precisely under the condition of stronger noise. Therefore, this method is possessed of good anti-noise performance and stability, and it is suitable for person detection. These advantages are important to improve anti-noise performance of fall event detection system.

V. CONCLUSION
A complex-scene fall event detection based on attention guided Bi-directional LSTM is proposed in this paper. The Mask R-CNN is adopted to extract moving object from the noisy background. The VGG16 model is used to extract features of each detected object. Furthermore, the followed Bi-directional LSTM provides attention model in both forward and backward behavior information and is further appropriate for classification, which improves the performance of fall event detection. The experimental results demonstrate that our method can achieve accurate fall detection in video, and outperform state-of-the-art methods. In the future, we attempt to utilize our method to multiperson environments to further protect the lives of the older people.