Implementation of Fall Detection System Based on 3D Skeleton for Deep Learning Technique

In recent years, the fall detection system has become an important topic in the homecare system. Compared with the traditional fall detection algorithm, the method used by neural network is more robust and has higher accuracy. However neural network consumes a large amount of energy due to a huge number of computations, and needs more memory to store parameters as compared to traditional algorithms. In this paper, we propose a fall detection system in combination of the traditional algorithm with the neural network. First, we propose a skeleton information extraction algorithm, which transforms depth information into skeleton information and extracts the important joints related to fall activity. Also we have modified the skeleton-based method with seven highlight feature points. Second, we propose a highly robust deep convolution neural network architecture, which uses a pruning method to reduce parameters and calculations in the network. The low number of parameters and calculations makes the system suitable for the implementation on an embedded system. The experiment results show the high accuracy and robustness on the popular benchmark dataset NTU RGB+D. The proposed system has been implemented on NVIDIA Jetson Tx2 platform with real-time processing.


I. INTRODUCTION
Generally, two approaches are applied to detect a case of people falling down.One is designed based on a wearable device on people.The main drawback of this approach is the un-convenience on user since people should wear it all the time.The other approach is by a non-wearable manner.Mostly it can be achieved with a video sensor to monitor and detect the cased of people fall.
In this paper we propose a video-based fall detection system.Our goal is to implement on embedded system of NVIDIA Jetson TX2 for real-time demonstration.We use the camera with depth image to capture the action of people.The Microsoft's Kinect V2 module is used to acquire the depth information.Since the current Kinect V2 SDK cannot support ARM 64 processor, we use Openni2 framework to read 640x480 depth image form Kinect V2 module.Then the 3D skeleton extraction algorithm is applied and combined with deep neural network method to determine whether the people fall.

II. THE PROPOSED SYSTEM
This skeleton extraction system is divided into four steps in a blue box as shown in Fig. 1.The input of our algorithm is depth information which is captured by Microsoft's Kinect V2 module to obtain foreground.
Then we scan through the depth information of each pixel and get the closest point which is most likely to the position of the person and mark the 600 mm region through camera direction as foreground.We resize this foreground result into a 320x240 binary image to reduce the error after the skeleton point is obtained as shown in Fig. 2.
After acquiring the binary image, we use our previous labeling work [1] to remove the rest of the noise, calculate the objects coordinate, and size of the foreground object which can determine whether the object is human or not.Thus, it can solve the mistake of camera closer to the nonhuman object.Next, we use the thinning algorithm [2] to In the searching part, we just calculate 7 joint points from the skeleton of the Kinect v2.The result is listed in TABLE I.The reason is that when falling occurred, the joints above waist would have enormous change.So, we can only use the relative position of the 7 joint points to classify falling by the deep neural network.Finally, In order to transplant the model to the embedded system, we first refer to the AlexNet where it has 5 convolution layers and 3 fully connected layers to construct the model.Because the skeleton information is the analog data, the horizontal axis and vertical axis are not much correlated.
Thus we propose a simple network MyNet1D-D which using one-dimension convolution to replace the convolution of AlexNet and using pruning method to reduce parameter such as reduce kernel size, filter number, input size.After optimize, our model parameter reduces from 60M to 35K and the amount of calculation is decreased to only 48K.The modified network with a low parameter and low calculation manner is more suitable for embedded systems.

III. EXPERIMENT RESULTS
For training and testing purposes, we used NTU RGB+D dataset.We trained on an Intel Core i7, 3.60 GHz processor with 16GB of RAM and NVIDIA GTX1080Ti on a Linux system.

A. Evaluation and Discussion
We first test the neural network for accuracy rate in the fall detection.Through the benchmark NTU RGB+D dataset [3], all falling videos of skeleton information are treated as the positive samples.The other actions of skeleton information are randomly picked as the negative samples.Totally 946 films are belonged to the positive samples, and 1419 films are negative samples.
In this paper, we use 70 percent of the samples as ours training data, and 30 percent are the testing data.We use training data to train on the MyNet1D.The proposed network is trained by using gradient descent optimizer.
Weights are initialized randomly set.The learning rate is 0.01.To attenuate it, the value is using exponential decay of 0.96 by 2000 iterations.Iteration number is set as 50000 and the dropout rate is set as 0.5.The experiment result is shown in TABLE II.After optimization, the result of MyNet1D-D has almost the same but the parameters and

B. Embedded System Design
We use the skeleton extraction algorithm to obtain skeleton information as the snapshot shown in Fig. 4. Our system applies 7 important nodes in each frame, and concatenates 30 frames to a sequence for the network input.When the network detects the falling case, it will appear red box in our display interface to indicate the dangerous occurrence shown in Fig. 4(c).The system is implemented on NVIDIA Jetson TX2 platform.It can achieve 10 fames per second for real-time consideration.

IV. CONCLUSIONS
In this paper, we propose a fall detection system.Our system is based on a vision sensor without using any wearable device to detect fall detection.A depth image has been used to extract the skeleton information and different techniques have been applied to refine the skeleton extraction results.Moreover, different parameters and the number of calculations have been decreased for the neural network model.Finally, the whole system has been implemented on the NVIDIA Jetson TX2 platform and it has been tested for a real-time environment.In our future work, the system will be implemented on other smaller embedded systems such as Raspberry Pi to so that it can be used efficiently in health-care applications.
first point which pixel values is 255 from left to right, from top to bottom, followed two conditions below: 1. x is located between x of the head and the point 30 px left of the head.2. y-axis is below 30px of y of the head SHOULDER_RIGHT Search for the first point followed two conditions below: 1. x is located between x of the head and the point 30 px right of the head.2. y-axis is below 30px of y of the head ELBOW_LEFT This point is the midpoint of the SHOULDER_LEFT and the HAND_LEFT.ELBOW_RIGHTThis point is the midpoint of the SHOULDER_RIGHT and the HAND_RIGHT.

TABLE II .
Performance comparison on Accuracy, Recall rate, Precision on NTU RGB+D dataset