Using ResNet Transfer Deep Learning Methods in Person Identification According to Physical Actions

Today, biometric technologies are one of the areas of information security which are increasingly used in all areas required by human security. The subjects such as person identification (PI), age prediction, and gender recognition are among the topics of human-computer interactivity that have been commonly researched in both academic and other areas in recent years. PI is the process of identifying the person according to biometric features obtained. In this study, the PI process was carried out with ResNet transfer deep learning methods by using the signals from an accelerometer, magnetometer and gyroscope sensors attached to 5 different regions of the persons. Here, the persons were identified depending on different physical actions and effective actions in the PI were determined. Furthermore, the effective body areas have also been identified in PI. Generally, high success rates have been observed through ResNet architecture. This study has shown that the signals of wearable accelerometer, gyroscope, magnetometer sensors can be used as a new biometric system to prevent identity fraud attacks. In summary, the proposed method can be greatly beneficial for the effective use of wearable sensor signals in biometric applications.


I. INTRODUCTION
The problem of person identification has been one of the popular areas that the researchers put emphasis on using various methods over the last decades. There are several unique characteristics of the human body. There are systems that can detect these characteristics and distinguishing them from other people. The systems identifying a person according to their physical or behavioral characteristics are called biometric systems. The biometric person identification is to recognize an individual based on his/her physiological and/or behavioral characteristics [1]. The biometrics examines the personally identifiable physical or behavioral characteristics. The operation of biometric systems is considered in two ways: (1) recognition (also called as ''identification'') and (2) authentication (also known as ''identity authentication''). In the first, authentication of an individual's identity is performed by searching the information of all persons in the database for a match (one-to-many comparison strategy).
The associate editor coordinating the review of this manuscript and approving it for publication was Shagufta Henna.
In the latter, the biometric information of a person is compared with his/her prototype stored in the system database to authenticate a person.
The physiological properties are those that generally remain stable and do not change easily over time. Notwithstanding, the properties based on behavioral characteristics may vary over time and depending on the environment. In the past decades, the biometric identification systems based on physiological properties such as the face, fingerprint, hand geometry, iris and retina, or behavioral characteristics such as gait, signature and speech have been developed. However, these methods have a major disadvantage since they can be imitated [2]- [4]. Imitation of voice, the use of lenses copied from the iris, and disguise can be exemplified as these frauds.
Therefore, in recent years, other descriptive systems have been implemented, depending on the individual's behavior or characteristics called biometrics based on signals measured from different regions of the person [5], [6]. According to the studies, different medical signals have also been used as biometric data. The signals such as EEG [6]- [9], ECG [10]- [15], accelerometer [16], [17] were used in the development of biometric systems. The studies have been conducted to demonstrate that medical signals are unique to individuals [8], [18], [10].
In the study of Alyasseri et al. [19], the person identification was performed using multichannel EEG signals. In addition, active EEG channels were detected in the study. In the study of Sun et al., the person identification process was realized by using EEG signals [9]. They reported a 99.56% success rate by applying the conventional 1D-LSTM deep learning method to 16-channel EEG signals. In their study of person identification via EEG signal, Rodrigues et al. reported an 87% success rate [6].
Altan et al. fabricated a biometric system using the ECG signals [20]. People's general physical condition, stress level, activity level, and further specifications can considerably alter the waves in the ECG. Waveforms, as well as the temporal features, have always shown that they have different characteristics on different people [20]. In another study, the authors executed a person identification process by using the cardiac dynamics in ECG signals as a feature with RBF [11]. They revealed that ECG signals can be used as biometric information by demonstrating that the temporal relationships and forms of ECG signals are different for each individual. In the study conducted by Lie et al., the person identification was performed using five ECG datasets obtained from PhysioNet database through convolutional neural network (CNN) (Lie et al., 2020). The average success rate was reported as 94.3%. In their study, Goshvarpour and Goshvarpour developed a biometric system by using the MP (matching pursuit) coefficients of ECG signals with different machine learning methods such as PNN, Knn, and LDA. They reported the success rate of the system as 99.68% [12]. In another study, the person identification process was executed by evaluating the pressure signals of people who were instructed to walk on the ground [21]. In this study, the success rate of 92% was observed.
The accelerometer signals were also used in the person identification. In the study of San-Segundo et al. [16], a biometric system was fabricated based on the signals that they have drown from their smartphones' accelerometer sensors. They implemented the person identification process by applying the Gaussian mixture models to the signals obtained by making people walk. In their another study, the person identification was carried out on the same dataset by proposing i-vector analysis and PLDA-based approach [17].
In this study, a biometric system has been developed with transfer deep learning methods, using the signals of accelerometer, gyroscope and magnetometer acquired via Xsens MTx sensors. The transfer deep learning methods have become one of the basic building blocks of machine learning with the increase in the processing power of computers and particularly the development of GPU-assisted technologies. For big datasets, the deep learning methods have achieved high success rates.
The deep learning has gained great success in several machine vision and person identification problems such as handwriting recognition, object classification, object finding, scene understanding, and face recognition. Besides, millions of parameters must be configured in deep learning methods, and also overcosting hardware such as graphics processing unit, tensor processing unit, etc. are required. One of the most significant problems of the deep web is weight assignment. The deep learning methods entail big datasets to assign the correct weight, and the run-time of this process is long. The pretrained networks are utilized to overcome this problem, and these networks are generally trained in ImageNet dataset. In pretrained networks, the calculated final weights are used. Therefore, high classification rates are achieved in a short run-time by using preconditioned-networks.
The signals from sensors were primarily transformed into images and then the person identification process was performed by deep transfer methods. In this study, it is put forward that the ResNet convolutional neural network model, which was previously used for object recognition, can also be implemented for the person identification problem. After the ResNet neural network model was taken with existing layer weights, it was subjected to transfer learning operation with different datasets. It is shown that the deep learning approach can be employed for problems with small datasets as well. Accordingly, the ResNet structure developed for object recognition has been adapted for the problem of person identification. A biometric system has been created based on signals measured from the wearable sensors (accelerometer, gyroscope, and magnetometer), which are the main contributions of this study. A novel approach was suggested for the biometric system. The transfer deep learning methods were used for person identification.
The rest of this study is organized as follows: In Section 2, the dataset used in the study is described. In Section 3, the method of converting the signal into an image and the deep learning (ResNet) methods are introduced. The experiments and results are discussed in Chapter 4. The important results are argued in Chapter 5.

II. DATA SET
In this study, Daily and Sports Activities Data Set that we have obtained from UCI database was used [22]- [24]. The data of 19 different actions (activities) previously determined were created for this study via Xsens MTx sensors attached to the designated regions of people. The data were collected by mounting these sensors to 5 different regions of the subjects. The points where the sensors were positioned have been determined as at chest level, on the right wrist, on the left wrist, right (above the knee) and left legs (above the knee) ( Figure 1). There are 9 sensors in each Xsens MTx unit (x, y, z accelerometers; x, y, z gyroscopes and x, y, z magnetometers).
The data in the dataset used in this study were acquired by exhibiting 19 different activities previously determined by 4 women and 4 men. The subjects displayed the specified activities for 5 minutes. The activities performed by the subjects are listed in Table 1.

III. THEORETICAL INFORMATION AND METHODOLOGY A. PERSON IDENTIFICATION BY ResNet DEEP TRANSFER LEARNING TECHNIQUE
The proposed approach for person identification with ResNet deep learning methods is shown in the figure below. The processes performed at each stage are given briefly.
Block 1: At this stage, wearable sensors were attached to the subjects. Xsens MTx sensor units were mounted in 5 different regions of the subjects.

Block 2:
The signals from a total of 45 channels were recorded through accelerometer, gyroscope, and magnetometer sensors from 5 different regions of the subjects.
Block 3: The signals for each activity in Table 1 were measured as 5 minutes. Afterward, these signals were divided into 5 seconds. Since the sampling frequency is 25, the signals of 5 × 25 = 125 length were obtained. Firstly, the values of the signals have been converted into values between 0-255 with the following equation. Because the number of channels is 45, images in the form of 125 × 45 were gained.
For instance, the images created by the signals of the standing activity for each subject are depicted in Figure 3.

B. RESNET DEEP TRANSFER LEARNING TECHNIQUES
The most significant characteristic that distinguishes deep learning architectures from conventional artificial neural network models is the ability to map the features of the data of the layers within the deep networks themselves. Thanks to these architectures, it is possible to analyze the feature map obtained by applying filters in different sizes and numbers in the data. As a result, when the image is queried for testing, it can be predicted which image it has [25]. After the architecture which comes to the fore with the success of ''AlexNet'' as a result of the 2012 ImageNet challenge [26], the Convocational Neural Networks have started to display great performance in areas such as object detection, image segmentation, image classification, face recognition, human pose estimation and monitoring. In addition to these developments, with the evolution of network architectures, the essential points to be followed in performing best practices for new architectures that can be formed have been put forward [27]- [30]. In recent years, deep convolutional neural networks have made a series of breakthroughs in image classification [26], [31], [32]. It was inspired by simple cell and their discovery of receptive field in neuroscience by Hubel and Wiesel [33]. The deep convolutional neural networks (CNNs) have a layered structure and each layer consists of convolutional filters. The feature vectors for the next layer are generated by combining these filters with the input image and can be easily learned by sharing parameters. While the first layers in convolutional neural networks represent low-level localities such as edges and color contrasts, the deeper layers try to capture more complex shapes and are more specific [31]. It can enhance the classification performance of CNNs by deepening the model and enriching the diversity and specificity of these convolutional filters [28]. Even though the deep networks can often outperform in classification, they are difficult to train mainly for two reasons: The first is the vanishing/exploding gradients: sometimes a neuron dies in the training process and may never come back due to activation function [34]. This problem can be resolved by initiation techniques that try to commence the optimization process with an active set of neurons. The second is more difficult optimization: if the model brings in more parameters, it gets difficult to train the network. Occasionally, this is not a compliance problem, as adding more layer sometimes leads to even more training errors [35].
Therefore, deep CNNs are more difficult to train, although they have better classification performance. An effective way to solve these suggested problems is Resnet Networks (ResNets) [27]. The main difference in ResNets are that they have shortcut links parallel to their normal convolutional layers. Unlike the convolution layers, these shortcut links are always active and gradients can easily propagate back, thus resulting in a faster training.
There is a simple difference between ResNets and normal ConvNets. The purpose is to provide open path for gradients to propagate into the early layers of the network. This makes the learning process faster by avoiding the vanishing gradient problem or dead neurons. In 2015, ResNet came in first in the IMAGNET competition in the categories of image classification and object recognition by delivering a solution to the vanishing gradient problem. Whereas the human error rate was considered between 5-10%, it has decreased this rate to 3.57% in the competition [8].
The ResNet architecture is a winning model in the ILSVRC 2015 and COCO 2015 competitions, its usage in different datasets are quite easy. Five convolutional blocks were used in the construction of the Resnet-50 architecture comprising 50 layers. These blocks consist of 1 × 1, 3 × 3, and 1 × 1 convolution layers. The input images were reduced to a lower size with 1 × 1 convolution, and the filtering process was implemented in higher sizes with 3 × 3 convolutions. The global average pool layer was utilized to reduce the size in the architecture, and a two-stage sampling process was carried out. In the fully-connected layer of the architecture, Softmax activation function was used, and an output of 1000 categories was given for the classification of images. The ResNet architecture is composed of 25.6 million VOLUME 8, 2020  parameters. ResNet50 convolutional neural network generally is made up of the groups of convolution layer, activation layer, pooling layer and fully-connected layer.
ResNet can be defined as basic multiple blocks that are connected to each other in series, and each basic block has parallel shortcut links and can be added to its output. A basic block of ResNet is shown in Figure 1 [27]. If the input and output size for a basic block are equal, the shortcut link is simply a distinctive matrix. Otherwise, the average pooling (for reducing) and zero padding (for enlargement) can be adopted to adjust the size. Different basic blocks for a shortcut link in ResNet are compared ( Figure 4) and it is shown that adding a parameterized layer after addition can minimize the ResNet advantages since there is no fast way for gradients to propagate anymore. However, taking into consideration this condition, adding a non-parameterized layer such as ReLU or release after the addition module does not provide a major advantage or disadvantage.
In the classical convolutional neural networks, it has been observed that as the depth increases, the error starts to boost after a particular number of steps due to overfitting. In this model, a new approach called ''Residual Block'' has been introduced in order to prevent this. The major feature that distinguishes it from other architectures is that the residual values are formed by adding blocks that feed them to the next layers to the model. In this approach, x input retains an output named (x) after the processes of convolution (weight layer) -> ReLu -> convolution (weight layer). This output is not directly given to the other block. The output obtained by processing the result x + F (x) into the ReLu operation is given for other block. This output is referred to as H (x) and in this architecture, it is combined with the ''f (×)'' value attained in return for ''×'' value entered between each layer as expressed in Figure 3 and transferred to the next layer and shown as follows [36]; That is, before F(x) output is given to the other block, it is combined with x input and put into the ReLu function. This value which is added every two layers between the linear and ReLu activation codes alters the system account as specified. a^([l]) value from the previous layer is added to the a^([l+2]) account [37]. Normally, increasing the number of layers in a model means higher performance; however, in practice, the situation is changing. Thus, in the event of w[l+2]=0, it is a[l + 2] = b[l + 2] according to the new theory. This brings about the problem that the derivative produces zero value. It is an unwanted situation [27]. However, the residual value feeding optimizes the new output equation and the learning error, even if the a[l] value from the two previous layers is zero, thus the network is trained faster.  The internal structure of the residual block is demonstrated in Figure 5.
As it can be understood from Figure 5, the ''x'' input passes through the weight (W) layers and an ''F (×)'' result is acquired and after the process is completed, the ''F (×)'' parameter is added to the ''×'' parameter [27].
The authors suggested different ResNets model configurations with 18, 34, 50, 101, and 152 layers in their article on the main ResNet ( Figure 6). The ResNet50 architecture is formed by replacing both layer blocks in a 34-layer network with a 3-layer bottleneck layer This 50-layer network involves 3.8 billion FLOPs. ResNet101 and ResNet152 architectures use more layers than 3-layer block. ResNet152 comprises 11.3 billion FLOPs. Currently, it has lower complexity than VGG16/19 networks. [27] ResNet layer architectures are given in Table 2 ResNet architectures have two versions as V1 and V2, depending upon the use of activation functions before and after blocks. The differences between ResNet model versions, as shown in Figure 7, are added to a second nonlinear function after ResNet V1 calculation is performed between weight operation x and F (x). This situation eliminates the nonlinear function application process carried out after weight calculations in ResNet V2. Another significant difference is that ResNet V2 applies batch normalization and ReLu activation before weight calculations.

IV. RESULT
The dataset in this study consists of the signals obtained for 19 different activities from 8 individuals, 4 males, and 4 females. Each activity was divided into 60 segments. Thus, the dataset consists of 19 × 8 × 60 = 9120 signal matrices. After these signal matrices were converted into images, the ResNet deep transfer learning techniques were used. As a result, 9120 images were extracted to test the success of our system. 5 different ResNet architectures were used. The Success rates are calculated as (100 * # True classified # True classified + # False classified (%)).
The success rates with ResNet version 1 (ResNetV1) and ResNet version 2 (ResNetV2) architectures for person identification are given in Table 3. As seen in Table 3., the person identification was performed by using all signals belonging to 19 different activities with ResNet V1 and V2 methods. The optimal result was observed with ResNetV2-101 as 98.02% by using the images of all activities. The optimal result in ResNetV1 networks was obtained as 96.92% with ResNetV2-101 and ResNetV2-50. In person identification, ResNetV1 models have been concluded to be more successful than ResNetV2 models. The person identification was conducted for each activity performed to specify the impact of each activity on person identification. There are 480 images for each activity. The performance values of person identification for each activity are given in Table 4. The optimal results for each activity are shown in Figure 9 When examining Table 4., it is seen that high results are obtained in person identification with ResNet transfer deep learning methods in different architectures. The success rate  of person identification for a major part of the activities has been observed as 100%. Considering the ResNet architectures in different architecture, it is understood that the most successful technique is ResNet-152V2. 100% success rate has been observed with this method for the activities of A1 (sitting), A2 (standing), A3 (recumbency), A4, (lying on the right side), A9 (walking in the parking lot), A11 (treadmill walking at an angle of 15 degrees to the ground at a speed of 4km/h), A12 (running fast at the speed of 8km/h), A13 (step exercise) and A16 (cycling in a vertical position).
Based on the results, the activities of A8 (standing while the elevator is moving), A17 (rowing), A18 (leaping), and A19 (basketball playing) were found to be less successful in person identification than other activities. The sensors were attached to the body (chest level), right-left arms, and right-left legs of the person. The signals from each region are transformed into images to indicate the effect of the regions on the person identification and then ResNet deep transfer techniques were applied. When images by regions were used, the image dimensions were implemented at a lower size of VOLUME 8, 2020  125 × 9. The resulting success rates are given in Table 5 The highest success rates obtained according to different ResNet architectures are shown in Figure 10: As seen in Table 5, the optimal success for person identification was achieved from the images created from the sensors in the right leg. The signals from the right leg section provided more distinctive information for person identification. A 95% success rate was observed with the images in this section. Generally, all regions have displayed similar success rates in person identification.

V. CONCLUSION
In recent years, various biometric systems have been developed. The biometric systems such as the face, voice, fingerprint, palm print, ear structure, and gait have found a wide area of use in security systems. However, most of these systems create major disadvantages as they can be imitated. New biometric systems based on medical signals have been set up to overcome these problems. In this study, a biometric system has been created for person identification from wearable sensor signals. The main purpose of this study is to show that signals of accelerometer, gyroscope, and magnetometer signals obtained from wearable sensors are effective in person identification. After the signals are converted into images, the person was predicted with ResNet transfer deep learning architectures. When all 19 different activity images were used together, the highest success was observed as 94.21%. Moreover, the person identification was performed according to actions in the developed system.
100% success rate has been observed with this method for the activities of sitting, standing, recumbency, lying on the right side, walking in the parking lot, treadmill walking at an angle of 15 degrees to the ground at a speed of 4km/h, running fast at the speed of 8km/h, step exercise and cycling in a vertical position. The activities of standing while the elevator is moving, rowing, leaping, and basketball playing were found to be less successful in person identification than other activities.
In the study, people were identified according to the regions where the signals were obtained. The signals from the right leg section provided more distinctive information for person identification. A 95% success rate was observed with the images in this section.
In conclusion, the biometric systems may be more secure than other systems such as fingerprint, face, palm and iris, depending on the persons' physical activities. The reason is that the copying process in systems such as fingerprint, face and palm is easier. Gait and similar physical activities are more difficult to imitate.