Deep Learning Based Person Authentication Using Hand Radiographs: A Forensic Approach

Biometric radiographs have gained importance in recent times owing to the rise in crime and disaster incidents. In recent times, authentication and identification of a person has become an essential part of most of the computer vision automation systems. Conventional fingerprint, iris, face, palm prints fail to recognize the human when the external biometric parts have been damaged due to rashes, wounds, and severe burning. Security, robustness, privacy, and non-forgery are the critical aspects of any person authentication system. In such situations, identification based on radiographs of the skull, hand, and teeth are effective replacement methods. In this paper, a novel forensic hand radiograph based human authentication is proposed using a deep neural network. Three-layered convolutional deep neural network architecture is used for the feature extraction of hand radiographs and for recognition; KNN and SVM classifiers are used. As a part of the experimentation, a total of 750 hand radiographs acquired from 150 subjects of different age groups, professions, and gender are considered. The performance of the algorithm is evaluated based on cross-validation accuracy by varying striding pixels, polling window size, kernel size, and the number of filters. Our experiment reveals that hand radiographs contain biometric information that can be used to identify humans in disaster victim identification. The experimental study also indicates that the proposed approach is significantly effective than conventional methods for the person authentication using hand radiographs.


I. INTRODUCTION
Authentication is the process of automatically recognizing the correct person using computational algorithms based on features stored in computer systems. Presently, the biometric identification systems are based on static features like face [1], iris [2], palm print [3], voice [4] and fingerprint impression [5] of the user, which mostly remains unchanged over time. Whereas, dynamic biometric system features of the user may change over a period of time, such as an electrocardiogram-based system [6], keystroke, and touch dynamics [7]. With few techniques available to make any of these identifiers work for recognizable proof, the procedure and results are the same. For any procedure to work, what we need is a record of an individual's trademark kept in a database. After that, when the recognizable proof is required, a recent or on-hand record is compared and contrasted with the record of the database. The performance The associate editor coordinating the review of this manuscript and approving it for publication was Zhan-Li Sun . of a biometric identification system is measured based on accuracy, efficiency, security, and privacy. Biometric systems can be unimodal or multimodal. The unimodal biometric system is less reliable, less secure, and has limited usability, whereas multimodal biometric systems are a combination of multiple sensors, multiple algorithms, and numerous instances, making it more accurate, reliable, secure, and robust [8]. These systems are subjected to impersonation and spoofing attacks, which can be easily replicated, further degrading the quality and reliability of the person recognition system [9]. Many times, the catastrophes like tsunami, earthquake and fatal accidents damage the biometric parts and make it challenging to identify the person. Addressing this problem, forensic radiography plays a very vital role. Forensic radiography is a part of forensic medicine, which is concerned with identifying people using the post-mortem radiological images of different parts of the body including skeleton, skull and teeth [10]. Radiographs acquired before and after death are termed as antemortem (AM) and postmortem (PM), respectively. Generally, in radiograph based human recognition, the PM radiographs are compared with the AM radiographs stored in the database [10]. Dental records have been extensively used in disaster victim identification, such as 9/11 bombing and Asian tsunami [11], [12]. Many authors [13]- [17] successfully demonstrated the uses of dental radiographs for human identification. There are still a number of challenges to overcome like poor image quality, changes in the dentition over time, such as tooth eruption and loss, the emergence, abrasion, falling and replacement of dental restorations [13]- [17]. In such cases, hand radiographs can be considered for authentication purposes as bones cannot be easily damaged due to burning, rashes, and wounding. Human hand anatomy consists of proximal phalanges, middle phalanges, distal phalanges, metacarpals, and carpal bones. The different views for capturing hand radiographic images are the posteroanterior (PA) view, lateral view, oblique view, and anterior-posterior view [18]. This paper presents a novel method for the person authentication based on hand radiographs using a deep learning. Three-layered convolutional deep-learning architecture is used for the feature extraction of radiographs, and the k-nearest neighbor (KNN) and support vector machine (SVM) classifiers are used for the retrieval of subjects for different striding pixel, filter size, polling window and kernel size.
Deep learning has become popular in recent years because of its ability to extract the hidden features of the images. Deep learning algorithms have been successfully applied for detection, recognition, classification, segmentation, and retrieval of image data. Learning is an important step, which is used to get the optimum value of weights of the convolution filter. There are different learning algorithms for deep learning, such as Gradient Descent, Stochastic Gradient Descent, Momentum learning, Levenberg-Marquardt algorithm, and Back-propagation through a time learning algorithm [19]. The problem of variable feature length of the fully connected layer is reduced in R-CNN by using a selective search algorithm. R-CNN takes more substantial time for training, and it is challenging to train the fixed selective search algorithm [20], [21]. Alhussein et al. have implemented the transfer learning using two types of the convolutional neural network, namely Deep learning and Shallow learning. The fusion of these two algorithms resulted in 87.96 % accuracy for EEG pathology classification [22]. Zhou et al. used faster R-CNN for object detection, which has been used for ImageNet, PascalVoc, and COCO dataset. Faster R-CNN is quicker than R-CNN, as convolution is done only once rather than giving a broader region with convolution per region [23]. Zhao et al. [24] used the capsule convolutional neural network for robust and efficient iris recognition. Wang et al. [25] presented the rail surface area detection using cascade sampling and dilated convolution. Dilated convolution is used for multi-scale feature learning of rail surfaces with CNN architecture. Cascade sampling is a combination of average pooling, maximum pooling, and convolution to down sample the image. Hassan and Mahmood [26] successfully applied convolutional recurrent deep neural network architecture for the sentence classification. They have used CNN to train word embedding at the initial level and RNN tuning the parameters. A recurrent layer has been used instead of pooling a layer to avoid loss in image information.
In recent years, many researchers [27]- [34] have focused on bone age assessment, rheumatoid arthritis detection, bone segmentation, human identification and osteoporosis detection based on hand radiographs using advanced algorithms like convolutional neural network and deep neural network. El Soufi et al. [27] have proposed the system for human identification using hand x-ray images, which consist of segmentation of phalanges, complex Fourier transforms, and KNN classifier for classification. Kauffman et al. [28] extracted the 64 shape features per bones from the proximal phalanges, middle phalanges, and metacarpals using an active appearance model (AAM). A principal component analysis method with a likelihood ratio classifier is used for data classification. In our previous approach [29], dual cross pattern (DCP) features along with KNN (N=3 and N=5) and Classification Tree classifier have been successfully applied for human authentication based on hand radiographs. Harmsen et al. [30] presented the novel method for bone age assessment employing a support vector machine combined with a cross-correlation prototype. For feature extraction, cross-correlation function features have been extracted for the 14 epiphyseal regions of the hand finger bone images. It resulted in 96.16 % accuracy for the bone age assessment for the age group 1 to 19. Huo et al. [31] have proposed the joint space width quantification in radiographic finger bone image for the automatic early detection of rheumatoid arthritis. Bone regions have been segmented using a second-order derivative filter and Hausdorff distance is used for the joint space width measurement. Areeckal et al [32] presented the method for diagnosis of osteoporosis with geometric features of the third metacarpal using the watershed segmentation technique for metacarpal segmentation. For the detection of Rheumatoid arthritis (RA) using hand radiographic images, Mihail et al. [33] have offered an estimation of the hand skeletal shape using a deep CNN and conditional random fields. Wang et al. [34] offered skeletal maturity recognition based on the hand radiographs using a deep neural network for a dataset of 1101 hand and wrist radiographs. It resulted in a recognition accuracy of 92% and 90% for the radius and ulna, respectively. Sánchez et al. [35] developed a new human recognition method using benchmark ear and face database. For optimization of modular granular neural networks, a firefly algorithm, grey wolf [36] is suggested. Gupta and Gupta [37] proposed a multi-biometric authentication system using fusion of palm slap fingerprints, palm dorsal vein and hand geometry. This proposed fusion technique resulted into improved results. Afifi [38] proposed gender recognition and biometric identification technique using dorsal and palmar side of 11K human hand images. The experimental results concluded that, dorsal side also consist of effective distinctive feature similar to, if not better than, those available in palmar side. Only few researchers [28], [29], [34] VOLUME 8, 2020 have focused on person authentication based on the hand radiographs, which can be primarily used in post-mortem identification or recognition of the person [39].

II. MATERIALS AND METHODS
Many authors [32]- [34], [40]- [45] used a dataset of children under the age of 18 for the bone age assessment (BAA) because the segmentation of carpals and phalanges is easy as, after the age of 18, bones are fully grown, and fusion is complete. Hence, it becomes a time-consuming and challenging task to select a particular algorithm or combination of different algorithms. The primary advantage of the proposed study is to win over the segmentation problem experienced by state of the art authors for BAA and identification of victims where traditional biometric techniques cannot be utilized. The deep neural network gives the internal connectivity map of images that retain the spatial and temporal information of the image and is sufficient to discriminate one image from another. The features are extracted by convolving the input images with the filter kernel, providing the interconnectivity between the content of the image, and can be considered as the best features for the image object. The word ''deep'' refers to the number of convolution layers used for the implementation. The flow diagram of the proposed system with three convolution layers is shown in Fig. 1.

A. CONVOLUTION NEURAL NETWORK (CNN)
CNN is inspired by the biological phenomenon of the animal visual cortex, which shows the connectivity pattern between different neurons. While using CNN, slight preprocessing is vital, like enhancement or filtering of an image, which is also essential in traditional handmade feature extraction techniques. CNN has a wide range of applications such as image processing, video processing, speech processing, and natural language processing. CNN consists of four significant steps, such as convolution layer, rectified linear unit (ReLU), maximum pooling layer, and fully connected layer. The architecture of the CNN single layer is shown in Fig. 2.

B. CONVOLUTION LAYER
Convolution layer is used for sharing the parameters and maintaining the sparsity of connections of the different image region. Equation (1) refers to the convolution process which gives the relationship between the input image (I m ) and filter kernel (W ). Here, bo indicates bias used to adjust the output according to weighted sum of input neurons.
An original image is first converted into the gray scale image and then convolution is applied. The weights of filter kernels are updated using the back-propagation learning algorithm. The original image is padded with zero to all sides to fit the convolution filter over the image. The size of the original grayscale image is 200 × 150 and filter kernel size is selected as 3 × 3 × 6. After an application of the convolution layer, it generates the output of 200 × 150 × 6. The output of the convolution layer is fed to ReLU layer.

C. ReLU LAYER
In the convolution layer, an image is multiplied by filter kernel, which may have some negative values. These negative values bring the non-linearity in the image. The non-linearity is then removed by using the rectified linear unit layer by converting all the negative values to zero using (2).
where, I ReLU is the ReLU layer image and I conv is the convolutional layer image. The ReLU layer output size is 200 × 150 × 6, which is equal to the size of the convolution layer output. The output of the ReLU layer is given to max pooling layer.

D. MAX POOLING LAYER
Pooling is used for the minimization of the computing cost by reducing the dimension of the ReLU layer output. Pooling is also used to retain the position and rotational invariant features of an image. There are two types of pooling methods: Maximum and Average pooling. In maximum pooling, the maximum value of the given window is selected, while in average pooling, the average value of the window is selected (See Fig. 3). Unlike average pooling, maximum pooling suppresses the noise in the image along with feature reduction. Pooling is also used to maintain the localization of the shape of the local object in the image. The window size of 2 × 2 is selected for maximum pooling, which reduced the size to exactly half of the ReLU layer (100 × 75 × 6). The larger window size for the maximum pooling may result in the fine local information of the image.

E. FULLY CONNECTED LAYER
In the fully connected layer, the output matrix of a max-pooling layer is converted to a one-dimensional column vector. Most of the classifiers need the input data in one dimension vector. A fully connected layer connects each neuron in one layer to another layer, similar to a multi-layer perceptron.

F. LEARNING
The weights of the filter are adjusted by back propagation learning.
To adjust the weight, mean squared error based loss function is used (3).
where, target is class label and output refers to weighted sum of fully connected layer. The main aim of the learning is to minimize the loss function. Equation (4) and (5) are used to update the weights and bias of network.
Here ω i ( ) and β i ( ) are initial weight and bias of the network and η is learning rate. The gradients in (4) and (5) are calculated by (6) and (7). where, Here a j ( ) is activation map of th neuron and H (z i ( )) is output map of the hidden layer and i ( ) is error delta.

G. CLASSIFIERS
As the Deep Neural Network (DNN) feature map has enough discriminative power, KNN and one versus all SVM classifiers with linear kernel are used for classification purpose.
In general, the classifier is used to find the patterns and classify data mining. KNN is easy to implement and less time-consuming. The value of k is selected as an odd number, as even value of k may cause ambiguity in recognition. KNN is known as a lazy learner, because the training time for this algorithm is zero. During training, features are stored only, and at the time of testing, the nearest point is selected based on feature distance. The standard Euclidean distance (9) is used to calculate the similarity distance.
(test − train) 2 (9) VOLUME 8, 2020 where; δ i is Euclidean distance, test is a testing feature, train is training samples, and n is the total number of training samples. The larger value of k resulted in crowded neighbors and may give false decision if the number of samples is more. In this study, after experimentation, we preferred the value of k as 3. SVM is widely used for pattern regression and recognition. SVM along with the radial basis function, the kernel can be employed as one classifier. γ and C parameters, which employ a five-fold cross-validation process, are selected. One versus all SVM is used to classify multiple subjects as each subject represents a new class because it has different characteristics. In proposed CNN, fully connected layer converts the multidimensional feature map into one dimensional feature map. Instead of the conventional softmax classifier in CNN we have used SVM classifier with linear kernel which reduces the classification time. The linear kernel is used for the class separating hyper-plane creation. SVM is trained for one against other class features.

H. DATASET COLLECTION
The dataset used in the proposed work consists of a total of 750 right-hand radiographs recorded at different positions for 150 different subjects (5 radiographs per subject) belonging to different age groups, professions, and gender (See Table 1). Although it is challenging to collect a large number of PM radiographs for validation of the proposed system, to evaluate the system on a larger dataset, AM hand radiographs captured from the same subjects at different time and position have been used. Some sample hand radiographs from the dataset are shown in Fig. 4. Keeping the hand position fixed ensures a higher level of accuracy for experimentation purposes but not on a real-time basis. All the radiographs captured from the same subjects have different finger postures and distance between the fingers. Hence, by doing this, an attempt has been made to improve the accuracy level when matching is done on a real-time basis rather than to serve the experiment purpose solely. All radiographs are acquired from Siemens and Vision-C x-ray machine with all the necessary safety precautions and controlled radiation exposure. ''All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards'' [46]. Radiographs from different x-ray machines have different dimensions; hence, they have been resized before feature extraction.

III. RESULT & DISCUSSION
The system is implemented using computer vision and image processing toolbox of MATLAB on Windows environment. The performance of the system is evaluated based on percentage cross validation accuracy which is calculated using (10).

% Cross Validation Accuracy =
Correctly Recognized Samples * 100 Total Number of samples (10) Increasing the number of layers of CNN in the deep learning architecture increases the discrimination power of the feature map, but a large number of layers increase the computation cost. In this study, three layers of CNN are selected, and the feature map size for different layers of deep learning architecture is as shown in Table 2. The output of the first convolution layer after the convolution of the original image of size 200 × 150 (See Fig. 5) with the 3 × 3 × 6 filer kernel resulted in six different feature maps. The output of the ReLU layer of the first CNN has the same dimension as that of a convolution layer output, but the ReLU layer output map has only positive values. All the negative neurons are neglected to remove the non-linearity. The output of the maximum pooling layer is sampled down to exactly half of the original image size, which reduces the dimensions of the feature map. The output of the first CNN layer is again convolved with the filter kernel of size 3 × 3 × 6, which is given to the ReLU layer for rectification. After the max-pooling of the second layer of CNN, the feature map has the 50×37×36 dimension. The second layer of CNN generates 36 interconnected maps of the image. Output of CNN layer 2 is given to CNN layer 3, which is further convolved with the learned filter kernel of size 3 × 3 × 6. After rectification, the feature map is given to the max-pooling layer, which further halves subsample for the image maps. The feature map size of the CNN layer three is 25 × 18 × 216. The third layer of CNN generates the 216 interconnected maps of the image (See Fig. 6). As the number of filters increases, the connectivity showing discrimination between the different parts of the image increases. Table 3, 4, 5 and 6, 7, 8 shows cross validation accuracy for KNN and SVM for different striding pixel and different filter size.
It is observed that the six filter kernel gives better accuracy (97.60% for KNN and 99.20% for SVM) better than two (95.60% KNN and 97.60% for SVM) and four (94.40% KNN and 98.80% for SVM) filter size. KNN requires almost zero time for the training but at the classification level the time required is more, also KNN classifier gives poor performance for noisy data. As the time required for multi-layered CNN is more, SVM classifier is applied to minimize the classification time. In SVM, the testing data is compared with the support vectors which minimizes the classification time and gives better performance than KNN classifier.
The large number of the filters increases the size of the convolution layer output, which further results in higher computation cost. Therefore higher values of the number of the filters are neglected. The 3×3 filter window for a convolution gives better spatial and temporal connectivity information of the image and performs better than 2 × 2 and 5 × 5. The lower order window and higher-order windows lose the coarse and fine edge information, respectively. Filter window strides over the original image during the convolution process. If the window is stridden by one pixel, then it is the time-consuming task, and for the larger pixel striding, filtering may lose the connectivity between the different regions of an image. Therefore, two-pixel striding is used, which gives better connectivity between the subtle gradients of the image region. For the N × N pooling region, the image is scaled down by (1/N) times of the original hand radiograph features. For larger scaling, maximum pooling feature maps are losing the internal connectivity of the image. Therefore, the 2 × 2 pooling region scales the original image features to half of its original size. Also, max-pooling acts as a noise suppressor, unlike an average pooling, which makes it robust for noisy data. Increasing the number of     convolution layers in deep learning architecture increases the connectivity of neurons in the architecture and makes it more discriminative.
The experimental result reveals that the proposed technique can be used for the identification of victims where the traditional biometric approach cannot be utilized. Hand   radiographs are non-replicable, and non-spoofing makes the proposed system more robust and reliable [9]. The accuracy of a deep learning system increases with the increase in the number of layers (from CNN-1 to CNN-3), and if computation time increases, the algorithm becomes slower and unreliable for real-time applications. It is observed that VOLUME 8, 2020 after CNN-3 layer, a further increase in the number of the layer (CNN-4) decreases the retrieval accuracy.
There is no public hand radiograph dataset for the comparison of the proposed human identification system, so it is difficult to make a rational comparison. The main aim of the state of art papers is to predict the bone age in the range of 0-18 years [28], [32], [40]- [45] because the bones are entirely matured, and fusion is complete after the age of 18 years. In most of the state of art papers, assessment is done successfully; however, these methods are not suitable for the identification of adult victims. Kauffman et al. [28] presented an automated radiographic assessment of hand in Rheumatoid Arthritis (RA). The method is tested on 100 plain left and right hand radiographs of 40 different patients for joint space width segmentation using Active Appearance Model (AAM). In our previous work [29], we have performed human identification based on dual cross pattern (DCP) of hand radiographs. The dataset consists of 100 right hand radiographs of adults with age group of 18 year to 42 year. The average classification accuracy of 89.1% was achieved. The method of [32] is tested on 157 left hand x-ray images collected from Image Processing and Informatics Lab, University of Southern California with age group 17-18 years for automatic segmentation of third metacarpal bone for diagnosis of osteoporosis. The accuracy achieved is 94.9%. Wang et al. [34] proposed CNN based technique for Radius and Ulna bone classification. The classification accuracy of 92% and 90% for 400 radius images and 600 ulna images is achieved. The method of [40] is tested on for 120 images of children's below 18 years. The accuracy of 90% with the discrepancy of 2-year error rate between PROI & CROI was achieved. Pietka et al. [41] proposed bone age assessment method based on Epiphyseal/Metaphyseal region of interest (ROI) extraction. The feature extraction accuracy achieved is of 91%, 83%, and 75% for distal, middle and proximal ROI respectively from 200 left-hand images below the age of 14. Simu et al. [42] proposed method for the segmentation of radius and ulna bones from hand radiographs is tested on 19 images (1 for each age group from 0 to 18 year) collected from Children's Hospital Los Angeles, USA. The method proposed by Yuh et al. [43] for later stage bone age assessment using Wavelet transform, Singular Value Decomposition (SVD) and SVM is evaluated on 21 hand radiographs from 7 years old to 12 years old with average the accuracy of 92.41%. Niemeijer et al. [44] developed a technique for skeletal maturity estimation of children using Tanner Whitehouse method with error rate not more than one stage for 71 cases. Pathak et al. [45] presented an automatic skeletal maturity identification technique using hierarchical three stages of syntactic recognition for the structural development of 128 × 145 dimension radiographs of 10-12 year boy.
To compare our method, we implemented human identification system based on established networks. Table 9 shows a comparison of the average accuracy of the proposed method with the traditional deep learning architectures such as AlexNet, ResNet, VGGNet and InceptionNet. Because of minimization of non-linearity and increased number of filter maps, proposed method outperforms the existing architectures for the captured hand radiograph database.
Proposed method includes many steps (Fig. 1) and the running time depend upon number of parameters like number of layers, kernel size, window size, samples size, striding pixel. Hence it is not trivial to give running time to every step. In this paper, instead of conventional soft-max classifier in CNN, we have used SVM classifier with linear kernel which minimizes the classification time. Table 10 shows that increasing the number of training samples increases number of support vectors for SVM classifier, thus increases the cross validation accuracy. During complete experimentation, testing images were not the part of training images.

IV. CONCLUSION
A novel human identification method using a deep neural network for matching hand radiographs is presented in this paper. The initial results on a primary dataset indicated that hand radiographs are an appropriate approach for human identification. Three layers of the convolutional neural network are used to get deeper connectivity of the image and to construct deep learning architecture. For classification, the KNN and SVM classifier are used. Extensive experiments are performed on 750 right hand radiographs acquired from 150 subjects. The performance of the algorithm is deliberated based on percentage cross-validation accuracy shows that the DNN with 3 layers, 3 × 3 filter kernels and 3 × 3 polling window resulted in 97.60% and 99.20% cross-validation accuracy for KNN and SVM respectively. Experimental results show that proposed method outperforms the existing architectures for the captured hand radiograph database. The experimental results also show that the proposed technique can be used as a substitute for a dental-radiograph-based identification system in disaster victim identification. As compared to the conventional biometric identification, hand radiograph-based recognition is challenging as hand bones change over time, due to fracture and hard work. Several hurdles were faced during the retrieval of hand images from the database, such as pose, lighting, and variations related to the profession. It is observed that hard-working people have more variations in their bone texture. Experimental results also reveal that, the variability in hand positioning and image capturing setup can create nonconformity in the overall performance of the system, due to finger postures and the distance between the fingers while capturing radiographs. It is recommended to develop a method to standardize the way in which hand is positioned during radiographic acquisition. In future, a unified system based on hand radiographs can be designed to improve pose correction by combining quality-based frame selection and mark-based matching techniques. Different digital x-ray machines have different resolution but generally display a high contrast image due to various digital filtering techniques. How this affect the overall performance of the proposed system is not clear. In addition, the algorithm is sensitive to the number of training samples. To investigate this, it is important to acquire a larger set of new data over a period of several years from different x-ray machines with different age groups.