Palmprint-Palmvein Fusion Recognition Based on Deep Hashing Network

Palmprint has attracted increasing attention due to its several advantages in the biometrics field. Deep learning has achieved remarkable performance in the computer vision area, so a large number of deep-learning-based methods have been proposed by the research community for palmprint recognition. The outputs of a deep hashing network (DHN) can be represented as a binary bit string, so DHN can reduce the storage and accelerate the matching/retrieval speed. In this paper, DHN is employed to extract the binary template for palmprint and palmvein verification. Spatial transformer network is used to overcome the rotation and dislocation. Palmprint and palmvein can be acquired from visible-light spectrums, including red (R), green (G), blue (B), and near infrared (NIR) spectrum, respectively. Since the features in different spectrums are different, their complementary advantages can be exploited to the full by fusion. Image-level fusion and score-level fusion are developed for palmprint-palmvein fusion recognition. The experiments demonstrate that score-level fusion can improve the accuracy efficiently.

In texture coding-based methods, some filters are typically employed to extract and encode the texture features, and generate the binary code as the template. The storage cost is low. In addition, the dissimilarity The associate editor coordinating the review of this manuscript and approving it for publication was Yizhang Jiang . between two binary templates can be measured by Hamming distance, so the XOR computation in matching is fast.
Coding-based methods are suitable for ideal environments, such as contact acquisition. However, coding-based methods are not robust to some interferences in contactless acquisition, such as deformation due to hand gesture, placement movement, rotation, and tilt due to imperfect segmentation and localization.
Deep learning has achieved remarkable performance in many computer vision tasks. Deep learning models can solve the aforementioned interferences. Deep hashing network (DHN) is a state-of-the-art deep learning model, which inherits the advantages of deep learning and the strengths of coding-based methods, including strong robustness to interference, low storage cost, and fast matching speed. The size of the binary template of DHN is much smaller than the traditional coding-based methods, so this paper employs DHN for palmprint/palmvein recognition to further reduce the storage cost, accelerate the speed, but without accuracy degradation even in severe environments.
The spectrums for palmprint acquisition include red (R), green (G), blue (B), and their combinations. The spectrums for the palmvein acquisition include Infrared (IR) and Near Infrared (NIR). Palmprint and palmvein can be acquired simultaneously, and they can use the same recognition algorithms, so they are suitable for fusion recognition. Accordingly, the main contributions of this paper are as follows: (1) Spatial transformer network (STN) is embedded into DHN to align the images. Since STN overcomes the problems of dislocation and rotation, the verification accuracy is improved.
(2) Two image-level fusion modes, namely spatial concatenation and channel concatenation, are developed to improve the accuracy of palmprint-palmvein fusion recognition. Different combinations of the spectrums in the fusions are compared and analyzed to confirm the optimal scheme.
(3) Score-level fusion is performed for the fusion recognition of palmprint and palmvein. Compared with the recognition of a single modality, the recognition accuracy is greatly improved.
The rest of this paper is organized as follows: Section II reviews the related work. Section III describes the proposed method, and Section IV presents the experimental results. Finally, the conclusions are drawn in Section V.

II. RELATED WORK
Deep learning has achieved remarkable performance in many computer vision tasks. A limited number of deeplearning-based methods have been proposed for palmprint and palmvein recognition, which can be briefly categorized into four classes as follows:

B. TRAINING NETWORKS FOR PALMPRINT/PALMVEIN RECOGNITION
Svoboda et al. [21] trained AlexNet to enlarge the separability between the genuine and imposter distributions, but it requires supervision training. Wen et al. [22] proposed a novel loss function, which aimed to increase the inter-class distances. Inspired by this work, Zhong and Zhu [23] proposed centralized large margin cosine loss on the benchmark structure in [24], which enhanced the intra-class compactness. Matkowski et al. [25] proposed a CNN framework for palmprint recognition in an uncontrollable environment, which included two sub-networks for segmenting region of interest (ROI) and extracting features, respectively. Chai et al. [26] pre-trained a network with gender soft biometric, and then trained the network for palmprint classification. Du et al. [27] proposed a CNN-based regularized adversarial domain model for cross-domain recognition. Liu et al. [28] used fully convolutional network (FCN) to develop soft-shifted triplet loss function to learn the discriminate palmprint features.

C. COMBINATION OF SPECIAL FILTERS AND DEEP LEARNING
Minaee and Wang [29] used deep scattering network and a bank of special filters to extract palmprint features. SVM was selected as the classifier. Genovese et al. [30] proposed PalmNet, which combined CNN, Gabor filters and principal component analysis (PCA). The significant advantage of PalmNet is that it can be trained in an unsupervised mode.

D. HASHING-BASED NETWORK
The classical CNN frameworks are suitable for classification (identification), but they are not good at verification (authentication). DHN was specially proposed for verification to enlarge the inter-class variance and reduce the intraclass variance. The outputs of DHN can be represented as a binary bit string, so DHN can reduce the storage and accelerate matching/retrieval speed.
Chen et al. [31] proposed discriminant spectral hashing for compact palmprint representation. Cheng et al. [32] combined supervised hashing with deep convolutional features for palmprint recognition. Zhong et al. [33] recognized hand-based multi-biometrics by using DHN and biometric graph matching to separately extract the features of palmprint and dorsal hand veins. A trained SVM was used as the classifier. Zhong et al. [34] employed deep hashing network for palmvein recognition; their method was named as deep hashing palmvein network (DHPN). The equal error rate of DHPN was close to 0% on NIR. Li et al. [35] used softmax classification loss and improved ternary losses to learn the hashing code to maintain the consistency with high-dimensional features. Liu et al. [36] used deep self-taught hashing to generate pseudo labels and then used DHN to generate hashing code.
The deep-learning-based palmprint/palmvein recognition methods are summarized in Table 1.

III. METHODLOGY
A. SPATIAL TRANSFORMER NETWORK CNN encounters some restrictions due to the lack of spatial invariance. Jaderberg et al. [37] first proposed STN that can be used in spatial transform and data alignment according to specific tasks. STN is a differential module that can be embedded into the existing convolutional structures to shift, VOLUME 9, 2021 scale, rotate, and distort the feature maps while without any additional training or optimization. STN can generate the models that are invariant to translation, zoom, rotation, and deformation [38]. STN consists of three parts, namely localization network, grid generator, and sampler, as shown in Figure 1. The localization network is a regression network, which aims to generate the parameter θ. The input feature map passes several layers of convolution operations and fully connected layer regression to output θ. The size of θ depends on the transformation type, e.g. θ is 6-dimension in affine transformation. The grid generator uses the coordinates of the target feature map to calculate the corresponding coordinates of the target feature map T in the source feature map.
where (x s i , y s i ) are the source coordinates in the input feature map, (x t i , y t i ) are the target coordinates, and θ is the transformation parameter. The sampler samples the original feature map according to the coordinates in T (G), and copies the pixels in the source feature map S to the target feature map.

B. NETWORK ARCHITECTURE
CNN-F in [34] is selected as the backbone. The architecture of the modified CNN-F is shown in Figure 2. The network configuration is shown in Table 2. Compared with the network in [34], the network structure in this paper adds STN after the input layer, and the modified CNN-F uses 5 layers of convolutional layers instead of 4 layers, and uses PReLU as the activation function after each convolutional layer. In the   STN module is embedded at the back of the network input layer. STN module does not need to design a loss function separately, its parameter learning can be realized by using the loss function of the network. In order to avoid the problem of the gradient disappearance when the convolution result is negative, PReLU activation function is used following each convolution layer. Tanh function is used as the activation function of the output layer. Sign function is used to quantize the output. Finally, the hashing code with a binary value of −1 or 1 is obtained.

C. LOSS FUNCTION
Controlling quantization error and cross entropy loss can effectively improve network performance. Therefore, the proposed loss function is based on two parts, namely distance loss and quantization loss.

1) DISTANCE LOSS
The purpose of the distance loss is to enlarge inter-class variance and reduce intra-class variance. The distance loss is defined as: M is the number of palmprint images in the training set, x i and x j are two input samples, f i and f j are their binary hashing codes, and D(f i , f j ) is the Hamming distance between two hashing codes. If x i and x j are of the same class, l ij =1; else l ij =0. T is the threshold. If x i and x j are of different classes and D(f i , f j ) > T , there is no need to further enlarge the distance between the two samples. The value of T is determined by the length of the hashing code output from the network.

2) QUANTIZATION LOSS
In the output of DNN, if the output of the last layer is randomly distributed, binarization by the tanh and sgn functions will inevitably lead to a large quantization error. In order to reduce the quantization error, the quantization loss should make each entry in the output closer to 1 or −1. The quantization error is defined as: |f i | is the absolute value of f i , · denotes the L2 norm. Then the optimization is: α is the scaling factor, whose empirical value is 0.5 according to experiences.

D. IMAGE-LEVEL FUSION
Image-level fusion includes two modes, namely spatial concatenation and channel concatenation. The ROI size of palmprint and palmvein is 128 × 128. Assume I 1 , I 2 , I 3 and I 4 are the ROI images of R, G, B, NIR spectrums of the same class. The images can be concatenated, i.e., they are fused at image level.

1) SPATIAL CONCATENATION
Two spectral ROI images, I i , I j , of the same class, can be concatenated as I=[I i , I j ] with the size of 128×(2 × 128)=128 × 256; Three spectral ROI images, I i , I j , I k , of the same class, can be concatenated as I=[I i , I j , I k ] with the size of 128×(3 × 128)=128 × 384; Four spectral ROI images, I i , I j , I k , I l of the same class, can be concatenated as I=[I i , I j , I k , I l ] with the size of (2 × 128)×(2 × 128)=256 × 256.

2) CHANNEL CONCATENATION
Each spectral image can be regarded as a single channel in channel concatenation.   The concatenated images can be input to DHN for training, which contain the multi-spectral information with image-level fusion, so the accuracy can be improved.

E. SCORE-LEVEL FUSION
The output of DHN is two-valued, −1 and 1, and the output can be converted to binary, 0 and 1. The Hamming distance can be fast computed by XOR operation on each bit. The proposed model is trained on each spectrum sample independently. There are four spectrums VOLUME 9, 2021 in multi-spectral database, so four proposed models are trained.
The dissimilarity between two deep hashing codes is measured by Hamming distance. Assume the Hamming distance of one spectrum is h i , 1≤ i ≤4. In score-level fusion, the weighted sum of Hamming distances is: n is the number of spectrums in the fusion at score level, n = 2, 3 or 4. If all the weights are equal, w i =1/n.

A. DATABASES
In order to confirm that the methods in this paper have satisfactory performance, the experiments are conducted on four public databases, which are PolyU [39], multi-spectral [39], IITD [40] and Tongji [41], [42]. The left and right hands of one person are considered as two different classes. The samples of some classes are removed so that each class has the same sample number. Table 3 lists the information of the databases. Figure 3 shows the image samples of these databases. The palmprint and palmvein look dissimilar because they are acquired with different devices.

B. VERIFICATION
When false rejection rate (FRR) is equal to false acceptance rate (FAR), their values are equal to error rate (EER). The proposed method, namely DHN+STN, is compared with coding-based methods and deep-learning-based methods. Coding-based methods include PalmCode [43], OrdinalCode [44], FusionCode [45], Competitive Code (CompCode) [46], Robust Line Orientation Code (RLOC) [47], Half-orientation Code (HOC) [48], Double Orientation Code (DOC) [49], Discriminative Competitive Code (DCC) [50], Discriminative Robust Competitive Code (DRCC) [50], and Binary Orientation Co-occurrence Vector (BOCV) [51]. Deep-learning-based methods include DHPN [34] and PalmNet [30]. For all the deep-learning-based methods, the ratio between the sample numbers in training set and testing set is 3:1. The hashing code length is set to 128. Table 4 compares the EERs of different methods on different databases. Our method yields the best results on all databases. The accuracies on contact databases are better than those on contactless databases since the acquisition conditions can be controlled satisfactorily on contact databases.
The receiver operating characteristic (ROC) curves of the proposed method and the state-of-the-art methods on the contact and contactless databases are shown in Figure 4 and Figure 5, respectively. The ROC curves of the proposed method are almost always the highest.
To confirm that our method can yield zero EER, Figure 6 shows the results on Red, including the intra-class and inter-class matching distance distribution curves as well as FAR-FRR curves using 128-bit and 256-bit encoding. The ratio between the training set and the test set is 3:1. In Figure 6(a), when the distance threshold is about 40, the two curves are not overlapped, at the same time, FAR=FRR=0 in Figure 6(b). The same conclusions can be drawn from Figure 6(c) and (d). The failure of other methods to get EER=0 is possibly caused by the number of test samples and the difficulty of network training. The test set of the coding-based methods is the entire data set, while the method in this paper and the deep-learning-based methods are tested only on the test set. Obviously, the latter test set is a subset of the former test set. The method in this paper takes full advantages of fusion of multiple spectrums to enhance the discrimination, so it can yield zero EER.
The data sizes of the outputs of the methods are compared in Table 5. The size of all the outputs is measured by bit number. The number of filter orientations is typically 6, so the orientation index, an integer within the range of [0,5], can be represented by 3 bits. Our method and DHPN require much less storage than the other methods. Table 6 compares the EERs of our method with different ratios between the training and testing sample numbers. Since the sample numbers per class are different in different databases, the ratios are different in the databases. The longer VOLUME 9, 2021  the hashing code length is, the lower the EER will be. The EER can be 0 in some databases with good acquisition conditions.

C. IMAGE-LEVEL FUSION
Image-level fusion includes two modes, namely spatial concatenation and channel concatenation. Figure 7 shows the spatial-concatenated image with different spectrum numbers. Table 7 shows the EERs of spatial concatenation. The ratio between the training and testing sample numbers is 1:1, i.e., 6 samples of each class are for training; while the other 6 samples of each class are for testing. The hashing code length is set to 128 and 64. The best results on one spectrum and two-, three-, four-spectrum spatial concatenation are bold. For single spectrum, N outperforms the other three spectrums. Spatial concatenation cannot remarkably improve the accuracy. Table 8 shows the EERs of channel concatenation. The ratio between the training and testing sample numbers is 1:1. The hashing code length is set to 128 and 64. Compared with single spectrum, channel concatenation improves the accuracy. In addition, channel concatenation is better than spacial concatenation according to Tables 7 and 8.  TABLE 7. EERs(%) of spatial concatenation.  Table 9 shows the EERs of multi-spectral fusion at score level. The ratio between the training and testing sample numbers is 1, i.e., 6 samples of each class are for training; while the other 6 samples of each class are for testing. The hashing code length is set to 128 and 64. n is the number of spectrums in the fusion at score level, n = 2, 3 or 4. All the weights are equal. n = 1 means single spectrum is used without fusion.

D. SCORE-LEVEL FUSION
(1) All the results when 2≤ n ≤4 are better than those when n = 1, i.e., any score-level fusion can yield better result than that of any single spectrum.
(2) The best results appear when n = 3, but the results when n = 3 are unstable, the results are different when different spectrums are fused. Thus, it is not easy to select the spectrums when n = 3.
(3) The results when n = 4 are highly close to the best results.   Table 10 shows the EERs of Tongji palmprint and Tongji palmvein fusion at score level. The ratio between the training and testing sample numbers is 1:1, i.e., 10 samples of each class are for training; while the other 10 samples of each class are for testing. The hashing code length is set to 128 and 64. The weights of palmprint and palmvein are set to 0.5. From the experimental results, the score-level fusion improves the accuracy significantly.

V. CONCLUSION AND FUTURE WORK
In this paper, DHN is employed to extract the binary template for palmprint and palmvein verification. Spatial Transformer Network is used to overcome the rotation and dislocation, and accordingly improve the accuracy. The spectrums include R, G, B and NIR. Palmprint and palmvein can be acquired from visible-light spectrums and NIR spectrums, respectively. Since the features in different spectrums have different information, their complementary advantages can be exploited to the full by fusion. According to our investigation, image-level fusion of channel concatenation and score-level fusion can improve the accuracy. In our future work, we will try to select the discriminative bits from the hashing code to reduce the data size while without degrading the accuracy, and will explore a deep hash network recognition framework for open data sets. VOLUME 9, 2021