Towards Unconstrained Palmprint Recognition on Consumer Devices: a Literature Review

As a biometric palmprints have been largely under-utilized, but they offer some advantages over fingerprints and facial biometrics. Recent improvements in imaging capabilities on handheld and wearable consumer devices have re-awakened interest in the use fo palmprints. The aim of this paper is to provide a comprehensive review of state-of-the-art methods for palmprint recognition including Region of Interest extraction methods, feature extraction approaches and matching algorithms along with overview of available palmprint datasets in order to understand the latest trends and research dynamics in the palmprint recognition field.


I. INTRODUCTION
The last decade has seen the migration of biometric recognition approaches onto mobile devices by using fingerprint [1], face [2] or iris [3] as an alternative to conventional authentication using PIN numbers or patterns. Two-factor authentication, multi-modal and multi-biometrics are all considered to be viable options improving the security of a system, as they considerably increase the spoofing effort for an attacker [4]. Jain et al. [5] evaluate several biometric features and reach the conclusion that there is no ideal biometric. Alongside the previously mentioned features is another biometric which has not received as much attention: the palmprint. However, there are several advantages which palmprint recognition can offer regarding their deployment on consumer devices: • The features contained in a palmprint are similar to fingerprints, but cover a much larger surface. For this reason they are generally considered to be more robust than fingerprints [5]. • Palmprints are more difficult to spoof than faces, which are public feature, or fingerprints, which leave traces on many smooth surfaces. • There is no extra cost required for acquisition, as long as the device is fitted with a camera (optical sensor) and a flash source (LED or screen). • It has potential for multi-biometric recognition, as it can be used with other hand-based features (fingerprints [6], finger knuckles [7], wrist [8]) Adrian-S. Ungureanu is with the National University of Ireland, Galway, email: a.ungureanu1@nuigalway.ie Saqib Salahuddin is with the National University of Ireland, email: s.salahud-din1@nuigalway.ie Prof. Peter Corcoran is with the National University of Ireland, Galway, email: peter.corcoran@nuigalway.ie • It can be seamlessly integrated into the use case of many consumer devices, such as AR/VR headsets [9], smartphones [10], gesture control systems, driver monitorin systems, etc.
The aim of this paper is to provide a comprehensive review focusing on the pipeline of palmprint recognition in order to clarify the current trends and research dynamics in the palmprint recognition based biometric systems. The paper discusses in detail the available datasets of palmprint images and reviews the state-of-the-art methods for palmprint recognition.
A particular emphasis is placed on the improvement in imaging subsystems on handheld and wearable devices and on recent developments in unconstrained palmprint analysis, including the recent availability of new datasets and Region of Interest (ROI) extraction methodologies.
The rest of the paper is organized as follows. Section II describes existing datasets of palmprint images. Section III provides an overview of approaches developed for the palmprint ROI extraction from various palmprint datasets. Section IV presents an overview of approaches of feature extraction and matching algorithms. Section V presents discussions and concludes the paper.

II. PALMPRINT DATASETS
This section presents an overview of palmprint datasets used for the recognition of palmprints in the visible spectrum (hyperspectral imaging at various wavelengths is not considered, nor 3D acquisition).
The currently available palmprint datasets can be split into three categories, based on the restrictions imposed to the user during the acquisition process (as represented in Fig. 1 and summarized in Table I): 1) Constrained acquisition: This category includes the most popular palmprint datasets, which place the main focus on the feature extraction and matching stages, simplifying the acquisition as much as possible (for the recognition system). Images tend to display hands with a specific hand pose (fingers straight and separated) against a uniform background with no texture, usually black. 2) Partly unconstrained acquisition: • Unconstrained environment: The background is unconstrained, which corresponds to the use case of consumer devices. The hand pose is required to follow a specific protocol, generally consisting of presenting the fingers spread out in front of the sensor (preferably the center of the image). The Hong Kong Polytechnic University Palmprint dataset (HKPU) [11] was the first to provide a large-scale constrained palmprint dataset to compare recognition performance. The images were acquired using a scanner (A1 in Table I) having a cropped guide around the palm, reducing the impact of fingers' position. A similar approach for acquiring palmprints but including the entire hand can be found in the Bosphorus Hand dataset [12]. The earliest touch-less palmprint datasets (A2 in Table I) were the ones released by the Chinese Academy of Sciences (CASIA) [13] and by the Indian Institute of Technology in Dehli (IIT-D) [14]. Both used a digital camera for acquisition in an environment with uniform lighting. The main differences are the scale and color information contained in IIT-D. The hand images in CASIA are gray scale and have cropped fingers. The College of Engineering Pune (COEP) [15] released a touch-less dataset of palmprints, but the acquisition relied on pegs to direct the position of fingers relative to the camera. Another touch-less dataset was released by Las Palmas de Gran Canaria University under the name GPDS [16]. They used two webcams to acquire palmprint images in two sessions. One of the webcams was adapted to acquire (a) NUIG Palm1 [10] (b) XJTU-UP [28] source: c 2019 IEEE (c) MPD [29] (d) NTU-CP-v1 [30] Table I. NIR images by removing its IR filter and replacing it with an RGB filter. The dataset is split into images acquired in visible range (GPDS-CL1) and in NIR range (GPDS-CL2). In 2017, Zhang et al. [17] released a large-scale dataset (12,000 images) of palmprints acquired with a dedicated device containing a digital camera (Tongji). The acquisition environment was dark with a controlled light source illuminating the palm area. Recently, Kumar [18] released a large-scale dataset of palmprints entitled PolyU-IITD Contactless Palmprint Database v3, introducing a variety of challenges. Firstly, it contains hand images from two ethnicities (Chinese and Indian). Secondly, the palmprints were acquired from both rural and urban areas. The physical appearance of the hands varies significantly, there being instances of birth defects, cuts and bruises, callouses from manual labour, ink stains and writing, jewelry and henna designs. The dataset also contains a 2nd acquisition session after 15 years, for 35 subjects.

B. Partly Unconstrained Palmprint Datasets
Moving away from constrained scenarios, several datasets introduced at least one challenging factor in the context of palmprint recognition systems.
Considering an unconstrained environment for acquisition (B1 in Table I) leads to both variable background and lighting conditions. An initial step was made for palmprint matching in the context of smartphones by Aoyama et al. [20] in 2013 with a small dataset of images (called DevPhone). Unfortunately, the conditions of acquisition are not clear (how many backgrounds considered, if flashlight was enabled), besides the fact that users were required to use a square guide to align the palm with the center of the acquired image. A much larger dataset was acquired by Kim et al. [21] both in-doors and out-doors (BERC DB1 and DB2). Both DB1 and DB2 included a scenario where the smartphone's flashlight was enabled. As in the case of DevPhone, the images in BERC DB1/DB2 contained hands with specific hand pose (open palm with spread fingers. A different approach to acquisition was provided by Tiwari et al. [22] who recorded videos of palmprints with a smartphone, with the video centered on the user's palmprint. Recently, Izadpanahkakhk et al. [23] introduced two palmprint datasets acquired with a smartphone camera -Birjand University Mobile Palmprint Database (BMPD) and Sapienza University Mobnile Palmprint Database (SMPD). The variation considered for investigation was the rotation of the hands (in both datasets), both in-plane and out-of-plane rotation.
The first dataset of palmprints acquired with multiple devices (B2 in Table I), albeit of reduced size, was developed by Choras et al. [24] using three smartphones. Jia et al. [25] developed a large dataset of images entitled Palmprint Recognition Accross Different Devices (PRADD) using two smartphones and one digital camera. The background used was a black cloth. The hand's posture was restricted. From the images provided in [25], it appears that the acquisition was performed by someone other than the participants. Unfortunately, the datasets developed by Choras et al. [24] and Wei et al. [31] are currently not available to the research community.
The first palmprint dataset to consider the hand pose variation (B3 in Table I), understood as open palms with spread fingers versus closed fingers, was collected by Afifi et al. and released under the name 11K Hands [26]. It contains over 11,000 images of hand images -both palmar and dorsal (each has about 5,500 images). The images were acquired against a white background, using a digital camera. An auxiliary palmprint dataset exploring various hand poses was released in 2019 by the authors under the name NUIG Palm2 (NUIGP2) [27]. NUIGP2 was designed to support the development of ROI extraction algorithms.

C. Fully Unconstrained Palmprint Datasets
This category of palmprint datasets attempts to bring to researchers conditions as close as possible to a realistic deployment of a palmprint recognition system on consumer devices. An overview is presented in Table I for categories C1 and C2.
The first dataset to provide such palmprint images was released in 2017 by Ungureanu et al. [10] under the name NUIG Palm1 (NUIGP1). It contains images from several devices in unconstrained scenarios (both background and hand pose, as presented in Fig. 2a). Recently a large-scale dataset of palmprint images acquired in similar conditions to NUIGP1 was released by Shao et al., entitled Xian Jiaotong University Unconstrained Palmprint database (XJTU-UP) [28]. The dataset contains 30,000+ images (200 hands) using five smartphones, making it the largest currently available palmprint dataset acquired with smartphone cameras. Several samples are provided in Fig. 2b. Another large-scale palmprint dataset acquired with smartphones was released recently by Zhang et al [29]. They used two smartphones to collect 16,000 hand images in unconstrained conditions.
Representing the next step of this trend, the NTU-Palmprints from Internet (NTU-PI-v1) [30] was released in late 2019, where severe distortions in the hand pose represent the main challenge to palmprint recognition. The dataset is especially large in terms of the number of hand classes (2,035), with a total of 7,781 images. Matkowski et al. [30] also release a dataset of more conventional hand images where the hand pose varies significantly, with acquisition against white background. This dataset, entitled 'NTU-Contactless Palmprint Database' (NTU-CP-v1) also contains a relatively large number of hand classes (655), with 2,478 hand images in total.

III. ROI TEMPLATE DETECTION AND EXTRACTION
This section presents a general overview of existing approaches for palmprint ROI extraction. The process of ROI extraction is an essential part of the palmprint recognition system, as any inconsistencies in ROI templates will affect the recognition task.
The existing ROI extraction techniques can be grouped in four categories, based on the cues contained in the hand images as shown in Fig. 3: • Standard palmprint ROI extraction: algorithms based on separating the hand from the background (segmentation) and performing measurements to determine the landmarks (or palm region) required for ROI extraction. This family of techniques relies on accurate segmentation, as well as a specific hand pose (open palm with spread fingers). • ROI extraction based on conventional Machine Learning (ML) algorithms: ML approaches are used for the detection of palmprints or used for key-point regression.
The key-point regression is method that takes a hand image as an input and returns a set of points used for ROI extraction. • ROI extraction based on Deep Neural Networks (DNNs): Approaches relying on DNN soutions to perform detection or key-point regression task. • Avoiding ROI detection altogether: based on specific acquisition protocols.

A. Standard Palmprint ROI Extraction
Standard palmprint ROI extraction algorithms rely on accurate segmentation of the hand region from the background. The most used approaches include using Otsu's thresholding method [32] applied to grayscale images, or using a skin-color model [33]. The segmentation is a pre-processing stage that characterizes the shape of the hand and determines the keypoints required for ROI extraction.
The most popular ROI extraction approach was introduced by Zhang et al. [34] in 2003, which relies on the constrained environment from images in databases (A1, A2) in Table I, either touch-based or touch-less. Zhang et al. ROI extraction approach relies on determining the tangent line between the two side finger valleys in order to normalize the palmprint's rotation and provide a reference point from which to extract a square region. This step is made possible thanks to the constrained environment of acquisition (black background, constant lighting), characteristic of palmprint datasets (A1, A2) in Table I. Recently, Xiao et al. [19] proposed an approach based on the intersection of the binarized hand with lines of specific orientations, resulting in several candidate points for the finger valleys. They then used K-means clustering to obtain the center of each cluster.
A second category of approaches defines the contour of the extracted hand, and the distance from a point of reference (the geometric center [18], [35] or the wrist [36], etc) to the pixels found on the contour [20], [37], [38], [39], [40], [41], [42], [43]. Considering this distribution of distances, the peaks generally correspond to the tips of the fingers, while the local minimas correspond to the finger valleys. These type of approaches are extremely sensitive to segmentation artifacts and generally apply smoothing to the distribution of distances.
A third category traverses all the contour pixels and counts the pixels belonging to the hand region (a circle was considered for sampling). Balwant et al. [44] introduced specific rules to determine the finger valleys and finger tips, followed by the correct selection of finger valley points that form an isosceles triangle. Goh Kah Ong et al. [45] considered sampling with fewer points using 3 stages corresponding to circles with greater radius. The outliers resulting from segmentation artifacts were removed with specific rules. Franzgrote et al. [46] further developed the approach proposed by Goh Kah Ong et al. by classifying the angles of remaining lines in order to provide a rough rotation normalization step. The finger valley points were then determined with a horizontal/vertical line (depending on the orientation of the hand), having 8 points of transition from non-hand region to hand region. Morales et al. [47] fitted a circle inside the binarized hand, with its center found equidistantly from the finger valleys (previously determined with the center-to-contour distances).
A fourth category uses the convex hull to describe the concavity of the binarized hand map and finger valleys [48], [49].
The following are methods that are hard to classify into one category or another, as they either employ very different or combine several of the previously mentioned approaches together. Khan et al. [50] determined the finger tips and the start of the palm by counting the hand-region pixels along the columns. After determining the pixels corresponding to finger valleys, several 2nd order polynomials were used to extrapolate the middle of the finger valleys. The palm's width was used to determine the size of the ROI (70% of palm size). This approach requires specific hand pose, with hands always rotated towards the left with spread fingers. Han et al. [51] successively cropped the binarized hand image regions corresponding to fingers (after rotation normalization with PCA) by determining the number of transitions from background to hand area. Leng et al. [33] determined the finger valleys by computing differential maps upward, to the right and the left. The AND operator was applied on these maps, resulting in 4 regions corresponding to all finger valleys. Ito et al. [40] considered an approach based on line detection after determining the binarized hand region, and subtracting the major lines corresponding to finger edges. Then a distance was computed from center of the palm, allowing the detection of finger valleys even with closed fingers (not relying on spread fingers). Ito et al. compared the effectiveness of their approach with three other algorithms [33], [34], [51]. Liang et al. [52] used an ROI extraction approach loosely based on [34] and [53], where the tip of the middle finger was determined and then extended to the center of the palm 1.2 times. This point was then used as a reference to determine the distance to all contour points, allowing the detection of both finger valleys and tips. Wei et al. [25] exploited the constrained nature of acquisition (hand position pose, scale and rotation) to base the ROI extraction on the accurate detection of the heart line's intersection with the edge of the hand (using the MFRAT defined in [54]), performing specific pixel operations to decide on the ROI's center and size. Kim et al. [21] combined several elements for ROI extraction, such as the use of a distance based on a YCbCr model, a specific hand pose (fingers spread) indicated by a guide displayed during acquisition, as well as validating finger valley points by sampling 10 pixels from the determined hand region. Shang et al. [55] modified the original Harris corner detection algorithm [56] in order to locate the points at the middle of finger valleys. However, this approach relied on constrained acquisition, as the background was not overly complex. Another approach using Harris corners was proposed by Javidnia et al. [57]. After obtaining an initial candidate for the hand region based on skin segmentation, the palm region was located using an iterative process based on the strength of the Harris corners.
However, none of the standard approaches for palmprint ROI extraction can be used in circumstances where the background's color remotely resembles skin color or the hand's pose is not constrained (such as the (C1, C2) datasets in Table  I). Furthermore, one can point out the limitation of skin color segmentation regardless of the chosen color space, based on the inherent inability of classifying a pixel into skin or nonskin [58].

B. Palmprint ROI Extraction based on Conventional ML Algorithms
There are few approaches using ML algorithms for ROI extraction regressing either a predefined shape or a set of points. Initially, Doublet et al. [59] considered to fit an Active Shape Model (ASM) to a number of points describing the shape of a hand (with spread fingers). The model regressed the output of a skin segmentation step, after which the centers of the two finger valleys were used to normalize the hand's rotation. Ferrer et al. [16] used a similar ASM to extract the hand from the background in the GPDS-CL1 dataset. Aykut et al. [60] considered an Active Appearance Model (AAM), which also considered the texture information from the hand's surface. They also provided the first evaluation of predicted key-points. Because the acquisition of images was performed in a considerably constrained environment, no normalization was required relative to the palmprint's scale. Aykut et al. preferred to report the error in terms of pixels (from the ground truth points).
Recently, Shao et al. [28] employed a complex pipeline for ROI extraction for unconstrained palmprint recognition.
The approach included an initial stage of palmprint region detection using Histogram of Oriented Gradients (HOG) and a sliding window providing candidate regions at several scales to a pre-trained SVM classifier for palmprint detection. A tree regressor [61] (initially developed for face key-point detection) was then used for the landmark regression task applied to all 14 key-points. Unfortunately, Shao et al. did not provide details regarding the performance of their ROI extraction, how its accuracy influences the recognition task, or any comparison with prior algorithms.

C. Palmprint ROI Extraction based on Neural Networks
There have been only a handful of attempts to use Convolutional Neural Networks (CNNs) for the ROI extraction, and most have consisted solely on experimenting on gray-level images. Bao et al. [62] used the CASIA palmprint database [13] to determine the positions of a hand's finger valley points. They used a shallow network composed of 4 Convolutional and 2 Fully-Connected layers, including several Dropout and MaxPooling layers. The CNN architecture achieved results comparable to Zhang et al. [34] in stable conditions, but surpassed it when noise was added. Since, a CNN can adapt to noisy or blurred images, the pixel-based approach used by Zhang et al. is vulnerable to any kind of image quality degradation.
Izadpanahkakhk et al. [63] trained a similar shallow network based on an existing model proposed by Chatfield et al. [64]. The network determined a point in the hand image and the corresponding width/height of the palmprint ROI. The network was composed of 5 Convolutional and 2 Fully-connected layers, including several MaxPooling layers and one Local Response Normalization Layer (LRN). The reported results are good for constrained images from HKPU [11], but the case of in-plane rotated hands was not considered.
Jaswal et al. [65] trained a Faster R-CNN [66] model based on Resnet-50 (87 layers) on three palmprint datasets (HKPU, CASIA and GPDS-CL1). They reported lower Accuracy and Recall rates for CASIA (up to 5% less) than for HKPU and GPDS-CL1. This can be explained by slightly larger variation in rotation. Similar to [63], the predicted bounding boxes (considered as ROIs) do not include measures for rotation normalization, which considerably affects the recognition rate for the scenario using images from CASIA, as they contain significant rotation variation. Comparatively, images from HKPU and GPDS-CL1 are already normalized rotation-wise.
Recently, Liu et al. [67] also considered a Fast R-CNN [68] for palmprint ROI detection. They acquired several videos of palmprints in 11 environments (no other details provided) where the hand pose was varied (from spread to closed fingers, with several hand orientations). These acquisition sessions resulted in 30,000 images that were used for training and testing. having 60% IoU (with the ground truth) affects the recognition task.
An especially promising approach was proposed by Matkowski et al., who integrated a Spatial Transformer Network (STN) into ROI-LAnet, an architecture performing the palmprint ROI extraction. The STN was initially proposed by Jaderberg et al. [69] to improve the recogniton of distorted digits. This is achieved by learning a thin plane spline transform based on a collection of points, a Grid generator and a bilinear sampler. The STN learns a transformation T θ that is differentiable with respect to the predicted coordinatesθ based on the input feature map. ROI-LAnet uses a feature extraction network (based on the first 3 MaxPooling stages from the VGG16 network [70]) to obtain the feature map, followed by a regression network providing estimates for the 9 points used for describing the palmprint region (trained initially using L2 loss). The output of ROI-LAnet is a palmprint ROI of fixed size, which is normalized w.r.t. the hand's pose. The authors then include ROI-LAnet into a larger architecture to train it end-to-end using Softmax for loss function.

D. Avoiding the ROI Detection Altogether
Tiwari et al. [22] provided a guide on the screen of the smartphone during acquisition, avoiding the need for an ROI step. Tiwari then used an algorithm to determine the best frames for feature extraction. Similar to Tiwari's approach, Leng et al. [71] presented a guide on the smartphone's screen, indicating a specific hand pose and orientation for the hand.
Afifi et al. [26] considered a different approach, having the entire image as the input to a CNN, thus removing any need for an ROI extraction phase. This approach is only feasible because all other parameters in the acquisition environment (background, lighting and hand orientation/scale) are not constant.

IV. PALMPRINT FEATURE EXTRACTION AND MATCHING
This section presents a general overview of approaches used for palmprint feature extraction, with emphasis being placed on the more recent advancements. In this section, the algorithms are split into two categories, based on how the kernels used for feature extraction were obtained (as visualized in Fig. 4): 1) Conventional approaches: a) Encoding the line orientation at pixel-level with: i) Generic texture descriptors ii) Palmprint-specific descriptors. b) Encoding the line orientation at region-level, with: i) Generic texture descriptors, a special category including descriptors such as SIFT, SURF and ORB, which are treated separately ii) Palmprint-specific descriptors.
2) Neural Networks approaches: a) Having fixed kernels, such as ScatNet [72] b) Kernels learned based on a training distribution: i) With no non-linearities, such as PCANet [73] ii) Deep Learning approaches: A) Classifying with Softmax B) Using Siamese network architectures. An overview of the more conventional approaches to palmprint feature extraction is presented in Table II, whereas an overview of the more recent approaches based on Neural Networks is presented in Table III.

A. Palmprint Feature Extraction -Conventional Approaches
Conventional palmprint recognition approaches are mainly focused on line-like feature detection, subspace learning or texture-based coding. Of these, the best performing approaches have been the texture-based ones [74], which will represent the main focus of this overview. For a broader description of the other groups, please refer to the work of Zhang et al. [74], Kong et al. [75] and Dewangan et al. [76].
Jia et al. [77] defined a framework that generalized the palmprint recognition approaches.The stages of feature encoding are broken down and populated with various approaches. The following sub-sections describe these approaches and provide results in the form of either Equal Error Rate (EER) or Recognition Rate (RR) corresponding to popular palmprint datasets such as HKPU [11], CASIA [13] or IITD [14].
1) Extracting Palmprint Features with Texture Descriptors: Chen et al. [78] used a 2D Symbolic Aggregate approximation (SAX) for palmprint recognition. The SAX represents a real valued data sequence using a string of discrete symbols or characters. Applied to grayscale images, it encodes the pixel values, essentially performing a form of compression. The low complexity and high efficiency of SAX make it suitable for resource-constrained devices.
Ramachandra et al. [79] employed a series of BSIF filters that were trained for texture description on a large dataset of images. The ROI is convolved with the bank of filters and then binarized (using a specific threshold value), allowing for an 8-bit encoding.
Jia et al. [80] investigated the potential use of HOG [81], which were successfully used in the past for robust object detection, especially pedestrians and faces. Furthermore, the Local Directional Pattern (LDP) [82] was evaluated in the context of palmprint feature extraction.
Zheng et al. [83] described the 2D palmprint ROI with a descriptor recovering 3D information, a feature entitled Difference of Vertex Normal Vectors(DoN). The DoN represents the filter response of the palmprint ROI to a specific filter containing several sub-regions (of 1 or -1) intersecting in the center of the filter (borders are made up of 0s), with various orientations. In order to match two DoN templates, a weighted sum of AND, OR and XOR operators was used.
Li et al. [84] extracted the Local Tetra Pattern (LTrP) [85] from a palmprint image that was initially filtered with a Gabor [86] or MFRAT [54] filter. Only the real component from the Gabor convolution was taken into consideration, after the winner-take-all rule of arg min was applied at pixel level between all filter orientations. Then, block-wise histograms of the LTrP values were concatenated in order to determine the final vector describing a palmprint image. Wang et al. [87] used the Local Binary Pattern (LBP), which encodes the value of a pixel based on a neighborhood around it [88]. Generally, the 3x3 kernel is used, allowing codes that range in value from 0 to 255.
An overview of these approaches is detailed in Table II under category (A0).
2) Encoding Palmprint Line Orientation at Pixel Level: One of the first approaches to extract the palmprint features from an ROI relied on only one Gabor filter oriented at π 4 , entitled PalmCode [34]. Three values were used in the matching stage of PalmCode, namely the real, imaginary, as well as a segmentation mask to reduce the influence of poor ROI segmentation. Several approaches following a similar rationale were proposed in the following years after PalmCode, with the introduction of Competitive Code (CompCode) [86] and Robust Line Orientation Code (RLOC) [54]. Both CompCode and RLOC used a competitive rule (arg min ) between a bank of filters having 6 orientations. Every pixel from the palmprint ROI was considered to be part of a line, and as the lines in the palmprint correspond to black pixels, the minimum response was chosen. Whereas CompCode used the filter response from Gabor filters, RLOC used the filter response from a modified filter Jia et al. called MFRAT because it was inspired from the RADON transform. In the case of CompCode only the real component was used.
Gaussian filters were also used, either the derivative of two 2D Gaussian distributions (DoG [91]) or as the difference between two 2D orthogonal Gaussian filters (OLOF [90]).
Guo et al. [92] introduced Binary Orientation Co-occurrence Vector (BOCV), obtained the filter response of a Gabor filterbank and encoded every pixel relative to a specific threshold (0 or another threshold, chosen based on the distribution of values after convolution with a specific filter). Every filter response was L1 normalized prior to the encoding, after which the thresholded values from each orientation were used to encode an 8-bit number corresponding to every pixel. An extension of this approach was introduced by Zhang et al. [95] with EBOCV, which included masking the 'fragile' bits obtained after convolution with the Gabor filter-bank (as performed previously on IrisCode [107] in the context of iris recognition). In this context, a 'fragile' bit is interpreted as being the pixels close to 0 (after convolution). Khan et al. [50] introduced ContourCode, obtained by convolving the input ROI in two distinct stages. Initially, the filter response corresponding to a Non-subsampled Contourlet Transform (uniscale pyramidal filter) was obtained, after which the ROI was convolved with a directional filter bank. The strongest sub-band was determined (arg max ) and the resulting code was binarized into a hash table structure. Fei et al. [96] introduced the Double-orientation Code (DOC) which encodes the two lowest responses (to a Gabor filter bank). In order to compute the distance between two ROIs, a non-linear angular distance, measuring the dissimilarity of the two responses was determined.
Zheng et al. [97] investigated the effect of number of filter orientations on the efficiency of CompCode [86] and RLOC [54]. A single orthogonal pair of Gabor and MFRAT filters was found to perform better than when using 6 orientations. This encoding approach was called Fast-Compcode/Fast-RLOC due to its increase in speed, mostly due to a reduction in complexity.
An interesting approach was introduced by Tabejamaat et al. [99], who described the concavity of a 2D palmprint ROI by convolving it with several Banana wavelet filters [108]. Three pairs of filters (positive and negative concavity) were convolved with the ROI and a competitive rule (arg min ) was used for encoding. The joint representation was called Concavity Orientation Map (COM). An angular hamming distance was then used for matching COMs.
An overview of these approaches is detailed in Table II  under category (A1). 3) Region-based Palmprint Line Orientation Encoding: Jia et al. [80] introduced an analysis of region-based methods applied to palmprint recognition. They extended the RLOC encoding capabilities to the region-level by using the histogram of dominant orientations (after the arg min rule). The histograms of orientations were then concatenated. This approach essentially replaced the gradient information used in HOG with the dominant MFRAT filter response. For matching two palmprint templates, the L2 distance was used. Zhang et al. [17] used a similar approach to retrieve the blockwise histograms of CompCode orientations, but a Collaborative Representation Classifier (CRC) was used to perform the classification.
Kim et al. [21] used a modified version of CompCode, where a segmentation map was first determined by using the real values of the filter responses. This segmentation map was then used to compute the strongest gradients and compute the corresponding HOG. The Chi-square distance was used for matching palmprint templates.
Li et al. [84] extended the general approach of Local Tetra Patterns [85] by replacing the derivative along the width and length with the filter response to MFRAT [54] or Gabor [86] filter banks. Furthermore, the encoding method was modified to take into account the thickness of the palm lines. The image was then separated into regions and histograms were computed for each region. Finally, they were concatenated and passed through a Kernel PCA filter to reduce the dimensionality of the template.
Luo et al. [89] introduced the Local Line Directional Pattern (LLDP), which represented an extension of general region encoding approaches (LDP [82], ELDP [109] and LDN [110]). The convolution stage replaced the use of Kirsch filters with Gabor or MFRAT filter banks. This step corresponds to replacing the general gradient information in a region with palmprint-specific line information. A similar approach was employed by Fei et al. [111] to encode the 2D information in the context of a 3D palmprint recognition system. The response to the Gabor bank of filters was encoded using the LBP [88] strategy. The system used a feature-level fusion technique. Fei et al. [101] introduced the Local Multiple Directional Pattern (LMDP) as a way of representing two strong line orientations when these were present, instead of choosing only the dominant line orientation. The block-wise histograms of LMDP codes were computed and matching was performed using the Chi-square distance. In a similar manner, Xu et al. [102] introduced SideCode as a robust form of CompCode, representing a combination of the dominant orientation with the side orientations in a weighted manner. Fei et al. [100] used the Neighboring Direction Indicator (NDI) to determine the dominant orientation for each pixel, along with its relation to the orientations of the neighboring regions in the image.
Jia et al. [77] introduced the Complete Directional Representation (CDR) code, encoding the line orientation information at 15 scales with 12 MFRAT filters. From these images 6 overlapping regions were extracted, resulting in 1080 regions. These features were then matched using Band Limited Phase-only Correlation (BLPOC) [112]. This approach was based on the average cross-phase spectrum of the 2D Fast Fourier Transforms (FFT) corresponding to two palmprint templates. The impulse centered on (x 0 , y 0 ) corresponds to the probability of the two templates belonging to the same class (large if intra-class, low if inter-class).
An overview of these approaches is detailed in Table II  under category (A2). 4) Image Descriptors used for Palmprint Feature Extraction: Image descriptors such as the Scale Invariant Feature Transform (SIFT) [113] represented a major breakthrough for object detection in unconstrained conditions because of the rotation and scale invariance of SIFT key-points. This brought much interest to SIFT descriptors, which were either applied directly to palmprint images, such as in [105], [22], [114] or with certain modifications brought to one of its stages. Morales et al. [104] replaced the DoG with the Ordinal Line Oriented Feature (OLOF) in the stage associated to key-point detection. Furthemore, the score determined from matching SIFT descriptors was fused with the OLOF matching prediction, making the prediction more robust. Zhao et al. [105] improved the initial key-point detection stage by filtering the palmprint image with a circular Gabor filter. Then the corresponding SIFT descriptors were matched using a modified version of the RANSAC algorithm which used several iterations.
Kang et al. [106] introduced a modified SIFT which is more stable, called RootSIFT. Furthermore, histogram equalization of the graylevel image was added as a pre-processing stage. A mismatching removal algorithm (of SIFT descriptors) based on neighborhood search and LBP histograms further reduced the number of out-liers.
Charfi et al. [43] used a sparse representation of the SIFT descriptors to perform the matching, as well as ranklevel fusion with an SVM. Similarly, a rank-level fusion was performed by Chen et al. [103] matching SAX and SIFT descriptors.
Tiwari et al. matched SIFT and ORB [115] descriptors acquired using smartphone cameras. As with most other approaches using SIFT descriptors, a dissimilarity function was defined, counting the number of in-lier matches performed between two images. Srinivas et al. [116] used Speeded Up Robust Features (SURF) [117] to match two palmprint ROIs. They further improved the matching speed by only matching the SURF descriptors extracted from specific subregions of the ROI, instead of the entire surface of the ROI.
An overview of these approaches is detailed in Table II under category (B).

B. CNN-based Approaches
One of the great advantages of using CNNs is that the filters are learned from a specific training distribution, which makes them relevant to the task of palmprint recognition. As opposed to traditional (crafted) features, the learned features are trained to describe any distribution. The main disadvantage of this approach lies in the requirement of abundant and accurately labeled training data, which generally is a problem.
The existing approaches for palmprint feature extraction relying on CNNs, can be split into three categories: • Using pre-trained models (on ImageNet), the network's output is considered to be the extracted feature. Also relies on a classifier such as SVM. • Networks of filters optimised using various approaches. • Training from scratch (or using transfer-learning) of DNNs to determine embeddings that minimize intra-class distance and maximize inter-class distance.
1) Using pre-trained DNNs: Dian et al. [118] used AlexNet [134] pre-trained on ImageNet to extract deep features. These were then matched using the Hausdorff distance. In a similar fashion, Tarawneh et al. [119] used several networks pretrained on ImageNet (AlexNet, VGG16 [70] and VGG19). The extracted deep features from the images in two hand datasets (COEP [15] and MOHI [120]) were then matched using a multi-class SVM. Ramachandra et al. [121] used transfer-learning (AlexNet) to match palmprints acquired from infants. The class decision was obtained through a fusion rule, which took into consideration the prediction from an SVM, as well as the Softmax prediction of the network.
An overview of these approaches is presented in Table III  under category (C1). 2) PCANet, ScatNet and PalmNet: Minaee et al. [72] employed a scattering network (ScatNet) that was first introduced by Bruna et al. [135] for pattern recognition tasks, especially because of its invariance to transformations such as translation and rotation. ScatNet uses Discrete Wavelet Transforms (DWT) as filters and considers the output(s) at each layer as the network outputs (not just the last layer), providing information regarding the interference of frequencies in a given image [135]. Meraoumia et al. used a filter bank of 5 scales and 6 orientations, the network having an architecture composed of 2 layers. The palmprint ROIs were split into blocks of 32x32 pixels and passed through the network, resulting in 12,512 scattering features. PCA was applied to reduce the dimensionality, reducing it to the first 200 components. A linear SVM was then used for the classification task.
Chan et al. [136] initially introduced PCANet for general pattern recognition applications. Unlike DNNs which make use of the Rectified Linear Unit (ReLU), the PCANet does not contain any non-linearity. Instead, the filters are determined from a distribution of training images. Specifically, a series of overlapping blocks are extracted from every input image, after which the mean is removed. Based on the derived covariance matrix a number of Eigen vectors are extracted (after being sorted, the top 8) and considered as filters belonging to the first layer. The input to the second layer is the distribution of input images to the 1st layer, but convolved with the computed filters in layer 1. This process is repeated for any given number of layers, but generally architectures with 2 layers are commonplace. PCANet was used for palmprint feature extraction by Meraoumia et al. [73] on two datasets -CASIA Multispectral [94] and HKPU-MS [93]. For classification, both SVM and KNN reported 0% EER across all spectral bands for HKPU-MS and 0.12% EER for CASIA-MS. However, after applying a score-fusion scheme where the first 3 bands are used, the EER drops to 0%.
Recently, Genovese et al. [122] expanded the PCANet approach to include convolutions with fixed-size and variablesized Gabor filters in the 2nd layer. The described architecture entitled 'PalmNet' determines the Gabor filters with the strongest response, followed by a binarization layer. An alternative architecture is considered, entitled 'PalmNet-GaborPCA', where the filters of the first layer are configured using the PCA-based tuning procedure used in PCANet, whereas the kernels in the 2nd layer are configured using the Gabor-based tuning procedure. For classification, a simple KNN classifier is used. PalmNet represents an interesting approach for quickly training on large datasets of palmprints, at the same time requiring fewer resources than DNNs.
An overview of these approaches is presented in Table III under category (C2).
3) Training DNNs: The main distinction separating approaches in this category is the training strategy being used. If the classification task is borrowed from the standard pattern recognition problem (like the ImageNet challenge), then the CNN is required to predict the class to which an input palm print belongs to. The network's last layer is fully connected with a number of units corresponding to the number of classes (in the form of a one-hot vector, depending on the size of the dataset), with the activation function being Softmax (expressing the probability of that input image to belong to either class). In this case, the loss function is the crossentropy. Example implementations include [23], [26], [63], [124], [126], [127].
Fei et al. [126] compared the performance of several networks like AlexNet, VGG16, InceptionV3 and ResNet50. Izadpanahkakhk et al. [23] trained and evaluated four networks (GoogLeNet, VGG16, VGG19 and a CNN developed by Chatfield et al. [64] for the ImageNet challenge) on two novel palmprint datasets. Alternatively, after training with cross-entropy loss, the output from the log-its layer (the layer preceding the Softmax layer) can be considered as the extracted feature, which is then used to train a classifier such as SVM [26], CRC [127] or Random Forest Classifier (RFC) [63]. Zhang et al. [125] used a combination of cross-entropy and center-loss functions during training for multi-spectral palmprint matching. After learning a representation of palmprints, they then fed the embeddings (output of log-its layer) to an SVM. Afifi et al. also take into consideration separating the input image's information into either high-frequency and lowfrequency, thus having a two-stream CNN. The two branches later concatenate, to allow the training based on classification. Several of these layers' outputs are then concatenated, and then classified using an SVM which employs a SUM rule for fusion.
Matkowski et al. [30] provided the first CNN-based solution for palmprint recognition which was trained End-to-End (EE-PRnet) for palmprint feature extraction. This architecture was composed of the previously mentioned ROI-LAnet and FERnet, which was also based on a pre-trained VGG16 (pruned after the 3rd Maxpool) architecture. This was followed by two fully connected (FC) layers benefiting from Droput regularization. The network is trained using Cross-entropy (a 3rd FC layer was added to the network, corresponding to palmprint classes), but the authors explore several training scenarios regarding the Dropout layers, or fine-tune specific blocks in FERnet. Furthermore, a color augmentation protocol consisting of randomly shifting the saturation and contrast of images , was performed on-the-fly during training. After obtaining the palmprint embeddings (from the 2nd FC layer), they are matched using Partial Least Squares regression (PLS) [128], linear SVM, KNN-1 and Softmax. The best results were obtained using PLS. Overall, the EE-PRnet provides the best results, showing that training both networks (ROI-LAnet and FERnet) together allows the architecture to reach a better understanding of the features contained in the palmprint, as well as the distortions brought by the hand's pose. Furthermore, this setup provides a considerable advantage, as the input to the network is the full image, not a cropped image of the hand.
An overview of these approaches is presented in Table III under category (C3-A).
Another training approach is to use the Siamese architecture (overview presented in Table III), characterized by two inputs (or several) resulting in two embeddings (usually 128 units corresponding to the last fully-connected layer) that are then compared with a loss function to determine how similar they are versus how similar they should be. This architecture, where the same network outputs the two embeddings, relies on a similarity estimation function, such as the Contrastive loss [137], or the Center loss [138], where the distance between inputs is minimized (intra-class) or increased (inter-class). When the three inputs (triplets) are considered, the distance between the anchor and the positive sample is reduced while increasing the distance between the anchor and the negative sample [139]. Svoboda et al. [129] introduced a loss function called 'discriminative index', aimed at separating genuine-impostor distributions. Zhong et al. [130] used transfer-learning based on VGG16 (initially trained on ImageNet) and Contrastive loss.
Zhang et al. [29] used a Siamese architecture of two Mo-bileNets [140] outputting feature vectors that are then fed to a sub-network tasked with the intra-class probability (0 for interclass and 1 for intra-class, with 0.5 as a decision threshold). It is not clear, however, what loss function they used (most likely contrastive loss). Du et al. [133] used a similar architecture trained using the few-shot strategy. Shao et al. [141] used the output of a 3-layer Siamese network, and matched the palmprints from two datasets (HKPU-Multispectral and a dataset collected with a smartphone camera) with a Graph Neural Network (GNN). Unfortunately, the training details of the Siamese network are not clear. Liu et al. [67] introduced the soft-shifted triplet loss as a 2D embedding specifically developed for palmprint recognition (instead of a 1D embedding). Furthermore, translations on x and y axes were used to determine the best candidates for triplet pairs (at batch level). Recently, Shao et al. [28] introduced an approach based on hashing coding, where the embeddings used to encode the palmprint classes are either 0 or 1. Furthermore, similar matching performances were obtained using a much smaller network, obtained via Knowledge Distillation [131]. These are worthwhile directions for development, as they represent solutions to the limitations of resource-constrained devices.
A promising strategy for cross-device palmprint matching was recently proposed by Shao et al. [132] with PalmGAN, where a cycle Generative Adversarial Network (cycle GAN) [142] was used to perform cross-domain transformation between palmprint ROIs. A proof of concept was evaluated on the HKPU-Multispectral (HKPU-MS) palmprint dataset containing palm images acquired at several wavelengths, as well a semi-unconstrained dataset acquired with several devices.
An overview of these approaches is presented in Table III under category (C3-B).

A. Palmprint Datasets
The advancement of palmprint recognition relies on the release of relevant datasets which reflect specific sets of requirements. Initially the main focus was placed on recognition, allowing little to no flexibility in terms of interaction with the system (e.g. HKPU [11]). As the sensor technology progressed (and new consumer devices appeared on the market), there was more room for various aspects, i.e. contactless systems (IITD [14], CASIA [13]). Then invariance to various factors of the acquisition encouraged the introduction of datasets like BERC [21] (background), or 11K Hands [26] (hand pose) and PRADD [25] (devices used for acquisition). Unfortunately there are several datasets that are no longer available to researchers, such as PRADD [25] or DevPhone [20]. Some recently introduced datasets are yet to be released to the research community (e.g. HFUT [19], MPD [29] or XJTU-UP [28]).
Following the general trend of biometric recognition migrating to consumer devices, the last years have seen the introduction of several large-scale palmprint datasets (e.g. XJTU-UP [28]) reflecting the challenging operating conditions brought by a mobile environment. A new category of unconstrained palmprint datasets was recently introduced with NTU-PI-v1 [30], including the palmprint acquired with conventional cameras to the list of forensic applications. This collection of palmprints gathered from the Internet proved to be especially challenging, given the low resolution of images, the high degree of distortion, as well as the large number of hand classes. It is our opinion that these will be the most meaningful palmprint datasets for the upcoming 5-10 years, anticipating the adoption of palmprint recognition on smartphones and other devices. An overview of this transition was presented, the culmination of which is represented by the fully unconstrained datasets class, initiated with the introduction of NUIG Palm1 [10] in 2017.

B. Palmprint ROI Extraction
The approaches used for palmprint region of interest extraction are linked directly with the operating conditions of devices used for acquisition. In palmprint datasets where the background is fixed (e.g. HKPU, CASIA, IITD, COEP) the task of segmentation is a straightforward procedure. However, when the background is unconstrained such as is the case with images from BERC, skin color thresholding provides limited results, even when the skin model is computed for every image based on a distribution of pixels [21].
With the migration of palmprint recognition onto consumer devices, the general pipeline for ROI extraction needs to take into consideration more challenging factors such as lighting conditions, hand pose and camera sensor variation. It is in this context that more powerful approaches based on machine learning or deep learning can provide robust solutions without imposing strict protocols for acquisition onto the user of consumer devices. A complete evaluation of these approaches is yet to be made in terms of: 1) The prediction error of the key-points used for ROI extraction/alignment. This seems to have been a commonly overlooked step in most research papers, with some exceptions (e.g. [50]). 2) Recognition rate and the main sources of error (from the ROI extraction) affecting recognition. 3) Running time and resource requirements, especially for CNN-based approaches. Low inference time is expected from all solutions running on consumer devices. Furthermore, at the time of writing of this literature review, there are currently no CNN-based solutions to detect the palmprint in unconstrained environments, besides the Fast R-CNN approach demonstrated by Liu et al. [67], which is a Fast-RCNN. The recent use of a CNN for the normalization of palmprint ROIs regarding hand pose by Matkowski et al. [30] has opened up exciting new possibilities for unconstrained palmprint ROI extraction (they do not address the task of palmprint detection). The Spatial Transform Network learns a non-affine transform applied to the ROI, defined by the palmprint's labeled keypoints. Alternatively, pose correction could be made using 3D information, similar the work of Kanhangad et al. [143]. Although a special 3D sensor is used in [143], the hand's 3D structure can be recovered from the 2D image with hand pose estimation algorithms (as was developed by Mueller et al. [144]).

C. Palmprint Feature Extraction
Although palmprint recognition took off in early 2000's with the introduction of HKPU [11] dataset, the pipeline stage that received the most attention from the research community has been the palmprint feature extraction. As was the case for iris and face recognition, CNNs have become the current state of the art in palmprint recognition (Section IV-B). The general trend is to either train a network using Cross-entropy or Center-loss (e.g. [26], [23], [125], [126], [30]), Siamese networks (e.g. [129], [67], [29], [132]), but there are or also entirely linear networks (PCANet [73] and PalmNet [122]).
It is important to note that most of these works use in their training/evaluation scenarios images acquired with smartphones (on datasets such as XJTU-UP [28] and MPD [29]). The cross-device training and matching will become a main focus especially for device-independent palmprint recognition solutions, as demonstrated by [30]. This is first investigated in [10], with impressive results being obtained in [67] and [30]. The cross-domain conversion of a palmprint ROI using a generative approach [132] also represents a promising direction of research. A GAN-based architecture might benefit from the ROI pose-normalization approach introduced by Matkowski et al. [30], where the ROI extraction network contains a Spatial Transform Network [69].
The complexity of architectures becomes an important factor to optimize for devices with limited resources, as in [28], where the network is distilled (number of layers is reduced) and the network's output is a discrete hash code (binary values). This not only reduces the processing requirements (including matching), but also the storage space necessary when dealing with a large number of classes. An alternative approach would be to consider the ternarization of networks [145].
As in the case of ROI extraction algorithms, the feature extraction approaches (especially the CNN-based solutions) require an evaluation in terms of processing time, as this aspect is only touched in few papers (e.g. [21] and [67]).