A Hybrid Model Combining Learning Distance Metric and DAG Support Vector Machine for Multimodal Biometric Recognition

Metric learning has significantly improved machine learning applications such as face re-identification and image classification using K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) classifiers. However, to the best of our knowledge, it has not been investigated yet, especially for the multimodal biometric recognition problem in immigration, forensic and surveillance applications with uncontrolled ear datasets. Therefore, it is interesting and very attractive to propose a novel framework for multimodal biometric recognition based on Learning Distance Metric (LDM) via kernel SVM. This paper considers metric learning for SVM by investigating a hybrid Learning Distance Metric and Directed Acyclic Graph SVM (LDM-DAGSVM) model for multimodal biometric recognition, where LDM and DAGSVM are two emerging techniques in dealing with classification problems. Different from existing multimodal biometric recognition methods, the proposed approach aims to learn Mahalanobis distance metric via kernel SVM to maximize the inter-class variations and minimize the intra-class variations, simultaneously. Experimental results on the uncontrolled datasets such as AR face and AWE ear datasets show that the proposed approach achieves competitive performance compared with models working on individual modalities and overperforms the state-of-the-art multimodal methods. The proposed model achieves five-fold classification accuracy around 99.85 % for the face and ear images.


I. INTRODUCTION
Unimodal biometric authentication systems got more attention in the last decades for intelligent applications such as Internet of Things (IoTs), Automated Teller Machine (ATM), surveillance, immigration, and mobile applications. However, some unimodal biometric authentication systems The associate editor coordinating the review of this manuscript and approving it for publication was Andrea F. Abate . are potentially vulnerable to forgery, which means that the biometric system can be cheated [1]. For example, fingerprint is the most popular biometric trait due to its perceived uniqueness and persistence [2]. However, the Apple iPhone Touch ID fingerprint reader might be cheated by using nonauthentication fingers [3]. Furthermore, the big challenging issue for unimodal biometric recognition systems is to select discriminative features to accomplish personal authentication in the presence of large variations among biometric samples for the same person. Thus, one type of features is usually neither efficient nor sufficient to predict the right subject, especially for the collected images in various conditions such as illumination, rotation, and occlusion conditions. Therefore, most researchers pay more attention to multimodal biometric recognition to increase identification performance and to provide more security. To deal with multimodal biometric recognition, feature fusion has become a very important research aspect [4]- [12], because fusing different types of features provides complementary information. Most of the recent human recognition works [10], [13-[17] utilized feature level fusion to overcome the challenges of constrained resources, and to increase the system security and system performance. The well-known traditional feature fusion methods involve serial and parallel fusions. Serial fusion can apply by adding or concatenating two or more feature vectors to a single feature vector; on the other hand, parallel fusion is more difficult and hard in utilizing more than two features, due parallel fusion provides a complex vector by collecting a feature set as the imaginary component and another feature sets as the real component, which makes its application limited.
Recently, recognition metrics have been considered as an important research point for the development of multimodal biometric recognition systems. To the best of our knowledge, most of the previous works [4], [9], [10], [13], [15], [16] have adopted traditional distance metrics and classifiers. Due to the number of images per person that are usually limited to 3∼5 images and with noises, traditional distance metrics and classifiers cannot achieve the performance desired. Moreover, they are sensitive to noises. To deal with classification problems and achieve better generalization ability, metric learning for SVM is considered. As the LDM and the DAGSVM are two emerging techniques in dealing with classification problems and can achieve better generalization ability than the traditional distance metrics and classifiers, the LDM-DAGSVM model for the multimodal biometric recognition system is investigated.
In addition, there is no reported work based on LDM for SVM. Learning models have got much more attentions in the last decades especially for machine learning applications such as classification applications [18], [19], [20]- [ 23], face recognition [24], [25], and human re-identification [26]. Metric learning aims to learn a valid distance from the given training data or a similarity function for a given problem. However, most existing metric learning methods are based on convex and non-convex optimization algorithms or multiple kernel classification. In this paper, we investigate a kernel classification model that can be used to improve the stateof-the-art multimodal biometric recognition system based on metric learning algorithms. In addition, multimodal biometric recognition systems can be enhanced using metric learning approaches, which utilize various known efficient classifiers such as K-Nearest Neighbor (K-NN) and SVM. Therefore, the main motivation of our proposed multimodal biometric recognition system includes two aspects: face and ear images representation, and classification. This paper exploits the advantages of local features fusion to represent face and ear images, by using Discriminant Correlation Analysis (DCA) in feature fusion algorithm that enhances the efficiency and effectivity [11], [15]. The proposed work presents kernel DAGSVM to improve the SVM performance based on learning the Mahalanobis distance using Radial Basis Function (RBF) for a multimodal biometric recognition system.
Hence, a multimodal biometric recognition system based on LDM is investigated in this paper to achieve higher classification rates for face and ear images. The experiments are performed on biometric applications of Annotated Web Ear (AWE) dataset, Mathematical Analysis of Images (AMI) ear dataset, and the Georgia Tech, Olivetti Research Laboratory (ORL) face datasets, and AR face dataset. The experimental results indicate that multimodal of individual modalities can improve the overall performance of the human biometric recognition system, even in the case of low-quality data. The results also demonstrate that the proposed model performs better than classical and traditional multimodal biometric models. The adopted LDM-DAGSVM model is particularly useful for two reasons. First, it achieves comparable classification accuracies with those of the state-of-the-art methods, and it clearly outperforms the other multimodal biometric methods not explicitly geared towards LDM. In addition, it provides the researchers in the multimodal biometric recognition systems a convenient way for using various metric learning algorithms via SVM. The major contributions of this work are five-fold.
1. This paper presents a novel framework for multibiometric recognition through LDM and kernel SVM. biometric recognition model outperforms other stateof-the-art methods. In addition, this work indicates a new direction for multimodal biometric recognition problem by developing new systems using metric learning approaches with various known active classifiers such as KNN and SVM. This paper is organized as follows. In Section II, we introduce a brief review of the related work. In Section III, we present the proposed model in detail. In Section IV, we evaluate and compare the proposed model with other state-of-the-art methods on three complicated datasets. Finally, the conclusion and future works are presented in Section V.

II. RELATED WORK
Automated biometrics authentication refers to the automated human recognition based on physiological or/and behavioral characteristics. Biometric authentication task is known as predicting whether two biometric traits belong to the same person or not. Therefore, human biometrics authentication has received more increasing attention recently in many intelligent applications such as IoT, Mobile application, ATM, and surveillance. Human biometric traits include physiological traits like face, fingerprint, palm print, iris, and ear, or/and behavioral traits such as gait, keystrokes, voice, and signature. Face and ear recognitions have received a lot of attention in the last decade as these two traits are proved to be the promising biometric traits due to their uniqueness, collectible, permanence and universality [2], [27]. However, they have unique advantages and also have some limitations [28], [29]. Face and ear images have many advantages such as both face and ear traits are large and visible for acquisition, that means they are passive, non-intrusive traits, and can be collected within one sensor. However, face and ear recognition systems have many challenges such as facial expression, makeup, mask, rotation and occlusion from hair, glasses, or rings. On the other side, the human ear images have complementary information for face image. The human ear has stable structure through expression and age, being visible and large, that means it can be collected without user cooperation; furthermore, human ears for identical twins and triplets are different [30].
Chang et al. [28] presented a multimodal biometric recognition system for human identification based on standard Eigen-faces and Eigen-ears to represent face and ear images, respectively. They adopted feature-level fusion based on a simple concatenation that may trigger the dimension problem. The authors in [9] have motivated the utilization of Kernel Principal Component Analysis (KPCA) to solve the dimension problem with a scanty accuracy around 94.5% [28]. Therefore, several attempts [10], [15], [31]- [34] have presented multimodal biometric recognition systems based on face and ear images with various level fusion to improve the recognition accuracy and the system performance. To overcome the above-mentioned limitations, the authors of [34] and [31] have exploited 3D ear images and sparse representation for classification, respectively.
Mahoor et al. [34] have fused 3D ear and 2D face images to build a multimodal biometric recognition system. Active shape model and Gabor filter were used for extracting landmarks from face images. Then, to get the 3D shape ear images, they exploited the Shape-From-Shading (SFS) algorithm. More recently, 3D shape ear images have also been used in [35] by using block-wise statistics to improve ear recognition system. However, the computation complexity of 3D ear recognition system is still expensive and high costs [36], [37], which limits its use for real-time applications. We can conclude that all previous researches have proved that the multimodal biometric authentication provides increased system security, recognition accuracy, and compensates for the limited resources of unimodal biometric recognition systems [15], [28], [32], [38]. Multimodal biometric recognition systems are guaranteed from forgery and theft and can achieve higher security level than those of unimodal biometric recognition systems those can be cheated. It has been well proved that multimodal biometrics recognition systems can override the respective limitations of unimodal system. It is interesting to exploit these advantages of multimodal biometric recognition systems to develop robust, efficient, and effective multimodal human recognition systems.
The main goal of metric learning is to learn a valid distance from a similarity function of a given problem or from the training data. There are a lot of existing metric learning approaches based on non-convex and convex optimization techniques or multi-kernel classification. However, Mahalanobis distance is considered as the most used distance in metric learning research. Mahalanobis distance M A between two samples x i , and x j is known as the squared distance under Euclidean distance over the new mapping space, and A is a semi-positive definite. SVM is popular classifier utilizing Mahalanobis distance and it has two main composing methods: one-versus-one (1-v-1) and one-versus-rest (1-v-r) classification methods. The 1-v-1 SVM is known as the binary classification. However, for many applications, the multi-class classification with SVM is needed. Hence, some of the multi-class classification methods have been presented in [39]- [42]. The DAGSVM is considered as one of multi-class SVM classification method. It works as 1-v-1 method in the training phase and adopts a root binary Directed Acyclic Graph (DAG) in the testing phase [39], [43], [44]. Nevertheless, the 1-v-r SVM classifier take more time for the learning process and scale linearly with respect to the number of classes. In addition, some of them are sensitive to solving some applications, especially when the data feature dimension is high, or the size of the training data is large. Therefore, for non-linear SVMs those separate the data points by using a non-linear decision boundary, several kernel tricks have been proposed such as Polynomial, Quadratic, and RBF kernels. Among them, the RBF is the most popular one used for many applications and it achieves a good classification performance.
In our work, feature-level fusion, which demonstrates more details and information, is adopted. It leads to better recognition performance. The LBP [45] and HOG [46] features are used to represent the face and ear images. The LBP is known as a powerful texture descriptor for improving system performance and recognition accuracy when used together with HOG features [47]. Moreover, the proposed model is based on learning distance metric and kernel SVM, which can be easily integrated in a multimodal biometric recognition system to increase the system performance. However, selecting and weighting features are also always big challenging issues in biometrics applications. Selecting a discriminative features may be quite challenging, when the feature vectors x ∈z p are in a space of high dimension p. Hence, it is required to find good discriminant features with lower dimension to represent face and ear images. The DCA algorithm for this purpose in [16]. It was investigated to combine discriminative local features and reduce the dimensionality of the Mahalanobis data matrix M before learning the distance metric.

III. THE PROPOSED LEANING FRAMEWORK: LEARNING MAHALANOBIS DISTANCE VIA DAGSVM
The proposed framework has depends on LBP and HOG features for face and ear images representation. To select discriminative features with lower dimension for each image, DCA is exploited as a dimension reduction technique, and a feature fusion method that incorporates the class association into correlation analysis to reflect the class structure information is utilized. After the DCA is conducted on the whole face, ear, or multimodal datasets, N −1 correlation components are left where N refers to the number of subjects in each dataset. Algorithm 1 explains the summary of the DCA feature fusion algorithm, and for more details, see [16].
A hybrid LDM-DAGSVM model has been developed to investigate enough and robust multimodal biometric recognition based on face and ear images. Mahalanobis distance metric learning aims to search with a square matrix generated from the training set. The SVM has a better generalization ability than traditional classifiers such as the KNN using Euclidean distance. In the following Subsections, we will introduce each part of the proposed model.

A. SUPPORT VECTOR MACHINE (SVM)
Original SVM is designed for 2-class classification aiming at finding a hyper-plane to separate classes and maximize the margin; the margin refers to the distance between the hyper-plane and the closest points of both classes. Beside previous issues, minimization of the structural risk is adopted for SVM, referring to that a misclassification may not appear, when using the SVM classifier. Therefore, the SVM owns a better generalization ability than the traditional classifiers. Assume that the training dataset , 1}, which is the corresponding class label for X i . Therefore, the classification mapping is implemented considered as where r is the first largest non-zero eigenvalue and U r is the corresponding eigenvector. 5 Project X 1 and X 2 to Z 1 = P T X 1 and Z 2 = P T X 2 . 6 Compute the between-set covariance matrix of the transformed feature set in Eq. (1).
where x i results from mapping of X i according to V : R n → R m . This feature mapping that maps the input feature to a high dimensional space, nonlinearly [39]. W ∈ R m , and b is a bias term.
The optimization problem of SVM for separating classes is represented by the following Eq. (2) where ρ maximizes the geometric margin for the two classes.
Eq. (2) is equivalent to Eq. (3), and we can form it as follows.
where W is the norm of the normal vector weights of the hyper-plane. To obtain the constrained optimization problem, the primal Lagrange form is given as in Eq. (4).
where α i ≥ 0 are Lagrange multipliers, and V represents the mapping. Therefore, the decision function rule for classification for a testing set x i is formed as in Eq. (5). in which K (x, x i ) is the kernel function [39] that represents any function satisfied Mercer's conditions [48]. In this work, the most common kernel function, called Gaussian RBF kernel, is adopted for the SVM classifier as follows in Eq. (6).
where σ is a standard deviation of the Gaussian distribution.

B. DIRECTED ACYCLIC GRAPH SUPPORT VECTOR MACHINE (DAGSVM)
In the graph theory, DAG is defined as a graph with directed edges and without cycle's connection among vertices; directed edges mean that the edges go only one way from a vertex to another vertex. The DAGSVM [49] and DAGKNN are adopted in a root binary decision DAG learning structure in [39], [43], [44]. However, DAGSVM has shown great success compared to DAGKNN, especially with RBF kernel functions. The training model of DAGSVM [49] is the same as the 1-v-1 model that can be solved by n(n−1)/2 binary SVM classifiers. On the other hand, for the testing model, DAGSVM uses a rooted binary DAG that has n(n−1)/2 internal nodes organized in a diamond shape and n leaves labeled by classes as shown in Figure 1. Figure 1 illustrates a decision diagram of DAGSVM for a 4-class data. Suppose X is a test sample. The classification operation is applied by starting from the root node, and each node works as a binary SVM classifier for classes of X i and X j . Then, we go to either left edge or right edge depending on the SVM output value. Finally, we go through a path, and reach a leaf node that indicates the predicted output class. There exist many advantages for using DAGSVMs, it is analyzed that it can be established for better generalization; and the DAGSVM testing time is less than 1-v-1 SVM, since each node is trainined only with the labeled classes of the node, as shown in Fig.1 at the left side for each diamond node.

C. MAHALANOBIS DISTANCE VIA SVM
For simplicity, this work considers learning Mahalanobis distance via SVM by using DAGSVM and RBF kernel functions. For n given training samples, the Mahalanobis distance M A between x i and x j is defined as in Eq. (7) [50].
where x i , x j ∈ R d and A is a positive semi-definite matrix [50]. Furthermore, L ∈R d×d , and if A = I d×d , then Eq. (7) produces the Euclidean distance metric.
To transfer the Mahalanobis distance for learning the SVM, Eq. (7) should be integrated to the kernel function in Eq. (6). Then, calculate the result of mapping for the kernel function that is considered in Eq. (8).
where I is the identity matrix, and σ is the standard deviation of the Gaussian distribution. Finally, the training data is divided into a training set T and a validation set V. The SVM parameters are trained on T, and the outcome of the SVM is evaluated on the validation set V. Therefore, the objective of L is to maximize the interclass variations and minimize the intra-class variations, which refers to minimize the classification error ε V for the validation set V.  5) is noncontinuous, performing the minimization to find L is then non-trivial, and it can be performed on a smooth loss function ϕ v (L) as follows: where s a (z) is the sigmoid function, and a is a parameter that adjusts the steepness of the learning curve. For example, if a 0, the function ϕ V will be identical to ε V . Since ϕ V is a continuous and differentiable function. So, we can compute the derivative of ϕ V to L. To derivative Eq. (5) and find ∂D/∂L that relies on α, K , and b, we should apply the chain rule as: where ∂D ∂α , ∂D ∂K , ∂K ∂L , and ∂D ∂b are straight-forward, therefore, we can compute ∂b ∂L . and ∂α ∂L that can be derived from the matrix inverse rule as follows: where H = K y y T 0 andK ij =y i y j K (x i ,x j ). Based on previous issues, our evaluations are restricted to the binary classification case, and we convert multi-class classification problems to binary ones. Additionally, we apply the cross-validation technique for each binary SVM classifier to prevent the overfitting of classification. Furthermore, the implementation of the proposed model depends on a modification of the SVM software LIBSVM [51].

IV. DATABASES AND EXPERIMENTAL RESULTS
To evaluate the feasibility, and effectiveness of the proposed framework, extensive experiments are conducted on public and available face and ear datasets, and their fusion constructed as a multimodal dataset. For all experiments, facial images are cropped beside the ear images. In addition, the face and ear images are converted to gray-scale images and resized into 100 × 100 pixels. Moreover, fusion of LBP and HOG descriptors is adopted to represent face and ear images. The proposed framework depends on uniform LBP with a radius of 2 pixels, a neighborhood size of 8 and a block size of 8 × 8 without block overlapping. On the other side, for the HOG descriptor, we adopt a cell of size 8 × 8 pixels, and a block size of 16 × 16 pixels with 8 pixels overlapping. Furthermore, the DCA is exploited as a feature level fusion algorithm that performs an effective feature fusion process based on maximizing the pairwise correlation across the two feature vectors. Therefore, the proposed model exploits the DCA algorithm advantages for feature fusion to avoid the limitations of using face or ear images, separately. This section includes dataset description, performance evaluation metrics, and experimental results of unimodal face and ear recognition. Finally, the proposed efficient multimodal biometric recognition based on face and ear images is implemented by LDM with kernel SVM.

A. DATASET DESCRIPTION
Different biometric traits are adopted to perform our experiments. Ear datasets include two available and challenging ones: AWE and AMI. On the other hand, Georgia, ORL and the challenging AR face datasets are adopted to represent face datasets. Moreover, virtual multimodal biometric traits are constructed by fusing ear with face datasets, such virtual multimodal datasets called MD1 to MD6. In the following points, we describe each dataset, separately.
• Mathematical Analysis of Images (AMI) ear dataset.
AMI dataset [52] has been collected from 100 persons, and each person has 7 images, totally 700 images. AMI ear images are collected from teachers, students, and staff at Universidad de Las Palmas de Gran Canaria (ULPGC), Spain. All persons are in the age range of 19 to 65 years, and ear images are taken in an indoor environment, of which the resolution is 492 × 702 pixels.
• Annotated Web Ear (AWE) dataset [53]. It includes 1,000 images of 100 subjects; each subject has 10 images. All images are collected from the Internet with various degrees of variability and illumination, and with different image scales and rotations. Hence, AWE ear dataset is considered as one of the most challenging ear datasets.
• Olivetti Research Laboratory (ORL) face dataset [54]. ORL dataset includes 400 images, 10 face images for 40 individual subjects. Face images are acquired from Cambridge employees and students. This dataset is collected with no restrictions imposed on expression.
In addition, most of the person images are captured at different times with different lighting conditions. All images have the same resolution of 92 × 112 pixels, and some subject face images have glasses.
• Georgia Tech (GT) faces dataset [55]. To provide more face images for training, Georgia Tech face dataset is used, which contains 750 images for 50 subjects, and every subject has 15 color images. Face images are collected with different scales and orientations. Furthermore, most of the face images are taken in two or three different sessions that refer to variations in different illumination conditions, appearances, and facial expressions (open or closed eyes, smiling or not smiling). The average size of the face images is 150 × 150 pixels.
• AR faces dataset [56,57]. As more complicated face images, AR dataset has over 3000 RGB images with an average size 768 × 576 pixels for 126 subjects, 74 subjects are men and the rest are women. Everyone has participated in two sessions separated with two weeks and without restrictions on hair style, make-up, accessories, and scarves. Each subject also has at least 26 images including face images under different conditions such as face expressions, illumination and pose variations, accessories and partial occlusion like scarves, sunglasses, hairs, and beards. Therefore, AR face dataset is considered as a challenging face dataset. All images are cropped to 128 × 128 pixels. Figure 2 illustrates sample face images cropped and the original ear images for each dataset. In addition, Table 1 explains the number of images and number of subjects for all datasets (face, ear, and virtual multimodal biometric datasets), which are used in our experiments. Multimodal datasets will be described later with more details.

B. EXPERIMENTAL SETTINGS AND EVALUATION METRICS
Combined LBP and HOG features are adopted as standard local features to represent face and ear images. Moreover, LDM via kernel SVM is investigated for ear classification based on an RBF-SVM classifier. In order to validate the proposed model, every dataset is divided randomly into two disjoint groups, one for training and another for testing. The   5 fold cross-validation is adopted. The performance metrics are computed as the average Rank-1 recognition rates and standard deviations over all 5 folds for each experiment. In order to decrease the splitting influence of the training and testing data sets and to evaluate SVM training efficiency, all experiments are repeated for 10 times with randomly chosen training and testing data. The final recognition rate result is the average Rank-1 recognition accuracy of the 10 random experiments. The standard deviation is presented to assess the robustness for each face and ear datasets, and their fusion. Furthermore, to validate the proposed model performance, all experiments depend on two widely computed metrics: system accuracy (the average Rank-1 recognition rate) and the Receiver Operating Characteristic (ROC) curve. Model accuracy that represents the overall model performance on all available subjects can be formulated with Eq. (13), i.e., accuracy is the ratio of the number of correctly classified subjects divided by the test dataset size.
Accuracy (Acc) = (TP + TN ) (TP + FP + TN + FN ) (13) where TP and FP refer to the true positive subjects and the false positive subjects, respectively. In addition, TN and FN mean the true negative and the false negative matched subjects. The proposed model adopts kernel RBF for SVM classifier to calculate the system accuracy that refers to the recognition rate. Besides, the proposed method depends on the ROC curve to evaluate and compare the model performance for each biometric trait and their fusion.

C. EFFICIENCY AND ROBUSTNESS OF DAGSVM TRAINING MODEL
This section explains the training efficiency of the DAGSVM model for our proposed approach improving learning Mahalanobis distance metric via kernel DAGSVM. In this subsection, we provide the proposed biometric model performance through 10 random repeated tests. Figure 3 shows the model accuracy variation during different repeating tests for AWE and AMI ear datasets. As shown in Figure 3, we can observe that the proposed model can achieve high performance within several repetitions times for the unimodal ear recognition. It gives around 95.50 % and 96.50 % on AWE and AMI ear datasets, respectively. The model performance on the AMI ear images is better than that on AWE images collected from the web as an unconstrained dataset.
Furthermore, for face recognition, the proposed model is evaluated on three standard and public face datasets (ORL, Georgia Tech and AR face databases) with 10 random  repetitions. The results prove the efficiency and robustness of DAGSVM training model as shown in Figure 4 for ORL, Georgia Tech and AR face datasets. Figures 3 and 4 prove the stability of the proposed model which achieves good results with several repetitions for unimodal face recognition system. Gratifyingly, the proposed face recognition model gives an accuracy around 99.00 %, 98.50 %, and 99.70 % for ORL, Georgia Tech, and AR face datasets, respectively. These results for unimodal ear and face recognition through several random repetitions reveal that the proposed model is stable more efficient than other classification models. Later, we will present the classifier performance through 10 repetitions for the proposed multimodal biometric recognition.

D. PERFORMANCE COMPARISON WITH OTHER MODELS
Shu et al. [58] and Kar et al. [59] and others [60]- [62] presented different models for face recognition. However, the performance of these models is not satisfactory as shown in Table 2 as in [58], [59], [62]. We can notice that Tables 2 and 3 obviously explain the proposed model achieves better performance than that of some recent state-of-the-art face and ear recognition models.
Moreover, the proposed model produces better recognition rates with low computation complexity compared to those of the models in [59], [63], [64]. These models adopt deep convolutional neural networks for improving face and ear  image representation to achieve good accuracy. However, deep features are high-dimensional features which may lead to more computational complexity than that achieved with hand-crafted features (LBP and HOG) used in this work. Furthermore, most of the previous works on biometric recognition were based on traditional distance and classifiers [16], [53], [62], [64], [65].
Multimodal biometric recognition has got more attention in the last decades, and it has been used in many intelligent applications such as immigration systems, access control systems, and surveillance systems. In order to improve the performance of a multimodal biometric recognition, the proposed model depends on combination of ear images with face VOLUME 9, 2021 images to construct robust and efficient multimodal feature vectors. To evaluate the proposed model for multimodal biometric recognition, we firstly constructed virtual multimodal datasets by using face and ear images and generated six virtual multimodal datasets called MD1 to MD6. However, the number of ear images is limited to 7∼10 images for each subject in the ear dataset. Therefore, the proposed takes the corresponding 7∼10 face images to the available ear images.
• MD1 (ORL+AWE) and MD2 (ORL+AMI). To build virtual multimodal datasets MD1 and MD2, the proposed method uses ORL face dataset with ear databases AWE and AMI. These two multimodal databases have 40 individuals; for each subject, ORL face images and corresponding ear images from AWE and AMI are randomly selected, respectively. Therefore, MD1 has 400 multimodal images in which In order to evaluate the efficiency and robustness of the proposed DAGSVM model, experiments have been conducted on the virtual multimodal datasets MD1 to MD6. All experiments have been evaluated 10 times, whereas each multimodal dataset is proposed 5 fold cross-validation. Figures 5(a), 5(b), and 5(c) show the proposed model through 10 repetitions on MD1 to MD6. Figure 5 reveals that means the proposed DAGSVM model is stable, robust, and efficient for multimodal biometric recognition. Furthermore, we can see that the results on MD5 and MD6 are the best as they have the largest sizes as shown in Table 1. This means that the proposed model can achieve excellent performance, especially when providing more images for training.
The experimental results prove that the proposed model can also improve the multimodal recognition based on the fusion of face and ear images compared to the case of using individual traits as shown in Table 4. It can be noticed from Table 2    accuracies of 99.00 % and 96.00 % for face and ear biometric recognition, respectively.
Moreover, we also introduce the ROC curves of the proposed model on MD1 to MD6 datasets as shown in Figures 7(a), 7(b), and 7(c). We can find that the proposed model gives better performance for multimodal biometric  Table 4 and Figures 7(a), 7(b), and 7(c).
In addition, Figures 6(a) and 6(b) show the ROC curves for face and ear recognition. ROC curve refers to the model performance and presents the relation between the Acceptance Positive Rate (APR) and the False Positive Rate (FPR). Anyone can observe that the model performance on the AR face dataset is the best as it is large enough. In addition, the model performance on the AMI ear dataset outperforms AWE dataset as presented in Table 4 and Figure 6 (b). The AWE is considered as a challenging ear dataset [53,63,64,65,67],which contains ear images with high degrees of variability in pose, illumination and resolution as shown in Figure 2. VOLUME 9, 2021 We have evaluated the proposed multimodal biometric recognition model and compared it with some state-of-theart multimodal biometric models, especially those based on face and ear images as shown in Table 5. Table 5 illustrates that the performance of the proposed multimodal biometric recognition model (LDM-DAGSVM) is superior to recent state-of-the-art multi-biometric models using face and ear images [10], [16], [31], [68], [70]. Moreover, we compared the proposed multimodal recognition model based on face and ear images to other multimodal biometric recognition models [13], [69] using different biometric traits. We can observe that the proposed model achieves competitive results compared to the other multi-biometric models using three biometric traits, and requiring more sensors to collect the data for three traits, not to mention the computation complexity of the traits.

V. CONCLUSION
The main motivation of the proposed multimodal biometric model includes two aspects: face and ear images representation, and human classification. Therefore, an efficient framework, based on a hybrid model of learning distance metric (LDM) and directed acyclic graph (DAG) support vector machine (SVM), has been proposed for a multimodal biometric recognition in this paper. Mahalanobis distance metric learning is used to seek a square matrix from the training set. Besides, kernel SVM is used to achieve better generalization ability than that of the traditional classifiers such as K-Nearest Neighbor (KNN) using Euclidean distance. Extensive experiments have been conducted on public and available face and ear datasets, and their fusions which are constructed as multimodal datasets. With experiments conducted on unimodal and multimodal datasets, the experimental results demonstrate that the proposed model is more effective than the other recent state-of-the-art human recognition models.