Learning Discriminative Factorized Subspaces With Application to Touchscreen Biometrics

Information fusion is a challenging problem in biometrics, where data comes from multiple biometric modalities or multiple feature spaces extracted from the same modality. Learning from heterogeneous data sources, in general, is termed as multi-view learning, where view is an encompassing term that refers to different sets of observations having distinct statistical properties. Most of the existing approaches to learning from multiple views either assume that the views are either independent or fully dependent. However, in real scenarios, these assumptions are almost never truly satisfied. In this work, we relax these assumptions. We propose a feature fusion method called Discriminative Factorized Subspaces (DFS) that learns a factorized subspace consisting of a single shared subspace (that captures the common information), and view-specific subspaces that captures information specific to each view. DFS jointly learns these subspaces, by posing the optimization problem as a constrained Rayleigh Quotient based formulation, whose solution is efficiently obtained using generalized eigenvalue decomposition. Our method does not require lots of data to learn from, and we show how it is apt for domains characterized by limited training data, and high intra-class variability. As an application, we tackle the challenging problem of touchscreen biometrics, which is based on the study of user interactions with their touch screens. Through extensive experimentation and thorough evaluation, we demonstrate how DFS learns a better discriminatory boundary, and provides a superior performance than state of the art methods for touchscreen biometric verification.


I. INTRODUCTION
In recent years, users increasingly store private data and perform security critical operations, like banking on their mobile devices. Thus it becomes increasingly important to devise newer ways to authenticate the user. Touchscreen biometrics provides a novel way to learn a unique signature of an individual's interaction with the device by studying their swipe gestures. Encouraging results are reported using touchscreen signals as a biometrics especially on mobile systems [1]- [3].
Recent work [1] explores the use of multimodal data for touch biometrics, where different modalities correspond to different types of features extracted from user swipes and score level fusion is employed to merge information from multiple sources. However, touchscreen biometrics is characterized by specific challenges, namely limited training data (in the form of swipes) per-user, which is obtained over The associate editor coordinating the review of this manuscript and approving it for publication was Jinjia Zhou . a small device-area. Also, human-device interaction suffers from high variation due to their dependence on emotional state of the user, device-orientation and temporal variance. To mitigate these challenges, we propose a generalized learning method that works well with limited training data that characterized by high-intra class variation. Additionally, our proposed method fuses information at feature level, as it has been found to be more effective than late fusion techniques, like score fusion [4], but at the same time more challenging due to different statistical properties of the individual feature spaces [5]. We handle the challenge of multiple disparate feature spaces, by learning a factorized subspace, that is composed of a common subspace and private subspaces (details given below).
In general, learning from heterogeneous data sources or modalities is termed as multi-view learning, where ''view'' is an encompassing term referring to different sets of observations having distinct statistical properties [6]. Views can correspond to data from different sensors, sources, modalities or feature spaces extracted using different algorithms. Learning from multiple views either involves concatenating them, or utilizing correlations among them to get a joint subspace. With limited per-subject samples and long lengths of feature vectors, concatenation leads to overfitting, owing to the curse of dimensionality. Other multi-view learning algorithms assume that the various views either share all information (like shared latent space modeling), or no information (like multi-kernel learning). In real-life problems, such assumptions are too restrictive [6], [7].
The focus of this work is to learn ''factorized'' representations from multi-view data, which maps the original data to a low-dimensional single shared subspace (that captures the common information among the different views) and private (or view-specific) 1 subspaces (these capture the individual information specific to each view) in a supervised setting. Figure 1 shows a description of the proposed framework as an illustration for two views, where the factorized representation consists of a shared subspace, and two view specific subspaces. We pose our optimization problem as a constrained Rayleigh Quotient based formulation [8], which can be efficiently calculated using a generalized eigenvalue decomposition. Our method has several desirable properties for touchscreen biometric tasks -it is efficient, optimal and fast; it can be generalized to admit different data properties, as well as ''kernelized'' to learn non-linear mappings. FIGURE 1. Illustration of the proposed framework for factorized representation learning for two views -feature spaces 1 and 2, which are extracted from the biometric database. Our framework jointly learns a single shared subspace, capturing the common components across the views and view-specific views, capturing the individual information from the views, in a supervised setting.
Although getting a factorized representation from multiple views is useful for many learning and visualization tasks, we employ them for prediction and show how our method learns a better classifier than using single subspace and other score based state of art fusion methods in touchscreen biometrics. We provide results on several touchscreen benchmark data sets, and show that proposed factorized representation yields superior performance for both verification and identification tasks, compared to the state of art methods.
The intuition behind our approach is as follows: we assume that the observed views are generated by a latent subspace that has a shared component, that is common to all the views, and view specific components. Our idea is to provide a methodology to unravel this subspace from the observed data, in an efficient and generalized manner. The rest of the paper is organized as follows: Section II states the related works in this area. Section III, IV and V discusses the research background and methodology, while Section VI details the experimental setup and results on touch analytics, for biometric authentication (genuine-imposter) and verification (1:n) study. Section VII concludes this work.

II. RELATED WORKS
We first discuss research done in the areas of feature fusion and touchscreen biometrics; and then discuss relevant research in broadly related domains.

A. RELATED WORKS IN BIOMETRICS
In biometrics, multi-view learning usually has the flavor of multi-modal learning. Either information from multiple views are fused ''later'' (score or decision level fusion); or fused ''early'' as in feature level fusion. Since the focus of our work is on feature fusion, we concentrate on the research efforts in that direction. For feature fusion from multiple modalities, efforts are focused on learning a single subspace from either multiple modalities (iris, fingerprints, gait etc.) or multiple features extracted from a single modality. One method to learn a common subspace is Discriminative Correlation Analysis (DCA) method [9], which employs class information for feature fusion, while restricting the correlations among points belonging to same classes. Another approach is to learn sparse dictionaries from multi-modal data [4], [10], such that the sparsity is shared across the modalities. However, none of these deal with the concept of generating a factorized subspace and perform poorly in comparison with the proposed method, as shown in Section VI.
For touchscreen biometrics, researchers have mostly relied on score level fusion, e.g., Fierrez et al. [1] consolidate past work [2], [11], [12] and use score based fusion based on discriminative (Support Vector Machines -SVM) and statistical (Gaussian Mixture Model -GMM) methods for user verification. To the best of our knowledge, none of the existing research focuses on feature fusion for touchscreen biometrics and perform poorly, compared to our method, in the experimental evaluation.

B. RELATED WORKS ELSEWHERE
The approaches to learning a single subspace from multiview data can be broadly classified into unsupervised and supervised settings. Learning a single shared subspace is performed either by forcing view agreement [13], [14] or via latent variables [7], [15]- [17]. VOLUME 8, 2020 In unsupervised setting, Canonical Correlation Analysis (CCA) [18] is one of the earliest technique to learn a single shared subspace from multiple views, and has been extended to factorized subspaces as Non-consolidated Correlated Analysis (NCCA) [16] and Factorized Orthogonal Latent Subspaces (FOLS) [7].
In the supervised setting, recent works [19]- [22] focus on jointly learning the shared and view-specific representations by exploiting partial correlations among the views while imposing additional constraints to ensure that the resulting representations are discriminative. However, none of these methods have all the desirable properties to be useful for our task of biometric identification and verification -they are not consistently efficient in terms of run time complexity and producing an optimal solution. In our experimental evaluation, our method is shown to be superior to methods belonging to this class [21], [22].
A related field is cross-modal retrieval [23] where the task is to get meaningful retrieval of objects of one modality, given a test query from other modality, e.g. retrieving an image for a given text, and vice-versa. In biometrics, this is referred to as cross-modal recognition [24], [25]. Several solutions developed in multi-view learning lend themselves naturally to the task of cross-modal retrieval. Existing approaches are divided into unsupervised and supervised learning scenarios. Since our work deals with supervised subspace learning, we focus on that setting, one work is Generalized multi-view Analysis (GMA) algorithm [26], that seeks to learn a single subspace that maximizes the separation between different classes, which is expanded by others [27].
Some recent papers have introduced the idea of learning the common and modality-specific representations [25], [28]- [30]. However, there are important distinctions. Many methods in this category do not learn factorized representations, consisting of shared and private components. Instead, they employ deep neural architectures that first learn modality-specific representations, which are then combined to obtain a common representation, which is eventually used for discrimination between classes [25], [28]. Others [29], [30] use the modality-specific representations for generative tasks and use the shared representation for discriminating among classes. Generally, these works use deep neural network architectures and typically require millions of training samples. In contrast, our method aims to learn a factorized subspace/representation with very few training samples (100 or less per class), and so the deep learning based methods are not directly applicable.

III. PROBLEM SETTING
We consider n views, such that each view is specified by a (N × D v ) data matrix, Y v , where N is the number of data instances, and D v are the number of features for the v th view. The class labels for the multi-view instances are provided in a (N × 1) vector, l, such that l[i] ∈ {1, . . . , k}, where k is the total number of classes. ) contain a shared dimension (in black) and view-specific private information (red and blue). Each data instance has a class label denoted as a '+' or '×'. (c) Learning a single subspace from multiple views is unable to find a discriminative boundary (d) A better discriminative representation is learnt using factorized subspaces, encapsulating both shared and view-specific subspaces. Note that the 1D signals in (a), (b), and (c), are plotted as functions of t (x-axis).
Our task is to learn a factorized representation, guided by the labels, l. Besides allowing dimensionality reduction, the representation allows mapping the original data to a single shared subspace (capturing the common components among all the views) and n view-specific subspaces (capturing the individual information). We denote the mapped representation of the data in the d s dimensional shared subspace as a (N × d s ) matrix X s . The view-specific representation is denoted as To understand what is meant by the factorized representation that will be unraveled by our algorithm, we use a synthetic data set. The data set consists of two observed views, Y and Z , each containing 2000 points, generated by combining a 1-D shared representation, X s , with two 1-D view-specific representations, P y and P z . The ''latent'' representations are generated as sinusoidal signals, as follows: where t is a vector consisting of values uniformly distributed in the interval (−1, 1). The first observed view, Y is generated by projecting the concatenated space, [X s P y ] to a 20 dimensional space using a random projection vector and adding a Gaussian noise with variance 0.01. View Z is generated similarly from the concatenated space, [X s P z ]. A class label l ∈ {−1, +1} is assigned to each data instance, where l = sign(exp(w [x si ; p yi ; p zi ]) − 0.5). The relative weights in w control the proportion of discriminative information in the shared and view-specific private signals. The two views, along with the label information, are shown in Fig. 2a and 2b.
Methods that force a single shared representation (Figure 2c), fail to effectively learn a discriminatory representation. On the other hand, using our proposed method, the shared and view-specific subspaces are learnt, and provide a better discriminatory representation of the multi-view data, as shown in Figure 2d.

IV. BACKGROUND
We first start with a brief description of Linear Discriminant Analysis [31], [32], or LDA, which learns a linear projection to map data into a new space, such that the class separation in the projected space is maximized.
LDA finds the linear projection, denoted by the basis vectors in a (D × d) matrix, W , by maximizing the ratio of the between-class scatter to the within-class scatter. D is the number of features in the observed data (denoted as a (N ×D) matrix Y ) and d is the number of features in the projected space. The between-and within-class scatter is described by two (D × D) matrices, S b and S v , respectively, which are defined as: whereμ is the overall mean of the Y , and µ c is the class specific mean for class c. To obtain the optimal W , LDA maximizes the following optimization criterion: The above formulation is known as the Generalized Rayleigh Quotient or GRQ, which occurs in many optimization problems within engineering and pattern recognition domains, and can be efficiently solved as a generalized eigenvalue problem [8]. The solution can be obtained by solving the following generalized eigen-value problem (assuming S w is positive definite): Choosing the largest d eigenvalues of the above problem, one can construct the matrix W that consists of the d eigenvectors with largest eigenvalues. LDA assumes a Gaussian distribution for each class, which might be violated in practical settings. Non-parametric methods such as Marginal Fisher Analysis (MFA) [33] have been proposed to handle this issue, in which the between-class and within-class scatter matrices (S b and S w ) are replaced by inter-class separability and intra-class compactness graphs, respectively. The adjacency matrices, G b and G w , of the two graphs are defined as: Here, c i indicates the class of the i th training sample, N + k 1 (i) indicates the set of k 1 nearest neighbors of the i th data sample in the same class, and P k 2 (c) refers to the set of data pairs that are the k 2 nearest pairs among the set {(i, j), c i = c, c j = c}. The objective criterion for MFA has a similar trace-ratio form as LDA: Here, D b and D w are diagonal matrices where each diagonal term contains the sum of the corresponding row in G b and G w , respectively. Many multi-view extensions of LDA have been proposed [26], [34], [35] in literature. All have similar trace ratio formulation, with some modifications, and with the single aim of learning a discriminative subspace from multiple views. However, none of them handle factorized subspaces.

V. METHODOLOGY
We propose a method, called Discriminative Factorized Subspaces (DFS) whose objective is to combine multiple views and produce a factorized latent representation of the data. The goal is to devise an objective criterion which allows for simultaneous learning of a discriminative shared subspace, that captures the common components among multiple views and view-specific subspaces, that extract information that could not be captured by the shared subspace. Two versions of DFS are proposed here -linear and non-linear (using kernels).
In linear DFS, the latent representations (shared and viewspecific) can be obtained from the observed views via linear transformations, using a set of bases. For each view, we learn two sets of orthonormal bases, W v and V v , such that: The bases are learnt such that the between-class separation is maximized in each of the latent spaces, while ensuring that the shared and view-specific subspaces are mutually orthogonal (W v V v = 0), so that the view-specific subspaces capture the information that is specific to the view and is not shared across the views. This can be enforced using the following joint optimization formulation.

VOLUME 8, 2020
Consider a matrix W , which is the vertical concatenation of the individual basis matrices, {W } n v=1 , i.e., Thus, W will have D (= n v=1 D v ) rows and d s columns, where d s is the dimensionality of the shared subspace. Let S bv be a symmetric D v × D v matrix that describes the betweenclass scatter for the v th view, and, let S wv be a symmetric D v × D v matrix that describes the within-class scatter for the v th view. Similarly, let S w be a symmetric and positive definite D × D matrix that describes the within-class scatter for all the observed views, and, let S b be a symmetric D × D matrix that describes the between-class scatter for all the observed views. We will discuss different approaches to construct these scatter matrices in Section V-C.
To obtain the desired bases, {W v , V v } n v=1 , we seek to optimize the following objective function: The two terms in the above objective criterion, are functions of W v and V v , respectively. However the third constraint, that depends on both, means that we cannot optimize each term independently. Instead, we first optimize the first term, ignoring the constraints, and then optimize the second term, using the values obtained from first in the constraint. The two optimization criterion will, thus, become: and, The first objective criterion (in (11)) is similar to the objective criterion for LDA or MFA, 2 as discussed in Section IV. Assuming S b is positive definite, the solution will be the eigenvectors corresponding to the d s largest eigenvalues obtained by solving the following generalized eigenvalue decomposition problem: Solving the second optimization problem (in (12)) is not as straightforward due to the presence of an additional constraint (W v V v = 0, ∀v). We show that, if S bv and S wv are symmetric and positive semi-definite, the objective function in (12) can be solved by posing it as a constrained Generalized Rayleigh Quotient optimization problem, which has an efficient closed form solution.

A. SOLVING THE CONSTRAINTED GENERALIZED RAYLEIGH QUOTIENT OPTIMIZATION PROBLEM
Dropping the view-specific subscript v for notational simplicity, and noting the equivalence between the ratio-trace form of (12) and the classical Rayleigh quotient form of (4), we consider the following form of the objective criterion in (12): where S b and S w are (D × D) symmetric and positive definite matrices and v is a (D × 1) vector. The constraint matrix (D × d s ) matrix W is full-rank since it consists of d s orthogonal basis vectors obtained by solving (11).
For the constrained problem, we adapt the solution provided by Golub and Underwood [8]. Note that since v occurs in both numerator and denominator of the objective function, the constraint v v = 1 can be ignored during the optimization. At the end, we normalize the optimal vector v, as v = v v 2 . In the original formulation [8], the matrix W is assumed to be of rank r (r ≤ d s ). Let Q be a matrix such that by permuting the columns of W , the following holds: where R is an (r × r) upper triangular matrix, S is a (r × (d s − r)) matrix, and Q Q = I m . Q can be constructed as a product of r Householder transformations. Since d s = r, we can use QR-decomposition of the matrix W to obtain Q. 2 By converting the trace-ratio problem into a ratio-trace problem which can be analytically solved using the generalized eigenvalue decomposition, i.e., Using Q, the two matrices, G and H , are constructed as follows: It has been shown that the stationary values for the constrained GRQ problem can be obtained using the eigenvalues of the following generalized eigenvalue problem [8]: In fact, the eigenvector, z max , corresponding to the largest eigenvalue, λ max , can be used to get the solution of the maximization problem as: The corresponding eigenvalue, λ max is the optimal value for the objective function in (12).

B. CONSTRUCTING SCATTER MATRICES
The first optimization criterion in (11) requires two (D × D) matrices, S b and S w , that are used to measure the betweenclass and within-class scatter, respectively. Similarly, the second optimization criterion in (12) requires a pair of (D v × D v ) matrices, S bv and S wv , for each view. We call them the crossview alignment matrices. One option for the per-view matrices is to use the scatter matrices as defined in classical Linear Discriminant Analysis (LDA) (See (20) and (21)). The matrices, S b and S w , are constructed from the corresponding view-specific matrices, using a block construction approach [9], [26]. For instance, S w is defined as: S b is constructed in the same way except that we introduce additional terms in the matrix to ensure a cross-view alignment when mapping data into the shared subspace, This means that Y 1 W 1 , Y 2 W 2 , . . . , Y n W n , should be aligned. We encode the cross-view alignment for a pair of views as a (D i × D j ) matrix, A ij , and define the unified matrix, S b as: where the scalar λ controls the relative impact of the withinview and the cross-view effects. We set A ij to Y i Y j [26].

C. GENERALIZATION CAPABILITY OF THE METHODOLOGY
Our methodology can be generalized to varying notions of scatter and cross-view (Xview) alignment matrices, as depending on the data sets and problem domains different notions of similarity and Xview alignment might be appropriate. For example, LDA-encoded scatter matrices, might not be appropriate in a problem, where the distributions are significantly non-Gaussian, as it assumes a Gaussian likelihood. In such cases, non-parametric variants of LDA could be used. One such variant is Marginal Fisher Analysis (MFA), which build graphs to capture intra-class compactness and interclass separability [33]. Please see (6)) for details of how it is used in our framework.
The generalization capability of our methodology also extends to different notions for construction of A ij . One alternative is to compute similarity for each pair of features in the two views, using a Gaussian radial basis function,

D. INFERENCE
After learning the basis matrices, {W v , V v } n v=1 by optimizing (10) on the training data, one can infer the factorized representation for a test instance, denoted as {y * } n v=1 , where y * v ∈ R D v , as follows. The shared subspace representation is given by: where C() is a combiner function to merge the shared representations obtained from each view. A possible option for C() would be a simple feature-wise average, which would only be reasonable if the subspaces represented by each W v are aligned, i.e., the representations produced from each view have the same scale. However, since this is not necessarily ensured in (11), averaging will not always give optimal results. Another option would be to simply concatenate each subspace representation to yield a nd-dimensional subspace.
Other methods, such as using a dimensionality reduction method such as Principal Component Analysis (PCA) or Multi-Dimensional Scaling (MDS) can be employed as well.
The view-specific latent representation for the test instance is obtained as:

E. SUPERVISED LEARNING USING DFS
Though the representations can be used for multiple tasks, like visualization or clustering, in this work, we show how these representations can assist in building a better classifier.
Here, we concatenate the learned shared and private representations of the training data [x * s p * v ]. Given that these are lowdimensional representations, and are mutually orthogonal to each other, concatenation is a reasonable approach. Any classifier can be trained on this data, though we use K-nearest neighbor (with K = 5) as the classifier.
During testing time, the test data is mapped to these factorized representations, using (22) and (23), and the classifier model is used to predict the unseen labels.

F. DISCRIMINATIVE FACTORIZED SUBSPACE (DFS) -ALGORITHM DESCRIPTION
Algorithm 1 shows the training and inference steps for the proposed DFS method. In the training phase, the algorithm first constructs unified and view-specific between-class and within-class scatter matrices. A small positive value (η) is added to the diagonal entries of the within-class scatter matrices to ensure that they are positive definite (Step 6). After obtaining the scatter matrices, the algorithm solves the generalized eigenvalue decomposition problem (See (13)) and finds the eigenvectors whose corresponding eigenvalues are greater than a specified threshold (tol).
The shared space projection matrices, W v , for each view are extracted from the resulting matrix, W , using the construction in (9). Finally, the view-specific projection matrices, V v , are extracted by solving the Constrained Generalized Rayleigh Quotient problem discussed in V-A and identifying the eigenvectors corresponding to the generalized eigenvalue problem in (18) whos eigenvalues is greater than the threshold, tol. Note that the dimensionality of the shared and viewspecific subspaces is automatically determined by applying the threshold on the corresponding eigenvalues.

G. KERNEL DISCRIMINATIVE FACTORIZED SUBSPACE (KDFS) LEARNING
The linear DFS algorithm can be adapted to learn non-linear representations, using a kernel based approach, referred to as Kernel Discriminative Factorized Subspaces or KDFS. It assumes that data in each view is mapped to a Hilbert space F v , by a non-linear mapping, v . The revised objective function can be written as: where the matrices, W v , V v , S bv , and S wv , are obtained from the mapped data (using v ), in the same fashion as that for the DFS algorithm in (10). However, instead of using explicit mappings, we employ the ''kernel-trick''. For this, we first use a known result from the theory of reproducing kernels, that states that each for v ∈ (v + 1) . . . n do 10: basis vector in the two sets of matrices, W v and V v , can be expressed as a linear combination of the mapped features. For instance, any column of matrix W v can be written as where Y vi is the i th data instance in the v th view, and α vi is the corresponding scalar coefficient for that data instance. This permits us to rewrite the objective function in a form where all mapped instances, v (Y vi ), occur in a dotproduct with other instances, and thus can be replaced by a kernel function. The rest of the KDFS algorithm follows the methodology of kernel LDA [36], where the scatter matrices, S bv and S wv , are replaced with kernel matrices that can be obtained by applying a kernel function (e.g., radial basis function or RBF) to every pair of input instances in each view.

VI. RESULTS
We show the efficacy of our method on problems motivated by identification and verification tasks for touchscreen biometrics. Subsequent sections describe the data sets, features, experimental set-up and results.

A. DATA DESCRIPTION
We use four publicly available datasets [1], 3 namely Serwadda, Frank, Antal and UMDAA. Serwadda database has 190 subjects across 2 sessions, at least one day apart [11]. The users were asked questions and they responded with their answers on the smart phone. They were, also, allowed free interaction with the device in landscape and portrait mode. The metrics that were recorded from the touch interactions were x and y coordinates, the time stamp, the area covered by the finger, the pressure on the screen and the device orientation. Only gestures obtained by swiping one finger on the screen were recorded. Multi-touch gestures, e.g. zooms, were ignored.
Frank's data set [2] has 41 users over two sessions held one week apart. There were two applications for user interaction -one for image comparison and another for reading texts. Multiple devices were used, to record each data point's the x and y coordinates, the time stamp, the area covered by the finger, the pressure, the device orientation and the finger orientation. Most of the data is available in portrait orientation, hence the results are available only for that orientation.
The Antal data set [12] spans 71 users, over 8 eight devices. Similar to Frank's data set, two applications were developed for data acquisition, where subjects had to read texts and choose their favorite image. The data was obtained during a 4 week period.
The UMDAA data set [37] has data from 48 users, for over two month period. For data acquisition in this case, users were not given a task, and were allowed free use of devices during this time period. Thus, the data acquisition starts when the device is unlocked, and can spread over days, until the device is locked again.

1) FEATURE DESCRIPTION
For each data set, two different types of features vectors were extractedswipe-based and signature-based. Swipe based features: These consist of 28 features, including velocity and acceleration vector for every pair of adjacent points in a stroke. For the velocity, acceleration, pressure and area measurements, the following features are calculated -mean, standard deviation, the first, second, and third quartile, the left most and right most co-ordinates in horizontal strokes, the upper most and bottom most co-ordinates in vertical strokes, the distance between start and end-points, the angle of the straight line that joins the start and end points, the total duration of the stroke, the summation of the distance between every pair of adjacent points. Signature based features: Their use is motivated by their relevance in signature biometrics and graphical touch passwords and includes 5 features as published previously [38]. 3 Available at http://atvs.ii.uam.es/atvs

B. COMPETING METHODS
We compare with the following related methods: Partially Shared Latent Factor (PSLF) method [21] simultaneously learns a common subspace and view-specific subspaces by exploiting the correlation and complementarity among views within an optimization formulation, with an additional constraint to ensure that the learned subspaces are discriminative. This method operates in a semi-supervised setting and we adapt it for our problem setting.
Generalized Multi-view Analysis (GMA) [26] is a generalized feature extraction approach for cross-view retrieval and multi-view classification. It learns transformation matrices that map the views into a common subspace and uses a Latent Discriminant Analysis (LDA) type constraint to ensure that the mapped data is discriminative according to the label information.
Score Level Fusion for Mobile Authentication (SCF) [1] is a score level fusion mechanism, that uses SVM on swipe level features, and a Gaussian Mixture Model (GMM) on signature level features, to get the similarity scores, which are normalized and then combined to get the final averaged score.
Discriminative Correlation Analysis [9] is a supervised extension of CCA (Canonical Correlation Analysis) and learns a joint discriminative subspace from multiple views, by restricting only the correlations among data points that belong to the same class, within a view. It eliminates the correlations between data points belonging to different classes within each view.

Sparse Multimodal Biometric Recognition (SMBR) [4]
is a feature level fusion technique motivated by Sparse Coding. It imposes common sparsity constraints within each biometric modality and also, across different modalities. During testing, each test instance is represented by a sparse linear combination of the training data, with constraints that the test subject will share a sparse representation across different modalities.
Multi-view Metric Learning (MVML) [22] aims at jointly learning shared and view-specific similarity metrics in a supervised setting by learning transformation matrices that map data into a factorized representation. The discriminative aspect is handled by imposing pairwise constraints among instances that belong to the same class.

C. EXPERIMENTAL SETUP
We conducted experiments for both verification (genuineimpostor) and identification (1:n search) settings. For each data set, we considered following setups to construct the training and test data: i). Intra-Session -both training and testing data is sampled from the first session. ii). Inter-Session -training data is obtained from first session, and tested on second session. For mobile verification, we experimented with both inter and intra session scenarios, while for mobile identification we worked with combined setting, where data from both sessions are combined. We choose these VOLUME 8, 2020 settings as they were more challenging. For Frank, Antal and UMDAA, the data was available for a single time period, hence experiments were performed only for intra-session settings.
We use the four different types of touch operations -sliding up, down, left and right, for the experiments. Additionally, for Serwadda, experiments were performed for both orientationslandscape and portrait. For others, only portrait orientation is used for experiments, since enough data was not available in the landscape orientation.
Verification: We use Equal Error Rate (EER), which is the error rate at which the false positive rate equals the false negative rate. The lower the EER, the better the performance of the verification system. For each algorithm, the EER was calculated for each user, and then, the mean of EERs over all the users is used. This procedure is repeated 20 times and the mean and standard deviation across the multiple runs is reported.
For each experimental run, a user was randomly sampled, and a randomly selected subset of the user's data instances are taken as the genuine class. For the impostor class, a subset of remaining users were selected, and a subset of their samples were randomly chosen to create that class. For test data, the remaining data instances of the selected user were chosen as genuine instances, and data instances belonging to a sample of remaining users were chosen as impostors. Similar to previous works [1], during testing, we combine the classification scores for 10 successive strokes and use the average as the final score for a user.
Identification: We pose the problem of identification with n-users as a multi-class classification problem. We use stratified cross-validation accuracy as the evaluation metric, in which, for each cross-validation fold, the stratified sampling ensures that the training data is balanced across all the classes. Since there is limited data per user, we only consider users with a significant number of strokes (> 100). For robustness, during testing, we combine predictions over 10 randomly selected input strokes for a user to calculate the accuracy of the classifier. To test the stability of the method, the cross-validation procedure (with 10 folds) was repeated 5 times, and the mean and standard deviation was reported.
Parameter Settings For the results in next section, we report results for the following parameter settings: γ = 1, where γ is the scalar weight associated with the XView matrices in (21); tol = 10 −6 ; η = 10 −6 . We used the LDA approach to calculate the scatter matrices, and the covariance method to calculate the XView matrices. We show comparative results for other approaches in Section VI-E. Table 1 details the results of verification on the data sets, with various orientations. We note that the proposed DFS algorithm performs the best across the data sets, and settings, with low EER values. We also report very low standard deviations, meaning that the results were stable across various runs. In some cases, the sparsity based method (SMBR) also performs comparative or better. However, the run time complexity of SMBR is prohibitive to use it for continuous authentication, as also reported elsewhere [9]. We ran timing experiments for testing on a data set with 2000 test instances. While DFS algorithm (and other competitors) take less than 1 second to run, the SMBR algorithm takes more than 10 minutes for the same task. The intra-session scenario yields lower EERs, than intersession, because there is less variability in the data when used within one session. We also perform better than the scorelevel fusion method (SCF), partly because we are fusing at feature-level, which is known to perform better than ''late'' fusion [5]. Table 2 details the results for the identification scenario, where the cross-validation accuracy is used as the performance metric. We note that the proposed DFS algorithm is consistently superior to all other methods across nearly all data sets. While the proposed method scales well with the number of users, other methods such as SMBR, which showed good performance for the verification task, do not scale as well. The MVML method also performs well for many data sets, however, due to the pairwise constraints that are quadratic in the data size, the time complexity of the method is significantly high and cannot scale to larger data sets such as UMDAA. EERs (equal errror rates), along with standard deviations reported for both inter and intra session scenarios for verification (genuine-imposter scenario) for datasets -Serwadda, Antal, Frank and UMDAA. Rows marked with (-) mean that the not enough data/users were present to run the experiment. TABLE 2. Accuracy for the identification task along with standard deviations for the four datasets. Rows marked with (-) mean that the not enough data/users were present to run the experiment. For UMDAA, the experiments with SMBR and MVML algorithms could not be completed in reasonable amount of time.

E. STUDYING PARAMETER CHOICES IN DFS
In this section we study the following aspects of the proposed DFS algorithm: Benefit of factorized representation: Figure 3(a) shows the benefit of using a factorized representation over using only the common representation, on one particular data set (Serwadda in portrait orientation) for the verification task. It is noted that using a factorized representations offers significantly better discrimination between the genuine and impostor data instances. This is a supporting evidence for using factorized representations.
Generalization ability: The impact of the choice of method to create the within-view scatter and cross-view alignment matrices is illustrated in Figures 3(a) and 3(b), respectively. The LDA strategy to construct the between and within class scatter matrices using (20) and (21), gives marginally better results than using the non-parametric MFA strategy (See (6)). However, this could be different for other data sets. For instance, if the class distribution is non-Gaussian, one would expect the MFA method to work better. Figure 3(c) shows the results for the two different strategies to construct cross-view alignment matrices: linear covariance and non-linear radial basis functions (RBF) based similarity. For this particular orientation and data set, there is no clear winner, however the results are again data dependent. While using covariance is a reasonable choice when there is linear dependence between the observed and hidden representations, the RBF based similarity would be applicable when the dependence is non-linear.
Non-linear formulation: Finally, we study the impact of using the non-linear or kernel version of DFS over the linear method. For the data sets investigated here, we note that the linear DFS method consistently performs better than KDFS, as shown in Figure 3(d). This is probably because the linear dependence between the observed and latent subspaces might be valid. In other scenarios, KDFS could be used to learn a non-linear mapping.

VII. CONCLUSION AND FUTURE WORK
We propose the idea of generating factorized subspaces from multi-view data, which consists of a single shared subspace (capturing the common information from different views) and view-specific subspaces (capturing the information individual to each view). Our methodology learns the factorized space jointly, and uses a constrained Rayleigh Quotient based formulation, which provides an efficient and globally optimal solution.
We show how our method can be generalized to different notion of similarity, cross-view alignment matrices, and kernel settings. Using the challenging data of touchscreen biometrics, we show the efficacy of our methodology by producing superior performance than state of the art methods for problems of biometric verification and identification. In future, we plan to extend it to novel biometric problems and settings, with multi-view datasets.