Commonality and Individuality-Based Subspace Learning

Subspace learning (SL) plays a key role in various learning tasks, especially those with a huge feature space. When processing multiple high-dimensional learning tasks simultaneously, it is of great importance to make use of the subspace extracted from some tasks to help learn others, so that the learning performance of all tasks can be enhanced together. To achieve this goal, it is crucial to answer the following question: How can the commonality among different learning tasks and, of equal importance, the individuality of each single learning task, be characterized and extracted from the given datasets, so as to benefit the subsequent learning, for example, classification? Existing multitask SL methods usually focused on the commonality among the given tasks, while neglecting the individuality of the learning tasks. In order to offer a more general and comprehensive framework for multitask SL, in this article, we propose a novel method dubbed commonality and individuality-based SL (CISL). First, we formally define the notions and objective functions of both commonality and individuality with respect to multiple SL tasks. Then, we design an iterative algorithm to solve the formulated objective functions, with the convergence of the algorithm being guaranteed. To show the generality of the proposed method, we theoretically analyze its connections to existing single-task and multitask SL methods. Finally, we demonstrate the necessity and effectiveness of incorporating both commonality and individuality by interpreting the learned subspaces and comparing the performance of CISL (in terms of the subsequent classification accuracy) with that of classical and state-of-the-art SL approaches on both synthetic and real-world multitask datasets. The empirical evaluation validates the effectiveness of the proposed method in characterizing the commonality and individuality for multitask SL.


Commonality and Individuality-Based Subspace Learning
Jinfu Ren, Yang Liu , Senior Member, IEEE, and Jiming Liu , Fellow, IEEE Abstract-Subspace learning (SL) plays a key role in various learning tasks, especially those with a huge feature space.When processing multiple high-dimensional learning tasks simultaneously, it is of great importance to make use of the subspace extracted from some tasks to help learn others, so that the learning performance of all tasks can be enhanced together.To achieve this goal, it is crucial to answer the following question: How can the commonality among different learning tasks and, of equal importance, the individuality of each single learning task, be characterized and extracted from the given datasets, so as to benefit the subsequent learning, for example, classification?Existing multitask SL methods usually focused on the commonality among the given tasks, while neglecting the individuality of the learning tasks.In order to offer a more general and comprehensive framework for multitask SL, in this article, we propose a novel method dubbed commonality and individuality-based SL (CISL).First, we formally define the notions and objective functions of both commonality and individuality with respect to multiple SL tasks.Then, we design an iterative algorithm to solve the formulated objective functions, with the convergence of the algorithm being guaranteed.To show the generality of the proposed method, we theoretically analyze its connections to existing single-task and multitask SL methods.Finally, we demonstrate the necessity and effectiveness of incorporating both commonality and individuality by interpreting the learned subspaces and comparing the performance of CISL (in terms of the subsequent classification accuracy) with that of classical and state-of-theart SL approaches on both synthetic and real-world multitask datasets.The empirical evaluation validates the effectiveness of the proposed method in characterizing the commonality and individuality for multitask SL.

I. INTRODUCTION
S UBSPACE learning (SL), also known as dimensional- ity reduction or feature extraction, aims to uncover the intrinsically low-dimensional representation of the original The authors are with the Department of Computer Science, Hong Kong Baptist University, Hong Kong, SAR, China (e-mail: jinfuren@ comp.hkbu.edu.hk;csygliu@comp.hkbu.edu.hk;jiming@comp.hkbu.edu.hk).
Color versions of one or more figures in this article are available at https://doi.org/10.1109/TCYB.2022.3206064.
In many real-world applications, such as image/face recognition [26], [48], [49], natural language processing [9], [10], [21], and healthcare [5], [18], [55], multiple related learning tasks exist simultaneously.Take the application of recognizing facial images as an example.Given the datasets of facial images, the learning tasks could be recognizing faces under various pose and illumination conditions, or recognizing occluded faces, or recognizing different emotions from human faces.The above tasks are somewhat similar in the sense that they all use face images as the input.Moreover, some common features, such as the facial contours, could play an important role in all these tasks.On the other hand, the above tasks are different in terms of their nature or/and objective, which may require some task-specific features for the respective recognition tasks.For instance, discriminative facial features that are robust to the external variations are required to accomplish the task of face recognition under various pose and illumination conditions, whereas a subspace with distinguishable emotional representations yet insensitive to different face identities is more important to the task of facial emotion recognition.
When processing multiple related learning tasks with highdimensional data, it is of crucial importance to make use of the subspace extracted from some tasks to help learn others, so that the learning performance of all tasks can be enhanced together.To do so, it is crucial to answer the following question: How can the commonality (i.e., the task-shared features) among multiple learning tasks and the individuality (i.e., the task-specific features) of different learning tasks be characterized and extracted from the given datasets, so as to benefit the subsequent learning targets?

A. Related Work
Extensive research has been conducted on single-task SL.According to the availability of the label information, SL methods could be roughly divided into two categories: 1) unsupervised SL and 2) supervised SL.Principal component analysis (PCA) [15] is the most classical unsupervised SL algorithm that maximizes the total variance of the low-dimensional projection.As another category of unsupervised SL technologies, subspace clustering aims to preserve the grouping information in the learned subspaces to identify meaningful clusters [6], [40], [43].Linear discriminant analysis (LDA) [12], which is the most typical supervised SL algorithm, aims to seek the discriminative subspace that can maximize the between-class scatter and minimize the withinclass scatter simultaneously.To deal with nonlinear datasets, some nonlinear SL methods have been developed, including kernel PCA (KPCA) [34], kernel LDA (KLDA) [1], manifold learning [32], [38], [52], and deep neural networks-based methods [14], [17], [27], [29], [31].
In order to handle multiple related learning tasks, some representative SL algorithms have been extended to the multitask scenarios.Yamane et al. [46] developed a multitask PCA (MPCA) by casting the original PCA into multitask framework based on a distance measurement of subspace.Saha et al. [33] addressed the inefficient computation and suboptimal performance issues in the single-task setting of subspace clustering by extending it to multitask setting via exploiting the structural sharing between tasks and data points.Zhang and Yeung [51] generalized the classical LDA to the multitask version (MTDA), aiming at handling heterogeneous feature space and multiclass classification problems.In addition to the extension of existing single-task SL algorithms, some methods have been designed directly for multitask SL.Gu and Zhou [13] presented a multitask clustering method, aiming at learning a subspace shared by all tasks to enhance the performance of multitask clustering and transductive transfer classification.Zhu et al. [54] proposed a sparse multitask learning method, which regularizes the subspace representation to preserve the local structural information.Liu et al. [19], [20] introduced two multitask feature selection methods that select important features from original feature spaces of different learning tasks by imposing sparse constraints.
Metric learning [16], [44], under some mild conditions, can be considered as one category of SL methods, because the learned distance matrix for the measurement of similarity/relatedness between data samples sometimes (e.g., if positive semidefinite) could be decomposed into two transformation matrices to project original data to a lower-dimensional feature space.Bhattarai et al. [4] presented a multitask metric learning method (MTSML), in which the projections are coupled with each other by enforcing them to be a combination of one common projection matrix and one task-specific projection matrix.Suo et al. [37] introduced a low-rank constraint on the learned transformation matrices to model the task relation.

B. Motivation
Although various methods have been proposed for multitask SL, they may not fully explore the commonality and individuality of subspaces of multiple learning tasks in an appropriate way.Specifically, most of the aforementioned multitask SL methods focused on modeling the strong commonality or similarity between multiple learning tasks, and thus, neglected the individuality of different learning tasks, which is, however, of equal importance as the commonality in characterizing the nature of the learning tasks.Moreover, when modeling the commonality of learned subspaces of multiple tasks, the shared transformation matrices of different tasks are usually forced to be exactly the same as each other.This constraint could be too strong in many situations because even there exist similar or common properties of multiple learning tasks, these properties may not necessarily be exactly the same across different tasks.To provide a general and comprehensive framework for multitask SL, it is necessary to characterize both commonality and individuality of multiple tasks.As a result, we have to answer the following three questions.
1) How can we define the commonality and individuality in multitask SL and formulate the corresponding objective functions to characterize them?2) How can we quantify and analyze the proposed method's capacity in capturing the commonality and individuality of multiple learning tasks? 3) How can we validate the performance of proposed method in various applications of multitask SL?

C. Our Contribution
This article is aimed to specifically tackle the challenging issue of characterizing and extracting the common and individuality subspaces of multiple learning tasks by answering the three questions above.The contribution of this article can be highlighted as follows.
1) We start with a mathematical statement on the problem of multitask SL, including necessary definitions on notations.Based on the problem statement, we formulate the commonality and individuality of multiple subspaces in terms of the similarity/dissimilarity between transformation matrices, and propose a novel method called commonality and individuality-based SL (CISL) to characterize both commonality and individuality.2) We develop an iterative algorithm to optimize the objective functions of CISL.We further provide comprehensive theoretical analysis of the proposed method to examine its learning behavior and capacity from different perspectives, including the algorithm convergence analysis, the computational complexity analysis, and the connection between the proposed CISL and other representative single-task/multitask subspace methods.3) We design systematic experimentation to validate the effectiveness of the proposed method.First, we provide an illustrative example to explain the effect of CISL in an intuitive way.Then, we validate the performance of CISL (in terms of the subsequent classification accuracy) by comparing it with classical and state-of-the-art approaches on both synthetic and real-world multitask datasets.Experimental results demonstrate CISL's capacity in capturing the commonality and individuality for multitask SL.

D. Organization of This Article
The remainder of this article is organized as follows.Section II introduces the details of the proposed method, Schematic illustration of the idea behind the proposed CISL method.(a) Multiple related but different learning tasks.Task 1 is the face recognition task under various pose and illumination conditions; Task m is the task of occluded face recognition; and Task M is to recognize emotions from different human faces.These tasks are related in the sense that they all use face images as the input and some of the common facial features, such as the facial contours, play an important role in all the above tasks.On the other hand, these tasks are different in terms of the task nature or/and task objective, which may require some task-specific features for the respective recognition tasks.(b) Developed CISL method.For each task T m (m = 1, . . ., M), CISL aims to learn two subspaces-one commonality subspace C m and one individuality subspace V m .The transformation matrix C m is constructed to capture the commonality shared by multiple learning tasks while the transformation matrix V m is designed to characterize the task-specific features of each individual learning task.(c) Learning results of multiple tasks.With the task-shared and task-specific features being well captured by the proposed CISL, in each learning task, the within-class data samples are expected to be mapped together while the between-class samples are expected to be projected far away from each other in the learned subspace, thus benefiting the subsequent recognition task.
including the problem definition, objective function formulation, and optimization procedure.Section III presents the theoretical analysis of the proposed method, including the convergence analysis, complexity analysis, and the connections to other methods.Section IV provides extensive experimental results on illustrative examples as well as synthetic and realworld datasets to validate the effectiveness of the proposed method.Section V concludes this article.

II. COMMONALITY AND INDIVIDUALITY-BASED SUBSPACE LEARNING
In this section, we elaborate the CISL method.First, we introduce the notations and define the problem of multitask SL.Based on that, we formulate the objective functions for capturing commonality and individuality, respectively.Finally, we describe an iterative strategy to solve the optimization problem formulated in the objective function of CISL.

A. Problem Definition
Given M multiclass classification tasks, {T m } M m=1 , as shown in Fig. 1  number of classes in T m . 1 The target of multitask SL is to learn M transformation matrices to project the data in the given M tasks to the corresponding subspaces, where the discriminative information is well preserved for all tasks.In order to capture both commonality and individuality of all learning tasks, the proposed CISL, as shown in Fig. 1 The objective function of CISL is designed to maximize the correlation between M commonality subspaces and minimize the correlation between M individuality subspaces.Meanwhile, the between-class scatter in each task is maximized and the within-class scatter in each task is minimized.By doing so, the data samples in the same class are expected to be mapped together while the samples from different classes are expected to be projected far away from each other in the subspaces of all tasks, as shown in Fig. 1(c).The notations used in this article and the corresponding explanations are provided in Table I.

B. Objective Function of CISL
Based on the notations and problem statement, we can formulate the objective function of the proposed CISL.

1) Characterizing Task Commonality:
To capture the commonality of subspaces learned from multiple tasks, we propose the following objective function: where In (1), we preserve the discriminative information of all M tasks in the commonality subspaces by maximizing the term of overall between-class scatter, m S m t C m ), simultaneously.Moreover, the second term in the numerator of (1) measures the summation of pair-wise correlation between commonality subspaces of any two tasks.As {C m } M m=1 represents the commonality of the given learning tasks, we expect them to be as closely correlated to each other as possible.Therefore, we maximize the correlation of all pairs of C m and C p (m = p), that is, Here, we use the F-norm of matrix product to measure the similarity/correlation between C m and C p because of the universality of this formulation.Other metrics can also be utilized according to various learning requirements.By maximizing this item, the commonality of multiple learning tasks, which is generally hidden in the original feature space, can be extracted to the maximum extent.
Unlike existing methods for multitask SL that enforce the common or shared transformation matrices to be exactly the same [4], [22], [51], we aim to maximize the correlation/similarity (i.e., minimize the distance) between common transformation matrices of different tasks but allow them to be different, as in many real-world applications, the existence of commonality does not necessarily mean exactly the same.By relaxing this strong constraint, we expect the learning behavior of the proposed method to be more flexible.
2) Characterizing Task Individuality: To capture the individuality of subspaces learned from different tasks, we introduce the following objective function: Similar to (1), the discriminative information of all M learning tasks is also well preserved in the individuality subspaces by simultaneously maximizing the corresponding overall between-class scatter Meanwhile, the second term in the big parenthesis in the denominator of (2) measures the summation of pair-wise correlation between individuality subspaces of any two learning tasks.Since V m (m = 1, . . ., M) represents the individuality of different learning tasks, we expect them to be as independent of each other as possible.Therefore, we minimize the correlation of all pairs of V m and . By doing so, we expect that the transformation matrices V m can represent the individuality of different learning tasks to the maximum extent.Moreover, we introduce the last term in the denominator of (2) to minimize the redundancy between the commonality subspace and individuality subspace for each learning task.

C. Optimization Procedure
In this section, we introduce iterative strategies to optimize (1), there is no closed-form solution.We alternately optimize each C m while fixing the others, until the entire procedure converges.After having obtained {C m } M m=1 , we, then, update {V m } M m=1 in (2) using the similar strategy, that is, we optimize each V m with others fixed until the convergence of the whole procedure.The proof of convergence will be provided in Section III-A. 1) can be equivalently represented as the trace form, that is, . Therefore, we rewrite (1) as follows: We iteratively optimize M commonality matrices {C m } M m=1 in the above objective function.Specifically, in the mth step, we update C m with the remaining C 1 , . . ., C m−1 , C m+1 , . . ., C M being fixed.Thus, the objective function for optimizing C m can be represented as follows: Then, we can further rewrite (4) as follows: where Here, I d denotes the d × d identity matrix, with d being the original dimension of data samples.Since all tasks except the mth one are fixed, then, a c m and b c m will be constants.To solve the optimization problem in (5), we employ an iterative procedure introduced in [41].Specifically, in the tth iteration (t = 1, 2, . ..), we first calculate the trace ratio in (5) where λ c m (t) denotes the value of λ c m after the tth iteration, and denotes the value of C m after the (t − 1)th iteration.Then, we construct the matrix Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Based on the constructed S c m (t) , we can obtain the optimal C m in the tth iteration by solving the following problem: We repeat the steps in ( 6)-( 8) until C m converges.
2) Optimizing {V m } M m=1 : The terms , respectively.Then, we can rewrite (2) as follows: We iteratively optimize M individuality matrices {V m } M m=1 in the above objective function.In the mth step, we update V m with the remaining V 1 , . . ., V m−1 , V m+1 , . . ., V M being fixed.Thus, the objective function for optimizing V m can be represented as follows: max where Then, we rewrite (10) as follows: where Here, a v m and b v m are constants.Similar to the optimization for C m , we employ the iterative procedure to solve the optimization problem in (11).In the tth iteration, we first calculate the trace ratio in (11) as follows: where λ v m (t) denotes the value of λ v m after the tth iteration, and m denotes the value of V m after the (t − 1)th iteration.Then, we construct the matrix Finally, we can obtain the optimal V m in the tth iteration by solving the following optimization problem: We repeat the steps in ( 12)-( 14) until V m converges.After convergence of the entire learning procedure, we obtain the final transformation matrix W m for the mth task by concatenating C m and V m :

III. THEORETICAL ANALYSIS OF THE CISL METHOD
In this section, we analyze the algorithm convergence and the computational complexity of CISL, as well as the connections between CISL and other representative SL methods.

A. Convergence Analysis
To prove the convergence of CISL, we need to show that: 1) the objective functions of commonality/individuality SL are upper bounded and 2) the objective functions are nondecreasing during the optimization procedure in Algorithm 1.
1) Convergence of Optimization on {C m } M m=1 : For the objective function of commonality SL, we first derive its upper bound.We denote the objective function in (1) as J(C), where where μ m b(j) is the jth largest eigenvalue of S m b , and μ m t(j) is the jth smallest eigenvalue of S m t .Combining ( 1) and ( 15), we Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
obtain the upper bound of J(C) Second, we show that the objective function of commonality SL is nondecreasing during the optimization procedure.In the for loop (lines 4-10) of Algorithm 1, once, we fix other tasks to update the mth one, we denote J(C) as J(C m ) The following theorem shows that in the inner while loop, the value of J(C m ) is nondecreasing.
Proof: Let g(C ).Therefore, we have g(C We then show that for the outer while loop, the value of the objective function J(C) is nondecreasing.Specifically, we have the following theorem.
Theorem 2: In the first outer while loop (lines 3-10) of Algorithm 1, we have J l (C) ≥ J l−1 (C), where J l (C) denotes the value of J(C) after the lth outer iteration.
Proof: Since J l (C) denotes the value of J(C) after the lth outer iteration, it could be rewritten as J l (C) = J(C [l]  1 , . . ., C [l]  M ), where C [l]  m (m = 1, . . ., M) denotes the updated value of C m after the lth outer iteration.We further denote Then, we can easily get the following result: Equation (16) shows that J(C) is upper bounded and Theorems 1 and 2 indicate that J(C) is nondecreasing during the optimization procedure in Algorithm 1.Therefore, the convergence of the optimization for commonality subspaces can be guaranteed.
2) Convergence of Optimization on {V m } M m=1 : Similar to the proof for commonality SL, we denote the objective function of individuality SL in (2) as J(V), where V = {V 1 , . . ., V M } is the set of individuality transformation matrices learned from all tasks.Then, it is easy to know that where μ m b(j) and μ m t(j) are the same as those defined immediately after (15).Therefore, J(V) is upper bounded by To further show that J(V) is nondecreasing during the optimization procedure, we denote it as J(V m ) when fixing other tasks to update the mth one in the for loop (lines 14-20) of Algorithm 1 Similar to the properties of J(C) in Theorems 1 and 2, we have the following Theorems for J(V): Theorem 3: Following the iterative procedure in ( 12)-( 14), we have J(V (t)  m ) ≥ J(V (t−1) m ) in the second inner while loop (lines 16-20) of Algorithm 1.
Theorem 4: In the second outer while loop (lines 13-20) of Algorithm 1, we have J l (V) ≥ J l−1 (V), where J l (V) denotes the value of J(V) after the lth outer iteration.
Since the proof of Theorems 3 and 4 is analogous to that of Theorems 1 and 2, we omit it in this article to save the space for other content.From (19) and Theorems 3 and 4, we know that J(V) is upper bounded and nondecreasing during the optimization procedure in Algorithm 1.Therefore, the convergence of the optimization for individuality subspaces can be guaranteed.

B. Computational Complexity
In this section, we analyze the computational complexity of the proposed method.The most computationally demanding step is the update of transformation matrices in each iteration which requires eigendecomposition (lines 10 and 20 in Algorithm 1).For the optimization of commonality subspaces, the complexity of a single eigenvalue decomposition is O(d 3 ) (line 10), so the complexity for updating the mth task's commonality transformation matrix is O(l c t c m d 3 ), where t c m is the number of iterations in the inner while loop for the mth task, l c is the number of iterations in the outer while loop for all tasks, and d is the original dimension.For the optimization of individuality subspaces, the procedure is similar, that is, the complexity of a single eigenvalue decomposition is O(d 3 ) (line 20), so the complexity for updating the mth task's individuality transformation matrix is O(l v t v m d 3 ), where t v m is the number of iterations in the inner while loop for mth task and l v is the number of iterations in the outer while loop for all tasks.Therefore, the total computational complexity for learning all tasks in Algorithm 1 is O( M m=1 (l c t c m + l v t v m )d 3 ).In fact, l c , l v , t c m , and t v m are very small integers in practice (generally less than 5, see the experiment section for more details), and thus can be regarded as constants.
1) Connections to PCA: If we set M = 1 and G = N, which means that there is only one learning task and each class has only one sample, then, the mean vector of each class in the task is equal to the feature vector of the only sample in that class.Accordingly, we have the following: Since S w = 0, there is no need to minimize the term tr(W T S w W), then, the objective function of the proposed method is reduced to that of PCA max 2) Connections to LDA: If we set M = 1, there is no need to consider the commonality and individuality of multiple tasks, then, the objective function of the proposed method becomes max which is exactly the objective of LDA.So the proposed method reduces to classical LDA under the single-task setting.
3) Connections to MPCA: MPCA is the multitask extension of the classical PCA.It aims to maximize the total variance of the low-dimensional projection of all tasks simultaneously.The objective function of MPCA can be formulated as follows: where If we let the class number in each task equal the number of training samples (i.e., G m = N m , m = 1, . . ., M), the between-class scatter matrix S m b is exactly the same as H m .Moreover, similar to the case in (21), the within-class scatter matrix S m w = 0 for any m = 1, . . ., M. Therefore, the objective function in ( 1) is reduced to that of MPCA in (24).
4) Connections to MTDA: MTDA is the multitask extension of classical LDA, which aims to handle heterogeneous feature space and multiclass classification problem.In MTDA, no matter how many tasks are given, only one common matrix will be modeled for all the tasks to represent the common information.For the proposed method, if we restrict the objective of maximizing C T m C p 2 F to be a hard constraint of C m = C p for all the m, p = 1, . . ., M, it reduces to the setting in MTDA, that is, all tasks share the exactly same commonality representation.We further reformulate the way of obtaining the transformation matrix from the concatenation of commonality and individuality matrices to the multiplication of them, then, the objective function will become max which is the same as that of MTDA.

IV. EXPERIMENTATION
In this section, we conduct a series of experiments to systematically evaluate and demonstrate the effectiveness of the proposed method.We start from motivating the multitask SL by comparing it with the single-task SL.Then, we visually illustrate what can be learned from the proposed method and explain what do the commonality and individuality mean in this example.After that, we examine the convergence of the proposed CISL on multiple learning tasks, validating its efficiency.Finally, we compare the proposed method with classical and state-of-the-art single-task/multitask SL methods on both synthetic and real-world datasets, showing its superiority over existing methods in terms of classification accuracy.

A. Multitask SL Versus Single-Task SL
Multitask learning was proposed to address the data insufficiency issue existing in some, if not all, of the given learning tasks.In this section, we show the superiority of multitask SL over single-task SL on both synthetic and real-world datasets.
1) Experiment on Synthetic Datasets: In the first experiment, we generate three synthetic datasets for three multiclass classification tasks, respectively.The first dataset (for Task 1) has three classes, with 80 samples in each class.The second dataset (for Task 2) has four classes, with 10 samples in each class.The third dataset (for Task 3) has three classes, with 10 samples in each class.The dimension of the original feature space is 100.Obviously, tasks 2 and 3 have very small data size when compared to Task 1.
For the first dataset, we first generate 2-D (low-dimensional) data points for classes 1, 2, and 3 by sampling from the normal distributions N (0, I), N (1, I), and N (2, I), respectively.Then, we create a 100 × 2 inverse transformation matrix, with each element being sampled from N (0, √ 2).Finally, we map all 2-D data points back to the 100-D feature space Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.via the inverse transformation matrix to generate the original high-dimensional data points for the first dataset.The purpose of using such an inverse transformation matrix to create highdimensional data is to ensure that all data points have the "ground truth" subspace.
For the second dataset, the 2-D data points of the first three classes are the same as those in the first dataset, while the points of the fourth class are sampled from N (2.5, I).For this task's inverse transformation matrix, its first 50 rows are the same as those in the inverse transformation matrix of Task 1, while the elements of its last 50 rows are independently generated from the uniform distribution U (0, 1).Then, we map all 2-D samples to the 100-D feature space using the inverse transformation matrix to generate the original high-dimensional data.
For the third dataset, the 2-D data points are the same as those in the last three classes of the second dataset.For the inverse transformation matrix, the elements of its first 50 rows are independently generated from U (0, 1), while its last 50 rows are the same as those in the inverse transformation matrix of Task 1.After that, we use this inverse transformation matrix to map all 2-D data points back to the 100-D feature space to form the original high-dimensional samples for the third dataset.
We compare the proposed CISL with the classical singletask SL method LDA.For each task, we randomly choose 20% of the dataset for training and use the remaining 80% for testing.After SL, we use the nearest neighbor classifier for the final classification.We repeat the experiment for ten times and report the average classification accuracy in Table II.As can be seen, by incorporating the information learned from Task 1, the proposed CISL outperforms LDA on both Tasks 2 and 3 with relatively small data sizes.Compared with Task 3, Task 2 has closer relation with Task 1 in terms of the low-dimensional representation of data.Therefore, the improvement of CISL over LDA on Task 2 is more substantial than that on Task 3.An interesting observation is that although Task 1 already has relatively large data size in each class, its learning performance can still be further enhanced by jointly modeling with tasks with small data sizes (Tasks 2 and 3), demonstrating the power of multitask SL on improving all learning tasks' performance.
1) PIE dataset [35] contains different facial pose images with various lighting and illumination conditions of 68 persons.In our experiment, we use 49 frontal view images of each person for training and testing.2) AR dataset [24] contains frontal view images of 100 individuals and each individual has 26 images with different expressions and illuminations.3) JAFFE dataset [23] contains 213 frontal view pictures of 10 Japanese females in 7 emotion categories: a) angry; b) surprise; c) sad; d) happy; e) disgust; g) fear; and h) neutral.The statistics of the tasks on these datasets are given in Table III.Specifically, the task on the PIE dataset is face recognition with various pose and illumination conditions.There are totally 68 classes with 49 samples in each class.The task on AR dataset is occluded face recognition.We randomly select 20 classes from the original dataset and use 10 samples for each class.The task on JAFFE dataset is emotion recognition.In this task, we have 7 emotion classes and we randomly select 10 samples for each class.Obviously, the number of data samples of the tasks on AR and JAFFE datasets is relatively small compared to that on PIE dataset.For each task, we randomly select 10% of the data for training and use the rest for testing.All the images in this experiment are resized to 30 × 40 pixels, and the original 1200-D data are projected to the (G m − 1)-dimensional subspace by following the standard setting of the classical LDA, where G m is 68, 20, and 7 for tasks on PIE, AR, and JAFFE datasets, respectively.Again, we use the nearest neighbor classifier for the final classification in the learned subspace and report the average classification accuracy over ten trials with randomly selected training data.
As shown in Table IV, the proposed CISL outperforms LDA on all tasks.Moreover, the performance improvement of CISL over LDA on AR and JAFFE datasets is more significant than that on PIE dataset, validating the effectiveness of multitask learning on enhancing the performance of tasks with insufficient data.

B. Illustration of Learned Commonality and Individuality
In this section, we illustrate (Fig. 2) and explain the commonality and individuality learned from the tasks on PIE, AR, and JAFFE datasets.In this experiment, we use all classes and all samples from the PIE, AR, and JAFFE datasets.For each task, we randomly select 20% from each class for training.Then, we select several images from the remaining samples, project them to the learned commonality and individuality subspaces, and reconstruct them using the corresponding subspaces, respectively.The commonality-reconstructed and individuality-reconstructed images are calculated by , respectively.We show several samples from the PIE, AR, and JAFFE datasets in Fig. 2(a), and the corresponding commonalityreconstructed and individuality-reconstructed images in Fig. 2(b) and (c), respectively.The reconstructions in Fig. 2(b) show that the commonality subspaces are able to capture task-shared features of different learning tasks, for instance, the contour of facial images.The reconstructions in Fig. 2(c) demonstrate that the individuality subspaces are capable of characterizing task-specific features of different learning tasks, for examples, the eyes' features are extracted for the face recognition task on PIE dataset; the nose region, which is generally not occluded, is highlighted for the task of occluded face recognition on AR dataset; and the cheek and mouth regions, which are informative in indicating the human emotional responses, are identified as important features for the emotion recognition task on JAFFE dataset.With the commonality and individuality subspaces learned via CISL, complementary features (i.e., task-shared and task-specific features) can be extracted for the given learning targets.

C. Validation of CISL's Convergence
In this section, we validate the convergence of the proposed CISL on real-world datasets.In addition to the PIE, AR, and JAFFE datasets, we include one more dataset for this experiment as well as for all the following experiments.
1) ORL dataset [3] contains 400 images of 40 persons, with 10 images for each person.The task on the ORL dataset is a regular face recognition task.The images of the ORL dataset are resized to 30 × 40 pixels.For ease of reference, we number the tasks corresponding to different datasets: Task 1 for AR, Task 2 for ORL, Task 3 for PIE, and Task 4 for JAFFE.The data in all tasks are projected to the (G m − 1)-dimensional subspace.For all tasks, we randomly select 10% of data from each class for training.Figs. 3 and 4 show the change of objective function value of commonality SL and that of individuality SL with the increase of iteration number, respectively.The values of the objective functions increase stably and converge in less than five iterations in all the scenarios, showing the efficiency of CISL.

D. Statistical Evaluations
In this section, we compare the performance of the proposed method with classical and state-of-the-art SL methods on both synthetic and real-world datasets.Specifically, we compare the proposed CISL with nine SL methods, including PCA [28], LDA [12], Robust SL with L2-Graph (RSLLG) [30], MPCA [46], MTDA [51], multitask sparse metric learning (MTSML) [37], sparse exclusive lasso multitask feature selection (SPEL) [19], graph-clustered feature sharing (GCFS) [20], and multitask learning for classification problem via new tight relaxation of rank minimization (MSVMnew) [8].Here, PCA and LDA are two classical single-task SL methods, the representative single-task SL method (RSLLG) is a representative graph-guided single-task SL method, MPCA and MTDA are two representatives multitask SL methods, MTSML is a recent MTSML, SPEL and GCFS are two state-of-the-art multitask feature selection methods, and MSVMnew is a state-of-the-art multitask learning method with low-rank subspace constraint.
1) Experiment on Synthetic Datasets: We generate four datasets for four multiclass classification tasks, respectively.Each task has three classes and each class has 40 samples whose original dimension is 100.Similar to the synthetic Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.experiment in Section IV-A1, for all tasks, we generate the three-class 2-D data points by sampling from N (0, I), N (1, √ 2I), and N (1.5, √ 1.5I), respectively.Note that in this experiment, we let al. tasks share the same low-dimensional data points in order to ensure the intrinsic relatedness of all the tasks.
The difference between these tasks is reflected by different inverse transformation matrices.For the first task, we generate the 100 × 2 matrix by independently sampling each element from N (0, √ 2).For the second task, the first 50 rows of its inverse transformation matrix are the same as those in the matrix of the first task; and its elements of the last 50 rows are generated independently from U (0, 1).For the third task, the last 50 rows of its inverse transformation matrix are the same as those in the matrix of the first task; and its elements of the first 50 rows are generated independently from U (0, 1).For the fourth task, we randomly select and copy 30 rows from the inverse transformation matrix of the first task, and then, independently generate the elements of the remaining 70 rows from U (0, 1).For all four tasks, we map their 2-D data points to the 100-D feature space using the corresponding inverse transformation matrices to generate the original high-dimensional data points.
We test the recognition accuracy of ten methods with the variation of task combination (Tasks 1 and 2, Tasks 1-3, and Tasks 1-4) and that of the training ratio (5% and 10%).The average results over ten trials are listed in Tables V-VII.The recognition accuracy of all methods increases when the training ratio increases from 5% to 10%.Moreover, the proposed CISL performs the best among ten methods in most of the scenarios, validating its effectiveness.
An interesting observation is that the performance of CISL on Tasks 1 and 2, after adding Task 3, is declined rather than improved.This is possibly due to the relation between Task 3 and Tasks 1 and 2. From the construction of inverse transformation matrices of these tasks, we can see that the data in Task 1 and those in Task 2 are closely related as they share half of the feature space, and thus, their commonality can be easily extracted by the proposed algorithm.However, for Task 3, its relation to Task 2 is not that close as they do not share any feature space.Moreover, the shared features between Task 1 and Task 2 (the first 50 dimensions) and those between Task 1 and Task 3 (the last 50 dimensions) are totally different.These two facts make the overall commonality among Tasks 1-3 hard to be captured, and thus, the accuracy of the proposed method on Tasks 1 and 2, after adding Task 3, is Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.decreased.Fortunately, the further inclusion of Task 4, which is constructed to be related to Tasks 1-3, links up with all previous tasks and make the overall commonality easier to be characterized, and thus improves the performance.
2) Experiment on Real-World Datasets: In addition to the evaluation on synthetic datasets, we further test the performance of the proposed method for multitask learning on four real-world datasets.
3) Task 3: Face recognition with various lighting, illumination, and pose conditions on PIE dataset.4) Task 4: Emotion recognition on JAFFE dataset.We examine the recognition accuracy of ten methods with the variation of task combination (Tasks 1 and 2, Tasks 1-3, and Tasks 1-4) and that of the training ratio (10%, 20%, 30% and 40%).We perform ten random trials and calculate the average results for all ten methods.
We first learn Tasks 1 and 2 jointly.As can be seen in Table VIII, almost all the methods achieved much worse performance on AR than on ORL.The reason might be that the occlusions in the AR dataset, as shown in Fig. 1, mask some important facial features such, as eyes and mouth, and thus, introduce additional difficulties in recognizing human faces.In this experiment, an interesting observation is that the classical single-task SL method, LDA, performs better than all the existing multitask SL methods on the AR dataset, indicating that depending only on the information learned from the ORL dataset, which is a face dataset without any occlusion, might not be necessarily helpful in recognizing the occluded faces in the AR dataset.The proposed method, which aims to preserve both the task-shared and task-specific discriminative information, inherits the advantages from both LDA and multitask SL methods, and thus achieves the best performance.
Then, we add the Task 3 of face recognition on PIE dataset into the multitask learning procedure.Since the PIE dataset contains unconcluded facial images, it can share some useful information to the task of face recognition on ORL.As a result, the performances of most methods on ORL dataset are improved when jointly learning with PIE datasets, especially when the training size is small.Under this three-task setting, the proposed method again outperforms other method in all the scenarios, as shown in Table IX, demonstrating its effectiveness in capturing the common information from given learning tasks to improve the performance on all tasks.
Finally, we include the Task 4 of emotion recognition on JAFFE dataset into the multitask learning procedure.Different from the tasks of face recognition, the emotion recognition task will have different individuals' faces (with the same emotion) in the same class and the same individual's faces (with different emotions) in different classes.This will make the within-class distances larger than the between-class distances, bringing challenges to the supervised SL task.In this situation, the common facial features such as the facial contour that play an important role in regular face recognition tasks might not be that useful in differentiating human emotions.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.On the other hand, some task-specific features for emotion recognition, such as the cheek and mouth regions, could be quite informative in this specific task.As shown in Table X, with the joint objectives of learning both the commonality and individuality, CISL obtains the highest recognition accuracy among ten methods in all the scenarios, further validating that, in addition to capturing the common features of different tasks in multitask learning, characterizing the individuality or task-specific features is of equal importance in extracting the key discriminative information from the given tasks.
3) Running Time Evaluation: In this section, we evaluate the running time of all ten SL methods on the aforementioned four real world tasks with the AR, ORL, PIE, and JAFFE datasets.We conduct the experiment on a Linux server with Intel Xeon Gold 6230R (2.10 GHz) and 768-GB RAM, and report the result in Table XI.As can be seen, the eigendecomposition process indeed dominates the computational cost-the multitask SL methods that involve eigendecomposition (MPCA, MLDA, GCFS, and the proposed method) generally run slower than that without eigendecomposition (SPEL). 2 This result is consistent with our computational complexity analysis provided in Section III-B.Furthermore, the running time of the proposed method in all settings is within 1 min, which is considered acceptable in practice.In addition, the training time of our method is not increasing along with the increase of the training set size, showing that the computational cost of the proposed method is not seriously affected by the size of the dataset.This is an advantage of the proposed method, especially when dealing with large datasets. 2The high computational cost of MTSML is due to its ADMM optimization.

V. CONCLUSION
In this article, we investigated an important research question, that is, how to characterize and extract the commonality and individuality from multiple learning tasks simultaneously.To tackle this problem, we proposed a method called CISL, which represents the transformation matrix for each task as the concatenation of a commonality matrix and an individuality matrix.We developed an iterative strategy to solve the formulated optimization problem.Through a systematic theoretical analysis and a comprehensive experimental evaluation, we validated the effectiveness of the proposed method.
In the future, we plan to extend our work from the following three perspectives.First, we will investigate alternative ways to model and integrate the commonality and individuality of the learning tasks.Moreover, we plan to explore other initialization strategies, so as to provide a good start of the algorithm.Last but not least, we aim to generalize the proposed method for other learning tasks, such as clustering and regression, and consider more complicated nonlinear formulations.

Manuscript received 19
June 2022; accepted 4 September 2022.Date of publication 4 October 2022; date of current version 16 February 2024.This work was supported in part by the General Research Fund from the Research Grant Council of Hong Kong, SAR, under Project RGC/HKBU12202220; in part by the Research Fund from the Guangdong Basic and Applied Basic Research Foundation under Project 2022A1515010124; and in part by the HKBU/CSD Departmental Start-Up Fund for New Assistant Professors.This article was recommended by Associate Editor S. Ventura.(Corresponding author: Jiming Liu.)

Fig. 1 .
Fig. 1.Schematic illustration of the idea behind the proposed CISL method.(a) Multiple related but different learning tasks.Task 1 is the face recognition task under various pose and illumination conditions; Task m is the task of occluded face recognition; and Task M is to recognize emotions from different human faces.These tasks are related in the sense that they all use face images as the input and some of the common facial features, such as the facial contours, play an important role in all the above tasks.On the other hand, these tasks are different in terms of the task nature or/and task objective, which may require some task-specific features for the respective recognition tasks.(b) Developed CISL method.For each task T m (m = 1, . . ., M), CISL aims to learn two subspaces-one commonality subspace C m and one individuality subspace V m .The transformation matrix C m is constructed to capture the commonality shared by multiple learning tasks while the transformation matrix V m is designed to characterize the task-specific features of each individual learning task.(c) Learning results of multiple tasks.With the task-shared and task-specific features being well captured by the proposed CISL, in each learning task, the within-class data samples are expected to be mapped together while the between-class samples are expected to be projected far away from each other in the learned subspace, thus benefiting the subsequent recognition task.
(b), aims to learn two transformation matrices C m , V m ∈ R d×d m for the mth task (m = 1, . . ., M; d m << d).Then, the high-dimensional data sample x m i can be mapped to the low-dimensional subspace by z the between-class scatter matrix and the total-class scatter matrix of the mth task, respectively, I d is d × d identity matrix, is a very small positive number introduced to avoid the singularity of S m t , tr(•) denotes the matrix trace operator, xm = ( N m i=1 x m i )/N m denotes the mean of all samples in the mth task, xm k = k m denotes the mean of the kth class in the mth task, and I d m is the d m × d m identity matrix, with d m being the dimensionality of the learned subspace for the mth task.

M
m=1 tr(C T m S m b C m ), and minimizing the term of overall total-class scatter, M m=1 tr(C T

Fig. 2 .
Fig. 2. Illustration of commonality and individuality learned from the recognition tasks on PIE, AR, and JAFFE datasets.(a) Facial samples from three datasets.(b) Facial representations are reconstructed from the commonality subspaces, capturing task-shared features (e.g., the contour) of facial images from different tasks.(c) Facial representations are reconstructed from the individuality subspaces, characterizing task-specific features, for example, eyes for face recognition on PIE dataset, unconcluded nose region for occluded face recognition on AR dataset, and cheek and mouth regions for emotion recognition on JAFFE dataset.

TABLE I NOTATIONS
AND DESCRIPTIONS

TABLE II CLASSIFICATION
ACCURACY OF LDA AND CISL ON SYNTHETIC DATASETS.BETTER PERFORMANCES ARE HIGHLIGHTED IN BOLD

TABLE III STATISTICS
OF TASKS ON AR, ORL, PIE, AND JAFFE DATASETS

TABLE V RECOGNITION
ACCURACY OF THE PROPOSED METHOD AND NINE SL METHODS, INCLUDING TWO CLASSICAL SINGLE-TASK SL METHODS (PCA AND LDA), ONE RSLLG, TWO REPRESENTATIVE MULTITASK SL METHODS (MPCA AND MTDA), ONE RECENT MTSML, TWO STATE-OF-THE-ART MULTITASK FEATURE SELECTION METHODS (SPEL AND GCFS), AND ONE STATE-OF-ART MTL METHOD WITH LOW-RANK SUBSPACE CONSTRAINT (MSVMNEW) ON TWO TASKS WITH SYNTHETIC DATASETS UNDER THE SETTING OF TRAINING RATIO 5% AND 10%.THE BEST PERFORMANCES ARE HIGHLIGHTED IN BOLD TABLE VI RECOGNITION ACCURACY OF THE PROPOSED METHOD AND NINE SL METHODS, INCLUDING TWO CLASSICAL SINGLE-TASK SL METHODS (PCA AND LDA), ONE RSLLG, TWO REPRESENTATIVE MULTITASK SL METHODS (MPCA AND MTDA), ONE RECENT MTSML, TWO STATE-OF-THE-ART MULTITASK FEATURE SELECTION METHODS (SPEL AND GCFS), AND ONE STATE-OF-ART MTL METHOD WITH LOW-RANK SUBSPACE CONSTRAINT (MSVMNEW) ON THREE TASKS WITH SYNTHETIC DATASETS UNDER THE SETTING OF TRAINING RATIO 5% AND 10%.THE BEST PERFORMANCES ARE HIGHLIGHTED IN BOLD

TABLE VII RECOGNITION
ACCURACY OF THE PROPOSED METHOD AND NINE SL METHODS, INCLUDING TWO CLASSICAL SINGLE-TASK SL METHODS (PCA AND LDA), ONE RSLLG, TWO REPRESENTATIVE MULTITASK SL METHODS (MPCA AND MTDA), ONE RECENT MTSML, TWO STATE-OF-THE-ART MULTITASK FEATURE SELECTION METHODS (SPEL AND GCFS), AND ONE STATE-OF-ART MTL METHOD WITH LOW-RANK SUBSPACE CONSTRAINT (MSVMNEW) ON FOUR TASKS WITH SYNTHETIC DATASETS UNDER THE SETTING OF TRAINING RATIO 5% AND 10%.THE BEST PERFORMANCES ARE HIGHLIGHTED IN BOLD

TABLE VIII RECOGNITION
ACCURACY OF THE PROPOSED METHOD AND NINE SL METHODS, INCLUDING TWO CLASSICAL SINGLE-TASK SL METHODS (PCA AND LDA), ONE RSLLG, TWO REPRESENTATIVE MULTITASK SL METHODS (MPCA AND MTDA), ONE RECENT MTSML, TWO STATE-OF-THE-ART MULTITASK FEATURE SELECTION METHODS (SPEL AND GCFS), AND ONE STATE-OF-ART MTL METHOD WITH LOW-RANK SUBSPACE CONSTRAINT (MSVMNEW) ON TWO TASKS WITH AR AND ORL DATASETS UNDER VARIOUS TRAINING RATIOS (10%, 20%, 30%, AND 40%).THE BEST PERFORMANCES ARE HIGHLIGHTED IN BOLD TABLE IX RECOGNITION ACCURACY OF THE PROPOSED METHOD AND NINE SL METHODS, INCLUDING TWO CLASSICAL SINGLE-TASK SL METHODS (PCA AND LDA), ONE RSLLG, TWO REPRESENTATIVE MULTITASK SL METHODS (MPCA AND MTDA), ONE RECENT MTSML, TWO STATE-OF-THE-ART MULTITASK FEATURE SELECTION METHODS (SPEL AND GCFS), AND ONE STATE-OF-ART MTL METHOD WITH LOW-RANK SUBSPACE CONSTRAINT (MSVMNEW) ON THREE TASKS WITH AR, ORL, AND PIE DATASETS UNDER VARIOUS TRAINING RATIOS (10%, 20%, 30%, AND 40%).THE BEST PERFORMANCES ARE HIGHLIGHTED IN BOLD

TABLE X RECOGNITION
ACCURACY OF THE PROPOSED METHOD AND NINE SL METHODS, INCLUDING TWO CLASSICAL SINGLE-TASK SL METHODS (PCA AND LDA), ONE RSLLG, TWO REPRESENTATIVE MULTITASK SL METHODS (MPCA AND MTDA), ONE RECENT MTSML, TWO STATE-OF-THE-ART MULTITASK FEATURE SELECTION METHODS (SPEL AND GCFS), AND ONE STATE-OF-ART MTL METHOD WITH LOW-RANK SUBSPACE CONSTRAINT (MSVMNEW) ON FOUR TASKS WITH AR, ORL, PIE, AND JAFFE DATASETS UNDER VARIOUS TRAINING RATIOS (10%, 20%, 30%, AND 40%).THE BEST PERFORMANCES ARE HIGHLIGHTED IN BOLD TABLE XI RUNNING TIME (IN SECONDS) OF ALL TEN SL METHODS ON FOUR TASKS WITH AR, ORL, PIE, AND JAFFE DATASETS UNDER VARIOUS TRAINING RATIOS (10%, 20%, 30%, AND 40%)