Learning Discriminative Projection With Visual Semantic Alignment for Generalized Zero Shot Learning

Zero Shot Learning (ZSL) aims to solve the classification problem with no training sample, and it is realized by transferring knowledge from source classes to target classes through the semantic embeddings bridging. Generalized ZSL (GZSL) enlarges the search scope of ZSL from only the seen classes to all classes. A large number of methods are proposed for these two settings, and achieve competing performance. However, most of them still suffer from the domain shift problem due to the existence of the domain gap between the seen classes and unseen classes. In this article, we propose a novel method to learn discriminative features with visual-semantic alignment for GZSL. We define a latent space, where the visual features and semantic attributes are aligned, and assume that each prototype is the linear combination of others, where the coefficients are constrained to be the same in all three spaces. To make the latent space more discriminative, a linear discriminative analysis strategy is employed to learn the projection matrix from visual space to latent space. Five popular datasets are exploited to evaluate the proposed method, and the results demonstrate the superiority of our approach compared with the state-of-the-art methods. Beside, extensive ablation studies also show the effectiveness of each module in our method.


I. INTRODUCTION
With the development of deep learning technique, the task of image classification has been transfered to large scale datasets, such as ImageNet [1], and achieved the level of human-beings [2]. Does it mean that we are already to solve large-scale classification problems? Two questions should be answered: 1) Can we collect enough samples of all the classes appeared all over the world for training? 2) Can the trained model with limited classes be transfered to other classes without retraining? The first question cannot be given an affirmative answer because there are 8.7 million classes only in animal species [3] and over 1000 new classes are emerging everyday. Therefore, many researchers moved their focus to the second question by employing transfer learning [4], [5] and Zero-shot Learning (ZSL) [6].
ZSL tries to recognize the classes that have no labeled data available during training, and is usually implemented by The associate editor coordinating the review of this manuscript and approving it for publication was Gang Li . employing auxiliary semantic information, such as semantic attributes [7] or word embeddings [8], which is similar to the process of human recognition of new categories. For example, a child who has not seen a ''zebra'' before but knows that a ''zebra'' looks like a ''horse'' and has ''white and black stripes'', will be able to recognize a ''zebra'' very easily when he/she actually sees a zebra.
Since the concept of ZSL was first proposed [9], many ZSL methods have been proposed and most of them try to solve the inherent domain shift problem [10]- [13], which is caused by the domain gap between the seen classes and unseen classes. Although these methods can alleviate the domain shift problem and achieve certain effect, their performance are limited due to their negligence of unseen classes. To fully solve the domain shift problem, Fu et al. [14] assumed that the labeled seen samples and the unlabeled unseen samples can be both utilized during training, which is often called transductive learning. This type of method can significantly alleviate the domain shift problem and achieve the state-of-the-art performance [15]- [17], but the unlabeled VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ unseen data usually is inaccessible during training in realistic scenarios.
In addition, conventional inductive learning often assumes that the upcoming test data belongs to the unseen classes, which is also unreasonable in reality because we cannot have the knowledge of the ascription of the future data in advance. Therefore, Chao et al. suggested to enlarge the search scope for test data form only the unseen classes to all classes [18], including both seen and unseen categories, which is illustrated in Fig. 1. To better solve the domain shift problem on the more realistic GZSL setting, many synthetic based methods have been proposed [19]- [22]. They often train a deep generative network to synthesize unseen data from its corresponding attribute by applying the frameworks of Generative Adversarial Network (GAN) [23] or Variational Auto-Encoder (VAE) [24], and then the synthesized data and the labeled seen data are combined to train a supervised close-set classification model. The synthetic based methods can also achieve state-of-the-art performance, but there is a serious problem that when a totally new object emerges the trained model will inevitably fail unless new synthetic samples are generated and retrained with previous samples. To solve the above mentioned problems, in this article, we proposed a novel method to learn discriminative projections with visual semantic alignment in a latent space for GZSL, and the proposed framework is illustrated in Fig. 2. In this framework, to solve the domain shift problem, we define a latent space to align the visual and semantic prototypes, which is realized by assuming that each prototype is a linear combination of others, including both seen and unseen ones. With this constraint, the seen and unseen categories are combined together and thus can reduce the domain gap between them. Besides, to make the latent space more discriminative, a Linear Discriminative Analysis (LDA) strategy is employed to learn the projection matrix from visual space to latent space, which can significantly reduce the within class variance and enlarge the between class variance. At last, we conduct experiments on five popular datasets to evaluate the proposed method. The contributions of our method is summarized as follows, 1) We proposed a novel method to solve the domain shift problem by learning discriminative projections with visual semantic alignment in latent space; 2) A linear discriminative analysis strategy is employed to learn the projection from visual space to latent space, which can make the projected features in the latent space more discriminative; 3) We assume that each prototype in all three spaces, including visual, latent and semantic, is a linear sparse combination of other prototypes, and the sparse coefficients for all three spaces are the same. This strategy can establish a link between seen classes and unseen classes, reduce the domain gap between them and eventually solve the domain shift problem; 4) Extensive experiments are conducted on five popular datasets, and the result shows the superiority of our method. Besides, detailed ablation studies also show that the proposed method is reasonable. The main content of this article is organized as follows: In section II we briefly introduce some related existing methods for GZSL. Section III describes the proposed method in detail, and Section IV gives the experimental results and makes comparison with some existing state-of-the-art methods on several metrics. Finally in section V, we conclude this article.

II. RELATED WORKS
In this section, we will briefly review some related ZSL and GZSL works for the domain shift problem.

A. COMPATIBLE METHODS
Starting from the proposed ZSL concept [9], many ZSL methods have been emerging in recent several years. Due to the existence of the gap between the seen and unseen classes, an inherent problem, called domain shift problem, limits the performance of ZSL. These methods often project a visual sample into semantic space, where Nearest Neighbor Search (NNS) is conducted to find the nearest semantic prototype and its label is assigned to the test sample. Kodirov et al. tried to use an autoencoder structure to preserve the semantic meaning from visual features, and thus to solve the domain shift problem [13]. Zhang et al. exploited a triple verification, including an orthogonal constraint and two reconstruction constraints, to solve the problem and achieved a significant improvement. Akata et al. proposed to view attribute-based image classification as a label-embedding problem that each class is embedded in the space of attribute vectors [25], they employed pair-wise training strategy that the projected positive pair in the attribute space should have shorter distance than that of negative pair. However, the performance of these method are limited due to their negligence of unseen classes during training.
In addition, conventional ZSL assumes that the upcoming test sample belongs to the target classes, which is often unreasonable in realistic scenarios. Therefore, Chao et al. extended the search scope from only unseen classes to all classes, including both seen and unseen categories [18]. Furthermore, Xian et al. re-segmented the five popular benchmark datasets to avoid the unseen classes from overlapping with the categories in ImageNet [26]. Beside, they also proposed a new harmonic metric to evaluate the performance of GZSL, and release the performance of some state-of-the-art method on the new metric and datasets. From then on, many methods have been proposed on this more realistic setting. For example, Zhang et al. proposed a probabilistic approach to solve the problem within the NNS strategy [27]. Liu et al. designed a Deep Calibration Network (DCN) to enable simultaneous calibration of deep networks on the confidence of source classes and uncertainty of target classes [28]. Pseudo distribution of seen samples on unseen classes is also employed to solve the domain shift problem on GZSL [29]. Besides, there are many other methods developed for this more realistic setting [30], [31].

B. SYNTHETIC BASED METHODS
To solve the domain shift problem, synthetic based methods have attracted wide interest among researchers since they can obtain very significant improvement compared with traditional compatible methods.
Long et al. [32] firstly tried to utilize the unseen attribute to synthesize its corresponding visual features, and then train a fully supervised model by combining both the seen data and the synthesized unseen features. Since then, more and more synthetic based methods have being proposed [8], [30], [31], [33], and most of them are based on GAN [23] or VAE [24] because adversarial learning and VAE can facilitate the networks to generate more realistic samples [34], [35]. CVAE-ZSL [36] exploits a conditional VAE (cVAE) to realize the generation of unseen samples. Xian et al. proposed a f-CLSWGAN method to generate sufficiently discriminative CNN features by training a Wasserstein GAN with a classification loss [19]. Huang et al. [37] tried to learn a visual generative network for unseen classes by training three component to evaluate the closeness of an image feature and a class embedding, under the combination of cyclic consistency loss and dual adversarial loss. Dual Adversarial Semantics-Consistent Network (DASCN) [20] learns two GANs, namely primal GAN and dual GAN, in a unified framework, where the primal GAN learns to synthesize semantics-preserving and inter-class discriminative visual features and the dual GAN enforces the synthesized visual features to represent prior semantic knowledge via semantics-consistent adversarial learning.
Although these synthetic based methods can achieve excellent performance, they all suffer from a common serious problem that when an object of a new category emerges, the model should be retrained with the new synthesized samples of the new category. Different from these GAN or VAE based synthetic methods, our approach is a compatible one, which does not have the previous mentioned problem, and it can still accept new category without retraining even though there will be a little performance degradation.

C. TRANSDUCTIVE METHODS
Fu et al. tried to include the unlabeled unseen data in training, which is often called transductive learning, to solve the domain shift problem and achieved a surprising improvement [14]. Unsupervised Domain Adaptation (UDA) [38] formulates a regularized sparse coding framework, which utilizes the unseen class labels' projections in the semantic space, to regularize the learned unseen classes projection thus effectively overcoming the projection domain shift problem. QFSL [15] maps the labeled source images to several fixed points specified by the source categories in the semantic embedding space, and the unlabeled target images are forced to be mapped to other points specified by the target categories. Zhang et al.proposed a explainable Deep Transductive Network (DTN) by training on both labeled seen data and unlabeled unseen data, the proposed network exploits a KL Divergence constraint to iteratively refine the probability of classifying unlabeled instances by learning from their high VOLUME 8, 2020 confidence assignments with the assistance of an auxiliary target distribution [17]. Although these transductive methods can achieve significant performance and outperform most of conventional inductive ZSL methods, the target unseen samples are usually inaccessible in realistic scenarios.

III. METHODOLOGY A. PROBLEM DEFINITION
Let Y = {y 1 , · · · , y s } and Z = {z 1 , · · · , z u } denote a set of s seen and u unseen class labels, and they are disjoint Y ∩Z = ∅. Similarly, let A Y = {a y1 , · · · , a ys } ∈ R l×s and A Z = {a z1 , · · · , a zu } ∈ R l×u denote the corresponding s seen and u unseen attributes respectively. Given the training data in 3-tuple of N seen samples: (x 1 , a 1 , y 1 from N seen images. When testing, the preliminary knowledge is u pairs of attributes and labels:( a 1 , z 1 ), · · · , ( a u , z u ) ⊆ A Z × Z. Zero-shot Learning aims to learn a classification function f : X u → Z to predict the label of the input image from unseen classes, where x i ∈ X u is totally unavailable during training.

B. OBJECTIVE
In this subsection, we try to propose an novel idea to learn discriminative projection with visual semantic alignment for generalized zero shot learning, the whole architecture is illustrated in Fig. 2.

1) SAMPLING FROM PROTOTYPES
Suppose we have already know the prototypes of seen classes, the seen features should be sampled from these prototypes, so we can have the following constraint, where, P s is the prototypes of seen categories, Y s is the one-hot labels of seen samples, and · 2 F denotes for the Frobenius norm.

2) PROTOTYPE SYNTHESIS
Here we think each class prototype can be described as the linear combination of other ones with corresponding reconstruction coefficients. The reconstruction coefficients are sparse because the class is only related with certain classes. Moreover, to make the combination more flexible, we define another latent space, and construct a sparse graph in all three space as, where, H is the coefficient matrix; P = [P s , P u ], P s and P u are visual prototypes of seen classes and unseen classes respectively; C = [C s , C u ], C s and C u are the prototypes of seen classes and unseen classes respectively in latent , α and β are the balancing parameters. We apply diag(H) = 0 to avoid the trivial solution.

3) VISUAL-SEMANTIC ALIGNMENT
In the latent space, the prototypes are the projections from both visual space and semantic space, so the alignment can be represented as, where, W 1 and W 2 are the projection matrices from visual space and semantic space respectively.

4) LINEAR DISCRIMINATIVE PROJECTION
In visual space, the features might not be discriminative, which is illustrated in Fig. 2, so the direct strategy is to cluster them within class and scatter them between classes. Linear Discriminative Analysis is the proper choice and we can maximize the following function to achieve such purpose, where, S B and S W are the between-class scatter matrix and within-class scatter matrix respectively.

C. SOLUTION
Since we have already defined the loss function for each constraint, we can combine them and obtain the final objective as follows, where γ , κ and λ are the balancing coefficients.

1) INITIALIZATION
Since Eq. 5 is not joint convex over all variables, there is no close-form solution simultaneously. Thus, we propose an iterative optimization strategy to update a single unresolved variable each time. Because proper initialization parameters can not only improve the model performance but also increase the convergence speed, we further split the solution into two sub problems, i.e., initializing the parameters with reduced constraints, and iterative optimizing them with the full constraints.
Initializing H: Since A is known in advance, we initialize H first with the last term of Eq. 2. We exploit the following formulation as the loss function for H, To solve the constraint diag(H) = 0, we calculate H once per column, where H i is the i th column of H and the i th entry of H i is also removed, A \i is the matrix of A excluding the ith column.
Initializing P s : We use Eq. 1 to initialize P s , and the closed-form solution can be obtained as follows, Initializing P u : Since there is no training data for unseen classes, we cannot use the similar initialization strategy as P s to initialize P u . However, we have already get H with Eq. 7 in advance, it is easy to utilize P s and H to calculate P u . The simplified loss function can be formulated as follows, By computing the derivative of Eq. 10 with respected to P u and setting it to zero, we can obtain the following solution, Since C is unknown till now, we cannot calculate W 1 withe Eq. 3. The only way for W 1 is to optimize Eq. 4, from which we can deduce the following formulation, If we define = τ , then W 1 can be solved by obtaining the eigenvector of S −1 W S B . Initializing C: Since W 1 and P are already known, it is easy to initialize C with the first item of Eq. 3, and the solution is, Initializing W 2 : By employing the second item of Eq. 3, W 2 can be solved with following formulation,

2) OPTIMIZATION
Since the initialized value of each variable has already been obtained, the optimization of them can be executed iteratively by fixing others.
Updating H: Similar as that for initializing H, we can obtain H once per column with the following loss function, By taking the derivative of L H i with respect to H i , and setting the result to 0, we can obtain the solution of H i as follows, Updating P s : By fixing other variables except P s , we can obtain the following loss function from Eq. 5, which can be expanded as, Eq. 17 can be simplified to AP s + P s B = C, which is a well-known Sylvester equation and can be solved efficiently by the Bartels-Stewart algorithm [39]. Therefore, Eq. 17 can be implemented with a single line of code P s = sylvester( A, B, C) in MATLAB.
Updating P u : Similar as that for P s , we fix other variables except P u , and obtain, By taking the derivative of L P u with respect to P u , and set the result to 0, we can obtain the following equation, Similarly, if we set can be simplified to AP u + P u B = C, which can also be solved efficiently with P u = sylvester( A, B, C) in MATLAB.
Updating C: If we only let C variable and make others fixed, Eq. 5 can be reduced as, By taking the derivative of L C with respect to C, and set the result to 0, we can obtain the solution of C as follows, Updating W 2 : As for W 2 , Eq. 5 can be simplified as, By taking the derivative of L W 2 with respect to W 2 , and set the result to 0, we can obtain the solution of W 2 as follows, Updating W 1 : Similar as W 2 for W 1 , Eq. 5 can be reduced as, Due to the direct derivative of Eq. 22 will cause the negative order of W 1 , we rewrite it as follows, where, η is a coefficient and set as the maximum eigenvalue of S −1 W S B here. By taking the derivative of L W 1 with respect to W 1 , and set the result to 0, we can obtain the solution of W 1 as follows, After these steps, the test sample can be classified by projecting it into the latent space and finding the nearest neighbor of it from C. The algorithm of the proposed method is described in Alg. 1.

IV. EXPERIMENTS
In this section, we first briefly review some datasets applied in our experiments, then some settings for the experiments are given, and at last we show the experiment results and ablation study to demonstrate the performance of the proposed method.

A. DATASETS
In this experiment, we utilize five popular datasets to evaluate our method, i.e., SUN (SUN attribute) [40], CUB (Caltech-UCSD-Birds 200-2011) [41], AWA1 (Animals with Attributes) [42], AWA2 [42] and aPY (attribute Pascal and Yahoo) [43]. Among them, SUN and CUB are fine-grained datasets while AWA1/2 and aPY are coarse-grained ones. The detailed information of the datasets is summarized in Tab. 1, where ''SS'' denotes the number of Seen Samples for training, ''TS'' and ''TR'' refer to the numbers of unseen class samples and seen class samples respectively for testing. The set of visual features of seen classes: X s ; The set of one-hot labels of X s : Y s ; the set of semantic attributes, including both seen and unseen classes: A; The number of iterative time for optimization: iter; The hyper-parameters: α, β, γ , λ, κ and θ ; Output: The projection matrices: W 1 and W 2 ; The visual prototypes of both seen and unseen classes: P s and P u ; The latent prototypes of both seen and unseen classes: C s and C u ; 1: Initializing H with Eq. 7 once per clolumn; 2: Initializing P s and P u with Eq. 8 and Eq. 10 respectively; 3: Initializing W 1 with the eigenvectors of S −1 W S B from Eq. 11; 4: Initializing C with Eq. 12; 5: for k = 1 → iter do 6: Update H with Eq. 15 once per column; 7: Update P s with Eq. 17 by applying P s = sylvester ( A, B, C); 8: Update P u with Eq. 19 by applying P u = sylvester ( A, B, C); 9: Update C with Eq. 21; 10: Update W 2 with Eq. 23; 11: Update W 1 with Eq. 26; 12: end for 13: return W 1 , W 2 , P s , P u and C.
Moreover, we use the same split setting, which is proposed by Xian et al. in [26], for all the comparisons with the stateof-the-art methods listed in Tab. 2.

B. EXPERIMENTAL SETTING
We exploit the extracted features with ResNet [2] as our training and testing samples, which are released by Xian et al. [26], and all the the settings, including both attributes and classes split, are also the same as those in [26]. In addition, there are six hyper-parameters α, β, γ , λ, κ and θ . Since θ is only used to control the regularization terms, we set it with a small value 1 × 10 −4 . As for other five parameters, due to the fact that different dataset usually performs well with different parameters, thus we choose our hyper-parameters from the set of {0.001, 0.01, 0.1, 1, 10, 100, 1000} by adopting a cross validation strategy. To be specific, we hereby compare the difference of ZSL cross-validation to conventional cross-validation for machine learning approaches. Compared to inner-splits of training samples within each class, ZSL problem requires inter splits by in turn regarding part of seen classes as unseen, for example, 20% of the seen classes are selected as the validational unseen classes in our experiments, and the parameters of best average performance of 5 executions are selected as the final optimal parameters for each dataset. It should be noted that the parameters may not be the most suitable for the test set, because the labels of test data are strictly inaccessible during training.

C. COMPARISON WITH BASELINES
In this subsection, we conduct experiments to compare our method with some baselines methods. In addition to the methods evaluated in [26], we also compare our method with some newly proposed frameworks, such as GFZSL [50], LAGO [51], PSEUDO [52], KERNEL [53], TRIPLE [54], LESAE [55], LESD [56] and VZSL [22]. To be specific, we directly cite the results from [26] or from their own papers if it is feasible, otherwise we re-implement them according to the methods described in their own papers. We exploit the harmonic mean H to evaluate our model under the GZSL setting, and it is defined as, where, acc tr and acc ts are the accuracies of test samples from seen classes and unseen categories respectively, and we adopt the average per-class top-1 accuracy as the final result. Since our method utilizes both seen and unseen semantic attributes and focuses on the more realistic GZSL setting, we do not report the result on conventional ZSL setting. The results of our method and the compared method are recorded in Tab. 2, and the best result of each column is highlighted with bold font. From this table, we can clearly discover that our method can outperform the state-of-the-art methods on both ts and H . Concretely, our method can improve ts by 0.8% on SUN, 4.0% on CUB, 6.5% on AWA1, 5.7% on AWA2 and 5.2% on APY, and enhance H by 0.6% on SUN, 0.2% on CUB, 9.8% on AWA1, 7.7% on AWA2 and 8.0% on APY respectively compared with the best methods LESAE and TRIPLE. Besides, compared to those existing methods that have high tr but low ts and H , such as DAP and CONSE, our method can achieve more balanced performance on ts and tr and eventually obtain a significant improvement on H . We ascribe this improvement to the discriminative projection with LDA and the prototype synthesis with both seen and unseen classes, because the first one can make the projected features from same class cluster and from different classes disperse, and the second one combines both seen and unseen classes into a unified framework to alleviate the domain shift problem.

D. ABLATION STUDY 1) EFFECT OF LATENT SPACE
In our method, we utilize the latent space as the intermediate space for both visual and semantic features and we have claimed that this space can obtain more discriminative projection and alleviate the domain shift problem. Therefore, it is necessary to verify whether this space can achieve such statement. In this subsection, we remove the latent prototypes from Eq. 2 and Eq. 3, modify the discriminative projection item with LDA, and redefine the three loss functions as follows, We replace the three items L syn , L eqnarray and L LDA in Eq. 5 and re-optimize it, the performance with the new loss function is illustrated in Fig. 3, form which it can be clearly seen that the accuracies with the latent space are higher than those without the latent space on all five datasets. To be specific, we can obtain more improvement on SUN and CUB than on AWA and APY, especially on the metric ts. We attribute this phenomenon to that the learned vectors in latent space can preserve more discriminative characteristic, and the employment of unified synthesis framework on both seen and unseen classes can well alleviate the domain shift problem.

2) EFFECT OF LDA
In our method, we utilize the LDA strategy to project visual features into latent space to make them more discriminative, so it is necessary to find how much this mechanism can improve the final performance. In this experiment, we remove the loss item L LDA from Eq. 5, and conduct the evaluation on the five popular datasets. The experimental results are illustrated in Fig. 4, from which it can be clearly observed that the method with LDA can significantly outperform that without LDA constraint. This phenomenon reveals that the LDA constraint plays a very important role in improving the performance due to its powerful ability of learning discriminative features in latent space. Moreover, to more intuitively display the improvement of our method, we also show the distributions of unseen samples on AWA1 with and without LDA in latent space with t-SNE [57]. The results are illustrated in Fig. 5, from which it can be discovered that the distribution with LDA is more compact than that without LDA in each class, especially those classes at the bottom of the figure. This situation further prove that LDA is necessary for our method to learn discriminative features in latent space.

3) DIFFERENT DIMENSION OF LATENT SPACE
Since we apply latent space in our method, It is necessary to discuss the effect of the dimension of the latent space on the final performance. In our experiment, we take AWA1 as an example and change the dimension of the latent space from 5 to 60 to show the performance change. The performance curves are recorded in Fig. 6, from which it can be clearly seen that the curves monotonically increase for both ts and H , and nearly stop increase when the dimension is larger than 50. This phenomenon reveals that we can obtain better performance when we have larger dimension in latent space, but this increasing will stop when it reaches the number of classes.

E. ZERO SHOT IMAGE RETRIEVAL
In this subsection, we conduct experiments to show zero shot retrieval performance of our proposed method. In this task, we apply the semantic attributes of each unseen category as the query vector, and compute the mean Average Precision (mAP) of the returned images. MAP is a popular metric for evaluating the retrieval performance, it comprehensively evaluates the accuracy and ranking of returned results, and defined as, where, r i is the number of returned correct images from the dataset corresponding to the ith query attribute, p i (j) represents the position of the jth retrieved correct image among all the returned images according to the ith query attribute. In this experiment, the number of returned images equals the number of the samples in unseen classes. For the convenience of comparison, we employ the standard split of the four datasets, including SUN, CUB, AWA1 and aPY, which can be found in [26], and the results are shown in Tab. 3. The values of the baseline methods listed in Tab. 3 are directly cited from [58]. The results show that our method can outperform the baselines on all four datasets, especially on the coarse-grained dataset AWA1, which reveals that our method can make the prototypes in latent space more discriminative.

V. CONCLUSION
In this article, we have proposed a novel method to learn discriminative features with visual-semantic alignment for generalized zero shot learning. in this method, we defined a latent space, where the visual features and semantic attributes are aligned. We assumed that each prototype is the linear combination of others and the coefficients are the same in all three spaces, including visual, latent and semantic. To make the latent space more discriminative, a linear discriminative analysis strategy was employed to learn the projection matrix from visual space to latent space. Five popular datasets were exploited to evaluate the proposed method, and the results demonstrated the superiority compared with the stateof-the-art methods. Beside, extensive ablation studies also showed the effectiveness of each module of the proposed method.
PENGZHEN DU received the Ph.D. degree from the Nanjing University of Science and Technology, in 2015. He is currently an Assistant Professor with the School of Computer Science and Engineering, Nanjing University of Science and Technology. His research interests include computer vision, evolutionary computation, robotics, and deep learning.
HAOFENG ZHANG received the B.Eng. and Ph.D. degrees from the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China, in 2003 and 2007, respectively. From December 2016 to December 2017, he was an Academic Visitor with the University of East Anglia, Norwich, U.K. He is currently a Professor with the School of Computer Science and Engineering, Nanjing University of Science and Technology. His research interests include computer vision and robotics.
JIANFENG LU (Member, IEEE) received the B.S. degree in computer software and the M.S. and Ph.D. degrees in pattern recognition and intelligent system from the Nanjing University of Science and Technology, Nanjing, China, in 1991, 1994, and 2000, respectively. He is currently a Professor and the Vice Dean of the School of Computer Science and Engineering, Nanjing University of Science and Technology. His research interests include image processing, pattern recognition, and data mining.