Joint Learning of Generative Translator and Classifier for Visually Similar Classes

In this paper, we propose a Generative Translation Classification Network (GTCN) for improving visual classification accuracy in settings where classes are visually similar and data is scarce. For this purpose, we propose joint learning from a scratch to train a classifier and a generative stochastic translation network end-to-end. The translation network is used to perform on-line data augmentation across classes, whereas previous works have mostly involved domain adaptation. To help the model further benefit from this data-augmentation, we introduce an adaptive fade-in loss and a quadruplet loss. We perform experiments on multiple datasets to demonstrate the proposed method’s performance in varied settings. Of particular interest, training on 40% of the dataset is enough for our model to surpass the performance of baselines trained on the full dataset. When our architecture is trained on the full dataset, we achieve comparable performance with state-of-the-art methods despite using a light-weight architecture.


Introduction
Generative models have received significant interest in the past years.Although recent models can generate realistic and diverse data, more study is needed to ascertain whether the methods can be useful in enhancing classification accuracy on hard condition such as visually similar classes or lack of data.For example, face liveness detection in biometrics is a crucial problem where it is hard to distinguish between real faces and printed fake faces, because examples from the two classes are very similar (Akbulut et Jourabloo, and Liu2018).In this application, obtaining a high true acceptance ratio (TAR) and a low false acceptance ratio (FAR) is important as a high TAR is essential for user convenience, whereas a low FAR results in better security.This paper is motivated by two questions: • If two classes, A and B, are visually very similar, how can we improve classifiers by employing cross-class generative models?
The research of this paper was conducted from March 2018 to February 2019.• When data is scarce, how can we use generative models to learn better representations?In practice, most past approaches to solving these two questions fall in two categories.The first one is using complex models, which are often hard to train, and make it difficult to perform fast inference in settings with limited computing resources, such as on a smartphone.The second is to collect large amounts of training data, which is costly, timeconsuming, and not always straightforward.In this paper, we propose a Generative Translation Classification Network that uses the translation model of visual classes to assist the training of the classifier via exploiting joint learning.If the translation model is able to effectively augment the quantity of training samples we expect both the issue of having closely distributed classes, and the lack of sufficient amounts of training data to be mitigated.We should note the similarity of our proposed method with the way the brain learns.There is biological evidence that long-term memory is formed by the collaboration between the hippocampus and the prefrontal cortex (Preston and Eichenbaum2013).The hippocampus recalls slightly distorted memories, in a way which inspired our translation network generating variations on training ex-arXiv:1912.06994v1[cs.CV] 15 Dec 2019 amples.The prefrontal cortex uses such memories, and has a role analogous to the classifier we used.In summary, our contributions are the following: • To augment mini-batch data (AMB) during training, we use inter-class translated samples based on joint learning of a translation model and a classifier.Specifically, half the training samples seen by the classifier are inter-class translated samples that are stochastically generated (ST), while the other half of the samples are real samples of training data to preserve data distribution.This is a novel attempt to couple a generative translation network and a classifier in a unified architecture for improving classification of visually similar classes.• Early on during joint learning, translated samples are of poor quality.We use adaptive fade-in training (AF) for the classifier, automatically adjusting the importance of real and translated samples to gradually adapt the influence from translated samples.We design a novel quadruplet loss (QL) that helps preserve intra-class distribution and taking inter-class distribution apart, even though generated samples are used for training.This is due to the fact that this loss encourages similarity between the embeddings of real and intra-class translated images, and dissimilarity between the embeddings of inter-class samples.

Related work
Since the introduction of generative adversarial networks (GANs) (  et al.2017).Typically, the semi-supervised classifiers take a tiny portion of labeled data and a much larger amount of unlabeled data from the same domain.The goal is to use both labeled and unlabeled data to train a neural network so that it can map a new data point to its correct class (Odena2016).
A simple yet effective idea for semi-supervised learning is to turn a classification problem with n classes into a classification problem with n + 1 classes, with the additional class corresponding to fake images (Salimans et al.2016).However, the methods are not directly augmenting data of existing classes, since they assume generated data is a new class.
In terms of cost functions, the feature matching loss (Salimans et al. 2016) is addressed to prevent instability of GANs from overtraining on the discriminator.Specifically, the generator is trained to match the expected value of the features on an intermediate layer of the discriminator.While feature matching losses bring benefits, a cost function that can account for inter-and intra-class relationships is required in our case.Deep learning networks with variants of a triplet loss become common methods for face verification (Schroff, Kalenichenko, and Philbin2015) and person re-identification (Chen et al.2017).In spite of the trends, applying on the generative models are rarely considered.
From another perspective, between-class data augmentation methods have been proposed (Tokozume, Ushiku, and Harada2018).Images are generated by mixing two training images belonging to different classes with a random ratio.The training procedure seeks to minimize the KL-divergence between the outputs of a trained model, and a target computed by interpolating the two one-hot target vectors of the initial examples using the same ratio.Even though the methods achieve superior results on visual recognition, it is not clear whether the classifier learns shaper and more diverse data on visually similar classes.
In this paper, we are using generated translation images to augment data in each mini-batch and training of a classifier.Even though many previous methods have tackled generative models, this work focuses on joint learning that improve classification accuracy for visually similar images.

Proposed methods
We design a unified deep network architecture that combines a classifier C and a generative translation model G to perform on-line data augmentation of the mini-batch.The proposed architecture is explained in Figure 2.

Formulation of proposed methods
Let {x i y : 1 ≤ i ≤ n, y ∈ Y } be a dataset such that x i y ∈X is the i-th sample belonging to class y ∈ Y .We consider a learning algorithm that trains a classifier by jointly using a generative translation model G(x|x).Our goal is to improve the classifier C by utilizing the on-line translated samples x as additional training data.More formally, C is a classifier to discriminate two-class cases, where Y ={A, B}.Generative translation models G are employed to produce x from x as: where k is the training iteration, m is the number of samples in each set.The x's in AM B k increase diversity of training data, while x's preserve original distribution of training data.
Loss of stochastic translator Visual translation objective across classes is that a source class borrow underlying structure from a target class, while maintain style of the source class.To meet the needs, cycle consistency losses, L A cyc and L B cyc , are used to train stochastic translators, G AB and G BA .The objective is expressed as: For generating realistic images, an adversarial loss L A adv is applied to train D A and G BA .Similarly, an adversarial loss L B adv is applied to train D B and G AB .The objective is expressed as: Adaptive fade-in learning In the early stages, translated samples in AM B k are relatively poor and relying on them too much could be detrimental for training classifier C. We therefore design an adaptive fade-in loss (AF) that adapts the importance given to real and translated samples during training.The objective is defined as: where α and β are parameters between 0 to 1.We use a categorical cross-entropy loss Quadruplet loss for inter/intra-class We design a quadruplet loss for the proposed architecture that enforces explicit relationships between classes.The objective is defined as: where

Overall training objective
The training objective of a classifier C in GTCN is: The training objective of translator G in GTCN is: where λ is a weight parameter to adjust the relative importance of the cyclic consistency losses.Finally, we aim to optimize: LG ,   Networks model and parameters setting We trained all the models in the experiments from scratch without pre-training and extra datasets.The mini-batches used to train the CNNs contained eight real samples, while four real samples and four translated samples were present in the mini-batchs used for GTCNs and other compared models.All of the networks' parameters were optimized using the Adam optimizer.We do not vary the learning rate for the first half of epochs and linearly decay the rate to zero over the next half of epochs.The base learning rate is 0.0002 and the number of training epochs was set to 100.To prevent over-fitting, data augmentation transforms were applied such as rotation, intensity, color adjustment, and scaling variation.All of the classifiers in the experiments use the simple architecture consisting of six convolution layers, specifically: , where F C A is a logit value for class A, F C B is a logit value for class B, SC(y = A) is the score for class A. The calculated SC(y = A) is employed to decide on the class as follows: where th is an acceptance threshold to decide if x is a class A. For example, if th is set to be high, then we can calculate TAR for overall test samples at low FAR.
Training with small volume of dataset First, we evaluate performance of baseline CNNs on different subsets of the training set to study the effect of data scarcity.In addition to this, other compared methods and the proposed GTCN are trained with only 40% of training data.We compared our method to between-class learning (BC/BC + )(Tokozume, Ushiku, and Harada2018) and semi-supervised learning with employing N +1 classes (Semi-sup.)(Salimans et al.2016).In terms of generative models, we consider a least-squares GAN (LSGAN) (Mao et al.2017) and a variational autoencoder (VAE)(Kingma and Welling2013) as alternative methods to be compared.Training with the full dataset In this experiment, we used 100% of the training dataset to verify the scalability of the proposed methods.We added CNN-256 models that are trained and tested on images of size 256×256 pixels, while other models use images of size 128×128 pixels.Table 4 and Table 5 show evaluation results.In a similar fashion to the experiments using reduced versions of the datasets, the performance of data augmentation methods BC/BC + and the compared deep generative models that mainly employ intra-class data augmentation show limited performance.VAEs and BC methods show relatively good performance in the face liveness dataset, since faces are highly structured images.However,   6 without pretraining with extra datasets, large networks, and combining SVM with traditional hand craft features.Note that our two patches based model is lighter and faster than others, since we used a light-weight CNN that has just 73,904 parameters.Figure 4 shows the ROC curves for single models and two patches based models.GTCNs achieve superior true acceptance rates in the range of low false acceptance rates.

Discussion
Ablation Study We perform an ablation study, the results of which are shown in Table 10.As first, we trained G AB and G BA separately and used translators with fixed parameters to augment data for training C. Seperate learning shows better accuracy then baseline CNN, because translators generate diverse data.However, proposed Joint learning is superior to Separate learning.One of the disadvantages of Separate learning is that it takes longer to train, since two models are trained sequentially.To analyze the effect of each component

Visualization of train and test features
We provide the results of a t-distributed stochastic neighbor embedding (t-SNE Maaten and Hinton2008) visualization in Figure 5. Since GTCNs produce translated data from given real data, the feature space of augmented training samples are visualized.Additionally, the feature space of the classifiers for test samples of the face liveness dataset is visualized.The t-SNE of the GTCN with our proposed methods results in a sharper distinction between examples of similar classes, highlighting the fact that the learned representation is of higher quality.Fisher's score J in the experiment is defined as: where µ A and µ B are the mean of the logits for distribution A and B respectively.σ A and σ B denote the standard deviations of the same logits.The comparison of Fisher's criterion scores for face liveness detection are shown in Figure 8.The proposed GTCN shows the best score, while other generative models show better scores than baseline CNNs.However, BC achieves lower score than the CNNs.In the ablation study results, utilizing all the proposed methods, Joint/AF/QL/ST, results in the best score.To clarify understanding for the scores, the histogram analysis of logit values for liveness test dataset are shown in Figure 6 and Figure 7. CNN256×256 seems to outperform CNN128×128.In case of BC, intra-class variance and mean difference between classes are both reduced.
In VAE and GAN, intra-class variance and inter-class margin are both increased.The proposed GTCN enlarges margin well beween live class and fake class.In the ablation study, Joint, AF, QL, and ST show different characteristics to form the class distribution.Joint learning makes smaller variance within each class than Separate learning, while AF enlarges margin between classes.QL apparently reduces variance of each class, while it enlarges margin between the classes., where 1 ≤ i ≤ k is a class-id, x is a given image, and k is the number of classes.We choose the class-id with the maximum probability among k probabilities as the recognized class-id of x.
Evaluation results are shown in Table 9 and GTCNs outperform compared methods.The accuracy for the four classes improved evenly for the GTCN as shown in Figure 10, because translated samples were used for training.We perform an ablation study, the results of which are shown in Table 10.Note that this is more like a simple modification for multiclass classification rather than carefully designed extension.However, it is noteworthy that because of joint training of GAN and classifier, the proposed method achieves 88.81% of accuracy for artist dataset significantly outperforming the baseline CNN (83.21%) or separate training of GAN and CNN (82.61%).In terms of characteristics for within and between classes, 2D PCA of logit values for artist test dataset are shown in Figure 15 and Figure 16  Effects of fine-tuning for multi-class task After training GTCNs, we additionally fine-tune the classifier with a crossentropy loss for a few epochs.The method quickly stabilizes and improves the overall accuracy of the classifier in GTCNs for a multi-class problem.By fine-tuning, we assume weights related to noisy samples of multi-class may be truncated.To perform a fair comparison, we tried applying fine-tuning to other compared methods such as CNNs and BC/BC + s.The fine-tuning is only useful for GTCN.To illustrate this, we report results of fine-tuning the other methods.There was no meaningful improvement and performance was sometimes degraded as shown in Table 11.Hyper-parameters search for quadruplet loss The proposed quadruplet loss uses hyper-parameters such as η a , η b , and η c in the main paper.To achieve the best performance, we perform a grid-search of hyper-parameters on the dogs vs. cats and artist dataset as shown in Table 12 and 13.The hyperparameters for the quadruplet loss were set to: {η a /η b =0.25, η c =1.5} for the artist dataset.The hyper-parameters for the quadruplet loss have an impact on accuracy.

Visual examples of trained images
The GTCN uses progressively translated images as a part of the mini-batches.Note that the main purpose of the GTCN is not to generate realistic images but rather to improve the  12, all of the compared models generate visually different images from given inputs to construct the augmented mini-batch.The images generated by the VAE are blurry and seem to wrongly interpolate across classes.BC attempts to mix up images, but seems to go pear-shaped at generating images that realistically belong to the target class, especially for the images of the artist dataset that display less overall structure.However, the GTCN attempts to borrow structure and shape information from samples of different classes while usually preserve texture and style, so the classifier can learn with sharper and more diverse training data.To show more examples of augmented training images, visually informative pairs were chosen in Figure 14.

Miscellaneous design of GTCN Light-weight classifier C
We chose to adopt an architecture that is sufficiently lightweight so as to run on performance-critical systems such as smartphones.The light-weight model is a CNN consisting of 6 convolutional layers (in addition to pooling and fullyconnected components).The parameters are denoted: θC = {w 1 , w 2 , w 3 , w 4 , w 5 , w 6 , w s }. ( The specific parameters for the network are given in Table 14.As θ C only consists of 73,904 (k=2) / 75,952 (k=4) parameters, the trained networks Cs have a small memory footprint and can be run on smart devices in real-time without GPU acceleration.

Score fusion models
As we described in the main paper, we use score fusion models to achieve the best accuracy in face liveness detection.In detail, two kinds of image patches are used in each classifier C in the GTCNs.Example of patches are shown in Figure 13.For the face liveness detection experiments, we employed a face detector (Viola and Jones2001) and resized the detected  where SC rs is a liveness score of a patch x rs that is a resized image of a detected face region, F D is a face detector, and θ rs are learned parameters.Similarly, the classifier C ct that discriminates global shapes and contextual pattern is given by where SC ct is a liveness score of a patch x ct that is a resized contextual image from an input image I, and θ ct are learned parameters.To decide whether the given image I is live or not, the final liveness score is calculated by where α rs and α ct are coefficient values to combine scores from two models to perform liveness prediction with a late fusion method.In this paper, we set α rs =1 and α ct =0.6, respectively.

Conclusion
We propose novel joint learning methods on GTCN that train a generative translation model and a classifier via augmented mini-batch technique, adaptive fade-in learning, and quadruplet loss.Since translators provide challenging data incrementally during training a classifier, accuracy can be improved by employing the augmented inter-class data.After the end of training, we perform inference using only the lightweight classifier of GTCN.Our method trained on a small subset of the whole dataset achieves a greater accuracy than the baselines trained on the full dataset.When training on the full dataset, we surpass comparable state-of-the-art methods despite using low resolution images and a smaller network architecture for our method.We believe our work can benefit classification tasks that suffer from visual similarity, diversity, and lack of data.

Figure 1 :
Figure 1: A generative translation model progressively generates samples to augment data in mini-batch for joint training of a classifier.

Figure 2 :
Figure 2: Overview of the proposed Generative Translation Classification Network : (Left) G AB is a translator from class A to B, while G BA is a corresponding translator from class B to A. G AB consists of three main blocks, an encoder En AB (three convolutional layers), transformer T r AB (nine residual blocks), and decoder De AB (two deconvolutional layers and one convolutional layer).Likewise, G BA consists of En BA , T r BA , and De BA .Random noise {z AB , z BA } ∈ R 32×32×3 for stochastic translation is sampled from a uniform distribution over [−1, 1] and concatenated to the output maps of En AB and En BA .D A and D B are the corresponding discriminators of adversarial training for classes A and B respectively.C is a simple classifier consisting of six convolutional layers over classes A and B. α and β are the trainable parameters of the adaptive fade-in loss.x A and x B are real samples from classes A and B respectively.The corresponding translated samples are denoted by xB and xA .The cyclic reconstructed samples are xA = G BA (x B ) and xB = G AB (x A ).Note that only C is used at test-time.Details of loss functions L * are in proposed methods section.(Right) Comparison of mini-batch structure.A mini-batch used for training C in GTCN contains real and translated samples, while the baseline CNNs are only trained on real samples.
cls on the Softmax output of C. The classification loss L cls is used to train C, G AB , and G BA by backpropagation.In the equation 9, the parameters α and β control the relative importance of four training inputs, x A , xA =G BA (x B , z BA ), x B , and xB =G AB (x A , z AB ) in AM B k .Specifically, α controls the relative weight given to real data x A and augmented data xA to train a classifier C. Similarly, β controls the relative importance given to real data x B and augmented data xB to train the classifier C.
[•] + = max(•, 0), and f C (•) denotes the embedding features of x A , xA , x B , and xB in AM B k at the last feature layer of the classifier C. The thresholds η a , η b , and η c are margins that are enforced between positive and negative pairs.Likewise L cls , L quad is used to train C, G AB , and G BA .The quadruplet loss encourages the similarity of intra-class samples and the dissimilarity of inter-class samples, which is useful for classification.Specifically, the intention of the designing the quadruplet loss considers all six pairs out of the four elements, x A , x B , xA , and xB .The first and second items in equation 10 are aim to minimize intra-class variation and maximizing means of inter-classes.The final term acts differently, it aims to auxiliary regulate the inter-class means.

Figure 3 :
Figure 3: Example from face liveness and dogs vs. cats.(First row) The first column shows real face images.The second, third, and fourth column show fake face images those are corresponding to video display, printed photos, and face masks with real eyes.(Second row) The first column shows cat images, others show dog images.
where do is dropout regularization, F C is a fully connected layer for k classes, pool is a pooling layer, and conv m n×n consists of m convolution kernels with size n × n, batch normalization, and rectified linear units.As the baseline translator network, we adapt the implementation of Cycle-GAN.Regarding the hyper parameters, λ was set to 10 in all experiments.The hyperparameters for the quadruplet loss were set to: {η a /η b =2, η c =6} for face liveness, {η a /η b =0.5, η c =8} for dogs vs. cats.Evaluation methods Binary classification is the problem of deciding to which class y ∈ Y = {A, B} a given test image x belongs.Instead of using the Softmax output that was used for training, we utilize logit values obtained before

Figure 4 :
Figure 4: Receiver Operating Characteristic (ROC) comparison of models trained with full volume of the Faceliveness dataset.(Left) ROC curves of single models (Right) ROC curves of two patches based models

Figure 5 :
Figure 5: First two plots show examples of t-SNE analysis for training data on GTCN: (1) Face liveness and (2) Dogs vs. cats.The augmented data points x achieve a good coverage of all the area of the embedding space corresponding to their true class in binary classification.We can consider that they are good at augmenting training data in that regard.Second two plots show comparison of methods by t-SNE analysis for LV40 test dataset: (3) CNN and (4) the proposed GTCN

Figure 9 :
Figure 9: Example from artist datasets.The first column shows Van Gogh's painting.The second, third, and fourth column show Monet, Cezanne, and Ukiyoe's one.

Figure 10 :
Figure 10: Comparison of confusion matrix for compared methods on the artist dataset , to analyze multiclass classification results.Characteristics of comparison with other methods and ablation study are very similar to binary classification cases.Overall results show the proposed GTCN enlarges margin between classes and reduce variance of within class.

Figure 11 :
Figure 11: Example variation of α and β for adaptive fade-in learning

Figure 11
Figure11shows the values taken by α and β for the AF method during training of the face liveness dataset.At the start of training, α and β were higher than 0.5, because real images are more confidently classified than generated images, while the values have almost converged to 0.5 at the end of training in the example.

Figure 12 :
Figure 12: Qualitative comparison of data augmentation methods to train a classifier.First two rows are binary class examples of the face liveness and the dogs vs. cats dataset.Last three rows are multi-class examples of the artist dataset.In the examples, BC uses 0.33 and 0.66 as mixing ratio between two images.

Figure 13 :
Figure 13: Examples of image patches: resized detected face x rs , and resized contextual x ct from input image I.Note that x rs and x ct are images of 128 × 128 pixels.

Figure 14 :
Figure 14: Examples of augmented training images for GTCN: (Up) Face liveness dataset (Middle) Dogs vs. cats dataset (Bottom) Artist dataset In the examples, translated samples attempt to borrow structure and shape information from samples of different classes while usually preserving texture and style.Even though the translated images are sometimes visually weird, the images in the inter-class space also contribute to improving of the classifier's accuracy for visually similar images in terms of shape and style.

Table 1 :
Configuration of experimental datasets and subsampled variants.LV* and DC* are reconfigured training datasets.

Table 2 :
Evaluation results of training with small volume of the face liveness dataset.ACC is mean accuracy.Cells in columns of FAR show percentage value of TAR.Outputs of the F C layer in the classifier C are utilized to calculate a score for class A as: Table2and Table3show evaluation results.As the two binary classes are visually similar and lack of diverse data, baseline CNNs have low accuracy despite using deep networks and 100% of training data in both of datasets.The overall sparsity of training data causes a poor true acceptance rate.Interestingly, BC/BC + did not fare better than CNNs in the range of low false acceptance rates, although mean accuracy for those methods was however higher than for CNNs.We hypothesize this is due to the fact BC/BC + may be hard to produce good mixing data in cases where classes are very similar, caused by the proximity of the manifolds corresponding to those classes.All deep generative models outperformed the CNN baseline for LV40 dataset.In par- ticular, VAE-based methods achieve a good accuracy, most likely because such methods generate diverse samples.For all evaluation datasets, GTCNs that were trained with 40% of the dataset clearly outperformed all of other compared methods including CNNs trained on 100% of training data.

Table 3 :
Evaluation results of training with small volume of the dogs vs. cats dataset.EER is the equal error rate.
*are results of score fusion based model.

Table 7 :
Ablation experimental results of the proposed methods on the datasets

Table 9 :
Comparison of mean accuracy on multi-class artist dataset

Table 10 :
Ablation experimental results of the proposed methods on the datasets Artist CNN Separate +Joint +AF +AF/QL +AF/QL/ST Accuracy 83.21 82.61 85.14 85.88 86.21 88.81 via a simple trick of setting minibach size to be four consisting of only one x A , one xA , one x B , and one xB meaning that each minibatch handles only two classes at a time.To calculate the probability for multi-class k from a given test image x, we employ the softmax output of the network.The function is defined by P (y = i|x; θ C ) =

Table 11 :
Comparison for effects of fine-tuning on the artist dataset.†denotes fine-tuning is applied.

Table 12 :
Examples of hyper-parameters obtained by grid search on the dogs vs. cats dataset.

Table 13 :
Examples of hyper-parameters obtained by grid search on the artist dataset.
classifier's accuracy.Nonetheless, a visual inspection of translated/generated samples gives insights into why certain methods work better than others.As shown in Figure