Boosting Network Weight Separability via Feed-Backward Reconstruction

This paper proposes a new evaluation metric and boosting method for weight separability in neural network design. In contrast to general visual recognition methods designed to encourage both intra-class compactness and inter-class separability of latent features, we focus on estimating linear independence of column vectors in weight matrix and improving the separability of weight vectors. To this end, we propose an evaluation metric for weight separability based on semi-orthogonality of a matrix and Frobenius distance, and the feed-backward reconstruction loss which explicitly encourages weight separability between the column vectors in the weight matrix. The experimental results on image classification and face recognition demonstrate that the weight separability boosting via minimization of feed-backward reconstruction loss can improve the visual recognition performance, hence universally boosting the performance on various visual recognition tasks.

Representation learning based on deep learning methods has been achieved remarkable performances in various visual recognition studies such as image classification (Le-Cun et al. 1998;Krizhevsky, Sutskever, and Hinton 2012;He et al. 2016), object recognition (Eitel et al. 2015;Socher et al. 2012), face recognition (Schroff, Kalenichenko, and Philbin 2015;Sun, Wang, and Tang 2014;, and person re-identification (Li et al. 2014;Ding et al. 2015). The key of these successes is the effective feature extraction via the non-linear and cascaded kernel structure of deep neural networks. However, in addition to extracting feature using locally connected and shared weight structure of a convolutional neural network, the neural networks' decision metrics based on Euclidean geometry have been demonstrating that embedded features on inner product space are sufficient to achieve superior recognition accuracies to the conventional discriminative approaches (Dalal and Triggs 2005;Lowe 1999;Zhang et al. 2006) based on hand-crafted features in various recognition tasks.
In recent years, not only studies to improve the representation learning capabilities of convolutional neural networks based on modifying structures of networks (He et al. 2016;Huang et al. 2017), but also the discriminative embedding methods for latent features into Euclidean space have been actively studied (Liu et al. 2016;2017;. Feature learning constrained on l 2 -norm space (Taigman et al. 2014) is proposed to improve the discriminative power of learned features by regularizing the vector scale of each data point. Angular cost function (Deng, Guo, and Zafeiriou 2018) is presented. Angular cost functions, Large-margin softmax function (Liu et al. 2016), and Sphereface ) are proposed to improve the discriminative properties of learned features based on the understanding of the principle of cosine similarity. ) presents the 'center loss' based on clustering methodology, and shows that even though the function is non-differential, it can improve the discriminative power of learned features during network training. Intuitively, these approaches are typically concentrated on the embedding latent features into some constrained space using restriction methodologies for the features by reinforcing of intra-class compactness and inter-class separability (Liu et al. 2016). Even though these approaches have achieved remarkable performance in diverse visual recognition tasks, improving separability of learned weight kernels is one of the challenging issues. In recognition tasks by computing vector similarities between weight and latent features, inner product correlation between weight vectors can significantly affect the performance of the recognition models.
In this paper, we formulate the evaluation metric for weight separability and propose a method to boost the separability of a network weight in a last fully connected layer. Figure 1 shows the intuitive concepts of weight separability, inter-class separability, and intra-class compactness. Although one-hot encoded label vectors already induce the weight vectors of last fully connected layer to be orthogonal in general approaches, there is a possibility for further improvement of discriminative power of learned features by revising loss functions or structural details 2016). We focus on the semi-orthogonalization of a weight matrix, which is a process to find a set of orthogonal vectors that can span a specific subspace. The set of orthogonal Figure 1: The intuitions of intra-class compactness, inter-class separability, and weight separability. σ c j and σ k j are the centroid and the k th sample of the j th class. θij is the angle between the i th and j th weight vectors. vectors takes linear independence between elements. The orthogonalization of weight in neural network is considered as a regularization method to reduce the correlation between detected features by networks (Rodríguez et al. 2016). Our main hypothesis is that the separability between vectors of a weight matrix is related to the recognition performances and it can be evaluated by the linear independence of the weight matrix. The purpose of this paper, therefore, is to prove the hypothesis and apply this intuition to improving representation learning capability of deep neural networks for various visual recognition tasks.
Our key contributions are as follows. First, we define and demonstrate a quantitative evaluation metric for weight separability, which can be used for high-dimensional features without any dimension reduction method and visualization task. Second, we propose a straightforward method to boost the separability of the weight vectors explicitly during network learning. The experimental results show that the proposed method can improve the performance of image classification and face recognition tasks.

Linearity and Separability
In commonly used deep learning structures for visual recognition tasks, a fully connected network is used to assign the label by calculating the confidence based on vectorial or probabilistic approaches. The column vectors of weight matrix in last fully connected neural network are used to decide recognition classes of inputs based on the vector similarity based on the inner product: w i · α = w i α cos θ i , where w i is the i th column vector of weight matrix W = [w 1 , w 2 , w 3 , ..., w n ] ∈ R m×n where m and n is the row and column dimensionalities of weight matrix, and α and θ i are a latent feature vector and the angle between w i and a respectively. In fully connected networks positioned at the last layer, the figures m and n indicate that the dimensionality of input feature and the number of classes. In recognition task using fully connected layer, the class of a latent feature is assigned as the index of column vector which takes the largest value calculated by the inner product defined as follows: where i is the index of column vectors in a weight matrix, alpha is a latent feature. w i and b are i th column vector in the weight matrix and a bias term respectively. The left side terms of above fomular can be changed like a argmax i (f (α · w i + b)), where f is an activation function in a network. We omit the bias term and use the augmented vector form to simplify the experiment process. In this paper, we argue that linear independence of the column vectors in a weight matrix has a relation to the separability of weight vectors which can influence performance of various recognition tasks based on vector similarities. To justify our argumentation, we conduct a simple experiment using MNIST dataset (LeCun et al. 1998). In these experiments, we used samples of classes: 0,1, and 5 only. We compare two neural networks which have the same structure but trained in different ways. We have employed LeNet (LeCun et al. 1998) structure in our experiment. One network is trained by forcing with linearly dependent column vectors, and the other is composed of linearly independent column vectors in a final layer. We initially assign random real numbers between −1 to 1, and conduct QR decomposition to take the weight matrix composed of linearly independent column vectors. The formula for the above process is represented as follow: where W ∈ R m×n is randomly initialized weight matrix, and W ∈ R m×n is an orthogonal matrix composed of linearly independent column vectors. R ∈ R n×n is an upper triangular matrix. We employed a square matrix (W ∈ R 10×10 ) in this experiments even though QR decomposition is applied to m × n matrix, with m ≥ n. To maintain the linear independence to the weight vectors during learning, the parameter in the final weight matrix is not updated during training each model.
We have reduced the dimensionality of latent features as 3 using principal component analysis (PCA) to visualize our results. As visualization results for experimental results using in Figure 2, the weight matrix of a neural network composed of the column vectors which take linear independence, shows better discriminative power in their distribution of latent features than the neural network did not force the linear independence during network training.

Weight Separability Evaluation Intuition
As the illustration in Fig 1 and the experimental results in Fig  2, the linearity of the column vectors in a weight matrix can influence recognition performances. We try to evaluate the weight separability using the orthogonality of a matrix. The property of orthogonal matrix is as follows: QQ T = Q T Q = I, where Q is a square matrix, and I is a corresponding identity matrix of Q. However, the dimensionality of the commonly used weight matrix W is not a square matrix, and also we can not guarantee that the weight matrix W is invertible in practical situations. Therefore, in this work, we employ the concept of a semi-orthogonal matrix. A nonsquare matrix A is semi-orthogonal if either AA T = I or A T A = I, and it implies that A take isometry property. With this notation, the linearity of a weight matrix W ∈ R m×n is simply evaluated by calculating an error E defined as follows: where W is a weight matrix and I n is the corresponded identity matrix of n × n dimension. The result of this subtraction operation is a matrix. When E(W, I) are closer to a zero matrix, W can take stronger linearity. However, matrix form is inappropriate to consider as a quantitative value to estimate the linearity. Moreover, in practice, Above equation does not show the complete equivalence as mathematical semi-orthogonal. The cause of this inequivalence is a matrix structure of a neural network. The matrix notation for a final fully connected network is represented as follow: where α ∈ R 1×m is the latent feature outputed from a previous layer which consisting of m of elements, w i ∈ R m×1 is i th column vector in weight matrix W of the final layer, and o ∈ R 1×n is the output of network. n is the number of classes. In above notation, each output o i , where i = 1, 2, 3, ..., n, is calculated as follows: where w ij is j th element of the i th column vector w i . In the above notations, the column vectors in weight matrix play a rule as a kernel to assign specific class by computing vector similarity between the given feature α and each column vector w f i . In this work, we consider the separability of weight kernel so that we only consider the linear independent of column vectors of weight matrix W . However, this principle can be used for the network in which their row vector is used for the decision kernel.

Metric Definition and Mathematics
Since a matrix format in Eq 3. is not suitable to evaluate the weight separability quantitatively, we employ Frobenius Distance which can be converting the matrix form to realnumber. We define the quantitative metric based on Frobenius Distance to evaluate the linearity of column vectors in a weight matrix. The metric (W ) for separability of a weight matrix W ∈ R m×n , m > n is defined by n is the number of column vectors in the weight matrix, and I n is an identity matrix with n × n dimension. The proposed metric computes the weight separability using Frobenius distance and regularizes it by dividing with the number of classes. The reason for the regularization with the number of classes is to provide the generalized evaluation metric invariant to the number of classes, and prevent the fluctuating evaluation values according to the problem domain. In equation 6, W T W − I is represented as follows: where w ij is i th row and j th column element in a weight matrix. By the properties of transpose: By this property, the metric in Eq. (6) can be represented as follows: where T r(·) is the trace operation of an square matrix defined by the sum of the elements on the main diagonal of the square matrix. Intuitively, when the value of e(W ) is converged to zero, the column vectors of weight matrix would be linearly independent and separability of the column vector can take maximum. We omit the bias in the fully connected layer because it just complicates our analysis based on visualization and nearly does not influence the recognition accuracies (Liu et al. 2016). Figure 3 shows the trend of the classification precision, cost function, and the kernel linearity evaluated by Eq. 8, based on ResNet-32 and Cifar-10 dataset. As shown in figure 3, the kernel linearity is gradually decreased and the classification precision increasing during the training.
Additionally, we conducted simple experiments using MNIST dataset to verify our metric. We trained the LeNet using various loss functions including l2-norm softmax (Taigman et al. 2014), center loss , and largemargin softmax (Liu et al. 2016), and carried out the cross check for accuracy and weight separability about each model. Figure 4 illustrates the visualization results of the experiments. As the results in Fig 4, the experimental results show that the more accurate recognition performance can take the larger weight separability evaluated as our metric. One of the interesting observations is that the evaluation results for weight separability using our metric can be reflected the recognition performance, even if it is difficult to figure out the superiority of recognition performance using visualization results.

Feed-backward Reconstruction Motivation
Consider the commonly used optimization methods such as softmax-cross entropy, and we have a latent feature α and corresponding annotation label o. If the latent feature α have to classified to i th class, then the methods are concentrate on to encourage w i · α > w j · α, j = 1, 2, 3, 4, ..., n and j = i, where n is the number of classes, and w i is i th column vector in a weight matrix W . In this work, we want to improve not only intra-class compactness and inter-class separability, but also boost the separability between the column vectors in weight vector. Current loss functions such as softmax-cross entropy, l 2 -distance loss, cosine angular loss, and large-margin softmax, do not consider the weight separability explicitly. So the new method is required to directly improve the weight separability.

Feed-backward Reconstruction Loss
Following the notation for the weight separability evaluation in Section 3, the weight separability would be maximum when W T W − I takes a zero matrix. In this case, basically, we assume that W T = W −1 . However, using the proposed evaluation metric as an objective function is unsuitable to train a model because of a problem for computing gradient as long as we use the back-propagation algorithm (Hecht-Nielsen 1992) to update network parameters. The evaluation metric is composed of the weight matrix of the final layer only, therefore the gradient of the proposed metric for weight Figure 5: (a) The normally trained models contain a simple mapping pipeline for classification f , and associated classification loss L cls . (b) The models applied the proposed reconstruction loss contains two mapping pipelines: the classification f and reconstruction f −1 , and associated losses L cls , Lre for each. o andô are the network output and corresponding annotation. α andα is the latent feature and reconstructed latent feature from the given annotationô respectively. The red and blue arrows in first row indicate the classification and reconstruction pipelines respectively. The red and blue dots represent the activation units of output and previous layers.
, will vanishes when the gradient for other layers are calculated. Consequently, it is necessary to develop an objective function which is suitable for applying the trainig procedure of networks.
To address this issue, we propose the feed-backward reconstruction loss that can improve the weight separability directly. The feed-backward reconstruction loss is defined as where α, w T i , andô ,are a latent feature, the i th transposed column vector of the weight matrix W , and corresponding label about the latent feature. P and Q are the distributions for the latent features and reconstruction results. The proposed loss functions mathematically equivalent to the Kullback-Leibler divergence, and literally this loss function defines the difference between the distributions of latent feature and reconstruction results. Intuitively, if the proposed loss L re is converged to zero, then it means P (α i ) is equivalent to Q(ôw T i ), and it is represented as, P (α)log( P (α) Q(ôW T ) ) = 0. In this situation, W T can be regarded as W −1 and it also can be regarded as a solution to maximizing the weight separability. The reconstruction loss functions using l 1 -norm or l 2 -norm force to minimize the Euclidean distance even their angular difference is tiny. These approaches can not be used with various activation functions since there is a probability that the Euclidean distance can be changed by an activation function. Therefore, so we instead require parameter transformations invariant method based on computing a difference of probabilistic distribution.
When we apply the proposed loss to train a model, the proposed loss is added to ordinary loss functions L cls such as softmax cross entropy, center loss , and large-margin softmax loss (Liu et al. 2016). Therefore, the total loss function is defined as follows, L total (ô, o; α, θ) = L cls (ô, o; θ) + λL re (ô; α, W ), (10) where o andô are the output of models and corresponding labels. α is the output of previous layer that connected to the network for recognition tasks, and W is the weight of a final layer. θ is a set of network parameters including W . λ is hyper-parameter to decide the weight of the proposed reconstruction loss in training task. In our experiments, the value of λ is set to 0.001, and this value is determined by the value with the best performance from several experiments.

Interpretation
The model applied the feed-backward reconstruction loss contains two mapping process: 1) Determination process f : α − → o and 2) Reconstruction process f −1 :ô − →α, and both processes share weight parameter W . The determination process f encourages W to translate α into an encoded output o, and the reconstruction process f −1 force W T to recoverα from given labelô. Figure 5 shows the comparison between a normal model and the model applying the feed-backward reconstruction process in a classification task. In optimization via these two processes, each process affects each other in achieving their objectives.
The objective of the determination process is to maximize the accuracy for visual recognition tasks by minimizing geometric or probabilistic difference between the output of a model αW = o and the given annotationsô. The reconstruction process aims to minimize the difference of distributions between the latent feature P (α) and the reconstruction results Q(ôW T ). The reconstruction process can be optimized when the determination process takes highly accurate performance, and it is able to provide more accurate recognition performance when the weight separability become more advanced. Above cooperation between two processes is similar to the cycle consistency losses (Zhu et al. 2017). Consequently, above processes not only can boost the weight separability but also can improve the cyclic consistency via dual minimization schemes for classification task and latent feature reconstruction.
The bolded values represent the lowest error rate on LFW and YTF datasets.
2009). The CIFAR-10 dataset is composed of 50,000 training images and 10,000 test images in 10 classes. CIFAR-100 dataset consists of 100 classes, and each class contain 500 training images and 100 testing images. Our work is concentrated to demonstrate the efficiency of the feed-backward reconstruction loss, and not on encourage state-of-the-art performance. Therefore, our experiment conducted based on the several baseline models intentionally and focused on the comparison between normally trained model and trained model using the feed-backward reconstruction loss . The baseline models used in the experiment for image classification, are as follows: Network in Network (Lin, Chen, and Yan 2013), VGG-16 (Simonyan and Zisserman 2014), Highway Network (Srivastava, Greff, and Schmidhuber 2015), Residual Network (ResNet) (He et al. 2016), and Densely Connected Convolutional Neural Network (DenseNet) (Huang et al. 2017). To improve an experimental efficiency, we use the most shallow structure on ResNet and DenseNet, and the ResNet-32 and Densenet-40 structures are selected for our experiments. All networks are trained using stochastic gradient descent (SGD) (Bottou 2010). We trained all networks using 128 batch size for 300 epochs. During training networks, we employed learning rate decay of 0.0001 and momentum of 0.9. The learning rate is initially set to 0.1, and divided by 10 in 100, 200, and 250 epochs.
The experimental results on CIFAR-10 and CIFAR-100 dataset are shown in Table 1. The densely connected convolutional network applying simple data augmenation and the pro-posed reconstruction loss achieved an error rate of 5.62% on CIFAR-10 dataset and 20.75% on CIAR-100 dataset. These figures are the best results in our experiment for image classification. The evaluation results of weight separability for these experiments are 2.16e-08 and 1.42e-08 respectively. The experimental results show that the trained model considering the feed-backward reconstruction loss outperformed the normally trained models. The most noticeable things in our experiment are that the models trained reflecting our loss achieve better performance whether the performance differences are small or large collectively.

Face Recognition
We have conducted additional experiments for face recognition to demonstrate the efficiency of the proposed method for improving weight separability. This experiment is conducted under the unrestricted with labelled outside data protocol, so that all models were trained only using CASIA-Webface dataset and tested using Labeled Faces in the Wild (LFW) dataset (Huang et al. 2007) and the Youtube Faces (YTF) (Wolf, Hassner, and Maoz 2011) dataset. CASIA-Webface dataset consists of 494,414 of face images labelled as 10,575 different identities, and the dataset also contains horizontally flipped images for data augmentation. The performance evaluation is carried out on 6000 of face pairs from LFW dataset, and 5000 of video pairs from YTF dataset.
The network model list used in this experiments as follows: DeepFace (Taigman et al. 2014), Facenet (Schroff, Kalenichenko, andPhilbin 2015), DeepID2+ , DDRL (Yu et al. 2018), and the other methods proposed by , and Liu et al. (Liu et al. 2016). These methods are initially trained via classification setting and conduct the evaluation using a verification scheme. We added the feed-backward reconstruction loss in calculating the total loss when the models are trained. Table 2. shows the comparison results of the normally trained models and the models applying the proposed loss.
The face recognition results usually show that the trained models applying the proposed loss achieved better performance than the normally trained models. The highest recognition accuracies in LFW and YTF datasets are achieved by the DDRL and DeepID frameworks trained with the proposed reconstruction loss. These models achieve 0.87% and 4.53% error rates on LFW and YTF datasets respectively. The evaluation results of weight separability for these experiments are 6.30e-08 and 8.76e-08. However, in experiments using the DDRL and the center loss, the proposed method degraded the recognition accuracies. In the experiment using YTF dataset and DDRL, the 3.54e-08 of weight separability was reduced, but the DDRL applying the proposed reconstruction loss have achieved 7.15%, and this figure is lower than then 5.98% of the original model. Additionally, the experiment using the center loss, the trained model with the proposed reconstruction loss achieved lower accuracies than the original model.
The overall experimental results on face recognition tasks show similar trend on the experimental results of image classification. Even though the experimental results in our experiment are slightly lower then the listed accuracies in their studies, these figures are comparable to the reported performance in the studies (Schroff, Kalenichenko, and Philbin 2015; ) and almost similar to the state-of-the-art methods only trained by CASIA-Webface dataset.

Analysis
The experimental results show clear advantages over current deep neural network models and a lot of compared baselines. Our interpretation of these performance improvements is as follows. In first, as we mentioned in Section 2 and Section 3, the weight separability can influence recognition performance in a model based on the neural network. We tried to improve the weight separability via feed-backward reconstruction loss which can encourage the linear independence between the column vectors in a weight matrix. In the learning procedure, the proposed reconstruction loss plays an important role to improve the weight separability explicitly. The error rates and weight separability evaluation results in Table 1, show that the classification performance is probably proportional to the weight separability evaluation results. Not only image classification results, but also experimental results for face recognition shows similar circumstance.
In Second, the feed-backward reconstruction can improve the not only weight separability but also intra-class compactness. Figure 6 represents the comparison of neuron activation pattern and the values of a corresponding column vector in a weight matrix in our classification experiment using ResNet. The figures on the top of a bar graph indicate that the Eu- Figure 6: Pattern comparison of neuron activation and the corresponding weight vector on 'Airplane', 'Bird', and 'Deer' classes in CIFAR-10 dataset. X-axis shows the an index of each neuron, and Y-axis represents an activation output. The graphs in right-side are the pattern comparison for normally trained ResNet, and the graphs in left-side are the comparison on the ResNet applying the proposed reconstruction loss. The blue bar indicates the expectation of neuron activation, and the red bar represents the corresponding weight vector. The values beside of class name are vector similarities based on Euclidean distance and cosine similarities between the expectation value of neuron activation and the corresponding weight vector. clidean distance and cosine similarity between the neural activation and the corresponding column vector in a weight matrix.
These figures are regarded as that the similarities between neuron activation and the corresponding vectors. A common point of these figures is the figures applying the proposed reconstruction loss, are smaller than the normal ones. In figure  6, the Euclidean distance and cosine similarity of the model applying our reconstruction loss, about 'Deer' class are 0.616 and 0.190. On the contrary, the corresponding Euclidean distance and cosine similarity of the normal model are 0.666 and 0.223, and these figures are bigger than the model applying the proposed reconstruction loss. In addition to the experimental results for 'Deer' class, Other experimental results for 'Airplane' and 'Bird' classes shows the same phenomenon. These results show that the proposed reconstruction loss can help to learn more discriminative representation.

Conclusion
In this paper, we presented the metric for weight separability evaluation and proposed the feed-backward reconstruction loss to directly improve the weight separability which can be used for various visual recognition tasks. The evaluation metric for weight separability can represent linear independence property of column vectors in a weight matrix. With feed-backward reconstruction loss, the separability of column vectors in weight matrix was improved. The experimental results present that the proposed feed-backward process and the loss function significantly contribute performance improvement in recognition tasks.