Soft Label With Channel Encoding for Dependent Facial Image Classification

In classification tasks, training labels are usually specified as one-hot targets which represent each class equally and exclusively. However, this labeling rule is not suitable in some situations. For the dependent classes, one-hot targets are not capable to represent the relation among them. The existing label smoothing methods just split the target response into neighboring classes, but it is only applied for ordinal classification, but not for the dependent but non-ordered classes. In this paper, we propose a novel labeling rule that decomposes the one-hot target into several bases to reflect relationships among classes while maintaining a balanced target space, which adopts channel encoding from communication systems, in particular, Bose–Chaudhuri–Hocquenghem (BCH) encoding. Besides, BCH encoder has an error-correcting mechanism that is expected to lift the accuracy. In theory, training with BCH targets ensures improved classification performance given the original accuracy is not less than 50%. To verify the proposed method on dependent classification, we conduct experiments with two facial tasks: age recognition and face anti-spoofing. The former is an ordinal classification task, and the latter is also regarded as a specific dependent classification problem due to the varying attack types being classified as one class finally and real for the other. Experimental results show that the proposed method improves accuracy by 6.33% on age recognition and reduces HTER by 3.63% for face anti-spoofing. In addition, as BCH targets divide the original response into a higher dimensional space, that is, the model is made to be heeded on learning the delicate sub-features. Hence, BCH targets also enhance model generalizability, thus guaranteeing improved performance on cross-domain evaluations. We further perform an assessment on the PACS dataset for evaluating domain generalizability. The results show that the domain generalizability is enhanced by increasing average accuracy by over 2% training with BCH targets.


I. INTRODUCTION
Image classification commonly involves taking a set of one-hot targets as training labels; the output feature is compared to each target, in terms of cosine similarity, after which the results are mapped to the corresponding class probabilities. The one-hot targets represent each class equally and exclusively, which however is not suitable in some cases. For the dependent classification task, there are some similarities or gradual changes among classes.
The associate editor coordinating the review of this manuscript and approving it for publication was Sudhakar Radhakrishnan .
For instance, in ordinal classification, classes are ordered. Hence, some soft labeling has been proposed to address this shortcoming: it distributes the target probability among neighboring classes with certain functions to satisfy the class characteristics and reflect relations among classes in a detailed manner. This label smoothing rule ensures that training targets represent gradual changes in conditions [1], [2]. To address the finer relation among classes, Kats et al. construct targets by considering the distance among classes as weights for probability sharing [3]. However, the existing label smoothing methods can not be adopted to handle the dependent but not ordered classification tasks. In some facial VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ image classification tasks, the classes are related but not in a sequence, like a face anti-spoofing. Due to the varying types of attacks are all belong to the fake class and the other class is the real one. Another example is binary facial emotion classification, which distinguishes the facial emotion into positive and negative sides roughly, but the emotions within each side are similar but not in a certain order.
To extend the usage of the soft label to all dependent classification tasks, a novel labeling rule is proposed in this study. The proposed soft labeling rule is necessary to hold the properties of targets. The first is that the new target should maintain the balance target space. In the other words, the distance between new targets has to be equal as far as possible. It ensures the classifier to be fair for each dependent but not ordered class. On the other hand, the new targets also have to be able to represent the gradual changes for ordered classes. The basic idea is to determine a set of bases that decomposes the feature centers in the classifier: the subspace of features reflects the similarities and differences among classes. That is, classes are further merged or split into sub-classes to represent gradual changes or relations among classes. For instance, in terms of feature similarity in face anti-spoofing task, the sample of photo attack shares more components with a sample of printing attack but few with a real face. Therefore, we decompose the response of one-hot targets into several targets with positions arranged to achieve component sharing and mutual exclusion. Given the proposed training targets, the feature extractor(FE) is forced to distinguish higher-level features. Simultaneously, the proposed method guarantees uniform distribution of the centers of sub-classes and thus strikes a balance in the performance among classes. In addition, as the feature description is mapped into a higher dimension, it is capable to represent more subtle and sufficient characteristics for each class, which leads to better domain generalizability.
The channel encoding technique in a communication system is leveraged to encode the training targets. We treat the original targets, which are represented in the binary form, as the messages to be encoded and the codewords as the new targets. Finally, with a target selection procedure, the new targets with the specific attributes are done. As we train the model with the encoded targets, the output is decoded by multiplying the codeword table; the final result is the corresponding probability as the original classification task. There is also a collateral benefit of channel encoding, by concatenating redundancy after the message, the mutual information between messages are embedded into each encoded codeword, which ensures fault tolerance and facilitates error correction. Here we use the Bose-Chaudhuri-Hocquenghem (BCH) encoder due to its efficiency for short messages. It is also guaranteed that the mentioned characteristics for encoded targets. First, BCH targets are fault-tolerant, and similarities between them are maintained equally. Furthermore, with the BCH encoder, the length of the encoded targets is not limited by the number of classes.
This flexibility makes it easy to select target sets for different tasks.
This paper is organized as follows. In Section II, we introduces related work on encoding, labeling adjustment for facial dependent classification tasks, and domain generalization. The problem formulation, theoretical derivation, and the details of the proposed method are addressed in Section III. The experimental results are shown in Section IV. There are two facial dependent classification and a domain generalization task are selected for verifying the efficiency of the proposed method, including age recognition and face antispoofing detection. We further conduct an ablation study for the verification of the preliminary assumption of the proposed method. Finally, the conclusion is given in Section V.

II. RELATED WORK A. ENCODING
Channel coding is a technique for detecting and correcting bit errors in digital communication systems. At the transmitter side, the structure is called an encoder, which generates redundant information according to the message information and then combines the message and redundancy into one package before transmission. At the receiver side, the decoder detects and corrects errors to enhance the robustness of the communication system. A Bose-Chaudhuri-Hocquenghem code is a well-known linear block code with a finite codeword length [4]. In BCH code, the message information is partitioned into k-bit blocks, from which the encoder generates n-bit codewords. These are called (n, k) codes.
In image classification, the label dimension usually corresponds to the number of classes; this corresponds to short block lengths in digital communications. In [5], simulation results on short block lengths show that BCH codes provide the highest reliability under optimal decoding techniques, since they have the highest minimum Hamming distance of all well-known channel codes. Another key feature for such a code is that the Hamming distance between every two individual valid codewords is approximately the same. For instance, for a BCH (n, k)-code, two randomly-chosen codewords within the 2 k valid codewords usually result in the same Hamming distance. Therefore, the key feature turns out to perfectly match the requirements for deep learning tasks, i.e., the differences among all labels should be balanced.

B. AGE RECOGNITION
Due to its advantages in feature extraction, convolutional neural networks (CNNs) have been the main approach for age recognition in recent years. Geng et al. [27] propose facial aging pattern extraction to estimate age, and Guo et al. [6] expand the task from regression to classification by using bio-inspired models in age recognition. Rothe et al. [8] treat age recognition as 100-class classification and post-process output neuron scores to approximate age. Yang et al. [7] propose using a multi-classification model for ranking, based on which they calculate the regression prediction. Ranking can also be implemented with a CNN model [9], [10]. Diaz et al. [11] propose soft labels for ordinal classification (SORD) by considering the relationship between classes. Zeng et al. [12] apply the soft label on ranking CNN to enhance the robustness. Generative Adversarial Network (GAN) is applied to solve some specific problems. Nam et al. [13] improve the performance on the lowresolution image by reconstructing the image with conditional GAN. Kim et al. [14] consider the domain issue of different ethnic and age distribution between the databases, enhance the robustness by training a cycle GAN to learn the transfer relation between different image styles.

C. FACE ANTI-SPOOFING
In conventional face anti-spoofing, motion-based and texturebased FEs are used to classify differences between real and fake faces. Dynamic mode decomposition (DMD) is proposed to capture facial movements [15]. Wen et al. classify real faces, protecting against attacks using image distortion analysis (IDA), which considers illumination, sharpness, chromatic moment, and color diversity features [16]. Another texture-based approach applies chromatic aberration from facial images by computing color moment and ranked histogram features to distinguish artifacts caused by various attacks [17]. Boulkenafet et al. expand the feature descriptor by combining the luminance and the chrominance channels to analyze joint color-texture information [18].
Face anti-spoofing can also be based on CNNs [19]. A region-based CNN model [21] is implemented as a threeway classifier to separate the real face, the fake face, and the background. In addition to spatial features, temporal changes can also be considered [25]. Temporal features extracted with recurrent neural network (RNN) are further considered as physiological changes within genuine samples. In [24], a CNN-RNN-based model is used to simultaneously estimate depth and heart rate signals. With an expanded CNN, a combination of shearlet-based features and stacked auto-encoders is used for the task [20]. Sun et al. [22] apply a depth-based spatial aggregation of a pixel-level local classifier to deal with multiple attacks. The meta-learning technique is also leveraged for a few-shot face anti-spoofing tasks in [23]. For domain adaptation consideration, Lv et al. [26] propose the dynamic image generation and prediction ensemble for elevating the performance on the cross-domain assessment.

D. DOMAIN GENERALIZATION
In computer vision, CNN have come to the fore in recent years. However, these models perform poorly testing on data from other domains. This is sometimes termed the domain adaptation problem [28]- [31]. Labeled data is available for the source domain, but for the target domain, only unlabeled data is available. By learning the transfer relation between the source and target domains during model training, the model adapts to more domains with unlabeled data. Other work treats the domain issue as the domain generalization problem. In contrast to domain adaptation, domain generalization takes all unseen data as the target domain, which is not available during training. DeVries et al. [32] apply cutouts of images during training to prevent the model from overfitting to the training domain. Li et al. [33] propose episodic training to train several models on the source domain and share the FE parts of each model to improve the generalizability of the final domain-agnostic model. Huang et al. [34] randomly drop the neurons with the highest gradients during training. The model thus learns other minority features in the source domain to improve target-domain accuracy.

III. PROPOSED METHOD A. PROBLEM FORMULATION
In this subsection, we derive the theoretical improvements yielded by training a model with BCH targets. We also prove that models trained with BCH targets are more robust to data attacks from different domains.

1) THEORETICAL IMPROVEMENTS
To determine the theoretical improvements attainable when training with BCH targets, we first assume that the output neurons right after the trainable weights have the same probabilities and that the central features trained by the onehot targets are evenly distributed in the feature space. (This assumption may not hold in some cases, but we address this first to simplify the derivation. In Section IV-D, we will discuss the conflicts.) As shown in Fig. 1, f is the output of the FE, f BCH ∈ R n is the input of BCH table, y BCH ∈ R m and y one-hot ∈ R m are the corresponding outputs of the classifiers trained with BCH and one-hot targets, respectively. n denotes the BCH target length and m is the number of classes.
When the model is trained with one-hot targets, given an image sample x in the sample space X, where X i denotes the sample space of the i-th class, we have whereŷ is the one-hot target corresponding to x. Considering the average over the entire sample space, we have the following probability of correct prediction with one-hot targets: As it is more difficult to make a correct prediction than activate a neuron, we have As training with BCH targets, f BCH is the input of the BCH target table denoted as T = [t 1 , t 2 , . . . t m ] ∈ R n×m . To derive the relation between p and p BCH , we also compute the probability to make a correct prediction with BCH targets. Given y BCH i belongs to i-th class, the correct prediction occurs as the cosine similarity between f BCH i and t i is larger than t j , j = i.
For those bits that t i and t j are the same, no matter the corresponding neurons of f BCH i are activated or not, the relation between f BCH i · t i and f BCH i · t j is not affected. Hence, we only need to focus on the bits that t i and t j are complemented, among these bits, the fault tolerance rate is 50%. That is, as long as the corresponding bits of f BCH i can be correct more than half of them, Eq. 5 can be satisfied. The length of a BCH target is n and l is the number of 1's in it. Let d denote the cosine similarity of two BCH targets t i and t j . In terms of the characteristic of BCH targets, there are four types of relationships between l and d, they are represented in n in Table 4. The p BCH can be regarded as a summation of Bernoulli distributions and depends on the parameter d and l. Hence, we address the probability of the correct prediction of the four types respectively as follows.
For type A, the maximum error bits to maintain the correct prediction is n−3 4 ; the corresponding probability is Similarly, type B has n−3 4 maximum error bits, and the probability of making the correct prediction is Type C is a special case of two complementary BCH targets, so the largest maximum error bit tolerance is n: For type D, the maximum error bit tolerance is n−3 4 : We compute the occurrence probability of these four types of situations as weightings. The total number of classes is m, and the occurrence conditions w are shown in Table 5.
The probability of making the correct prediction with BCH targets is With a numerical approach, we find that if the original probability of making a correct prediction with one-hot targets exceeds 0.5, there exists more than one set of BCH targets that ensure improvement, as shown in Fig. 2 and Fig. 3.
The improvement ratio R is defined as

2) CROSS DOMAIN IMPROVEMENT
The cross domain problem is an open problem with CNN models, especially for facial image classification tasks. Model performance degrades because the model handles data well only from domains that it is trained on. In the case of cross domain evaluation, misclassifications are most likely to occur at similar or neighboring classes. Thus, the perturbed output can be regarded as a linear combination of several targets. As we apply cosine similarity to judge the corresponding class between candidates t i and t j of a given model output u which belongs to the i-th class, the correct classification occurs when Letũ denote the perturbed model output, which can be regarded as a linear combination of targets with weights .
where t r is the target of the r-th class, and r is the corresponding weight of t r . For one-hot targets, If y BCH belongs to the i-th class, the correct classification implies that The expectation of y one-hot and t i − t j is Normalizing both sides by dividing by the expectation of t i − t j , we have For BCH targets, if g belongs to the i-th class, we have Considering all types of BCH targets and applying normalization similarly, We assume that r j , which means that there is only one class that causes the perturbation. Thus we simplify Eq. 21 as Comparing the right-hand side of Eq. 18 and Eq. 22, Hence, compared with one-hot targets, the proposed method has a higher probability to produce a correct classification because its conditions are looser.
To further examine the cross-domain condition, we use a generative adversarial network (GAN) to simulate and visualize the reconstructed results from different domains by adding noise. Here, we utilize MNIST [35] and infoGAN [36] to illustrate the comparison. We fix the noise part in the latent variable for infoGAN, and determine the remaining part by combining the two targets with the weights as noise energy for the perturbed one. is set from 0.5 to 1.0 at steps of 0.05 and juxtapose the reconstruction using one-hot targets and BCH targets. The results are shown in Fig. 4, in which every column should be the same as the numbers in the first row; otherwise, misclassification has occurred due to crossdomain perturbation. There are far more misclassifications within the results of one-hot targets. Hence, under noise perturbation from different domains, BCH target training enhances the robustness and the generalizability.

B. BCH TARGET GENERATION
The generation of BCH targets is addressed in the following section. There are two target selection schemes for ordered and dependent but not ordered classification respectively.

1) INPUTS FOR BCH ENCODER
The BCH encoder uses pair (n, k) for the message and the codeword. When using one-hot targets directly for encoding, fewer (n, k) pairs are available, as there are more classes; in addition, there is less redundancy, leading to a poorer capability to contain mutual information between targets. For better flexibility to select (n, k) pairs, we use the target in binary form for BCH encoder inputs instead of one-hot form. We remove all-zero and all-one codewords to prevent neuron responses from being the same for every class.

2) SELECTION FOR DEPENDENT BUT NOT ORDERED CLASSES
To ensure the targets are suitable for CNN models, that is, there are no all-zero and all-one targets, and the target space is maintained equally shared. Simultaneously, the targets represent the dependent relation for classes, we select the middle of the BCH codewords set symmetrically. The table of BCH targets has almost the same number of 1s over rows and columns. The same number of 1s over each row eliminates the need for normalization during training and makes the training procedure more stable. The same summation over columns ensures each neuron in the output layer to be trained equally as the training data is shuffled and balanced among classes. A BCH target generation sample for this type of classification task is shown in Table 7-(c).

3) SELECTION FOR DEPENDENT AND ORDERED CLASSES
For tasks with dependent and ordered classes, we further consider the representation for gradual changes, and the BCH targets are generated in the following manner. We calculate the occurring density of the 1's as ρ = n j=1 j · c j (24) where ρ is the center of density of the 1's, and c j is the j-th bit of the BCH codeword. Then, the BCH codeword set is sorted with ρ, and we pick up the first half targets in the intervals in arithmetical progression. The other half of the BCH targets are composed of the complements of the first selected half. The final step is making the selected BCH targets softer. We take SORD targets [11] as weights for the linear combination of BCH targets: where t j is the final BCH target of j-th class for classification tasks which are ordinal and dependent, s i j is the i-th bit of the SORD target for class j, and t i is the BCH target for class i after selection according to the density of 1's. Table 7 shows a complete flow of generating a sample of (n, k) = (7, 4).

C. MODEL TRAINING
In the procedure, to integrate BCH targets into CNNbased model training, we leverage soft maximum likelihood decoding to match the BCH targets, but not the conventional BCH decoder algorithm. We use the fixed decoding table of BCH targets as the second fully-connected (FC) layer immediately after the first one, which yields a matching result corresponding to each target in the table. Also, to ensure that the layer structure and loss design correspond with the original image classification model training, we use cosine similarity as the metric for target matching. Softmax is then applied to map the matching result to the probability of classes that the given sample belongs to. Overall, the difference between training with one-hot targets and BCH targets is mainly in the added fixed-weights FC layer, the number of parameters in each layer is shown in Table 6. The training loss design and the metric as model inference are the same for both situations. That is, both the training settings and the execution time remain nearly unchanged with the proposed label to the case with one-hot targets. The modified objective function L in this study is where N is the number of training samples in a batch, W and b are the weights and the bias within the FE.

IV. EXPERIMENTAL RESULTS
In this section, to ensure a fair comparison, the FE and all training settings are fixed to be the same. We utilize the PyTorch deep learning framework [37]. For dependent facial image classification tasks, age recognition and face anti-spoofing are considered. Age recognition illustrates how the proposed method works on tasks with dependent and ordered classes; face anti-spoofing is for dependent-only classes. As both tasks see performance degradation with cross-domain data, they are ideal sample tasks for Section III-A2. Hence, to check the capability of generalizability, all evaluations are cross datasets tested.  For a more specific demonstration of cross-domain evaluation, we also perform an assessment on the PACS dataset [38] for domain generalization. Besides, we conduct an ablation study to specify when the preliminary assumption of the theoretical improvement holds, and what is a collateral benefit of training a model with BCH targets with CIFAR10 [39].

A. AGE RECOGNITION EVALUATION
To evaluate cross domain performance on the age recognition task, we used the Adience dataset [40] for model training and the UTKFace dataset [41] for testing in the experiments.
It provides 19K processed images from Flickr where images were uploaded from mobile devices and labeled with gender and age groups. These images were aligned with the facial landmarks and divided into five folds for training and testing.
The UTKFace consists of over 20K faces labeled with age, gender, and race. It provides facial images with a wide range of ages from newborns to 116, and its images were crawled from search engines such as Google and Bing. Thus its domain is quite different from that of Adience. The difference between the two datasets increases the difficulty of the task.
We compare the performance of the BCH targets with the one-hot and the SORD targets. For the SORD targets, we apply the distance metric function mentioned in [11], which is also used to generate the BCH targets. We followed the steps described in Section III-B to generate the BCH targets for dependent classes. When choosing the (n, k) pairs, we took the smallest k for each n in the BCH pair list in [4]. Therefore, we experimented with (n, k) = (7, 4), (15,5), and (31,6). Note that for an 8-class task, 3 bits is enough to represent all classes. However, as we remove the all-zero and all-one targets, we start k from 4. We used ResNet-18 [42] pre-trained on Imagenet [43] as the FE and the stochastic gradient descent (SGD) optimizer with a learning rate (LR) of 10 −3 . If the model did not converge under this LR setting, we changed the LR to 5 × 10 −4 . Cross-entropy  loss was used as the optimization criterion. All of the models were trained with a batch size of 32 and stopped training when one of the following conditions was reached: (1) the accuracy of the training data reached 100% for more than 5 epochs, (2) the training loss converged to 0.0 for more than 5 epochs, or (3) the training epochs reached 50. For each model trained on five Adience folds, we evaluated the cross-domain performance on UTKFace. The performance is shown in Table 8. Both SORD and BCH targets outperform the one-hot targets under this cross-dataset evaluation. The proposed BCH targets have the highest average accuracy, 36.28%, with (n, k) = (15, 4), whereas the accuracy is 24.48% for the one-hot targets and 29.95% for the SORD targets. Clearly, the BCH targets yield better performance under a cross dataset evaluation. Further considering the relationship between dependent classes would give the model better adaptability under different domains.

B. FACE ANTI-SPOOFING EVALUATION
For cross domain evaluation on face anti-spoofing, we used the CASIA-MFSD [44] and OULU-NPU [45] datasets. We trained the model on one and tested it on the other, and vice-versa. Both have three classes: genuine, printed, and video.
CASIA-MFSD provides videos at three image resolutions: low, normal, and high. There are two types of printed attacks in CASIA-MFSD: normal and with cropped eye regions. The printed paper attacks are warped continuously. There are 50 subjects in total, each of which includes 12 videos (4 attacks at 3 resolutions). Of the three testing protocols, we adopted the overall protocol, which is the evaluation for general cases.
The OULU-NPU dataset contains printed and video attacks. There are 55 subjects in total: 20 for training, 15 for development, and 20 for testing. The dataset specifies four protocols for different aspects of evaluations. In our experiment, we adopted protocol I, designed to evaluate the performance for general cases. We used the most general protocol for both datasets to focus on the cross-domain influence and control other factors.
We used ResNet-18 pre-trained with the Imagenet FE and the SGD optimizer with a LR of 10 −4 . We treated this task as 3-class classification and post-processed the model to output a binary prediction, specifying whether the input is genuine or not. The facial image was cropped by the multi-task cascaded convolutional networks (MTCNN) face detector [46], aligned with the eye and mouth-corner facial landmarks and resized to 224 × 224 as input.
For prediction, we decided the prediction of every video by voting. Because we used the general protocol for both datasets, we assumed that the performance was mainly affected by the difference between domains. As shown in Table 9, in both cases, the proposed method is superior in terms of both accuracy and half total error rate (HTER). Thus, the BCH targets assist to increase the adaptability under cross-domain attacks for exclusive tasks such as antispoofing.

C. DOMAIN GENERALIZATION EVALUATION
The PACS dataset [38] is a dataset for the evaluation of domain generalization. This dataset covers the domains of art painting, cartoon, sketch, and photo. Each domain includes seven categories: dog, elephant, giraffe, guitar, horse, house, and person. We followed the evaluation protocol in [38],  training the model on three domains and testing with the remaining domain in four splits. In this experiment, we used EfficientNet-B0 [47] pretrained with ImageNet with an image size of 224 × 224 as the FE. We use the SGD optimizer and the LR is initialed with 10 −4 and reduced by 90% at epoch 50 and 125. The model training is stopped until the training loss is converged. The results are shown in Table 10. For all target domains, the proposed BCH target approach improves accuracy with n starting from n = 7. The average accuracy is higher with larger values of n for BCH targets.

D. ABLATION STUDY
To further verify the theoretical derivation in Section III-A, CIFAR10 [39] is leveraged for simulation. In the following discussion, there are three aspects are covered, including the usage adaptation and the collateral benefit to the independent classification task, the preliminary assumption for the ideal improvement level, and the recommended length for BCH target selection. We used ResNet-18 as the FE. The resolution of the input image is 32 × 32, and SGD is leveraged as the optimizer with an initial LR = 0.1 and drops to onetenth of the current value at the quarter and half epochs. The experimental results are shown in Table 12, which shows that training the FE with BCH targets indeed improves accuracy, even for the independent classification task.
The improvement in average accuracy is marginal. However, as we go deep into the precision of each class, there is a different observation as shown in Table 11. In terms of the precision, the average elevation is about 2%; the minimum precision is lifted about 8% that why the standard deviation of the precision among classes is lower to the half of the original nearly. Under the compatible accuracy, training with BCH targets makes the classification results more precise. That is, the misclassification is reduced significantly, which also matches the expectation.
With this observation, we can also clarify the theoretical derivation we addressed in Section III-A1. The empirical results fall short of the theoretical ones because the performance elevation happens not only on a single metric. The efficiency of a model is enhanced in multiple ways, including being more accurate, precise, and robust. On the other hand, the preliminary assumption for theoretical improvement, that is, the original accuracy should be above 50%, affects whether Eq. 1 holds or not. In the other words, as the accuracy is less than 50% feature centers within a classifier trained on one-hot targets do not share the entire feature space in a balanced manner. So the probability of each neuron can not be seen as the same. Conversely, it is also an explanation for the reason why BCH targets lead to equal performance among classes. For instance, with the model trained on onehot targets, some feature centers from difficult classes are sacrificed and overlap with others. When this occurs, the average performance may be acceptable but the accuracy of some classes is lower. But BCH targets constrain the balance learning for each target to avoid this problem.
Finally, there is the recommendation for selecting the suitable length of BCH targets based on the analysis of empirical results. As the longer the BCH targets, the more parameters there are to train, increasing the model training burden. Hence, the shortest length of BCH targets k log 2 m + 2.

V. CONCLUSION
In this paper, we propose a novel labeling rule that takes into account the relation between classes by leveraging channel encoding on the training targets especially for dependent image classification. Due to its efficiency for short messages, in this work we use BCH encoding. Theoretically, given the same FE, the proposed method guarantees performance improvements over any original accuracy which is not less than 50%. Also, BCH targets ensure that the CNN-based model learns mutual information between classes, which enhances the generalizability of the model. Thus, models trained with BCH targets yield more robust and precise results for samples from different domains.
In practice, for dependent facial image classification tasks evaluated with cross datasets protocol, including age recognition and face anti-spoofing, both yield better results than the baseline. Furthermore, in the domain generalization task, the proposed method solidly outperforms the model trained on one-hot targets. We also make a trial that applied BCH targets on independent image classification tasks (here, CIFAR10 for instance). The experimental results with BCH targets particularly reach higher and more concentrated precision under the comparable accuracy.
In this work, we have applied channel encoding to the training targets and changed the architecture of the fully connected layers. One extension of this work in the future would be to integrate this encoding into the convolutional layer.