Credal Transfer Learning With Multi-Estimation for Missing Data

Transfer learning (TL) has grown popular in recent years. It is effective to improve the classification accuracy in the target domain by using the training knowledge in the related domain (called source domain). However, the classification of missing data (or incomplete data) is a challenging task for TL because different strategies of imputation may have strong impacts on learning models. To address this problem, we propose credal transfer learning (CTL) with multi-estimation for missing data based on belief function theory by introducing uncertainty and imprecision in data imputation procedure. CTL mainly consists of three steps: Firstly, the query patterns are reasonably mapped into multiple versions in source domain to characterize the uncertainty caused by missing values. Afterwards, the multiple mapping patterns are classified in the source domain to obtain the corresponding outputs with different discounting factors. Finally, the discounted outputs, represented by the basic belief assignments (BBAs), are submitted to a new belief-based fusion system to get the final classification result for the query patterns. Three comparative experiments are given to illustrate the interests and potentials of CTL method.


I. INTRODUCTION
Traditional machine learning algorithms have already achieved great success under the assumption that training and test set are drawn from the same feature space and data distributions [1]. In many real-world situations, however, this assumption is not satisfied, which usually makes the performance of traditional classifier unsatisfying. Recently, a new method, called Transfer learning (TL) [1]- [3], has been proposed, which can effectively solve the above problems and is widely used in many fields, such as indoor WiFi location [4], text classification [5], sentiment analysis [6], etc.
According to the availability of training patterns in the target domain, TL methods are categorized into three types: supervised TL [7]- [9], semi-supervised TL [10]- [12] and unsupervised TL [13]- [15]. Supervised TL methods utilize the labeled target domain patterns in addition to the source domain patterns for training. For example, Transfer Adaboost (TrAdaboost) method [7], which is quite typical, extends the Adaboost algorithm by adding a weighting The associate editor coordinating the review of this manuscript and approving it for publication was Inês Domingues . mechanism corresponding to the the similarity of patterns in source domain and target domain. In [8], a heterogeneous feature augmentation (HFA) method is proposed. HFA transforms the patterns of two domains into a common subspace to augment the mapping patterns. Semi-supervised TL methods further use unlabeled target domain patterns to help with classification. An extended version of HFA, called semi-supervised heterogeneous feature augmentation (SHFA) [10] addresses the heterogeneous situations with sufficient labeled source patterns and limited target patterns. Interestingly, a semi-supervised TL method based on manifold regularization is proposed in [11], which exploits similarity constraints in the target domain to improve performance. Unsupervised TL is an interesting but challenging task, since it is applicable to the target domain without labeled patterns. A representative work is transfer component analysis (TCA) [13], which minimizes the distances between two domain distributions by mapping the patterns of two domains to a reproducing kernel Hilbert space. In [14], a joint domain adaptation (JDA) strategy is presented to simultaneously adapt the margin and condition distribution differences between the labeled source domain and the unlabeled target domain. However, all these TL methods are designd on complete patterns without considering the missing data situations. Unfortunately, missing data is a common issue in many real-world data sets. For example, UCI, one of the standard repository commonly used in machine learning algorithms, contains 45% of the data sets with missing values [16]. Under such circumstances, these classical TL classification methods are no longer adaptable. Therefore, pre-process on the missing data before classification is necessary.
A number of methods [17], [18] have been developed to deal with traditional classification problems with missing data. Generally, these methods respect one missing randomness mechanism among three assumptions: missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR). The simplest method is to discarding incomplete patterns, which is acceptable when the missing values only count for a small proportion (less than 5%) of the whole data set. Imputation strategy, in many situations, is a popular method for incomplete pattern classification [19]- [23], [25], [26]. For instance, in mean imputation (MI) [19], the missing values are simply replaced by the mean values of the complete attributes in the same dimension. A commonly used K -nearest neighbor imputation (KNNI) method [20], [21] uses K -nearest neighbors (KNNs) of patterns to estimate missing values. In fuzzy c-means imputation (FCMI) method [22], [23], the missing values are filled with the clustering centers produced by fuzzy c-means (FCM) [24] and the distances between the pattern and the centers.
Particularly, linear local approximation (LLA) [25], uses the KNNs with optimal weights obtained by local linear reconstruction to estimate the missing values. Interestingly, some recent research works have been dedicated to multiple estimation or non-estimation of missing values [27]- [29]. These methods have achieved satisfying results in some way, whereas they cannot be directly employed in TL because of the inconsistency of feature space and distribution. In addition to missing values, knowledge transfer will also bring uncertainty in TL, since the calculation error caused by transfer rules will inevitably occur in the process of transfer. Thus, how to reasonably characterize the transfer and classification uncertainty caused by missing values is a very meaningful work.
In this paper, we propose a credal transfer learning (CTL) method for missing data, which introduces uncertainty and imprecision while imputing missing values based on the belief function theory. Belief function theory [30] has been widely used in modeling and reasoning uncertain information in application domains of pattern classification [31]- [34], pattern clustering [35], [36] and information fusion [37], [38], also conventionally called credal methods. In credal classification methods, one pattern may belong to multiple classes (i.e. particular dis-junction of several singleton classes), named meta-classes, with different belief degree. Such representation is able to characterize the imprecision of classification for uncertain patterns. There are some methods developed for missing data based on belief function [31], [32]. For instance, a prototype-based credal classification (PCC) method is proposed in [31], where the missing values are estimated respectively with the class prototypes obtained by training patterns and they can be classified by traditional classifiers. More recently, a new transfer classification method, named evidence-based heterogeneous transfer classification (EHTC) is proposed in [34] to deal with the uncertainty in the process of feature mapping in heterogeneous domains. Nevertheless, there is no relevant literature(s) on transfer learning of incomplete pattern based on belief function theory.
CTL method applies belief function theory for the representation of uncertainty information in transfer learning on incomplete data. In CTL, we assume that patterns in source domain are attributed with ground truth labels while labels of patterns in target source are not observed. The feature distributions of the two domains are different while attributes of the patterns in the target domain are partially missing. Specifically, CTL first uses observed attributes to estimate multiple mapping patterns in the source domain for each pattern with missing values in the target domain based on KNNs techniques. Afterwards, a basic classifier (such as K-NN [39], EK-NN [40], NB [41]) that handles the complete pattern is selected. In this step, the labeled patterns in the source domain are used to classify the multi mapped versions of each query pattern. Finally, different discounting factors of multi-classification results are obtained depending on the distances between query patterns and corresponding KNNs. Final classification results are obtained from multi-classification results. Non-conflicting classification results are directly fused by discounted averaging method while an adaptive fusion method is designed to aggregate the remaining conflicting results. By conducting such step, the patterns that are difficult to be classified are automatically submitted to the reasonable meta-class, which is able effectively reduce misclassification rate. The classification of the uncertain patterns in meta-class can be eventually identified (refined) using certain other (costly) techniques or with extra information sources.
The contributions of this work mainly concern three aspects.
1) A multi-estimation strategy in different distribution domains is proposed. In this strategy, the unobserved attributes of incomplete patterns are estimated based on observed ones, with an uncertainty degree reasoned by the belief function theory. The capability of uncertainty reasoning is one of the advantages over traditional imputation methods.
2) A new adaptive global fusion method for decision-making in classification is designed. Because of the uncertainty reasoned by the belief function theory, CTL is able to make the decision more cautiously by considering the imprecision and uncertainty of learning results. Such decision making method effectively reduces the error rate in practice, which is justified by experiments on real world data. VOLUME 8, 2020 3) Evidential theory (belief function theory) is originally introduced in transfer learning, with effectiveness of classification application justified on incomplete data sets.
This paper is organized as follows. The preliminary information of transfer learning and belief function theory is shortly reviewed in section II, and the CTL method is introduced in the Section III. The performance of CTL is tested and compared with several other methods in Section IV. The conclusion of this paper is finally given in Section V.

II. PRELIMINARIES
In this section, brief introduction of transfer learning (TL) and belief function theory is given as well as corresponding notations.

A. TRANSFER LEARNING
In TL, the representation of patterns are transferred between different domains. A domain D is constituted of two components: feature space X and marginal probability distribution P(X ). Formally, D = {X , P(X )}, X = {x 1 , . . . , x n } ∈ X , where x i is the i-th pattern of D. For a given domain D, a task T is composed of two elements: label space Y and decision function f (·), Formally, T = {Y, f (·)}, Y = {y 1 , . . . , y n } ∈ Y, where y i denotes the corresponding output label.
Given a source domain D s with a corresponding a source task T s and a target domain D t with a corresponding a target task T t , D s = D t or T s = T t . TL is the process of improving the decision function f t (·) in the target domain D t by using relevant knowledge in source domain D s and source task T s .
The existing TL methods can be divided into three categories: supervised TL [7]- [9], semi-supervised TL [10]- [12] and unsupervised TL [13]- [15]. supervised TL requires training classifiers with sufficient labeled patterns. Semi-supervised TL uses few labeled patterns but vast unlabeled patterns to train the classifier. In this paper, we mainly focuses on the unsupervised TL, where no labeled patterns are available for training. More detailed introduction and examples of TL are available in [1], [2] and [3].

B. BELIEF FUNCTION THEORY AND CREDAL PARTITION
Belief function theory, also known as Dempster-Shafer theory (DST) or evidence theory [30], is originally proposed by Dempster and formed by Shafer generalization. It is a theoretical framework for reasoning with partial and unreliable information, notably the uncertain problems. It has been successfully applied in many fields [31]- [38]. In this theory, a set of finite mutually exclusive and complete elements = {ω 1 , ω 2 , · · · , ω c } is defined as the framework of discernment of the problem under study, usually a decision problem. The uncertainty is expressed on the power-set of , denoted as 2 , where the disjunctive elements imply information with imprecision.
The basic belief assignment (BBA) m(·) on the framework of discernment is a function m : 2 → [0, 1], such that A credal partition [35] is defined as the n-tuple M = (m 1 , . . . , m n ), where m i is the BBA of the pattern x i ∈ X , i = 1, 2, . . . , n associated with the different elements of the power-set 2 .
In classification problems, the output of each classifier can be regarded as an evidence on all possible classes represented by a BBA. The DS rule [30] is used in many applications to combine multiple evidence of different independent sources because of its commutative and associative properties. The DS combination of evidence m 1 (·) and m 2 (·) from two independent sources over a frame of discernment 2 is defined by In DS rule, all conflict belief mass B∩C=∅ m 1 (B)m 2 (C) is proportionally redistributed back to the focus element. However, DS rules can also produce very unreasonable results in high conflict situations and some special low conflict situations. Thus, a number of alternative combination rules have emerged to overcome the limitations of DS rule, such as the well-known Yager's rule [42], Dubois-Prade (DP) rule [43], and more recently the more complex Proportional Conflict Redistributions (PCR) rules [44] are found.

III. CREDAL TRANSFER LEARNING
CTL method is developed for incomplete data classification in transfer learning, where data is partially missing (or unobserved). It consists of two steps.
Firstly, CTL estimates multiple mapping patterns in the source domain for each pattern with missing values in the target domain by assuming that there are a few number of parallel (one-to-one) patterns pairs. In this step, we assume that some one-to-one pattern pairs are given to link two different domains, whereas the labels of these patterns are unknown. Afterwards, the mapping values for (incomplete) patterns will be classified by the corresponding trained classifiers to obtain multiple different classification results, which are submitted to a new belief-based fusion system with different weights (reliability) to get the final classification result for the each pattern with missing values. In this step, multiple labels may be assigned to one pattern, interpreted as imprecision. Under such circumstances, these patterns are submitted to a corresponding meta-class reasoning the imprecision. Finally, the classification of the uncertain patterns in meta-class can be eventually identified using some cautious techniques or with extra information sources. Therefore, CTL method is able to prevent erroneous fatal decisions by cautiously partitioning the classification results when necessary.

A. MULTI-ESTIMATION AND CLASSIFICATION
Given a data setX s with vast labeled complete patterns 1 in source domain and a data setX t in target domain with unlabeled patterns partially observed, where the feature distribution of two domains are completely different but label spaces identical, i.e., patterns inX s andX t are in the class partition framework = {ω 1 , . . . , ω c }. Here, we assume that there are some one-to-one pattern pairs to build cross domain connections in two domains, respectively denoted as The labels of these patterns are not available since there is no prior information in target domain. For a query (incomplete) patternx i in target domain, K -nearest neighbors (KNNs) strategy 2 is applied to estimate multiple mapping patterns in source domain.
In the process of multi-estimation, the KNNs of the query patternx i are firstly searched with the observed attributes. Hence the calculation of the distance between the incomplete pattern inX t and the complete pattern in X t is very critical. In CTL, the distance between the patternx i and complete pattern x t k is given forx i ∈X t and x t k ∈ X t by: where || · || denotes the Euclidean distance,x is and x t ks the s-th attributes of the patterns, respectively. p is the number of dimensions of observed attributes inx i .
Afterwards, the K minimum distances ||x i , x t k || (k = 1, . . . , K) and corresponding complete patterns x t k ∈ X (k = 1, . . . , K) are obtained from these distances. Since x t k and x s k are one-to-one pattern pairs that connecting target and source domain called as bridge, for the patternx i with missing values, K version mapping values in source domain can be estimated for the patternx i according to KNNs as follows.
where x s k is the mapping value of x t k in source domain, and k = 1, . . . , K.
A simple example is given to illustrate the process of estimating multiple mapping values in the source domain.
Example 1: Given a source domain with three attributes and a target domain with four. The 2nd attribute of a patternx i is unobserved. We assume that three nearest neighbors are found in X t according to the observed attributes ofx i , denoted as follows.
For each mapping patternx k i in source domain, any classifiers adaptable for complete pattern is available. The K pieces of sub-classification results forx k i are given by where (·) represents the chosen classifier. P k i is considered as a Bayesian BBA if the chosen classifier works under probability framework (e.g., K-NN [39], NB [41]), or a regular BBA with ignorance returned by the classifier works under evidence theory (e.g., EK-NN [40]).
In the CTL method, we combine K pieces of classification results to obtain the credal classification of incomplete patterns. Since the distances between patterns and KNNs are different, they are not equally weighted in the fusion process. Thus, discounting techniques are required for K pieces of classification results. The details are given in the next section.

B. DISCOUNTING CLASSIFICATION RESULTS
For one pattern, the weighting factors of K pieces of classification results correspond to the distances between the patternx i and its KNNs. In general, a larger distance from the pattern to the neighbor implies a less reliable estimated mapping value. i.e., a larger distance ||x i , x t k || corresponds to a smaller discounting factor γ k i . An effective method is adopted to define the relative discounting factor γ k i , formally: with where w max i = max{w 1 i , . . . , w K i }. The K pieces of classification results are discounted according to the discount factor γ k i . Afterwards, a well known discounted rule introduced by Shafer in [30] is applied here, more precisely, discounted masses of belief are obtained as follows: where m k i (·) denotes the BBAs of different classes (focal elements) after discounting the classification results of mapping VOLUME 8, 2020 patternx k i in the source domain by the discounting factor γ k i . By doing this, one can obtain K mass functions (m k i (·)) for the patternx i , and they will be fused by the global fusion method we designed to obtain the final class information ofx i .

C. GLOBAL FUSION OF DISCOUNTED CLASSIFICATION RESULTS
After estimation and classification step, the highest supported class of a patternx i in one result is defined by: where ω k ij denotes the highest support class ω j of patternx i subjected to m k i (ω j ). In the fusion process, we propose an adaptive fusion strategy to distinguish the discounted K classification results into two following situations. For one pattern, its classification results with different k ∈ K may be either identical or different.
In order to further introduce this adaptive fusion method, we assume that the discounted K classification results highest support ρ (1 ≤ ρ ≤ K, 1 ≤ ρ ≤ c) classes, and that ν (1 ≤ ν ≤ ρ) classes of ρ are supported by ϕ (2 ≤ ϕ ≤ K) discounted classification results, that is, ρ − ν classes are supported by only one discounted classification result.
The adaptive fusion strategy consists of two steps: fusion of non-conflict discounted results and global fusion of conflict discounted results. Step

1: Fusion of non-conflict discounted results
Let's consider that the ς (2 ≤ ς ≤ ϕ) classification results of the patternx i strongly support ω k ij (j = 1, . . . , ρ, k = 1, . . . , ς ), indicating that the ς classification results are not conflicting. Therefore, these results are directly fused with the simple rule as follows. The fusion results of the BBAs are given to a focal element A by m 1,...,ς i The fusion results obtained from Eq. (10) need to be normalized for the convenience of credal classification. In order to the convenience of computation, we use the classical normalization given by The final fusion result can be obtained directly with this rule if K discounted classification results of a patternx i strongly support a specific class (i.e., ς = K), which indicates that the K discounted classification results are consistent and non-conflicting. However, a more cautious method is essential to model conflicts between different discounted classification results for a pattern due to the invalidity of this rule in dealing with conflict information if ς = K (i.e., 2 ≤ ς < K), which is introduced in the next step.

Step 2: Global fusion of conflict (discounted) results
After non-conflict discounted fusion, ρ fused (discounted) completely conflicting results for a patternx i is obtained, of which ν results are obtained from Step 1, and ρ − ν results are the discounted classification results. It should be noted that although the classes supported by ρ results are different, their supports for the most likely classes are different, and it is difficult to accurately classify into a small number of classes (e.g. 2 or 3 classes) for a pattern in general. Therefore, we should attribute priorities to obtain the most likely meta-class composed of the highest supported and difficultdivided-singleton classes the pattern belongs to, and generate a new framework consisting of the meta-class and singleton classes for the pattern. The most likely meta-class for a pattern can be obtained by a threshold parameter , defined as follows: where ω k ij (1 ≤ k ≤ ρ) denotes one of the highest supported class ω j for the patternx i in ρ results, and ψx i is the most likely meta-class of the patternx i , which is composed of the highest supported and difficult divided singleton classes (such as ω j , ω t , etc). For a specific patternx i , the new global fusion rule are defined as: (15) where K is the normalization factor, |A| is the number of singleton elements included in A. It is not difficult to find Eq.(14) that the precision of classification of one pattern (i.e. whether the pattern is classified into meta-class or not) depends mainly on the parameter . Parameter is a conflict measure factor, which essentially characterizes the degree of conflict between different evidences (classification results). effects the number of meta-classes in the fusion process, in order to reduce the risk of misclassification, all subsets of set ψx i are retained and the corresponding conflict information is assigned to meta-classes. Here is the guideline given for adjusting the parameter as follow.
1) Guideline for Choosing the Parameter : In practice, the threshold is used for meta-classes selection in classification. A bigger value corresponds to a smaller number of are mis-classified patterns, as well as a more ambiguous classification result, i.e., more patterns belong to meta-classes. A small value results in fewer patterns in the meta-classes, but may cause more misclassifications for imprecise patterns. Therefore, should be tuned according to the adapted imprecision degree. In this paper, a proper interval ∈ [0, 0.3] is recommended, and = 0.1 is regarded as a default value in most situations.
For the convenience of implementation, the CTL method is outlined in Algorithm 1.

IV. EXPERIMENT APPLICATIONS
In this section, we test and evaluate CTL method through extensive experiments on twelve real data sets from UCI repository [16] and five public high dimensional data sets (i.e., MNIST + USPS, COIL20 and Office + Caltech). In order to fully justify the CTL method, we consider the different ways of combining the classical missing value estimation methods and with the traditional transfer learning methods. The imputation methods and transfer learning methods used in the comparison methods are listed as follows.
• Imputation Methods: 1) Mean Imputation (MI) [19]: In MI, the missing values are replaced using the mean value of the same attribute of the data set in the target domain.
3) Locally Linear Approximation (LLA) [25]: In LLA, the missing values are estimated using KNNs with optimal weights obtained by the locally linear reconstruction.
• Transfer Learning Methods: 1) Single Value Mapping-Based Transfer Learning (SVMTL) [34]: In SVMTL, only one mapping value is found for each incomplete pattern with estimation in the target domain, which means that when we find the nearest neighbor of a pattern, its corresponding pattern in the source domain is directly taken as the mapping value. 2) Weighted Mapping-Based Transfer Learning (WMTL) [34]: In WMTL, KNNs are found for each incomplete pattern with estimation in the target domain, then the KNNs corresponding patterns in the source domain are weighted to synthesize a new pattern as the mapping value according to the distance between pattern and KNNs.
In our simulations, the misclassification is declared (counted) for one pattern truly originated from ω i if it is classified into A with ω i ∩ A = ∅. If ω i ∩ A = ∅ and A = ω i then it will be considered as an imprecise classification. The error rate denoted by R e is calculated by R e = U e /N ,where U e is number of misclassification errors, and N is the number of patterns in target domain. The imprecision rate denoted by R i is calculated by Ri = U i /N , where U i is number of patterns committed to the meta-classes. The experiment is conducted with Matlab software.

A. EXPERIMENT 1
Twelve well-known UCI data sets are used to test the performance of CTL, and The basic information of the used data sets including number of classes (#Class.), attributes (#Attr.) and instances (#Inst.) are shown in Table 2. We divide the attributes of each data set into two parts corresponding to the source domain and target domain, to fit our transfer learning scenario. For instance, if a data set has 15 attributes, and we take 7 attributes as the source domain attributes, while the rest 8 attributes are regarded as target domain. We segment each data set into three partitions by the following steps.
1) In the two domains, 5% one-to-one corresponding patterns pairs are selected as bridge.
2) The rest patterns in the source domain are labeled training patterns.
3) The remaining patterns in the target domain are the test patterns with missing values, in which the test patterns randomly lose n attributes. Table 2 shows the number of attributes of source domain and target domain, which are expressed as N s and N t respectively. Our CTL method and other comparison methods are used to classify test patterns with missing values in the target domain. Here, ten sets of source and target domain are randomly generated for the same data set, and the average values of the evaluation index are reported.
In this experiment, K-NN is selected as basic classifier. The average error rate R e and imprecision rate R i (for CTL) of the different methods with different meta-class threshold of (i.e., = 0.05, 0.1, 0.15, 0.2), are given in Table 3. 3 3 The number of dimensions of each data set in target domain is shown in Table 2, and the n value in Table 3 is the number of missing dimensions in target domain One can see from Table 3 that the CTL method generally yields lower error rate than other methods, but meanwhile some imprecision are appeared in the classification result due to bringing in meta-classes, which indicates that some incomplete patterns are very difficult to classify due to lack of attributes information. It is worth noting that the increase of the number of missing values (i.e., n) in target domain generally results in the increase of error rate, and the increment of imprecision in CTL, since the more missing values cause the bigger uncertainty in classification problem. So the credal classification fusion method, which includes meta-classes in CTL, is very effective in characterizing the imprecise degree, and helps to reduce the rate of misclassification. In CTL, with the increases of from = 0.05 to = 0.2, which causes the decrease of error rate but meanwhile it brings the increase of imprecision rate. The results show that the pattern attributes used are not enough to accurately classify the patterns in meta-class. If one wants to obtain more precise classification results, some other source information with complementary characteristics. Figure 1 shows the effect of different on CTL classification results, where the x-axis denotes the meta-class threshold , ranging from 0.05 to 0.25, and the y-axis represents the average of the classification results with scale of [0,1]. One can intuitively find that with the meta-class threshold increasing, the error rate of CTL method tends to decrease, whereas the imprecision rate tends to increment, which is consistent with the trend in Table 3. Figure 2 shows the average classification error rate of CTL with respect to other methods when the meta-class threshold = 0, VOLUME 8, 2020  where the x-axis denotes the number of missing attributes and the y-axis denotes the error rate. In this situation, the CTL method only obtains a specific result. One can find that the CTL method still has a significant effect on these data sets. In real applications, the parameter should be tuned according to the imprecision rate one can accept in the classification. The CTL method allows the patterns that are really difficult to be classified correctly to be assigned to proper meta-class, and they should be cautiously treated in applications.
In Tables 4-6, we can see that error rates of CTL method with EK-NN, Adaboost and NB classifiers are smaller than the other applied methods in most situations. In parallel, some incomplete patterns that are very difficult to classify into a specific class have been submitted to the meta-classes. With the number of missing values n increases, it may cause the increment of error rates in the classifiers, and the imprecision rate generally becomes higher in CTL, which is reasonable. In the process of credal transfer learning, meta-classes are introduced to reasonably characterize the imprecision caused by missing values, so the proposed method is able to effectively reduce classification error. The average error rate and imprecision rate denoted by Ave of different methods on different data sets with the same classifier is given in the last row of Tables 4-6 to express the general performance of the corresponding method. It can be seen that CTL method has good adaptability in three basic classifiers: EK-NN, Adaboost and NB. In other words, CTL method has good robustness and can be applied to various basic classifiers. However, in the situation of large amount of data, we find that NB classifier takes less time than EK-NN and Adaboost, because EK-NN and Adaboost classifiers will bring heavy computational burden.
C. EXPERIMENT 3 In this experiment, we adopted five public high dimensional data sets: MNIST + USPS, COIL20 and Office + Caltech. These data sets have been widely used in most of the existing TL works. Table 7 shows the details of the data sets. MNIST (M) and USPS (U) are two handwritten digits recognition data sets that follow very different distributions. MNIST includes 60000 training images and 10000 test images, USPS contains 7291 training images and 2007 test images. The Office-Caltech data set contains 10 classes of images from four domains, and we select Amazon (A) and Caltech (C) for testing. COIL20 (CO) contains 20 classes and 1440 images, with 72 images in each class. Detailed descriptions about these data sets can be found in [14]. Here, we use A → B to express the knowledge transfer from source domain A to target domain B.
In experiment, we randomly select 10 patterns in each class from both the source domain and the target domain as oneto-one corresponding pattern pairs. Then, the rest patterns in the source domain are considered as labeled training set, and the patterns in the target domain as test set, in which each pattern randomly loses n attributes. EK-NN is selected as basic classifier, and the parameter = 0.1 is selected here. In order to fully prove the validity of CTL method for high dimensional data, we also use two other classic TL methods (i.e., TCA [13] and JDA [14]) for complete patterns. In TCA and JDA, we choose RBF kernel uniformly, and the  implementation details of TCA and JDA follow [14]. The average error rate R e and imprecision rate R i (for CTL)of different methods is given in Table 8.
From Table 8, we observe that CTL has better performance than the rest of methods in high dimensional applications. The average classification error rate of CTL on these data sets is 34.82%, which is 5.26% lower than best baseline method KSTL. This proves that CTL can construct a more effective representation for the task of cross domain incomplete pattern classification. Figure 3 shows more clearly and intuitively the influence of different K values in EK-NN classifier on classification results. The x-axis corresponds to the K value, ranging from 3 to 13, and the y-axis the average error rate in different methods, in interval of [0, 1]. SVMTL, WMTL, TCA and JDA respectively denote the average results under three imputation methods. We can observe that the error rate of CTL is always lower than that of other methods, and different K values has little impact on the classification results in CTL. This shows that CTL method has strong robustness for K values selection, which is a good feature for CTL method in practical classification applications.

V. CONCLUSION
In this paper, A new credal transfer learning (CTL) method, based on the belief function theory, is proposed to classify missing data. CTL method is able to effectively address the classification problem of missing values, which training and test sets come from different distribution domains. CTL uses observed attributes to search for K -nearest neighbors (KNNs), and estimates K versions of mapping patterns in the source domain for incomplete patterns the target domain according to some given one-to-one pattern pairs, which can effectively represent the uncertainty of estimation caused by missing values. The K pieces of classification results then are discounted by the discounting (reliable) factors depending on the distance between the corresponding KNNs and the (incomplete) pattern, and they are adaptively fused by a originally proposed method under the framework of belief function theory. The non classifiable patterns are reasonably submitted to the relative meta-class regarded as the union of some specific classes, representing classification with imprecision. The reasoning of imprecision is able to reduce the risk of error and characterize the uncertainly due to the lack of attributes information. Further technique (possibly costly) or extra informative sources can be used if more precise results are required. Finally, CTL the effectiveness of CTL is justified by three experiments, in which comparison with other methods is executed on real data sets. The results show that CTL is able to reduce mis-classification rates, and captures and represents well the imprecision of classification caused by missing values.
In this work, we assume that some one-to-one pattern pairs are given. In some situations, however, the pattern pairs may not be available. Therefore, in the future, we will consider a more general TL method instead of using pattern pairs. In parallel, we will further study the problem of cross domain incomplete pattern classification from the perspective of deep learning and data-driven methods [46], [47].