Multi-Source Transfer Learning via Ensemble Approach for Initial Diagnosis of Alzheimer’s Disease

Alzheimer’s disease (AD) is one of the most common progressive neurodegenerative diseases, and the number of AD patients has increased year after year with the global aging trend. The onset of AD has a long preclinical stage. If doctors can make an initial diagnosis in the mild cognitive impairment (MCI) stage, it is possible to identify and screen those at a high-risk of developing full-blown AD, and thus the number of new AD patients can be reduced. However, there are problems with the medical datasets including AD data, such as insufficient number of samples and different data distributions. Transfer learning, which can effectively solve the problem of distribution discrepancy between training and test data and an insufficient number of target samples, has attracted increasing attention over recent years. In this paper, we propose a multi-source ensemble transfer learning (METL) approach by introducing ensemble learning and our tri-transfer model that uses Tri-Training, which ensures the transferability of source data by the tri-transfer model and high performance through ensemble learning. The experimental results on the benchmark and AD datasets demonstrate that our proposed approach has effective transferability, robustness, and feasibility, and is superior to existing algorithms. Based on METL, we propose an auxiliary diagnosis system for the initial diagnosis of AD, which helps doctors identify patients in the MCI stage as quickly as possible and with high accuracy so that measures can be taken to prevent or delay the occurrence of AD.


I. INTRODUCTION
Machine learning has shown great success in variety of application fields, including computer vision, object recognition, and natural language processing [1], [2]. Some scholars have applied machine learning in the medical field, which led to the emergence of machine learning-driven intelligent auxiliary diagnostic systems [3], [4].
Alzheimer's disease (AD) is one of the most common progressive neurodegenerative diseases, and with the global aging trend, the number of patients with AD has increased year after year. It is estimated that by 2050, AD patients will increase by three times [5]. Medical research shows that in the early stage of AD, patients will present with mild cognitive impairment (MCI) [6], which lies between the normal state and the diseased state and begin to appear younger patients. Many studies are based on the hope that potential AD patients can be detected during the MCI stage, and then effective measures can be taken to prevent the disease from worsening. If early prevention and treatment are available, the number of new patients will be reduced. If the MCI stage can be studied in depth, it is hoped that the high-risk population of AD will be discovered and screened, thus providing an optimal treatment time window for preventing or delaying the occurrence of AD. The Alzheimer's Disease Neuroimaging Initiative (ADNI) [7] provides researchers committed to determining the progression of AD with research data. ADNI research resources and data include MRI images, PET images, genetic data, and clinical data from the North American ADNI Study; the collected samples include patients with Alzheimer's disease, subjects with mild cognitive impairment, and elderly controls.
Traditional machine learning still suffers from two defects: 1) high labor intensity for labeled data, especially for insufficient AD samples and 2) different data distributions that produced different regions and ages, multi-source medical datasets such as MRI images, PET images, genetic data, clinical data. Due to the above problems, it is difficult to obtain accurate classifiers directly by using traditional machine learning. Transfer learning was proposed to address these issues by imitating the learning of human beings. The core idea of transfer learning is to transfer knowledge from a welltrained source domain to a target domain where training data is insufficient. Due to its advantages, transfer learning has been widely used in various cross-domain fields and has been attracting increasing attention in recent years [8], [9].
The key issue in transfer learning is inappropriate domain adaption that results from different data distributions across domains [10]. Moreover, some transfer will not improve performance or may even reduce the performance of the target classifier; this is called negative transfer [11]. Many approaches aim to address this issue, such as TrAdaBoost and Co-Clustering approaches, and a comprehensive review on transfer learning is given in [10]. However, most existing approaches, such as TrAdaBoost [12] and Co-Clustering [12], only utilize a single source domain. In actual medical problems, the target domain often involves knowledge from multiple source domains. Therefore, transfer learning involving multiple source domains, referred as multi-source transfer learning (MSTL), is proposed to effectively utilize the knowledge from different domains [13], [14].
Multiple source domains not only bring benefits but also a new challenge, i.e., how to identify and select the useful knowledge from multiple source domains. The knowledge from multiple source domains usually have different distributions, and thus not all knowledge can be reused to improve performance. Thus, inappropriate selection and deployment of source domains will exacerbate negative transfer [13]. Several approaches were proposed to address this issue [13], [14]. Although these approaches have been developed to alleviate the limitation of negative transfer, it could reduce performance because most of them are unattachable to explore the distribution similarity between source and target domains, and to handle the imbalanced data.
In this paper, we propose a multi-source ensemble transfer learning (METL) approach. METL consists of two phases: (1) single-source tri-transfer learning, which improves the transferability of the classifier trained by a single source domain, and (2) MI-based multi-source ensemble learning, which ensembles multiple classifiers into a robust final classifier. To validate METL, we conduct four sets of experiments via a variety of multi-source transfer tasks. The experimental results show that METL outperforms existing algorithms in medical fields and has practical capability in AD initial diagnosis. To further prevent or delay the occurrence of AD, we propose an METL-based auxiliary diagnosis system, which helps doctors to identify patients in MCI stage as quickly and accurately as possible.
The rest of this paper is organized as follows. Section II discusses the related work. Section III presents the details of our approach. Section IV reports on and analyzes the experimental results on benchmark and AD datasets. Section V discusses the results, the limitations of our approach, and future work. Finally, Section VI concludes this paper.

II. RELATED WORK
In recent years, many improved transfer learning algorithms have been proposed by combining with other methods. In this part, we discuss algorithms related to our work.

A. TRANSFER LEARNING BASED ON ENSEMBLE LEARNING
Ensemble learning occurs when tasks are learned by combining the strengths of a collection of simpler base models [15]- [17]. In general, ensembled learners outperform the single algorithm in three aspects: (1) Accuracy: An ensembled solution has better average performance. (2) Novelty: An ensembled solution is unattainable by any single algorithm.
(3) Robustness: An ensembled solution has lower sensitivity to noise, outliers, or sampling variations. Dai et al. [12] proposed a classic correlation-based TrAdaBoost algorithm, which reasonably adjusted the weights of examples. Liu [18] presented a transfer learning algorithm that dynamically reassembled the main training dataset, and quickly eliminated redundant data. Xiao et al. [19] proposed a dynamic transfer ensemble model based on clustering and selection. Meanwhile, Mei [20] proposed a transfer learning framework for large-scale membrane protein identification based on the SVM ensemble.

B. MULTI-SOURCE TRANSFER LEARNING
Yao and Doretto [13] proposed Multi-Source-TrAdaBoost (MTrA), which extends TrAdaBoost to utilize multiple sources. However, MTrA selects only one source domain that is closest related to the target domain at each iteration. Qian et al. [14] proposed an algorithm based on multi-sources dynamic TrAdaBoost (MSDTrA), which ensembles all knowledge, but it does not consider unbalanced classes. Ge et al. [21] proposed the Supervised Local Weight (SLW) method, which effectively transfers knowledge even if there are unrelated source domains and unbalanced classes; however, it is not applicable to the classification of highdimension data. Eaton and Desjardins [22] presented a novel set-based boosting technique that boosts each source task and assigns higher weights to source tasks with positive transferability. VOLUME 8, 2020 C. TRANSFER LEARNING FOR AD AUXILIARY DIAGNOSIS Cheng et al. [23] presented a novel domain transfer learning approach for MCI conversion prediction, which contains three transfer components and uses data from both the target domain (i.e., MCI) and source domains (i.e., AD and normal control). Since 2D convolutional neural networks (CNN) will not be able to consider the relationship between 2D image slices in the MRI volume and make decisions on them independently. Ebrahimi-Ghahnavieh et al. [24] proposed to utilize recurrent neural network after the CNN and transfer learning to understand the relationship. Li et al. [25] presented an effective knowledge transfer method is proposed to reduce the differences between different data sets and improve the classification accuracy of data sets with insufficient training samples, tested on a small dataset from a local hospital and a large shared dataset.

III. MULTI-SOURCE ENSEMBLE TRANSFER LEARNING
In this section, we describe the details of METL. The framework of METL is shown in Fig. 1. METL consists of two phases: single-source tri-transfer learning and mutual information-based (MI-based) multi-source ensemble learning. Single-source tri-transfer learning improves the transferability of the classifier trained by a single source domain, while MI-based multi-source ensemble learning combines multiple classifiers into a final robust classifier.
According to the definition of transfer learning, data in the source domain D S has the same feature space X as data in the target domain D T but has a different data distribution. D S = {(x S 1 , y S 1 }, · · · , (x S m , y S m )}, where x S i ∈ X S is an instance, and y S i ∈ Y S is the corresponding label.
∈ X T is an instance, and y T i ∈ Y T is the corresponding class label. In our approach, substantial labeled examples are available in source domains, and a few labeled examples are useful in the target domain.
Phase 1 (single source tri-transfer learning): At this phase, one source domain (i.e., one of D S1 , D S2, D Si , · · · , D Sm ) and target domain D T are first combined to generate a new training dataset D 1 , D 2 , D i , · · · , D m . Then, three heterogeneous classifiers are iteratively trained on the new training dataset until a metric is satisfied. Here, we propose a novel source data sample method to effectively sample high-confidence data from source domains. As soon as the iterations stop, the three classifiers are ensembled to generate a robust classifier for one source domain, e.g., The main object of this phase is to enhance the transferability from one source domain to the target domain. (Details are in Section III.A)

Phase 2 (MI-Based multi-source ensemble learning):
After phase 1, many classifiers are obtained, each corresponding to one source domain. We propose a novel approach to weigh these classifiers based on the correlation between the source domain and the target domain. By means of our proposed weight assignment, each source classifier is given an optimal weight. Finally, all classifiers are ensembled to generate the final classifier f * (x) for the target domain. (Details in Section III.B)

A. SINGLE-SOURCE TRI-TRANSFER LEARNING
Tri-Training [26] is a semi-supervised learning algorithm that uses three different classifiers to exploit unlabeled data for enhancing learning performance. Inspired by Tri-Training, we derive three heterogeneous classifiers f 1i , f 2i , f 3i from different ''views,'' i.e., using different features. In phase 1, the core concept is to check the consistency between these classifiers. We assume that if they have the same predication for one instance x j , the transferability of x j is considered to be high and should be included to improve the prediction performance for the target domain. Different from the Tri-Training bootstrap sampling mechanism, where it is meaningless to divide a source domain into multiple source domains with the same data distribution, single-source tri-transfer learning employs a new source data sampling method for the multiview ensemble. Here, we can improve the transferability of a single source domain to the target domain, thus avoiding negative transfer.
Softmax [27], Support Vector Machine (SVM) [28], and Deep Neural Network (DNN) [29] are chosen as our three heterogeneous base classifiers. The Softmax classifier is a linear classifier, the input is an example feature, and the output is the probability that the example belongs to each category, which is flexible, efficient, and time-saving. SVM is an algorithm that uses nonlinear mapping to transform low-dimensional training data into higher dimensions, which builds an optimal hyperplane in feature space based on structural risk minimization theory. Therefore, it is robust, accurate, and less prone to overfitting. Finally, DNN mimics the learning mechanism of the brain, automatically combining simple features into more complex features, and uses these combined features to solve problems. Thus, the DNN has strong generalization ability.
Therefore, we can identify all useful data sample with high confidence by checking the predictive consistency of three heterogeneous classifiers. However, checking the consistency between three classifiers only once may not sample a good source data for transfer learning. Furthermore, we use an iterative approach to refine the data samples of the source domain.
The pseudo-code of phase 1 is given in Algorithm METL. As shown in Algorithm METL, we initially combine the target training dataset D T with data in the i-th source domain D Si to form a new training dataset D 1 i . Three classifiers are given to train D 1 i from different views. We sample all examples with consistent results from three classifiers into D n+1 Si . Then, D n+1 Si and D T form a new training set. We update the three classifiers and repeat the above-mentioned steps. The algorithm terminates when the training dataset is no longer changed and finally outputs the latest classifiers.
Once the final classifiers are derived, we use a multi-view ensemble method to train a more robust classifier for one source domain. The strong classifier of the i-th source domain is denoted as f i (x) and can be calculated as follows:

B. MI-BASED MULTI-SOURCE ENSEMBLE LEARNING
After the first step, we have obtained one classifier for each source domain. Due to use one single classifier is unlikely to provide a robust classifier for the target domain, but ensemble learning can improve this by combing several classifiers. In ensemble learning, we need to weigh ensemble classifiers according to their correlation such that the final classifier achieves the best performance. Likewise, we utilize ensemble learning to combine all classifiers from the source domains to produce a more robust and predictive classifier for the target domain. Inspired by the distribution weighted combination rule [30], the ideal target classifier can be treated as a mixture of multiple source classifiers weighted by normalized source distributions. In other words, the multi-source transfer learning problem is viewed as finding the ''mean'' predicted labels of all possible predicted labels that are generated by the corresponding source classifiers.
The pseudo-code of phase 2 is given in Algorithm METL. In phase 2, we select mutual information to assign different classifier weights. Mutual information from information theory [31] is widely used to describe the mutual dependence between two random variables. In METL, different source domains and target domains may have diverse data distributions. The source domains with a similar data distribution as the target domain should contribute more in our ensemble learning in terms of improving performance.
As mentioned above, p(x, y) denotes the joint distribution of two random variables (X , Y ), while p(x) and p(y) denote the edge distribution of X and Y , respectively. The mutual information of X and Y is expressed as I (X ; Y ), which is the relative entropy of p(x, y) and the distribution product p(x)p(y), as shown in Eq. (2).
The mutual information value between the source sample x S i m in the i-th source domain after iterations D Si , and the target sample x T n is obtained from Eq. (3).
For D Si and D T , the mutual information value between the two data distributions is calculated by Eq. (4), which actually computes the mean of all relevant source and target samples: We use mutual information I (D Si , D T ) to indicate the weight of one source domain D Si and target domain D T . Hence, for each source domain D Si , we have weight w i = I (D Si ; D T ). We normalize weight w * i as follows: where w * i ∈ [0, 1] and m i=1 w * i = 1. The target classifier is treated as a linear combination of the multiple classifiers with a weight w * i , and weights for all source classifiers collectively form a weight vector w * = w * i m i=1 . Finally, we utilize the value of weighted ensemble classifiers from multiple source domains and obtain an ensemble transfer learning effect with high performance and robustness. According to the above description, the function of the final classifier f * (x) is described as follows:

C. AD INITIAL DIAGNOSIS WITH METL
We combine the proposed approach with the traditional medical diagnosis process to achieve practical application value. The ultimate goal is to help solve medical problems and facilitate early diagnosis of AD. The METL-based auxiliary diagnosis system is shown in Fig. 2; the system simulates the VOLUME 8, 2020 traditional diagnosis process. It has four phases: collecting the new patient's medical records, data preprocessing, METL auxiliary diagnosis, and final diagnosis by the doctor. The first phase is to use medical devices to examine the new patient, collecting information such as MRI images, PET images, and clinical data. Then, we generate an inspection report, present this report to the patient, and upload the data to doctors' computers and servers. The second phase is the preprocessing of the new patient data and source and target domain datasets that from ADNI, including cleaning, integration, reduction, transformation, and class balancing. The third phase aims to generate an METL classification model and use the model to generate an auxiliary diagnosis; the results are displayed as either healthy or sick, the latter meaning the patient is in the MCI stage. In the fourth phase, the doctor refers to the auxiliary diagnosis result, makes the diagnosis, and informs the patient.
Different from the traditional diagnosis process, the proposed METL-based auxiliary diagnosis system not only reduces human error, but also improves accuracy, enabling doctors to make accurate judgments as soon as possible. If patients are found to be in the MCI stage, then the occurrence of AD can be prevented or delayed, thereby reducing the number of AD patients [32].

D. THEORY ANALYSIS 1) PHASE 1 (SINGLE SOURCE TRI-TRANSFER LEARNING)
Let p S k (x), p S k (y|x), p S k (x, y) denote the marginal, conditional, and joint distribution of the source domains, respectively, and p T (x), p T (y|x), p T (x, y) for those of the target domain. It is obvious that if the prediction of classifier f 1 , f 2 , f 3 for the source sample x S K i is the same, then this source sample is deemed to have a highly similar distribution with the target domain and is marked with a high confidence value, and vice versa. Here, we use β i to represent the transferability of one source sample x S K i , which is defined in Eq. (7).
Here, p T (x S k i ) denotes the probability of sample x S K i generated under the target domain distribution. α i is an indicator of whether three classifiers have the same predication as The large distribution difference between the source and target domains is an important factor of negative transfer. To eliminate the distribution difference between source and target domains, we weigh the source sample with its transferability.p S k (x, y) = βp S k (x, y) is defined as the estimated joint distribution of the source domain. Based on the Kullback-Leibler (KL) divergence [33], we define the following objective function for minimizing the distribution difference: The objective function contains two terms, and the first term is fixed when the dataset is known. Hence, to minimize Eq. (8), we just need to maximize the second part (within the parentheses). The second term can be maximized by training a better classifier. Consequently, optimizing Eq. (8) is equivalent to maximizing the third term, which becomes , which means that sample x S k i ∈ s + is helpful for learning the target task. In contrast, when x S k i ∈ s − , it plays a negative role. Note that s + i , s − i ≥ 0, and we can maximize the function as show in (9) by selecting better transferability of source samples x S k i ∈ s + .

2) PHASE 2 (MI-BASED MULTI-SOURCE ENSEMBLE LEARNING)
In phase 2, we denote f * (X T ) and {f i (X T )} m i=1 as the target labels predicted by the ideal target classifier and the source classifier, respectively. As mentioned above, the ideal target classifier can be derived by minimizing the loss function: where w i refers to the weight of the source classifier f i (x), and d is a distance metric approach. A classification function f (·) can be written as p(y|x) from a probabilistic viewpoint. For each source classifier, the predicted labels f i (X T ) are mathematically represented as probability distributions: p i (x n ) = y p i (y)p i (x n |y ), where p i (y) is the prior probability of labels, and p i (x n |y) is the post-probability of instance x n . Using the KL distance, the loss function L can be further derived as follows: where H (X ) = − n p(x n ) log p(x n ) is the entropy, which is an uncertain property. The loss function L can be further divided into two parts L 1 and L 2 : The performance of the ensemble learning approach depends on the predicted results of both the source classifier and the ensemble classifier. With the decrease of L 1 , the source classifier can achieve better performance. The loss function L 1 defined in Eq. (12) refers to the confidence of the classification results. Since the information entropy is the confusion property for a system, better classification results have smaller dissimilarity. For L 2 show in Eq. (12), in order to guarantee the performance of the ensembled classifier, the member of ensembled classifier should have a higher accuracy and dissimilarity for the classification task.

IV. EXPERIMENTAL EVALUATION
To validate METL, we conduct extensive evaluations and experiments via a variety of multi-source transfer tasks. We first use a standard benchmark dataset to evaluate the following: (i) the efficacy of individual single-source tri-transfer learning and multi-source ensemble learning, (ii) the transferability of our approach, and (iii) the classification performance of our approach in comparison with other algorithms. Then, we use the AD dataset from ADNI to verify the feasibility of our proposed approach. Through these experiments, we comprehensively evaluate the performance of METL and the practical application capabilities in AD diagnosis.

A. BENCHMARK DATASETS
We first conducted experiments on 12 representative medical datasets from the UCI repository [34]. These 12 datasets, widely used for comparison between different algorithms, represent diverse domains and data features and have been preprocessed.
To form multiple sources for our problem, we divide each dataset from UCI into four sets, i.e., one target domain and three source domains. We select a multi-valued attribute and VOLUME 8, 2020 use K-means [35] on one attribute to cluster data into four sets, each set corresponding to one domain. The resultant four domains have different data distributions. TABLE 1 presents the attribute that is used to split each dataset and the details of each domain after splitting.

1) EVALUATION OF THE TWO PHASES
In order to demonstrate the effectiveness of the two phases of METL, we design two baseline approaches for comparison against METL. The first approach (prototype1)

Algorithm 1 METL
Input: Source domain data D Si and target domain dataset D T with labels y i ∈ Y . Phase 1: Using three heterogeneous classifiers, f n 1i (x), f n 2i (x) and f n 3i (x), to train on data D n i . initialize D n+1  5).
uses TrAdaBoost to replace the proposed tri-transfer model, and the SVM is selected as the basic classifier. The second approach (prototype2) replaces the MI-Based ensemble method with an equal-weighted ensemble method.
Experiments using the three approaches are conducted on the 12 medical datasets. For comparison purpose, we chose 70% of the labeled examples in the target domain as the test dataset, and 3%, 10%, and 30% of the remainder as the training data. The number of source domains is 3. The experiments are repeated 10 times, and we average the results to obtain an accurate error estimate.
The experimental results are summarized in TABLE 2. We can see that METL outperforms the two baseline approaches in the majority of cases. The results prove that tritransfer learning is generally better than TrAdaBoost, and the MI-based ensemble method generally surpasses the equalweighted ensemble method. As the percent of labeled data in the target domain increases, the accuracy of three approaches is improved. Furthermore, as indicated by the accuracy on mammographic_masses and sani datasets, prototype1 outperforms METL. This is because in the case of 3% and 10% labeled training data in the target domain, only a few source samples can be obtained; tri-transfer learning may discard some of the samples are still useful even if they are not strongly correlated with the target domain. Therefore, the sampled source data may cause underfitting, and the value of mutual information is not able to correctly measure the similarity of data distributions between the source domain and target domain. When the percent of labeled data in the target domain reaches 30%, prototype1 and prototype2 are worse than METL, which means that METL classification performance is improved when the amount of training data is sufficient.

2) EVALUATION OF MULTIPLE SOURCES
To verify the transferability of our approach, i.e., that it not only makes full use of data in the multiple source domains  when there is little labeled data in the target domain but also avoids negative transfer, we conduct evaluations with multiple sources. We choose 3%, 10%, and 30% of labeled data in the target domain, with 0, 1, 2, and 3 source domains. We employ METL with different ratios of labeled data and different numbers of source domains, and then repeat the experiments 10 times and average the results.
As shown in TABLE 3, the average classification accuracy of the case with 3% labeled data and zero source domains is the worst while the case with 30% labeled data and three source domains is the best. In general, with the increasement in the ratio of target labeled data, the accuracy of METL increases. This means that an increase in the amount labeled data in the target domain more fully describes the data distribution, and hence the three heterogeneous classifiers have better generalization ability to ensure that the examples sampled from source domains have transferability. Furthermore, the accuracy raises with the increase in the number of source domains, indicating that METL can use samples from multiple source domains to assist learning the target task.
The experimental results in TABLE 3 show that multisource transfer learning outperforms single-source transfer learning. When the ratio of labeled data is 3%, the growth rate of the accuracy is the largest, which means that the less training data in the target domain, the more useful the transfer knowledge. However, when the ratios of labeled data are the same, the accuracy growth rate slowly decreases, which means that when there is enough source data, increasing the number of source domains will not significantly improve the performance.

3) COMPARISON WITH EXISTING APPROACHES
To further demonstrate the performance of METL, we compare it with three transfer learning algorithms: MultiSource-TrAdaBoost (MTrA) [13], Multi-Source Dynamic TrAdaBoost (MSDTrA) [14], and Multi-Source Tri-Training Transfer Learning (MST 3 L) [36]. The main settings of algorithms are shown in TABLE 4.
To seek an accurate error estimate, each algorithm repeats cross-validation 10 times, and the mean is taken as the final result. As indicated by the average classification accuracy in TABLE 5, when the ratio of labeled data is 10%, METL is superior to MTrA, MSDTrA, and MST 3 L; moreover, MTrA performs the worst. Three heterogeneous classifiers learn the same target task from different views, with strong generalization ability. Furthermore, METL reasonably estimates the correlation between each source and target domain by employing mutual information. MTrA and MSDTrA are both based on TrAdaBoost, but MSDTrA surpasses MTrA since MSDTrA joins dynamic factor improves the problem that the weight entropy caused by source weight convergence is transferred from the source sample to the target sample.

B. ALZHEIMER's DISEASE DATASET
To further validate the feasibility of the proposed approach, we conduct extensive experiments on real-world AD medical dataset. More than 30 million people worldwide suffer from AD, and with the increase in life expectancy, patients are expected to triple by 2050. Medicine has shown that during MCI, timely detection and effective measures can prevent the disease from worsening. Therefore, the early diagnosis of AD is very important, and determining the patient's stage in the disease has become the focus of current research.
ADNI provides researchers with research data as they work to determine the progression of AD. The data collection is divided into four phases: ADNI1, ADNI-GO, ADNI2, and ADNI3; ADNI3 is the latest stage. We used the AD diagnostic summary dataset obtained from ADNI, which includes the time phase, ID, multiple attributes of the inspection item, and diagnostic results labels. Attributes of the inspection item are DXCURREN, DXCONV, DXCONTYP, DXREV, DXNORM, DXMCI, DXMDES etc. We used data from the ADNI3 stage, including label 1 or 2, and then employed data preprocessing and SMOTE techniques [37] to balance the number of classes so that the training data in the AD dataset was easy to learn. The partitions of the AD dataset are shown in TABLE 6.
On the AD dataset, METL was compared with the three aforementioned algorithms. The main settings of the four algorithms are the same as those listed in TABLE 4. The ratios of labeled data in the target domain are selected as 3%, 10%, and 30%. The four algorithms are repeat cross-validation 10 times, and the experimental results are averaged.
In Fig. 3, the x axis is the accuracy and the y axis is the ratio of the labeled data in the target domain. As indicated by the overall classification accuracy in Fig. 3, METL and MST 3 L are better than MSDTrA, but MTrA is worse than MSDTrA. When the ratio of labeled data in the target domain is 3%, METL and MST 3 L significantly outperform MTrA and MSDTrA, which demonstrates that METL has better transferability when there is very little training data in the target domain. MTrA and MSDTrA have similar accuracies when the ratio of labeled data in the target domain reaches 30%, which means that MTrA and MSDTrA have similar performance when the training data in the target domain is sufficient.
Moreover, Fig. 3 shows that METL has good feasibility in the initial diagnosis of AD and can help solve practical problems. As the ratio of labeled data in the target domain increase, the accuracies of the four algorithms will increase. However, the growth rate of accuracy from 3% to 10% of the ratio of labeled data in the target domain is higher than that from 10% to 30%. This means that the less labeled data there is in the target domain, the more useful the transfer learning.

V. DISCUSSION
As demonstrated in the reported experiments, our approach obtains a high-quality transfer performance. Based on the mathematical analysis and overall observation of the experimental results, we summarize the advantages of our approach as follows.
First, we proposed a single-source tri-transfer learning model that has been proved mathematically feasible in Sect III.D, and we tested it on a variety of datasets. The experimental results as shown in TABLE 2, TABLE 3, TABLE 5, and Fig. 3. The tri-transfer learning not only ensures that the sampled source data has better transferability compared to general transfer learning algorithms but also enhances robustness. Second, the MI-based ensemble method was initially proved feasible via a mathematical derivation and then demonstrated effectiveness of the method through experiments. Finally, experimental results show that our approach can assist initial diagnosis of AD.
Liu et al. [38] designed an ensemble transfer learning framework that uses a weighted resampling method on the source and target data. However, the framework is used for a single source domain, and their base learners are trained by the resampling method and TrAdaBoost. In contrast, METL learns three classifiers via the sampling scheme to ensure that the transferability of sampled source data. Therefore, our approach improves not only the interaction between multiple learners but also the reliability of source data.
Although the experimental results demonstrate that our approach achieves a certain level of superiority on a variety of datasets, there are some issues that could directly affect its practical application. Like all existing transfer algorithms, our approach may incur poor performance when the target examples are very few. Furthermore, while our approach chose Softmax, the SVM, and the DNN as the base learners, selecting appropriate classifiers for datasets with different data characteristics remains worthy of further research. Moreover, obtaining the shared feature space between the source and target domains will be a direction for our future work because of heterogeneous data in medical field.
The rapid aging of the population and the high incidence of chronic diseases, especially AD, are increasingly serious social problems worldwide. Through our approach, we can slow down and interfere with the clinical conversion of MCI or normal control to AD, thereby providing faster and safer monitoring and treatment for dementia care.

VI. CONCLUSION
In this paper, we propose a multi-source ensemble transfer learning approach, referred to as METL, to learn an accurate and robust classifier for the target domain. In METL, the source data sampling method ensures the transferability of samples, which are sampled from the source domain. Then, three heterogeneous classifiers are ensembled to obtain a robust classifier. Finally, multiple classifiers are combined to further improve the performance by utilizing mutual information and ensemble learning. Many experiments show that METL is accurate, effective, and robust. At the same time, METL surpasses the existing algorithms when the target training data is insufficient. AD dataset experiments prove that our approach can effectively improve the classification accuracy, solve two problems in medical datasets, and assist doctors in making a diagnosis. We propose an METL-based auxiliary diagnosis system for initial diagnosis of AD. This system helps doctors accurately identify patients in the MCI stage as soon as possible so that measures are taken to prevent or delay the occurrence of AD.