Transfer Learning Under Conditional Shift Based on Fuzzy Residual

—Transfer learning has received much attention recently and has been proven to be effective in a wide range of applications, whereas studies on regression problems are still scarce. In this article, we focus on the transfer learning problem for regression under the situations of conditional shift where the source and target domains share the same marginal distribution while having different conditional probability distributions. We propose a new framework called transfer learning based on fuzzy residual (ResTL) which learns the target model by preserving the distribution properties of the source data in a model-agnostic way. First, we formulate the target model by adding fuzzy residual to a model-agnostic source model and reuse the antecedent parameters of the source fuzzy system. Then two methods for bias computation are provided for different considerations, which refer to two ResTL methods called ResTL LS and ResTL RD . Finally, we conduct a series of experiments both on a toy example and several real-world datasets to verify the effectiveness of the proposed method.

conditional shift [10]; and so on. In this article, we are mainly concerned with the transfer learning problem for regression under the situations of conditional shift, which is a common and challenging data shift problem. A conditional shift in the multistream learning problem has received great attention for a long time [11], [12]. As the nature of data streams, both source and target streams are subject to their own concept drift over time: P S (y|x) t = P S (y|x) t−1 , P T (y|x) t = P T (y|x) t−1 [13]. However, in this article, we focus on the conditional shift across domains rather than intradomain, where the source and target domains have different conditional distributions, while sharing the same marginal distribution, that is, P(y s |x s ) = P(y t |x t ), whereas P(x s ) = P(x t ).
In the engineering field, many kinds of training data need to be collected and labeled by ineffective manual measurement with expensive instruments [14], [15]. Fig. 1 shows an example to construct the prediction model of pose-dependent tooltip dynamics of a computer numerical control (CNC) machine center. Tool tip dynamics, including natural frequency w, damping ratio ξ , and stiffness K, are very important for chatter suppression in machining complex parts like an aeroengine impeller. In the machine's workspace defined by three linear axes (X, Y, Z) and two rotating axes (A, C), (w, ξ, K) usually varies at different positions thus a model f that maps (X, Y, Z, A, C) to (w, ξ, K) is required. Since the workspace is huge, for each tool-holder assemble the machine has to be turned off for a couple of weeks to carry out a large amount of impact testing experiments manually for label collection [16]. Moreover, a CNC machine usually has more than one hundred of frequently used tool-holder assembles, it is impossible to collect enough labeled tooltip dynamics data in real manufacturing industry. Although the data distributions produced by different subsetting of a system are different, for example, the tooltip dynamics distributions of different tool-holder assembles discussed in Fig. 1, they are usually similar. By transferring the knowledge learned from previous subsetting during the training process of a target subsetting, the required labeled data can be greatly reduced. More specifically, the data generated under different subsetting of a system usually have the same marginal distributions but different conditional distributions, thus transfer learning among this kind of data can be categorized into conditional shift problem. There are many other regression problems under the conditional shift in real industrial applications, including kinematic parameters prediction of robot with different end effectors [17] and tool wear estimation [18]. And it is also a common issue in many other fields, such as the predictions of the yield of vineyards [19], the delay of particular flight [20], and the particulate matter (PM2.5) data [21].
To bridge different domains under substantial distribution discrepancy, great efforts have been made recently. Feature matching methods like transfer component analysis (TCA) [22] and joint distribution adaptation (JDA) [23] were proposed to extract domain invariant information for the covariate shift problem. Other methods, including kernel mean matching (KMM) [8], Kullback-Leibler importance estimation procedure (KLIEP) [26], and domain adaptation under generalized target shift (GeTarS) [10], tried to resample the source instances by reweighting or transforming to deal with covariate shift or prior probability shift problems. Besides, deep neural networks were also explored in this area as its excellent feature extraction capability [24]. For the conditional shift problem discussed in this article, existing work can be summarized into two main categories which are instance-based methods and model-based methods.
Instance-based methods are widely used for distribution shift correction and can be outlined into two scenarios, namely, instance reweighting and instance transformation. In instance reweighting methods, each instance in the source domain will be weighted to indicate its contribution in constructing the target model. The weights can be predetermined by minimizing the distribution discrepancy between the source and target data [8], [20], [26], and can also be adjusted adaptively according to the training error of the two datasets, like TrAdaBoost [27]- [29]. To guarantee the success, this kind of methods usually requires the distributions across two domains are very close in some connected sets of x ∈ X, then the larger weights can be assigned to the source instances in these sets [20]. However, this assumption may not be always satisfied because many real-world cases have global conditional shifts, that is, P(y s |x) = P(y t |x) for all x ∈ X. Instance transformation methods try to transform the source data to enclose the conditional distribution to the target data, that is, P new s = P t [19], [30]. Nevertheless, if the target data are inadequate for estimating the distribution, it is difficult to solve the coefficients of the transformation accurately [26]. Two kinds of model-based transfer learning methods are also investigated for conditional shift situations, namely, parameter-based methods and hypothesis-based methods.
Parameter-based methods are established based on the assumption that the models in the source and target domains have similar structures or share some common parameters. To make the idea work, the prediction model has to be specified beforehand. The Gaussian process (GP) regression model is a popular choice for parameter-based methods by transferring through shared parameters [31] and shared representation of instances [32], or learning a transfer covariance function [33]. Recently, a more intuitive and interpretable modeling method, Takagi-Sugeno-Kang (TSK) fuzzy system, attracted more attention. Target models can be learned by transferring the consequent parameters [35], [36], or mapping the label space under fuzzy rules [37]- [39]. Though parameter-based methods show promising advantages for some applications, the limitation of model-specific parameters makes them not applicable to many scenarios [40]. Hypothesis-based methods aim to incorporate the hypothesis learned in the source domain instead of estimating the target model directly [42]. Different frameworks, including the offset and scale, are proposed to transform the source model to the target model [19], [42]. However, since the transformation function is trained only on the target data, the final performance is highly dependent of the amount of the labeled target data. A major challenge in conditional shift problem is how to train a target model reasonably with very limited target data.
In this article, we propose a new framework called transfer learning based on fuzzy residual (ResTL) which learns the target model for regression problems under conditional shift situation by preserving the distribution properties of the source data in a model-agnostic way. In ResTL, the target hypothesis h t (x) is obtained by adding a residual function r(x) to the source hypothesis h s (x). Traditional hypothesis-based methods train r(x) on the target data directly. However, it is usually difficult to find a reasonable solution in a considerable hypothesis space if the target data are quite limited. In this article, we follow the idea that the target model trained by transfer learning would preserve the distribution properties of the source data, especially when the target data are far from enough. Therefore, ResTL first models the source data into a TSK fuzzy system. Then, the fuzzy residual r(x) is obtained by multiplying the antecedent parameters of the source fuzzy system and the biases trained on the target data. As the fuzzy partition could be regarded as reflections to the marginal distribution of X, ResTL can learn a much more reasonable target model by preserving the data properties of the source task, even the target data are quite limited. Furthermore, we relax h s (x) from the TSK fuzzy system to a model-agnostic form, which means that the target model could be generalized as a combination of h s (x) trained by any supervised regression model and a fuzzy residual component r(x).
In Section II, we will introduce the main idea of ResTL and the model-agnostic formulation of the target model. Then two algorithms for bias computation are provided to construct fuzzy residual for different considerations. In Section III, we will describe a series of experiments carried out on toy examples and several real-world datasets to verify the effectiveness of the proposed method. Finally, we will conclude this article in Section IV.

II. MODEL-AGNOSTIC TRANSFER LEARNING BASED
ON FUZZY RESIDUAL Suppose that we have a regression task with a large amount of training data D s = {(x s 1 , y s 1 ), . . . , (x s n s , y s n s )} as a source task, where x s i ∈ X S is the data instance and y s i ∈ Y S is the corresponding label, a continuous variable. There is also another regression task with a small amount of training data D t = {(x t 1 , y t 1 ), . . . , (x t n t , y t n t )} as a target task, where the input x t i ∈ X T and y t i ∈ Y T is the corresponding output. This article focuses on the transfer learning problem between the two tasks under conditional shift situation, where the marginal distributions are the same, P(x s ) = P(x t ), but the conditional distributions are different, P(y s |x s ) = P(y t |x t ).
There are two crucial procedures for ResTL, preserving the shared knowledge across domains and reducing the discrepancy across domains. First, the TSK fuzzy system of the source task h FS s is constructed using the source data. Given that there are only small amounts of labeled target data, the TSK fuzzy system of the target task h FS t cannot be constructed directly. We propose to reuse the antecedent information of h FS s and describe the discrepancy between two models by adding bias to the fuzzy rules. The details about the proposed ResTL are introduced in the rest of this section. First, we formulate the target model by adding fuzzy residual to a model-agnostic source model. Then two methods for bias computation are provided which refer to two ResTL methods called ResTL LS and ResTL RD .

A. Formulating the Target Model With Bias of Fuzzy Rules
The target hypothesis h t (x) is obtained by adding a residual function r(x) to the source hypothesis h s (x). First, we model the source data using the TSK fuzzy system. Then, the target model can be expressed with a model-agnostic source model and a fuzzy residual term.

1) Constructing the TSK Fuzzy System of the Source Task:
For the source data D s , denote X s ∈ R n s ×d which is a matrix having the vectors of inputs and y s ∈ R n s ×1 which is a vector containing the output The TSK fuzzy system can represent the nonlinear dynamical systems with TSK fuzzy rules with high degree of precision. Denote the TSK fuzzy system of the source task as FS s , then the output of FS s can be represented by a combination of series submodels as where K is the number of fuzzy rules, and λ k (x) is the normalized membership degree which can be calculated with antecedent parameters as Appendix A [43], [44]. The antecedent parameters are obtained by fuzzy partition, which can be regarded as reflections to the marginal distribution of X.
Because of the assumption that P(x s ) = P(x t ), the antecedent parameters can be shared across domains to preserving the distribution information of the source domain. f k (x) is the kth fuzzy rule, namely, a linear function of the input variables. K×1 is the consequent parameters of the TSK fuzzy system. The consequent parameters of FS s can be solved with labeled dataset D s .
To solve the consequent p, we can construct where X s e = [X s , 1]∈ R n s ×(d+1) is the extended matrix by appending a unitary column to X s . k ∈ R n s ×n s is a diagonal matrix having the normalized membership degree λ k (x i ) as its ith diagonal element Then, the consequent parameters p ∈ R (d+1)K can be solved with the least-square (LS) method as Up till now, the TSK fuzzy system of the source task as h FS s is completely constructed and can be used to construct the target model.
2) Constructing the Target Model With Fuzzy Residual: Given that there are only small amounts of labeled target data, the TSK fuzzy system of the target task h FS t cannot be constructed directly. We propose to reuse the antecedent parameters of the source fuzzy system to preserve the distribution properties of the source data. For the labeled target data D t , denote X t ∈ R n t ×d and y t ∈ R n s ×1 as the input and output, respectively For an instance Since P(y s |x) = P(y t |x), it is obvious that h FS s (x t i ) = y t i . Here, we define the target fuzzy system FS t by appending a constant bias z k to each fuzzy rule of FS s as where the antecedent parameters λ k (x)(k = 1, . . . , K) are reused to preserve the distribution properties of the source data while the bias z k (k = 1, . . . , K) is defined to match the discrepancy of P(y s |x) and P(y t |x). h FS t (x) could be decomposed to the following representation as: The former component is the TSK fuzzy system of the source task, but can be easily replaced by any supervised regression model. Therefore, the target model obtained with ResTL can be generalized as where h s (x) is the source model which is model agnostic and After obtaining the antecedent parameters λ k (x) and h s (x), the task to learn the target model is to compute the bias z k on the target data. After obtaining the antecedent parameters λ k (x) and h s (x), the task to learn the target model is transferred to compute the bias z k . With the proper bias, the source data can be transformed with (10) to mimic the distribution of target data and facilitate learning on the target domain. Therefore, the bias parameters should be calculated with labeled target data.

B. Computing the Bias
We propose two methods to compute the bias z k : 1) the LS method by empirical risk minimization and 2) residual defuzzification (RD). The bounds of fuzzy residual with the two methods are also analyzed.
1) Computing the Bias With the LS Method: From a probabilistic viewpoint, the estimation of conditional distribution P(y|x) can be written as the hypothesis h(x). The conditional distribution matching can be converted into minimizing the empirical risk of h t (x) on the target data. The bias z k could be intuitively obtained through optimization algorithms by minimizing the risk of the hypothesis on D t . In general, the empirical risk is defined by the average loss function on the training set The goal is to find a hypothesisĥ among a fixed class of functions H for which the risk R emp (h t ) is minimal The hypothesis h t is formulized as (9), therefore the bias vector z=[z 1 , z 2 , . . . , z K ] T can be represented as The loss function can be defined as the square error as Substituting (9) into (14), it follows that: Denote , and e = [e 1 e 2 · · · e n t ] , then (15) can be simplified as Then, we have the minimization problem given as As the limited target data may distribute unevenly, a regularization term should be included and the minimization problem is turned to where η is a parameter to control the importance of the regularization term. For those rules with no target data, the corresponding z k will be controlled close to zero with the help of η. Therefore, the target model will approach the source model in the space without target data. The LS solution for z can be computed as where Note that the solution of bias vector z contains the normalized membership degree λ k (x t i ), which is calculated based on the antecedent parameters of the source domain.
2) Computing the Bias With RD: In FS s , the crisp input x t i ∈ X t is mapped to a fuzzy set. Similarly, we could design a fuzzy set of residuals for the Kth rule as where e i and μ B k (e i ) are functions of x t i , e i = y t i −h s (x t i ) is the residual of target data, and μ B k (e i ) is the membership function. We assume that the target model and the source model share the same antecedent parameters. Therefore, μ B k (e i ), namely, μ B k (x t i ), can be computed by where c k j and σ k j are the shared antecedent parameters as Appendix A. d is the dimension of x t i . x t i(j) means the jth dimension of the vector x t i . Since the fuzzy set B k is the residual of the fuzzy rules and the universe of discourse is denoted as X t , the relationship between the residual of target data e i and the bias z k can be established as a defuzzification process. A defuzzifier produces a crisp output of a fuzzy set. Therefore, the crisp output of the fuzzy set B k is the bias z k . Using centroid defuzzifier, the final bias z k can be computed as Note that the solution of bias vector z contains the membership degree μ B k (x t i ), which is calculated based on the antecedent parameters of the source domain.
3) Bounds of Fuzzy Residual: The ResTL methods that compute the bias with LS and RD are called ResTL LS and ResTL RD , respectively. The bounds of the fuzzy residual r(x) will influence the final performance of the target model learned by ResTL LS and ResTL RD . Actually, there are no strict bounds for the LS fuzzy residual. For those rules with no target data, the corresponding z k will be controlled close to zero with the help of the regularization term. However, the RD fuzzy residual has strict upper and lower bounds even with sparse target data.
Denote e t i = y t i − h s (x t i ), e t max = max [e t 1 , e t 2 , . . . , e t n t ], and e t min = min [e t 1 , e t 2 , . . . , e t n t ]. If there is only one target data, then e t max = e t min = e t 1 . Then, (23) can lead to the following inequality: Then, we have where The RD fuzzy residual has strict upper and lower bounds which are the bounds of generalization errors of the target data on the source model.

III. EXPERIMENTS
In this section, we evaluate the performance of ResTL through experiments on toy examples, three public datasets, and our own pose-dependent tooltip dynamics dataset. The source code for ResTL is available at github. 1

A. Experimental Setup 1) Methods for Comparison:
To provide a more comprehensive study of the proposed method, we evaluate the performance with four state-of-the-art methods for condition shift problem.
Transfer Learning by Boosting (TLB) [28]: The TLB method, called two-stage TrAdaBoost.R2, which is a famous boosting-based regression transfer algorithm, trains the source data and target data together by adaptively adjusting the instance weights.
Residual Approximation (RA) [19]: The RA method builds the target model by computing the offset between the target data and the source model with GP. This method is widely used in condition shift multisource fusion problems, such as the yield of vineyards [19] or surface measurement [41]. Note that the residual function of RA is trained on the target data directly without using the distribution information of the source data.
General Transformation Function (GTF) [42]: GTF is an algorithm-dependent hypothesis transfer learning method, which characterizes the relationship between the source and the target domains by establishing a GTF.
Domain Adaptation Under GeTarS [10]: GeTarS proposes to resample the source instances by reweighting or transforming to reproduce the distribution on the target domains. And the marginal distribution and conditional distribution are embedded to a reproducing kernel Hilbert space (RKHS).
ResTL LS : The proposed ResTL method that computes the bias with the LS method.
ResTL RD : The proposed ResTL method that computes the bias with the RD method.
Target: GP prediction using only the target data. Source: GP prediction using only the source data.
2) Datasets: In this experiment, we first demonstrate the properties of the proposed ResTL and other methods on toy examples. Then, four real datasets about conditional shift transfer learning problems are adopted for performance comparison.
The TOOL dataset consists of tooltip dynamics of three milling tools (T 1 , T 2 , T 3 ) of a five-axes CNC machining center for 381 postures [15]. The features are the angle of axes of the machining center. The labels to be predicted are three dynamic parameters at the tooltip: w n (natural frequency), ξ (damping ratio), and K(stiffness). For each dynamic parameter prediction task, there are six transfer tasks: 1) T 1 → T 2 ; 2) T 1 → T 3 ; 3) T 2 → T 3 ; 4) T 2 → T 1 ; 5) T 3 → T 1 ; and 6) T 3 → T 2 . Therefore, there are 18 transfer tasks for TOOL datasets.
The PM2.5 2 dataset from UCI is the PM2.5 data in five Chinese cities: 1) Beijing; 2) Shanghai; 3) Guangzhou; 4) Chengdu; and 5) Shenyang. There is significant compatibility of yearly changes in air conditions due to the characteristics of the city, such as the weather or geographical configuration [21]. However, the PM2.5 data in each city can shift from year to year with the influence of the traffic or economy. Therefore, five transfer learning tasks were constructed from 2014 to 2015 for the five cities.
The Kin 3 dataset from Delve aims to predict the distance of the end effector of a robot with 8-D inputs consisting of joint positions, twist angles, and other dynamics parameters. We build two transfer tasks: 1) kin8fm→kin8nm and 2) kin8nm→kin8fm.
The Bank 4 dataset from Delve has the objective to predict the rate of rejections of banks. We build two transfer tasks: 1) bank8fm→bank8nm and 2) bank8nm→bank8fm.
Totally, there are 27 transfer learning tasks. For each task, experiments with a different target data size n t = (5%/10%/15%/20%) × n s were carried out.
3) Other Settings: The hyperparameters of ResTL, including the number of fuzzy rules K, the spread of the fuzzy set, and others were determined with five-fold cross-validation on the source data. The fuzzy partition method for toy examples and the TOOL dataset was FCM. However, FCM showed poor performance for the high-dimensional dataset, the prototypes would run into the center of gravity of the input dataset. So, a deterministic method Var-Part [45] was adopted for the datasets Kin, Bank, and PM2.5. In addition, the TSK fuzzy system was selected as the base model of ResTL for the PM2.5 dataset whereas GP regressor was selected as that of ResTL for other datasets.

B. Toy Example
The toy example mainly serves as an illustrative demonstration of the characteristics of ResTL and other methods. The goal is to recover the target model with data from the source model (shown as the red curves in Figs. 2-5, and few target data (blue points in Figs. [2][3][4][5]. The synthetic curve functions are formulated based on previous researches [10], [19], [28], [42]. For Figs. 2-5(a), the source model is designed as y = sin(7x) + 1 and the target data are drawn from y = x * sin(7x) + 0.2. For Figs. 2-5(b), the source model is designed as y = cos(x) * x + N(0, 0.15) and the target data are drawn from y = cos(x) * x + x + N(0, 0.15). Then, the characteristics of different algorithms can be explored clearly. More results of toy examples can be found in Appendix B.
1) As shown in Fig. 2(a), for those rules with no target data, the results of ResTL LS almost approach the source model because the corresponding residual is controlled close to zero. However, ResTL RD can only provide global residual with limited target data. The RA method [ Fig. 2(b)] can provide similar results in the space where the target data are sufficient, but it will be out of control in other space where the target data are limited. 2) TLB can minimize the training error of both the source data and target data, but the model is easy to overfit  when the distribution of target data is far away from the source data as shown in Fig. 3(a) and (b). 3) GTF can provide a good result when the assumption of specific transformation is satisfied [ Fig. 4(a)], but the simple function is usually not enough to minimize the training error of target data as shown in Fig. 4(b). 4) GeTarS can match the marginal distribution and conditional distribution in RKHS by reweighting or transforming the samples as shown in Fig. 5(a) and (b). The stability of the distribution matching process suffers from the limited number of the target data.  with bold. From those tables, the following observations can be summarized. 1) Overall, the performance of ResTL RD is better than ResTL LS in most tasks. As discussed in Section II, it means that the source and target data in most transfer tasks of this experiment have a global conditional shift, ResTL RD is more applicable since ResTL LS will force the target model to approach the source model in the space without target data. 2) ResTL RD works better than other methods in most tasks when the target data ratio is 5% (18/27 tasks), 10% (18/27 tasks), and 15% (16/27 tasks). However, both ResTL RD and ResTL LS show no advantages when the target data size is up to 20% × n s . The results demonstrate that ResTL is more effective when the target data are not enough. For further comparison, the boxplots of MSEs are also drawn by 20 trials with random target data (5% and 10%). Some of them are shown in Figs. 6 and 7, and more results can be found in Appendix B. In these figures, the vertical axis refers to the MSEs. It can be found that the proposed ResTL is more effective for most tasks, although RA and GTF show advantages for some specific tasks. In addition, it is obvious that the distribution of the target data has a great influence on the performance of the transfer learning result.
2) Performance on Different Sizes of the Target Data: To further reveal the relationship between the transfer learning performance and the size of the target data, we performed experiments with the size of the target data from 2% up to 50% of n s . Part of the results is shown in Fig. 8 and others can be found in Appendix B.
For the transfer learning methods applied in the experiment, their MSEs went down as the target data increased, which indicates that more target data are beneficial to enhancing the performance of transfer learning. The MESs of learning with only target data are much larger than that of the transfer learning methods when the size of target data are limited, which means that transfer learning is indispensable in this condition. However, when the target data size increases to a certain level, most of the transfer learning methods were not as good as before and negative transfer may happen. Negative transfer is widely stated as transferring knowledge from the source domain induces a negative impact on the target learner [46], thus it is defined under the target-only baseline. If there is an abundance of the labeled target data, the target-only baseline may be already pretty good. Thus, the knowledge from the source domain may be insignificant and the difference between domains could hurt the generalization.
The influence of the target data size to the transfer learning result of these methods are quite different and complicated. As shown in Fig. 8(a), ResTL LS and ResTL RD show great advantages when the target data are less than 25%, after that, RA and GTF perform better. In Fig. 8(b) and (c), the performance of the target data exceeds all other methods when the target data size is more than 20% of n s , which means there is a clear boundary for negative transfer in these cases. In Fig. 8(d), ResTL RD and ResTL LS take the leading position with almost all sizes of target data. It can also be found that the performance of ResTL RD is better than ResTL LS for most tasks. As discussed in Section II, it means that the source and target data in most transfer tasks of this experiment have a global conditional shift, which is more applicable with ResTL RD . In summary, ResTL was much more effective for most tasks in this experiment when the target data were limited.

3) Performance on Different Sizes of Fuzzy Rules:
In this section, we will show the performance of ResTL LS and ResTL RD with different sizes of the fuzzy rules. Eight transfer learning tasks were selected and performed with K = 1, 3, 4, 5, 6, 8, 10, 15, 20, respectively. In this experiment, the size of the target data was set to 10% of n s . Fig. 9 shows the MSEs obtained by computing the average of 20 random trials. The experimental result shows that ResTL RD stayed robust with a wide range of the number of fuzzy rules whereas ResTL LS was much more sensitive to K. To guarantee the result of ResTL, we should decide K properly. Actually, the number of fuzzy rules could be selected by cross-validation on the TSK fuzzy model of source data. In summary, the proposed method ResTL RD could stay robust with a wide range of the number of fuzzy rules.

IV. CONCLUSION
In this article, we proposed a new framework called ResTL that learned the target model for regression problems under a conditional shift situation by preserving the distribution properties of the source data in a model-agnostic way. ResTL can build the target model by learning the residual between the two domains and reusing the antecedent parameters of the source fuzzy system. Furthermore, two algorithms ResTL RD and ResTL LS were introduced for different situations under the proposed framework. The proposed method has been evaluated comprehensively both on the toy example and several realworld datasets, including our own tool-tip dynamics dataset. The results show that the proposed ResTL can provide better performance compared with existing methods when it comes to global conditional shift cases. In addition, we found that the proposed method was more effective when the target data were limited, but the performance was not satisfactory when the target data size further increased. Thus, in the future, we will focus on the negative transfer problem of the proposed method. Moreover, we will extend the proposed framework to multisource cases.

APPENDIX A TSK FUZZY SYSTEM
The TSK fuzzy system is an intelligent model defined with fuzzy logic, which is essentially a combination of submodels with good interpretability. It is a powerful tool for modeling complex nonlinear systems [43]. Suppose a TSK fuzzy system has d inputs x 1 ∈ X 1 , . . . , x d ∈ X d , one output y ∈ Y, and K rules. Then, the structure of the Kth TSK fuzzy rule for the system consists of IF-THEN in the form where A k j is the fuzzy set of the jth input dimension under the kth. A fuzzy set is described by its membership function μ A k j (x) as The shape of the membership function is determined up to the designer. The commonly used membership function is the Gaussian membership function where c k j and σ k j are the center values and the spread of the fuzzy set A k j . These two parameters are defined as the antecedent parameters, which can be calculated by fuzzy partition algorithms such as fuzzy c-means (FCMs).
The membership degree of fuzzy set A k can be the multiplicative conjunction of membership degrees of each dimension as The normalized membership degree is given by where λ k means the degree of activation of the Kth rule. Then, the output of the TSK fuzzy system can be formulated by a combination of submodels as [44]

APPENDIX B SUPPLEMENTARY RESULTS
In the supplementary results, Tables BI-BIII are the overall MSEs on different datasets with n t = (10%/15%/20%) × n s , respectively. Figs. B1 and B2 are the boxplots with 5% and 10% target data.