Explainable CNN with Fuzzy Tree Regularization for Respiratory Sound Analysis

—Auscultation is an important tool for diagnosing respiratory-related diseases. Unfortunately, the quality of aus-cultation is limited by the professional level of the doctor and the environment of the auscultation. Some studies have focused on automated auscultation techniques. However, existing approaches suffer from two challenges: (1) the models cannot learn from data distributed among multiple hospitals; and (2) the predictions of the models are difﬁcult to interpret for physicians. To address this issue, this work proposes a novel explainable respiratory sound analysis framework with fuzzy decision tree regularization. This framework develops an ensemble knowledge distillation technique to learn distributed data and achieves good performance in terms of model efﬁciency and accuracy. Fuzzy decision trees are used to explain the predictions of the model and produce decision rules that can be well accepted by physicians. The effectiveness of this framework is thoroughly validated on Respiratory Sound Database and compared with other existing approaches.

examining respiratory sounds during auscultation, the medical practitioners can identify adventitious sounds (e.g., crackles or wheezes) during the respiratory cycle. For example, crackles are the earliest sign of idiopathic pulmonary fibrosis and wheezes are usually related to COPD and asthma [5]. However, the traditional way to detect abnormalities in lung sounds by using a stethoscope could be affected by various factors such as the environmental noise, hearing fatigue, and lack of experience among junior doctors. In addition, traditional stethoscopes are not user-friendly when doctors have to wear full Personal Protective Equipment (PPE) in highly infectious environment.
To overcome the limitations of conventional auscultation, the researchers have developed digital signal processing technology for analyzing lung sounds which can be digitized and converted into signals in time domain, frequency domain, or a combination of both [6]. As it is less effective to analyze lung sounds solely in time domain or frequency domain techniques, the time-frequency techniques are more commonly applied to lung sound signals analysis [7], [8]. With the development of automated technology and adventitious sound detection, algorithms such as k-Nearest Neighbors (KNN) method, genetic algorithm, fuzzy logic, wavelet transform, etc., are used to analyze respiratory sounds [9], [10]. In addition, several studies focus on automated adventitious sound detection or the characteristics of lung sounds [11]. Although the traditional machine learning methods have contributed to the development of automated auscultation, their predictive accuracy in lung sound analysis are very limited and need to be further improved.
To achieve better predictive accuracy, deep learning-based methodologies has been proposed for respiratory sounds detection [12]- [15]. Existing studies of respiratory sounds based on deep learning technology mainly convert lung sounds into spectrograms and then analyze them using Convolutional Neural Networks (CNN) or Recurrent Neural Networks (RNN) [15]. As the CNN models are successful in the field of computer vision, image recognition, and audio analysis, similarly, many studies analyze respiratory sound data by first extracting Mel Frequency Cepstral Coefficients [12], Mel spectrograms [13] or local binary pattern from the lung sounds spectrograms [14], and then feed into the CNN models for training and prediction. In addition, RNN is designed to discover temporal patterns and can be used for lung sound analysis. As an advanced extension of RNN, Long short-term memory (LSTM) network has been developed for the detection of respiratory abnormal sounds and chronic/non-chronic diseases [15]. Although the development of the deep learning models can improve the predictive accuracy of respiratory sound analysis, their real-world deployment is constrained due to lack of interpretability. Intuitively, deep learning models are intrinsically black-box models, and it is difficult for healthcare practitioners to fully trust the predictions if no explanation is available. In fact, as discussed in [16], the interpretability of a model is crucial in the medical and healthcare field.
In order to explain the deep learning models, existing explainable machine learning research [17]- [21] can be categorized into intrinsic interpretable models and post-hoc interpretable models. For intrinsic interpretable models, the key is to construct directly a model that can be explained by their internal structures. Methods in this category include the use of semantic representation constraints [17], rule-based methods [18], attention mechanisms [19], etc. Although the intrinsic interpretability has good explanatory power, it also degenerates the discriminatory capability of the model [20]. In contrast, the post-hoc interpretable methods suffer less from the degeneration by learning an additional explainer network [20] or using traditional explainable models (e.g., decision tree) to approximate the original black-box model [21]. However, the above explainable methods cannot be directly applied in healthcare scenarios due to several reasons. Firstly, the wellannotated data are distributed among multiple hospitals and cannot be directly shared due to privacy. If the model only learns from one data source, the features learned from small datasets typically fail to generalize numerous patients [22]. Secondly, these models are still difficult to be understood by the healthcare practitioners.
In this work, we propose a novel framework of ensemble knowledge distillation with fuzzy logic that can achieve both interpretability and predictive accuracy in auscultation scenario where respiratory sound data are distributed (see Fig. 1). To handle the distributed data, the multiple teacher models will be trained on datasets from different sources, then these teacher models will transfer the learned patterns to a student model. Here, each teacher model can learn a large network structure from a single data source. The student model with a smaller network structure then obtains the knowledge of multiple teacher models through knowledge distillation, which can take advantage of the distributed lung sound data and achieve more efficient data analysis. Due to the ensemble knowledge distillation, the student model can integrate and learn knowledge from multiple sources without having to touch directly the datasets of the teacher model. To achieve better explanation, the decision boundaries of the student model can be approximated by a small fuzzy decision tree. Usually, the medical personnel apply vague concepts rather than specific values in reasoning about the condition of the disease in real world situation [23]. As fuzzy logic is good at managing uncertain information [24], fuzzy decision tree is a more suitable method which is easy to understand by people in other fields such as doctors. Different from decision tree with clear decision boundaries, the fuzzy decision trees can handle uncertain information better and are more interpretable [25]. To use fuzzy logic in respiratory sound analysis, respiratory sound data is first converted into the Mel spectrograms which can represent the magnitude of sound energy. Then, the fuzzy decision tree can use the different frequency bands of lung sound as features to approximate the prediction of the student model, which makes the decision of the student model more interpretable in a step-by-step manner.
To sum up, the contributions of this work are as follows: • To learn an interpretable model from medical data that is distributed among multiple hospitals, this work proposed the ensemble knowledge distillation to learn from multiple sources. The proposed model can overcome the poor generalization due to training from a small single-sourced dataset. The distilled student model with a more compacted network structure can also be more efficient.
• To introduce interpretability of learning model in respiratory sound analysis, the proposed method incorporate a fuzzy decision tree regularization with a CNN model such that each lung sound prediction can be explained by a sequence of fuzzy logic.
• The proposed framework is tested on real datasets and the experimental results show that the proposed method can outperform the state-of-the-art counterparts.

A. Respiratory Sound Analysis
Limited by traditional lung sound auscultation methods, many machine learning algorithms are currently used in lung sound analysis, such as Artificial Neural Network (ANN), Hidden Markov Model, KNN, and so on [10]. Some research was motivated by cepstral features, using statistical properties of the cepstral coefficients as the feature extraction method and used ANN with Multilayer Perceptron (MLP) as a classifier for normal, wheezes, and crackles -three types of lung sounds [26]. Automated lung sound analysis process needs to be patient-friendly. For this reason, a non-invasive electronic stethoscope system based on support vector machines and CNN machine learning algorithm was constructed [12]. In order to overcome the challenge of labeling large amounts of data, a semi-supervised algorithm has been proposed to train two support vector machine classifiers to identify wheezes and crackles [22]. In addition, some research work proposed an automated lung sound analysis platform to identify abnormal lung sounds, which does not require the labeled respiratory cycle [27]. However, these methods are proposed under the premise that there is only one source of lung sound data, and it is difficult to train the model from multiple data sources if these data do not contribute to each other. Unlike these studies, we use multiple teacher models to learn datasets from different sources, making the student model more generalized than learning from a single dataset. The use of complex classification technology and multiple computationally-expensive features would limit their use in lightweight devices [1]. Through knowledge distillation, the knowledge of the teacher models is distilled to a student model, which can reduce its parameters while retaining the classification ability of the student model.

B. Knowledge Distillation with Multiple Teachers
The development of deep learning has brought about complex models with huge overhead. The application of these complex models in production requires a lot of inference time. Knowledge distillation based on teacher-student framework as an effective model compression method tries to achieve a trade-off between model accuracy and inference efficiency [28]. To improve the applicability of the knowledge distillation framework and the accuracy of the model, some researchers have proposed to improve the framework of distillation from a single teacher to multiple teachers. For example, multiple pre-trained teacher models are directly assigned fixed weights to integrate their predictions [29]. Some researchers have used different teacher models to learn different types of inputs and then used the weighted average to teach student models [30]. In addition, to accelerate the training of the student model in the word embeddings task, multiple teacher models were used to train a student model by combining their logit values such that the student no longer needs the teachers during decoding [31]. However, these approaches only treat teacher models equally, without taking into account the differences between them. In order to solve the conflicts and competition among all teachers, Shangchen et al. formulated ensemble knowledge distillation as a multi-objective optimization problem and assigned dynamic weights to each teacher model [32]. You et al. proposed to use a voting mechanism to unify multiple relative dissimilarity information, which can be transferred into the student network [33]. These existing methods use multiple teacher models to learn category-fixed training samples. In this paper, we use multiple teachers models to learn local data distributed across multiple hospitals with flexible categories.

C. Explainable Machine Learning
There are some interpretable works based on knowledge distillation in recent years. Some studies describe the features via linear projections and univariate functions based on the additive index model [16]. However, this additive model may be biased towards selecting a few visual concepts rather than all. To overcome the typical bias-interpreting problem, the researchers distilled knowledge from a pre-trained model into an explainable additive model [20]. Sometimes, the interpretability of a network comes at the expense of its power. An explainer network was trained to explain features inside the CNN aim to trade off between the network interpretability and the network performance [34]. The emergence of knowledge distillation provides an opportunity to explain the variables final learned by the model. Other methods attempt to distill the knowledge of neural networks into tree structures. The Gradient Boosting Trees (GBT) was used to mimic deep learning models and provided interpretable features and decision rules [35]. Some studies learned filters to make hierarchical decisions by training a soft decision tree [36]. Some researchers used model distillation techniques to learn global additive explanations, which describe the relationship between input features and model predictions [37]. Zhang et al. [21] approximated the decision boundaries of deep model via the decision tree. Besides, the decision tree was built to summarize an approximate explanation for CNN predictions at the semantic level [38]. However, it requires each CNN filter to represent a semantic concept, which limits the performance of the network. Compared to previous works [21], our proposed framework can learn multiple sources of data through knowledge distillation while protecting the privacy of the data. Moreover, the fuzzy decision tree in our work is close to the human reasoning process and more interpretable.

III. BACKGROUND
The combination of fuzzy set theory and decision tree produces a fuzzy decision tree that handles uncertainty [39]. Fuzzy ID3 is a top-down algorithm applied to construct fuzzy decision trees. In this paper, the fuzzy decision tree will be constructed by fuzzy ID3 algorithm [25].
For dataset D, it has l attributes A = {A 1 , . . . , A l }, n classes, and e fuzzy sets F i1 , F i2 , . . . , F ie for the attribute A i . We can calculate the information entropy by Eq. 1.
where r k denotes the proportion of the k-th class of samples in the current data set D. The fuzzy ID3 algorithm splits the attributes based on fuzzy information gain by Eq. 2 and the pseudocode is shown in Algorithm 1.
where e j=1 |D Fij | denotes the sum of the membership values for the attribute A i .

IV. OUR APPROACH
In this section, we first formally describe the problem encountered in respiratory sound analysis, and then propose an ensemble knowledge distillation to learn multi-source data and fuzzy decision tree regularization to provide model interpretation. if node satisfies the leaf node condition then 5: node ← mark node as a leaf. 6: A * ← Select the attribute that maximizes information gains. 11: if D j is null then 14: return node.

15:
end if 16: child j ← create new child node by D j and A j * . 17: node ← Connect TREEGENERATE(child j , D j , A\A * ) as branch node. 18: end for 19: end if 20: return node 21: end procedure

A. Problem Formulation
Assuming that the well-annotated lung sound is distributed across M different sources. For the i-th source, its dataset is defined as (X Ti , Y Ti ), where the element x Ti n is composed of the time domain (td) and the frequency domain (f d). In the medical scenario, the lung sound category of each source is different. We assume that the category of i-th source is C i ⊆ {1, . . . , M }. For a hospital with dataset (X , Y), we want to learn a function f S : R f d×td → {0, 1} and thus obtain prediction and some explainable mechanism. One problem is how we can use these distributed data, which cannot be directly shared due to privacy.
For this purpose, we design a framework to guide student model to learn the knowledge of multiple teacher models and distill the knowledge from the student model to an explainable model. We are given M teacher models {T i |i ∈ {1, · · · , M }} and a student model S. In learning task, teacher T i will learn dataset (X Ti , Y Ti ), the label of x Ti n is y Ti n ∈ Y Ti . Let x S n denotes an example in dataset X of student model S, where y S n is the label of x S n . The number of elements in dataset X is N .
To merge the knowledge of multiple teacher models and construct an explainable strategy require the following four steps: (i) We need to train teacher model and calculate the soft prediction with pre-trained teacher model, and expand the dimension of the soft prediction calculated by multiple teacher models. (ii) Train the student model S with soft label and fuzzy tree regularization term. (iii) Update the tree regularization term by the surrogate model. (iv) We need to iterate the training process. The details of our framework will be described in the following subsections.

B. Ensemble Knowledge Distillation
Teacher model. The teacher model is a classification network and each teacher model T i corresponds to the lung sound category is C i . We can train the teacher model via the following loss minimization objective: where N Ti is the number of elements in X Ti , f Ti (.) is the predictive function of T i . The different datasets from multiple sources cannot be shared directly, the joint training of teacher model and student model is not applicable in this case [40]. Unlike this, the teacher models have achieved the ability to predict during the training process. We can distill the knowledge of teacher models via the soft prediction, for each element x S n , the calculation of the soft prediction q (i) k at a temperature of T is given as Eq. 4.
is the logits layer output of the teacher model T i corresponds to the k-th (k ∈ C i ) category. The larger the value of T , the smoother the soft prediction distribution [28].
Transform soft label for student model. Since the categories of teacher models are not fixed, the dimension of their soft predictions need to be transformed into the same dimension as the student model before knowledge distillation. First, the soft predictions will be extended through Eq. 5.
where j ∈ {1, . . . , M } and q (Ti) j x S n denotes the transformed soft predictions. After extending the dimension of the soft predictions, these soft predictions will be grouped together as ensemble soft labels π j (x S n ) for student model with the following formula Eq. 6.
where w i ∈ [0, 1] denotes weight for q where h j (.) is the logits layer output of the student model to the j-th category.
On the one hand, we want to reduce the difference of information between teacher and student by knowledge distillation. On the other hand, we do not want the student model to be biased towards the teacher model. We want the predictions of the student model to be close to the ground truth. The loss of ensemble knowledge distillation is as follows: where the N is the number of elements in the dataset for S. f (., θ) is the predictive function of S, and θ θ θ is the parameters for S. λ ∈ [0, 1] is the constant parameter calibrating the relative importance between the ground truth and the soft labels. KL(.) denotes the KL divergence.

C. Fuzzy Tree Regularization for Student Model
The student model will learn the knowledge of multiple teacher models through the technology of ensemble knowledge distillation. We still have a problem to solve, i.e., how to explain the student model whose predictions cannot be easily simulated. We are inspired by [21], distilling the knowledge from the student model to the decision tree via tree regularization term Ω(θ θ θ). The tree regularization will penalize the student model and requires the behavior of the student model can be simulated through the decision tree which is a step-bystep manner.
Since the fuzzy decision tree approximates the prediction of the student model, its target isŷ S n the output of the student model for x S n . When the feature dimension of x S n is relatively high, the fuzzy decision tree directly trained on this is too complex to simulate. Thus, we replace x S n with lowdimensional feature by using a mapping function ∇ : X → D, where x D n ∈ D. We can divide x S n into multiple regions, and calculate the mean value (or max value, or standard deviation, etc.) of each region to get example x D n . In the process of calculating Ω(θ θ θ), the example x D n will be fuzzified into e fuzzy sets. For instance, for the strength of energy we can define fuzzy sets low, medium and high. For the membership degree value of i-th (i ∈ {1, . . . , e}) fuzzy sets can be calculated by triangular membership function µ i (.) as follows: where m i denote the parameters of membership function. The fuzzy decision tree will be trained to fit the example x D n ,ŷ S n . We will prune the trained decision tree by Algorithm 2 to improve the performance and reduce the size of the fuzzy decision tree. In the process of making a prediction for an input example x D n , the average number of decision nodes that must be passed is the Average Decision Path Length (APL). The APL as the output of Ω(θ θ θ) will be used to measure the complexity of this fuzzy decision tree. Then, the number of nodes will be passed when making a prediction for x D n will be calculated. We define the function for calculating APL as apl(.). error ← mse(T ).

4:
nodes ← Sort nodes from leaf to root. 5: for each node in nodes do 6: T ← Remove node from T .

8:
if error < error then return T 13: end procedure The loss function of the student model with the tree regular term can be defined as Eq. 12. However, there is a challenge of tree regularization, Ω(θ θ θ) is not differentiable. If this regular term is added directly to the loss function of student model, the loss function will be not optimized by the gradient descent method. Therefore, a method is to train a surrogate model to approximate Ω(θ θ θ) [21]. min θ θ θ L(θ θ θ) + ηΩ(θ θ θ) where η is a constant. We use the MLP as the surrogate model, its predictionΩ(θ θ θ) is used to evaluate Ω(θ θ θ). In other words, the MLP is required to fit the dataset {θ θ θ k , Ω(θ θ θ k )} (Ω(θ θ θ k ) −Ω(θ θ θ k , ϕ ϕ ϕ)) 2 + ε ϕ ϕ ϕ where ϕ ϕ ϕ denotes the parameters of this MLP, ε > 0 is a regularization strength.
In practice,Ω(θ θ θ) will be substituted into Eq. 12 as this tree regularization term. Wu et al. [21] provide a detailed demonstration for tree-regularized MLP. In here, we directly give the solution in Eq. 14. The ensemble knowledge distillation of the learning process is summarized in Algorithm 3.

Algorithm 3 Summary of our approach.
Input: data x Ti n , y Ti n N T i n=1 for teacher T i , data x S n , y S n N n=1 for student model S Output: The student model S and fuzzy decision tree T 1: Initialize teachers parameters, student parameters θ θ θ, MLP parameters ϕ ϕ ϕ, set l = 0. L(θ θ θ) ← Get loss of ensemble distillation with Eq. 8.
8:ŷ S n ← Get the prediction of student by f (x S n , θ θ θ). 9: x D n ← Get the data of T by mapping function ∇ and membership function Eq. 9, Eq. 10, and Eq. 11.

10:
Fit a decision tree T on ( x D n ,ŷ S n ) by Algorithm 1.

A. Dataset
The dataset we use is the Respiratory Sound Database created by two research teams in Portugal and Greece [41]. The database consists of a total of 5.5 h of recordings containing 6898 respiratory cycles, of which 1864 contain crackles, 886 contain wheezes, and 506 contain both crackles and wheezes from eight types of subjects. The respiratory sound audio were recorded from 126 entities, including healthy and seven respiratory diseases: Asthma, Bronchiectasis, Bronchiolitis, COPD, Lower Respiratory Tract Infection, Upper Respiratory Tract Infection, and Pneumonia.

B. Network Architecture
The network architecture of the teacher model and the student model are shown in Table I

C. Experimental Setup
We designed two experiments to validate our method. In experiment 1, we select three categories of respiratory cycles, namely, crackles, wheezes, and cycles that do not contain crackles and wheezes (normal cycles) from the Respiratory FC-Softmax C Sound Database. In experiment 2, we classify between three categories, namely, COPD, Healthy, and Pneumonia. During data preprocessing, first the lung sound recordings will be sampled by target sampling rate of 8000 Hz and transform into mono audio data.
In all experiments, the audio data will be extracted as the time-frequency results Mel spectrograms feature [42] by 2048 length of the FFT window, 512 samples between successive frames, 128 Mel bands and Hann window. Then, the timefrequency results will be split into 128 × 79 for model input. We use learning rate of 0.001, batch size of 128, 100 epochs, λ of 0.5, T of 2.0, η of 1000. For this experiment, we divided the training set and the test set using the 5-fold cross-validation method. The C of teacher model is used for the classification of M categories and student model are both set to 3. In all experiments, the training set will be augmented by changing dynamic range and adding distribution noise.
For each time-frequency result, we will divide it into 16 parts in the frequency domain, each part represents a frequency band as shown in Table II, and then calculate the mean value of each part to get 16 features (B 0 , . . . , B 15 ) as input of the decision tree. Then, each feature will be fuzzified into four fuzzy sets, indicating low, medium, high, and very high energy. In addition, we apply self-organizing map algorithm [43] to determine four clustering centers as the parameters of the membership function. The clustering centers is at −78.9, −71.9, −63.1 and −53.9 for experiment 1 and at −78.9, −58.7, −46.3 and −29.2 for experiment 2. Thus, the fuzzy membership function is shown in Fig. 2.  In our experiment, we compared three models [12]- [14] on our preprocessed data. Similar to our processing, these methods use the CNN to analyze the lung sounds through the spectrograms.

D. Evaluation Metrics
The evaluation using the metrics of Sensitivity (SE), Specificity (SP ), and average score (AS) as the authors did in [15], [44]. In experiment 1, for confusion matrix as shown in Table III, N (i) (j) , i, j ∈ {c, n, w} denote the numbers of the classification results for three categories, where the subscript j represents the ground truth, and the superscript i represents the prediction of the model. SE and SP can be calculated in Eq. 15.
Lung diseases classification COPD Healthy Pneumonia COPD N (co) (co) In experiment 2, for confusion matrix as shown in Table III, co, h, p denote COPD, Healthy, and Pneumonia three categories respectively. SE and SP can be calculated in Eq. 16.
AS can be calculated as Eq. 17. In addition, we will also indicate if the method requires the data to be assimilated before training.

E. Experiment 1: Respiratory Sounds Classification
In experiment 1, since 97% of the respiratory cycles are within five second, we extract the sound clips for the fixed duration five seconds. For the short cycles, We will add enough 0 values to ensure that the duration is not less than the duration we want.  [12] 0.82 ± 0.03 0.52 ± 0.02 0.67 ± 0.02 Y Bardou [14] 0.80 ± 0.02 0.56 ± 0.04 0.68 ± 0.02 Y Tariq [13] 0  Table IV evaluates the performance of difference methods for adventitious sounds. In Table IV, model 'Aykanat', 'Bardou', and 'Tariq' denote our three comparison networks respectively, 'Teacher' denotes the teacher model fit student dataset for three categories, 'No Distill' denotes a model with student network architecture without knowledge distillation and tree regularization, 'Distill' denotes a model with student network architecture and knowledge distillation without tree regularization, and 'Student' is our proposed framework. Compared with other models, the teacher model has good performance for three metrics SP , SE, and AS. We can find that distilling knowledge to the student model through the teacher model can improve the performance of the student model. Our student model has higher performance than 'Distill' in terms of AS. The tree regularization can prevent the model from overfitting compared with 'No Distill'. Our student model with knowledge distillation and tree regularization makes it possible to balance various evaluation metrics comparing 'No Distill' and 'Distill'. Comparing the other three networks, our student model outperforms model 'Aykanat', 'Bardou', and 'Tariq' by 3%, 2%, 5% in AS. Besides, the student model is only 0.01 lower in SE than the teacher model and higher than other models. The performance of the student model on this metric and the characteristics of learning from multiple sources data are more suitable for the medical care. Although the student model is lower in the first three metrics than the teacher model, the student model has fewer parameters and higher prediction accuracy.
The decision tree generated by the model is shown in Fig. 3. The leaf nodes of the decision tree are the one class with the highest prediction probability and the prediction probability of this class. We use the probability that the predictions of the decision tree on the test set agree with the predictions of the student model as material for the confidence level of the decision tree. In this task, the confidence level of the fuzzy decision tree is 68%. The decision tree is an interpretable model whose decision rules can be extracted directly. The decision tree branches with a probability of less than 70% are pruned in order to pay more attention to high-probability rules. To show the decision rules more visually, we list the decision rules with probability greater than 90% as shown in Fig. 4.
The three types of test samples of crackles, normal, and wheezes are shown in Fig. 7 Fig. 3. Fuzzy decision tree of respiratory sounds classification. For branches with a probability greater than 90%, we added a branch number at the end of them to more intuitively associate the branch with the relevant rule.  fuzzy energy level, and decision path of the sample. Through such a decision rule, the prediction of the model can be simply simulated by the doctor.

F. Experiment 2: Lung Diseases Classification
In this experiment, since the duration of most respiratory cycles is about 20s, we will the sound clips for the fixed duration five seconds to get the same length of sound clips. Table V reports the performance of our model and comparison networks on various evaluation materials. Since our student  [12] 0.85 ± 0.04 0.96 ± 0.01 0.91 ± 0.02 Y Bardou [14] 0.91 ± 0.04 0.97 ± 0.01 0.94 ± 0.02 Y Tariq [13] 0  model uses the knowledge distillation and tree regularization, it has higher performance than model 'Aykanat' and 'Tariq' in terms of AS. More importantly, the performance of student model is only 0.01 and 0.02 lower than model ' Bardou' and teacher model respectively. Similarly, the student model achieves better performance on metric SP than 'No distill' and 'Distill' due to the combined effect of knowledge distillation and tree regularization. The student model does not require the data to be assimilated to learn from multiple data sources.
The decision tree generated by the model is shown in Fig. 5. The confidence level of the fuzzy decision tree is 84%, which indicates that the decisions can be trusted for most of the samples. The pruning process of the decision tree is the same as that in Experiment 1. As above, we list the decision paths  that have a prediction probability greater than 90% on the decision tree as shown in Fig. 6. The three types of test samples of COPD, Healthy, and Pneumonia are shown in Fig. 8 respectively. The fuzzy decision trees allow us not to care about the specific values of the decision boundary, which is more in line with our behavior when making decisions. The decision path is directly translated into a decision rule that is easy for physicians to understand.

VI. CONCLUSION
This work proposes a novel explainable CNN framework based on fuzzy decision tree regularization for respiratory sounds analysis, which can learn distributed data from multiple hospitals and provide decision rules that can be simulated by  physicians. In this framework, teacher models can learn nonfixed categories of data and transfer their knowledge to student models. More importantly, fuzzy decision trees are used to deal with the uncertainty in the decision process, thus improving the interpretability of the model. Decision rules generated by fuzzy decision trees in this form are more easily accepted by physicians. This framework is evaluated on Respiratory Sound Database. The experimental results show that our framework can learn from distributed data, simplify the parameters of the model with less performance loss, and provide easily acceptable and interpretable fuzzy decision trees, compared to other methods. In future work, we aim to apply this framework to hospital auscultation systems.