Diagnose Like a Radiologist: Hybrid Neuro-Probabilistic Reasoning for Attribute-Based Medical Image Diagnosis

During clinical practice, radiologists often use attributes, e.g. morphological and appearance characteristics of a lesion, to aid disease diagnosis. Effectively modeling attributes as well as all relationships involving attributes could boost the generalization ability and verifiability of medical image diagnosis algorithms. In this paper, we introduce a hybrid neuro-probabilistic reasoning algorithm for verifiable attribute-based medical image diagnosis. There are two parallel branches in our hybrid algorithm, a Bayesian network branch performing probabilistic causal relationship reasoning and a graph convolutional network branch performing more generic relational modeling and reasoning using a feature representation. Tight coupling between these two branches is achieved via a cross-network attention mechanism and the fusion of their classification results. We have successfully applied our hybrid reasoning algorithm to two challenging medical image diagnosis tasks. On the LIDC-IDRI benchmark dataset for benign-malignant classification of pulmonary nodules in CT images, our method achieves a new state-of-the-art accuracy of 95.36\% and an AUC of 96.54\%. Our method also achieves a 3.24\% accuracy improvement on an in-house chest X-ray image dataset for tuberculosis diagnosis. Our ablation study indicates that our hybrid algorithm achieves a much better generalization performance than a pure neural network architecture under very limited training data.


INTRODUCTION
D UE to their rapid progress in the past ten years, deep neural networks have achieved tremendous success in boosting the performance of image recognition [18], [19], [27], [50] and other visual computing tasks [38], [48]. Such progress has also propelled forward many related research areas including medical image analysis. Nowadays, deep neural networks have been widely used for medical image analysis and diagnosis. They have demonstrated unprecedented accuracy on a variety of tasks, such as the detection of acute intracranial hemorrhage in head CT images [29], breast cancer diagnosis using mammography [39], the detection of diabetic retinopathy in retinal fundus photographs [14], and the classification of skin cancers [9].
Despite the high accuracy deep neural networks have achieved in medical image diagnosis, they still exhibit two major limitations, insufficient generalization ability and verifiability. It is well known that deep neural networks require large amounts of training data. When high quality training data is scarce, as is typically the case in the medical image domain, the performance of trained deep neural networks on novel testing data may deteriorate significantly. In addition, deep neural networks typically produce a final result without showing how and why the result can be reached. This is undesired because very often computer-generated diagnostic results only serve as decision support for human •  doctors, who need to verify their credibility before accepting them.
To improve verifiability, a medical image diagnosis algorithm needs to learn the knowledge reasoning process followed by radiologists during clinical practice. They start from visual evidences in medical images and reach diagnostic conclusions by referring to causal relationships between diseases and the visual evidences. For example, radiologists use intuitive morphological and appearance characteristics of a pulmonary nodule, such as lobulation, spiculation, and texture, as visual evidences to assess whether the nodule is benign or malignant [21] (Fig. 1). In this paper, such characteristics of a lesion or, in general, radiological abnormalities in a medical image are defined as the attributes of the lesion or image. Thus, equipping a machine learning model with the capability of simultaneously predicting attributes as well as underlying diseases improves the causality and verifiability of the model now that radiologists can verify the predicted attributes as well as the knowledge reasoning process that goes from attributes to diagnostic conclusions.
In addition to causal relationships between diseases and attributes, there also exist relationships, such as cooccurrence, among the attributes themselves. Effectively modeling and exploiting the hidden relationships among these attributes as well as the relationships between attributes and the underlying diseases can boost the accuracy of attribute and disease prediction. Nonetheless, there have been few methods in medical image diagnosis exploiting such relationships.
To address generalization ability, we need to look at the strengths and limitations of both neural and probabilistic learning algorithms for relational modeling and reason-Subtlety (5) Internal Structure (1) Calcification (5) Sphericity (5) Margin (5) Lobulation (1) Spiculation (4.25) Texture (5) ing. Examples of neural algorithms include graph neural networks [26] and relation networks [51] while examples of probabilistic learning algorithms include Bayesian networks [47] and factor graphs [11]. Neural algorithms are good at learning from large amounts of empirical data. The trained models can interpolate and approximate known information very well to reach high accuracy, but have relatively poor generalization ability, causality and explainability. On the other hand, probabilistic learning algorithms, which can be considered as a type of symbolic AI methods, can learn from smaller datasets. The trained models have better generalization ability, causality and explainability, but lower accuracy when facing real-world empirical data.
In this paper, to exploit the complementary strengths of neural and probabilistic learning algorithms as well as overcome the limitations of both ends, we introduce a hybrid neuro-probabilistic reasoning algorithm for verifiable attribute-based medical image diagnosis. Specifically, the proposed hybrid algorithm consists of multiple stages. It first employs a deep neural network backbone to extract features from an input image. Such features build the foundation of reasoning that happens in later stages. Next, a Bayesian network (BN) branch and a graph convolutional network (GCN) branch process the features extracted by the backbone and perform relational reasoning in parallel. The BN branch first converts features into attributes as evidences, then performs probabilistic reasoning using the causal relationships between the disease and attributes, and finally produces their marginal posterior distributions. On the other hand, the GCN branch performs feature-based relational modeling. It represents attributes and generic relationships among them using a graph. The GCN branch exploits the relationships among the graph nodes to enhance the feature of every attribute. More accurate diagnoses and attribute classifications can be achieved using the enhanced features from the GCN. Although the BN and GCN belong to two parallel branches, they are not completely independent. We tightly couple the BN and GCN branches so that the BN is utilized to improve the causality and generalization ability of the GCN while the accuracy of the GCN is retained. Such tight coupling is achieved via a cross-network attention mechanism and the fusion of their classification results. By fusing the results from the two branches, causality enforced by the BN branch may be partially compromised. Therefore, a second Bayesian network module is appended at the end of the entire network to reinforce causality.
The entire network except the Bayesian network modules can be trained using gradient back-propagation as the inference phase of a Bayesian network is differentiable and gradients can be propagated through the two Bayesian networks in our algorithm. However, we notice that the compatibility between the structure of the Bayesian networks and the rest of our hybrid network significantly affects the performance of the entire network. Thus, once we learn the initial structure of the Bayesian networks using ground-truth annotations, the training procedure proceeds by alternating between two phases. The first phase updates the CNN and GCN weights using stochastic gradient descent while the second phase updates the structures and parameters of the two Bayesian networks using the attribute classification results most recently predicted by their preceding stages in our hybrid network as training labels.
Note that verifiability is more stringent than explainability. For computer-generated diagnoses to be verifiable, the internal decision process of an algorithm not only should be explainable, but also needs to comply with the domain knowledge of human doctors. As mentioned earlier, in medical image diagnosis, radiologists start from visual evidences in medical images and reach diagnostic conclusions by referring to causal relationships between diseases and the visual evidences. Thus the algorithm needs to collect the same types of visual evidences as a doctor would do, and also learn the causal relationships between evidences and diseases as a doctor would do. The proposed algorithm in this paper models visual evidences as attributes and collects such visual evidences through attribute prediction. It further models causal and non-causal relationships using BN and GCN, which quantitatively encode such medical domain knowledge as "the presence of attribute A in this CT scan increases/decreases the likelihood that the patient has disease D" or "the occurrence of attribute A in this CT scan increases/decreases the chance that attribute B also occurs in the same scan". Every node in these networks (BN and GCN) represents a high-level concept (a disease or an attribute), and a disease diagnosis exploits the relationships between the disease and a set of other high-level concepts. In comparison to previous medical image diagnosis methods, radiologists may find the knowledge reasoning process of the proposed algorithm more consistent with their own practice and, consequently, the disease diagnoses made by our method more credible.
We have successfully applied our hybrid reasoning algorithm to two challenging medical image diagnosis tasks. The first task is benign-malignant classification of pulmonary nodules in chest computed tomography (CT) images in the LIDC-IDRI benchmark dataset. As mentioned earlier, for pulmonary nodule classification, morphological and appearance attributes have been commonly used to assist clinical diagnosis [21]. The second task is tuberculosis (TB) diagnosis using chest X-ray images. For this task, we adopt an in-house dataset carefully annotated by senior medical experts. Every image in this TB dataset is not only annotated with a disease diagnosis, but also multiple types of radiological abnormalities, such as 'Pulmonary Consolidation', 'Pulmonary Cavitation', 'Diffuse Nodules', and 'Fibrotic Appearance', potentially caused by tuberculosis. According to the experimental results of pulmonary nodule classification on the LIDC-IDRI benchmark, our method achieves a new state-of-the-art accuracy of 95.36% and an AUC of 96.54%. On our in-house dataset for tuberculosis classification, our algorithm also achieves a 4.98% improvement in accuracy in comparison to previously best-performing algorithm for attribute-based medical image classification.
Our main contributions in this paper can be summarized as follows.
• We introduce a hybrid neuro-probabilistic reasoning algorithm for attribute-based medical image diagnosis.
To support causal and verifiable relational modeling and reasoning, this algorithm tightly couples Bayesian networks and a graph convolutional network. Such coupling is achieved via a cross-network attention mechanism and a classification result fusion scheme. • We devise an effective training procedure for the proposed hybrid network. It alternates between two phases. The first phase updates the CNN backbone and GCN weights while the second phase updates the structures and parameters of the Bayesian networks.
• We have successfully applied the proposed hybrid reasoning algorithm to two representative medical image diagnosis tasks, benign-malignant classification of pulmonary nodules in chest CT images and tuberculosis diagnosis using chest X-ray images. The proposed algorithm achieves state-of-the-art performance on the LIDC-IDRI benchmark dataset for the first task and an in-house dataset for the second task.

Attribute Learning
Attributes, such as texture, color, and shape, are of great importance to describe objects. Attribute learning has been studied in computer vision for many years [1], [10], [28], [30], [33], [34], [42]. Ferrari et al. [10] proposed to use low-level semantic features for attribute representation and they presented a probabilistic generative model for visual attributes, together with an image likelihood learning algorithm. Human faces have many attributes, and remain a challenge for attribute learning. Kumar et al. [28] trained binary classifiers to recognize the presence or absence of describable aspects of facial visual appearance using traditional hand-crafted features. Liu et al. [37] proposed a CNN framework for face localization and attribute prediction, respectively. Attributes have also been exploited in tasks such as zero-shot learning [23], [30]. Effectively modeling the hidden relationships among attributes is useful for improving the accuracy of attribute prediction and causal association. Nonetheless, most of the early works in attribute learning did not model relationships among attributes and explore such relationships for attribute reasoning. The development of graph neural networks (GNN) [26] made it possible to learn relationships among attributes. For example, Meng et al. [41] used message passing to perform end-to-end learning of image representations, their relationships as well as the interplay among different attributes. They observed that relative attribute learning naturally benefits from exploiting the graph of dependencies among different image attributes. In this paper, we not only utilize a graph neural network to model the correlations among attributes, but also embed Bayesian networks into the whole framework to better model causality.

Bayesian Networks
Bayesian networks (BN), introduced by Judea Pearl [46], represent a natural approach to model causality and perform logical reasoning. Bayesian network learning includes two phases, structure learning and parameter learning. The most intuitive method for structure learning is that of 'search and score,' where one searches the space of directed acyclic graphs (DAGs) using dynamic programming and identifies the one that minimizes the objective function [58]. For parameter learning, the most frequently used method is maximum likelihood estimation (MLE) [46]. However, when given a dataset, BN  been proposed for new applications in computer vision. For example, Barik [3] improved BN using low-rank conditional probability tables. Elidan et al. [8] used the Copula Bayesian Network model for representing multivariate continuous distributions. The method in this paper differs from the above work in that it exploits the representation power of deep neural networks and the causality modeling capability of BN by embedding them into a unified framework.

Bayesian Neural Networks
In addition to poor interpretability, deep neural networks also suffer from overfitting on limited labeled data and the incapability of uncertainty analysis. To tackle these issues, researchers combined Bayesian methods with deep learning to construct Bayesian neural networks (BNN) for a probabilistic representation of uncertainty of the latter [12], [25], [56]. Specifically, a BNN estimates posterior distributions for the weights in a neural network. There exist a few estimation methods, such as variational inference [25] and dropout variational inference [12]. However, the proposed method in this paper differs significantly from Bayesian neural networks. Unlike BNNs, our proposed method aims to improve the causality and interpretability of deep learning via the fusion of deep learning and traditional graphbased probabilistic reasoning. It cannot perform uncertainty analysis. Nonetheless, due to the few trainable parameters in Bayesian networks, the proposed method can clearly mitigate overfitting on small training datasets, as demonstrated in one of our ablation studies.

Neuro-symbolic Learning
The integration of connectionist (neural networks) and symbolic AI has long been a key issue in machine learning [32], [53], [54], [65], [66]. Early works [53], [54] primarily focused on the injection of symbolic information as prior knowledge into a neural network. More recently, with the wide spread of deep learning, researchers [32], [65], [66] started to combine the merits of deep feature representation with symbolic reasoning to build strong, reliable and explainable neuralsymbolic AI systems. Relevant studies have been conducted on visual question answering (VQA) [65], semantic parsing [66] and handwritten formula recognition [32]. One of the key issues in neural-symbolic learning is that most symbolic approaches are not differentiable, making endto-end training difficult. In this study, we partially solved this issue by performing gradient back-propagation through both Bayesian networks, which serve as the symbolic reasoning module in our hybrid network, for a better feature representation and a more efficient training procedure.

Overview
The proposed hybrid neuro-probabilistic reasoning algorithm takes an image patch as the input and outputs a classification result, as shown in Fig. 2. The proposed algorithm consists of the following three stages.
The Backbone Stage: a backbone based on ResNet [18] or EfficientNet [57] is utilized for deep feature representation, which builds the foundation for reasoning in later stages. Moreover, a feature pyramid network (FPN) [36] is integrated into the backbone for multi-scale feature aggregation (Fig. 2). A distinct global average pooling (GAP) operator is applied to the feature map at each scale of the FPN to reduce the number of parameters. And for reducing information loss in low-level feature maps, we ensure that the dimension of the resulting feature from GAP is kept the same at different scales of the FPN. For instance, when the size of a feature map in the FPN is 64 × 64 × 256, we first reshape it to 32 × 16 × 2048 by concatenating the third dimensions of every 2 × 4 neighbors in the first two dimensions together, then use GAP to produce a 2048-dimensional feature vector. All GAP-reduced features go through parallel FC layers before they are fused via a sum operator and a subsequent FC layer to obtain the final feature, F 0 , at the backbone stage.
The Reasoning Stage: both a Bayesian network and a graph convolutional network are adopted for relational reasoning in parallel. The BN branch first converts deep features into attributes through attribute classification, then performs probabilistic causal reasoning using these attributes, and finally outputs marginal posterior distributions. The GCN branch first transforms the deep features from the backbone into a feature representation for individual attributes and represents the relationships among the attributes using an undirected graph, then enhances the feature of each attribute via graph convolutions. The BN and GCN are tightly coupled together through a cross-network attention mechanism and the fusion of their classification results, which will be described in Section 3.4.
The Second BN Module: to reinforce causality of the proposed hybrid algorithm, a second BN is appended at the end of the entire network.

Integration between BN and Backbone
A Bayesian network is incorporated in the reasoning stage to model probabilistic dependencies among various attributes for disease diagnosis. Specifically, a Bayesian network, a conditional probability table (CPT) for each node, and Θ represents all parameters (CPTs), which encodes the joint probability distribution of the BN. Each node v i ∈ V B stands for a random variable, and a directed edge e ∈ E B between two nodes (v i , v j ) indicates v i probabilistically depends on v j . The training of a Bayesian network undergoes two phases, structure learning and parameter learning [7], [46], [47]. We adopt dynamic programming with the Bayesian information criterion (BIC) [58] for structure learning to determine the topology of the DAG and maximum likelihood estimation for parameter learning to determine Θ. Our BN module first learns its structure and then updates all parameters, i.e. the conditional probability tables at all nodes in the network. In this paper, each attribute has a corresponding node in the Bayesian network. Without loss of generality, assume there is only one disease we wish to diagnose, and that disease is an extra node in the network.
To integrate the Bayesian network with our backbone, the final feature from the backbone, F 0 , is fed to another FC-layer with weight matrix W B to obtain a feature map F B .
where N is the number of attribute categories and C is the number of grades for each attribute or disease. Each attribute must have at least two grades to indicate whether that attribute is present or not. In practice, there is often a need to have more than two grades for each attribute or disease to indicate intermediate levels of certainty or severity. Note that different grades of the same attribute are mutually exclusive while different attributes are not because more than one attribute could be present at the same time. Thus, F B can be interpreted as the concatenation of N + 1 C-dimensional vectors, one for each node in the Bayesian network. We further define where the softmax operator σ is applied to each C-dimensional vector in F B , and the result represents a discrete distribution over the grades of an attribute or disease. This distribution becomes the input evidence at the corresponding node in the Bayesian network.
Once we perform inference in the Bayesian network using the conditional probability tables as well as the input evidences at all nodes, we obtain marginal posterior distributions at all nodes, including the extra node for disease diagnosis, as the output of the Bayesian network. Let the nodes in the Bayesian network be v i , i ∈ {0, 1, ..., n}. Without loss of generality, according to [46], the marginal posterior distribution P B (v 0 ) at node v 0 is formulated as where where P arents(v i ) is NULL if v i does not have any parent nodes. The equation in (2) is derived using the local Markov property of Bayesian networks. In practice, to evaluate (1), we use the belief propagation algorithm in [45].

GCN Module
In addition to the Bayesian network, we also adopt a graph convolutional network in the reasoning stage to perform feature-based relational modeling and reasoning among attributes to facilitate medical image diagnosis. Specifically, a GCN is based on an undirected graph G = (V G , E G ), where V G and E G are the set of nodes and edges respectively. Here, there is a node corresponding to each attribute or disease, and the node holds a feature vector associated with that attribute or disease. Again, assume there is only one disease we wish to diagnose, and the total number of nodes is N +1.
If vertices v i and v j are connected in the graph, there might exist certain type of relations between them. We associate a learnable weight with every edge in the graph to indicate the strength of the connection. The GCN in our algorithm consists of L layers, all of which share the same number of nodes. But the edge weights in these layers may be different. A general graph convolution operation F at the l-th layer can be formulated as follows is the graph at the l-th layer; W agg l and W update l are the learnable weights of the aggregation and update functions of the l-th layer, respectively. We use a max-pooling node feature aggregator to pool the pairwise differences of features at node v i and all of its neighbors. But the feature difference between v i and one of its neighbors is modulated by the edge weight between them before the pooling operation. The node feature updater is a multi-layer perceptron (MLP) with batch normalization and ReLU as the activation function. The above MLP concatenates the original node feature with the aggregated feature as its input.
We adopt the residual graph convolution introduced in [31]. Let H G l ∈ R N ×D l be the feature map holding the features at all nodes of G l , where N is the number of attributes and D l is the feature dimension in the l-th layer. The residual graph convolution between the l-th and (l + 1)th layers of the GCN is formulated as follows, To integrate the above GCN with our backbone, we need to obtain N + 1 features from the final feature in the backbone, one for each node in the GCN. For this purpose, F 0 is fed to another FC-layer with weight matrix W G to obtain feature map H G . Thus, where N is the number of attribute categories and D 0 is the input feature dimension for every node in the GCN. H G thus can be interpreted as the concatenation of N + 1 D 0dimensional feature vectors, one for each node in the first layer of the GCN as input.
The final feature map of the GCN is H G L ∈ R (N +1)×D L . We further use a FC-layer followed by a softmax operator to obtain attribute and disease classification results in the following equation, where P G ∈ [0, 1] (N +1)×C denotes the attribute and disease predictions.

Coupling between BN and GCN
Although both the BN and GCN are utilized for relational reasoning, they are not completely independent. In this work, we tightly couple BN and GCN via a cross-network attention mechanism and the fusion of classification results. As for the former, BN provides attention for the node features in each layer of the GCN. Specifically, the marginal posterior distributions at all nodes of the BN are taken as the input of L node attention modules, one for each layer of the GCN to figure out the relative importance of attributes in that layer. Node attention values prescribed by the l-th attention module are defined as follows, is the node attention vector for the l-th layer of the GCN, V includes all attribute and disease nodes of the BN, and P B (V) ∈ R (N +1)×C represents the concatenated marginal distributions at all nodes of the BN. Note that there is one attention value in M l attn for each node in the l-th layer of the GCN. We name M l attn the spatial attention map for GCN nodes since nodes are spatial entities of a graph, similar to pixels in a CNN.
We also extend squeeze-and-excitation based channelwise attention in SENet [20] to graph convolutional networks as follows to further enhance the features of every layer in the GCN, where C l attn ∈ R D l , l ∈ 1, 2, ..., L is the channel-wise attention vector for the l-th layer of the GCN, W lsq and W lex are the weights for squeezing and excitation, respectively.
Once both spatial and channel-wise attention vectors have been calculated, firstly the same channel of all features in G l is multiplied by its corresponding channel-wise attention value, then the entire feature at each node in G l is multiplied by its corresponding spatial attention value. Such spatially and channel-wise modulated features at the l-th layer of the GCN serve as the input to the next graph convolution to produce the feature map H G l+1 at the (l + 1)th layer.
As for the fusion of the disease classification results from BN and GCN, a residual fusion scheme is formulated as follows, ). (8) For attribute classification results, the same residual fusion scheme is adopted as follows, In these equations, P d B and P a B are the marginal posterior distributions at the disease and attribute nodes in the BN, P d G and P a G are the probabilistic disease and attribute classification results from the GCN, w B and w B are two learnable trade-off coefficients, W 0 and W 0 are the weight matrices of two fully connected layers, and Concat() is the concatenation operator.

The Second BN Module
It should be noted that by fusing the results of both BN and GCN branches, causality enforced by the BN branch may be compromised to some extent. To tackle this issue, we append a second Bayesian network at the end of the entire network to reinforce causality. The fused attribute and disease soft classification results in (9) and (8) are used as the input evidences at the nodes in this second Bayesian network. The marginal posterior distribution at the disease node, denoted as P d f inal , can also be defined using equations similar to those in (1) and (2). In practice, it is also evaluated using the belief propagation algorithm. This marginal posterior distribution at the disease node of the second BN becomes the final disease prediction of the entire network. Note that BN-1 and BN-2 are distinct in both parameters and structures.

Training Scheme
Our entire network except the BN modules can be trained using stochastic gradient descent. Meanwhile, we observe that the inference phase of a BN is differentiable with respect to its inputs and parameters, but not its structure. That implies gradients can be backpropagated through a BN from its outputs (i.e. marginal posterior distributions) to its inputs (i.e. evidences). Parameter learning in a BN is tightly coupled with structure learning, and is driven by global statistics of the training set. Performing parameter learning separately from structure learning using stochastic gradient descent is unlikely to be very effective. Moreover, we discovered that the compatibility between the Bayesian network structures and the neural parts of our hybrid network significantly affects the overall performance. Therefore, we propose the following two alternating phases to train the entire hybrid network.
• The first phase updates the weights of the CNN backbone and the GCN branch using gradient backpropagation by fixing the structures and parameters of the two BNs. Gradients are simply backpropagated through these two BNs. • The second phase updates the structures and conditional probability tables of the BN modules using the attribute and disease classification results most recently predicted by their preceding stages in the hybrid network as training labels. As for structure learning, dynamic programming is used with the same score function as in [58]. Once the structure of BN modules are determined, their conditional probability tables are updated immediately using maximum likelihood estimation. It should be noted that we actually set a maximum value on the number of times the structures and parameters of the two BNs are updated (20 times in practice). Once the two phases have been alternated for such a number of times, we fix the structures and parameters of the two BNs and only update the weights in the neural parts of our hybrid network. In this way, we can guarantee the convergence of our training process.
The ground-truth annotations of the training samples are only used to train the initial structure and parameters of the BN modules. In subsequent iterations, for the first BN, the ground-truth attribute and disease annotations of a training sample are replaced with its attribute and disease classification results most recently produced by the backbone, i.e. P 0 B ; for the second BN, they are replaced with its most recent fused attribute and disease classification results, P d f usion and P a f usion in (8) and (9). The effectiveness of this training strategy has been empirically verified. Potential reasons for the effectiveness are twofold: first, attribute classification results produced at intermediate stages of our hybrid network are soft labels, which are actually distributions, and have more complete information than the ground-truth attribute labels represented as one-hot vectors; second, using soft attribute labels generated within the hybrid network makes the trained Bayesian network structures more compatible with the neural parts of the hybrid network.
We rely on the belief propagation algorithm to perform inference in a Bayesian network and compute the marginal posterior distributions for its variables. Since we train the weights in the neural networks by propagating gradients through the Bayesian networks, we need to obtain the gradient of the belief propagation algorithm. A method for computing this gradient is described in the Appendix.

Training Loss
During the first phase of the above alternating training procedure, we adopt the deep supervision strategy by distributing multiple supervision signals to various stages of our hybrid network. Most of the time, each supervision signal is primarily responsible for training a subset of the modules in the network. The overall loss function during the first phase of training is a weighted combination of five loss functions, where 5 i=1 w i = 1. In our experiments, we always set w i = 0.2, i ∈ 1, 2, .., 5.
The first two supervision signals corresponding to losses L 0 B and L 0 G are respectively located at the beginning of the BN and GCN branches. Two classifiers are involved. One of them is simply P 0 B = σ(F B )(∈ [0, 1] (N +1)×C ), the input evidences at the nodes of the first Bayesian network. The cross-entropy loss for this classifier is written as follows, , which takes H G as the input and performs both attribute and disease classification. The cross-entropy loss for this classifier is written as follows, In the above two loss functions, y ij 's denote one-hot encoding of the ground-truth attribute or disease grades of training samples, p B,0 ij and p G,0 ij are components of P 0 B and P 0 G respectively. These two classification losses disentangle attributes from the deep features of the backbone under the supervision of ground-truth labels. These two supervision signals are primarily responsible for training the backbone as well as the connection layers between the backbone and the BN or GCN.
The second group of two supervision signals are applied to the fused results of the BN and GCN in (8) and (9). Two cross-entropy losses L a and L d are imposed again on the classification results of attributes and diseases, respectively. The first loss is imposed on the fused disease classification result, P d f usion in (8), and is formulated as follows, where y d denotes one-hot encoding of the ground-truth disease grades of training samples, and p d j represents a component of P d f usion . The second loss L a is imposed on the fused attribute classification results, P a f usion in (9), and is formulated as follows, where y a denotes one-hot encoding of the ground-truth attribute grades of training samples, and p a ij represents a component of P a f usion . Since we do not update parameters in the BNs during the first phase of the training scheme, these two supervision signals are primarily responsible for training the backbone and the GCN.

MEDICAL IMAGE DIAGNOSIS
In this section, we apply our attribute-based hybrid reasoning algorithm to two challenging medical image diagnosis tasks. The first task is benign-malignant classification of pulmonary nodules in chest computed tomography (CT) images. The second task is tuberculosis diagnosis using chest X-ray images. We define attributes and perform attribute relationship learning and reasoning for both tasks. Following the common practice of previous attributebased classification methods, not all attributes are used for training, i.e., we select attributes that have high causal relations with the diseases, and those unrelated attributes have been discarded.

Pulmonary Nodule Classification
Lung cancer gives rise to the most cancer-related deaths around the world [4]. Early diagnosis and treatment are of great importance to long-term survival of lung cancer patients. Benign-malignant classification of pulmonary nodules in chest CTs is vital for timely diagnosis of lung cancer [43]. In practice, radiologists often gain many clinical insights from the connections between domain knowledge and clinical symptoms as well as from the dependencies between diverse attributes and the underlying disease. This is applicable to pulmonary nodule diagnosis. For instance, when a nodule has an irregular shape and presents obvious spiculation, it has a high probability of being malignant.

LIDC-IDRI Dataset
The LIDC-IDRI dataset [2] from the Cancer Imaging Archive (TCIA) is one of the largest publicly available lung cancer datasets. It has 1018 clinical chest CT scans obtained from seven institutions. Each scan is associated with an XML file that details the locations of nodules on each 512×512 slice. The diameters of the nodules range from 3mm to 30mm. Each suspicious lesion is categorized as a non-nodule, a nodule < 3mm, or a nodule ≥ 3mm in diameter. Adopting the same setting as in [6], [15], [61], [62], [64], we only consider nodules ≥ 3mm in diameter since nodules < 3mm are not considered to be clinically relevant by current screening protocols [6], [15], [16], [22], [52], [55]. The malignancy of each nodule was evaluated with a 5-point scale, from benign to malignant, by up to four experienced thoracic radiologists. Following the same procedure used in previous studies [6], [15], [61], [62], [64], we select those nodules annotated by at least one radiologist and calculate the median malignancy level (MML) for each of them. Nodules with an MML less than or higher than 3 are respectively labeled as benign or malignant. To reduce the uncertainty of nodule malignancy evaluation, nodules with an MML of 3 (noted as 'uncertain') are excluded from our experiments as previous studies did. Thus there are a total of 1301 benign, 612 uncertain and 644 malignant nodules. The distribution of nodules over their MML is shown in Table 3. Apart from malignancy, eight attributes are also graded for each nodule, i.e., subtlety, internal structure, calcification, sphericity, margin, lobulation, spiculation, and radiographic solidity. The grade of each attribute is between 1 and 5 except for internal structure (1-4) and calcification (1)(2)(3)(4)(5)(6). We scale the grades of all attributes to 1-5 to maintain consistency, where 1 means the least obvious and 5 means the most obvious. Fig. 1 shows sample 2D slices with distinct attributes.
We also normalize all chest CTs to a unified voxel size of 1.0 × 1.0 × 1.0mm 3 using spline interpolation. In addition, following the same setting as in [61], [62], we assume the location of a nodule is the mean of the annotated centers of the nodule by radiologists. For each nodule, we crop a 64 × 64 × 64 patch centered on its location.

Experimental Setup
In all experiments, we perform 10-fold cross validation, and each model is independently trained and tested 5 times with randomly initialized weights on each fold. The overall performance of a model is assessed with several commonly used metrics, including the mean and standard deviation of accuracy, sensitivity/recall, specificity, precision with the cut-off value of 0.5, F-score and AUC (area under the receiver operator curve).
All models are trained for 160 epoches from scratch using PyTorch [44] on NVIDIA Titan X pascal GPUs while Adam [24] being the optimizer with the initial learning rate set to 1e-3, which is reduced by a factor of 10 after every 30 epochs. The weight decay is set to 1e-4. Both ResNet and EfficientNet are adopted as the backbone models. We have implemented their 3D versions for the LIDC dataset by revising the original 2D versions. The size of the input image patches is 64 × 64 × 64, and the batch size is 6 on a single GPU. Separate models are trained with and without data augmentation, which includes standard operations including flipping, rotation, and random cropping.

Comparison with the State of the Art
We have compared our hybrid algorithm with several stateof-the-art models for pulmonary nodule classification on the LIDC-IDRI dataset and the results are shown in Table 1, where O2 and O2 * represent our models trained without and with standard data augmentation. The proposed method achieves the best performance under every evaluation metric when data augmentation is applied while it is still ranked first under every evaluation metric except for specificity when data augmentation is not used. Such a performance demonstrates the effectiveness of hybrid neural and probabilistic reasoning proposed in this paper, as well as the important role of attribute learning in medical image diagnosis. Specifically, method A adopts a multicrop convolutional neural network trained a subset of 825 nodules from LIDC and achieves an accuracy of 87.14%. Method B trains a 3D CNN with a subset of 1144 nodules and achieves a higher accuracy (91.26%). All methods from C to H train models using the same number of nodules, which is 1945. However, method H introduced additional 1839 unlabeled nodules for semi-supervised learning and achieves the highest accuracy (92.53%) among methods from C to H. Method I proposes a multi-scale 3D ResNet with a cost-sensitive loss and delivers a higher performance (92.64%) than method H using a subset of 1712 nodules. In comparison, when adopting the same backbone (3D ResNet) as method I, our method (O1) achieves a better performance (93.74%), indicating the importance of attribute relationship modeling in enhancing the performance of pulmonary nodule classification. Most importantly, Table 1 shows that when using an EfficientB4-FPN backbone, our model with data augmentation (O2*) achieves the highest accuracy of 95.36%, and our model without data augmentation (O2) gains 4.41% over the baseline using the same backbone (M4).
In addition, Table 2 shows that under the training setting and testing protocol adopted in [59], our algorithm also achieves the highest accuracy, further demonstrating the effectiveness of our method. In this comparison, we adopt the same backbone (DenseNet) as in [59], and do not apply any data augmentation during training. During testing, we adopt the same "off-by-one" accuracy as in [59], which considers attribute/malignancy grading within ±1 of the ground truth as correct results. Note that although we do not exploit the additional supervision signal provided by the nodule segmentation masks, that are used in [59], our method still achieves the highest accuracy of 98.12%.
We have further compared with attribute relationship modeling methods including M31, M41, M32, and M42. For a fair comparison, methods in the same comparison always adopt the same backbone, which is either 3D ResNet-50 or EfficientNet-B4. The relational network module proposed in [51] and the GRU module proposed in [41] respectively achieve an improvement of 1.59% and 1.30% in accuracy when compared with the 3D ResNet-50 baseline while our method improves the accuracy of the same baseline by 3.63%. The same conclusion can be drawn for the EfficientNet-B4 backbone. We have also compared with two classic attribute learning methods T1 and T2, and they deliver a lower accuracy of 88.73% and 91.01%, respectively. In summary, all the above well-designed experiments demonstrate that the proposed hybrid neuro-probabilistic reasoning algorithm achieves clearly better performance than existing methods on the LIDC-IDRI dataset.
Sample input images and their associated attribute and disease classification results from our method and a few other baseline methods are shown in Fig. 3. In addition, we provide the importance of individual attributes for pulmonary nodule classification, i.e., quantitatively evaluating how important individual attributes are in making the prediction. To be specific, we measure how much the final disease classification probability is affected by deactivating one attribute at a time in the input signals to BN-1 and GCN. For the input signals to GCN, an attribute can be deactivated by setting its corresponding feature vector to an average feature for that attribute over the samples where that attribute attains the lowest possible grade. For the input signals to BN-1, since each attribute in the LIDC-IDRI dataset is associated with 5 grades, deactivation means assigning the lowest grade (1) to the attribute.

Ablation Study
We have performed a systematic ablation study on the LIDC dataset to verify the effectiveness of the components in our framework. The results are shown in Table 4. Specifically, we have explored the effects of both the first and second BN (BN-1 and BN-2), cross-network attention and residual fusion (CNA-RES), SE channel-wise attention (SEatt) [20], gradient backpropagation through BN (GradBN), BN and DNN alternating training (AlterTrain) and the Relational Network Module (Relation) [51] on the EfficientNet-B4 backbone. Table 4 shows that all proposed components or strategies contribute to the final performance of our algorithm. More specifically, we have found out that 'CNA-RES' is the most important component, the performance would drop by 1.67% if it was removed, which verifies the effectiveness of our proposed network coupling strategy. When both 'BN-2' and 'CNA-RES' are removed and the final result of the network is computed by a simple average of the results from BN-1 and GCN, the network becomes an ensemble model integrating the BN-1 and GCN branches. The performance drops significantly by 3.2% under this ensemble model setting. This result indicates that various components in our hybrid framework are highly coupled, and altogether they deliver a performance much better than a simple ensemble model. 'GradBN' and 'AlterTrain' also have a significant impact, and the performance drops by 1.5% and 1.0% respectively if either of them is removed. In addition, completely removing BN-1 and its coupling with GCN would decrease the performance by 2.90% while removing BN-2 alone would decrease the performance by 0.57%. Moreover, the performance would drop by 3.16% when both BN modules are removed and a relational network is appended after the GCN module. This result demonstrates the strong dependency modeling ability of BN modules. Fig. 4 shows the structure of the first BN module (BN-1) learned from the LIDC-IDRI dataset. This learned BN structure indicates that the node for malignancy has one parent node (i.e., spiculation) and four children nodes (i.e., subtlety, calcification, margin and lobulation).
We can draw the following conclusions from the results given in Table 4, are important to the classification performance on the LIDC-IDRI dataset. Note that removing 'AlterTrain' actually means using ground-truth attribute labels instead of soft attribute labels for training the BN modules because the BN structures will not change if we always use ground-truth attribute labels during alternating training. • BN modules are vital for attribute relational modeling.
Their performance surpasses the performance of the advanced relational network. • Although a simple ensemble model consisting of BN and GCN can already improve the performance, our proposed cross-network attention mechanism takes full advantage of BN and GCN, thus performs better.

Tuberculosis Diagnosis
Chest radiographs are a type of medical images that can be conveniently acquired for disease diagnosis. Radiologists can find out many diseases quickly via observing chest Xray images, which are useful for early diagnosis and intervention. Tuberculosis [35] has the highest mortality around the globe among infectious diseases. However, if it was detected at an early stage from chest radiographs, the death rate could decrease by 70% [17]. Meanwhile, tuberculosis is a type of bacterial infection that can give rise to multiple types of radiological abnormalities in chest radiographs such as diffuse nodules and fibrotic appearance. Just like morphological characteristics of lung nodules, these radiological abnormalities can be treated as attributes to facilitate the diagnosis of tuberculosis by medical experts. Therefore, we choose tuberculosis diagnosis as an additional task to further evaluate the proposed hybrid neuro-probabilistic reasoning algorithm, which is able to model the dependencies between tuberculosis and radiological abnormalities to improve the reasoning capability of deep learning methods.

TB-Xatt dataset
As for the task of tuberculosis diagnosis, there are a total of 14200 frontal chest X-ray images in an in-house dataset named TB-Xatt. Every image in this dataset has been carefully annotated by certificated and experienced radiologists.    or radiological abnormalities per image and 1326 annotated images for each type of disease or radiological abnormalities. The distribution of images over disease and radiological abnormalities is presented in Table 5. In this study, we aim to improve the classification performance of 'Pulmonary Tuberculosis' with the help of the seven types of radiological abnormalities, which are treated as the attributes of a chest X-ray image in our hybrid algorithm since these seven types of radiological abnormalities are the main expressions of active tuberculosis in the lung. Fig. 5 shows sample images with these attributes and disease from the TB-Xatt dataset. Note that the image samples in the dataset are pre-processed versions of the original X-ray images. We first perform lung segmentation in the original X-ray images and extract bounding boxes of the left and right lungs as shown in Fig. 6, then resize these bounding boxes to 512 × 256 and place each pair of left and right bounding boxes side by side as a single image sample. In all experiments, we perform 10-fold cross validation, and each model is independently trained and tested 5 times with randomly initialized weights on each fold, the same as we did on LIDC. The performance of a model is also assessed with the same set of metrics as LIDC, including accuracy, sensitivity/recall, specificity, precision, F-score and AUC.

Experimental Setup
Different from LIDC, the ResNet and EfficientNet backbones are pre-trained on ImageNet. Adam is also the optimizer with the initial learning rate set to 1e-3, and the batch size is 32. All models are trained 60 epochs and the learning rate is reduced by a factor of 10 after 20 and 40 epochs.

Comparison with the State of the Art
On our in-house TB-Xatt dataset for tuberculosis diagnosis, we have also compared our hybrid algorithm with existing methods, including two state-of-the-art models (S1 and S2), two attribute learning methods (A1 and A2) and two relationship modeling methods (M31, M41 and M32, M42). The results are shown in Table 6, where O2 and O2 * represent our models trained without and with standard data augmentation, respectively. No matter whether data augmentation is used, the proposed algorithm achieves the best performance under all evaluation metrics, including accuracy, sensitivity/recall, specificity, precision with the cut-off value of 0.5, F-score and AUC. Specifically, both A1 and A2 are classic attribute learning methods that use handcrafted features, and achieve an accuracy of 86.14% and 87.23%, respectively. When compared with state-of-the-art disease classification models S1 and S2, the proposed hybrid algorithm demonstrates better performance with 4.01% and 3.70% improvement in accuracy, respectively. Furthermore, as in the pulmonary nodule classification task, we implemented two attribute relationship modeling methods (M31, M41 and M32, M42) based on the ResNet-50 and EfficientNet-B4 backbones. Results indicate that our method achieves better performance due to its strong causality modeling and relationship reasoning capabilities. The classification performance of our method with ResNet-50-FPN and EfficientNet-B4-FPN backbones are 94.67% and 97.13%, which respectively gain 3.44% and 2.94% over the two baselines using the same backbones. The results in Table 6 demonstrate the effectiveness of our proposed algorithm on the tuberculosis diagnosis task. Sample input images and their associated attribute and disease classification results from our method and a few other baseline methods are shown in Fig. 7, where we have binary attribute and disease classifications. We also evaluate the importance of individual attributes for pulmonary tuberculosis diagnosis. Similarly, we measure how much the final disease classifica- tion probability is affected by deactivating one attribute at a time in the input signals to BN-1 and GCN. In this case, for the input signals to BN-1, deactivation means setting the classification probability of a specific attribute to zero. The deactivation scheme for the input signals to GCN is the same as that for the LIDC-IDRI dataset.

Ablation Study
We have also conducted an ablation study on the TB-Xatt dataset to verify the effectiveness of the components in our proposed framework. The results are shown in Table 7. As in the pulmonary nodule classification task, EfficientNet-B4 is the chosen backbone, and performance is measured when individual components are removed or replaced. Similar patterns in performance have been found here as well, and all the considered components contribute to the final performance of our hybrid algorithm.
Moreover, we have also tested the performance of our algorithm under the condition of a limited training set by sampling an increasingly smaller subset of the TB-Xatt dataset as the training set while keeping the size of the validation and testing sets unchanged. The sampling results are shown in Table 9, where P1 means all original training samples, and P2, P3 and P4 mean 75%, 50% and 25% of the original training samples, respectively. For a fair comparison, we perform stratified sampling and ensure the same ratio between the number of positive and negative samples under all settings. This ratio is always 1/8 for the training sets, and 1/3 for the validation and testing sets. All results are shown in Table 8, which indicates that the proposed algorithm can still perform better than the baselines when given a decreasing number of training samples. Note that all baselines adopt EfficientNet-B4 as the backbone since we empirically find out it is the best backbone under different settings. Furthermore, the performance gap between our algorithm and its corresponding baseline widens when the number of training samples decreases. When only 25% of the original training samples are used, this performance gap reaches 8.87% for the EfficientNet-B4 backbone. More surprisingly, although the number of training samples used in P4 is only half of that in P3, the performance of our models trained in the P4 setting is comparable and sometimes even better than the performance of corresponding baseline models in the P3 setting. These results demonstrate the strong generalization ability of our models trained using a small dataset.
We further investigate the classification performance of the individual output signals from the BN-1 and GCN modules to show how the two mechanisms work collaboratively when the size of training data varies. As shown in Tab. 10, (1) when the training dataset is small, BN-1 achieves better performance than GCN; (2) when the training dataset is large, GCN achieves better performance than BN-1. The justification is that (1) when the training dataset is small, BN-1 probabilistically models the causal relationships between attributes and diseases to mitigate the problem of insufficient data; (2) when the training dataset is large, GCN has the capability to learn expressive and powerful feature representations from the large-scale annotated dataset. Moreover, we notice that our residual fusion scheme always achieves higher accuracy than both individual networks (BN-1 and GCN). It has become clear that the strong generalization ability of our trained models is made possible by the complementarity between our neural and probabilistic reasoning algorithms. The neural branch in our network realizes its full power when the training set is large while the probabilistic branch prevents overfitting from happening when the training set is small as probabilistic models are better at learning from smaller datasets.

CONCLUSIONS AND DISCUSSIONS
In this paper, we have introduced a hybrid neuroprobabilistic reasoning algorithm for verifiable attributebased medical image diagnosis. There are two parallel branches in our hybrid algorithm, a Bayesian network branch performing probabilistic causal relationship reasoning and a graph convolutional network branch performing more generic relational modeling and reasoning using a feature representation. Tight coupling between these two branches is achieved via a cross-network attention mechanism and residual fusion of classification results. We train the hybrid network by alternatively updating neural network weights and the structure/parameters of the Bayesian networks. We have successfully applied our hybrid reasoning algorithm to two challenging medical image diagnosis tasks, benign-malignant classification of pulmonary nodules in chest computed tomography images and tuberculosis diagnosis using chest X-ray images. On the LIDC-IDRI benchmark dataset for pulmonary nodule classification, our   method achieves a new state-of-the-art accuracy of 95.36% and an AUC of 96.54%. Our method also achieves a 2.94% accuracy improvement on the in-house tuberculosis diagnosis dataset. Our ablation study indicates that our hybrid algorithm achieves a much more robust performance than a pure neural network architecture under very limited training data.

APPENDIX A GRADIENT OF BELIEF PROPAGATION
Belief propagation is an iterative algorithm, where nodes exchange messages to update their marginal distributions until convergence. For a predefined Bayesian network without any loops, there exists a maximum number of iterations that guarantees convergence. The messages exchanged during belief propagation are computed using sums of products, which only involve additions and multiplications. Therefore, the entire belief propagation process is differentiable.
To be able to propagate gradients through Bayesian networks, we need to compute the gradient of all converged marginal distributions with respect to the network inputs, which are evidences at a subset of the nodes. Since both network inputs and outputs are multi-dimensional, the gradient we wish to compute is a Jacobian matrix. In the rest of this section, we outline an approach to compute this gradient. As there exist multiple variants of the belief propagation algorithm, without loss of generality, we formulate the gradient for the original belief propagation algorithm proposed by Pearl [45], [47] for polytrees. Let x be the variable at a typical node X, which has M children nodes, {Y 1 , ..., Y M }, and N parents, {U 1 , ..., U N }. The current marginal distribution at X is denoted as P B (x). There are two types of incoming messages at X. The first type, λ Yj (x), is received from one of its children Y j . It represents the current strength of the diagnostic support contributed by the outgoing link X → Y j . And the second type, π X (u i ), is received from one of its parents U i . It represents the current strength of the causal support contributed by the incoming link U i → X. P B (x), λ Yj (x) and π X (u i ) are all vectors of probabilities, representing discrete marginal or conditional distributions. Note that the evidence at a node can be modeled as an incoming message from an auxiliary child node. To facilitate gradient formulation, we synchronize all messages using time steps and every message has a time stamp. At each time step, every node updates its marginal distribution and outgoing messages as follows.
u1,...,un P (x|u 1 , ..., u n ) where α and β are normalizing constants, λ t X (u i ) is an outgoing message from X to one of its parents, π t Yj (x) is an outgoing message from X to one of its children, and P (x|u 1 , ..., u n ) is the predetermined conditional probability table at X.
The marginal distribution and outgoing messages are defined as the variables associated with a node. Suppose the Bayesian network has N B nodes and N E edges, and each marginal or conditional distribution is defined on L discrete grades. Since a polytree with N B nodes has N B −1 edges and each edge is associated with two messages propagated along opposite directions, the total number of variables during belief propagation is (N B + 2N E )L = (3N B − 2)L. According to (16)- (18), at node X, the marginal distribution and outgoing messages at time step t are all functions of the messages received from its parents and children at the previous time step. Thus, we can define a (3N B − 2)L × (3N B − 2)L Jacobian matrix, J t , between every two consecutive time steps. Note that J t is a sparse matrix. Suppose it takes T time steps for belief propagation to reach convergence. The Jacobian matrix for all the time steps together, J all , is thus a product of the Jacobian matrices for individual time steps, J all = T t=1 J t . The gradient of the entire belief propagation algorithm, which consists of the partial derivatives of the final marginal distributions with respect to the input evidences, forms a submatrix of J all , and can be extracted from J all .