Graph-Assisted Bayesian Node Classifiers

Many datasets can be represented by attributed graphs on which classification methods may be of interest. The problem of node classification has attracted the attention of scholars due to its wide range of applications. The problem consists of predicting nodes’ labels based on their intrinsic features, features of their neighboring nodes and the graph structure. Graph Neural Networks (GNN) have been widely used to tackle this task. Thanks to the graph structure and the node features, they are able to propagate information over the graph and aggregate it to improve the classification performance. Their performance is however sensitive to the graph topology, especially its degree of impurity, a measure of the proportion of connected nodes belonging to different classes. Here, we propose a new Graph-Assisted Bayesian (GAB) classifier, which is designed for the problem of node classification. By using the Bayesian theorem, GAB takes into consideration the degree of impurity of the graph when classifying the nodes. We show that the proposed classifier is less sensitive to graph impurity, and less complex than GNN-based classifiers.


I. INTRODUCTION
Attributed graphs are useful tools for representing interactive phenomena such as social networks [1], financial market fluctuations [2], road or air traffic, human scene [3], brain activity [4], or gene interaction [5]. Graphs are described by i) a set of nodes associated with the entities (subscribers in social networks, planes in air traffic, etc.) and ii) a set of edges connecting them in order to represent their relationships. Graphs are said to be attributed when nodes and/or edges have assigned values, called features. While this type of data is rich for inference, it is not well suited to standard signal processing or machine learning techniques. The adaptation of these techniques to graph data has been the subject of numerous studies that aim to perform graph classification (e.g., molecules, handwritten characters) [6], [7], [8], edge prediction (e.g., social network links) [9], node regression or classification (e.g., citation networks) [10], [11].
This works focuses on node classification, which is one of the most important tasks of machine learning on graphs. Its objective is to predict the class (called also label) of The associate editor coordinating the review of this manuscript and approving it for publication was Jolanta Mizera-Pietraszko . each unlabeled node of the graph by relying on both nodes' features and nodes' connections within the graph.
In most real world graphs, connections between nodes are far from arbitrary. In social networks for instance, people are more likely to connect with those who share similar characteristics or areas of interest [12]. A citation network has connections between articles if one cites another, so links between articles addressing the same research topic are more likely than links between those addressing different topics. [13]. As in social sciences literature, a homophilic graph can be described as one where similar nodes tend to be connected to each other [12]. Using this principle, one should make use of the topology of the graph to determine the class of a node, rather than relying solely on intrinsic features, as is commonly done with standard machine learning approaches.
When taking into account the graph topology in a classifier, the first question is: what kind of information to share over the graph? There are two approaches.
• Label Propagation (LP): only the labels (true or estimated) are propagated through the adjacent nodes in the graph to make a new decision. Several methods relying on voting have been developed to merge labels' information [14]. The labels at the initial step are either VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ provided or estimated using the node's features only, i.e. ignoring the graph.
• Feature propagation (FP): the nodes' features are propagated through the adjacent nodes at each step. Typically, a weighted averaging of the current features of adjacent nodes (sometimes followed by a nonlinear function) is carried out for each node at each step. In this paper, we focus only on the FP approach. Graph neural networks (GNN) have recently been developed in order to adapt Neural Networks to attributed graphs. Typically, at each layer, they linearly combine the features of adjacent nodes and then apply an activation function. The training on labelled nodes consists of finding out the best weights of each linear combination. In this paper, we propose to derive in closed-form a classifier relying on (Bayesian) decision theory. As a result, we obtain an interpretable algorithm based on some parameters whose estimation step replaces the training phase of the GNN.
Before we go any further, let us briefly review the works on FP based on GNN. The most common way has been to extend the convolutional neural network (CNN) to the graph structure [15] by redefining the convolution operator on the graph domain. These CNN-based methods suffer from a high computational cost due to the necessity of performing an eigendecomposition of the graph Laplacian. To overcome this problem, different approaches avoid explicit computation of this eigendecomposition by using a polynomial expansion to represent the filters [16], [17]. More specifically, in [17], the authors consider a first-order polynomial approximation to build a neural network which they called Graph Convolutional Network (GCN) to do semi-supervised node classification. The GCN can also be seen as an aggregation operator, i.e. the representation of a node is obtained by averaging its intrinsic features with those of its first-order neighbors. The authors in [18] and [19] respectively introduced Graph Attention Network (GAT) and Attention-based Graph Neural Network (AGNN) where different weights are assigned to neighbors based on nodes' and edges' features. Other researchers explored higher-order (or equivalently, multi-hop) information of the graph by repeatedly mixing features of neighbors at various distances [20] or by modifying the propagation strategy of GCN [21]. Although these approaches have achieved remarkable results on a number of benchmark datasets, we notice that their performance vary significantly across datasets. For instance, the gain compared to a simple logistic regression (i.e. no contribution from the neighbors) highly depends on the dataset.
We use the following example to explain the underlying reason for the performance variations over datasets. We definep as the average probability of intra-class connection (i.e., the probability that two nodes from the same class are connected), andq as the average probability of inter-class connection (i.e., the probability that two nodes from different classes are connected). The degree of impurity of the graph may be represented by the ratioq/p. Datasets with a degree of impurity less than 1 are called assortative [22] and correspond to graphs containing communities (nodes with similar features are connected to each other). In Table 1, we show the classification accuracies of a logistic regression classifier and a two-layer GCN for two widely-used datasets used for benchmarking GNN algorithms. We also display the estimated values forp,q and the degree of impuritȳ q/p. As expected, node classification using graph structure is easier with graphs offering low degree of impurity (like Cora). This may explain the performance variations over datasets.
In this paper, a different point of view is taken by proposing a Bayesian classifier which does not rely on a neural network structure and is able to better adapt to the level of impurity than GNN-based classifiers.
Our approach to tackle the node classification problem is related to the so-called collective classification [23], [24] which refers to the classification of a set of connected nodes by using their intrinsic features and/or labels and their relative connections. Optimal collective classifications are carried out by maximizing the joint likelihood. However, optimal (and so exact) inference is an NP-hard issue in general, and is thus generally not well suited for the real-world networks. As a consequence, most collective classifiers rely on developing approximate inference [23], [25]. Some recent studies combine methods from collective classification with neural networks to ensure a better end-to-end learning, e.g. [26]. However, all the above-mentioned algorithms only make use of features of first-hop neighbors and thus rely on a propagation step to make use of higher-hop nodes' information in an iterative manner. We instead introduce a new classifier that directly takes into account higher-hop nodes. This has the additional advantage of being more interpretable. In our previous work [27], we proposed a first order version of the Bayesian based classifier which is only able to take into account first order neighbors. In this paper, we conduct detailed derivations of the classifier which allow us to consider higher orders. We also conducted a deeper comparison between our classifier and GNN based classifiers in terms of performance and complexity.
Let us define some notations. Let G = (V, E) be an undirected graph where V is a set of nodes and E ⊆ V × V is a set of edges. Each node u ∈ V is represented by a feature vector x u ∈ R F×1 where F is the number of node's features. An adjacency matrix A ∈ R N ×N represents the topological structure of the graph where N = |V| is the number of nodes in the graph. Without loss of generality we assume . . , D k (x u )) either by training a multi-layer perceptron or by estimating parameters of the generating distributions of nodes' features. Then we compute P u = (P u (1), . . . , P u (k)) using R 1 as a transition matrix for nodes in N 1 (u) and R 2 for nodes in N 2 (u) (See Eq. (14)).
Let y u denote the label of the u-th node and let K denote the number of classes, i.e. y u ∈ {1, . . . , K }.
The objective of node classification in this paper is to predict the class of all unlabeled nodes in the graph given the adjacency matrix A, the feature matrix X and the set of available labels.
The paper is organized as follows: in Section II, we introduce our new graph-assisted Bayesian classifier (GAB). In Section III, we compare our approach with GNN-based classifiers. In particular, we show our classifier has the shape of a GNN only in the noise-free case (i.e., q = 0) and under further conditions on the distribution of the features. In Section IV, complexity issues are discussed. In Section V, numerical results are provided. Comparison with existing GNN-based methods are done on real datasets whose degree of impurity has been modified by injecting artificial noise (i.e., introducing fictitious edges between classes). We see that our proposed Bayesian classifier offers better performance than GNN-based classifiers. In Section VI, concluding remarks are drawn.

A. THE CLASSIFIER DERIVATION
In this section, we derive our new GAB classifier based on Maximum A Posteriori (MAP) principle. We develop this classifier for a node u, called node of interest in the rest of the paper. Obviously, in practice, any node in the graph will be seen as a node of interest in a Round-Robin manner. We first consider the entire graph and then, in order to simplify the derivations, we consider only the information provided by the hop distance between the node and a node of interest. Let • V u be the set of nodes which will be involved in the classification of node u. Node u is not included in this set.
be the set of feature vectors of node u and its ''helping'' nodes.
• D k be the distribution generating the feature vectors of nodes belonging to class k. We do not assume that theses distributions are known. We instead either assume a shape for these distributions and then estimate their parameters, or approximate them using a neural network. The objective is to compute the posterior probability that a node u belongs to class k knowing information on the graph I G (typically its partial connectivity through the set V u ), and X u . Consequently, the classifier makes the following decisionŷ where the posterior probability that needs to be computed is defined as: P u (k) = P(y u = k|X u , I G ).
Using the Bayes' rule, we obtain P u (k) = P(y u = k|X u , I G ) = P(X u |y u = k, I G )P(y u = k|I G ) P(X u |I G ) since the denominator does not depend on k and the prior probability of an individual node u to belong to class k does not depend on the graph information. The above posterior probability can be rewritten as with Q u (k) = P(X u |y u = k, I G ), π k = P(y u = k).
Let V be the size of the set V u . Let {v 1 , · · · , v V } be the nodes of V u . In Appendix A, we show that Eq. (3) cannot be simplified further since the term p(y v 1 = k v 1 , · · · , y v V = k v V |y u = k, I G ) cannot be split into individual posterior probabilities. Indeed according to Example 1 below, one can see that in general p(y v 1 We set the number of classes to K = 3. We assume that only nodes belonging to the same class can be connected. We now compute P(y B = k, y C = k ′ |y A = 1, I G ) in two different cases. In the first case, we consider that I G provides only information on the connections between the pairs (A, B) and (A, C). Consequently, we only know A is not connected to B and A is not connected to C. This leads to the following expression for k ∈ {2, 3} (4) and In the second case, we consider that I G provides information on all possible connections. Once again, for k ∈ {2, 3} which is different from P(y B = k|y A = 1, I G ) × P(y C = k ′ |y A = 1, I G ). □ Consequently, in order to pursue analytical derivations offering practical algorithms, we make the following simplifying assumption: which corresponds to assuming statistical independence between the classes of node u's neighboring nodes given the graph. As seen in Example 1, this independence generally does not hold true, but it is required to pursue closed-form derivations. Using Eqs. (3) and (8), we obtain Finally, we get with The term r u,v (k, k ′ ) is the probability to be in class k ′ for node v given that node u is in class k and that we have partial (or total) information on the graph I G . Notice that the term r u,v (k, k ′ ) depends on the graph statistics as we will see in Eq. (13). In general, deriving a closed-form expression of r u,v (k, k ′ ) with respect to the graph statistics is very difficult due to combinatorial complexity. Before going further, we make the following remark. Remark 1 (How toTake Into Account the Labelled Nodes?): Some of the nodes in V u may have already been labeled, so their classes are known. Eq. (9) should then be adapted to account for this knowledge. Let us consider that node v 1 is labelled and belongs to class k 1 . The following term where δ is the so-called Kronecker index. Consequently, if V u = L u ∪ U u with L u being the set of labelled nodes and U u being the set of unlabelled nodes, we have that where k v is the class of the labelled node v.
As an example, we consider L u = {v 1 } and U u = ∅, and obtain which is the likelihood function for x u assuming class k, corrected by a term depending on the probability of connection between class k and that of the labelled neighbor node.
In order to obtain a practical algorithm, I G for each node of interest, u, will be made to consist of only the distances between this node and other nodes of the graph, i.e., Therefore, we order V u as follows where N d (u) is the set of d-hops neighbors of u and u is the maximum distance from node u; we have that u ≤ V . Hence, Eq. (9) can be rewritten as The above expression will be next simplified further.
We recall that . As I G now merely consists of the distances between each node and the node of interest, r u,v (k, k ′ ) no longer depends on the specific path(s) connecting u to v but only hop distance, between node v and node u. Moreover we assume that the probability of connection between two nodes only depends on the nodes' classes but not on the specific nodes. Consequently, To control the contributions of neighbors to the computation of the posterior probability, depending on their distance to node u, we propose to modify the expression of Q u (k) as follows where {γ d } d=1,··· , u are hyper-parameters to tune. Simulation results show that hyper-parameter γ d decreases with d, as expected. Setting γ d = 1 for all values of d implies that all the nodes in V u have the same weights and so contribute equally. This can be counter-productive especially when the degree of impurity is high. Indeed, information from distant nodes may not be reliable, mostly because of the simplifications made to I G . The optimization of the hyperparameters should counter-act this kind of phenomenon.
The goal now is to derive It is worth pointing out that, in general, R (1) is not symmetric but is non-negative with the row-sums equal to 1 (while the column-sums are not necessary equal to 1). Consequently, where I G has been replaced with C d , which is defined as the knowledge that both considered nodes are connected in d hops.
In Appendix B, we show that Since we are able to find r (d) (k, k ′ ) with respect to r (1) (k, k ′ ), we should now derive r (1) (k, k ′ ) in closed-form. We first define the following parameters: • Let p(k) denote the probability that any two randomly selected nodes belonging to the same class k are directly connected.
• Let q(k, k ′ ) denote the probability that two randomly selected nodes not belonging to the same class are connected. We let q(k, k) = p(k). We also assume symmetry, i.e. q(k, k ′ ) = q(k ′ , k).
• Letp define the average probability of connection between nodes of the the same class, i.e.
Similarly, letq define the average probability of connection between nodes not belonging to the same class, i.e.
• The degree of impurity (DoI), roughly evoked in Section I, is defined as This defines a general stochastic block model (SBM), a widely used random graph model for community detection and clustering [28], [29]. Note that we use SBM only for analytic tractability, and that unlike the work on community detection, we are interested in node classification, given some labeled samples. By using Bayes' rule, we have that As an example, if the probabilities of connection do not depend on the classes, i.e., p = p(k) for any k and q = q(k, k ′ ) for any k ̸ = k ′ , we obtain Hence, according to Eqs. (2) and (11), we obtain with r (d) (k, k ′ ) given by Eq. (12). We notice that the shape of the function to maximize is simple and corresponds to a product over all the considered nodes of a weighted sum of the possible distributions. The weights are perfectly characterized thanks to our derivations.

B. PARAMETERS' ESTIMATION
In order to implement our node classifier, i.e. to compute Eq. (14), we need π k , p(k), q(k, k ′ ), and the distributions D k (·) for all k and all k ′ . Since these are unknown, they have to be estimated using the data. Here, in order to perform this estimation using simple algorithms, we consider only the available labeled nodes of the graph in the estimation. These algorithms are described below: • Estimation of π k : this is obtained by counting the number of labeled nodes belonging to class k divided by the total number of labeled nodes.
• Estimation of p(k) and q(k, k ′ ): we count all the pairs of labeled nodes belonging to class k and k ′ ; then we count the number of these pairs that are connected in 1-hop. The estimate ofq(k, k ′ ) is obtained by dividing the latter count by the former one. if these values do not depend on k and k ′ , we also average over all the involved pairs (with k ̸ = k ′ to obtain q(k, k ′ ) =q and when k = k ′ to obtain q(k, k) =p).
• Estimation of classes' statistics D k (·): we assume a shape for the distribution, and this shape is dependent on a set of parameters. For instance, in the Cora dataset, the features are binary, so they are modeled as independent Bernoulli random variables; the probability associated with each feature is estimated using the proportion of non-zero elements of this feature in the labeled nodes, and Laplace smoothing. If the features are continuousvalued, we use the Gaussian distribution. It is worth pointing out that a semi-supervised estimation approach may also be possible but this would require an iterative approach that cycles between parameter estimation and node classification. It is also worth noting that the estimation step plays the role of the learning phase in GNN-based classifiers or in the classifiers developed in [26].
The values of the hyperparameters γ v are set to 1 by default. The optimization of these hyperparameters is addressed in Sections IV and V.
This optimization will be shown to improve classification performance. Indeed, these hyperparameters will adjust the degree to which neighbors of different orders should contribute to the classification of a node.

C. WHEN DOES GRAPH STRUCTURE NOT HELP GAB
Thanks to Eq. (14), we will be able to characterize some conditions on the graph's parameters for which the graph through the proposed GAB helps each node to improve its classification performance. Reasoning by contradiction, one can see that the classifier does not take into account the neighbors if and only if (iff) the function is independent of k for any d. Indeed, if true, the a posteriori distribution to maximize depends on k only through D k (x u ).
The above-mentioned function leads to the same output regardless of k for any feature values iff the weights r (d) (k, k ′ ) are identical for any k. Therefore the condition for the neighbors to be useless is According to Eq. (12), it is easy to prove that if it is satisfied for d = 1, then it remains true for any d. Therefore, we need to only focus on d = 1. According to Eqs. (13) and (15), We will now inspect some particular cases: • In the case of constant intra-class and inter-class probabilities of connection, we obtain the following constraint for any pair (k, k ′ ) such that k ̸ = k ′ π k ′ + (q/p) By setting ν = K k=1 π k , we have π k ′ (1 −q/p) + π k (1 −p/q) = ν(1 −q/p) which implies that where k → π k := π k /ν is a probability mass function. As (1 −p/q)/(1 −q/p) is negative, this equation does not hold except ifp =q. Consequently, the neighbors are not involved in the GAB classifier when the degree of impurity is 1 since there is no community structure.
In this setup, the neighbors are not involved in the GAB classifier when the inter-class probability of connection is the geometric mean of the intra-class probabilities of connection, and so not necessary when the degree of impurity (defined through the arithmetic mean) is equal to 1. Obviously, if p(1) = p(2), we go back to the first item leading to p = q, i.e., a degree of impurity equal to 1. But when p(1) ̸ = p(2), the degree of impurity leading to a graph-agnostic classifier is equal to the ratio between the geometric mean and the arithmetic mean of the intra-class probabilities of connection which is strictly smaller than 1 in general case.

III. LINK TO GRAPH NEURAL NETWORKS A. RECAP OF GRAPH NEURAL NETWORKS
GNNs are a class of graph embedding architectures which use the graph structure in addition to node and edge features to generate a representation vector (i.e., embedding) for each node. GNNs learn node representations by aggregating the features of neighboring nodes and edges. The output of the ℓ-th layer of these GNNs is generally expressed as: where h (ℓ) u is the feature vector of node u at the ℓ-th layer initialized by h (0) u = x u and N 1 (u) is the set of first-order neighbors of node u. Different GNNs use different formulations for the non-linear function σ (ℓ) (called activation function) and the linear function φ (ℓ) [30]. Note that a first-order GNN based classifier relies on one layer or equivalently considers only the 1-hop neighborhood in the graph.

1) GRAPH CONVOLUTIONAL NEURAL NETWORK (GCN)
The convolutional propagation rule used in GCN is defined as follows where • W (ℓ) is a learnable weight matrix, • d u is the degree of node u. The activation function (for any layer except the last one) is a rectified linear unit (ReLU). For the last layer, we consider the softmax which for each node u outputs the probability that node u belongs to class k. Then the node u is assigned to the class with the highest probability [17].

2) GRAPH CONVOLUTION OPERATOR NETWORK (GON)
In [31] and [32], GON is defined as GCN where Eq. (20) is replaced with the following one Unlike GCN, GON computes a transformation matrix of the central node that is different from the transformation of its neighbors.

3) GRAPH ISOMORPHISM NETWORK (GIN)
In [33], GIN is defined as GCN or GON where Eqs.(20)- (21) are replaced with the following one where α is a positive hyper-parameter. GIN thus attributes a different learnable weight to the central node (through α) when combining information from its neighbors.

4) GRAPH ATTENTION NETWORK (GAT)
In [34], GAT is defined as GCN, GON or GIN but with the following layer link where α (ℓ) u,v are normalized attention coefficients computed by an attention mechanism as follows: . (24) with ς the leaky ReLu function, the weighting row vector w (ℓ) ∈ R 2H , where H is the size of the hidden layer and ∥ corresponds to column concatenation.

B. RELATIONSHIP WITH A GNN BASED CLASSIFIER
In this Section, we compare the shape of the proposed GAB and the GNN. In GNN, there is one activation function between each layer which implies that the multi-hop information has undergone several non-linear functions before arriving at the node of interest. In GAB, the multi-hop mixture is done prior to making the final decision and do not follow a successive concatenation of linear combination and activation function. All the operations are intermixed, therefore GNN and GAB are very different in terms of structure, except for the one-layer/one-hop case. We therefore focus here on the relation between the first-order GAB classifier and first-order GNN-based classifier. For doing that, we consider a binary classification problem (i.e K = 2). According to Eq. (14), we assign node u to class 1 if: which implies that :

By setting
S(x) = D 1 (x) D 2 (x) and taking the log, we obtain the following test T (where T > 0 means ''decide class 1''): As q(1, 2) = q(2, 1), two classes in the system lead toq = q(1, 2) = q(2, 1). Consequently, Therefore, the test T can be split into three parts: The first part corresponds to a constant term and so is connected to the threshold. The second part is the contribution of the node of interest. The third part, which is the most interesting, corresponds to the contribution of the neighbors in the test. Clearly, in general, this term is not linear with respect to the nodes' features and so the test cannot be seen as a one-layer GNN.
If, in addition, p(1) = p(2) =p, the test can be written easily with respect to the DoI as follows T = log π 1 π 2 + log DoI.π 1 + π 2 π 1 + DoI.π 2 + log (S(x u )) For instance, we remark that if the DoI is much larger than the pdf ratio between the classes (which may be roughly related to the Kullback-Leibler divergence), then the third term is almost independent of the nodes's feature and the information provided by the graph is not used since it is not reliable. Consider now that the graph is pure (i.e., r(1, 2) = r(2, 1) = 0 or equivalently q = 0), we obtain where we consider that p(1) may be different from p (2). Once again the proposed classifier does not boil down to a one-layer GNN. Actually, it can be a GNN if the term log 2 (S(x v )) is a linear combination of x v . This can be achieved if the function S(x v ) is at least a power-function of x v , such as the Gaussian function or Binomial function. Let us first consider the Gaussian case, i.e. x v ∼ N (µ(v), ) where µ(v) is either µ 1 (if class 1) or µ 2 (if class 2) and the correlation matrix is independent of the class (if not, a second-order polynomial occurs and the GAB is different from a GNN). According to Eq. (25), the first-order GAB test is equal to where Consequently this test is a GNN-based test.
Let us now consider the Binomial case. We assume that features x v,f are independent binary random variables with probabilities P( f . Eq. 25 can be written as Eq. (26) with Consequently this test is a GNN-based test as well.

IV. DISCUSSION ON COMPLEXITY ISSUES
In this section, we compare the different versions of the proposed GAB and the GAT in terms of parameters/weights to tune and the number of flops for doing this tuning during the training phase. We will consider only the two-hops case for the GAB and the two-layers case for the GAT. First of all, we evaluate the number of parameters to be tuned. Let H and d max be respectively the size of the hidden layer and the maximum degree of the considered graph.
For a GAB classifier, we need to estimate • K 2 parameters for the transition matrix R through the terms {p(k)} k and {q(k, k ′ )} k̸ =k ′ ; • the parameters of the class' distributions (obviously, this value depends on the shape of the assumed distributions): --KF parameters when each class of each feature is Bernoulli-distributed --(KF + KF 2 ) parameters when each class is arbitrary Gaussian-distributed. The number of parameters can be reduced if the Gaussian distribution per class is structured. For instance, if the covariance is independent of the class, we only have (KF + F 2 ) parameters. If in addition, the covariance matrix is diagonal or is assumed as a diagonal matrix for the sake of simplicity, we have (KF + F) parameters; --K parameters for the priors {π k } k . When Cora or PubMed dataset are considered (see Table 5 for more details), the distribution is assumed to be Bernoulli in our numerical evaluations (even if not), which implies that the total number of parameters to estimate, N p is given by For GAT, we need to tune/learn the weights and the attention parameters of the neural network. As only two layers are considered we obtain FH weights for the first layer, HK weights for the second layer, and 2H (resp. 2F) weights for the attention mechanism for the first layer (resp. the second one). As a consequence, the number of weights to tune, denoted by N w is as follows: According to the number of classes and features given in Table 5 and by assuming a hidden layer of size H = 256, we have the following values N p and N w for the Cora and PubMed dataset in Table 2.
We observe that the number of parameters for GAB is much smaller (by more than an order of magnitude) than for GAT. In addition, the parameters in GAB are interpretable. Moreover, since for GAT, parameters are learnt using an iterative process, the computational complexity in terms of the number of epochs may significantly vary with the chosen optimization algorithm.
We here consider two variants for the GAB. The first one (denoted by GAB2) corresponds to the case where γ 1 = γ 2 = 1 while the second one (denoted by GAB2γ ) is optimized with respect to the pair γ = (γ 1 , γ 2 ). Let N t , N g , and N v be the number of training nodes, the number of tested pairs γ , and the number of nodes for validation respectively. To estimate the N p parameters during the training phase, we need N p N t flops since each parameter corresponds to counts or sums over each training node. To select the best pair of hyperparameters γ , we compute our GAB on all validation and test nodes. A GAB evaluation leads to at most K 2 D 2 max F flops by looking at Eq. (14). Indeed we consider that to compute each test for one class, we need d 2 max KF flops where F corresponds to the flops required to compute D k (·) as it is the case for the Bernoulli case or the diagonal correlation matrix Gaussian case. Finally, we have that When we only consider GAB2, the second term in the Right Hand Side (RHS) has to be omitted. Note that the 2D grid for the hyperparameter γ is [0 : 0.1 : 1] 2 leading to N g = 121.
For GAT2, we just consider the number of flops required for learning the weights (so the hyperparameters such as the learning rate are assumed to be obtained for free). To learn the weights, we apply a gradient-descent like algorithm with N e epochs. Hereafter, we consider only the neural network's  weights which are dominant. So the weights related to the attention mechanism are ignored.
We consider one epoch. For the first layer (resp. second layer), we have FH (resp. HK ) weights to update and so FH (resp. HK ) sums have to be computed once the gradient is known. For estimating the gradient, we average over N t nodes, over the sum of the neighbors (at most d max ) and a matrix computation of size HF (resp. KH ) with the current feature of size F (resp. H ). Consequently, we have In Table 3, we report the rough number of flops for Cora (with a supervision of 30% and 5%) and PubMed (with a supervision of 5%) with three classifiers. We set N e = 500, N t = 140 (Cora-5%) or N t = 1000 (Cora-30% and PubMed-5%). Moreover d max = 168 for Cora and d max = 171 for PubMed.
In the inference phase, applying our classifier (GAB2) leads to K 2 d 2 max F flops (which corresponds to the number of flops in the second part of the RHS in Eq. (30) without N g and N v ). For GAT2, we have FHd max FH + HKd max HK flops (actually, we apply Eq. (31) by removing N e and N t ). In Table 4, we report the rough number of flops during the inference phase for Cora and PubMed.
We remark that the numbers of flops for our GAB classifiers are much smaller (by many orders of magnitude) than for the GAT in both training and inference phases. Consequently, the structure imposed by the derivations of the GAB enable us to have interpretability and less complexity than the black box GAT.

V. NUMERICAL RESULTS
In this section, we conduct experiments with two kinds of datasets: the synthetic ones where we especially analyze the robustness to the degree of impurity; and some real benchmark ones where we compare our classifier to many other approaches based on GNN. The performance of our classifier (and the ordering with respect to standard GNN based approaches) depend on the dataset and the level of supervision.

A. SYNTHETIC DATASETS
For the sake of simplicity, we consider two classes, i.e., K = 2. Each class is associated with a different statistical distribution of the node's feature vector x v . We assume here VOLUME 11, 2023 that the two distributions are multivariate Gaussian, i.e. x v follows N (µ 1 , 1 ) in class 1 and N (µ 2 , 2 ) in class 2, with unknown mean and covariance. We assume that the features are uncorrelated in the two classes, i.e. 1 = diag(σ 2 1 ) and 2 = diag(σ 2 2 ), where σ 2 1 and σ 2 2 are (F ×1) vectors. To average over different configurations, different distributions are generated by drawing µ 1 and µ 2 randomly from U([0, 1] F ) and drawing σ 1 and σ 2 randomly from U([0.5, 3.5] F ). Moreover, the links are generated randomly using different values of the probability of intra-connectivity,p, and the probability of inter-connectivity,q whereq ≤p.
In each experimental setting, we evaluate the average node classification accuracy using a Monte Carlo simulation with 1, 000 runs. We set the number of nodes to N = 5, 000 and the number of features to F = 500. For each run, we train each classifier on 500 (already-labeled) nodes and test its accuracy on 2, 000 nodes. We use the remaining nodes for validation.
In Fig. 2, we plot the classification accuracy versus the Degree of Impurity (q/p) for one-order GAB and onelayer GCN. Both approaches perform well when DoI is small. Their performance decreases with increasing DoI. We observe, however, that the GAB is more robust to DoI than the GCN especially beyond a DoI of one half. As expected, the GAB is graph-agnostic when Eq. (18) is satisfied. For instance, when p(1) = p(2) = 0.05 then the agnosticity occurs atq/p = 1 while when p(1) = 0.025 and p(2) = 0.075, the agnosticity is reached atq/p = 0.866. We remark that GNN has similar behavior with respect to the usefulness of the information coming from the graph.

B. REAL DATASETS
Unless otherwise stated, we use the attributed graphs described in Table 5, and we use the training/validation/testing split equal to 30%, 20%, 50%, respectively. To implement the GNN based classifiers, we use Pytorch where we initialize all the GNN weights by Glorot initialization, and we train them to minimize the cross entropy loss using the Adam optimizer with an initial learning rate of 0.005. In Fig. 3, we plot the accuracy versus the DoI: on the top, the GAB at several orders as well as the best combination of powers γ d in Eq. (14). Here, the training phase enables us to estimate the parameters of the GAB except the γ d . The powers γ d are optimized with a grid search approach during the validation phase, on the middle panel, the GAT with several layers, and on the bottom panel the GCN at several layers. The dataset is here always Cora which implies that the distributions are Bernoulli. The x-axis starts with the real value of the DoI and we add links between previously unconnected nodes that belong to different classes with the goal of gradually varying the DoI to 1. The above mentioned  classifiers are trained for each ''impure'' graph. We remark that the information on the graph is less accurate for any approach when we consider more hops; actually, the confidence on the ''far'' data is smaller. Nevertheless by averaging properly the hops as in GABγ , the performance are improved slightly. In any case, GABγ is better then the best other approaches since it enables to better adapt to the impurity. Moreover GABγ outperforms the graph-agnostic classifier; so it always manages to take benefit of the graph.
In Table 6, we compare our approaches GABγ and GAB⋆ up to five hops (in GAB⋆, we approximate the unknown distributions by two-layer NNs which are learnt during the training phase) with the methods presented and analyzed in [26]. We copy-paste the values given in [26] where the best number of layers and the best version of each approach have been selected. We select the best version of GAB in the sense that we optimized the hyperparameters γ by allowing at most the fifth-order case. In GABγ , the shape of the distributions is a priori given. For instance, for PubMed, we considered a Bernoulli one while it is inaccurate. We plot the accuracy rate and the ranking (in brackets) for a 30%-supervised graph. For computing the average accuracy and the ranking (for all the considered datasets) for our GAB approach, we select the best one. We remark that the performance of our approach is close to GBPN and GAT which are the best ones in the literature. When GABγ is bad, it corresponds to the case PubMed where the used distribution does not fit well with the true one. So improvement can be done by choosing a better approximation. In addition to the good performance of our proposed approach, we have interpretability of our algorithms since we wrote them in closed-form and we are able to understand the meaning of each element.
In Fig. 4, we plot the accuracy rate versus the training size (in percentage) for our approach (GAB⋆) and the GCN and GAT with two-layers/hops. In solid line, we consider PubMed dataset and in dash line, we consider Cora dataset. We observe that for PubMed, our approach outperforms the state of the art regardless of the training size. Actually we should note than as PubMed is a large dataset, a small portion of training still leads to a large amount of data to estimate the statistics required for GAB. In contrast, for Cora, our approach outperforms GCN and GAT only when the training is large enough (and as Cora is a small dataset, large enough means also when the training set in percentage is large enough).
According to all the previous experiments, we remark that our GAB approach is more robust to the degree of impurity and its performance is close to or better than the tested GNN approaches on a number of benchmark graphs.

VI. CONCLUSION
In this work, we tackled the problem of node classification on attributed graphs. We start from the observation that GNN based models' performance depend on the graph topology, and more specifically on the degree of impurity of the graph. We adopted a different path and used the Bayesian theorem to propose a new graph-assisted Bayesian-based node classifier. This classifier is able to take into account the degree of impurity of the graph. It is also shown to significantly outperform some GNN-based classifiers, in addition to providing more interpretability and requiring lower computational complexity (if the model of the nodes' distributions is well approximated in closed-form).

APPENDIX A DERIVATIONS FOR EQ. (3)
We first focus on the term Q u (k). We get Q u (k) = P(X u |y u = k, I G ) As the classes of the neighbors are given, the information on the graph becomes redundant and so useless. Therefore I G can be removed from the first term.
Given the classes, the samples of each node are run independently, so we get P(x u |y u = k) VOLUME 11, 2023 . which concludes the derivations by doing a re-ordering.
• Finally, by induction, we conclude the derivations.

Algorithm 1
The Inference Algorithm for GAB Input: Graph topology G = (V, E); nodes' features {x u } u∈V ; labels of the set of labeled nodes {y u } u∈L ; a feature based classifier g; prior probabilities π = (π 1 , . . . , π K ), transition matrix R. Output: prediction for each unlabeled node {y u } u∈U ▷ ▷ Step1: Initialize predictions for u ∈ V do for k ∈ (1, . . . , K ) do if u ∈ L then D u (k) ← 1 (y u =k) else D u (k) ← g(x u ; k) end if end for end for ▷ Step2: Update predictions for u ∈ U do for k ∈ (1, . . . , K ) do P u (k) ← Q u (k)π k ▷ Q u (k) is defined in Eq. (11) end for y u ← argmax k P u (k) end for We recall that V = L ∪ U where L and U are respectively the sets of labeled and unlabeled nodes. The feature based classifier g is obtained either by training a multi-layer perceptron or by estimating parameters of the generating distributions of nodes' features. g(x u ; k) returns the probability that node u belongs to class k based on its intrinsic features. Prior probabilities π and transition matrix R are estimated as explained in section II-B.