Graph Convolutional Architectures via Arbitrary Order of Information Aggregation

,


I. INTRODUCTION
Networks or graphs are a ubiquitous data structure and have been extensively employed to capture interactions (i.e., edges) between individual units (i.e., nodes) of complex systems in biology, neuroscience, engineering, and social science. Along this line, machine learning tasks on networks have attracted lots of attention with applications ranging from drug design to recommendation systems in social networks [1]- [3]. However, due to the nonlinearity and high dimensions of network structure, it is difficult to directly implement classical machine learning methods on networks. To solve this problem, various graph representation The associate editor coordinating the review of this manuscript and approving it for publication was Juan Wang . learning (GRL) methods have been proposed in recent years [4]- [6]. One typical GRL method is graph embedding, the purpose of which is to learn a map that encodes network nodes in a low-dimensional vector space such that certain structural properties of the network can be preserved [7]- [9]. In doing so, the obtained node embeddings/representations can then be treated as feature inputs of downstream machine learning methods to solve specific network analytic tasks, such as community mining [10], [11], node classification [12], and link prediction [13]. Nevertheless, there are still two shortcomings. First, most existing graph embedding algorithms focus merely on preserving structural properties of networks, such as structural proximity [14], [15], equivalence [16], [17], and identity [18], [19]. They are limited in learning node representations of attributed networks, where each node is associated with additional heterogeneous information, such as node attributes and labels. Second, most graph embedding algorithms learn the map function independently from subsequent machine learning tasks. In other words, the representations are learned without the supervision of downstream machine learning outputs (e.g., node labels).
Because unsupervised GRL methods do not leverage label information in the learning process, they have natural defects in solving (semi)supervised machine learning tasks on networks. In the past, to solve the node classification problem, many community-preserving embedding algorithms have been designed with the assumption that densely connected nodes tend to have the same label [15]. However, in reality, it is difficult to determine in advance what structural properties are related to node labels [17]. The situation becomes even more complicated for attributed networks. Therefore, to solve (semi)supervised machine learning tasks on attributed networks, the primary challenge lies in how to find a supervised way to learn representations of network nodes and their associate attributes in an end-to-end manner.
By extending the idea of deep learning on networks, graph neural networks (GNNs) have been recognized as a useful framework to tackle supervised network analytic problems (see detailed surveys [6], [9], [20], [21]). In essence, GNNs treat the network structure as a computational graph, and train the whole neural network model in an end-to-end manner [22]. For example, inspired by convolutional neural networks in the field of computer vision, a variety of graph convolutional networks (GCNs) have been proposed in recent years [23]- [26]. By adopting appropriate message passing mechanisms in each convolutional layer of a GCN, each node can aggregate attribute information from its neighboring nodes in the network. However, as the depth of a GCN increases, nodes will aggregate information from other nodes with higher-order proximity. In doing so, node representations will be projected towards a steady state after several aggregation steps [27]. As a result, the depth of the existing GCNs cannot be too large.
In many real-world applications, node labels are relevant to their structural roles in a network [28]- [30]. Moreover, nodes with the same/similar structural roles may be far away from each other in the network [17]. In this case, GCNs with limited depth cannot aggregate information of nodes with similar roles but far away from each other. Therefore, it would be desirable to develop a computationally efficient convolutional architecture such that information of nodes with higher-order of proximity can be exploited through appropriate aggregators while keeping their heterogeneity. In this paper, instead of increasing the depth of GNNs, we aim to develop a novel graph convolutional architecture by enriching the information channels to support arbitrary order of information aggregation through the network.
Specifically, in this paper, we focus mainly on how to develop a graph convolutional architecture to solve the semi-supervised node classification problem on attributed networks in an end-to-end manner. The novelty and contributions of this paper are summarized as follows: 1) We propose a multi-channel graph convolutional network (MCGCN) that allows higher-order of information aggregation by enriching the number of input channels. Based on the notion of Katz index, the proposed model can further achieve an arbitrary order of information aggregation without increasing the computational overhead. 2) We introduce a shared weight mechanism to assess the relative importance of different attributes, which are weighted and aggregated among nodes with a certain order of proximity. 3) We carry out experiments on several benchmark datasets to evaluate the performance of the proposed MCGCN architecture, by comparing with the state-ofthe-art GRL methods in terms of classification accuracy and computational efficiency. The remainder of this paper is organized as follows.
In Section II, we briefly review the related work of this paper. In Section III, we formulate the semi-supervised node classification problem on attributed networks. In Section IV, we present a multi-channel convolutional architecture that can aggregate information from nodes with arbitrary order of proximity. In Section V, we carry out experiments to evaluate the performance of the proposed MCGCN method by comparing it with several state-of-the-art methods. Finally, we conclude this work in Section VI.

II. RELATED WORK
In recent years, the graph embedding approach has been proposed with the purpose of automatically representing, or encoding network elements into a low-dimensional vector space by preserving certain network properties [4], [5], [7], [8]. Generally, existing node embedding methods can be classified into three categories: factorization-based approach [14], [31], [32], random walk-based approach [33]- [35], and deep learning-based approach [16]. The first two categories focus mainly on encoding network elements by preserving the structural properties of networks. For example, concerning community-based embeddings [15], the basic idea is to learn embeddings of each node such that the inner product between any two learned vectors approximates certain measures of structural proximity. Once node embeddings (or representations) are obtained, they can be treated as feature inputs for downstream machine learning methods to solve specific network analytic tasks. In doing so, such methods are considered as an unsupervised graph representation learning approach, where the representations are learned independently from the downstream machine learning tasks.
The deep learning-based approach usually learns the encode/decode functions by involving an end-to-end learning process [36], [37]. To deal with (semi)supervised machine learning problems on attributed networks, graph neural network models have been proposed, where the network structure is treated as a computational graph in each layer of a GNN [6], [20], [21]. By leveraging the information about nodes labels and attributes, they can achieve much better performance than hand-engineered network analytic methods and unsupervised GRL methods [38]. For example, by adopting the message passing mechanisms, GNNs can aggregate node/edge attributes from neighboring nodes with a certain order of proximity in the network. By training model parameters or aggregators with the supervision of label information, GNNs have the potential to deal with both inductive and transfer learning tasks on networks.
As a typical GNN approach, graph convolutional networks are inspired by the powerful convolutional neural networks in the field of computer vision. Recently, researchers have proposed a variety of graph convolutional networks (GCNs), including spectral-based GCNs [24], [39], spatial-based GCNs [26], [27], [40], and graph attention networks (GAT) [25]. The differences lie in how the filters are defined to aggregate the information through the network. For example, the Chebyshev method defined graph convolutions using a K -degree polynomial of the Laplacian to avoid the huge computational cost of the Laplacian eigendecomposition [39]. Then, the vanilla GCN simplified graph convolutions with a specified renormalization trick [24]. By extending the vanilla GCN framework, the GraphSAGE method proposed different aggregator architectures [40]. To reduce the computational complexity, the GraphSAGE chose to aggregate information from a fixed number of neighboring nodes, instead of from all neighbors. Meanwhile, the Monet model presented a generalized CNN architecture, which aggregated the local information from networked data [41]. The GAT networks incorporated the well-known attention mechanism into each layer of the GCN framework such that the weight of information from different neighbors can be learned [25]. More recently, the SGC network simplified the vanilla GCN framework by reducing the number of non-linear activations and aggregation layers [26].
From a computational perspective, the spectral-based GCNs, such as the Chebyshev method [39], often have a relative huge computational complexity due to the computation of eigendecomposition. To avoid the computational burden, the spatial-based GCNs have introduced various message passing mechanisms, where each node can aggregate information from its neighbors in the network. Then, the objective is to learn the weight matrix and/or aggregators through an end-to-end learning process [26]. However, the drawback lies in that as the depth of a GCN increases, the learned representations will be projected towards a steady state [27].
The assumption behind is that neighboring nodes in a network tend to have similar representations, which instead limits the depth of GNNs. Paradoxically, if the depth of a GCN is not large enough, nodes cannot aggregate the information of distant nodes. This will limit its implementation on many real-world applications, for example, classifying nodes with respect to their structural roles [28]- [30]. To tackle this problem, it would be desirable to develop novel graph convolutional architectures that can (i) aggregate information from nodes with arbitrary order of proximity, and (ii) reflect the relative importance of information from nodes with different orders of proximity.

III. PROBLEM STATEMENT
In this section, we first introduce some notations and definitions we will use in the remainder of this work. Then, we introduce the semi-supervised multi-class node classification problem on networks. Let Further, we denote I N as the N × N identity matrix, and denote D as the diagonal degree matrix, where D ii = j A ij represents the degree of node i.
Suppose that each node i in G is associated with a ddimensional feature vector x i ∈ R d . Then, the attributed graph can be defined as follows: The entire feature matrix X ∈ R N ×d stacks N feature vectors on top of one another. In other words, we have x i is the i th row of matrix X .
In this paper, we assume that each node i in an attributed graph G = (V , E, X ) belongs to one of C classes, which can be denoted as a C-dimensional one-hot vector y i ∈ {0, 1} C . For example, if the label of i is c, the we have y i = [0 (1) , · · · , 0 (c−1) , 1, 0 (c+1) , · · · , 0 (C) ]. When stacking the labels y i , we have a label matrix Y ∈ {0, 1} N ×L , where y i is the i th row of Y . Suppose only a subset of nodes V L ∈ V in a network have labels Y L , and the rest of nodes V U = V \V L are unlabeled. Formally, the semi-supervised node classification problem can be defined as follows: Definition 2 (Semi-Supervised Node Classification on Networks): Given an attributed graph G = (V , E, X ) and the label y j of each node j ∈ V L , the objective of the node classification problem is to (i) learn a model f such that In the following, we will propose a graph convolutional architecture to solve the semi-supervised multi-class node classification problem on networks.

IV. THE PROPOSED GCN ARCHITECTURE
In this section, we build upon graph convolutional networks to learn node representations for semi-supervised node classification in an end-to-end fashion. We first introduce the message-passing architecture in GCNs. Then, we present our higher-order GCN architecture that can aggregate any order of information over the network. A schematic depiction of the proposed higher-order convolutional architecture for semi-supervised node classification task on networks. The model takes the feature matrix X as input, each row of which represents the feature vector of a node. Instead of increasing the number of convolutional layers, the architecture allows higher order of information aggregation by enriching the number of information channels based onÂ,Â 2 ,Â 3 , and so on. Each node aggregates its information of all channels through a shared weight mechanism. Finally, a full-connected layer together with the softmax function is used for the multi-class node classification task.

A. A MESSAGE-PASSING MECHANISM
In general, the message-passing architecture can be described as a multi-layer convolutional network. Here, we follow Kipf & Welling to introduce the architecture of GCNs in the context of node classification [24]. Specifically, for the k th convolutional layer in an l-layer GCN, the message-passing model is formulated as follows: where H 0 = X represents the feature matrix X as the inputs of the model. Moreover, H k ∈ R N ×d k are the output node representations (i.e., ''message'' or ''information'') of the k th layer, and subsequently the input node representation of the (k + 1) th layer. Consequently, the nodal information will be aggregated through the message-passing model. Here, σ represents the message propagation function for information aggregation over the network. There are many possible implementations of information aggregators. For example, in [24], the information is aggregated through a combination of linear transformations and ReLU activation. In addition, it is noteworthy thatÂ is the renormalized adjacent matrix of the graph. Formally, it is calculated as follows:Â whereD is the diagonal degree matrix ofÃ. The utilization of the renormalization trick can constrain the number of model parameters. In doing so, the model and its variants can address the overfitting problem and reduce the computational overhead of GNNs [24], [39]. With this renormalization trick, every node in the graph can aggregate information from their direct neighbors, also known as the first-order neighbors, in each forward propagation layer of a GCN. Accordingly, if we want to pass messages to, or aggregate information from nodes with high-order proximity, one intuitive way is to increase the number of convolutional layers (i.e., the model depth). Nonetheless, this may result in the overfitting problem as the introduction of more learnable parameters W . Kipf & Welling have shown in their experiments that increasing the model depth will deteriorate the performance of downstream learning tasks, even though the residual connections are used for training between GCN layers [42]. Therefore, the depth of the vanilla GCN and its variants usually have less than three convolutional layers. As a consequence, they cannot aggregate information from distant nodes in the network.

B. A MULTI-CHANNEL CONVOLUTIONAL LAYER
Instead of increasing the number of convolutional layers, in this paper, we propose a multi-channel convolutional architecture by enriching the number of input channels such that higher-order information can be aggregated. As shown in Figure 1, the model takes the feature matrix X ∈ R N ×d as input, each row of which represents the feature vector of a node. Then, node information can be separately aggregated in different channels. Specifically, the propagation network in the channel k corresponds to a specific matrixÂ k , which is the k th power of the renormalized adjacency matrixÂ. Our forward model then takes the following form: VOLUME 8, 2020 whereÂ i XW i represents a high-order GCN channel that captures the information from i th order neighbors. The operator AGG aggregates node-wise information from all channels.
In this paper, we use the SUM operator as the aggregator function. The reasons are twofold: First, the aggregation scheme in the vanilla GCN can be deemed as a class of functions over the sets of neighbor nodes [40], [43]. Among different aggregator functions, such as SUM, MEAN, MAX, only the SUM operator can capture the full multiset (a generalized concept of the set) [43]. Consequently, with respect to the proposed architecture in this paper, the SUM operator is more powerful than others to distinguish diverse network structures. Second, the implementation of the SUM operator can attain a weighted summation over different convolutional channels, which can magnify the relative important information by summation. In doing so, the forward model can be formulated as follow: where k represents the total number of channels in the architecture.
From the above equation, the learnable weight W i in each channel can be regarded as a pre-processing operation on node features [26]. To reduce the number of model parameters and avoid overfitting, in this paper, we use a shared weight matrix W S among different channels. Further, we use a non-linear function σ to boost the expressive power of our model. In doing so, the feature aggregation rule can be rewritten as: Finally, we add a parameter α to make a flexible adjustment among different channels: It determines the weight decay of a channel as the order increase. In doing so, the proposed architecture can aggregate as high as k th order information through a network without increasing the depth of GCNs. We name such a k-channel GCN model as ''MCGCN-k'' for comparison and evaluation in Section V. The detailed forward process is shown in Algorithm 1.

C. INFORMATION AGGREGATION WITH ARBITRARY ORDER OF PROXIMITY
Along this line, to aggregate arbitrary order of information for each node, the number of channels should not less than the diameter of the network G. However, for large-scale networks, the diameter will become very large, that is, k → ∞ in Equation 6. Alternatively, we introduce the Katz

Algorithm 1 The Forward Process of MCGCN-k Algorithm
Input: The adjacent matrix A, feature matrix X , and parameter α Output: The predicted label distribution matrix Y index as follows: As a result, the Katz index can formulated as: Accordingly, the feature aggregation rule becomes to be: Notably, the parameter α should be properly set to make sure the convergence [44]. In doing so, we can aggregate arbitrary order of information with just one-layer GCN. We name this model as ''MCGCN-Katz''. The detailed forward process is shown in Algorithm 2.

Algorithm 2 The Forward Process of MCGCN-Katz Algorithm
Input: The adjacent matrix A, the feature matrix X , and parameter α Output: The predicted label distribution matrix Y 1Ã ← A + I NDii ← jÃ ij ; Once node features are aggregated through the messagepassing layer, they are treated as input of a fully connected layer for multi-class node classification (see Figure 1). Formally, the output of the forward propagation model can be estimated as follows: where W S ∈ R d×d h is a learnable input-to-hidden weight matrix in the convolutional layer, W F ∈ R N ×C is a learnable hidden-to-output weight matrix in the fully connected layer, and σ denotes an activation function. Many activation functions can be adopted by σ , such as the rectified linear unit (ReLu) and the leaky ReLU (LReLU). While in this paper, we employ the Exponential Linear Unit (ELU) as the activation function [45]: Accordingly, the output of the fully connected layer is then normalized by a row-wise softmax activation, which can transform the output into a series of probability for each class: where o i the output vector of node i after the information propagation and activation. The estimated label distribution Y ∈ R N ×C stacks N class distribution vectorsŷ i on top of one another. Concerning the multi-class node classification tasks, the cross entropy loss is used to train model parameters: where V train is the set of labeled nodes used for training, and C is the number of classes. The y ic andŷ ic denote the ground-truth value and the predicted probability of node i with respect to class c. It is worth noting that node set V train is different from the labeled set V L because there are some nodes with known labels used for validation.

E. COMPUTATIONAL COMPLEXITY
For the sumGCN-k model, the computational complexity of evaluating Equation 6 is O(k × |E| × d), where k is the number of channels in the convolutional architecture, |E| is the number of the non-zero elements in theÂ, and d is the dimension of node feature. Generally, we have |E| N 2 due to the sparsity of real-world networks. Moreover, we have k |E| in the proposed model because k is less than the diameter of the network. For the sumGCN-Katz model, the computational overhead is relatively high because the value of the Katz index requires calculating the inverse of a matrix. Due to the existence of hyperparameter α, the entries in (αÂ) i will become small as the order increases. Therefore, an alternative implementation is to take a relatively large k to attain an approximation of the Katz index, such as α k < 0.1. In summary, the computational complexity of both models is linear to the number of edges in the network.

V. EXPERIMENTS
In this section, we first carry out experiments on several benchmark networks to evaluate the performance of our proposed architecture by comparing it with the state-of-the-art methods with respect to the semi-supervised node classification problem on networks. Then, we conduct experiments to verify the computational efficiency of our architecture. Finally, we show how the different settings of hyperparameter contribute to the performance of our methods in terms of classification accuracy.

A. DATASETS
To evaluate the advantage of higher-order information aggregation, we conduct experiments on two types of network datasets that have been wildly used in the field of graph representation learning. One is the academic paper citation networks, the other is the air-traffic networks.

1) CITATION NETWORKS
For the citation networks, each node represents a published paper and each edge between two nodes indicates the existence of a citation relationship between them. The ground-truth class of each node represents the main subject of this paper. Intuitively, the densely connected modes are more likely to have the same class. In this case, it is not necessary to aggregate information from distant nodes. In our experiments, we adopt the following three citation networks: 1) Cora Network [46]: This dataset contains a selection of the Cora database which consists of 2,708 papers in the field of machine learning with 5,229 citation links. All papers have been classified into 7 classes according to their main subject, involving Genetic Algorithms, Neural Networks, Probabilistic Methods, Reinforcement Learning, Rule Learning, and so on. Besides, each dataset contains a binary Bag-of-Words (BoW) feature vector, which is extracted from the paper abstract. The dimension of the feature vector is 1,433. 2) CiteSeer Network [46]: This dataset is a part of the Citesser database, which consists of 3,327 papers in the field of machine learning and 4,732 citation links. The class of each node is also generated based on their main subject. Totally, there are 6 classes, involving Agents, Artificial Intelligence, Databases, Information Retrieval, and so on. The feature vector of each node is generated using the Bag-of-Words (BoW) method, The dimension of the feature vector is 3,703. 3) Pubmed Network [47]: The Pubmed dataset consists of 19,717 scientific publications and 44,338 citation links from the PubMed database. All papers have been classified into 3 classes according to the disease type. Each node in this network is associated with a TF-IDF weighted word vector built from a dictionary consisting of 500 unique words.

2) AIR-TRAFFIC NETWORKS
For the air-traffic networks, each node represents a real airport and each edge between two nodes indicates the existence of airlines between them. Node labels of these networks are generated based on the airport activity, which is measured by the total number of landings plus takeoffs in 2016. Intuitively, the class of an airport is related to its structural roles in the corresponding network. Therefore, nodes that far away from each other may belong to the same class. In our experiments, we focus on the following two air-traffic networks: 1) Brazilian air-traffic network [18]: Data was collected from the National Civil Aviation Agency (ANAC) of Brazil from January to December 2016, which contains 131 nodes and 1,038 edges. The ground-truth label has 4 classes. 2) European air-traffic network [18]: Data was collected from the Statistical Office of the European Union (Eurostat) from January to November 2016. The network has 399 nodes and 5,995 edges. Totally, there are 4 ground-truth classes. It is noteworthy that there are no additional node features in air-traffic networks. Therefore, we treat the identity matrix I N as the input feature matrix X of our methods.
Finally, the statistic characteristics of all datasets are summarized in Table 1. Based on existing studies [17], [18], [34], the labels of citation networks are more relevant to structural proximity among different nodes, while the labels in air-traffic networks are more relevant to structural equivalence (or roles) of different nodes. To better understand their differences, we illustrate the Cora citation network and the European air-traffic network in Figure 2. Nodes with the same labels are shown in the same color. It can be observed that nodes with the same color are inclined to exhibit the community structure in the Cora citation network, for example, the red nodes in Figure 2(a). On the contrary, in the European air-traffic network, nodes with the same color are far apart from each other, such as the yellow nodes in Figure 2(b). In this case, one goal of our experiments is to evaluate whether the proposed methods can achieve better performance on both types of datasets.

B. BASELINE METHODS
Although a large amount of GRL methods have been designed for graph learning, such as spectral clustering [49], label propagation (LP) [36], manifold regularization [50], semi-supervised embedding (SemiEmb) [51], evidence shows that most of them perform relative poor compared with the recent advanced GCN methods. As a result, we omit these results here.
The detailed descriptions about the baseline methods for comparison in our experiments are introduced as follows: 1) DeepWalk [33]: This is a classical random walk-based network embedding method, where the word2vec model is first introduced to learn node embeddings [52], [53]. 2) Chebyshev [39]: The Chebyshev GCN defines graph convolutions using a k-degree polynomial of the Laplacian to approximately estimate the output of Laplacian eigendecomposition. 3) Planetoid [54]: The Planetoid GCN jointly train node representations for classification and graph context prediction, which does not depend on the graph Laplacian regularization. 4) GCN [24]: The vanilla GCN method simplifies the Chebyshev GCN via a localized first-order approximation of spectral graph convolutions, i.e., the renormalization trick. 5) GCN-3: For a fair comparison, we add an additional GCN layer to the vanilla GCN in order to aggregate the farther information. 6) Monet [41]: This is a kind of Laplacian-based approach, which designs a unified generalization of CNN architectures to graph data. 7) GAT [25]: The graph attention network incorporates the attention mechanism into each layer of the GCN framework in order to weight the information from different neighbors. 8) SGC [26]: The simplified GCN simplifies the vanilla GCN by reducing the number of non-linear activations and convolutional layers.

C. EXPERIMENTAL SETTINGS
Our experiments were implemented on a single PC with one NVIDIA RTX 2070 GPU (8Gb of RAM) and one Intel Core i7 CPU (i7-4771 @ 3.50GHz). Specially, all models in our implements were coded by Keras based TensorFlow-GPU version. Concerning the three citation networks, we use the same method to split the data sets as in [24], [25], [54]. For each network, 20 nodes per class were used for training, 500 nodes were used for validation, and another 1000 nodes were used for testing. Differently, for each air-traffic network, a quarter of all nodes were first randomly sampled for training. Then, another quarter was randomly sampled for validation. While the rest were used for testing. For comparison, all results were averaged over 50 experimental runs.
Specifically, the Glorot method [55] was used for model initialization and the cross-entropy loss was minimized using the Adam SGD optimizer [56]. During the training period, the learning rate is fixed at 0.01. Additionally, a maximum of 200 epochs were set for training and an early stopping strategy was implemented with patience of 30 epochs. In other words, the training will be interrupted automatically if the loss does not decrease in several consecutive epochs. Moreover, the dropout [57] and L 2 regularization was applied to all layers. Finally, hyperparameters of all models were tuned with grid search by validation data in following ranges: dropout rate p ∈ [0.1, 0.9] with an interval of 0.1, the L 2 regularization term coefficient λ ∈ {1e − 3, 5e − 4, 1e − 4, 5e − 5}, the number of first hidden units h ∈ {16, 24, 32, 64}, α ∈ [0.1, 1.0] with an interval of 0.1. For the number of hidden units in the output layer, we set it equal to the number of classes. In Table 2, we summarize the value of all hyperparameters used by our methods.

D. PERFORMANCE COMPARISION 1) CLASSIFICATION ACCURACY
We first evaluate the performance of our methods on three citation networks. Table 3 summaries the results of classification accuracy surveyed from existing studies [25], [26]. It can  be observed that the performances of both the Multi-Layer Perception (MLP) and the DeepWalk methods are worse than other GCN methods in the three citation networks. The reason is that the MLP method can only use the node feature information without taking into consideration the contribution of network structure. On the other hand, as one kind of random walk-based network embedding methods, Deepwalk can only use the structural information of a network, which ignores the contribution of node features. The other methods are different variants of GCN methods, which involve both network structure and node features into an end-to-end learning process [24]- [26]. Among them, the GCN, GAT, and SGC methods have better performance than the other three VOLUME 8, 2020 methods. Therefore, in our experiments, we mainly compare the performance of our methods with the three GCNs. Table 4 summarized classification accuracy with the standard deviation of each method implemented following our experimental settings. 1 It can be observed that our MCGCN-Katz model has the best performance in the Cora network, which can gain 2.1% improvement compared with the GCN method. The MCGCN-3 methods can also achieve higher accuracy than the GCN-3 method. Similarly, in the Pubmed Network, the MCGCN-Katz method can also achieve the best performance. As for the Citeseer network, the performance of our models is slightly worse than the SGC and GAT methods. However, as stated in [26], SGC can reduce to a multi-class logistic regression on the network by removing the non-linear functions and a learnable kernel compared with the vanilla GCN. Consequently, the generalization ability of SGC is relatively weak. While for the GAT method, the computational overhead and memory usage much higher than our methods due to the usage of the attention mechanism [25]. Table 5 shows the classification results on the two airtraffic networks. It can be observed that our MCGCN-Katz method achieves much better performance than all baseline methods. The reason is that for the air-traffic networks, node labels are more relevant to their structural roles on the network, such as transportation hubs or mediators [17], [18]. In this case, two airports with the same label may be far apart from each other in the network. Among all these baseline methods, only the MCGCN-Katz method allows arbitrary order of information aggregation through the network. As a result, nodes with the same label are more likely to share their structural feature.
In summary, evidence has revealed that nodes far apart from each other may have similar functions in complex networks [58]. Therefore, it would be essential for GCNs to enable higher-order information aggregation without increasing the computational overhead. In this paper, the proposed multi-channel graph convolutional architecture allows information aggregation of nodes with arbitrary order of proximity in a network, where other GCN-based GRL algorithms cannot leverage in practice. Moreover, the shared weight mechanism can avoid the overfitting of the proposed 1 Codes are available at https://github.com/zhouchunpong/GCN_Keras. MCGCN architecture by reducing the complexity of the model. In doing so, the aggregation of higher-order information together with the shared weight mechanism ensures the efficiency of the proposed MCGCN algorithm in solving the semi-supervised node classification problem on attributed networks.

2) COMPUTATIONAL OVERHEAD
We compare the training time of our methods (MCGCN-3 and MCGCN-Katz) with the state-of-the-arts GAT method on the three citation networks. Specifically, all codes run on a GPU implementation in TensorFlow with the same experimental settings and environment (see Section V-C). We measure the running time from the first epoch to the last epoch and repeat it 50 times. The average running time is used for comparison. Figure 3 shows the comparison of training time between GAT and our methods on the three citation networks. It can be observed that both MCGCN-Katz and MCGCN-3 are superior to the GAT method. In the Cora network, the computational overhead of MCGCN-Katz is just 31.5% of that of the GAT method. Moreover, the MCGCN-3 method just takes 26.8% of the GAT's overhead. Similar results can also be found in the other two networks. While for the Pubmed network, the GAT method failed to train the model because it requires more GPU memory. The reasons are threefold: First, the usage of attention mechanisms in the GAT method enlarges the requirements of memory and computing resources. Second, to stabilize the learning process of the attention mechanism, multi-head attention will be independently implemented, which will aggravate the computational burden. Third, the output features of the first layer will be concatenated, which increases the dimension of hidden layers, as well as the number of learnable parameters. In summary, with much less computational overhead, our methods can achieve as good classification accuracy as the GAT method.

E. PARAMETER SENSITIVITY
In this part, we will study how the different settings of hyperparameter α can affect the performance of the semi-supervised nodes classification task. In the proposed architecture, α is a decay parameter determining the relative importance of information channels with higher-order proximity. As α increases, features from higher-order channels will have a relatively high weight. However, to keep the Katz-index converging, the value of α cannot increase infinitely. Figure 4 shows the sensitivity analysis of the MCGCN-Katz method on the Cora and Citesser networks as α increases. It can be observed that the best choice of α is different on different networks. The best performance on the Cora and Citesser networks is achieved when α = 0.7 and α = 0.6, respectively. In the future, the value of α can also be learned automatically during the training stage.

VI. CONCLUSION
In this paper, we have proposed a multi-channel graph convolutional architecture to tackle the semi-supervised node classification problem on attributed networks in an end-to-end manner. Different from existing graph neural networks, the proposed architecture allows higher-order information aggregation through the network by increasing the number of input channels rather than the depth of a GCN (i.e., the number of layers). Further, based on the notion of Katz index, we have also introduced a variation of the architecture that supports arbitrary order of information aggregation without aggravating the computational overhead. Moreover, we have introduced a shared weight mechanism to access the relative importance of different attributes. To evaluate the performance of the proposed GCN architecture, we have carried out experiments on several benchmark networks, including 3 citation networks and 2 air-traffic networks. Through comparing with the state-of-the-art GRL methods, we have shown that our framework can achieve better performance in terms of both node classification and computation efficiency. The proposed multi-channel architecture offers new insight into investigating the function of long-range week connections in complex networks.
In the future, the proposed MCGCN architecture can be extended into following two directions. First, in this paper, we only considered the information aggregation of nodes with higher-order proximity. In the field of complex networks, evidence has shown that higher-order connectivity patterns play essential roles in understanding fundamental functions of complex systems [58]. Therefore, it is worthwhile to construct input channels of the proposed MCGCN based on other higher-order structures, such as motifs, graphlets, Weisfeiler-Lehman isomorphism, and shortest path length. Second, an appropriate attention mechanism can be involved to automatically assess the relative importance of information channels. In doing so, we can quantify which type of higher-order structures is more relevant to the machine learning task on networks. The relationship between higher-order structures and network functions can further help improve the interoperability of the GRL architecture.