Domain Adaptive Graph Infomax via Conditional Adversarial Networks

The emerging graph neural networks (GNNs) have demonstrated impressive performance on the node classification problem in complex networks. However, existing GNNs are mainly devised to classify nodes in a (partially) labeled graph. To classify nodes in a newly-collected unlabeled graph, it is desirable to transfer label information from an existing labeled graph. To address this cross-graph node classification problem, we propose a graph infomax method that is domain adaptive. Node representations are computed through neighborhood aggregation. Mutual information is maximized between node representations and global summaries, encouraging node representations to encode the global structural information. Conditional adversarial networks are employed to reduce the domain discrepancy by aligning the multimodal distributions of node representations. Experimental results in real-world datasets validate the performance of our method in comparison with the state-of-the-art baselines.


I. INTRODUCTION
N ODE classification is a vital task in various real applications, such as the prediction of user characteristics in social networks [1], the classification of protein roles in protein-protein interaction (PPI) networks [2], and the assignment of research topics for publications in citation networks [3]. Since real-world networks are usually sparse, nonlinear, and high-dimensional, it is challenging to learn meaningful information from these networks to facilitate node classification [4].
A general way is to encode graph 1 information, like graph structure and node attributes, into low-dimensional node representations (i.e., embedding vectors) [4], [5]. On top of node representations, node classification can be easily performed by employing classical machine learning techniques, e.g., a logistic regression classifier.
Existing studies mainly focus on node classification in a (partially) labeled graph [2], [3]. In real applications, it is frequently encountered that the nodes are unlabeled in a newly collected graph. A node classification model, which is learned in an existing labeled graph, can sometimes be directly applied to an unlabeled new graph. However, different graphs usually have diverse data distributions of node connections, attributes, and labels, which would degrade the performance of such direct application.
Therefore, it is desirable if a learning model can effectively transfer label information from a labeled graph to assist node classification in an unlabeled graph. Otherwise, when a new graph is collected, to obtain satisfactory node classification performance, we have to label the nodes before rebuilding a learning model. For example, in a newly-formed unlabeled social network, to enable user prediction, it would be beneficial to transfer knowledge from a mature social network which is well annotated. In addition, to classify proteins in a newly-collected PPI network, it would be favorable if the label information in an existing biological network can be utilized. Moreover, the abundant label information in well-established citation databases would be helpful to enable the classification of papers in a newly-constructed citation network.
In this work, we consider the cross-graph node classification problem [6], [7], [8], [9] illustrated in Fig. 1. The research goal is to facilitate node classification in an unlabeled target graph by transferring label information from a labeled source graph. There are no edge connections or common nodes existing between the source and target graphs. With only a part of the node attributes in common, the discrepancy is further enlarged between the source and target graphs.
Since the source and target graphs can be treated as two independent domains, cross-graph node classification has a close relation with domain adaptation research. As a subtopic of transfer learning [10], domain adaptation studies are mostly conducted in the fields of computer vision [11], [12], [13] and natural language processing [14], [15]. The input data (i.e., images and text) is usually assumed to be independent and identically distributed (i.i.d.). The classical domain adaptation approaches are not directly applicable to the graph-structured data, since the nodes in a graph are highly correlated by edges, thus violating the i.i.d. assumption [16].
There are a few recent studies addressing the challenging problem of cross-graph node classification, such as CDNE [6], ACDNE [7], AdaGCN [8], and UDA-GCN [9]. Although their performance is more preferable than those designed for single-graph learning, three open questions remain to be further explored.
From the global view of a graph, if distant nodes have similar structural roles, they are likely to perform the same function [17], which can be an important predictor for the node labels. Existing methods [6], [7], [8], [9] manage to make the nodes have similar representations, if they are close in a graph, such as the adjacent nodes or the nodes within K steps. It is achieved by the Laplacian smoothing of graph convolution [18] or the additional constraint based on PPMI matrix [19]. Although these methods reach a certain level of local or global consistency, they still lack consideration about the global structural role of a node. Existing studies align the source and target representations by minimizing domain classification error [7], [9] or distribution discrepancy metrics such as the Wasserstein distance [8]. Since the domain alignment is category agnostic in these methods, the aligned node representations are possibly not classification friendly. Moreover, due to the nature of classification problem, the distribution of node representations would be multimodal. These domain adaptation techniques may fail to capture the multimodal structure [20]. AdaGCN [8] and UDA-GCN [9] employ the graph neural networks (GNNs). They apply spectral graph theory [21] to define filters for graph convolutions that process the whole graph in one go. It may hinder their applications in real-world large-scale graphs which can be directed, signed, or heterogeneous [16]. To address the above three issues, we propose a novel method, named as domain Adaptive Graph Infomax via conditional adversarial networks (i.e., AdaGIn). Given a labeled source graph and an unlabeled target graph, the objective of our method is to classify nodes in the target graph, by jointly utilizing the information in the source and target graphs. Our method consists of three modules: a representation learner, a node classifier, and a domain discriminator.
Representation learner employs the spatial GNN layers to compute node representations for the source and target graphs. The spatial GNN layers repeatedly aggregate the local neighborhood information [22]. The generated node representation summarizes a patch of graph centered around this node, thus containing the local structural information. Inspired by DGI [23], the node representation is then driven to preserve the mutual information with the graph-level global representation. Maximization of the local-global mutual information could encourage node representations to capture the global structural properties [23]. Taking node representations as inputs, node classifier produces label predictions. Representation learner and node classifier are trained to minimize the cross-entropy loss of labeled nodes in the source graph, so that the learned node representations can be label-discriminative.
To reduce the domain discrepancy between the source and target graphs, our method applies the conditional domain adversarial networks [20], which model the domain adaptation process as a two-player game. Domain discriminator is trained to distinguish whether an input sample comes from the source or target graph. In contrast, representation learner is optimized to "fool" domain discriminator by generating domain-invariant node representations. To enhance the alignment of multimodal distributions, domain discriminator is conditioned on the discriminative information in the classifier predictions. Fig. 2 illustrates that, with the help of domain adaptation, the generated node representations are more meaningful and separable in the 2D space, benefiting the subsequent node classification task.
The proposed method is evaluated using five real-world networks, including three citation networks and two social networks. The main contributions are summarized as follows.
A novel spatial GNN model is proposed to tackle the challenging problem of cross-graph node classification. The overall performance is superior to the state-of-theart baselines in the benchmark transfer tasks. Mutual information maximization is enabled in crossgraph learning, which encourages node representations to encode the global structural information. Conditional adversarial networks are introduced to reduce the domain discrepancy by matching the multimodal distributions of node representations. Fig. 1. An example of cross-graph node classification. The research goal is to facilitate node classification in an unlabeled target graph by transferring knowledge from a labeled source graph. Fig. 2. Visualization of node representations in the target graph (Source graph: Citationv1, Target graph: ACMv9). Each point represents one node. Five colors distinguish various node classes. The sign "-" means domain adaptation component (DA) has been removed from the AdaGIn model.
The rest of this paper is organized as follows. Section II reviews the related literature. Section III presents the proposed method. Section IV reports the experiments and results. Finally, Section V concludes this paper.

II. RELATED WORK
In this section, we introduce the related work in three aspects, including the research on domain adaptation, graph representation learning, and cross-graph learning.

A. Domain Adaptation
As a subtopic of transfer learning, domain adaptation [10] aims at transferring knowledge from a source domain with abundant labels to a target domain that is short of labels. The feature-based domain adaptation approaches, which attract lots of research interests due to their effectiveness, can be categorized into three kinds, including the discrepancy-based [12], [24], the reconstruction-based [25], and the adversarialbased [20], [26], [27].
The adversarial-based methods are considered in this work. To learn domain-invariant features, DANN [26] builds an adversarial training platform, in which representation learner and domain discriminator play a minimax game. WDGRL [27] introduces the Wasserstein distance [28] to measure the domain divergence, in order to improve the gradient property and the generalization bound. Inspired by the recent progress in conditional generative adversarial networks (CGANs) [29], CDAN [20] conditions the domain discriminator on the discriminative information contained in the classifier predictions, which helps match the multimodal distributions in the classification problem.
There are also some recent advances worth noticing. For example, ATM [30] devises a novel loss, named as MDD, to quantify the distribution gap. By optimizing this additional loss, ATM alleviates the equilibrium challenge issue [31], thus improving the performance of adversarial domain adaptation. In addition, Li et al. [32] innovatively employed the adversarial attacks to enable domain adaptation in the absence of source data or target data.
The existing domain adaptation methods usually assume the input samples to be independent and identically distributed ( i.i.d.), such as images and text. They are not directly applicable to solving the learning problems on graph-structured data, where highly-correlated nodes violate the i.i.d. assumption [16].
A few recent studies construct graph adjacency matrix based on the pairwise distance of image embeddings/features. Graph matching is then employed to align the domains of image samples. For example, by minimizing the spectral distance of graph Laplacians, LGA [33] aims at forcing the manifold of target domain has a similar connectivity structure as that of the source domain. In addition, Das and Lee [34] applied the first-, second-, and third-order hyper-graph matching to match the images in different domains. These approaches are devised for the alignment of image samples rather than the data that can be naturally represented as a graph, such as the citation and social networks.
GNNs are classified into two categories: spectral approaches and spatial ones. The spectral GNNs devise spectral filters to perform graph convolutions. The classical GCN model [3] simplifies the spectral convolutions and processes the whole graph at the same time. Modern GNNs usually update node representations by iteratively aggregating neighborhood information [22]. This stream of approaches relies on the spatial relations of nodes to propagate information [16]. The computation of node representations can be conducted in a batch of nodes rather than the whole graph. GraphSAGE [2] pioneers in designing various aggregation functions to aggregate information from the sampled neighborhood. In graph attention networks (GAT) [42], node representation is the weighted sum of the representations of all neighboring nodes. Compared with spectral approaches, spatial ones have advantages in the scalability to large graphs and the generality to various graph types [16].
To capture the statistical similarities among data, contrastive learning [44] has been introduced to encode graph into informative representations. Inspired by the Deep InfoMax (DIM) [45], some recent studies (e.g., DGI [23] and Info-Graph [46]) maximize the mutual information between the local representations of substructures (e.g., a patch of graph centered around a node) and the global representations of a graph. By doing so, the local and global representations can be mindful of each other, leading to improved performance on node classification or graph classification.
Existing studies mainly focus on learning in a single graph, ignoring knowledge transfer across graphs. Although a model learned in one graph can sometimes be adapted to perform learning tasks in a new graph, to improve model performance, more efforts are needed to overcome the discrepancy between graphs. To this end, our work jointly explores contrastive learning and domain adaptation to perform transfer learning across graphs.

C. Cross-Graph Learning
Some recent research on cross-graph learning assumes the existence of inter-graph connections [47], [48] or common nodes across graphs [49], [50]. Besides, with the target graph partially labeled, the basic assumption of DASGA [51] is that the source and target graphs have similar frequency contents in the label function. DASGA then performs domain adaptation on graphs by learning aligned graph Fourier bases.
Without the above assumptions, some other studies [6], [7], [8], [9] focus on transfer learning between two independent graphs. The abundant label information in a source graph is expected to be transferred to facilitate node classification in a target graph lacking labels.
NetTr [52] projects the label propagation matrices of the source and target graphs into a common latent space utilizing the nonnegative matrix tri-factorization (NMTF) [53]. Although node attributes are also combined to train the node classifier, the transferrable representations are learned solely based on graph topology, in order to capture structural patterns shared by the source and target graphs. CDNE [6] learns node representations in an autoencoder architecture. The maximum mean discrepancy (MMD) [54] is minimized between the source and target representations to mitigate domain divergence. ACDNE [7] preserves attribute affinity and topology proximity separately using two feature extractors. Gradient reversal layer [26] is employed in ACDNE to learn domaininvariant node representations.
Recently, a few GNN-based methods are proposed to perform cross-graph node classification. AdaGCN [8] employs GCN [3] to learn node representations. This method minimizes the Wasserstein distance [28] between the source and target representations to enable domain adaptation. UDA-GCN [9] develops a dual graph convolutional network to jointly preserve the local and global consistency of a graph. Gradient reversal layer [26] is also used in UDA-GCN to achieve domain adaptation. Unlike these spectral GNNs (i.e., AdaGCN and UDA-GCN), this work devises a spatial GNN model to address this challenging problem. Two recent techniques, mutual information maximization [23] and conditional adversarial networks [20], are employed to capture global graph information and multimodal embedding distribution, respectively.
It is worth noting that a few recent GNN models transfer knowledge following a pre-training and fine-tuning paradigm, without utilizing the adversarial domain adaptation. As one typical example, GCC [55] pre-trains a model for discovering common structural patterns in the absence of node attributes and node labels. Specifically, the pre-training is designed as subgraph instance discrimination in and across multiple source graphs. GCC leverages contrastive learning to distinguish similar subgraph instances from dissimilar ones. The pre-trained model is then fine-tuned on a partially-labeled target graph for node classification task. Differing from GCC, in this work, we consider a transfer learning problem involving two attributed graphs: one fully labeled source graph and one unlabeled target graph. Contrastive learning is employed to maximize the mutual information between node representation and graphlevel representation within a single graph (i.e., source graph or target graph). To reduce domain divergence, we make use of conditional adversarial networks.

III. PROPOSED METHOD
In this section, we first introduce the research problem and the main notations. Then we present the model architecture and elaborate each module. Next, we provide a theoretical analysis about the adversarial domain adaptation. Finally, we describe the training algorithm followed by a time complexity analysis.

A. Problem Definition and Notations
An information network can be represented as an attributed Y Y 2 R NÂC are node set, adjacency matrix, attribute matrix, and label matrix, respectively. N is the number of L is the dimension of node attributes. C is the number of node labels. In A A A A A A A, X X X X X X X, and Y Y Y Y Y Y Y , the i-th row contains binary values indicating the edge connections, attributes, and labels of the i-th node v 2 V V V V V V V , respectively. Specifically, A ij ¼ 1 means there is an edge connecting the i-th and j-th nodes; X id ¼ 1 means the i-th node has the d-th attribute; and Y ic ¼ 1 means the i-th node is associated with the c-th label. The undirected graph is considered in this paper. The degree of the i-th node v is the number of its connected edges, i.e., P j A ij . The average degree is calculated by P i P j A ij =N, indicating the density of a graph. The main notations of this paper are summarized in Table I.
Graph embedding aims at mapping a node, v 2 V V V V V V V , to a lowdimensional node representation e e e e e e e v (i.e., embedding vector). e e e e e e e > v is one row within representation matrix E E E E E E E 2 R NÂl , where l is the embedding dimension. Since the label space of nodes is Y ¼ f1; . . . ; Cg, data distribution e e e e e e e v $ P ðe e e e e e e v Þ would be multimodal, with each mode corresponding to one class. The multimodal structure can be indicated by the t-SNE visualization shown in Fig. 2. There are five clusters in total (i.e., C ¼ 5), with each cluster indicating one mode. On top of node embeddings, a clas- can be learned to conduct node classification. In this work, we investigate the cross-graph node classification problem. Source graph, Y Y t is unknown, remaining to be predicted. We use two superscripts, s and t, to denote the source and target graphs, respectively.
In the case that the attribute sets of source and target graphs (i.e., X X X X X X X s and X X X X X X X t ) are not exactly the same, we construct a union attribute set X X X X X X X ¼ X X X X X X X s [ X X X X X X X t . With U ¼ jX X X X X X Xj representing the total number of attributes, the attribute matrices of source and target graphs can be reformulated as X X X X X X X s 2 R N s ÂU and X X X X X X X t 2 R N t ÂU , respectively. The union attribute set enables parameter sharing when processing the source and target graphs, since the same learning model can be applied to generate node representations for both graphs. The parameter sharing attempts to learn domain-invariant node representations, and to facilitate knowledge transfer across graphs [56]. Through sharing parameters, the overall model architecture can also be more compact, with the number of learnable parameters being reduced. Based on the union attribute set, we can further define a common attribute rate R a ¼ jX X X X X X X s \ X X X X X X X t j=jX X X X X X X s [ X X X X X X X t j, showing the percentage of common attributes shared by the source and target graphs.
There are no shared common nodes or edge connections existing between the source and target graphs. Therefore, the source and target graphs can be treated as two independent domains. In addition to the varying graph scale, these two domains are also diverse in the distributions of node connections, attributes, and labels. Note that, as the settings in many prior arts [6], [7], [8], [9], the source and target graphs are required to have the same set of labels, that is, the categories of nodes in these two graphs are the same.
The goal of this work is to mitigate the graph divergence, so that label information in the source graph can be exploited to learn a node classifier that is readily applicable to classify nodes in the unlabeled target graph. To achieve this goal, node representations generated in the learning process need to be domain-invariant and label-discriminative.

B. Overview of Model Architecture
To achieve cross-graph node classification, there are two major challenges. First, the available information (i.e., graph structure, node attributes, and node labels) shall be well exploited to learn node representations that are informative enough for the subsequent node classification task. Second, the domain gap between source and target graphs shall be mitigated, so that the node representations can be shared across domains. Consequently, the node classifier trained on source representations can be readily applicable to predict target nodes using the target representations.
To address the above challenges, we propose a method named as domain Adaptive Graph Infomax via conditional adversarial networks (i.e., AdaGIn). Fig. 3 shows the proposed model architecture. The three main modules of this model are as follows.
Representation Learner. The node representations are learned by the GNN layers, which aggregate information from the local neighborhood, thus capturing the graph structure and node attributes simultaneously. To make a node representation mindful of the global structure and other nodes with similar structural roles, inspired by DGI [23], the local-global mutual information is calculated and maximized in the learning process. Node Classifier. On top of the learned node representations, the node classifier, which is a logistic regression, makes predictions on the node labels. Domain Discriminator. The domain discriminator aims at telling apart the input samples from the source and target graphs. It competes with the representation learner during adversarial training using the gradient reversal layer, so that the representation learner can generate domain-invariant node representations. The discrepancy between source and target graphs would be reduced. As a result, node classifier trained on the source representations can be more desirable for classifying target nodes with the target representations.

C. Node Representation Learning
The graph neural networks (GNNs) are applied to encode nodes into vector representations (i.e., embedding vectors). In the modern GNNs, a common strategy is to iteratively update the representation of a node by aggregating information from its neighboring nodes [22]. Considering node v in a sampled minibatch of nodes (i.e., B B B B B B B), the k-th step/layer of the neighborhood aggregation is formulated as follows.
where AGG k is the aggregation function at step k, S S S S S S S v is a set of sampled nodes adjacent to node where f g is the representation learner (i.e., the spatial GNN layers) and x x x x x x x S is the attribute matrix of the sampled neighboring nodes.
To enhance the structural constraint applied during the learning process of node representation, the local-global mutual information is calculated and maximized within a graph. Considering a minibatch of nodes (i. e., B B B B B B B), the representation of each node is first computed using the GNN layers, i.e., E E E E E E E ¼ ½e e e e e e e 1 ; . . . ; e e e e e e e jB B B B B B Bj . Since the neighborhood information is repeatedly aggregated in the learning process, the produced representation of a node summarizes a patch of the graph centered around this node. Therefore, the node representation can be treated as a local representation of this patch in the graph. A summary vector, r r r r r r r, is obtained by a readout function (i.e., R) which simply averages the node representations within this minibatch. where s 2 is the logistic sigmoid nonlinearity (i.e., s 2 ðxÞ ¼ 1=ð1 þ expðÀxÞÞ). This summary vector is a kind of global information regarding the nodes within the minibatch and their sampled neighbors. A local-global pair is denoted as ðe e e e e e e i ; r r r r r r rÞ. The local-global pairs are then treated as positive samples. The local-global mutual information is maximized by classifying these positive samples and the negative counterparts. The negative samples are obtained by pairing the summary vector with the node representations of a corrupted graph. To encourage the positive samples to encode the structural similarities of different nodes, as in DGI [23], the graph is corrupted by rowwisely shuffling attribute matrix X X X X X X X, while adjacency matrix A A A A A A A is preserved. The corrupted attributes and original adjacency matrix are fed into the GNN layers to obtain the node represen- where W W W W W W W b is a learnable scoring matrix. Finally, the objective for this binary classification is as follows.
logD e e e e e e e i ; r r r r r r r ð Þþlog 1 À Dẽ e e e e e e i ; r r r r r r r ð Þ ½ f g The mutual information between e e e e e e e i and r r r r r r r will be maximized by minimizing this loss function.
Note that node representations of both the source and target graphs (E E E E E E E s and E E E E E E E t ) are generated in the same way, so the above descriptions are provided without mentioning the specific graph. The loss regarding mutual information (i.e., the unsupervised loss) is calculated for the source and target graphs independently, i.e., L s MI and L t MI . The total unsupervised loss is the summation of source and target losses.
Furthermore, the same GNN layers is utilized to generate node representations for the source and target graphs, that is, parameter sharing is enabled when processing these two graphs.

D. Node Label Prediction
To make the node embeddings label-discriminative, the supervised signals (i.e., the node labels) in the source graph are incorporated in the learning process. Specifically, we construct a node classifier which is a logistic regression. Node embeddings learned by the representation learner are fed into this classifier to obtain label predictions.
where f c is the node classifier; W W W W W W W c and b b b b b b b c are learnable weight matrix and bias vector, respectively.ŷ y y y y y y > v is one row in prediction score matrix,Ŷ Y Y Y Y Y Y . Nonlinear activation, s 3 , is usually a sigmoid activation for multilabel classification or a softmax activation for multiclass classification.
Since only the label information of source graph is available during training, the cross-entropy loss (i.e., the supervised loss) is calculated using the source labels Y Y Y Y Y Y Y s and the corresponding predictionsŶ Y Y Y Y Y Y s . For multilabel classification, the cross-entropy loss is defined as follows.
B B s belongs to class k, andŶ s vk is the corresponding element in prediction score matrixŶ Y Y Y Y Y Y s . For multiclass classification, the cross-entropy loss is

E. Adversarial Domain Adaptation
Conditional adversarial networks [20] are leveraged to mitigate the domain gap between the source and target graphs. Domain discriminator, f d , is constructed as a multilayer perceptron (i.e., MLP) followed by a sigmoid activation. It is conditioned on the discriminative information contained in the classifier prediction. We construct a joint variable, h h h h h h h v , which is a multilinear map defined as the outer product of classifier prediction (i.e.,ŷ y y y y y y v ) and embedding vector (i.e., e e e e e e e v ), that is, h h h h h h h v ¼ŷ y y y y y y v e e e e e e e v . Domain prediction is obtained by feeding the joint variable into domain discriminator.
whered v is a score indicating the probability that a joint variable comes from the source graph. Multilinear conditioning is employed to capture the crosscovariance between node representation and classifier prediction. By conditioning, domain divergence in both node representation and classifier prediction can be modeled simultaneously. Domain discriminator is then optimized to tell apart whether a joint variable is from the source or target graph. Domain adaptation loss, L DA , is calculated as follows. To reduce the domain discrepancy, representation learner and domain discriminator are trained in an adversarial manner. Specifically, domain discriminator is trained to minimize domain adaptation loss, L DA , thus improving its capability of distinguishing the source and target samples. In contrast, representation learner is trained to maximize the same loss, so that the produced node representations can be domain-invariant. We apply a gradient reversal layer (GRL) [26] to simultaneous update representation learner and domain discriminator. When minimizing the domain adaptation loss, GRL flips the gradient with respect to the model parameters in representation learner.

F. Overall Loss and Model Training
The overall loss of AdaGIn is comprised of supervised loss, L CE , unsupervised loss, L UN , and domain adaptation loss, L DA . min u u u u u u u g ;u u u u u u u c where balance coefficients, 1 and 2 , are named as mutual information coefficient and domain adaptation coefficient, respectively; u u u u u u u g , u u u u u u u c , and u u u u u u u d are the sets of learnable parameters in representation learner, node classifier, and domain discriminator, respectively. Algorithm 1 outlines the training and testing procedures of AdaGIn. During the minibatch training, we first independently sample a batch of nodes from the source graph and a batch of nodes from the target graph. Then the same GNN layers is employed to compute representations for nodes in these two batches. Next we calculate the unsupervised loss, the supervised loss, and the domain adaptation loss one by one. Finally, the model parameters of AdaGIn are updated via gradient descent based on the overall loss in (14). When the model converges after a number of epochs, the generated node representations would be label-discriminative and domain-invariant. In the testing stage, the classifier trained on source nodes can be applied to classify target nodes with the target node representations.
The time complexity of AdaGIn depends on its three modules, including representation learner (3), node classifier (9), and domain discriminator (12). As the GNN layers follow the neighborhood aggregation strategy, the time complexity of representation learner is similar to the one of GraphSAGE [2] which is Oð Q K k¼1 s k Þ for each node. s k is the neighborhood sample size at search depth k. K is the maximum search depth. The time complexity of node classifier and domain discriminator is proportional to the number of nodes to be processed. Therefore, the overall complexity of AdaGIn is linear to the total number of nodes, that is, the scale of a graph.

G. Theoretical Analysis on Conditional Adversarial Networks
As introduced in Section III-E, a joint variable, h h h h h h h v , is constructed to be the outer product of classifier prediction (i.e.,ŷ y y y y y y v ) and embedding vector (i.e., e e e e e e e v ). The label information, conveyed by classifier predictionŷ y y y y y y v , potentially reveals the multimodal structure behind data distribution e e e e e e e v $ P ðe e e e e e e v Þ. Therefore, joint variable, h h h h h h h v , is expected to capture the multimodal structure. Following CDAN [20], we provide a generalization error analysis about the conditional adversarial networks. For simplicity, we ignore subscript, v, in the following discussions.
We consider the source and target domains over a fixed node representation space, e e e e e e e, and a family of source node classifiers, G, in hypothesis space, H [26]. Let S ðGÞ ¼ E ðe e e e e e e;y y y y y y yÞ$S ½Gðe e e e e e eÞ 6 ¼ y y y y y y y be the risk of a hypothesis G 2 H regarding Algorithm 1: Algorithm of AdaGIn X X X X X X X t Þ; batch size N b ; maximum training epoch n e ; maximum iteration per epoch n i ; coefficients 1 and 2 .
1 Initialize model parameters u u u u u u u g for representation learner, u u u u u u u c for node classifier, and u u u u u u u d for domain discriminator. 2 for epoch < n e do 3 for iteration < n i do 4 Sample a batch of source nodes (i. e., B B B B B B B s ) and a batch of target nodes (i. e., B B B B B B B t ); 5 Compute source representations E E E E E E E s and target representations E E E E E E E t using (3); 6 Calculate unsupervised loss L UN using (8); 7 Calculate supervised loss L CE using (10) or (11); 8 Calculate domain adaptation loss L DA using (13); 9 Backpropagate the overall loss in (14) and update u u u u u u u g , u u u u u u u c and u u u u u u u d .

end 11 end
Testing: With the optimized model parameters u u u u u u u g and u u u u u u u c , target source domain distribution S. Disagreement between hypotheses G 1 ; G 2 2 H is denoted as S ðG 1 ; G 2 Þ ¼ E ðe e e e e e e;y y y y y y yÞ$S ½G 1 ðe e e e e e eÞ 6 ¼ G 2 ðe e e e e e eÞ. Let G Ã ¼ arg min G ½ S ðGÞ þ T ðGÞ be the ideal hypothesis that explicitly embodies the notion of adaptability. Note that target domain distribution, T , is different from source domain distribution, S, that is, S 6 ¼ T . The probabilistic bound [57] of target risk T ðGÞ of hypothesis G can be given with source risk S ðGÞ and domain discrepancy j S ðG; G Ã ÞÀ T ðG; G Ã Þj.
The goal of adversarial domain adaptation is to reduce domain discrepancy j S ðG; G Ã Þ À T ðG; G Ã Þj.
in which H disc is the family of domain discriminator f d ;ŷ y y y y y y ¼ The above upper bound has a close relation with domain adaptation loss, L DA , defined in (13). Referring to the minimax paradigm shown in (14), this supremum is achieved in the process of training the optimal discriminator f d (i.e., max u u u u u u u d ðÀL DA Þ), thus giving an upper bound of domain discrepancy j S ðG; G Ã Þ À T ðG; G Ã Þj. Simultaneously, node representation, e e e e e e e, is generated by representation learner, f g , to minimize domain discrepancy (i.e., min u u u u u u u g ðÀL DA Þ). Therefore, conditional adversarial networks can theoretically bound and reduce the domain discrepancy, thus aligning the multimodal distributions.
Furthermore, by minimizing the cross-entropy loss (i.e., min u u u u u u u g L CE , see (14)), node classifier, f c , is optimized to reduce source risk, S ðGÞ. With the help of adversarial domain adaptation, domain discrepancy j S ðG; G Ã Þ À T ðG; G Ã Þj is bounded and minimized, encouraging the source risk to approximate the target risk more closely (see (15)). Therefore, when applying the node classifier trained on the source graph to the target graph, target risk, T ðGÞ, would also be reduced, resulting in improved node classification performance in the target graph.

IV. EXPERIMENTS
In this section, we first evaluate the proposed method on the task of cross-graph node classification. An ablation study is then conducted on the contributions of mutual information maximization, multilinear conditioning, and domain adaptation. Next, we investigate the performance under varying common attribute rates. Finally, we analyze the influences of the critical hyperparameters, including initial learning rate, embedding dimension, and mutual information coefficient.
A. Experimental Setup 1) Datasets: As shown in Table II, following [6], [7], [8], experiments are conducted on five real-world networks. The three citation networks (i.e., ACMv9, Citationv1, and DBLPv7) are provided by ArnetMiner [59]. They are extracted from ACM, Microsoft Academic Graph, and DBLP, respectively. Papers in these three networks are published in different periods, i.e., after year 2010, before year 2008, and between years 2004 and 2008, respectively. Each citation network is modeled as an undirected graph. Specifically, a node represents one paper. An edge indicates a citation link between two papers, ignoring the direction. The attributes of a node are represented by a bag-of-words vector, indicating keywords extracted from the title of the corresponding paper. The number of union attributes (i.e., 6775) is the total number of attributes in these three graphs. According to the research topics, each node is categorized into some of the five classes, including "Database", "Artificial Intelligence", "Computer Vision", "Information Security", and "Networking".
Since papers in three citation networks are extracted from different databases and published in different time periods, the corresponding data distributions of these networks would also be diverse. Although label categories are the same, the statistics of graph scale, node attributes, and edge connections vary across graphs, indicating the intrinsic discrepancy between graphs. These graphs have no common nodes or edge connections, which further enlarges the divergence. Therefore, it would be challenging to transfer knowledge from one graph to facilitate node classification in another graph.
The two social networks (i.e., Blog1 and Blog2) are disjoint subnetworks from the BlogCatalog dataset [60]. A node represents one blogger. An undirected edge indicates the friendship between two bloggers. Node attribute vector is obtained using the keywords in the blogger's self-description. Each node belongs to one class that is designated by the blogger's interest group. Compared with the citation networks, the social networks have higher average degrees, indicating each node has a larger number of neighbors. Nodes in the social networks also have richer attributes. Since the two social networks are extracted from the same network, they have close attribute distributions.
The proposed method is evaluated by conducting multilabel node classification in six transfer tasks, including D ! A, A ! D, C ! A, A ! C, D ! C, and C ! D. A, C, and D denote ACMv9, Citationv1, and DBLPv7, respectively. The arrow, "! ," indicates the direction of knowledge transfer, that is, from a fully labeled source graph to a completely unlabeled target graph. The evaluation is further conducted by performing multiclass classification in transfer tasks B1!B2 and B2!B1. The two social networks, Blog1 and Blog2, are denoted by B1 and B2, respectively. The common attribute rate in each transfer task can be found in Table III, where the number of union attributes is the total number of attributes in the designated source and target graphs.
2) Baselines: The baselines are of three kinds, including graph neural networks for learning on a single graph, typical domain adaptation methods, and transfer learning methods specifically designed for graph data.
GCN [3], GAT [42], and GraphSAGE [2]: They are well-known graph neural networks for single-graph node classification. GCN devises a simplified spectral filter to perform convolution on graphs. GraphSAGE samples fixed-size neighbors first, then generates node representations by aggregating neighborhood information. Instead of neighborhood sampling, GAT assigns learnable weights to all neighboring nodes in the aggregation process. These three GNN models are trained on the labeled source graph, and then directedly evaluated on the unlabeled target graph. DANN [26], CDAN [20], and WDGRL [27]: These adversarial domain adaption approaches are originally designed for transfer learning on images or text. To process graph data, their representation learning modules are constructed as multilayer perceptrons (MLPs), which only take node attributes as input, ignoring the graph structure. GCC [55], NetTr [52], CDNE [6], ACDNE [7], AdaGCN [8], and UDA-GCN [9]: They are designed for cross-graph learning. The GCC model is pre-trained to discover common structural patterns across multiple source graphs, and then fine-tuned on the target graph. NetTr discovers structural representations that are transferrable across graphs. CDNE achieves domain adaptation by incorporating the MMD loss [54] across graphs in an autoencoder architecture. ACDNE utilizes two feature extractors to separately preserve attribute affinity and topology proximity. UDA-GCN develops a dual graph convolutional network. Both ACDNE and UDA-GCN employs the gradient reversal layer [26] to perform domain adaptation. AdaGCN improves the transferrability of GCN [3] by minimizing the Wasserstein distance [28] between the source and target representations. 3) Implementation Details: The proposed method, Ada-GIn, is implemented using PyTorch. Table IV presents the hyperparameters selected for each transfer task. Representation learner, f g , consists of the GNN layers in which the layer sizes are selected in f128; 256; 512; 1024; 2048g. For example, in transfer task D ! A, the dimensions of three GNN layers are 1024, 512, and 128 in sequence, which are denoted as "1024=512=128". The dropout rate of each GNN layer is 0:5.
Node classifier, f c , is a logistic regression. Domain discriminator, f d , is a multilayer perceptron (MLP) containing three hidden layers, with the layer dimensions set as 512, 64, and 16. Its output is of one dimension followed by a sigmoid activation.
We train the AdaGIn model over shuffled minibatches using the Adam optimizer [61], with a batch size of 32. Initial learning rate, h 0 , is chosen from f0:005; 0:010; 0:015g. To prevent overfitting, L 2 regularization is imposed on all learnable weights, with the weight decay term set as 5 Â 10 À5 . Following DANN [26], learning rate, h p , is decayed by h p ¼ h 0 ð1 þ 10pÞ À0:75 , where training progress p increases from 0 to 1. In the overall loss function (i.e., (14)), mutual information coefficient, 1 , is set as 0:1 for the citation graphs and 1:0 for the social graphs. Starting from 0, domain adaptation coefficient, 2 , is progressively increased by 2 ¼ 2ð1 þ expðÀ10pÞÞ À1 À 1. The maximum value of 2 is set as 0:2 for the citation graphs.
The setup of GAT follows the original paper [42], while two attention heads are applied in the first layer due to the memory constraint. We adapt GraphSAGE to transductive setting, and employ its max-pooling variant (i.e., GS-pool) due to its preferable performance and computational efficiency [2]. The architectures of DANN and CDAN are similar to AdaGIn, except that their representation learners are MLPs. WDGRL also employs a MLP to generate node representations, with the settings similar to AdaGCN.
With only one source graph in our case, it is unfeasible to pre-train a GCC model, since its pre-training requires multiple source graphs as input. Therefore, we use the pre-trained model 2 provided by the authors to generate node representations for the source and target graphs. Then we train a logistic regression classifier using the labeled source graph. The trained classifier is subsequently applied to classify nodes in the unlabeled target graph. The results of NetTr are directly taken from the ACDNE paper [7]. In order to calculate standard deviations and perform t-test, we execute the official source codes 3 4 to evaluate CDNE and ACDNE. We find the results of CDNE and ACDNE are slightly better than, or very close to, the reported ones in the ACDNE paper. UDA-GCN is also tested using its source codes 5 with almost every key hyperparameter tuned. There are two or three GNN layers with the layer sizes selected in f128; 256; 512; 1024; 2048g. The learning rate is swept in f0:0001; 0:001; 0:01g. The weight decay coefficient is chosen from f0:00005; 0:0005; 0:005; 0:05; 0:5g.
Following ACDNE, the embedding dimension (i.e., l) is set as 128 for every method except for GCC. The embedding dimension of GCC is fixed as 64 by the provided pre-trained model. For each method, we report the mean values and the standard deviations of F1 scores after five runs with different random seeds. Note that the standard deviations of NetTr are not reported in the ACDNE paper. All the experiments are conducted on a computer with one NVIDIA GeForce GTX 1080Ti GPU (11 GB of RAM), an Intel(R) Core(TM) i7-8700 K CPU (6 cores, 3.70 GHz), and 32 GB of RAM.

B. Cross-Graph Node Classification
Experiments are conducted under the unsupervised domain adaptation setting, in which the source graph is fully labeled, and the target graph is completely unlabeled. The F1 scores (i.e., Micro-F1 score and Macro-F1 score) are reported to quantify the node classification performance in the target graph. We introduce the experimental results in the citation and social graphs separately.
1) Node Classification in Citation Graphs: As shown in Table V and Table VI, the first group of baselines are GNNs including GCN [3], GAT [42], and GraphSAGE [2]. The GNN models are trained in the source graph and then directly evaluated in the target graph. Domain adaptation techniques are not utilized in these cases. Since the information of target graph is unavailable during training, referring to DANN [26], these baselines are called as the source-only models.
GCN performs best in this group. The strengths of GNN models are twofold. One is that the GNN models explore and capture structural information, like edge connections, to learn meaningful node representations. Such structural relations are commonly seen among graphs. Thus, the GNN model trained in the source graph has the potential to encode important structural properties in the target graph. The other is that the three citation graphs share some common attributes (see Table III). The GNN models considered here can jointly encode node attributes and graph structure to compute node representations. The common attributes help shrink the domain divergence between graphs, thus improving the model performance when testing in the target graph. Further investigation on common attribute rate are presented in Section IV-D.
In the second group of baselines (i.e., DANN [26], CDAN [20], and WDGRL [27]), the models only take node attributes as input, ignoring the structural relations between nodes. It is similar to the way when processing the image data which is assumed to be independent and identically distributed (i.i.d.). Even though domain adaptation techniques are employed in these methods, their performance is generally inferior to other groups, including the source-only models. Therefore, when solely utilizing the node attributes of citation graphs, the generated node representations are not that labeldiscriminative. Furthermore, in such cases, domain discrepancy would be too large to be effectively reduced by these domain adaptation techniques. Since the nodes in a graph are correlated by edge connections violating the i.i.d. assumption, to improve classification performance, it is crucial to devise methods that incorporate structural information.
In the third group of baselines (i.e., GCC [55], NetTr [52], CDNE [6], ACDNE [7], AdaGCN [8], and UDA-GCN [9]), clear underperformance is observed in GCC. As introduced in Section IV-A3, although node representations can be generated using the pre-trained GCC model, a node classifier cannot be directly trained using the unlabeled target graph. Instead, we have to train the classifier with the labeled source graph first, and then test it in the target graph. It degrades the performance of GCC model. More importantly, GCC cannot exploit node attributes which are crucial for representation learning in the attributed graphs. Specifically, the affinity of node attributes could be a predictor for node labels [7]. The neighborhood attribute information also contributes to learning meaningful node representations [2]. Similarly, NetTr learns transferrable latent representations solely based on graph topology, consequently leading to underperformance.
In Table V and Table VI, AdaGIn ranks first in all six transfer tasks, yielding the highest averaged F1 scores. ACDNE is the best performing baseline. In Table V, the relative performance gains of AdaGIn over ACDNE are 2:32%, 1:05%, and 1:24% in transfer tasks D ! A, A ! D, and D ! C, respectively. In the t-test, the corresponding p-values are 1:57 Â 10 À3 , 4:85 Â 10 À2 , and 3:81 Â 10 À2 , respectively. As the pvalues are smaller than 0:05, the improvements of AdaGIn over ACDNE are statistically significant in these transfer tasks. In summary, AdaGIn surpasses the prior arts by incorporating the local-global mutual information and conditional adversarial domain adaptation.
2) Node Classification in Social Graphs: As reported in Table VII, the methods discussed above are further tested in the social graphs. The domain discrepancy between Blog1 and Blog2 is reduced by their close attribute distributions (see Section IV-A1). Therefore, the F1 scores of AdaGIn are much higher than those in the citation graphs. AdaGIn outperforms ACDNE in transfer tasks B1!B2 and B2!B1. The corresponding p-values (i.e., 4:95 Â 10 À3 and 3:57 Â 10 À2 ) are smaller than 0:05, indicating that the improvements are statistically significant. Note that AdaGIn has an improvement over CDAN by incorporating the structural information in the domain adaptation process.
As shown in Table II, compared with the citation graphs, the social graphs have richer node attributes and higher average degrees, which means the nodes have more attributes and neighbors. Therefore, when evaluated in the social graphs, we observe distinct relative performance among the methods. The spectral GNN methods, AdaGCN and UDA-GCN, have clear underperformance in comparison with AdaGIn. The spectral GNNs exploit the whole neighborhood of a node, which might introduce certain noise and degrade the model performance. Instead, AdaGIn reduces noise by sampling some of the neighboring nodes. Similarly, among the sourceonly models, GraphSAGE largely outperforms GCN and GAT with the help of neighborhood sampling. Although GAT is a spatial GNN model, it also exploits the entire neighborhood without sampling [42]. In addition, even though DANN and CDAN ignore the neighborhood information, they still have impressive performance, which reveals that node attributes in the social graphs are informative for learning meaningful node representations.
The benefit of neighborhood sampling in the social graphs can be supported by analyzing the labels of neighboring nodes. It can be seen in Fig. 4 that, in the citation graphs, most neighboring nodes belong to the class of the central node. However, in the social graphs, only around 50%, or even less, neighboring nodes are of the same class as the central node. For clarity, we use the same set of class index for the citation and social graphs, although these two kinds of graphs have different definitions for each class. Note that, there are five classes in the citation graphs and six classes in the social graphs, respectively.
GCN and GAT compute the representation of a node as the weighted average of its previous representation and its neighbors'. Such smoothing operation makes the connected nodes have close representations, which leads to similar label predictions in the subsequent classification task [18]. The homophily assumption behind this smoothing operation is that the nodes with edge connections are likely to be of the same class [3]. Fig. 4 shows that there are more neighboring nodes violating this assumption in the social graphs. GCN and GAT utilize the complete neighbor set, which would lead to their underperformance in the social graphs. AdaGCN and UDA-GCN follow the way of GCN to compute node representations, thus also encountering performance degradation. Instead, Ada-GIn and GraphSAGE sample nodes from the neighborhood. The sampled neighbor set has fewer nodes that belong to classes different from the one of the central node. The classification performance is consequently improved.
3) Loss Curve During Training: Fig. 5 shows the curve of each loss with respect to the training epoch. Two representative transfer tasks, C ! A and D ! A, are presented as examples. According to (14), the overall loss for training the GNN layers can be computed by L ¼ L CE þ 1 L UN À 2 L DA . There is a steady decrease in the overall loss with a reduced decrease rate, indicating the model training gradually converges. Similar trends can be seen in supervised loss L CE and unsupervised loss L UN . In contrast, with the overall loss minimized during training, domain adaptation loss, L DA , increases first and converges around five epochs. Note that the GNN layers are trained to raise the domain adaptation loss, thus generating domain-invariant node representations.

C. Ablation Study
In this section, we investigate three key components in the AdaGIn model, including mutual information maximization (MI), multilinear conditioning (CD), and domain adaptation (DA). The investigation is conducted by removing some of the components to construct a variant of AdaGIn (see the left side of Table VIII or Table IX). The contribution of one component can be indicated by the changes in F1 scores caused by its removal. The following AdaGIn variants are constructed for comparison.
AdaGIn-MI: A variant of AdaGIn without mutual information maximization (MI). The unsupervised loss is removed from the overall loss function (i.e., (14)).   Note that when the domain discriminator is removed, the multilinear conditioning disappears simultaneously. Therefore, AdaGIn-MI-DA is a variant without all the three key components.
1) Performance of Model Variants: As shown in Table VIII  and Table IX, the averaged Micro-F1 score of AdaGIn drops by 3:09% (absolute difference) when the MI component is removed (i.e., AdaGIn-MI), which empirically validates the importance of preserving the local-global mutual information. Similarly, in the averaged Micro-F1 score, a 1:75% drop (absolute difference) is caused by the removal of DA component from the AdaGIn model (see AdaGIn-DA). Mutual information maximization is applied in both the source and target graphs. Therefore, although the DA component has been removed, AdaGIn-DA is still optimized with the unsupervised loss calculated in the target graph. Compared with AdaGIn, this helps keep the performance from dropping greatly. To validate this point, we further remove the MI component to obtain a source-only model, i.e., AdaGIn-MI-DA. The information of target graph is entirely unknown when training the source-only model. In comparison with AdaGIn-MI, there is a decrease by 3:37% (absolute difference) in the averaged Micro-F1 score due to the lack of DA component.
The domain adaptation component of AdaGIn conditions domain discriminator on the label predictions. The efficacy of multilinear conditioning is empirically validated by the performance drop due to its removal (see AdaGIn-CD). However, when MI is removed, the improvement of AdaGIn-MI over AdaGIn-MI-CD cannot be consistently observed in all transfer tasks. Mutual information maximization contributes to learning more informative node representations and more accurate label predictions, which is likely to enhance the capability of multilinear conditioning.
In Table VIII and Table IX, we observe the MI component produces larger performance gains when the target graph is of large scale (i.e., ACMv9 and Citationv1). For example, considering the absolute difference between AdaGIn and Ada-GIn-MI, when the mutual information is maximized, the Micro-F1 score increases by 5:16% in D ! A and 4:33% in D ! C which are much higher than 1:73% in C ! D. As introduced in Section III-C, neighborhood information is repeatedly aggregated to generate the representation of a node. Therefore, node representation can be treated as the local patch representation in a graph. To avoid memory overflow, we have to restrict the neighborhood search depth and the sample size in each depth, which results in a limited patch size. Consequently, if a graph consists of a large number of nodes, it would be more difficult for the node representation to capture the global structural information. Under such circumstances, the model performance is more likely to benefit from the maximization of local-global mutual information. 2) Visualization of Node Representations: As a qualitative evaluation, Fig. 6 visualizes the node representations in the 2D space using t-SNE [62]. For clarity, two transfer tasks are taken as examples (i.e., C ! A and D ! A), in which ACMv9 serves as the target graph. We only show the nodes from two classes, i.e., "Computer Vision" and "Information Security". Node representations generated by AdaGIn have the most preferable visualization. Specifically, the same class nodes are projected together, regardless of whether they are from the source or target graph. Furthermore, the clusters of the two classes are separated more clearly. By comparing the visualizations of AdaGIn and its variants, we have the following observations.
Mutual information maximization (MI) improves the label predictions, consequently making the clusters of the two classes more separable in most cases. Domain adaptation (DA), which includes a conditional discriminator, improves the multimodal alignment in the 2D space. Specifically, the clusters of the two classes have a clearer separation. Meanwhile, a larger overlap can be seen in the clusters of the same class. Multilinear conditioning (CD) contributes to distinguishing the clusters of different classes and aligning the clusters of the same class together.

3) Effect of Global Information:
We further analyze the effect of global information encoded from various domains. Specifically, we use "SMI" and "TMI" to denote the mutual information maximization applied in the source graph and in the target graph, respectively. Then two AdaGIn variants are constructed for comparison.
AdaGIn-TMI: A variant of AdaGIn without mutual information maximized in the target graph. In other words, MI is merely applied in the source graph to encode the global structural information. AdaGIn-SMI: A variant of AdaGIn without applying mutual information maximization in the source graph. From another perspective, MI is solely applied in the target graph to encode the global structural information.
As shown in Table X, when the MI component is removed from one of the graphs, there is a clear performance drop (see AdaGIn-TMI and AdaGIn-SMI). It indicates that the model performance benefits from the MI component regardless of which graph it is applied. Furthermore, the decrease of Micro-F1 score observed in AdaGIn-TMI is larger than the one in AdaGIn-SMI. Since the model is evaluated in the target graph, it would be more beneficial to directly maximize the mutual information in the target graph. With the MI component removed from both the source graph and the target graph, the Micro-F1 score has the largest drop (see AdaGIn-MI). Therefore, it is desirable to jointly apply the MI component in the source and target graphs. The above observations are found to be consistent in the citation graphs and the social graphs.

D. Distribution Discrepancy Analysis
In this section, we investigate the influence of common attribute rate (i.e., R a ) on the model performance. As introduced in Section III-A, common attribute rate, R a , is the percentage of common attributes shared by the source and target graphs. We conduct the investigation by removing some of the common attributes. Lower common attribute rate indicates larger distribution discrepancy between the source and target graphs.  6. Node representations visualized in the 2D space using t-SNE (best viewed in color). Each point represents one node. The points in orange and brown are from the source graph, and the points in pink and blue are from the target graph. The orange and pink points belong to "Computer Vision," whereas the brown and blue points belong to "Information Security". In the two representative transfer tasks (i.e., C ! A and D ! A), the original common attribute rates are 64:29% and 56:92%, respectively (see Table III). In Fig. 7, by comparing with the source-only model, AdaGIn-MI-DA, we showcase the performance gain of AdaGIn. A positive performance gain indicates the improvement obtained by mutual information maximization (MI) and domain adaptation (DA). To investigate these two learning strategies individually, we further present the variant AdaGIn-MI. Note that multilinear conditioning is applied when performing domain adaptation.
In Fig. 7, the Micro-F1 score of each model decreases with the common attribute rate. The reason is that domain gap is enlarged between the source and target graphs by reducing common attributes. Furthermore, the total number of node attributes also decreases in each graph due to the removal of common attributes. As described in Section III-C, node representation learning relies on node attributes, the reduction of which increases the difficulty of learning informative node representations, consequently degrading the node classification performance. AdaGIn outperforms the source-only model, AdaGIn-MI-DA, in all cases except R a ¼ 0 in transfer task D ! A. We explain the reasons by discussing mutual information maximization and domain adaptation individually.
Mutual information maximization improves the classification performance in a wide range of common attribute rates. However, without any common attributes (i.e., R a ¼ 0), we observe marginal improvement in D ! A and no improvement in C ! A. Local-global mutual information is calculated and maximized using node attributes (see Section III-C). If all common attributes are removed, the remaining node attributes would not be that informative. For example, in transfer task C ! A, when R a ¼ 0, there are merely 20:34% and 23:08% node attributes left in Citationv1 and ACMv9, respectively (see Table II and Table III). Under such conditions, mutual information maximization would be weakened, which even results in underperformance.
Domain adaptation also contributes to performance improvement in some cases. However, the performance gain, yielded by applying domain adaptation, sometimes shrinks or even becomes negative (see transfer task D ! A), as the common attribute rate gets smaller. As stated in [10], domain adaptation techniques might be unsuccessful when the source and target domains share few similarities. In the worst case, such techniques may have negative impacts on the learning tasks in the target domain, which is referred to as negative transfer.

E. Sample Complexity Analysis
In this section, we analyze how the model performance is influenced by the number of nodes in the source or target graph. We follow [63] to conduct the experiments under domain adaptation scenario. In Fig. 8(a), the source graph is downsampled by selecting the same percentage of nodes from each class. Percentage, R n , represents the proportion of nodes selected from the graph. When R n ¼ 100%, there are no nodes being removed from a graph. We report the Micro-F1 score in a target graph that is unchanged. Similarly, in Fig. 8(b), we fix the source graph and report the F1 scores in a downsampled target graph where only a percentage of nodes per class is preserved. Note that, differing from the i.i.d. images in computer vision studies, when a node is removed from a graph, its connected edges disappear simultaneously.
We summarize the findings in the transfer tasks of citation graphs (i.e., C ! A and D ! A) first. It can be seen in Fig. 8, the Micro-F1 score increases gradually with the percentage of nodes selected. As shown in Fig. 4, in the citation graphs, most neighboring nodes are with the same class as the central node, which is in line with the homophily assumption shared by many GNN models. An increased number of neighboring nodes in the same class could support the GNN layers to learn more informative node representations. In the multimodal distribution of node representations, each mode corresponds to one class of nodes (see Section III-A). If there are more nodes of the same class to characterize the features of a mode, the distribution alignment would possibly be improved.
In Fig. 8, the downsampling of target graph leads to a steeper performance drop. Since the model performance in target graph is reported, when a part of target nodes is removed, the downsampled target graph becomes less informative, resulting in a large drop in performance. In the case that the source nodes are removed partially, the complete and informative target graph prevents the Micro-F1 score from dropping greatly.
Similar observations are also found in the transfer tasks of social graphs (i.e., B1!B2 and B2!B1). However, the Micro-F1 score only slightly increases when R n is greater than 40%. Since the social graphs have higher average degrees (see Table II), when R n reaches 40%, the nodes would already have abundant neighborhood information to assist representation learning. Moreover, as the homophily assumption is not Fig. 7. Micro-F1 score under varying common attribute rate. Fig. 8. Analysis of the model performance with the source or target graph downsampled.
strictly held in the social graphs (see Fig. 4), the increase of neighboring nodes might introduce certain noise. In addition, the rich node attributes in social graphs are informative for representation learning. Therefore, with only a few nodes left in the target graph (i.e., R n ¼ 10%), the performance drops in social graphs are much smaller than those in citation graphs.

F. Hyperparameter Sensitivity Analysis
In this section, we investigate the performance of AdaGIn with regard to three hyperparameters, i.e., initial learning rate h 0 , embedding dimension d, and mutual information coefficient 1 . The goal is to shed light on the hyperparameter configuration. When investigating one hyperparameter, the others are fixed to their default values introduced in Section IV-A3. Fig. 9 displays the Micro-F1 scores of transfer tasks in the citation and social graphs. In general, the performance of Ada-GIn is more sensitive to these hyperparameters when evaluated in the citation graphs. In B1!B2 and B2!B1, the Micro-F1 scores become stable when the initial learning rate is larger than 0:005. In C ! A and D ! A, the Micro-F1 score first increases with the initial learning rate, and then decreases after reaching a maximum value. Similar trends can be seen in the other two hyperparameters under all transfer tasks. With embedding dimension set as 128, the highest Micro-F1 score is observed in each transfer task. The optimal mutual information coefficients are 0:1 and 1:0 for transfer tasks in the citation and social graphs, respectively.

V. CONCLUSION
A novel GNN model has been proposed to address the cross-graph node classification problem. This method enables mutual information maximization in the cross-graph learning, which encourages node representations to capture the global structural information. Conditional adversarial networks are employed to align the multimodal distributions of node representations. Experimental results demonstrate that the proposed method is superior to the state-of-the-art approaches in the benchmark transfer tasks. As a spatial GNN model, the proposed method is promising for many high-impact applications in the real-world large-scale networks, such as protein-protein interaction networks and recommender systems.