Deep Inductive Graph Representation Learning

This paper presents a general inductive graph representation learning framework called <inline-formula><tex-math notation="LaTeX">$\text{DeepGL}$</tex-math><alternatives><mml:math><mml:mtext>DeepGL</mml:mtext></mml:math><inline-graphic xlink:href="rossi-ieq1-2878247.gif"/></alternatives></inline-formula> for learning deep node <italic>and</italic> edge features that generalize across-networks. In particular, <inline-formula><tex-math notation="LaTeX">$\text{DeepGL}$</tex-math><alternatives><mml:math><mml:mtext>DeepGL</mml:mtext></mml:math><inline-graphic xlink:href="rossi-ieq2-2878247.gif"/></alternatives></inline-formula> begins by deriving a set of base features from the graph (e.g., graphlet features) and automatically learns a multi-layered hierarchical graph representation where each successive layer leverages the output from the previous layer to learn features of a higher-order. Contrary to previous work, <inline-formula><tex-math notation="LaTeX">$\text{DeepGL}$</tex-math><alternatives><mml:math><mml:mtext>DeepGL</mml:mtext></mml:math><inline-graphic xlink:href="rossi-ieq3-2878247.gif"/></alternatives></inline-formula> learns <italic>relational functions</italic> (each representing a feature) that naturally generalize across-networks and are therefore useful for graph-based transfer learning tasks. Moreover, <inline-formula><tex-math notation="LaTeX">$\text{DeepGL}$</tex-math><alternatives><mml:math><mml:mtext>DeepGL</mml:mtext></mml:math><inline-graphic xlink:href="rossi-ieq4-2878247.gif"/></alternatives></inline-formula> naturally supports attributed graphs, learns interpretable inductive graph representations, and is space-efficient (by learning sparse feature vectors). In addition, <inline-formula><tex-math notation="LaTeX">$\text{DeepGL}$</tex-math><alternatives><mml:math><mml:mtext>DeepGL</mml:mtext></mml:math><inline-graphic xlink:href="rossi-ieq5-2878247.gif"/></alternatives></inline-formula> is expressive, flexible with many interchangeable components, efficient with a time complexity of <inline-formula><tex-math notation="LaTeX">$\mathcal {O}(|E|)$</tex-math><alternatives><mml:math><mml:mrow><mml:mi mathvariant="script">O</mml:mi><mml:mo>(</mml:mo><mml:mo>|</mml:mo><mml:mi>E</mml:mi><mml:mo>|</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="rossi-ieq6-2878247.gif"/></alternatives></inline-formula>, and scalable for large networks via an efficient parallel implementation. Compared with recent methods, <inline-formula><tex-math notation="LaTeX">$\text{DeepGL}$</tex-math><alternatives><mml:math><mml:mtext>DeepGL</mml:mtext></mml:math><inline-graphic xlink:href="rossi-ieq7-2878247.gif"/></alternatives></inline-formula> is (1) <italic>effective</italic> for across-network transfer learning tasks <italic>and</italic> large (attributed) graphs, (2) <italic>space-efficient</italic> requiring up to 6x less memory, (3) <italic>fast</italic> with up to 106x speedup in runtime performance, and (4) <italic>accurate</italic> with an average improvement in AUC of 20 percent or more on many learning tasks and across a wide variety of networks.


INTRODUCTION
L EARNING a useful graph representation lies at the heart and success of many machine learning tasks such as node and link classification [1], [2], anomaly detection [3], link prediction [4], dynamic network analysis [5], community detection [6], role discovery [7], visualization and sensemaking [8], network alignment [9], and many others. Indeed, the success of machine learning methods largely depends on data representation [10], [11]. Methods capable of learning such representations have many advantages over feature engineering in terms of cost and effort. For a survey and taxonomy of relational representation learning, see [11].
Recent work has largely been based on the popular skipgram model [12] originally introduced for learning vector representations of words in the natural language processing (NLP) domain. In particular, DeepWalk [13] applied the successful word embedding framework from [14] (called word2vec) to embed the nodes such that the co-occurrence frequencies of pairs in short random walks are preserved. More recently, node2vec [15] introduced hyperparameters to DeepWalk that tune the depth and breadth of the random walks. These approaches have been extremely successful and have shown to outperform a number of existing methods on tasks such as node classification.
However, the past work has focused on learning only node features [13], [15], [16] for a specific graph. Features from these methods do not generalize to other networks and thus are unable to be used for across-network transfer learning tasks. 1 In contrast, DeepGL learns relational functions that generalize for computation on any arbitrary graph, and therefore naturally supports across-network transfer learning tasks such as across-network link classification, network alignment, graph similarity, among others. Existing methods are also not spaceefficient as the node feature vectors are completely dense. For large graphs, the space required to store these dense features can easily become too large to fit in-memory. The features are also notoriously difficult to interpret and explain which is becoming increasingly important in practice [17], [18]. Furthermore, existing embedding methods are also unable to capture higher-order subgraph structures. Finally, these methods are also inefficient with runtimes that are orders of magnitude slower than the algorithms presented in this paper (as shown later in Section 4). Other key differences and limitations are discussed below.
In this work, we present a general, expressive, and flexible deep graph representation learning framework called DeepGL that overcomes many of the above limitations. 2,3 Intuitively, DeepGL begins by deriving a set of base features using the graph structure and any attributes (if available). 4 1. The terms transfer learning and inductive learning are used interchangeably.
2. This manuscript first appeared in [19]. 3. Note a deep learning method as defined by Bengio et al. [20], [21] is one that learns multiple levels of representation with higher levels capturing more abstract concepts through a deeper composition of computations [10], [22]. This definition includes neural network approaches as well as DeepGL and many other deep learning paradigms. 4. The base graph features computed using the graph are functions since they have precise definitions and can be computed on any graph.
The base features are iteratively composed using a set of learned relational feature operators that operate over the feature values of the (distance-') neighbors of a graph element (node, edge; see Table 1) to derive higher-order features from lower-order ones forming a hierarchical graph representation where each layer consists of features of increasingly higher orders. At each feature layer, DeepGL searches over a space of relational functions defined compositionally in terms of a set of relational feature operators applied to each feature given as output in the previous layer. Features (or relational functions) are retained if they are novel and thus add important information that is not captured by any other feature in the set. See below for a summary of the advantages and properties of DeepGL.

Summary of Contributions
The proposed framework, DeepGL, overcomes many limitations of existing work and has the following key properties: Novel framework: This paper presents a general inductive graph representation learning framework called DeepGL for learning inductive node and edge relational functions (features) that generalize for acrossnetwork transfer learning tasks in large (attributed) networks. Inductive representation learning: Contrary to existing node embedding methods [13], [15], [16], DeepGL naturally supports across-network transfer learning tasks as it learns relational functions that generalize for computation on any arbitrary graph. Sparse feature learning: It is space-efficient by learning a sparse graph representation that requires up to 6x less space than existing work. Efficient, parallel, and scalable: It is fast with a runtime that is linear in the number of edges. It scales to large graphs via a simple and efficient parallelization. Notably, strong scaling results are observed in Section 4. Attributed graphs: DeepGL is also naturally able to learn node and edge features (relational functions) from both attributes (if available) and the graph structure.

FRAMEWORK
This section presents the DeepGL framework. Since the framework naturally generalizes for learning node and edge representations, it is described generally for a set of graph elements (e.g., nodes or edges). 5 A summary of notation is provided in Table 1.

Base Graph Features
The first step of DeepGL (Algorithm 1) is to derive a set of base graph features 6 using the graph topology and attributes (if available). Initially, the feature matrix X X contains only the attributes given as input by the user. If no attributes are provided, then X X will consist of only the base features derived below. Note that DeepGL can use any arbitrary set of base features, and thus it is not limited to the base features discussed below.
Given a graph G ¼ ðV; EÞ, we first derive simple base features such as in/out/total/weighted degree and k-core numbers for each graph element (node, edge) in G. For edge feature learning we derive edge degree features for each edge ðv; uÞ 2 E and each 2 fþ; Âg as follows: Table 1 that d þ v , d À v , and d v denote the out/in/total degree of v. In addition, egonet features are also used [23]. Given a node v and an integer ', the '-egonet of v is defined as the set of nodes '-hops away from v (i.e., distance at most ') and all edges and nodes between that set. More generally, we define the '-egonet of a graph element g i as follows: Definition 1 ('-EGONET). Given a graph element g i (node, edge) and an integer ', the '-egonet of g i is defined as the set of graph elements '-hops away from g i (i.e., distance at most ') and all edges (or nodes) between that set.
For massive graphs, one may set ' ¼ 1 hop to balance the tradeoffs between accuracy/representativeness and scalability. The ' ¼ 1 external and within-egonet features for nodes are provided in Fig. 1 and used as base features in DeepGL-node. For all the above base features, we also derive variations based on direction (in/out/both) and weights (weighted/unweighted). Observe that DeepGL naturally supports many other graph properties including efficient/linear-time properties such as PageRank. Moreover, fast approximation methods with provable bounds can also be used to derive features such as the local coloring number and largest clique centered at the neighborhood of each graph shortest distance between g i and g j S set of graph elements related to g i , e.g., a feature score function tolerance/feature similarity threshold a transformation hyperparameter (e.g., bin size in log binning 0 a 1) relational operator applied to each graph element Matrices are bold upright roman letters; vectors are bold lowercase letters.
5. For convenience, DeepGL-edge and DeepGL-node are sometimes used to refer to the edge and node representation learning variants of DeepGL, respectively.
6. The term graph feature refers to an edge or node feature. element (node, edge) in G. All of the above base features are concatenated (as column vectors) to the feature matrix X X.
In addition, we decompose the input graph G into its smaller subgraph components called graphlets (network motifs, induced subgraphs) [24] using local graphlet decomposition methods [25] and concatenate the graphlet countbased feature vectors to the feature matrix X X.
Definition 2 (GRAPHLET). A graphlet G t ¼ ðV k ; E k Þ is an induced subgraph consisting of a subset V k & V of k vertices from G ¼ ðV; EÞ together with all edges whose endpoints are both in this subset E k ¼ f8e 2 E j e ¼ ðu; vÞ^u; v 2 V k g. A k-graphlet is defined as an induced subgraph with k-vertices. Alternatively, the nodes of every graphlet can be partitioned into a set of automorphism groups called orbits [26]. Each unique node (edge) position in a graphlet is called an automorphism orbit, or just orbit. More formally, Definition 3 (ORBIT). An automorphism of a k-vertex graphlet G t ¼ ðV k ; E k Þ is defined as a permutation of the nodes in G t that preserves edges and non-edges. The automorphisms of G t form an automorphism group denoted as AutðG t Þ. A set of nodes V k of graphlet G t define an orbit iff (i) for any node u 2 V k and any automorphism p of G t , u 2 V k ()pðuÞ 2 V k ; and (ii) if v; u 2 V k then there exists an automorphism p of G t and a g > 0 such that p g ðuÞ ¼ v.
This work derives such features by counting all node or edge orbits with up to 4 and/or 5-vertex graphlets. Orbits (graphlet automorphisms) are counted for each node or edge in the graph based on whether a node or edge-based feature representation is warranted (as our approach naturally generalizes to both). Note there are 15 node and 12 edge orbits with 2-4 nodes; and 73 node and 68 edge orbits with 2-5 nodes. Unless otherwise mentioned, this work uses all 15 node orbits shown in Fig. 2.
A key advantage of DeepGL lies in its ability to naturally handle attributed graphs. In particular, any set of initial attributes given as input can simply be concatenated with X X and treated the same as any other initial base feature derived from the topology.

Relational Function Space & Expressivity
In this section, we formulate the space of relational functions 7 that can be expressed and searched over by DeepGL.

Definition 4 (RELATIONAL FUNCTION). A relational function
(feature) is defined as a composition of relational feature operators applied to an initial base feature x x.
where x x is an arbitrary base feature applied to h relational feature operators.
Note the relational feature operators can obviously be repeated in the relational function. For instance, in the extreme case all h relational feature operators may represent the mean relational feature operator defined in Table 2.
Recall that unlike recent node embedding methods [13], [15], [16], the proposed approach learns graph functions that are transferable across-networks for a variety of important graph-based transfer learning tasks such as across-network prediction, anomaly detection, graph similarity, matching, among others.

Composing Relational Functions
The space of relational functions searched via DeepGL is defined compositionally in terms of a set of relational feature operators F ¼ fF 1 ; . . . ; F K g. 8 A few relational feature operators are defined formally in Table 2; see [11] (pp. 404) for a wide variety of other useful relational feature operators. The expressivity of DeepGL (space of relational functions expressed by DeepGL) depends on a few flexible and interchangeable components including: The term relational operator is used more generally to refer to any relational function applied over the neighborhood of a node or edge (or more generally any set S). Note that DeepGL is flexible and generalizes to any arbitrary set of relational feature operators. The set of relational feature operators can be learned via cross-validation. Recall the notation from Table 1. For generality, S is defined in Table 1 as a set of related graph elements (nodes, edges) of g i and thus s j 2 S may be an edge s j ¼ e j or a node s j ¼ v j ; in this work The relational operators generalize to '-distance neighborhoods (e.g., Ã is a vector and x i is the ith element of x x for g i .
7. The terms graph function and relational function are used interchangeably.
8. Note DeepGL may also leverage traditional feature operators used for i:i:d: data. i) the initial base features derived using the graph structure, initial input attributes (if available), or both, ii) a set of relational feature operators F ¼ fF 1 ; . . . ; F K g, iii) the sets of "related graph elements" S 2 S (e.g., the in/out/all neighbors within ' hops of a given node/ edge) that are used with each relational feature operator F p 2 F, and finally, iv) the number of times each relational function is composed with another (i.e., the depth). Observe that under this formulation each feature vector x x 0 from X X (that is not a base feature) can be written as a composition of relational feature operators applied over a base feature. For instance, given an initial base feature x x, by abuse of notation let x x 0 ¼ F k ðF j ðF i hx xiÞÞ ¼ ðF k F j F i Þðx xÞ be a feature vector given as output by applying the relational function constructed by composing the relational feature operators F k F j F i to every graph element g i 2 G and its set S of related elements. 9 Obviously, more complex relational functions are easily expressed such as those involving compositions of different relational feature operators (and possibly different sets of related graph elements). Furthermore, DeepGL is able to learn relational functions that often correspond to increasingly higher-order subgraph features based on a set of initial lower-order (base) subgraph features (Fig. 3). Intuitively, just as filters are used in Convolutional Neural Networks (CNNs) [10], one can think of DeepGL in a similar way, but instead of simple filters, we have features derived from lower-order subgraphs being combined in various ways to capture higher-order subgraph patterns of increasingly complexity at each successive layer.

Summation and Multiplication of Relational Functions
We can also construct a wide variety of functions compositionally by adding and multiplying relational functions (e.g., F i þ F j , and F i Â F j ). More specifically, any class of functions that are closed under addition and multiplication can be used as base functions in this context. A sum of relational functions is similar to an OR operation in that two instances are "close" if either has a large value, and similarly, a product of relational functions can be viewed as an AND operation as two instances are close if both relational functions have large values. This is similar to many existing architectures for learning complex functions such as the AND-like or OR-like units in convolutional networks [27], among others [10].

Searching the Relational Function Space
A general and flexible framework for DeepGL is given in Algorithm 1. Recall that DeepGL begins by deriving a set of base features which are used as a basis for learning deeper and more discriminative features of increasing complexity (Line 2). The base feature vectors are then transformed (Line 3). For instance, one may transform each feature vector x x i using logarithmic binning as follows: sort x x i in ascending order and set the aN graph elements (nodes/edges) with smallest values to 0 where 0 < a < 1, then set a fraction of remaining graph elements with smallest value to 1, and so on. Observe that we only need to store the nonzero feature values. Thus, we avoid explicitly storing aN values for each feature by storing each feature as a sparse feature vector. See Section 3.2 for further details. Many other techniques exist for transforming the feature vectors and the selected technique will largely depend on the graph structure. The framework proceeds to learn a hierarchical graph representation (Fig. 3) where each successive layer represents increasingly deeper higher-order (edge/node) graph functions: In particular, the feature layers F 2 ; F 3 ; . . . ; F t are derived as follows (Algorithm 1 Lines 4-10): First, we derive the feature layer F t by searching over the space of graph functions that arise from applying the relational feature operators F to each of the novel features f i 2 F tÀ1 learned in the previous layer (Algorithm 1 Line 5). An algorithm for deriving a feature layer is provided in Algorithm 2. Further, an intuitive example is provided in Fig. 4. Next, the feature vectors from layer F t are transformed in Line 6 as discussed previously.
The resulting features in layer t are then evaluated. The feature evaluation routine (in Algorithm 1 Line 7) chooses the important features (relational functions) at each layer t from the space of novel relational functions (at depth t) constructed by applying the relational feature operators to each be a matrix of feature weights where w ij (or W ij ) is the weight between the feature vectors x x i and x x j . Notice that W W has the constraint that i < j < k and x x i , x x j , and x x k are increasingly deeper. Each feature layer F h 2 F defines a set of unique relational functions F h ¼ f . . .; f k ; . . . g of order h (depth) and each f k 2 F h denotes a relational function.
Moreover, the layers are ordered where F 1 < F 2 < Á Á Á < F t such that if i < j then F j is said to be a deeper layer w:r:t: F i . See Table 1 for a summary of notation.
9. For simplicity, we use Fhx xi (whenever clear from context) to refer to the application of F to all sets S derived from each graph element g i 2 G and thus the output of Fhx xi in this case is a feature vector with a single feature-value for each graph element. feature (relational function) learned (and given as output) in the previous layer t À 1. Notice that DeepGL is extremely flexible as the feature evaluation routine (Algorithm 3) called in Line 7 of Algorithm 1 is completely interchangeable and can be fine-tuned for specific applications and/or data. This approach derives a score between pairs of features. Pairs of features x x i and x x j that are strongly dependent as determined by the hyperparameter and evaluation criterion K are assigned W ij ¼ Kðx x i ; x x j Þ and W ij ¼ 0 otherwise (Algorithm 3 Line 2-6). More formally, let E F denote the set of connections representing dependencies between features: The result is a weighted feature dependence graph G F ¼ ðV F ; E F Þ where a relatively large edge weight Kðx x i ; x x j Þ ¼ W ij between x x i and x x j indicates a potential dependence (or similarity/correlation) between these two features. Intuitively, x x i and x x j are strongly dependent if Kðx x i ; x x j Þ ¼ W ij is larger than . Therefore, an edge is added between features x x i and x x j if they are strongly dependent. An edge between features represents (potential) redundancy. Now, G F is used to select a subset of important features from layer t. Features are selected as follows: First, the feature graph G F is partitioned into groups of features fC 1 ; C 2 ; . . .g where each set C k 2 C represents features that are dependent (though not necessarily pairwise dependent). To partition the feature graph G F , Algorithm 3 uses connected components, though other methods are also possible, e.g., a clustering or community detection method. Next, one or more representative features are selected from each group (cluster) of dependent features. Alternatively, it is also possible to derive a new feature from the group of dependent features, e.g., finding a low-dimensional embedding of these features or taking the principal eigenvector. In Algorithm 3 the earliest feature in each connected component C k ¼ f. . . ; f i ; . . . ; f j ; . . .g 2 C is selected and all others are removed. After pruning the feature layer F t , the discarded features are removed from X X and DeepGL updates the set of features learned thus far by setting F F [ F t (Algorithm 1: Line 8). Next, Line 9 increments t and sets F t ? . Finally, we check for convergence, and if the stopping criterion is not satisfied, then DeepGL learns an additional feature layer (Line 4-10).
An important aspect of DeepGL is the specific convergence criterion used to decide when to stop learning. In Algorithm 1, DeepGL terminates when either of the following conditions are satisfied: (i) no new features emerge (in the current feature layer t and thus jF t j ¼ 0), or (ii) the maximum number of layers is reached. However, DeepGL is not tied to any particular convergence criterion and others can easily be used in its place.
given as output in the previous layer F tÀ1 (via Algorithm 2). Add feature vectors to X X and functions/def. to F t . 6: Transform feature vectors of layer F t 7: Evaluate the features (functions) in layer F t , e.g., using a criterion K to score feature pairs along with a feature selection method to select a subset (Algorithm 3) or by finding a low-rank embedding of the feature matrix and concatenating the embeddings to X X. 8: Discard features from X X that were pruned (not in F t ) and set F F [ F t 9: Set t t þ 1 and initialize F t to ? for next feature layer 10: until no new features emerge or the max number of layers (depth) is reached 11: return X X and the set of relational functions (definitions) F In contrast to node embedding methods that output only a node feature matrix X X, DeepGL also outputs the (hierarchical) relational functions (definitions) F ¼ fF 1 ; F 2 ; . . . g where each f i 2 F h is a learned relational function of depth h for the ith feature vector x x i . Maintaining the relational functions are important for transferring the features to another arbitrary graph of interest, but also for interpreting them. Moreover, DeepGL is an inductive representation learning approach as the relational functions can be used to derive embeddings for new nodes or even graphs.
There are many methods to evaluate and remove redundant/noisy features at each layer and DeepGL is not tied to any particular approach. For the experiments, we use the relational function evaluation and pruning routine in Algorithm 3 as it is computationally efficient and the features remain interpretable. There are two main classes of techniques for evaluating and removing redundant/noisy features at each layer. The first class of techniques use a criterion K to score the feature pairs along with any feature selection method (e.g., see Algorithm 3) to select a subset of representative and non-redundant features. The second class of techniques compute a low-rank embedding of the feature matrix at each layer and concatenates the embeddings as features for learning the next layer. Alternatively, one can also use a hybrid approach that combines the advantages of both by simply concatenating the features from each. The above approaches are applied at each feature layer (iteration). Notice that these methods all have the same general objective of reducing noise and removing redundant features (minimality condition).
While previous work learns node embeddings (features), DeepGL instead learns complex relational functions that represent compositions of relational operators. Hence, these relational functions naturally generalize across graphs for inductive learning tasks. Both the relational operators and base graph features (which can be thought of as functions themselves) are independent of the particular graph topology G (i.e., they can be computed on any graph G) and therefore any arbitrary composition of relational operators applied to any base feature can be computed on any graph G.

Feature Diffusion
We introduce the notion of feature diffusion where the feature matrix at each layer can be smoothed using an arbitrary feature diffusion process. As an example, suppose X X is the resulting feature matrix from layer t, then we can set X X ð0Þ X X and solve where D D is the diagonal degree matrix and A A is the adjacency matrix of G. The diffusion process above is repeated for a fixed number of iterations t ¼ 1; 2; . . . ; T or until convergence; and X X ðtÞ ¼ D D À1 A A X X ðtÀ1Þ corresponds to a simple feature propagation. More complex feature diffusion processes can also be used in DeepGL such as the normalized Laplacian feature diffusion defined as X X ðtÞ ¼ ð1 À uÞL L X X ðtÀ1Þ þ uX X; fort ¼ 1; 2; . . . ; where L L is the normalized Laplacian: The resulting diffused feature vectors are effectively smoothed by the features of related graph elements (nodes/edges) governed by the particular diffusion process. Notice that feature vectors given as output at each layer can be diffused (e.g., after Line 5 or 8 of Algorithm 1). Note X X can be leveraged in a variety of ways: X X X X (replacing previous) or concatenated by X X Â X X X X Ã . Feature diffusion can be viewed as a form of graph regularization as it can improve the generalizability of a model learned using the graph embedding.

Supervised Representation Learning
The DeepGL framework naturally generalizes for supervised representation learning by replacing the feature evaluation routine (called in Algorithm 1 Line 7) with an appropriate objective function, e.g., one that seeks to find a set of features that (i) maximize relevancy (predictive quality) with respect to y y (i.e., observed class labels) while (ii) minimizing redundancy among the features in that set. The objective function capturing both (i) and (ii) is: where K is a measure such as mutual information; X is the current set of selected features; and b is a hyperparameter that determines the balance between maximizing relevance and minimizing redundancy. The first term in Eq. (7) seeks to find x x i that maximizes the relevancy of x x i to y y whereas the second term attempts to minimize the redundancy between x x i and each x x j 2 X of the already selected features. Initially, we set X fx x 0 g where x x 0 ¼ argmax x x i K À y y; x x i Á . Afterwards, we solve Eq. (7) to find x x i (such that x x i 6 2 X) which is then added to X (and removed from the set of remaining features). This is repeated until the stopping criterion is reached.
Algorithm 2. Derive a Feature Layer Using the Features from the Previous Layer and the Set of Relational Feature Operators F ¼ fF 1 ; . . . ; F K g 1: procedure FeatureLayer(G, X X, F, F , F t , ) 2: parallel for each graph element g i 2 G do 3: Set t jF j 4: for each feature x x k s.t. f k 2 F tÀ1 in order do 5: for each relational operator F 2 F do 7: X it ¼ FðS; x x k Þ and t t þ 1 8: Add feature definitions to F t 9: return feature matrix X X and F t Algorithm 3. Score and Prune the Feature Layer 8: Partition G F using conn. components C ¼ fC 1 ; C 2 ; . . .g 9: parallel for each C k 2 C do " Remove features 10: Find f i s.t. 8f j 2 C k : i < j.

11:
Remove C k from F t and set F t F t [ ff i g

ANALYSIS
Let N ¼ jV j denote the number of nodes, M ¼ jEj be the number of edges, F ¼ number of relational functions learned by DeepGL, and K ¼ number of relational feature operators.

Time Complexity
and the time complexity for learning node features using the DeepGL framework is: Hence, the time complexity of both edge (Eq. 8) and node (Eq. 9) feature learning in DeepGL is linear in the number of edges.
Proof. The time complexity of each of the main steps is provided below. Recall that DeepGL is a general and flexible framework for inductive graph-based feature learning. The particular instantiation of DeepGL used in this analysis corresponds to using Algorithm 3 for feature evaluation and pruning where K ¼ agreement scoring defined in Eq. (12) with logarithmic binning. We also assume the relational functions are composed of any relational feature operator (aggregator function) with a worst-case time complexity of OðjSjÞ where S is the set of related graph elements (e.g., ' ¼ 1 hop neighbors of a node or edge). Further, the related graph elements S given as input to a relational feature operator F 2 F ¼ fF 1 ; . . . ; F K g is the ' ¼ 1 hop neighborhood of a node in the worst case. Hence, if G is directed, then we consider the worst case where It is straightforward and trivial to select a subset J S of related graph elements for a given node (or edge) using an arbitrary uniform or weighted distribution P and derive a feature value for that node (edge) using J. Moreover, the maximum size of J can be set by the user as done in [25], [28].t u Base Graph Features. Recall that all base features discussed in Section 2.1 can be computed in OðMÞ time (by design). While DeepGL is not tied to a particular set of base features and can use any arbitrary set of base features including those that are more computationally intensive, we nevertheless restrict our attention to base features that can be computed in OðMÞ to ensure that DeepGL is fast and efficient for massive networks. For deriving the graphlet (network motif) frequencies and egonet features (that are to 3-node motif variations), we use recent provably accurate estimation methods [25], [28]. As shown in [25], [28], we can achieve estimates within a guaranteed level of accuracy and time by setting a few simple parameters in the estimation algorithm. The time complexity to estimate the frequency of all 4-node graphlets is OðMD ub Þ in the worst case where D ub ( M is a small constant set by the user that represents the maximum sampled degree [25], [28].
Searching the Space of Relational Functions. The time complexity to derive a novel candidate feature x x using an arbitrary relational feature operator F 2 F takes at most OðMÞ time.
For a feature f k 2 F with vector x x k , DeepGL derives K ¼ jFj new candidate features to search. This is repeated at most F times. Therefore, the worst-case time complexity to search the space of relational functions is OðKFMÞ where K ( M. Since K is a small constant, it is disregarded giving OðFMÞ. Scoring and Pruning Relational Functions. First we score the feature pairs using a score function K. To assign a score Kðx x i ; x x j Þ to an arbitrary pair of features f i ; f j 2 F, it takes at most OðMÞ time (or OðNÞ time for node feature learning). Further, if Kðx x i ; x x j Þ > , then we set W ij ¼ Kðx x i ; x x j Þ and add an edge E F ¼ E F [ fði; jÞg in oð1Þ constant time. Therefore, the worst-case time complexity to score all such feature pairs is OðF 2 MÞ for edge feature learning where F ( M and OðF 2 NÞ for nodes where F ( N. The time complexity for pruning the feature graph is at most F 2 þ F and thus can be ignored since the term F 2 M (or F 2 N) dominate as M (or N) grows large.

Inductive Relational Functions
We now state the time complexity of directly computing the set of inductive relational functions F (e.g., that were previously learned on another arbitrary graph). The set of relational functions F can be used in two important ways. First, the relational functions are useful for inductive acrossnetwork transfer learning tasks where one uses the relational functions that were previously learned from a graph G i and wants to extract them on another graph G j of interest (e.g., for graph matching or similarity, across-network classification). Second, given new nodes or edges in the same graph that the relational functions were learned, we can use F to derive node or edge feature values (embeddings, encodings) for the new nodes/edges without having to relearn the relational functions each time a new node or edge appears in the graph.
Hence, the time complexity is linear in the number of edges.
Computing the set of inductive relational functions on another arbitrary graph obviously requires less work than learning the actual set of inductive relational functions (Section 3.1.1) since there is no learning involved as we simply derive the previously learned relational functions F directly from their definitions. Therefore, we avoid Algorithm 3 completely since we do not need to score or prune any candidate features from the potential space of relational functions. The time complexity is therefore OðFMÞ where F ¼ jFj as shown previously in Section 3.1.1.

Space Complexity
Proof. The space complexity stated in Eq. (11) assumes logarithmic binning is used to encode the feature values of each feature vector x x 2 R N . Notice that without any compression, the space complexity is obviously OðNF Þ. However, since log binning is used with a bin width of a, we can avoid storing approximately aN d e values for each N-dimensional feature by mapping these values to 0 and explicitly storing only the remaining nonzero feature values. This is accomplished using a sparse feature vector representation. Furthermore, we can store these even more efficiently by leveraging the fact that the same nodes whom appear within the aN largest (or smallest) feature values for a particular feature often appear within the aN d e largest (or smallest) feature values for other arbitrary features as well. This is likely due to the powerlaw observed in many real-world graphs. Similarly, the space complexity of the sparse edge feature matrix X X is OðF ð aM d eÞÞ. t u

EXPERIMENTS
This section demonstrates the effectiveness of the proposed framework. In particular, we investigate the predictive performance of DeepGL compared to the state-of-the-art methods across a wide variety of graph-based learning tasks as well as its scalability, runtime, and parallel performance.

Experimental Settings
In these experiments, we use the following instantiation of DeepGL: Features are transformed using logarithmic binning and evaluated using a simple agreement score function where Kðx x i ; x x j Þ ¼ fraction of graph elements that agree. More formally, agreement scoring is defined as: where x ik and x jk are the kth feature value of the N -dimensional vectors x x i and x x j , respectively. As an aside, recall that a key advantage of DeepGL is that the framework has many interchangeable components including the above evaluation criterion, base features (Section 2.1), relational operators (Section 2.2), among others. The expressiveness and flexibility of DeepGL makes it well-suited for a variety of application domains, graph types, and learning scenarios [29]. In addition, DeepGL can even leverage features from an existing method (or any approach proposed in the future) to discover a more discriminative set of features.
Unless otherwise mentioned, we use the base graph features mentioned in Section 2.1; set a ¼ 0:5 (bin size of logarithmic binning) and perform a grid search over 2 f0:01; 0:05; 0:1; 0:2; 0:3g and F 2 È F mean ; F sum ;F prod ; fF mean ;F sum g; fF prod ;F sum g;fF prod ;F mean gfF sum ;F mean ;F prod g É : See Table 2. Note F prod refers to the Hadamard relational operator defined formally in Table 2. As an aside, DeepGL has fewer hyperparameters than node2vec, DeepWalk, and LINE used in the comparison below. The specific model defined by the above instantiation of DeepGL is selected using 10fold cross-validation on 10 percent of the labeled data. Experiments are repeated for 10 random seed initializations. All results are statistically significant with p-value < 0:01.

Within-Network Link Classification
We first evaluate the effectiveness of DeepGL for link classification. To be able to compare DeepGL to node2vec and the other methods, we focus in this section on within-network link classification. For comparison, we use the same set of binary operators to construct features for the edges indirectly using the learned node representations: Given the feature vectors x x i and x x j for node i and j, ðx and ðx x i À x x j Þ 2 is the WEIGHTED-L 1 and WEIGHTED-L 2 binary operators, respectively. 11 Note that these binary operators (used to create edge features) are not to be confused with the relational feature operators defined in Table 2. In Table 3, we observe that DeepGL outperforms node2vec, DeepWalk, and LINE with an average gain between 18.09 and 20.80 percent across all graphs and binary operators.
Notice that node2vec, DeepWalk, and LINE all require that the training graph contain at least one edge among each node in G. However, DeepGL overcomes this fundamental limitation and can actually predict the class label of edges that are not in the training graph as well as the class labels of edges in an entirely different network.

Inductive Across-Network Transfer Learning
Recall from Section 2 that a key advantage of DeepGL over existing methods [13], [15], [16] lies in its ability to learn features that naturally generalize for inductive across-network transfer learning tasks. Unlike existing methods [13], [15], [16], DeepGL learns relational functions that generalize for extraction on another arbitrary graph and therefore can be used for graph-based transfer (inductive) learning tasks such as across-network link classification. In contrast, node2vec [15], DeepWalk [13] and LINE [16] are unable to be used for such graph-based transfer learning tasks as the features from these methods are fundamentally tied to node identity, as opposed to the general relational functions learned by DeepGL that can be computed on any arbitrary graph. Further, these methods require the training graph to be connected (e.g., see [15]), which implies that each node in the original graph has at least one edge in the training graph. This assumption is unrealistic, though is required by these approaches, otherwise, they would be unable to construct edge features for any edges containing nodes which did not appear in the test set.
For each experiment, the training graph is fully observed with all known labels available for learning. The test graph is completely unlabeled and each classification model is evaluated on its ability to predict all available labels in the test graph. Given the training graph G ¼ ðV; EÞ, we use DeepGL to learn the feature matrix X X and the relational functions F (definitions). The relational functions F are then used to extract the same identical features on an  11. Note x x 2 is the element-wise Hadamard power; x x i x x j is the element-wise product. arbitrary test graph G 0 ¼ ðV 0 ; E 0 Þ giving as output a feature matrix X X 0 . Notice that each node (or edge) is embedded in the same F -dimensional space, even despite that the set of nodes/edges between the graphs could be completely disjoint, and even from different domains. Thus, an identical set of features is used for all train and test graphs.
In these experiments, the training graph G 1 represents the first week of data from yahoo-msg, 12 whereas the test graphs fG 2 ; G 3 ; G 4 g represent the next three weeks of data (e.g., G 2 contains edges that occur only within week 2, and so on). Hence, the test graphs contain many nodes and edges not present in the training graph. As such, the predictive performance is expected to decrease significantly over time as the features become increasingly stale due to the constant changes in the graph structure with the addition and deletion of nodes and edges. However, we observe the performance of DeepGL for across-network link classification to be stable with only a small decrease in AUC as a function of time as shown in Fig. 5a. This is especially true for edge features constructed using mean. As an aside, the mean operator gives best performance on average across all test graphs; with an average AUC of 0.907 over all graphs. Now we investigate the performance as a function of the amount of labeled data used. In Fig. 5b, we observe that DeepGL performs well with very small amounts of labeled data for training. Strikingly, the difference in AUC scores from models learned using 1 percent of the labeled data is insignificant at p < 0:01 w:r:t: models learned using larger quantities.

Analysis of Space-Efficiency
Learning sparse space-efficient node and edge feature representations is of vital importance for large networks where storing even a modest number of dense features is impractical (especially when stored in-memory). Despite the importance of learning a sparse space-efficient representation, existing work has been limited to discovering completely dense (node) features [13], [15], [16]. To understand the effectiveness of the proposed framework for learning sparse graph representations, we measure the density of each representation learned from DeepGL and compare these against the state-ofthe-art methods [13], [15]. We focus first on node representations since existing methods are limited to only node features. Results are shown in Fig. 6. In all cases, the node representations learned by DeepGL are extremely sparse and significantly more space-efficient than node2vec [15] as observed in Fig. 6. DeepWalk and LINE use nearly the same space as node2vec, and thus are omitted for brevity. Strikingly, DeepGL uses only a fraction of the space required by existing methods (Fig. 6). Moreover, the density of node and edge representations from DeepGL is between Â 0:162; 0:334 Ã for nodes and Â 0:164; 0:318 Ã for edges and up to 6Â more space-efficient than existing methods.
Notably, recent node embedding methods not only output dense node features, but are also real-valued and often negative (e.g., [13], [15], [16]). Thus, they require 8 bytes per feature-value, whereas DeepGL requires only 2 bytes and can sometimes be reduced to even 1 byte if needed by adjusting a (i.e., the bin size of the log binning transformation). To understand the impact of this, assume both approaches learn a node representation with 128 dimensions (features) for a graph with 10,000,000 nodes. In this case, node2vec, Deep-Walk, and LINE require 10.2 GB, whereas DeepGL uses only 0.768 GB (assuming a modest 0.3 density) -a significant reduction in space by a factor of 13.

Runtime & Scalability
To evaluate the performance and scalability of the proposed framework, we learn node representations for Erd€ os-R enyi graphs of increasing size (from 100 to 10,000,000 nodes) such that each graph has an average degree of 10. We compare the performance of DeepGL against LINE [16] and node2vec [15] which is designed specifically to be scalable for large graphs. Default parameters are used for each method. In Fig. 7a, we observe that DeepGL is significantly faster and more scalable than node2vec and LINE. In particular, node2vec takes 1.8 days (45.3 hours) for 10 million nodes, whereas DeepGL finishes in only 15 minutes; see Fig. 7a. Strikingly, this is 182 times faster than node2vec and 106 times faster than LINE. In Fig. 7b, we observe that DeepGL spends the majority of time in the search and optimization phase. In Fig. 8, we also investigate effect on the number of relational functions learned, sparsity, and runtime of DeepGL as a and vary.

Parallel Scaling
This section investigates the parallel performance of DeepGL. To evaluate the effectiveness of the parallel algorithm we measure speedup defined as S p ¼ T 1 T p where T 1 and T p are the execution time of the sequential and parallel algorithms (w/ p processing units), respectively. In Fig. 9, we observe strong parallel scaling for all DeepGL variants with the edge representation learning variants performing slightly better than the node representation learning methods from DeepGL. Results are reported for socÀÀgowalla

Effectiveness on Link Prediction
Given a graph G with a fraction of missing edges, the link prediction task is to predict these missing edges. We generate a labeled dataset of edges as done in [15]. Positive examples are obtained by removing 50 percent of edges randomly, whereas negative examples are generated by randomly sampling an equal number of node pairs that are not connected with an edge, i.e., each node pair ði; jÞ 6 2 E. For each method, we learn features using the remaining graph that consists of only positive examples. To construct edge features from the node embeddings, we use the mean operator defined as ðx x i þ x x j Þ 2. Using the feature representations from each method, we then learn a model to predict whether a given edge in the test set exists in E or not. Notice that node embedding methods such as node2vec require that each node in G appear in at least one edge in the training graph (i.e., the graph remains connected), otherwise these methods are unable to derive features for such nodes. This is a significant limitation especially in practice where nodes (e.g., representing users) may be added in the future. Furthermore, most graphs contain many nodes with only a few edges (due to the small world phenomenon) which are also the most difficult to predict, yet are avoided in the evaluation of node embedding methods due to the above restriction. Results are provided in Table 4. We report both AUC and F1 scores. In all cases, DeepGL achieves better predictive performance over the other methods across a wide variety of graphs from different domains with fundamentally different structural characteristics. In Fig. 10, we summarize the gain in AUC and F1 score of DeepGL over the baseline methods. Strikingly, DeepGL achieves an overall gain in AUC of 32 percent and an overall gain in F1 of 37 percent averaged over all baseline methods and across a wide variety of graphs with different structural characteristics. As an aside, common neighbors and other simple measures are known to perform significantly worse than node2vec and other embedding approaches (see [15]) and thus are not included here for brevity.

Sensitivity & Perturbation Analysis
Many real-world networks are partially observed and noisy (e.g., due to limitations in data collection) and often have both missing and noisy/spurious edges. In this section, we analyze the sensitivity of DeepGL under such conditions using the DD242 network. Results are shown in Fig. 11. In particular, DeepGL is shown to be robust to missing and noisy edges while the number of features (dimensionality) learned by DeepGL remains relatively stable with only a slight increase as a function of the number of missing or additional edges. While one must define the number of dimensions (features) in existing node embedding methods, DeepGL automatically learns the appropriate number of dimensions and the depth (number of layers). Furthermore, we find the number of features learned by DeepGL is roughly correlated with the complexity and randomness of the graph (i.e., as the graph becomes more random, the number of features increases to account for such randomness).

Interpretability of Learned Features
The features (embeddings, representations) learned by most existing approaches are notoriously difficult to interpret and explain which is becoming increasingly important in practice [17], [18]. In contrast, the features learned by DeepGL are relatively more interpretable and can be explained by examining the actual relational functions learned. While existing work primarily outputs the feature/ embedding matrix X X, DeepGL also outputs the relational functions F that correspond to each of the learned features. It is these relational functions that allow us to gain insight into the meaning of the learned features.
In Table 5, we show a few of the relational functions learned from an email network (ia-email-EU; nodes represent users and directed edges indicate an email communication from one user to another). Recall a feature in DeepGL is a relational function representing a composition of relational feature operators applied over a base feature. Therefore, the interpretableness of a learned feature in DeepGL depends on two aspects. First, we need to understand the base feature, e.g., in-degree. 13 Second, we need to understand the relational feature operators that are used in the relational function. As an example, ðF À mean Þðx xÞ in Table 5 where x x ¼ d d À (in-degree) is interpreted as "the mean in-degree of the in-neighbors of a node." Using the semantics of the graph, Fig. 7. Runtime comparison on Erd€ os-R enyi graphs with an average degree of 10. Left: The proposed approach is shown to be orders of magnitude faster than node2vec [15] and LINE [16]. Right: Runtime of the main DeepGL phases.  13. Notice that base features derived from the graph G are also functions (e.g., degree is a function that sums the adjacent nodes).
it can be interpreted further as "the mean number of emails received by individuals that have sent emails to a given individual." In other words, the mean number of emails received by users that have sent emails to a given node. This relational function captures whether individuals that send a given user emails also receive a lot of emails or not. It is straightforward to interpret the other relational functions in a similar fashion and due to space we leave this up to the reader. DeepGL also learns relational functions involving either frequent (3-stars, 4-paths) or rare (triangles) induced subgraphs as these are highly discriminative as they characterize the behavioral roles of nodes in the graph. As shown in Table 5, composing relational operators allows DeepGL to learn structured and interpretable relational functions from well understood base components. The learned relational functions yield decompositions of a signal into interpretable components that facilitate model checking by domain experts.

Visualization
We now explore the properties of the graph that are captured by the feature matrix X X from DeepGL and node2vec. In particular, we use k-means to group nodes that are similar with respect to the node feature vectors given as output by each method. The number of clusters k is selected automatically using MDL. In Fig. 12, we visualize the graph structure and color the nodes (and edges in the case of DeepGL) by their cluster assignments. Strikingly, we find in Fig. 12a that DeepGL captures the node and edge roles [7]   that represent the important structural behaviors of the nodes and edges in the graph. In contrast, node2vec captures the community of a node as observed in Fig. 12b. Notably, the roles given by DeepGL are intuitive and make sense (Fig. 12a). For example, the red role represents authors (and co-author links) from the CS PhD co-authorship graph [30] that are gate-keepers (bridge roles) connecting different groups of authors. Furthermore, the green role represents nodes at the peripheral (or edge) of the network and are star edge nodes (as opposed to nodes at the center of a large star/ hub nodes). This demonstrates the effectiveness of DeepGL for capturing and revealing the important structural properties (structural roles [7]) of the nodes and edges.

RELATED WORK
Related research is categorized below. Node Embedding Methods. There has been a lot of interest recently in learning a set of useful node featuresfrom largescale networks automatically [13], [15], [16], [31]. Many of the techniques initially proposed for solving the node embedding problem were based on graph factorization [32], [33], [34]. Recently, the skip-gram model [12] was introduced in NLP to learn vector representations for words. Inspired by the success of the skip-gram model, various methods [13], [15], [16], [35], [36] have been proposed to learn node embeddings using skip-gram by sampling ordered sequences of node ids (random walks) from the graph and applying the skip-gram model to these sequences to learn node embeddings. However, these methods are not inductive nor are they able to learn sparse space-efficient node embeddings as achieved by DeepGL. More recently, Chen et al. [37] proposed a hierarchical approach to network representation learning called HARP whereas Ma et al.proposed MINES for multi-dimensional network embedding with hierarchical structure. Other work has focused specifically on community-based embeddings [13], [15], [36], [38] as opposed to role-based embeddings [39]. In contrast, DeepGL learns features that capture the notion of roles (i.e., structural similarty) defined in [7] as opposed to communities (Fig. 12).
Heterogeneous networks [40] have also been recently considered [41], [42], [43], [44] as well as attributed networks [45], [46], [47]. Huang et al. [45] proposed an approach for attributed networks with labels whereas Yang et al. [48] used text features to learn node representations. Liang et al. [49] proposed a semi-supervised approach for networks with outliers. Bojchevski et al. [50] proposed an unsupervised rankbased approach. Coley et al. [51] introduced a convolutional approach for attributed molecular graphs that learns graph embeddings as opposed to node embeddings. Similarly, Lee et al. [52] proposed a graph attention model that embeds graphs as opposed to nodes for graph classification. There has also been some recent work on learning node embedding in dynamic networks [33], [53], [54], [55], semi-supervised network embeddings [56], [57] and methods for improving the learned representations [36], [47], [58], [59], [60], [61]. However, these approaches are designed for entirely different problem settings than the one focused on in this work. Notably, all the above methods are not inductive (for graphbased transfer learning tasks) nor are they able to learn sparse space-efficient node embeddings as achieved by DeepGL. Other key differences were summarized previously in Section 1.
Inductive Embedding Methods. While most work has focused on transductive (within-network) learning, there has been some recent work on graph-based inductive approaches [19], [62]. Yang et al. [62] proposed an inductive approach called Planetoid. However, Planetoid is an embedding-based approach for semi-supervised learning and does not use any structural features. DeepGL first appeared in a manuscript published in April 2017 as R.
Examples of a few important relational functions learned from the ia-email-EU network are shown below. Recall from Section 2 that a relational function (feature) of order-h is composed of h relational feature operators and each of the h relational feature operators takes a set S of related graph elements (Table 2). In this work, S 2 È G þ ' ðg i Þ; G À ' ðg i Þ; G ' ðg i Þ É where ' ¼ 1. For simplicity, we denote F þ ; F À ; F as the relational feature operators that use out, in, and total (both in/out) 1-hop neighbors, respectively. See text for discussion. Fig. 12. Left: Application of DeepGL for edge and node role discovery (ca-PhD). Link color represents the edge role and node color indicates the corresponding node role. Right: However, since node2vec uses random walks it captures communities [39] as shown in (b) where the color depicts the community of a node. See text for discussion.
Rossi et al., "Deep Feature Learning for Graphs" [19]. A few months later, Hamilton et al. [63] proposed GraphSage that shared many of the ideas proposed in [19] for learning inductive node embeddings. Moreover, GraphSage is a special case of DeepGL when the concatenated features at each layer are simply fed into a fully connected layer with nonlinear activation function s. However, that work focused on node classification whereas our work focuses on link prediction and link classification.
Many node embedding methods are based on traditional random walks (using node ids) [13], [15], [35], [36] and therefore are not inductive nor do they capture roles. Recently, Ahmed et al. [39] proposed an inductive network representation framework called role2vec that learns inductive role-based node embeddings by first mapping each node to a type via a function F and then uses the proposed notion of attributed (typed) random walks to derive inductive role-based embeddings that capture structural similarity [39]. The role2vec framework [39] was shown to generalize many existing random walk-based methods for inductive learning tasks on networks. Other work by Lee et al. [64] uses a technique to construct artificial graph data heuristically from input data that can then be used for transfer learning. However, that work is fundamentally different from our own as it constructs a graph from non-relational data whereas DeepGL is designed for actual realworld graph data such as social, biological, and information networks. Moreover, graphs derived from non-relational data are often dense and most certainly different in structure from the real-world graphs investigated in our work [65].
Higher-Order Network Analysis. Graphlets (network motifs) are small induced subgraphs and have been used for graph classification [2] and visualization/exploratory analysis [24]. However, DeepGL uses graphlet frequencies as base features for learning higher-order node and edge functions from large networks that generalize for inductive learning tasks.
Sparse Graph Feature Learning. This work proposes the first practical space-efficient approach that learns sparse node/edge feature vectors. Notably, DeepGL requires significantly less space than existing node embedding methods [13], [15], [16] (see Section 4). In contrast, previous work learns completely dense feature vectors which is impractical for any relatively large network, e.g., they require more than 3TB of memory for a 750 million node graph with 1K features.

CONCLUSION
This work introduced the notion of relational function representing a composition of relational feature operators applied over an initial base graph feature. Using this notion as a basis, we proposed DeepGL, a general, flexible, and highly expressive framework for learning deep node and edge relational functions (features) that generalize for (inductive) across-network transfer learning tasks. The framework is flexible with many interchangeable components, expressive, interpretable, parallel, and is both spaceand time-efficient for large graphs with runtime that is linear in the number of edges. DeepGL has all the following desired properties: Effective for learning features that generalize for graph-based transfer learning and large (attributed) graphs Space-efficient requiring up to 6x less memory Fast with up to 106x speedup in runtime performance Accurate with a mean improvement in AUC of 20 percent or more on many applications Expressive and flexible with many interchangeable components making it useful for a range of applications, graph types, and learning scenarios.