GDLL: A Scalable and Share Nothing Architecture based Distributed Graph Neural Networks Framework

Deep learning has recently been shown to be effective in uncovering hidden patterns in non-Euclidean space, where data is represented as graphs with complex object relationships and interdependencies. Because of the implicit data dependence in the big graphs with millions of nodes and billions of edges, it is hard for industrial communities to exploit these methods to address real-world challenges at scale. The skewness property of big graphs, distributed file system performance penalty on small k-hop neighborhood subgraphs, and varying size of subgraph makes graph neural networks training further challenging in a distributed environment using parameter servers. To address such issues, we propose a scalable, layered, fault-tolerance, and in-memory distributed computing-based graph neural network framework called GDLL. The base layer utilizes an optimized distributed file system and a scalable graph data store to reduce the performance penalty. The second layer provides distributed graph processing using in-memory graph programming models while optimizing and hiding the underlying complexity of information complete subgraph computation. In the third layer, graph neural network modules are deployed on top of the first two layers for efficient distributed training using parameter servers. Finally, we evaluate and compare GDLL with the state-of-the-art solutions and outperform it significantly in terms of efficiency while maintaining similar GNN convergence.


I. INTRODUCTION
G RAPH topologies may naturally represent real-world data in a variety of applications. In Euclidean data, deep learning is efficient at finding hidden patterns. But what about applications that use data from non-Euclidean domains and represent it as graphs having high interdependencies and compound object relationships? This is where Graph Neural Networkss (GNNs), one of the most prominent deep learning frameworks for graph data modeling and analysis, comes in as a substantial solution [1]. This promising technique's core idea is to get the target node's embedded representation by preparing, aggregating, and merging information from its local neighborhoods, which is akin to graph embedding [2]. GNN has been widely utilized in many industrial problems such as recommendation systems [3], social network analysis, and anomaly detection [4] because of its great expressiveness and impressive performance.
The social network graph, for example, contains up to two billion nodes and a trillion edges [5]. By estimating associated features with those nodes and edges, the graph data at such a scale may result in 100 TeraBytes of data. While in GNN, the data are infeasible to store and process on a single computer. To use GNN methods to solve real-world problems by processing industrial-scale graphs, we require to build a graph learning system with scalability, fault tolerance, and integrity of fully-functional GNNs training computations. Nevertheless, the computation graph of GNNs are fundamentally different from conventional learning tasks because of data dependency within a graph. In existing traditional Parameter Server (PS) frameworks [6], [7] assuming data-parallel, each sample's computation is independent of other samples. But for convergence, the GNNs are highly dependent on the information complete independent subgraphs. Furthermore, because real-word graphs follow the Power-Law, the independent subgraph-based solution results in load-balancing issues in a distributed environment. Similarly, for quick data access, we must keep the graph data in memory. Thus we can no longer keep training or inference samples on disk and retrieve them via pipelines [5]. As a result, we are unable to create a learning and inference system for graph learning tasks based on existing PS frameworks that merely maintain model integrity in PS and perform workloads in data-parallel in each worker.
In the recent past, several attempts have been made to develop scalable systems, both in the literature and in industry. The present scalable learning solutions may be divided into two categories: shared memory-based distributed GNN frameworks (DGL [8], PyG, AliGraph [9], FeatGraph [10], and [11]) and shared-nothing architecture-based distributed GNN frameworks (NeuGraph [12] and AGL [5]). The former frameworks have inherited issues (such as limited scalability) because of the design architecture and lacking suitability for some realworld scenarios. For example, PBG fails when the graphs have rich attributes over nodes and edges. PyG and DGL are designed as a standalone powerful-machine systems to deal with industrialscale graphs. AliGraph implements distributed inmemory graph storage engine, which requires standalone deployment before training a GNN model. In the latter class, NeuGraph [12] is based on the Scatter Apply Gather [13] graph processing model. This approach is suitable for information complete subgraph PS distributed training. Likewise, AGL is a general-purpose learning framework and is based on MapReduce distributed computing architecture. AGL also tried to inference in the distributed environment but it utilizes a disk-based storage approach and does not efficiently use the distributing file system.
Overall, existing solutions require in-memory storage of graph data either in a standalone powerful monster machine that could not manage realworld big graphs or in a customized graph store that could lead to a massive amount of commu-nications between graph stores and workers. This makes them not scale to bigger graph data. Further, state-of-the-art solutions do not well exploit existing shared-nothing distributed architecture, such as MapReduce framework aiming scalability, deploying GNNs against big graphs, and for fault tolerance purposes. Finally, existing frameworks focus more on the training of graph learning models but overlook system integrity and generalizability.
To address such issues and to fill the research gap, in this paper, we present Graph Distributed Learning Library (GDLL), a scalable, layered, fault-tolerance, in-memory, and shared-nothing architecture-based framework for distributed GNNs training.
The proposed framework is composed of three layers, i.e., Graph Data Layer (GDL), Graph Optimization Layer (GOL), and Graph Learning Layer (GLL). The GDL is the base layer (section IV-A) and has been designed on top of an optimized distributed file system (for k-hop based subgraphs) called Flock-HDFS for Subgraphs (F-HDFS) and a scalable graph data store to manage large scale raw data, big graphs, and subgraphs. F-HDFS is novel file aggregation approach designed on top of Hadoop Distributed File System (HDFS) [14]. Flock is designed as the I/O unit merging a group of related k-hop based subgraphs aiming to effectively access subgraphs for efficient GNNs training in a distributed environment. The second layer (section IV-B) is called GOL, which provides distributed graph processing on top of an in-memory MapReduce framework (Apache Spark [15]) while optimizing and hiding the underlying complexity of advanced message passing, k-hop-based subgraph computation, and graph sampling techniques. The third layer (section IV-C) is called GLL which deploy GNN modules on top of the first two layers for fast and efficient training while considering PS.
The main contributions of this paper are many fold and can be summarized as: • We propose a scale-out, share-nothing architecture-based distributed GNN framework. •

II. RELATED WORK
The past few years have seen notable advancement in GNNs, in solving problems from diverse domains e.g., social network analysis [16], computer vision [17], chemistry [18], medicine [19], health [20], etc. However, deploying GNNs against big graph is challenging [21]. Efforts have been made to propose scalable system for GNNs. Existing methods adopted two types of approaches for scalable GNN training, i.e., Shared memorybased distributed GNN (systems that allow multiple processors to share the same memory location), and shared-nothing architecture based distributed GNN frameworks (where each server operates independently and controls its memory and disk resources against a common goal).
In the former case, attempts have been made on a single monster machine by fully utilizing the computing resources i.e., GPU, memory, and CPU to speedily train GNNs models. This includes PyTorch Geometric (PyG) [22] and Deep Graph Library (DGL) [8] where both of them are based on the message-passing paradigm. They have been optimized aiming to maximize the utilization of the CPU and GPUs. However, these systems cannot scale and are not applicable on big graphs consisting of considerable features. Similarly, PinSage [23], inspired from GraphSAGE, samples the neighborhood against a node exploiting localized convolutions and deploys MapReduce pipeline for fast inference but it is also based on a standalone machine.
In multiple machine setups, recently, researchers have tends to opt for a distributed design. In this context, Facebook offered PyTorch-BigGraph (PBG) [11], a large-scale graph embedding framework. The PBG can generate node embedding in an unsupervised manner from multi-relation data. Yet, it lacks in dealing with rich attributes over nodes and edges which is largely the case in realworld graph applications. DistDGL [24] is based on DGL, proposed a mini-batch based scalable GNNs training on a cluster of machines. It adopts METIS [25] to partition and distributes graph data across all machines. Furthermore to achieve better GNN convergence they followed the synchronous training approach on indigenously designed key-value store servers. However, their graph partition method results in severe imbalance partition and thus needs extreme optimization for load balancing. Another DistGNN [26] also based on the DGL, proposed a full-batch training strategy on an optimized shared memory implementation. They tend to bypass the communication by employing a family of delay update algorithms and minimum vertex-cut for further communication reduction. AliGraph [9] currently deployed at Alibaba for products recommendation, presented three layers system to support GNN algorithms consisting of the storage layer, sampling layer, and an operator layer. The storage layer provides a caching mechanism for neighbors of important nodes. The sampling layer provides Neighborhood, traverse, and negative sampling methods in a distributed setting. The last operator layer provides optimized aggregate and combines operators. Still, AliGraph cannot scale out, and the same has not been reported in the paper.
Likewise, scale-out systems have been proposed based on shared-nothing architecture, i.e., Neu-Graph [12] and AGL [5]. The Neugraph is based on a GPU-based distributed memory system and uses a parallel, multi-GPU standalone system for GNNs training on large graphs. It proposes SAGA-NN (Scatter-ApplyEdge-Gather-ApplyVertex with Neural Networks), a programming model for GNN, and is inspired by vertex-centric parallel graph abstraction GAS [27]. It uses min-cut Metis partitioning for the input graph. Similarly, AGL [5] is a framework designed to process industrial-scale GNNs while utilizing the Hadoop map-reduce framework [28]. Along with this AGL is exploiting the default structure of distributed file systems that could be a significant bottleneck. AGL and AliGraph tried to solve the scalability issue, however, they are still far from near-real-time inferences for the large industrial graph.
To summarize, recent solutions demand inmemory graph data storage, either in a standalone powerful computer that can't manage real-world enormous graph data or in a customized graph store that can lead to huge communication overhead among graph stores and workers. As a result, they are unable to scale to massive graph data. Furthermore, MapReduce-based GNNs solutions do not fully utilize existing shared-nothing distributed architecture. Finally, existing frameworks focus more on graph learning model training while neglecting system integrity and generalizability. Also, they do not address the issues of k-hop based GNN training, i.e., varying size subgraph distributed GNN training. In this research work, we propose GDLL framework aiming to fill this research gap. VOLUME 5, 2021

III. PRELIMINARIES
To better understand this research work, in this section, we introduce some notations and background of GNNs and its computation paradigm such as message passing then we articulate the foundation of HDFS [14], and Apache Spark [15]. Finally, we introduce the concept of K-hop neighborhood to help realize the data independence and the formation of information complete subgraph required in graph learning tasks.

A. NOTATIONS
We start from a directed, attributed and weighted graph, defined as G = {V, E, A, X, E}. The vertex set and edge set can be defined as V and E ∈ V × V, separately. A represents a weighted and sparse adjacency matrix and is defined as a A ∈ R |V|×|V| , where A v,u > 0, represents the weighted directed edge from vertex u to vertex v (i.e., (v, u) ∈ E). Similarly, A v,u = 0 represents there is no edge (i.e., (v, u) / ∈ E). X and E are vertex and edge feature matrices and are defined as X ∈ R |V|×n and E ∈ R |V|×|V|×e , where n and e are their corresponding feature sizes. We denote vertex v feature vector as x v and edge feature vector as e v,u from edge (v, u) if (v, u) ∈ E. To represent undirected edge, we decomposed a directed edge (v, u) into two directed edges as (v, u), and (u, v) with the same edge feature as e v,u . Furthermore, we represent the Diagonal matrix as D, graph embeddings as h, identity matrix as I, Attention weights as a, aggregation function as AGGR, and Activation method as σ, correspondingly. Finally, we use N + v to define the set of vertexes directly pointing at v, i.e., In simple words, N + v defines the set of in-edge neighbors of v, while N − v defines the set of out-edge neighbors of v. we will use the term in-edges for the edges pointing at a target vertex v and out-edges for the edges pointed by the same vertex v.

B. GNNS AS MESSAGE PASSING
GNNs are the type of deep learning methods that can directly be applied on graphs. Over the years several GNNs have been proposed which we divide into two broader category i.e. message passing based GNNs and GNNs based on Weisfeiler-Lehman (WLGNN) test. In this paper, our focus is on message passing-based GNNs as WLGNN based GNNs are not scalable and its complexity increases polynomially with respect to graph size. Message passing based-GNNs tends to generalize the convolution operation to irregular domains in graph expressed as neighborhood aggregation or also called as message passing scheme as can be seen in Figure 1. The message passing scheme consists of two major steps i.e. aggregate, and update whereas it can be formulated as: where AGGR must be differentiable and permutation invariant function, e.g., mean, sum or max. h k−1 v denotes node v embeddings at k − 1 layer or feature vector incase of k=1, z k v presents neighborhood aggregation at k th layer and N v presents the neighbors of node v. The next step is to update the aggregated neighborhood and propagate such as: where W k is the learnable weight matrix and σ is an activation function e.g ReLU.
In order to execute GNNs on a large graph, GraphSage proposed the notion of sampling before aggregation and update. In this paper, we implemented the Graph Convolution Network (GCN), GraphSage and Graph Attention Networks (GAT) in a similar message passing scheme.

C. HDFS AND APACHE SPARK
HDFS, inspired from Google File System [29], is an open-source distributed file system designed to store large-scale data reliably and to stream those data to the user application at a high speed. HDFS is high fault-tolerant and can scale out up to thousands of machines even on commodity hardware. The main components of HDFS are NameNode, DataNode, and Client, as shown in Figure 2. The NameNode is in charge of managing the file system namespace and the metadata, responding to various requests from clients. DataNodes are used for storage and provide a data replication mechanism to guarantee the availability of a system. When a client needs to write a file to HDFS, an RPC request is made to the NameNode to allow the request, if it does not exist. The files are divided into blocks by the client. For each block, the NameNode allocates several DataNodes according to its replication policy and creates the metadata including the location information in its memory. The client writes a packet to the DataNode. Similarly, for data reading, the client sends a read request to the NameNode and returns the metadata (contains the location information of file blocks) followed by the actual data. Finally, all file blocks are assembled into a complete file in the client.
Likewise, Apache Spark is an in-memory cluster computing platform that generalizes MapReduce model to provide efficient computation. Apache spark immensely outperforms traditional mapreduce procedures on platforms like Hadoop due to its heavy usage of in-memory computing. Resilient Distributed Dataset (RDD) which are stored in-memory and is mainly responsible for efficient memory abstraction mechanism. An RDD in Spark is basically an unchangeable distributed collection of elements. To perform any operation on Spark, all work is expressed by defining new RDD, transforming existing RDD, or calling methods on RDD.
Every RDD are distributed into multiple partitions, which are computed on numerous nodes of the cluster. Furthermore, Spark processes are absolutely fault-tolerant, due to this feature large amounts of worker nodes can be killed without affecting the overall status of the job.

D. K-HOP NEIGHBORHOOD
represents the length of the shortest path from u to v and its length is less than or equal to k. Its edge set from graph G consists of the edges that have both endpoints in its node set i.e., Moreover, it contains the feature vectors of the nodes and edges in the computed k-hop neighborhood, such as X k v and E k v . The 0-hop neighborhood for a target node v is the node v (its feature vector as well) itself.
The generated k-hop neighborhood provides sufficient and necessary information without compromising on data interdependency among nodes in the subgraph (see theorem 1 in [5]). It can similarly be denoted like an original graph such as is the k-hop neighborhood's adjacency matrix. Thus, this provides the ability to train the GNN model in workers independently in parallel using parameter server settings without compromising on the effectiveness of the GNNs models.

IV. PROPOSED GDLL FRAMEWORK
To tackle the complex problems in highly interdependent graph structure for scalable and efficient graph learning framework, we proposed the GDLL framework as can be seen in Figure 3. In this section, we present an abstract overview of the proposed framework whereas the technical details are elaborated in the following subsections.
The effectiveness of our proposed framework lies in the layer design as we proposed a three-layer framework, i.e., GDL, GOL, and GLL. It encourages progress in each layer separately to achieve scalable graph learning when integrated. To better comprehend the proposed GDLL framework, we have elaborated simple steps to perform scalable GNNs training, as can be seen in Figure 3. For handling large-scale graphs effectively during the GNNs training process, we first develop GDL layer, which provides methods to load and store the large VOLUME 5, 2021 raw-graph on top of HDFS, big-graphs into an indexed efficient Graph DS. Further, this layer is equipped with F-HDFS, which keeps the related subgraphs into the same blocks for fast loading during GNNs training.
The GOL layer is designed on top of an inmemory distributed computing engine, i.e., Apache Spark. This layer provides the functionality of the k-hop neighborhood, an information complete subgraph along with graph re-indexing and sampling strategies while hiding the complexity of distributed computing engines.
Finally, the GLL layer is responsible for PSbased GNNs training in distributed environment. The beauty of our proposed approach is that all variants of message passing based GNNs can be easily integrated (on top of GDL, and GOL) with our proposed shared-nothing framework.

A. GRAPH DATA LAYER
Graph Data Layer is the base layer of the framework and deals with big graphs' efficient persistence and retrieval. It consists of two main modules, i.e., Graph Access Controller, and Graph Data Persistence. The Graph Access Controller allows the upper layers to establish a connection with the Graph Data Persistence and store and persist the data in the Graph Data Persistence Store. This component is responsible for all the communication between the Graph Data Persistence and the subsequent upper layers. The Graph Data Persistence provides three kinds of data stores, i.e., Unstructured Data Store, Graph Data Store, and F-HDFS. Unstructured Data Store manages the storage and retrieval of raw graph data and GNN models on top of HDFS. Graph Data Store facilitates live on-demand graph data storage and retrieval. F-HDFS is an optimized HDFS for subgraph based GNN.
Real-world graph data is in a raw form stored in the RAW DS. In order for this data to be processed, it needs to be mapped to the graph representation. This task is carried out by the Raw Data to Graph Mapper (R2G Mapper). The R2G Mapper transforms the raw graph data into graph format for distributed processing as shown in Algorithm 1. Algorithm 1 reads raw graph data from the RAW DS. The nodes data consists of node id and its attributes, and edge data consists of source and destination node id and edge attributes. It then constructs a node object from each of the nodes' records in nodes data. The nodes' objects are put in the nodes list in the graph. Then the source and destination node IDs and edge attributes are obtained from the edge data, an edge object is created, and put in the edge list in the graph object. After loading, the graph is represented as an object in memory consisting of nodes with properties and edges with properties. This representation is handed over to the GOL to compute the information k-hop subgroups.

1) F-HDFS
The produced k-hop subgraph are large in number and need to be indexed so that to speed up the training process. The design principle behind HDFS is to work with large-size files which makes it suitable for Big-Data platforms mostly on commodity hardware. However, the k-hop subgraphs are mainly composed of small files in huge numbers. A massive number of small files causes two problems. First, for a large number of small files (as is the case with subgraphs), the NameNode gets overburdened and the DataNodes fail. Second, subgraph-based GNNs training demands rapid read access to the subgraph stored in files which consequently results in an exceeding amount of reading requests and hopping between DataNodes. This severely damages the I/O performance as shown in Figure 4. The throughput is very low for small files. Therefore, for scalable GNN distributed computing platforms, HDFS be-

Graph
Subgraph Model Figure 3: The proposed GDLL framework has three layers. The first layer, GDL, is responsible for graph data management throughout the training lifecycle. In step-1, the raw data are persisted into the RAW DS, then the R2G mapper (Step-2) to map it into graph format for storing in Graph DS (Step-3). The GOL then load the Graph in distributed in-memory (Spark RDD) aiming to compute the information complete subgraph (Step-4). These subgraphs are then indexed in the F-HDFS (Step-5). These subgraphs are then retrieved by the GLL from F-HDFS for distributed training while utilizing parameter servers (Step-6). Finally, the trained model is persisted in the Model DS for inferences (Step-7).  14 Return f lock as a scheme for optimized subgraph loading operations for the GNN training. The main notion is to systematically merge the generated subgraphs into data units called Flocks, which are then stored on the HDFS. We designed Flock format keeping the requirements of k-hop subgraphs in such a way that the required subgraphs packed in a Flock are loaded in bulk for the training operation which not only reduces the I/O cost but also considerably reduces the network overhead and as a result, speeds up the distributed GNNs training. The size of Flock is fixed to HDFS Block Size which we determined from the HDFS throughput shown in Figure 4. There is a significant improvement beyond 128MB which is the HDFS Block Size in our case and that's why we have fixed the Flock Size to 128MB.
The storage operation of Flocks is outlined in Algorithm 2. An index record is generated for each Flock consisting of Flock id, the number of subgraphs, the HDFS path where the Flock is to be stored, and the subgraph vertices that the Flock contains. This index record is indexed and is used for the retrieval operation. Algorithm 3 shows the Flock retrieval operation. First, the index record of the Flock containing the required vertex is retrieved. Then the corresponding HDFS location of the Flock is obtained from the index record, and the actual Flock is read. Finally, the Flock is unpacked, and the subgraphs are retrieved.

B. GRAPH OPTIMIZATION LAYER
Once large graph datasets are acquired and persisted to the GDL. Then to train GNNs in a distributed environment using PS, we need independent subgraphs having complete information. To accomplish this, we need to compute k-hop based independent subgraphs in the MapReduce environment while utilizing Apache Spark. For managing the k-hop based subgraph, we need a subgraph indexer to index the subgraph and to hold the information regarding subgraphs, i.e., access id, subgraph size, number of nodes, and edges. Further, as the realworld graphs are skewed thus the k-hop subgraph size might be large and becomes challenging to be processed on a single machine in a MapReduce cluster. Therefore, sampling strategies are required. Likewise, generated k-hops against targeted nodes contain overlapping nodes. This puts extra computation pressure during GNN training. To handle this problem we develop a graph pruner to discard nodes from similar overlapping k-hop subgraphs for faster training. To address these issues, we design GOL on top of MapReduce Framework [28]. The GOL is composed of four components, i.e., k-hop subgraph generator, subgraph indexer, subgraph sampler, and subgraph pruner.
Another aim of this layer is to provide Application Programming Interfaces (APIs) abstraction on top of the Spark MapReduce framework for graph operations while optimizing and hiding the underlying complexity. The overall performance of the GNN training and inferences is proportional to the design of the GOL. The GOL main components are further elaborated in the following subsections.

1) K-Hop-based Subgraph Computation
We observed that the (k − 1)-hop neighborhood of node u is k neighborhood of v if existing an edge from uto v Therefore, to provide a k-hop neighborhood for one node v we use (k − 1)-hop neighborhood of the outgoing neighborhood nodes of v integrate with 1-hop of v.
We build two main functions that are in-edge merging and out-edge propagation. Where in-edge merging function is simply to merge in-edges into one set of in-edges. Out-edge propagation is to provide in-edge at k + 1 for neighbors along outedges [5].
To implement the k-hop generator using Spark MapReduce. First, we define (key, value) pairs, where node identity is key and node information is value. The node information consists of its k-hop, in-edge list, out-edge list. The reducer will provide k-hop by executing k-times. At k time, the reducer merges (k-1)-hop to in-edge at that time to provide k-hop. The k-hop generation process is described as in Figure 5. Given a graph dataset, and k. The map and reduce will be repeated k times. For the first round, we use MapToPair to provide (key, value) pair, then call ReduceByKey to merge in-edges of each node. To provide the input for the next round, we use the out-edge propagation function to provide in-edge at (k+1) for neighbors belongs to out-edge. The details of k-hop based subgraph generator are shown in Algorithm 4.

2) Subgraph Sampler and Pruner
Once the k-hop subgraph is too large and difficult to be executed on a single machine of the cluster then we use the subgraph sampler. Instead of analyzing the whole network, we can sample a small subgraph similar to the original graph. We use subgraph sampling to reduce the scale of the K-hop neighborhoods, especially for those "hub" nodes. In algorithm 4, subgraph sampling can be called inside inEdgeMerging() function if the number of "hub" of the target node is big. Several sampling techniques can be used such as random walk (select neighbors uniformly and randomly), bias random walk (select neighbors by graph properties like degree). Figure  6 shows the flowchart for sampling to return neighborhoods for a hub node i.
Graph pruning is to reduce the unnecessary computations in aggregating steps of the k-th layer while training. The graph pruning strategy is to reduce the non-zero values in the adjacency matrix of each layer. We include graph pruning as an optional function for the training phase.

C. GRAPH LEARNING LAYER
This layer aims to provide popular GNNs models implementation, facilitating distributed GNN training on a graph, and integration to the previous twolayer APIs for efficient training of GNNs. The GLL consist of two main components, i.e., K-hop-based GNNs Trainer, and PS.

1) K-hop-based GNNs Trainer
In this section, our focus is on expediting the largescale efficient GNNs training. Large-scale GNNs training is a complicated problem due to the high inter-dependency of graph data that makes traditional distributed training challenging contrary to independent batch training in images and text. There are specifically two strategies to train a GNN model. One is whole-graph training [30], [31] and the other is sample-based training [9], [32]. The latter is generally adopted for large-scale graph learning and is further divided into a full batch training [26], and mini-batch training [24]. Yet, they have one disadvantage i.e., they fall short on interdependency within the nodes in samples thus compromising on the effectiveness of GNNs models. To compensate for this, researchers proposed the prior computation of k-hop neighborhoods based GNN training to overcome the issue of inter-dependency among nodes in the graph [5]. The K-hop neighborhood provides the complete information of the graph and also facilitates the sample-based traditional distributed training of GNNs using PS. Whereas, it induced new challenges such as varying size k-hops for different nodes, the issue of "hub" node (i.e., one node may get a large chunk of k-hop neighborhood relative to the original complete graph), and overlapping subgraphs among the computed k-hop neighborhoods.
The varying size of k-hop neighborhoods induces inconsistency in the PS-based distributed training, e.g., workers with fewer nodes in k-hop will have to wait for other workers to complete the computation. To overcome this issue, we ordered the k-hops on the number of nodes present in those k-hops so that all workers at a certain epoch get the identical length of k-hops. The 'hub' node issue is resolved through sampling from the GOL where larger khops are sampled, based on a certain threshold (i.e., the maximum number of nodes allowed in k-hop neighborhood subgraph), set according to the main memory of the systems. Finally, on overlapping subgraphs among the k-hops, we devise a setting in the training flow where we pick similar adjacent khops in the same batch for training with the help of our proposed F-HDFS.
The training of message-passing based GNNs on k-hop neighborhoods in the distributed environment has its challenges. To make it fully parallel and distributed, we prior computed the aggregations on node's direct neighbors and used them during model computation in GNNs training avoiding matrix multiplication method of message aggregation. However, it overloads the memory with slight improvements in training efficiency. Thus, we moved on from this way of implementation to the standard way of GNNs model computation. Yet, this way of implementation brings a new challenge in the khop subgraph vectorization because they are loaded extensively during the training phase. To overcome this, we made the two processes parallel i.e., the model computation and subgraph vectorization during the training phase. The batch size is now being kept small to accommodate both of these processes in memory to avoid the out-of-memory. Also, the F-HDFS storing and indexing of identical subgraphs makes the subgraph loading and vectorization more efficient. Additional improvements come from Apache Spark-based implementation which is several times faster than Hadoop MapReduce.
Our implementation is based on the messagepassing paradigm where we divided the GNNs into three parts; 1-Sampling, which in our case is a khop neighborhood, and neighbor sampling incase of hub node, 2-Aggregation which aggregates the features of all the incoming neighbor nodes from the k-hop neighborhood, and 3-Combine-e.g., a fully connected neural network operation that obtains the updated feature of node v at k th layer of GNNs, The GCN inspired from eq 1, where neighborhood aggregation is performed here by the below eq 3 whereÂ is obtained using eq 4 and self-loop is added using eq 5. X is the vertex feature matrix, and W is the learnable parameters. σ is any activation function such as Relu, Elu etc.
hereÃ is an Adjacency matrix obtained by adding vertex self loop to it using eq 5 andD −1/2 is a inverse diagonal matrix ofÃ.
GNNs are a special type of deep learning method designed especially for graphs. The most prevalent methods are GCN [30], GraphSage [32], and GAT [33]. The work on GNNs dates back to the first decade of the 2000s, but the breakthrough came after the proposal of GCN in 2017 [30]. We will discuss these methods and their integration with the information complete k-hop subgraphs in the coming section.

a: Graph Convolution Network
The GCN introduced a semi-supervised deep learning approach that can be applied directly to the graph. The general idea of GCN is to aggregate information of a processing node and its direct neighbors and then combine them to generate new embeddings. The first step is to add the self-information by adding an identity matrix I as can be seen in eq 5. Next, for direct neighborhood aggregation, it multiplies the inverse diagonal matrix with obtained self-looped identity matrix as can be seen in eq 4. Finally, it injects the relevant direct neighborhood aggregated feature information into an MLP layer as can be seen in eq 3. The major limitation of GCN is its transductive property such as it was training on the complete graph which is in practical incapable to process large graphs at once in memory. Thus we trained it on k-hop subgraphs iteratively.

b: GraphSage
The GraphSage [32] proposed a neighbor sampling approach that enables large-scale graph training. For inductive learning, it provides three different aggregation functions as a convolution operation. The three aggregation functions are average aggregation, max-pooling, and LSTM where the average aggregation is a non-parametric aggregator, and the others two are learnable and parametric aggregators. The GraphSage due to its aggregation choices makes it adaptable for a graph that grows over time, Unlike GCN. However, both of these methods are isotropic that is they have an equal contribution from their direct neighbors. But in reality, it is the opposite and neighbors may have an unequal impact on a tar-get node, thus GAT [33], introduces an anisotropic method(unequal weightage) for neighbor's impact.

c: Graph Attention Networks
GAT introduces an attention strategy similar to the multi-head self-attention from transformer [34] where it gives more weight to the important nodes, unlike GCN and GraphSage who treats all the neighboring nodes equally important. Thus, they induced a new learning parameter for learning neighbor impact in terms of how much weightage needs to be given to a certain neighbor. Though it gives better performance in terms of accuracy but computationally it is more expensive than GCN and GraphSage.
To formulate these methods we start from GCN, where a graph convolution operation constructs the normalized sum of the node features of neighbors as can be seen in eq 6: Here N (v) is the set of its direct neighbors of node V v along with a self-loop to add information from node itself, c vu = |N (v)| |N (u)| is a normalization constant based on graph structure, σ is an activation function (e,g, ReLU), and W (k) is a shared weight matrix for node-wise feature transformation. The GraphSage model applies identical update rule except that it set c uv = |N (u)|.
As, GAT replaces normalized convolution operation with attention mechanism and computes the node embedding h k+1 v of layer l + 1 from layer k embeddings such as: where, z k v is a linear transformation of the lower layer embedding h k v and W (k) is its learnable weight matrix Here e k vu is an additive attention score between two neighbors. where denotes the concatenation of two nodes i.e. z embeddings. it then take a dot product concatenated z embeddings with α (k) T , where α (k) T is a learnable weight vector. Finally, it applies LeakyReLU. Equation 9 employs a softmax to normalize the attention scores on each node's incoming edges.
Finally, equation 10 is similar to the GCN model, where embeddings from neighbors are aggregated collectively and scaled by the attention scores.

2) Parameter Server
The PS is the principal component of distributed machine learning applications because a single machine alone cannot cope with large-scale data. It is inevitable and essential when the data is independently distributed across workers as it maintains the globally shared parameters of the GNN model during distributed GNNs training. However, the PS assumes the data is independent and distributed among workers. But graph data due to its interdependency cannot be directly partitioned among the workers for which information sufficient subgraphs (i.e., k-hop neighborhood) must be computed earlier and then distributed among the workers [5].
In GLL, we employ multiple PS machines [35], because GNNs parameters are too numerous for a single parameter. Hence, we equally partition the parameters among multiple PS as shown in Figure  7, where each PS is only liable for holding the parameters in its partition. Accordingly, the gradient vector will be partitioned and sent as chunks to its affiliated parameter server by workers before sending a gradient. Likewise, workers will receive the related chunks from the updated model from PS. This setting is suitable for GNNs training on a large graph because the single PS tends to get overloaded.
The PS is the principal component of distributed machine learning applications because a single machine alone cannot cope with large-scale data. It is inevitable and essential when the data is independently distributed across workers as it maintains the globally shared parameters of the GNN model during distributed GNNs training. However, the PS assumes the data is independent and distributed among workers. But graph data due to its interdependency cannot be directly partitioned among the workers for which information sufficient subgraphs (i.e., k-hop neighborhood) must be computed earlier and then distributed among the workers [5].
In GLL, we employ multiple PS machines [35], because GNNs parameters are too numerous for a single parameter. Hence, we equally partition the parameters among multiple PS as shown in Figure   Gradients Push Parameters Pull Node-1 Node-n Node-2 Node-3 7, where each PS is only liable for holding the parameters in its partition. Accordingly, the gradient vector will be partitioned and sent as chunks to its affiliated parameter server by workers before sending a gradient. Likewise, workers will receive the related chunks from the updated model from PS. This setting is suitable for GNNs training on a large graph because the single PS tends to get overloaded. Under GDLL, the defined GNN model is distributed among workers and servers to be used in training. It involves stochastic gradient descent (SGD) computation for training distributed models. During each iteration, each worker performs model computation, and model parameters are sent to the PS server. These parameters are accumulated for the next iteration in the server and it maintains the current version of the GNN model parameters. As k-hop contains the necessary information to train the GNN model, the workers become independent of one another. This makes the GNN training on a graph similar to the conventional machine learning model training. Moreover, most of the k-hops are small subgraphs (the larger k-hop such as against a 'hub' node are also sampled) that enable us to deploy simple personal computers as workers in GNN training. We also provided helper functions for obtaining data, including getter/setter methods for gradients and weights.

D. GDLL LIBRARY AND SCENARIOS
In this section, we provide descriptions of the GDLL library. We show APIs for every layer in the GDLL (i.e., GDL, GOL, and GLL) as can be seen in Table 1, 2, 3. The flow and interaction between GDLL's layers are elaborated in the sequence diagram, shown in Figure 8, for better understanding.  Given an inputPath of a graph, a number k, and outputPath. This function provides k parts of edges of the input path such that edges that have the same destination nodes will be located in the same partition. gol.GraphSampler.graphSampler(graph, strategy function object) Given a graph, and strategy function name, tt returns smaller data from original data.
Numerous GNNs-based domain-specific applications can be developed while utilizing the proposed GDLL framework. To understand the applications of the proposed system, we present some real-world domain-specific scenarios. Overall, all graph applications can mainly be divided into three tasks, i.e., node classification, link classification, and graph classification. Each of these tasks has countless applications. e.g., node classification in retail applications where products and customers can be viewed as nodes. Therefore recommending products to a customer can be done using node classification. However, GNNs has its challenges especially deploying on a large-scale graph. Thus, the proposed system could be utilized for GNNsbased applications by industries ranging from small to mid and even large having resource constraints.

V. GDLL RESULTS AND EVALUATION
In this section, first, we present the experimental environment, datasets being used, the performance of the proposed GDLL. Finally, the scalability and the performance of the proposed system in terms of accuracy and speed have been evaluated. Further, the time cost per iteration is improved with comparable accuracy with the state-of-the-art.

A. EXPERIMENTAL SETUP
We set up an indoor cluster of 11 machines running Hortonworks Data Platform (HDP) [36] as shown in Figure 9. Each nodes is running Ubuntu 18 with Core i5 processor with 4 cores and 32 GB Ram and VOLUME 5, 2021 It will give support for sequential containers for stacking graph neural network modules. gll.GML.neighborSampling(graph) This generates neighbor sample gll.GML.accuracy (pred_labels, orig_labels) This method computes accuracy between two set of labels, gll.GML.oneHotEncode(object) This method computes one hot encoding to an array (both for labels and features)  1TB disk. The nodes are connected with high speed switch. The cluster consists of five types of nodes, i.e., one server node, one NameNode, one Graph DB Server, four worker nodes, and four Parameter Servers. The Server node runs Ambari Server, F-HDFS Index, and F-HDFS Generator instances. F-HDFS Manager and Spark Server are running on NameNode. The worker nodes run Spark client, and the Parameter Server instances run on the Parameter Servers. GDLL instances are deployed on the respective nodes. In 9, Clients, and Data Node services are configured on Worker Nodes. Data Node is the HDFS node whereas clients are the instances of Spark, Yarn [37], and Zookeeper [38]. Further, we set the value of different parameters of the GDLL, as shown in Table 5.  We use three different size of datasets to evaluate our proposed framework, consisting of two small datasets in standalone settings (i.e., cora [39] and PPI [40]), and one big datasets in distributed settings (OGBN-product [41]). Details about those datasets are summarized in Table 6. Cora: The Cora dataset consists of 2708 scientific publications which belong to seven classes and 5429 connections amongst them corresponding to 2708 nodes and 5429 edges.
PPI: The PPI dataset is protein-protein interaction (PPI) graphs. There are 24 independent graphs with 56944 nodes belonging to 121 classes and 818716 edges in total, where each graph expresses different human tissue.
OGBN: The dataset OGBN is created by collecting Amazon products. There are total 2,449,029 nodes and 61,859,140 edges where nodes represent products, and edges connect two products if the products are purchased together. Each node belongs to one of 47 classes and are described by 100dimensional features.     Figure 10 shows the HDFS performance in terms of read/write time with respect to different data size. For smaller data sizes, both the read and write operations take very little time, but the time increases as the data size increases beyond The HDFS Block size. From the experimental result, we can see that the I/O time increases exponentially when the size of data exceeds the HDFS block size, which is 128MB in our case. It is worth mentioning that these experiments were conducted for a single read/write operation and not a bunch of read/write operations. One of the bottlenecks of the HDFS is the number of simultaneous I/O operations it can handle, for which we designed the F-HDFS. Figure 11 shows the performance of F-HDFS in terms of read time concerning the number of subgraphs and Flocks. From the results, it can be seen that the read time increases linearly with the individual subgraphs whereas the read time remains significantly small in the case of Flocks.

D. GOL PERFORMANCE
The k-hop performance has been evaluated on a single system and in a distributed environment. We show k-hop performance, on standalone machine, on Cora and PPI datasets, as shown in Figure 12. Here k is the value of k in k-hop. In this experiment, we evaluated the performance of the k-hop ranging from 1 to 5 k-hops. From the Figure, it is clear that Spark-based k-hop has significantly better performance than Hadoop. The AGL utilizes Hadoop MapReduce, whereas our proposed GDLL is based on Apache Spark. This experiment aims to show that performance improvement will also affect the overall performance of the proposed GDLL in a distributed environment.

E. DISTRIBUTED GNN TRAINING
In this section, we show that our proposed GDLL is significantly efficient to the existing frameworks such as AGL while maintaining similar effectiveness in terms of GNNs convergence. Here, we demonstrate the effectiveness and efficiency by evaluating GNNs training in both standalone and distributed settings. In a standalone mode, we performed our experiments on two public datasets (i.e., Cora and PPI), as can be seen in Table 7. We achieved comparable accuracy on the Cora by training GCN and GAT in a standalone environment on a similar configuration used in AGL. Correspondingly, we achieved a similar micro-F1 on a multilabel PPI dataset by employing GraphSage and GAT. This shows our proposed GDLL framework   is effective in terms of convergence to the existing graph learning frameworks such as PyG, DGL, AliGraph, and AGL. To demonstrate the scalability of GDLL, we also experimented on OGBN-product for distributed GNNs training and achieved similar performance in terms of effectiveness, as shown in Table 7.
To show the efficiency of our proposed GDLL framework in terms of speed, we illustrated per epoch time in the standalone environment on Cora, PPI, and in distributed mode on OGBN-product, respectively. For a fair comparison, we kept a similar configuration to AGL. Table 8 shows the efficiency in standalone mode and GDLL superior training speed can be seen. In the distributed mode, we employed in-house 4 workers and 4 servers in the parameter server for OGBN-product. We evaluated AGL and then our proposed GDLL on our in-house setup. Figure 13 shows the per batch (similar to AGL, i.e., 20 neighbors sample for a certain node) time in a distributed mode in comparison to AGL. For a fair comparison, we also expand the original 100-dimensional feature to 1000 dimensional by appending zeroes. Figure 13 shows the GDLL superior efficiency in comparison to the existing state-of-the-art framework i.e., AGL. The GDLL having similar effectiveness demonstrates superior efficiency in terms of speed, both in standalone and distributed mode. The optimization of the GDL, and GOL layers in the context of GNN training played a significant role in the overall efficiency of the VOLUME 5, 2021