Boosting Graph Embedding on a Single GPU

Graphs are ubiquitous, and they can model unique characteristics and complex relations of real-life systems. Although using machine learning (ML) on graphs is promising, their raw representation is not suitable for ML algorithms. Graph embedding represents each node of a graph as a d-dimensional vector which is more suitable for ML tasks. However, the embedding process is expensive, and CPU-based tools do not scale to real-world graphs. In this work, we present GOSH, a GPU-based tool for embedding large-scale graphs with minimum hardware constraints. GOSH employs a novel graph coarsening algorithm to enhance the impact of updates and minimize the work for embedding. It also incorporates a decomposition schema that enables any arbitrarily large graph to be embedded with a single GPU. As a result, GOSH sets a new state-of-the-art in link prediction both in accuracy and speed, and delivers high-quality embeddings for node classification at a fraction of the time compared to the state-of-the-art. For instance, it can embed a graph with over 65 million vertices and 1.8 billion edges in less than 30 minutes on a single GPU.


INTRODUCTION
G RAPHS are widely adopted structures to model unique characteristics and complex relations among objects in a network, e.g., a social or citation network, or networks of financial transactions. This modeling potential can be leveraged by machine learning (ML) tasks to predict or recommend new or missing links, classify nodes into different groups, and detect fraudulent users or connections. However, the sparse, raw graph connectivity does not readily lend itself to be used in ML tasks. Instead, a regular, d-dimensional representation is more appropriate for learning. Graph embedding is the process of transforming a graph from its raw form to a latent d-dimensional representation, and it has received a surge of interest in the literature.
The popularity of graph embedding made way for several techniques in the literature [1], [2], [3], [4]. However, most of these approaches do not scale to large-scale graphs. Even for small-and medium-scale graphs, VERSE [1], DEEP-WALK [3], node2vec [4], and LINE [2] require hours of training. For instance, VERSE takes over three hours for a graph containing 4M vertices and 34M edges on 16 CPU cores. Various techniques have been proposed in the literature to make embedding faster. HARP [5] and MILE [6] proposed to shrink the graphs into smaller ones and applied a multilevel training procedure. However, due to the overhead of their heavy-weight coarsening, these approaches still fall short for large graphs. By using accelerators, such as GPUs, one can design and implement scalable graph embedding tools. To the best of our knowledge, the most successful GPU-based attempt is introduced by GV [7].
In this paper, we present GOSH 1 , a CPU-GPU parallel, fast, graph embedding algorithm for memory-restricted devices. GOSH applies a novel, parallel graph-coarsening that mitigates forming giant vertex sets. The coarsening algorithm iteratively shrinks the graph into multiple levels. Then a multi-level, unsupervised training is performed. The training starts at the coarsest level, and it continues with upper levels until the training on the original graph is completed, and the final embedding is obtained. GOSH is designed to handle large-scale graphs with a single GPU by utilizing a judiciously devised partitioning and scheduling algorithm. The contribution of this paper is threefold: • When the embedding does not fit into the GPU, state-ofthe-art tools usually require multiple GPUs. On the contrary, GOSH applies a judicious partitioning and is faster than the multi-GPU tools thanks to its efficient, smart task scheduling, and its multi-level coarsening. Compared to its CPU-based counterpart, on graph com-orkut, it is approximately three orders of magnitude faster while being more accurate. On the same graph, it is 36× faster and scores 0.44% higher AUCROC for link prediction compared to state-of-the-art. For node classification, it has comparable accuracy while being 2.7× faster. • In the literature, coarsening has been applied by MILE [6] and HARP [5] to improve both the embedding performance and accuracy. Although the coarsening time is negligible for CPU-based embedding, this is not the case for GPUs. In this work, we propose a parallel and efficient coarsening algorithm that accelerates the embedding by 73× and improves the AUCROC by 2% on com-orkut. • For different embedding dimensions, the best GPUimplementation, that utilizes the device better, also differs. GOSH performs different strategies to further increase the performance, especially for small d values.
The paper is organized as follows: In Section 2, the notation and the essential background information are given. Section 3 describes GOSH in detail including the techniques applied for link prediction and node classification. Section 4 describes the parallel, multi-level coarsening algorithm, and Section 5 describes our design and implementation proposed to handle large graphs. The experimental results are presented in Section 6, and the related work is summarized in Section 7. Section 8 concludes the paper.

NOTATION AND BACKGROUND
A graph is a pair G = (V, E) where V is the set of vertices and E ⊆ (V × V ) is the set of edges among them. Edges in undirected graphs are unordered vertex pairs, while in directed graphs, the order is significant. A d-dimensional embedding of G is a |V | × d matrix M. The vector M[v] is the embedding of vertex v ∈ V , and the value M[v] [j] where 0 ≤ j < d captures some information about v. Having a regular structure, the matrix M can be used in many ML tasks.
There are many algorithms that embed a graph into a ddimensional space. We model GOSH after VERSE [1], that is characterized by its small memory overhead and fast runtime, in addition to its ability to produce embeddings that reflect any vertex-to-vertex similarity measure Q. VERSE employs two distributions for each vertex v ∈ V over all vertices. The first, sim v Q , is obtained from the empirical similarity between v and V \ {v} based on Q. The second, sim v E , is derived from the embedding matrix M using the cosine similarities between v's embedding vector and those of u ∈ V \ {v}. VERSE's objective function is minimizing the Kullback-Leibler (KL) divergence between these two distributions. In this paper, we choose Q to be the adjacency similarity measure [1] that is defined as Q(u, v) which is equal to 1 for (u, v) ∈ E, and 0, otherwise.
Training of the above objective, which employs the Noise Contrastive Estimation [1], is carried out by training a binary classifier to separate positive samples that are drawn from sim Q and negative samples that are drawn from a noise distribution N , where the parameters of this classifier are the corresponding embedding vectors. More precisely, for a total of e training epochs, we iterate through all v ∈ V and process each v by drawing a single positive sample u from sim v Q , and n s negative samples s 1 , s 2 , · · · , s ns from N . Then, we minimize the negative log-likelihood of observing the positive sample and not observing the negative ones using a logistic regression model that updates the embedding vectors of v and all the sampled vertices. Algorithm 1 shows a single update of the source vertex v and a (positive or negative) sample vertex u. In the algorithm, b is a binary value that is equal to 1 if the sample is positive, and 0 if it is negative, is the dot-product operation, σ is the sigmoid function, and lr is the classifier's learning rate. Table 1 summarizes the notation used in the paper.

GOSH: GRAPH EMBEDDING ON A GPU
To describe GOSH, we begin with a high-level explanation of the tool, followed by the GPU acceleration details in Section 3.1. Section 3.2 explains the random-walk-based version of GOSH to perform embedding for node classification. The coarsening algorithm and the techniques applied to handle large graphs are described in Sections 4 and 5.

Symbol
Definition G 0 = (V 0 , E 0 ) The original graph to be embedded.
Represents a graph, which is coarsened i times.
The set of mappings used in coarsening.
# embedding parts to be placed on the GPU. S GP U # sample pools to be placed on the GPU. B # positive samples per vertex in a sample pool. K i,j embedding kernel for sub-graphs i and j.
Algorithm 2, the main body of GOSH, takes a graph G 0 as input and computes the embedding M 0 . The embedding is performed in two stages: 1) creating a set G = {G 0 , G 1 , . . . , G D−1 } of graphs coarsened iteratively starting from G 0 as in the left of Fig. 1.a, where one or more nodes in G i−1 are uniquely represented by a super node in G i (Line 1 of Alg. 2), 2) starting from the coarsest graph, G D−1 , training the embedding M i of graph G i and projecting it to M i−1 , to later train it as shown in Fig. 1.a (Lines 3-11 of Alg. 2). The algorithm stops once M 0 is obtained. We obtain M i−1 from M i by using the mapping information for

Algorithm 2: GOSH
Data: G 0 , n s , lr, lr d , p, e, threshold, P GP U , S GP U , B Result: GOSH is able to accelerate the embedding of large-scale graphs whose memory footprint exceeds the capabilities of contemporary GPUs. Real-world graphs can have hundreds of millions of vertices; for |V | = 128M , the embedding matrix M with d = 128 takes 64GB in single precision and 128GB in double precision. For each G i ∈ G, GOSH checks whether the GPU can store both G i and M i (Line 5). If this is the case, they are both copied to the GPU, and the training proceeds in a single step (Lines 6-7). Otherwise, the large-graph embedding schema described in Section 5 is Fig. 1: An overview of GOSH: (a) The input graph, G0, is iteratively coarsened into smaller graphs until an exit condition is met. Then, starting from the smallest, GD−1, each graph Gi is embedded. (b) To embed Gi, the embedding Mi+1 is projected onto Gi to initialize Mi which is then fine tuned. If Gi and Mi fits to the GPU memory GOSH carries out Sequential-Kernel Execution (Section 3.1). Otherwise, it carries out Directed Acyclic Graph Execution (Section 5). (c) In Sequential-Kernel Execution, CPU dispatches kernels to GPU. Meanwhile, GPU executes the kernels sequentially. If random-walk sampling is used CPU generates positive samples and sends them to GPU. On the other hand, in DAG execution, the CPU continuously generates positive samples, sends them to GPU, and exchanges submatrices with it. Meanwhile, the GPU concurrently performs embedding. used, in which M i is partitioned into sub-parts to be copied and processed on the GPU. In this schema, the samples are generated on the host and dynamically sent to the GPU.
Multilevel coarsening introduces a peculiar work distribution problem. Let e be the total number of training epochs performed on all levels; one can simply evenly distribute them. However, if more epochs are reserved for the coarser levels, the overall embedding becomes faster. Besides, training on coarser levels significantly impacts the final embedding since the updates are projected to many finer-level vertices. On the other hand, more epochs for the higher levels potentially make the final embedding more fine-tuned. Therefore, one must distribute the epoch budget cleverly. GOSH employs a mixed strategy; we distribute a portion p of the epochs uniformly across all levels and distribute the remaining (1−p) portion of epochs geometrically. More formally, the ith level training uses e i = e/D + e i epochs where e i is half of e i+1 . We leave the value p, dubbed the smoothing ratio, as a configurable parameter to allow an interplay between accuracy and performance. The learning rate is another important parameter that has a significant impact on embedding quality. GOSH uses the same initial learning rate lr for all levels but decreases it linearly in each epoch. More formally, at training level i, in epoch j, the learning rate is equal to lr × max 1 − j ei , 10 −4 .

Graph embedding in GOSH
GOSH implements a GPU-accelerated, lock-free training step, and similar to VERSE, it uses SGD-based optimization. As shown in [8], a lock-free SGD implementation which does not take race conditions, i.e., simultaneous updates, into account does not negatively impact the learning quality on multi-core CPUs significantly. However, on a GPU with millions of parallel threads, especially with coarsening, we observed that race conditions deteriorate the quality. To mitigate the impact of race conditions, we serialize the training epochs, i.e., no two training epochs are executed in parallel. Section 6.3 describes this effect in more detail.
In an epoch for M i , GOSH traverses V i and assigns a single source vertex to a single GPU warp. For each source vertex, multiple positive and negative samples are processed sequentially by the warp assigned to that vertex, and updates are performed as in Algorithm 1. With this vertex-towarp mapping, no vertex v ∈ V i can be a source vertex for two concurrent updates, but v may be (positively or negatively) sampled by one or more vertices. Furthermore, its assigned warp can also be active. Therefore, reads and writes on M i are not completely race free. However, our experiments show that synchronizing epochs in this way is sufficient to perform the parallel embedding process robustly.
The positive and negative samples for source vertices are generated on the GPU as shown in Algorithm 3. A positive sample for source v ∈ V i , is a vertex u ∈ V i s.t. u ∈ Γ Gi (v). As mentioned in Section 2, the negative samples are drawn from a noise distribution, which we model as a uniform random distribution over V i similar to [1]. When a warp carries out the updates for a source vertex src, its threads perform (1 + n s ) × d accesses to M i [src]. Therefore, larger values of n s and d increase global memory accesses and deteriorates performance dramatically. To counter this issue, before src is processed, M i [src] is copied from the global memory of the GPU to the shared memory of a streaming multiprocessor. This way, for all consecutive positive and negative updates, reads and writes to M i [src] are performed on the shared memory. After all the updates are executed, M i [src] is copied back to global memory. For a sampled vertex u, on the other hand, M i [u] is always kept in global memory since it is only read and written once. Accesses to M i [u] for a vertex u are coalesced, where a warp's threads perform the accesses in a round-robin fashion. More precisely, M i [u][j + (32 × k)] is accessed by thread j on its kth access where 32 is the number of threads within a warp.

Embedding for small dimensions
When embedding with d ≤ 16 dimensions and with 32 threads in a warp, a single source vertex does not occupy all the threads in a warp. When d is small, since 32 − d threads in each warp remain idle, the device will be underutilized. For this reason, GOSH employs a specialized vertex-warp distribution for small d. Depending on the dimension, we assign 2 or 4 source vertices to a single warp, and hence, 16 or 8 threads, respectively, are assigned to a single source.

Random-walk-based sampling for classification
Our preliminary experiments show that VERSE-based sampling can not compete with the state-of-the-art on node classification. Instead, we employ random-walk-based sampling, which is a common technique in embedding literature, and implement GOSHW which samples from random walks on the host machine and sends them to the GPU in an online fashion. To sample positive edges for node classification, we start a random walk of length and sample every pair of vertices within a distance of g.
We opted to carry out the random walks and sampling on the CPU for three reasons. First, GOSH now utilizes CPU and GPU at the same time. Furthermore, a single walk of length and sampling distance g require storing a minimum of g elements. By sampling on the host, we do not impose the burden on GPU memory, which is already the main bottleneck. Second, we want to exploit the relative efficiency of CPUs over GPUs on irregular memory accesses, which happen due to the nature of random walks and the graph sparsity. Last, Zhu et al. show that shuffling the samples before using them produces better results in downstream ML tasks [7]. Acquiring the samples from walks on the host allows them to be shuffled more easily.

COARSENING GRAPHS FOR FAST EMBEDDING
GOSH coarsens a graph G i = (V i , E i ) while retaining the structure of the coarse graph and maximizing coarsening efficiency and effectiveness. We measure the ith level efficiency by the rate of shrinking, On the other hand, the effectiveness is measured in terms of embedding quality. Our approach, MULTIEDGECOLLAPSE, is agglomerative and generates the clusters similar to [5]. The vertices in V i are processed one by one. When processing v ∈ V i , if it is not yet marked it is marked and mapped to a cluster, i.e., a new super vertex in V i+1 , and its edges are processed. If an edge (u, v) ∈ E i with an unmarked u exists, u is added to v's cluster. Finally, all the vertices in v's cluster are shrunk into a super vertex v sup ∈ G i+1 . MULTIEDGECOLLAPSE can preserve both the firstand second-order proximities of vertices as defined in [2]. The former proximity measures the pairwise connection between two vertices, while the latter measures the similarity between vertices' neighborhoods. It accomplishes that by collapsing vertices belonging to the same neighborhood around a local hub vertex. However, we observed that a naive implementation usually merges two giant, hub vertices. This degrades the effectiveness and efficiency of the coarsening. The effectiveness degrades since the coarsened graphs are not structurally similar in lower coarsening levels because a small number of giant super vertices represent almost all the vertices on the finer graphs. Having a small set of giant super vertices also inhibits coarsening the graph further, degrading the efficiency. To delay and if possible, avoid such cases, we introduce a new merging condition, |Vi| . This way, two hub vertices whose degrees are greater than the density of G i will no longer be in the same cluster. Our experiments show that this simple rule significantly improves both coarsening effectiveness and efficiency.
When a vertex is marked and added to a cluster, its neighbors are not processed and do not contribute to the coarsening. If vertices are processed in an arbitrary order, the efficiency may degrade since vertices with small neighborhoods can lock large vertices. Hence, when an edge (u, v) ∈ E i is used for coarsening via a hub-vertex v ∈ V i , we prefer u to be inserted into v's cluster to maximize efficiency. We achieve this by a new heuristic, dubbed ordering, resulting in substantial improvements in coarsening efficiency. With ordering, the vertices are ordered to process the vertices with larger neighborhoods earlier.
The coarsening process is shown in Alg. 4. The algorithm takes an input graph G = G 0 and returns a set G of graphs and the mapping information that will be used to project the embedding matrices M. We initialize G and M as {G 0 } and ∅, respectively. Coarsening starts with i = 0 and terminates either when a graph G i+1 with a vertex count that is less than threshold is obtained or |V i+1 | is larger than 80% of |V i |. We observed that 80% works well for scale-free graphs, however it is experimentally tuned and left as a parameter. Additionally, we store the mapping information map i which is used to shrink G i to G i+1 , and later use it to project the embedding matrix M i+1 and initialize M i . GOSH uses threshold = 100 for all the experiments.

Complexity analysis
GOSH employs the Compressed Storage by Rows (CSR) graph data structure which stores the vertex neighborhoods in an adjacent manner. There are three stages in MULTIEDGECOLLAPSE; sorting (Line 3), mapping (Lines 7-14) and coarsening (Line 15). The sorting stage uses counting sort whose time complexity is O(|V | + |E|). For mapping, the algorithm must traverse all the edges, leading to a time complexity of O(|V | + |E|). Finally, coarsening the graph requires the vertices to be sorted with respect to their mappings and then going through all the vertices and their edges within the CSR, incurring a time complexity of O(|V |+|E|).

Parallelization of GOSH coarsening
On a CPU, the embedding process dominates the total time. However, this is not the case when a fast, GPU-based embedding is employed, such as the one in GOSH. Then the coarsening overhead can no longer be neglected. Hence, a parallel coarsening is a must for GOSH.
For coarsening, traversing V i in parallel and mapping with no synchronization can yield inconsistencies. We avoid the race conditions by using a single lock per each map i . While updating map i [v] and map i [u] as in Lines 9-14 of Alg. 4, a coarsening thread tries to lock both map i [v] and map i [u]. If the locks are acquired, the thread continues. Otherwise, it skips the current candidate and continues with the next vertex. Besides, to avoid race conditions on the counter cluster, the algorithm uses the hub-vertex id for mapping. More precisely, map i [v] is set to v as opposed to Line 9 in the sequential algorithm. It should be noted that with the parallel implementation, map i does not provide a mapping to vertex IDs in G i+1 . This can be corrected in O(|V |) time through a sequential traversal of map i , which first detects/counts the vertices that have map i [v] = v and resets the values of map i for all the vertices.
Constructing the coarsened graph in parallel is not easy since after the mapping is generated, we do not yet know the degrees of the (super) vertices in G i+1 . To mitigate this issue, a private region of memory E j i+1 is allocated for each thread t j where 1 ≤ j ≤ τ . Each thread creates edge lists of the new vertices in these private regions, which are merged on a different location with size |E i+1 |. This happens via a sequential scan to find the region in E i+1 for each thread's private information, and copying that private information to E i+1 . An important issue with this approach is load imbalance. A static vertex-to-thread assignment can reduce embedding performance since the degree distribution of the original graph can be skewed and more skewed for the coarsened graphs. For this reason, a dynamic scheduling strategy that uses small batch sizes is used in all the steps above.

EMBEDDING LARGE GRAPHS ON A GPU
GOSH can bypass the memory limitations of the GPU. Unlike other GPU-based approaches such as [7], [9], whose hardware requirements increase with the memory requirements, GOSH requires only a single GPU to accelerate the embedding of any arbitrarily large graph. The embedding algorithm adopted in this work so far requires storing the embedding matrix M as well as the graph on the GPU, whose memory complexities are O(d × |V |) and O(|V | + |E|), respectively. For a graph with 10 9 vertices and d = 10 2 , this is more than any modern GPU can provide. Here we describe a novel partitioning schema and DAG execution model, which can be used when the memory requirement of graph embedding exceeds the GPU memory size.
Instead of storing M as a whole on the GPU, we employ a partitioning schema as in [7], [9], [10] which partitions the vertex set These submatrices are copied in and out of the GPU to be embedded. As shown in Alg. 3, a single update accesses no more than two embedding vectors. Hence, the correctness of an embedding requires that for every edge (u, v) ∈ E, the embedding vectors of the vertices u, v ∈ V must be simultaneously on the GPU at some time point. For that reason, GOSH splits the embedding procedure into rounds where each round is processed by an embedding kernel performs up to B positive updates, and for each positive update, s negative updates. The positive samples are generated on the host in parallel and sent to the GPU as they are needed, as opposed to the base algorithm, in which they are generated on the GPU. This removes the need to store the graph on the GPU, and concurrently utilizes the host's resources. The negative samples are generated on the GPU by sampling randomly from the opposite embedding matrix. This execution schema implies that in a single embedding round, at most B × K positive samples are executed per vertex. For a vertex v ∈ V i , no updates will be carried out by a K i,j or K j,i if no neighbor of v appears in V j . In this schema, we run r = e B rounds and use an atomic global counter to keep the number of samples from exceeding e × |V |.

Kernel execution order
The order in which kernels are executed in an embedding has a direct impact on the embedding efficiency. Although it does not change the amount of updates performed, the order directly affects the number of submatrix transfers happening between the host and the GPU. We use the insideout ordering [10] as it minimizes the number of submatrix copies to and from the GPU and produces high quality embeddings. We define the part pairs on which the kernels operate as A toy example of a kernel order for a graph with six parts is shown in Fig. 2a.

Number of submatrix and sample pool bins
While partitioning the embedding matrix, there are two important hyper-parameters; (1) the number of submatrices P GP U , and (2) the number of sample pools S GP U simultaneously stored on the GPU. For correctness, we require that P GP U ≥ 2. Although fewer submatrices are easier to handle, more would allow more overlap between memory transfers and computation. For example, assume that M 0 , M 1 and M 5 are currently on the GPU, and the three upcoming kernels are K 5,0 , K 5,1 , and K 5,2 . After K 5,0 is done, while K 5,1 is running, M 0 can be switched out and replaced with M 2 . Increasing P GP U also increases K and decreases the number of vertices in a single vertex set. Hence, more memory copies would be required in a single round. We empirically find that P GP U = 3 provides the best performance. As mentioned before, positive samples are generated on the host and sent to the GPU as they are needed. We allocate S GP U sample pool bins on the GPU such that each pool can store the positive samples used by a single kernel K i,j . Having multiple pools on the GPU allows further concurrency between kernels. However, it also incurs a memory overhead reducing the available memory size for embedding submatrices.We find that S GP U = 2 exposes enough concurrency without increasing K too much.

Directed acyclic graph execution model
Embedding a graph using the proposed partitioning framework involves various parallel task executions and a myriad of dependencies among them. The correctness of the algorithm can be maintained using global synchronization, but this can potentially degrade the performance. Instead, we adopt a Directed Acyclic Graph (DAG) execution model to resolve dependencies and further optimize the algorithm. In this model, a task must only synchronize with its incoming neighbors. We first define three task types running on the host that encompass the embedding: 1) Submatrix-swap task (M ST k i,j ): dispatches a GPU-tohost transfer of submatrix M i out of the kth submatrix bin, where k < P GP U . It also dispatches a host-to GPU transfer of M j to the same bin.
2) Sample-pool-copy task (SCT i,j ): dispatches a memory copy of the sample pool S i,j to the GPU. The exact bin on the GPU is resolved at runtime. 3) Kernel-execution task (KT i,j ): dispatches a kernel K i,j to the GPU. We construct an embedding DAG by associating these tasks and modeling the dependencies among them as directed edges. We associate an execution queue with the DAG to hold the ready-to-be-executed nodes whose incoming neighbors are completed. This queue uses a team of threads to execute the tasks in parallel.
When a task is executed, it asynchronously dispatches the associated GPU job and terminates without waiting for its execution. Such a loose coupling reduces the GPUhost synchronization overhead, reduces the host workload, and saves more resources for the far more expensive sampling. We used native GPU constructs, cudaStreams and cudaEvents, to realize the DAG dependencies on the GPU. cudaStreams are used to create sequences of synchronous GPU jobs. Since the DAG cannot be partitioned into linear sequences/streams, we set inter-stream dependencies via cudaEvents among the jobs on different streams.
GOSH allocates a cudaStream for every sample pooland submatrix-bin on the GPU since copy operations on the same bin require a natural sequential dependency. Also, it allocates two cudaStreams for embedding kernels to exploit their potential overlap. Each task is associated with a cudaEvent that is triggered at the end of the dispatched GPU job. When a task dispatches a GPU job, it also instructs it to wait for the cudaEvents of its incoming neighbors' GPU jobs, and this makes the dependencies of the GPU completely isolated from the host's. The only exceptions to this are the dependencies among kernel-execution tasks; we want the embedding kernels to overlap whenever possible.
Overall, the task dependencies are summarized as follows: the task KT i,j depends on the tasks which bring its sample pool and submatrices to the GPU, i.e., SCT i,j , M ST * * ,i , and M ST * * ,j . The tasks KT i,j and SCT i,j depend on KT m,n and SCT m,n , respectively, where (m, n) is the pair preceding (i, j) in the kernel order (the dependency between KT m,n and KT i,j does not involve cudaEvents as described before). Finally, the task M k i,j depends on all KT s which use the submatrix i, stored in submatrix bin k. A sample DAG is shown in Fig. 2b.
Unlike submatrix copy tasks, sample pool tasks do not know the exact copy bin on the GPU. This is a deliberate design choice made since the processing times of sample pools can vary, and dynamically allocating the GPU pool bins to the first available sample-pool-copy task improves the utilization of bins. To implement this, we leverage mutex-protected host variables that are used by the samplepool-copy and kernel-execution tasks to communicate.

EXPERIMENTAL RESULTS
The embedding quality of GOSH, VERSE, MILE, PBG, and GV are evaluated with link prediction and node classification tasks, which are widely used for evaluation in the literature [1], [4], [7], [10].

Details of the experiments
The details, e.g., datasets and their use for ML tasks, hardware, and the baseline tools are given below.

Hardware specifications:
We run our experiments on three servers. The first, Arch-1, is equipped with an Nvidia V100 GPU with 16GB memory, 348GB RAM, and 2 Intel Xeon Gold 6148 processors, each with 20 cores running at 2.40GHz. The second, Arch-2, uses a single Titan X Pascal GPU with 12GB of global memory, 198GB RAM, and two Intel Xeon E5-2620 v4 processors, each with 8 CPU cores running at 2.10 GHz. Lastly, Arch-3 has 4 Nvidia A100 GPUs with 80GB of memory, 64 cores on an AMD 7742 CPU that runs on 2.24GHz, and 512GB of memory.
The servers have CentOS 7.3.1611 (Core) and Ubuntu 4.4.0-159, and CentOs 7.9 as operating systems, respectively. All the CPU codes are compiled with gcc 7.0.1 with -O3 as the optimization parameter. For CPU parallelization, OpenMP multithreading is used in general. The GPU codes are compiled with nvcc with CUDA 10.1 and -O3 optimization flag. The GPU is connected to the server via PCIe 3.0 x16. For GPU implementations, all the relevant data structures are stored on the GPU memory. Datasets used for the experiments: We use six mediumscale and three large-scale graphs to fairly and thoroughly evaluate the tools. We choose graphs that differ in terms of their origin, the number of vertices, and density. We classify the graphs that do not fit the GPU memory as large-scale and display them separately in Table 2. Data preparation for ML tasks: For link prediction, we split the input G into two; G train = (V train , E train ) with 80% of the edges, and G test = (V test , E test ) with the remaining 20%. In order to ensure V test ⊆ V train , we remove all the isolated vertices in G train , and also all the edges (u, v) ∈ G test where u / ∈ G train or v / ∈ G train . We generate the embedding matrices using G train employing the tools mentioned above. Then we utilize a Logistic Regression model to predict the existence of the edges in G test . To train the model, we generate |E train | amount of positive and negative samples. The positive samples are directly taken from E train , and the negative samples are randomly selected from (V train × V train ) \ E train . A sample is generated by doing an element-wise multiplication of M 0 [u] and M 0 [v], where (u, v) is selected as a training sample. The same procedure is used to generate the test samples. Finally, we report the AUCROC score obtained from the test set [15]. For medium-scale graphs, we use LogisticRegression from scikit-learn. However, logistic regression becomes too expensive for large-scale graphs. Thus, for such graphs, we use SGDClassifier from scikit-learn with a logistic regression solver.
For node classification, we train a one-vs-rest multi-label logistic regression classifier from scikit-learn. We limit our evaluation scope to the top 100 labels on all graphs and report Macro-, and Micro-F1 scores using 10% labeled data. Since the label information is hidden from the embedding process, the entire G is used for embedding. The embedding vector of a vertex v, M 0 [v], and its corresponding labels are directly used as samples without any preprocessing. State-of-the-art tools used for evaluation: The following state-of-the-art tools are used as the baselines for GOSH.
• VERSE is a CPU parallel graph embedding tool [1] that contains three different vertex-similarity measures, including adjacency similarity and personalized page rank (PPR). For the experiments, we use PPR with an α = 0.85, and 0.0025 as the learning rate. For link prediction, we run with e = 600, 1000, and 1400 and report the best AUCROC. For node classification, we use e = 1000. • GV [7] (GraphVite) is a multi-GPU embedding tool that matches the embedding quality of the CPU based tools while being much faster. We created two settings for the tool used on medium scaled graphs, fast and slow, with e = 600 and 1000, respectively. For large-scale graphs, we use e = 100.  We report four different configurations of GOSH; NoCoarse, fast, normal, and slow. We report the details of these configurations in Table 3. From slow to fast, we decrease the smoothing ratio and the number of epochs and increase the learning rate for compensation. We report the NoCoarse version, which does not apply coarsening to understand the impact of coarsening explicitly. Since the epoch definitions of the respective tools are not aligned, we define a single epoch as performing |E| updates, given by GV [7], and adjust epochs accordingly for fairness.

Experiments on coarsening performance
Recently, we showed that the coarsening approach used has a significant impact on the embedding performance, both in terms of accuracy and runtime. For completeness, here we also briefly analyze the coarsening phase. For a detailed evaluation, we refer the reader to [16]. We first use the coarsening algorithm naively with no heuristics, then only with ordering, and lastly with both ordering and hub 2 -restriction and report the results in Table 4. We then investigate the impact of coarsening depth on embedding and report the results in Table 5.
The naive variant displays a poor performance regarding both coarsening efficiency and runtime. Although we observe a slight improvement in coarsening efficiency with ordering, the embedding quality degrades substantially, i.e., from 0.964 to 0.921 for com-friendster. We believe sorting amplifies an already existing problem; as explained in Section 4, the algorithm ends up with a couple of huge hub-vertices that prevent further coarsening. With hub 2 -restriction, this problem is mitigated; the largest graph in our dataset, com-friendster, is reduced from 63M vertices to 823 vertices in only 10 iterations. Furthermore, the heuristics positively affect the embedding quality; we observe a 1.5% increase in AUCROC on average. These show that a carefully designed coarsening algorithm is critical for both speed and accuracy. As described in Section 3, when D, the number of coarsening levels increases, the amount of work reserved for higher levels decreases. Naturally, with larger D, GOSH gets faster. One may expect that this embodies a trade-off between speed and accuracy. On the contrary, as shown in Table 5, the AUCROC scores tend to improve. In fact, for all of the graphs, we observe a consistent increase (by an average of 0.7%) in AUCROC as D increases from D = 3 to D = 7. This phenomenon shows that the updates on the coarser levels have a significant positive impact on embedding quality. We believe, with the cumulative effect of the updates on the deeper levels, coarsening is able to guide the embedding algorithm to find better local minimas. However, there is no strict relationship between AUCROCs and D. Indeed, with many levels, most of the epochs will be spent on similar graphs due to the exponential nature of GOSH's epoch distribution. We found that around 10 levels, where each level is at least 80% smaller than its predecessor, yield a good performance.  Table 6 on graph com-orkut that has 3 million vertices and 100 million edges. GOSH's parallel coarsening uses 16 threads, yet MILE's coarsening is not parallel. Since MILE does not have a stopping criterion for coarsening, we use 8 levels of coarsening for both algorithms. GOSH is more efficient than MILE regarding the number of vertices obtained in each level while being 264 times faster. For instance, GOSH shrinks the graph into 230 vertices, where MILE obtains 12,062 vertices. As our experiments show, a high-performance, parallel coarsening with a high efficiency and effectiveness is particularly important.

Experiments on epoch synchronization
As mentioned in Section 3.1, GOSH carries out the updates of a single training epoch completely in parallel. Yet, it serializes the training epochs to avoid further race conditions. Figure 3 shows the effect of increasing epoch parallelism on the AUCROC score of link prediction on youtube.
Each point on the x-axis represents the number of epochs executed in parallel on the GPU without synchronization, with 1 being the current implementation with sequential epochs, and 1000 being the other extreme, where all training epochs are executed with no synchronization. Fig. 3 shows different coarsening granularities with different lines, where D = 1 implies no coarsening, and D = 8 is the maximum number of coarsening levels. Although epoch synchronization has little to no effect with a small number of coarsening steps, its benefit heaves in sight as the number of levels increases and graphs become smaller in lower levels; a small number of vertices implies higher probability for potential race conditions. These results clearly demonstrate the importance of synchronizing epochs -especially in a multi-level setting. They also show that multilevel, GPU-based embedding tools may be improved in terms of runtime for large graphs by allowing a number of concurrent epochs on the upper levels but forcing full synchronization elsewhere. We leave this as a future study.

Experiments on medium-scale graphs
Tables 7, 8, and 9 provide the runtimes, AUCROC (for link prediction), and Micro-, and Macro-F1 scores (for node classification), for the tools evaluated on medium-scale graphs.   In addition to GOSH (for link prediction) and GOSHW (for node classification), we use GV, MILE, PBG, and VERSE.
To evaluate the runtime performance on link prediction, we use VERSE as the baseline and present the speedups of each tool in Table 7. The results show that GOSH-fast is an ultra-fast solution that produces very accurate embeddings at a fraction of the time compared to all the systems under evaluation. It can achieve speedups over VERSE of nearly three orders of magnitude while sacrificing only an average of 0.28% in AUCROC. When compared to MILE, it is superior in terms of AUCROC in all four graphs while being at least two orders of magnitude faster. Although PBG is on par with GOSH-fast on link prediction quality, it is 40× slower on average. Finally, when compared to GV, we find that GOSH-normal is two orders of magnitude faster and scores 0.06% higher AUCROC on average, setting the new state-of-the-art in link prediction. Relative to link prediction, node classification is a more challenging problem. For both GOSHW and GV, node classification requires more work to converge, where fast configurations are not as successful as they are for link prediction. The adjacency similarity measure, which respects only the first-order proximity, is sufficient for link prediction; however, a more sophisticated measure is a must for classification. For this purpose, we experimented with VERSE using the personalized page rank (PPR) similarity measure, and report the results in Tables 8 and 9. Overall, VERSE performs poorly with respect to GV and GOSHW, which shows that PPR similarity measure also falls short for node classification. We found that the best option for node classification is random walks [3]. The biggest difference between random walks and PPR is related to the number of samples collected from a single walk. Compared to a single sample per walk by PPR, with [3], tens of samples are collected from a single walk depending on the walk distance and sampling window size. These results highlight the importance of coherent sampling.
We compare GOSHW with the state-of-the-art, GV and PBG, using the best Micro-and Macro-F1 scores for both tools. As Table 8 shows, GOSHW-slow achieves an average improvement of 1.88%-4.38% and 5.63%-6.59% in Microand Macro-F1 scores while achieving an average speed-up of 2.7× and 9.3× with respect to GV and PBG. Although GOSHW-fast obtains significant speedups over GV and PBG, it falls short on node classification accuracy on all the graphs besides com-amazon. Furthermore, GOSH could saturate with the normal configuration on link prediction; however GOSHW, on node classification, requires more training to converge. For instance, switching from fast to normal results in an average AUCROC score increase of 0.3% in link prediction (Table 7), and Micro and Macro-F1 increase of 3.2% and 3.0% in node classification, while reducing the speed on average only by a factor of 3× (Tables 7 and 8). Table 9 shows the node-classification results with different percentages of labeled nodes. In terms of Micro-and Macro-F1 scores, unlike link prediction, GV and GOSHW are in the same ballpark. Hence, coarsening does not have the same positive impact as on link prediction on node classification. However, even for node classification, coarsening is still needed for a fast embedding. Increasing B leads to a trade-off between embedding performance and quality; an increase in B decreases the runtime due to the reduction in number of embedding rounds and hence, number of submatrix copies. However, increasing B also results in more updates to be carried out on subsets of the graph in isolation from the remaining subsets. Thus, it delays the propagation of these updates and hurts the accuracy. In GOSH, we use a default value of B = 5 as it achieves a performance boost without a significant negative impact on the accuracy. The results are acquired using a single Titan X GPU on Arch-2. Figure 5 shows the relationship between the number of submatrices kept on the GPU, i.e., P GP U , the number of parts V is partitioned into, i.e., K, and the embedding time.

Experiments on handling large graphs
When P GP U increases from 2 to 4, the embedding time is reduced drastically. This is because the embedding kernels are overlapped with the submatrix transfers between the host and the GPU. However, the reduction stops after P GP U = 4. We suspect that this is due to the increase in K. To elaborate, as the number of submatrices on the GPU increases, the size of a single submatrix decreases and hence, the number of parts increases. More parts translate to more submatrix communications per embedding round, and the increased communication cost deteriorates the overall performance.  results on large-scale graphs. Every data-point is the average of 3 embeddings. Single-and multi-GPU experiments are conducted on Arch-1 and Arc-3, respectively.
We report the results for large-scale graphs in Table 10. VERSE timed out (12 hours) on all the large-scale graphs. MILE timed out on hyperlink2012 and could not embed the remaining two due to insufficient memory. Similar to MILE, GV ran out of GPU memory (on a single A100 GPU) on all the large-scale graphs. For this reason, we run it on the Arch-3 server equipped with four A100 GPUs. However, we used three out of four GPUs since GV had (node-level) out-of-memory issues with four GPUs. As Table 10 shows, although it is using a single V100, GOSH surpassed GV's AUCROC scores (on 3× A100s) in link prediction while being, on average, 3× faster. Similarly, GOSH-slow is always better than PBG in terms of AUCROC score and runtime.
For node classification, GOSHW-slow on a single V100 is approximately 20% slower compared to GV using three A100s. In fact, GOSHW-slow generates embeddings with similar F1 scores on a single A100 GPU approximately in 2000 seconds leveraging a larger, 80GB global memory. In terms of F1 scores, on com-friendster, GOSHW-slow obtains a worse Micro F1 score compared to GV but have a slightly better Macro F1 score on average.

Experiments with smaller dimensions
As explained in Section 3.1, to utilize the GPU's full power, we use a warp-based implementation; a single warp performs a single update at a time. Since a warp contains 32 threads; when d < 32, 32 − d threads remain idle. In other words, GOSH takes almost same amount of time for all runs where d < 32. To increase efficiency, we apply the technique described in Section 3.1.1 for d < 32. With this assignment, as shown in Table 11, we observe 2.63× and 1.8× speedup for d = 8, 16 on com-orkut compared to d = 32.

Speedup breakdown
We run intermediate GPU-accelerated versions of GOSH and report their speedup over our optimized, 16-thread CPU implementation. We show the results in Figure 6. In the figure, the first two results are for large-scale graphs, and the remaining four are for medium-scale graphs. We omit the results of the non-coarsened versions of large-scale graphs as they take a very long time. The first version, Naive GPU, is 3.3× slower than the multi-core CPU version. The next, Optimized GPU, improves the first by leveraging GPU-based optimizations, e.g., memory coalescing, and utilizing shared memory to reduce global memory usage. It is, on average, 5.4× faster than the multi-core version. These versions do not use coarsening.
Sequential Coarsening introduces single-thread coarsening. It achieves an average speedup of 45× over the multicore version and maintains the quality as shown in Tables 7-10. Note that a single update on a super vertex in a coarsened graph is propagated to all the vertices it contains.
The final version, Parallel Coarsening, employs multithreaded coarsening. The improvement is more pronounced for larger graphs. For instance, sequential and parallel coarsening take 2469 and 235 seconds, respectively, on com-friendster. Meanwhile, the total runtime of GOSHnormal is 2721 secs. This means that parallel coarsening improves performance by 80%. However, it improves the runtime of com-dblp by only 3% since the embedding takes 2.2 seconds, and the sequential and parallel coarsening times are 0.13 and 0.06 seconds, respectively.

RELATED WORK
In recent years, there has been a surge of interest in graph embedding. The literature is carefully documented in [17] and can be categorized into three branches: factorization-, deep-learning-, and sampling-based embedding. The first class of tools factorizes the adjacency/similarity matrices [18], [19], [20], [21], [22], [23], [24]. Deep-learningbased approaches utilize deep autoencoders to reduce the dimensionality of the graph [25], [26]. Lastly, in samplingbased embedding, a single-layer neural network is trained by choosing samples from the graph [1], [2], [3], [4]. Samples are chosen with respect to a similarity measure or a sampling strategy, varying throughout the works. Although there have been attempts to generalize the sampling process [1], [20], a common ground has not been reached yet.
Graph coarsening is often used to reduce the cost of large-scale graph problems. Most recently, Gilbert et al. presented CPU and GPU optimizations on many graph coarsening algorithms and demonstrated significant performance improvements on graph partitioning [27]. There have been attempts to use coarsening for embedding [5], [6], however, they do not utilize specialized processing units like GPUs. Distributed embedding approaches are also proposed to make the embedding faster [10], [28], [29]. Most notably, [10] can embed multi-relational graphs and utilizes a parameter partitioning schema similar to GOSH that provides distributed training on many multi-core and multi-GPU nodes. However, under hardware restrictions, such methods cannot be used for large-scale graph embedding.
Several studies have focused on accelerating graph algorithms by exploiting GPUs [7], [9], [30]. Both [7] and [9] require the embedding matrix to fit in device memory, which constitutes a restriction on the graphs' size to be embedded when working with limited hardware. GOSH relaxes this limitation by leveraging a novel coarsening algorithm and a judiciously devised CPU and GPU parallel scheduling algorithm. Similar to GOSH, [30] is able to tackle limited-hardware restrictions by utilizing a memory partitioning schema that can embed large-scale graphs on a single GPU. However, we did not include it as a baseline since its knowledge-graph focused scoring functions make it incomparable to the other baselines in this work.

CONCLUSION AND FUTURE WORK
This paper introduces a high-quality, fast, and flexible graph embedding algorithm GOSH that outperforms the state-ofthe-art in terms of both efficiency and accuracy by exploiting a parallel, multi-level coarsening. GOSH can handle realworld, large-scale graphs with a single GPU via a partitioning schema and CPU sampling which are designed to minimize GPU idling. We achieve a new state-of-the-art on link prediction in terms of both speed and AUCROC. Moreover, we match the state-of-the-art on node classification in terms of F1-scores while being twice as fast. We leave multi-node and multi-GPU support of GOSH as future work.