Evaluating Representation Learning and Graph Layout Methods for Visualization

,


Introduction
Visualization of data is useful both for understanding data and cross-checking models trained on that data. Consequently, it may lead to new insights, better models, the detection of outliers, etc. Visualization is frequently used in data science. However, it is not straightforward to visualize a graph. The vertices and edges typically do not have a real-valued representation, hence the primary problem in graph visualization is to find a good representation of the vertices in two-dimensional space.
Traditionally this area was known as graph drawing and methods that produce a representation of the nodes of a graph on a two-dimensional real-valued space were referred to as graph-layout algorithms. Early research considered problems such as to find planar embeddings (no edge crossings), later research also embeddings of large graphs, leading also to the development of still popular forcedirected approaches [3,18,6,?]. In contrast, graph representation learning methods (sometimes also referred to as network embedding methods) have not been developed specifically for visualization, but rather they aim to embed graphs into a low-dimensional space (depending on the method between 8 to 128 dimensions in general), to enable subsequent use of common machine learning techniques.
There exist many graph-layout algorithms, each with a different objective or optimization procedure. Someone who wants to inspect or analyze a graph visually is then already faced with a difficult choice of which method would be most appropriate for their goal. This difficulty is amplified by the introduction of dozens of representation learning methods for graphs, which have been premised to also produce good visualizations (see, e.g., [15,9,17,10]).
It is not clear which method to use, or even which aspects vary by choosing one method instead of another. For example, in Figure 1 we show visualizations from different representation learning methods. It is not obvious which visualization is better than another or how they differ.
Representation learning methods are often evaluated by coloring the vertices according to a class label, and a good visualization is then one where nodes from the same class are (visually) grouped together. But such evaluations have limited scope, because whether an algorithm does well depends strongly on what type of structure from the graph aligns with the node labels.
New graph layout methods are evaluated on more immediate quality measures, but are not compared to representation learning methods (see, e.g., [9,21]).
It is an open question how to generally evaluate graph layouts and whether some currently used methods outperform others. Contributions. In this paper, we perform an empirical analysis of graph visualizations by representation learning and graph layout algorithms. We investigate the differences in their visualizations visually and quantitatively using a diverse set of quality measures. Representation learning methods are often used to compute node embeddings in more than two dimensions. We thus evaluate native two-dimensional and also higher-dimensional node representations that we em-bed into two dimensions using the well-known t-distributed stochastic neighbor embedding (t-SNE) method. We base our analysis on publicly available methods and provide the full source code, which can be easily extended to include other methods or datasets. Lastly, we give recommendations on how to apply t-SNE to retain important graph structure and how different methods could be improved to yield better visualizations.

Setup of the study
In this section we introduce our choice of quality measures, graphs, and evaluated methods. The definitions of the quality measures are provided in the supplement.

Quality measures
There exist a variety of scores to measure the quality of graph layouts. The graph representation learning community typically evaluates graph embeddings on tasks such as clustering, link prediction, graph reconstruction, or node classification. These are inspired by and suitable to evaluate embeddings as models of the data, either considering the performance of an embedding towards reconstruction or prediction. In contrast, recent methods for graph visualization ( [9,10,21]) are mostly evaluated by means of readability, shape-based, or distance-based measures. Such measures inform more directly about the quality of a graph layout, as they are easy to interpret, but they may not capture to what extent we can do inference and prediction using a layout.
As the measures may be complementary to each other, we evaluate graph layouts on measures from each of these groups: Crosslessness [16,10] measures the proportion of edge crossings that are avoided in a graph layout with respect to all possible pairwise edge crossings. Minimum angle [16], measures how much the smallest angle between incident edges at each node deviate from the optimal angle that could be achieved when all edges are evenly spaced. Edge length variation [4] quantifies the variability of the edge lengths relative to their mean. Gabriel graph similarity [2] is a shape-based quality measure for layouts of large graphs. We use the GLAM (Graph Layout Aesthetic Metrics) toolbox [10] to compute these four readability measures. The distance-based measures that we evaluate are neighborhood preservation [9] and stress [21].
Finally, we use the EvalNE toolbox [13] to quantify the suitability of the layouts for link prediction. For link prediction, a random connected subset of 80% of the edges is used to compute the layout and the remaining 20% together with the same number of randomly sampled non-existing edges are used for testing. While some methods have built-in edge embeddings (AROPE, CNE, GAE), others only result in node embeddings. In the latter case, we use EvalNE to transform the node embeddings into edge embeddings using their elementwise average, Hadamard product, L 1 , or L 2 distance. The measure reported is the AUC-ROC of a logistic classifier that predicts the test edge probabilities.

Datasets
We evaluate visualizations of the largest connected components of six undirected and unweighted graphs listed in Table 1. The karate club graph is a commonly used example and models the social interactions of the sports club members. can96 is an artificial mesh structure, netscience a co-authorship graph from network scientists, and powergrid represents the electricity grid of the western United States. The facebook friendship graph is commonly evaluated on the link prediction task. The largest graph, a twitter feed graph, has been used in the experiments by Zellmann et al. [19] and provided to us by the authors.

Evaluated methods
In this study we compare graph visualizations from seven methods, with parameter settings and links to their source code in Table 11 of the Supplement. We followed the parameter recommendations of the original publications where possible. We selected four graph representation learning methods: Arbitrary-order Proximity Preserved Network Embedding (AROPE), Conditional Network Embedding (CNE), DeepWalk (DW), and Graph Auto-Encoders (GAE), which are state-of-the-art methods each using a distinct approach [14]. These methods are mostly used to compute low-dimensional node embeddings in more than two dimensions and the two-dimensional graph layouts for visualization are then obtained by t-SNE [12]. We denote their node embeddings by, e.g., AROPE128, indicating the dimensionality before applying t-SNE. We also evaluate native two-dimensional node representations for all graph representation learning methods denoted by e.g., AROPE2. The representatives for graph layout methods are the Fruchterman-Reingold (FR) algorithm and DRGraph. We chose FR due to its popularity and evaluate its naive implementation as well as a recent more scalable grid-based optimization. DRGraph is a recently published method that is faster than existing layout techniques but produces visualizations with comparable readability and neighborhood preservation.
Fig. 2: Placement of low-degree nodes (dark) around a high-degree hub node (yellow, marked with red arrow) for the twitter graph. While nodes are clustered according to their degree in AROPE128 and GAE16, and to some extend in CNE16, leaf nodes are close to their parent in DW128.
Arbitrary-order Proximity Preserved Network Embedding [20] preserves higher-order proximities by factorizing a linear combination of powers of the adjacency matrix. It is efficient through the use of eigen-decomposition reweighting.
Conditional Network Embedding [7] is a probabilistic network embedding method based on Maximum Likelihood Estimation that can factor out prior knowledge about the graph structure using a Bayesian approach.
DeepWalk [15] captures the neighborhood structure of nodes using random walks. The node embeddings are updated using the Skip-gram model, maximizing the co-occurrence probability of nodes in the same neighborhood.
Graph Auto-Encoders [8] embed the graph into a latent space using a graph convolutional network encoder and an inner product decoder.

Fruchterman-Reingold [3]
is a force-directed approach with attraction forces between connected nodes and repulsion between all nodes. Node positions are updated iteratively with a temperature cooling scheme that reduces the displacement of nodes over time. We compare the O(|V | 2 ) implementation of Fruchterman-Reingold by Hagberg et al. [5] (version 2.2) with a GPU-optimized implementation by Zellmann et al. [19], denoted as FR-RTX.
DRGraph [21] preserves the first-order neighbors from the graph in the lowdimensional representation by minimizing the difference between node similarities and layout proximities. It is based on the idea of tsNET [9] but improved the computational and memory complexity by introducing a sparse distance matrix, negative sampling, and a multi-level optimization process.

Empirical results
We first compare the graph layouts qualitatively based on the visualizations in Table 2 and then evaluate the quantitative measures. The code and the embeddings are available at https://github.com/aida-ugent/graph-vis-eval. The Sup- plement contains node-link diagrams (Table 12) and scores for all layouts (Tables  1 to 10).

Visual inspection of the embeddings
AROPE. The first embedding coordinate of graph layouts by AROPE2 is proportional to the eigenvector centrality of the nodes, making the embedding unlike the others. Central nodes may be identified on one side and leaf nodes on the other side of the embedding for karate club, can96 , and powergrid . The nodes for the other three graphs are mostly aligned along two axes, thus hiding most of their connections. Embeddings by AROPE128 reveal more details of the graph structure. In the twitter graph layout in Figure 2a we observe a clustering according to node-degree.
CNE. The native two-dimensional embeddings of CNE2 exhibit a proximitybased arrangement of the nodes where hub nodes are placed in the center of their connections. The embeddings by CNE16 are similar to AROPE128, DW128, and GAE16. For the twitter graph, we find the t-SNE embedding is more readable as it shows cluster structure. We see in Figure 2b that CNE16 also clusters nodes by degree.
DeepWalk. The embeddings by DW2 conceal the underlying shape of the graph as nodes are mostly arranged on a curved line. DW128 produces readable embeddings with a clear cluster structure. In Figure 2c we note that DW128 embeds low-degree nodes close to their connections resulting in a star shape around the hub node.
DRGraph. The graph layouts by DRGRAPH for netscience and powergrid are appealing and very similar to the layouts by DW128. The different communities of the facebook graph are well visible, but in the node-link diagram (Supplement ,  Table 13) we notice some long edges that have a leaf node on one end. Long edges also dominate the visualization of the twitter graph and make it difficult to observe any structure in the center of the layout. We do not know whether the parameter settings or the fact that DRGRAPH only preserves first-order graph distances cause the 'hairball' structure of this visualization.
FR. Both implementations result in graph layouts with similar global structure but different local node arrangement. For FR, we observe that clusters are more compact and nodes are pushed away from the center. For FR-RTX, nodes are distributed more evenly around a shared connection, which hinders the formation of visible clusters for powergrid but improves the readability of twitter . We assume this is the effect of approximating the repulsive forces in FR-RTX, causing only the closest nodes to repel each other.
GAE. The embeddings from GAE2 have a distinct circular shape due to the inner product decoder. High-degree node embeddings have large coordinates and low-degree nodes are placed near the origin. In this graph layout, we can easily identify the most central hub nodes, e.g., for facebook or twitter . The embeddings  by GAE16 and AROPE128 have an similar local structure as shown in Figure 2d. In both embeddings, nodes from the same cluster are arranged by degree.

Readability measures
We show the scores for crosslessness, edge-length uniformity, minimum angle, and Gabriel graph similarity, averaged over four runs, in Figure 3. Averaging over all graph datasets, the graph layout methods outperform the representation learning methods (Supplement Table 10). DW128 achieves high scores for the layouts of larger graphs. The high minimum-angle scores of the layouts by FR-RTX and DW128 stem from the star-shaped arrangement of hub-nodes and their connections (see Figure 2c). The layouts by AROPE2, DW2, and GAE2 generally have small angles between incident edges as they optimize the dot-product similarity of connected nodes. Further, we observe that AROPE2, DW2, and GAE2 score poorly for the Gabriel shape measure. These methods optimize the node embeddings for dot-product similarity whereas the Gabriel graph similarity measure retrieves neighbors based on Euclidean distance.

Distance-based measures
We show the scores for second-order neighborhood preservation and stress in Figures 4a and 4b, and the scores of the first-order neighborhood preservation  measure in the Supplement; they are highly similar to the Gabriel similarity measure. DW128 and DRGRAPH preserve the second-order neighborhood best. Considering all methods, the neighborhoods of the facebook graph are better preserved than the neighborhoods of powergrid or twitter . We presume that the community structure of the graph aligns well with the second-order neighborhoods. FR places these communities far apart and thus achieves the highest score. The differences in the stress measure are subtle. Averaged over all datasets, FR and FR-RTX result in layouts with lowest stress. The t-SNE based embeddings from AROPE128, DW128, and GAE16 are more distance faithful than their native two-dimensional node embeddings but the opposite holds for CNE2 and CNE16.

Link-prediction
Results are presented in Figure 4c. No single method achieves the highest link prediction score on more than two datasets. Averaging over all datasets, the graph layouts based on the Fruchterman-Reingold algorithm result in the highest link predict scores. While it is difficult to identify much structure in the hairball-shaped twitter graph layouts by FR and DRGRAPH they score highly on the link-prediction task.

Runtime
The average runtime to embed the whole graphs using an Intel Xeon CPU E5-2620 v4 @ 2.10GHz with one GeForce GTX 1080Ti is depicted in Figure 4d. We ran the experiments for GAE on twitter on a different machine with 256GB RAM resulting in mean runtimes of 5020s for GAE2 and 4635s for GAE16. We notice that AROPE2 is the fastest method and that the runtime increase for AROPE128 is mainly caused by t-SNE, which took about 22, 43, and 660 seconds to reduce the dimensionality from 128 to 2 for facebook , powergrid , and twitter respectively. FR-RTX has a slightly larger runtime than the other methods on the smaller graphs possibly due to a small startup cost for a userinterface. Notably, on the twitter dataset, FR-RTX runs in less than a minute, while FR takes almost 24 hours. CNE16 has lower runtime than CNE2 on the larger graphs as the optimization of the 16-dimensional node representations stabilizes earlier than the two-dimensional representations.

Discussion
In this study we have shown that visualizations by graph layout methods scored higher on the chosen quality measures than the native two-dimensional node embeddings by representation learning methods. The combination of DeepWalk with t-SNE resulted in visualizations with the best local neighborhood preservation and highest scores in Gabriel graph similarity but does not scale to larger graphs. We believe that there is great potential in comparing a wider range of scalable graph layout and representation learning methods on real-world graphs with millions of nodes.
The standard definitions of Gabriel graph similarity, neighborhood preservation, and stress all assume that the graph-theoretical distances are reflected by the Euclidean distances in the low-dimensional embedding. Methods such as AROPE, DeepWalk, and GAE optimize the embedding based on dot-product similarity. From this perspective, it is not surprising that the graph layouts by AROPE2, DW2, and GAE2 score worse on these measures. Although other distances than Euclidean may be interpretable, it is not obvious to what extent that may be the case. In addition to the quality measures, the standard version of t-SNE also defines high and low-dimensional neighborhoods based on Euclidean distance. To retain the graph neighborhoods, we would have to adjust the similarity definition of t-SNE.

Recommendations for practical use
It is important to note that the choice of method is not universal and depends on the task for which the visualization is used. The designer of the visualization should ask the question which quality measure is most important to judge the effectiveness of the visualization, for that task. For example, out-of-sample quality measures like link prediction performance may not be important in the context when a static graph is to be analyzed that does not have any missing edges. In other contexts, generalizability of the proximity of nodes, for which link prediction performance is a proxy, could be desirable.
No single winner emerged from the comparisons of graph representation learning with graph layout methods. Nonetheless, some general patterns did emerge: Graph layout methods. We found that FR and FR-RTX resulted in readable layouts, score best on stress minimization, and achieved the highest linkprediction scores. The latter is especially surprising as the representation learning methods are focused on generalization performance, through use of higher-order information (paths, convolutions), supervised learning (use of negative edges), or both.
DRGRAPH is a fast method that preserves the local graph neighborhoods better than both versions of Fruchterman-Reingold. While the neighborhood preservation decreases for larger graphs, it may be still the best choice to use either of the computationally efficient graph layout methods instead of a representation learning method.
Graph representation learning methods. While the native two-dimensional embeddings of CNE2 perform well in terms of quality, it appears that the only reason to consider the AROPE2, DW2, and GAE2 embeddings is to better understand the method itself. They are outperformed, typically by a wide margin, by the t-SNE embeddings of higher-dimensional node representations. DW128 is arguably the best representation learning method included in the evaluation. Compared to DRGRAPH, it provides a more readable visualization for the difficult twitter graph, but is also 1000 times slower. For larger graphs, we expect the relative difference in runtime to grow further.

Future work
Improving the visual quality. We observed that the naive Fruchterman-Reingold algorithm results in embeddings where cluster structure is more pronounced compared to the grid optimized version where nodes are distributed evenly across the layout. Adjusting the repulsive forces to depend on node degrees as in ForceAtlas2 [6] could improve the perceived clusteredness of the layout, by embedding leaf nodes closer to their parent.
Combination of graph representation learning with t-SNE. The t-SNE visualizations of the large twitter graph (CNE16, DW128, and GAE16) are of high quality and suggest that preserving only the small distances of a lowdimensional node representation avoids the 'hairball' embedding. It appears this allows deriving an integrated method that could outperform existing methods. Besides, it would be worthwhile to improve the scalability of the DW128 combination.
Finally, we hope this evaluation inspires follow-up work to further connect the representation learning with the graph drawing community.

Evaluation measures
We denote an undirected, unweighted, and connected graph G = (V, E) by the set of nodes V with |V | = n and the set of edges E = {(i, j)} ⊆ V × V . We refer to the graph-theoretical shortest-path distance between nodes i and j by d ij .
With N (i, h) = {j ∈ V |d ij ≤ h, i ̸ = j} , we denote set of nodes in the h-order neighborhood of node i. With the following readability measures we evaluate two-dimensional graph layouts Y ∈ R n×2 , where the embedding of a node i is denoted as y i ∈ R 2 and the edges are drawn using straight lines. Crosslessness [6] measures the proportion of edge crossings that are avoided in a graph layout with respect to all possible pairwise edge crossings. We adopt the definition by Kwon et al. [4] where c is the number of crossings and c max the upper bound on the number of crossings. c max is defined as |E|(|E|−1) 2 − 1 2 i∈V |N (i, 1)|(|N (i, 1)| − 1) , where |N (i, 1)| is the degree of node i. Minimum angle [6] measures how much the smallest angle between incident edges at each node deviates from the optimal angle that could be achieved when all edges are evenly spaced.
where θ(i) = 360 |N (i)| is the optimal angle at node i and θ min (i) is the minimum angle between any two edges incident to node i. Edge length variation [2] measures the variability of the edge lengths relative to their mean. It is defined as the coefficient of variance To be in line with the other readability measures where higher values are better, we define for graphs with |E| > 1, where the coefficient of variance is normalized by its upper bound. Gabriel graph similarity [1] is a shape-based quality measure for layouts of large graphs. It is based on the intuition that a representation of the global graph structure can be accurate despite a high number of edge crossings. It is defined as the mean Jaccard similarity between the first-order neighborhood in the original graph N G and the neighborhood in the Gabriel graph N S . The Gabriel graph over the set of nodes V has an edge connecting two nodes i and j if a disk with y i and y j on its boundary and diameter ∥y i − y j ∥ does not contain any other node. Neighborhood preservation [3] is defined as the Jaccard similarity where N G (i, h) is the graph neighborhood of node i with distance at most h and N Y (y i , k i ) are the k i = |N G (i, h)| nearest neighbors based on Euclidean distances in the embedding. We evaluate the neighborhood preservation for h ∈ {1, 2}.
Stress measures how accurate a graph layout represents the actual graphtheoretic node distances. We define the stress similar to Zhu et al. [7] as where α ≥ 0 scales the embedding to minimize the stress as follows: arg min α (stress) = arg min α i<j∈V Taking the derivative with respect to α and equating to zero yields: