SNoRe: Scalable Unsupervised Learning of Symbolic Node Representations

Learning from real-life complex networks is a lively research area, with recent advances in learning information-rich, low-dimensional network node representations. However, state-of-the-art methods offer little insights as the features that constitute the learned node representations are not interpretable and are as such less applicable to sensitive settings in biomedical or user profiling tasks, where bias detection is highly relevant. The proposed SNoRe (Symbolic Node Representations) algorithm is capable of learning symbolic, human-understandable representations of individual network nodes based on the similarity of neighborhood hashes to nodes chosen as features. SNoRe's interpretable features are suitable for direct explanation of individual predictions, which we demonstrate by coupling it with the widely used instance explanation tool SHAP to obtain nomograms representing the relevance of individual features for a given classification, which is to our knowledge one of the first such attempts in a structural node embedding setting. In the experimental evaluation on 11 real-life datasets, SNoRe proved to be competitive to strong baselines, such as variational graph autoencoders, node2vec and LINE. The vectorized implementation of SNoRe scales to large networks, making it suitable for many contemporary network analysis tasks.


Introduction
Networks can be used to model numerous real-world systems, spanning from biological protein interaction networks to social and transportation networks Costa et al. (2011); Lü and Zhou (2011). By representing a real-life system as a network, it is possible to study network properties, such as the key network nodes, why they are relevant, how sets of nodes group together, and how network nodes are classified Bhagat et al. (2011); Cai et al. (2018). The latter task is the focus of this research.
The problem of node classification has been already considered in the 1990s Farmer and Rodkin (1996). However, it was popularized only in the recent years due to the increase in the available computational power. A well-known method capable of node classification is label propagation Xiaojin and Zoubin (2002), an algorithm that asynchronously assigns labels to neighboring nodes, eventually reaching an equilibrium state that corresponds to the final classification. Albeit efficient, label propagation and similar approaches operate in a relatively naïve manner, not accounting for the rich structure of a given network that spans beyond simple neighborhoods. To mitigate this issue, novel representation learning methods emerged, offering efficient ways of constructing real-valued representations of individual nodes, suitable for down-stream learning such as classification.
Contemporary structural node representation algorithms are mostly concerned with the down-stream performance, however, do not emphasize the interpretability, which is of utmost importance when the user attempts to understand why the system decided to classify a given instance the way it did. To mitigate this issue, we developed SNoRe, an algorithm that compares node neighborhoods and is capable of learning interpretable feature sets describing a given node, which can be used to obtain explanations of individual predictions, which is an improvement over state-of-the-art low-dimensional, black-box node representations.
The contributions of this work are summarized as follows: • We propose SNoRe, an efficient algorithm capable of learning symbolic representations of nodes by accounting for global network topology.
• Theoretical and empirical comparisons with state-of-the-art indicate competitive performance, whilst offering the interpretability of individual predictions, explained by the contributions of the neighboring nodes.
• We show that SNoRe scales to real-life networks with tens of thousands of nodes, and does not require dedicated hardware for effective performance.
• SNoRe is implemented as a simple-to-use Python library, transpired to lower-level code via the Numba framework Lam et al. (2015) for maximum efficiency.

Related work
This section discusses the related work, describing the state-of-the-art methods capable of solving the node classification task along with their properties. Note that there are two main variants of learning from networks: transductive and inductive classification. In the transductive setting, node classification is performed within the same network-part of the network is initially labeled, while the remaining is not. The task addresses the issue of extrapolating the information from the known part of the network to the unknown (unlabeled) one. Common examples of this task include gene function prediction and social network-based tasks such as user profiling. On the other hand, the inductive learning corresponds to a setting where independent networks are fed as input and are also classified on the network level. The focus of this work is on transductive learning.
The types of learning algorithms can further be split based on the information they are capable of exploiting during learning. An algorithm can perform solely by exploiting the network structure, or can also incorporate potentially interesting features assigned to either nodes or edges. The focus of this work is on structural classification (no features).

Structural node embedding
The notion of structural node embedding corresponds to the process of learning a given node's latent representation (most commonly real-valued), based on its neighborhood within a given network. The first branch of methods was inspired by the widely known word2vec algorithm Mikolov et al. (2013): DeepWalk Perozzi et al. (2014) was one of the first node representation learners, and remains state-of-the-art to this date. DeepWalk creates a network representation by using sequences of nodes representing random walks as input sentences for the word2vec algorithm. Random walks are created in a depth-first search manner and intuitively map nodes with similar second-order proximity close together.
Following similar ideas, methods such as node2vec, struc2vec, LINE, PTE and similar emerged, each exploring additional e.g., network-topological properties, considered during representation learning. Algorithm node2vec Grover and Leskovec (2016) uses hyperparameters p and q to guide random walks. Parameter p dictates the return probability while q dictates the probability of exploration away from the previous node. If p and q are set to 1 we get the special case where the node2vec algorithm can be seen as DeepWalk. LINE Tang et al. (2015b) derives an objective function for first and second-order proximity that is computationally intensive and thus not scalable. The algorithm is then made scalable with the adoption of negative sampling. Function parameters of the classification model are optimized with asynchronous stochastic gradient descent.
NetMF is presented in Qiu et al. (2018) along with theoretical analysis of DeepWalk Perozzi et al. (2014), node2vec Grover and Leskovec (2016), LINE Tang et al. (2015b) and PTE Tang et al. (2015a). This theoretical analysis shows that all aforementioned methods approximate matrix factorization and that the close forms of these matrices are intrinsically connected to the graph Laplacian. NetMF factorizes these close form matrices, potentially offering consistent improvement in performance over other methods mentioned above. Personalized Page Rank with Shrinking (PPRS) was introduced as part of the HINMINE methodology Kralj et al. (2017). This algorithm creates vectors representing personalized node ranks by using power iteration. Such vectors can be used directly for learning purposes, or further compressed by an autoencoder krlj et al. (2019a), offering small, compact representations trained in an end-to-end manner.

Graph neural networks
Since networks as such are not bound to a given coordinate system, direct input of e.g., adjacency matrices into neural networks proves to be problematic. As a result, in parallel with the aforementioned structural node embedding methods, which are useful for representation learning in domains with a well structured spatial structure (such as images), the area of graph neural networks (GNNs) Bojchevski  GNNs create a representation of the network by propagating and transforming node representations until an equilibrium is reached. These algorithms are mostly divided into three subgroups: graph recurrent neural networks, graph convolutional neural networks and graph autoencoders. Graph recurrent neural networks try to capture and learn recursive and sequential patterns by taking advantage of recurrent neural networks. Graph convolutional neural networks learn local and global patterns trough designed convolution and readout functions.
Convolutional neural networks are divided into spectral and spatial based algorithms, based on how they define convolution. Spectral based algorithms define graph convolution using filters from graph signal processing while the convolution in spatial algorithms relays on information propagation. Graph autoencoders are often used for unsupervised representation learning by assuming that the networks have low-rank structures that are potentially nonlinear .
Graph convolutional network (GCN) Kipf and Welling (2017) is one of the most influential works in graph-based deep learning since it bridges the gap between spectral and spatial based graph convolutional neural networks. The algorithm simplifies filtering by only focusing on first-order neighbors. Since the number of neighbors can vary, Graph-SAGE Hamilton et al. (2017) samples a fixed amount of neighbors and aggregates them. Graph attention network (GAT) Veličković et al. (2018) further improves both previously mentioned approaches by introducing the attention mechanism. Attention mechanism allows the neural network to learn how much each neighbor contributes instead of assuming that all neighbors contribute the same amount (like in GraphSAGE) or that this amount is predetermined (like in GCN). Another interesting graph convolutional neural network is Graph Isomorphism Network (GIN) Xu et al. (2019) that presents a readout function that uses summation and a multi-layer perceptron to provably achieve the maximum discriminative power.
A popular graph autoencoding algorithm is the Variational graph autoencoder (VGAE) Kipf and Welling (2016) that uses latent variables to create a representation for undirected networks. The algorithm encodes the network into mean and variance matrices and decodes them with the dot product. The parameters of the model are learned by minimizing the variational lower bound.

Algorithm SNoRe
This section presents the SNoRe algorithm. We first define some essential components and present the key ideas of the algorithm (Section 3.1). The algorithm is separated into four steps: random walk generation (Section 3.2), random walk hashing (Section 3.3), feature selection (Section 3.4), and similarity calculation (Section 3.5). For each step, we present its description and how we implemented it. We also present an extension of the algorithm that chooses the number of features based on the embedding size (Section 3.6), show an overview of the algorithm (Section 3.7) and present its theoretical properties (Section 3.8).

Definitions and key ideas
Let us first define the key terms that we use throughout this paper.
Definition 1 (Network). A network is a tuple G = (N, E), where N represents the set of nodes and E represents the set of edges. An edge can be represented as an ordered pair (e.g., (n 1 , n 2 ) ∈ N × N )-in this case the network is directed. Alternatively, an edge can be represented as a subset of size 2 (e.g., {n 1 , n 2 } ⊆ N )-in this case the network is undirected.
For generality, we will use directed networks since we can also represent the undirected ones using the same formalism. On a network, we define a walk as follows.
Definition 2 (Walk). A walk of length k in the directed network is any sequence of k nodes n 1 , n 2 , ..., n k ∈ N , so that each pair of consecutive nodes n i and n i+1 has a connection (n i , n i+1 ) ∈ E.
We extend the definition and define the notion of a random walk.
Definition 3 (Random walk). A random walk is a walk generated in such way that at step i, node n i+1 ∈ {a, (n i , a) ∈ E} is chosen with some probability.
The result of our algorithm is symbolic node embedding of a network defined as: Definition 4 (Symbolic node embedding). Symbolic node embedding of a network G = (N, E) is a matrix M ∈ R |N |×d , where d is the dimension of the embedding. Such embedding is considered symbolic, when each column represents a symbolic expression, which-when evaluated against a given node's neighborhood information-returns an integer number representing a given node.
Note that the above defined type of symbolic node embedding can also be referred to as propositionalization (see recent review  for more details). Step 1 generates random walks that are then hashed in Step 2. This hashes are represented as sparse vectors and used to calculate the similarity between two node neighborhoods in Step 4. The similarity is calculated between all nodes and nodes that are chosen as features in Step 3 based on their PageRank score.
We use these definitions to outline the proposed algorithm illustrated in Figure 1. In the figure, we highlight a node and mark it red to present how an arbitrary node in the network gets embedded. The first step generates random walks, marked as a collection of red edges in the first step of the algorithm shown in Figure 1. We then aggregate walks starting in the red node into a vector of node occurrences in step 2.
Step 3 then selects the features based on weights assigned by the PageRank algorithm Page et al. (1999). In the final step, we generate the embedding of any given node by calculating the cosine similarity between the hash values of nodes selected as features and the red (considered) node.

Random walk generation
Sampling the neighborhood of a given node can give us information about networks structure and connectivity patterns in his vicinity. We can sample the neighborhood using short random walks. These offer many advantages such as ease and parallelization of computation, bound for the distance of the farthest node and ease of representation.
The first step of the algorithm generates random walks and represents them with a data structure such as a list of visited nodes. We use the random walk generation scheme (and vectorized implementation) presented in krlj et al. (2019b). Let w ∈ R s , w 1 = 1 be the distribution vector, where s is the maximum length of the walk and w i denotes the probability that the walk is of length i. We sample random walk length i from w and create a random walk of length i using Algorithm 1. In line 4 of the algorithm we append the current node c (together with some information) to walk representation structure. Function neighbor in line 5 returns a neighbor of the given node (randomly). This algorithm is repeated nw times for each node, giving us nw random walks per node.
Algorithm 1 Classical walk Input : Starting node n i , Walk length wl Output: Random walk structure Ws 1: c ← n i 2: Ws ← ∅ 3: for i = 0 to wl do 4: Ws ← Ws ∪ c 5: c ← neighbor(c) 6: end for 7: return Ws In our implementation, we represented a random walk with a list of tuples denoting the node and step (n i , s i ) ∈ N × {j, j = 0, ..., s}. The final random walk structure consists of concatenated random walk lists l i for each node separately.
The time and space complexity of computing the random walk structure is O(|N |·nw·s), where s represents the mean length of the walk. We get this time complexity because for each node we create nw walks that make s steps on average. Since only the random walk hashing step uses this representation of random walks, the space complexity can drop to a constant if we merge the first two steps by incrementally calculating the hash value after each walk. This way the walks do not have to be stored.

Random walk hashing
We represent the neighborhood of node n i numerically by hashing random walks starting in n i . The hashing function can incorporate different sources of information about the network to make a vector, h ∈ R dh , where dh is the dimension of hashing function output. Some examples of this can include: occurrences of nodes normalized, the number of nodes with some degree normalized or occurrences of the communities normalized. We will denote the hash value (vector) for the i-th node as h i .
Our implementation uses only neighborhood-level information about the network, i.e. how often a node appears in random walks that start at node n i . We also use threshold as the lower bound for occurrences. Any node that occurs less then length(l i ) · times is not included in h i . The final hash is a normalized sparse row vector, where values represent how frequently an included node was encountered during a random walk.

Feature selection
Features of the node embedding created by our algorithm are symbolic expressions that can be easily interpreted. We use a subset of nodes as our features to satisfy this goal. The feature values represent the similarity between the neighborhoods of a given node and the node that represents the feature. We will use ind : N → N as the function mapping feature index to the corresponding node.
Feature selection can be done in a supervised or unsupervised manner Saeys et al. (2007). We focus on unsupervised feature selection so that the whole algorithm can remain unsupervised. In feature selection, we want to select nodes that are important for the network structure. We assign a score to each node using the PageRank algorithm Page et al. (1999), then sort them based on this score in the descending order, and select top d nodes as our features.
The PageRank algorithm computes a probability distribution pr ∈ R |N | , pr 1 = 1 where pr i approximates the probability of a random walker being at node i. When pr i is high, node i is more likely to be visited and therefore it is likely more important for the structure of the network. Let r ∈ R |N | represent a vector of PageRank values for each node. Let d j represent the degree of the j-th node. If the adjacency matrix of the considered network (A) is normalized as follows: the computation of PageRank can be formulated as an eigenvalue problem: For larger networks, the power iteration is used to approximate the final solution. This procedure first initializes r = [ 1 |N | , 1 |N | , . . . , 1 |N | ] T (i.e. a discrete uniform distribution), and iterates by computing: until the difference between r k and r k+1 is smaller than some predetermined threshold µ. The final r represents the final collection of PageRank values, considered in this work. Note that in practice, about 10-50 iterations are needed for convergence, rendering this method highly scalable. We chose PageRank as the scoring function used in feature selection because it is fast, unsupervised and gives a good approximation for node importance. In Section 3.6 we further use feature ranking to develop an extension to the algorithm that estimates d such that the embedding we get is equivalent in size to a chosen dense embedding.

Similarity calculation
The proposed algorithm creates a symbolic node embedding matrix M where row m i represents the similarity of the i-th node to nodes chosen as features. This similarity is calculated in the final step from hash values h i generated in the random walk hashing step. We compare the hash value h i of the i-th node to the hash value h ind(j) of the j-th feature.
Cosine similarity is a metric of similarity defined such that it represents the cosine angle between two non-zero vectors: where a and b represent the two vectors. 1 The similarity score between two vectors without common features is 0, and between two vectors with the same angle is 1. Because of this, the similarity between two vectors can easily be interpreted. Further, since the score can be 0, this metric works well with sparse representations.
In Section 5.4 we further demonstrate the advantage of cosine similarity and show how different measures of distance compare against it.

Estimating the representation dimension
One of the key features of SNoRe is its ability to construct sparse representations of individual nodes. Compared to e.g., DeepWalk and similar methods, where the dimension is predetermined, SNoRe exploits the following theoretical insight to construct a high dimensional representation with the same (or lower) memory footprint than the comparative methods.
As the dimensions in SNoRe can be computed independently (walks w.r.t. individual nodes are independent), this feature offers an iterative expansion of the representation until a sufficient number of e.g., floating-point values are obtained.
The following example demonstrates the mentioned functionality. Consider a situation where SNoRe is to be compared against a dense representation learning algorithm, which learns d dimensional representations of nodes. Assuming |N | instances, the total space required to store the representation can be denoted with τ = |N | · d (floating-point values). The SNoRe algorithm constructs the representation requiring the same (or less) space.
We follow the first three steps of the algorithm to create hash values and the list of nodes sorted in the descending order by their PageRank score. We then add features incrementally calculating the similarity between the added feature and all nodes. After each calculation, 1. We use scikit-learn implementation Pedregosa et al. (2011) for efficient cosine similarity calculation between sparse vectors.
we subtract the number of nonzero values for the feature from τ and return the created embedding when τ drops below zero.
During testing, we realized that the quality of the embedding is not affected much by small changes in the similarity score and that sometimes digitizing it helps classification (showcased in Figures 15 and 16 in the Appendix). Because of this we divide the interval , and use them to discretize the similarity score between two hashes. We replace the similarity score M i,j with idx d , where idx denotes the index of sub-interval containing M i,j . This allows us to store values using fewer bits and consequently create the embedding with more features that takes up the same amount of space.
The automatic sparse representation construction is outlined in lines 12-25 of Algorithm 2. This extension is presented as SNoRe with Size Dependent Features (SNoRe SDF) in Section 5.

SNoRe overview
The pseudocode of SNoRe is presented in Algorithm 2. Function SAMPLE takes a distribution vector described in Section 3.2 as the input and returns an integer representing walk length sampled from it. Function WALK returns a structure that represents a random walk and takes as arguments the starting node and the walk length. Function HASH returns the hash value of the inputted walks. Function PAGE RANK returns a sorted list of nodes based on their PageRank score. Function SIM return a number between 0 and 1 that represents the similarity between two hashes given as input (distance between the obtained walk distributions).
Lines 1-6 show the random walk generation step. The outer loop iterates over nodes and the inner loop over random walks for each node. In line 4 the generated random walk is transformed into a suitable representation and appended to the ones, already generated. In the implementation, we use memoization to sample walk lengths once and use them for all nodes instead of sampling the length for each walk. The generated walks are used in random walk hashing step that is outlined in lines 7-10.
Hash values (vectors) are generated in the loop shown in lines 8-10. Since hashes are independent between nodes we parallelized this step in the implementation.
The version of the algorithm described in pseudocode also estimates the representation dimension as shown in Section 3.6. This is done in lines 11-25. Line 11 calculates the PageRank score of nodes and sorts them. Embedding is iteratively calculated in lines 14-25, adding one feature in each pass until τ < 0 or we run out of features that can be added. We can see that the estimation also uses similarity calculation step denoted in lines 17-23. The algorithm finishes in line 26 where it returns the embedding of size |N | × l, with ≤ τ = |N | · d floating point values. for j = 1 to num walks do 4: Random walk hashing.
Unsupervised feature ranking.

Theoretical properties
For an algorithm to be useful it has to have time and space complexities that are not too resource-intensive. Using the definitions from the previous sections and the understanding of how the algorithm works we next derive the time and space complexity of SNoRe.

Time complexity
To present the time complexity we describe how each step of the algorithm behaves and sum the gathered complexities. We simultaneously describe the time complexity of random walk creation and hashing step, since they can be implemented together efficiently as shown in Section 3.2.
Random walk creation and hashing are computed in O(|N | · nw · s), since we need to create nw walks with an average of s steps for |N | nodes, whilst assuming that every step takes O(1) time. Hashing maintains this complexity since each of |N | · nw walks needs O(s) time to be hashed.
The time complexity of feature selection depends mostly on the algorithm used for selecting the representative subset of nodes. For feature ranking we used the PageRank algorithm, that has time complexity O(c · |E|) when networks are represented with a sparse adjacency matrix. In the aforementioned time complexity c represents the maximum number of iterations. We also need additional O(|N | · log |N |) to sort feature scores and gather first d features. This can be done more efficiently by only selecting top d features, but we rank all nodes for use in the extension.
To calculate the time complexity of the last step we focus on the time needed to calculate the similarity between two hash values since this has to be calculated |N | · d times to create the final node embedding matrix. We use sparse implementation of the cosine similarity function with sparse vectors containing at most 1 non-zero values. Because of this, we need 1 time to compute the similarity between two hashes. Consequently we need O(|N |·d· 1 ) to calculate the similarity between each node and each feature.
The algorithm extension that is shown in Section 3.6 only impacts the size of d, since other used operations do not contribute significantly to time complexity and can be omitted because of this. Since we use nodes as features d ≤ |N | still holds.
Summing the time complexity of all steps we get the following time complexity:

Space complexity
The space complexity can be calculated similarly to time complexity by considering the four parts of the algorithm and merging the random walk creation and hashing step. Furthermore, we need O(|E|) for the sparse adjacency matrix to represent the network. We can compute the random walk creation and hashing steps in O( |N | ) space. Since random walks and hash value calculation can be done for each node independently, we need O(nw · s) space for random walk creation and O( 1 ) space to store the sparse vector that represents the hash value for this node. This holds because at most 1 values can be greater than the threshold . Since node occurrence is usually not uniform and many nodes occur more frequently than , the used space is usually smaller than this. We get the space complexity O( |N | ) for this two steps by concatenating hash representations of each node.
The space complexity of the feature selection depends on d and the space complexity of the algorithm used for feature selection. We use PageRank that uses O(E) space to store a sparse adjacency matrix. We also need O(d) to store the selected features.
The similarity calculation step creates a (sparse) matrix of size |N | · d where d ≤ |N |. To calculate the similarity between two hashes we only need constant additional space. If we put the space complexity of all steps together we get the final space complexity: We further extend the analysis of space complexity with the algorithm extension in Section 3.6 since we generate a sparse matrix that uses less or equal than τ = |N | · d space, where d is the dimension of a dense embedding.

Datasets and experimental setting
In this section, we describe the datasets used to evaluate the performance of the proposed embedding algorithm, the experimental setting and the baselines we used to compare the results with.

Datasets
The datasets used for the evaluation of the embedding algorithms consist of 11 real-world complex networks. The summary of this datasets is shown in table 1. We also show visualizations of Cora and Pubmed datasets in Figure 2. In the figure target classes are represented using different colours.
• Ions krlj et al. (2018, 2019a) is a network of ion binding sites, linked by their structural similarity. The target class is the type of ion that binds to a given protein substructure (node).
• Cora Lu and Getoor (2003) is a network of scientific publications and the citations between them. The labels represent the topic categories of the publication.
• CiteSeer Lu and Getoor (2003) is a network of scientific publications and the citations between them. The labels represent the topic categories of the publication. • Homo sapiens (PPI) (as used in Grover and Leskovec (2016)) is a network of the proteome, i.e. a set of proteins which interact with each other. The labels represent the protein functions. • BlogCatalog Zafarani and Liu (2009) is a network of social relationships on the Blogger website. The labels represent interests inferred from metadata provided by the authors.
• Coauthor-CS Shchur et al. (2018) is a computer science co-authorship network where nodes represent authors and edges represent that two authors co-authored a paper. The labels represent the authors most active fields of study.
• Pubmed (as used in Wang and Leskovec (2020)) is a network of scientific publications and the citations between them. The labels represent the topic categories of the publication.
• Coauthor-PHY Shchur et al. (2018) is a physics co-authorship network where nodes represent authors and edges represent that two authors co-authored a paper. The labels represent the authors most active fields of study.

Experimental setting
When comparing the proposed method to the baselines, we evaluated the performance of a given embedding algorithm in the following way.
• We embedded the network to a low-dimensional representation.
• We made ten copies of the embedding and corresponding labels and shuffled each. • We evaluated the performance on each copy using a training set of increasing size, i.e. from 10% to 90% and logistic regression. We classified each node into top k i classes based on the probability returned from the classifier, where k i represents the number of classes of a given node.
• We calculated micro and macro F1 score and averaged the results for each percentage.
• We performed the described test for each embedding algorithm ten times.
The exception to this method of testing is the Label Propagation algorithm that does not use an embedding. To test it we ran the algorithm 100 times with the randomly selected training set of increasing size from 10% to 90%, similarly to how we tested the other embedding algorithms.
All experiments were conducted on a machine with 128 GB RAM, Intel(R) Xeon(R) Gold 6150 @ 2.7 GHz with a NVIDIA Tesla V100 SXM3 32 GB GPU. The approaches that consumed more than 128 GB of RAM were marked as unsuccessful and are shown as Out Of Memory (OOM) in the results. We added this constraint because we use medium-sized datasets for testing and the methods that need more memory would probably not scale well to larger networks.
As default parameters for SNoRe we use = 0.005, maximum walk length = 5, number of walks = 1024 and 2048 pivot nodes (d). For SNoRe SDF we use the same parameters except that we use d equivalent to a dense representation with 256 features (τ = |N | · 256). We have chosen 256 features because other embedding algorithms we tested use 32-bit floating-point numbers with 128 features while we use 16-bit floating-point values, making the size of the embedding the same.

Baselines
We compared the results of the proposed approach against the results of eight other baselines outlined below. Seven of these are embedding algorithms, the exception being Label Propagation that performs classification directly on the network structure.
• Random baseline creates an embedding of size |N | × 64 with random numbers drawn from Unif(0, 1). (2002) propagates labels of annotated nodes through the network until convergence or the maximum number of iterations. We used alpha = 0.9 as parameter.

• Label Propagation (LP) Xiaojin and Zoubin
• VGAE Kipf and Welling (2016) is a variational auto-encoder that uses latent variables to learn a model that can be interpreted. This auto-encoder is used mostly for link prediction. We used default parameters epochs = 200, learning rate = 0.01, 32-dim hidden layer and 16-dim latent variables in the experiments.
• Personalized Page Rank with Shrinking (PPRS). This variant of Personalized PageRank was developed as part of HINMINE methodology Kralj et al. (2017). The algorithm, for each node, computes its representation by iteratively obtaining a discrete stationary distribution of walk visits. The shrinking offers additional speedups. We use probability threshold = 0.0005 and number of important = 1000 that are the default parameters for testing.
• DeepWalk Perozzi et al. (2014) equates random walks to sentences. These sentences are used to learn the network representation using a language model together with deep learning. We use default parameters: representation size = 128, walk length = 80, and the number of walks = 10 in the experiments. . NetMF tries to approximate the closed form of the DeepWalk's implicit latent walk matrix. The re-implementation is suitable for highly sparse matrices and is optimized for running on GPUs, offering substantial performance improvements. We use the default parameters: dimension = 128, window size = 10, rank = 248 and negative = 1 in the experiments.
• LINE Tang et al. (2015b) is one of the first network embedding algorithms. It uses an objective function that preserves first and second-order proximities. We use default parameters: embedding dimension = 200 and the number of negative samples = 5 in the experiments.
• node2vec Grover and Leskovec (2016) learns a low dimensional representation of nodes that maximizes the likelihood of neighborhood preservation using random walks. We use default parameters: embedding dimension = 128, walk length = 80, number of walks = 10 and window size = 10 in the experiments.

Results
We next present the results of the empirical evaluation. We begin with the classification results across the considered real-life datasets, followed by a series of ablation studies, where we explored SNoRe's behaviour in more detail, ranging from its explainability capabilities to behaviour w.r.t. different hyperparameter settings.

Classification results
Classification results are visualized in Figures 3 and 4, as well as presented in tabular format, where average performances across different training percentages alongside the corresponding standard deviations are reported (Tables 2 and 3). It can be observed that the proposed SNoRe algorithm performs competitively, or even outperforms the considered baselines. We can see that SNoRe and its extension SNoRe SDF work well on co-authorship networks, citation networks, Cora and the Ions dataset. Their results on both co-authorship network are interesting since F1 scores are low at first, but then they rise fast and achieve the best results out of all baselines when we use enough training instances. Our algorithm performs poorly compared to other baseline methods on datasets such as Wikipedia and BlogCatalog where nodes with similar class do not necessarily have similar neighborhoods, which is potentially the case with the coauthorship datasets. We can see that all embedding algorithms perform similarly to the random baseline on both Bitcoin datasets. This potentially shows that some datasets may not be suitable for learning.    Similar results can be observed in the averaged-out results (Tables 2 and 3), indicating SNoRe and its extension offer the state-of-the-art performance, albeit offering fundamentally different representation learning capabilities (sparse and symbolic).

Statistical analysis
This section presents the statistical comparison of embedding algorithms using average rank diagrams Demšar (2006) and Bayesian comparison Benavoli et al. (2017).
Average rank diagrams are shown in Figures 5 and 6. These diagrams display the mean rank of algorithms over all datasets along the horizontal line. The ranks used in these diagrams are assigned to the algorithms based on their best performing percentage on a given dataset. We assigned ranks in this way because we usually only want to classify a few new instances using models trained on vast amounts of labeled data. More diagrams showing the performance where ranks are assigned based on mean results over all percentages on a dataset can be found in the Appendix (Figures 17 and 18).
We see that for both micro and macro F1 metric SNoRe SDF performs best out of all algorithms and that when a constant amount (2048) of features are used SNoRe performs observably worse being fifth overall in the macro F1 metric.
Bayesian variants of performance comparison classifiers were recently introduced as a way to combat the shortcomings of methods like null hypothesis significance testing (NHST) Benavoli et al. (2017). We use the Bayesian variant of the hierarchical t-test to determine differences in performance of compared classifiers. This test distinguishes between three scenarios: two where one of the classifiers outperforms the other and the one in which the difference in classifier performance lies in the region of practical equivalence (rope). The size of rope is a free parameter set to 0.01 in our experiments, which means that two performances are considered the same if they differ by less then 0.01.
As Bayesian multiple classifier correction cannot be intuitively visualized for more than two classifiers, we show the comparison between SNoRe SDF and node2vec as well as Label Propagation in Figure 7. The two comparisons are used to demonstrate the performance against a strong and a weak baseline. The data used to make these comparisons was collected over all datasets using ten repetitions of ten-fold cross-validation.  Green dots in the triangles represent samples, obtained from the hierarchical model. As the sampling procedure is governed by the underlying data, green dots fall under one of the three categories; classifier one dominates (left), classifier two dominates (right), or the difference of the classifiers' performance lies in the region of practical equivalence (up). Upon model convergence, some areas of the triangle are more densely populated, showing higher probability that the classifier outperformed the other. We can see that in our experiment SNoRe SDF significantly outperformed the Label Propagation algorithm in both micro and macro F1 metric, having almost all green dots in the far left corner. More interesting are the comparisons against node2vec where SNoRe still outperforms node2vec who's probability of SNoRe Figure 6: Macro F1 average rank diagram.
outperforming SNoRe is only 2%. Here a lot of dots are in the region of practical equivalence showing that both algorithm perform similarly a lot of times.

Ablation study -representation dimension
Having shown that the default hyperparameter setting = 0.005, maximum walk length of 5, number of walks = 1024 and 2048 pivot nodes performs competitively to state-of-the-art, we conducted additional experiments to better understand SNoRe's behaviour w.r.t. the number of pivot nodes, with respect to which dataset is being used. Figures 8 and 9 show the relative impact of different number of pivot nodes. From both heatmaps, we can extract two types of datasets, the ones where score rises gradually and those where the score is similar no matter the number of features. From Figures 3 and 4 we can further observe that the results on datasets where the number of features does not affect the score are usually similar no matter which embedder we use or in some cases even close to those given by the random baseline. These datasets are less susceptible to classification and therefore harder or possibly not suitable for the type of features our algorithm learns.

Ablation study -effect of different distances (metrics)
In our approach, we selected cosine similarity to calculate the distance between two vectors because it offers good results and works well with sparse representation in both calculation and final embedding matrix. Since the choice of this distance metric is arbitrary we next show how the choice of distance metric affects the results. Table 4 shows different distance metrics we compared using same, default parameters. In the formula for standardized Euclidean v i represents the variance of the i-th feature in the hash vector. These metrics where chosen because they represent different groups of distance metrics. Euclidean distance is a special case of Minkowski distance where p = 2. It measures the distance between two points in the Euclidean space. By taking the variance of each dimension into account during the calculation of Euclidean distance, we get the Standardized Euclidean distance that usually more robust when dimensions are scaled differently. Canberra distance is mostly used in intrusion detection and computer security and is a metric suitable for when data is scattered around the origin. The last metric we showcase is the Jaccard similarity.  This metric work on binary data and calculates similarity based on whether a feature is present or not, but can also be generalized for use with numeric values. The results between different distance metrics are shown in Figures 10 and 11. We can see that most metrics perform similarly on Bitcoin datasets, Citeseer, Cora, Homo sapiens, Ions and Pubmed. On BlogCatalog both Euclidean metrics performed a lot better than the other three. On both co-authorship datasets, cosine similarity performed worse than other metrics but the byte size of the embedding is significantly smaller since the embedding matrix is very sparse. Using SNoRe SDF where the size of representation is less than τ = |N | · 128 we get results that are better than those of other metrics. Using different distance metrics also helps on the Wikipedia dataset where the score is a lot higher for the Jaccard and Canberra metrics. As it should be expected both the Euclidean and Standardized Euclidean distance perform very similarly since our hash function already normalizes the hash values.
It should also be noted that the cosine and Jaccard similarities give us sparse embeddings, which perform significantly better when compared to embeddings calculated using other metrics of the same size in bytes.

Ablation study -evaluation time
In Section 3.8 we give the theoretical boundaries for time complexity. Here we give further empirical results and compare them to other baselines and distance metrics. We also show how the number of features affects calculation time. The results between different baselines are shown in Figure 12. We can see that both SNoRe and SNoRe SDF need a similar amount of time to finish and are usually the fastest, just before NetMF (SCD). We can see that SNoRe SDF is the fastest on small datasets but needs a little more time then SNoRe on larger datasets where more features need to be chosen.   Since we use sparse matrices to store random walk hashes the implementation of similarity calculation step is crucial to obtain good performance. This can be seen in Figure 20 where the difference in performance for euclidean distance and cosine similarity that are optimized for sparse matrices is significant compared to the others.
Lastly, we show the execution time between the different number of pivot nodes. We see in Figure 19 that the number of features impacts the execution time but that the difference is not that significant and that the impact of the number of nodes is far greater. This further gives us reason to use SNoRe SDF since the execution time is not impacted much if more features are used.

Ablation study -Explainability
In the final set of experiments, we demonstrate how SNoRe can be coupled with the existing model explanation approaches such as SHapley Additive exPlanations (SHAP) Lundberg and Lee (2017);Štrumbelj and Kononenko (2014). SHAP is a game-theoretic approach used to explain any type of classification or regression model. The algorithm perturbs subsets of input features to take into account the interactions and redundancies between features. The explanation model can then be visualized, showing how the feature values of an instance impacted its classification.
We use the following methodology to explain how different feature values representing nodes impact how the classifier assigns a label to a node. This process will be showcased SNoRe on the Pubmed dataset. First, we create the embedding and save indexes used as features. We then train the XGBoost model and input it to the SHAP tree explainer. We can then explain how different feature values impact an instance or create a summary of impact for all instances. We created such a summary using SHAP library Lundberg et al. (2020) and visualized the results in Figure 13. In the figure, the features are already renamed to indexes of the node (feature index i is renamed to node ind(i)). Red and blue dots represent feature value, red being 1 and blue 0. We can see that usually, only high (non-zero) values impact how the model classifies a given instance since only those give information about nodes neighborhood. This can be seen in the figure, especially for the first eight features of class 2. From the first feature in the summary table for class 0 (node 13475), we can see, that sometimes even low feature values (merely their presence) can have a big impact on the classification. The plot in the bottom right of Figure 13 shows how much impact a feature has on average. We can see that node 11449 has the biggest impact on classification of nodes and that usually when its value is high the node is classified to class 1.
Similarly, we can show which nodes impacted the classification of a single instance to explain why the node was classified as it was.

Ablation study -latent clustering with UMAP
We also look at how nodes cluster together using UMAP algorithm McInnes et al. (2018) to transform embedding vectors into 2D space. We saved the embedding of SNoRe SDF and used the default parameters for the unsupervised UMAP algorithm to generate node positions as shown in Figure 14. The class to which the node belongs to is shown as colour in the plot and added only for visualization. In general, we see that the nodes that belong to the same class are embedded near each other as best seen on the Coauthor PHY dataset. On the Pubmed dataset, we can observe that the classes coloured red and blue cluster well together and that the green one is scattered all over the plot, not clustering well. Nodes in the Cora embedding cluster well, but the classes are close together and sometimes overlap. The worst example we show is on dataset Citeseer where nodes do not cluster well and where classes overlap a lot, but some clusters can still be seen.

Discussion
In this section we summarize the main results and their implications, and discuss the limitations of the proposed SNoRe approach.
As empirically shown in Section 5, SNoRe and SNoRe SDF outperform state-of-the-art methods on most datasets and performs comparably or slightly worse on others (e.g., Homo sapiens, Wikipedia). Coupled with the ability to use different distance metrics, speed and explainability of the embedding, this algorithm provides a very good alternative to state-ofthe-art algorithms. We further back this claim in Section 5.2, where we show how SNoRe outperforms node2vec with pairwise Bayesian performance comparison.
In both execution time and classification results, we show that the proposed algorithm is scalable since it achieves best results on both the smallest and the largest dataset while using the same amount of space or less than the baselines we compared it to. This is further shown in Figures 19 and 20 where the different number of features and distance metrics display the benefits of an efficient implementation. By observing the classification results of embeddings that have a different number of features we have observed another interesting phenomenon. On datasets where all baselines achieved results that were similar to the random baseline, the number of features did not matter and an embedding with 4 features achieved similar results to the one with 4096. This gives us the ability to judge how susceptible a dataset is for classification.
While observing classification results between a different fixed number of features can give us an idea how susceptible a dataset is for classification, observing the number of features returned by SNoRe SDF can give us some insight into the structure of the network. This is most notable on the Wikipedia dataset where SNoRe SDF gives us a dense embedding since all nodes have at least one node in common. On the other hand, when SNoRe Figure 14: UMAP clustering on Cora, Citeseer, Pubmed and Coauthor PHY datasets.
using the same amount of space as a dense embedding on the Coauthor PHY dataset, our algorithm generates an embedding with all nodes used as sparse features. This shows that the Coauthor PHY network is a lot more decentralized than the Wikipedia one.
SNoRe uses nodes as features, making it possible to explain the reasoning behind why an instance was classified in a certain way. This can be done with the use of tools such as SHAP and allows us to use this embedding algorithm in situations where explainability is crucial such as medicine.
In Section 5.7 we show that our algorithm creates an embedding that embeds nodes belonging to the same class close together. We do this by using the UMAP algorithm to transform each instance into 2D space and coloring the node w.r.t. the class they belong to. In the corresponding figure, we can easily see how nodes with the same class cluster on datasets Pubmed and Coauthor PHY and although a little less prevalent also on the other ones.
Some of the limitations of our algorithm can be seen on datasets like Wikipedia and BlogCatalog, where the neighborhood of the node is not necessarily important and distinctive enough. Since the algorithm is modular this can probably be avoided sometimes by changing the hashing function is such a way that it encodes the relevant network structure better.
Although PageRank works very well on most networks, giving us features that give us good results, we cannot guarantee that good features that span trough all the network will be chosen. This can drastically decrease the performance on some part of the network since some nodes may not have neighborhoods that overlap with the neighborhoods of the features.
The last problem to highlight is the number of features in the final embedding. A small number of features is usually not descriptive enough and therefore the embedding performs badly. On the other hand, having a large number of features may give good results but need longer to train the classifier. Related to this, many classifiers are not optimized for sparse matrices.

Conclusion and Further work
We introduced a scalable unsupervised algorithm for learning symbolic node representations of networks. The algorithm is fast, achieves results that are comparable or better than those of state-of-art algorithms and can be interpreted when coupled with methods like SHAP.
In further work, we plan to further explore how to incorporate different high-level network structures and the effect of different hashing functions. We also want to explore how different feature selection algorithms affect the performance and if the difference is significant when supervised algorithms are used. Another venue worth exploring is the use of different walk length distributions, which is not explored in this paper. Lastly, SNoRe's behavior in the inductive and dynamic setting could be explored to further show algorithms usefulness.

Availability
The proposed methodology will be available as a Python library at: https://github.com/ smeznar/SNoRe.

Acknowledgements
The work of the last author (BŠ) was funded by the national research agency (ARRS)'s grant for junior researchers. The work of other authors was supported by the Slovenian Research Agency (ARRS) core research program P2-0103 and P6-0411, and research projects J7-7303, L7-8269, and N2-0078 (financed under the ERC Complementary Scheme). SNoRe embeddings on both Co-authorship datasets, Homo sapiens (PPI) dataset and Pubmed dataset. Figures 17 and 18 show the average rank diagrams when we average classification results of every training set size. Here we see that SNoRe SDF and SNoRe achieve similar ranks in both micro and macro F1. We can see that here SNoRe SDF achieves worse results than on Figures 5 and 6. The reason behind this can be seen on Coauthor CS and Coauthor PHY datasets in Figure 3, where the algorithm performs poorly compared to others when a small amount of training data is used.
Execution time plots can be seen in Figures 19 and 20. We can see that the execution time is not affected significantly by the number of features used. We can also see that the execution time is greatly affected by the distance metric used. We see that cosine similarity and Euclidean distance that are optimized in scikit-learn perform significantly faster than other, non-optimized distance metrics.