A Novel Global Prototype-Based Node Embedding Technique

Node embedding refers to learning or generating low-dimensional representations for nodes in a given graph. In the era of big data and large graphs, there has been a growing interest in node embedding across a wide range of applications, ranging from social media to healthcare. Numerous research efforts have been invested in searching for node embeddings that maximally preserve the associated graph properties. However, each embedding technique has its own limitations. This paper presents a method for generating deep neural node embeddings that encode dissimilarity scores between pairs of nodes with the help of prototype nodes spread throughout the target graph. The proposed technique is adaptable to various notions of dissimilarity and yields efficient embeddings capable of estimating the dissimilarity between any two pairs of nodes in a graph. We compare our technique against relevant state-of-the-art similar embedding techniques. Superior results have been demonstrated in a number of experiments using several benchmark datasets.


I. INTRODUCTION
A node embedding technique seeks to project nodes into a latent space where geometric relations in this latent space correspond to relationships (e.g., edges) in the original graph or network [1]. However, there is no embedding technique that '' fully'' preserves both the topological graph properties (adjacency relations between nodes) and the node features relations (problem-dependent information). Research came up with many embedding techniques that vary in complexity, performance, and the graph properties required to be preserved. In general, node embedding methods fall under one of three groups: (i) factorization-based, (ii) random walk-based, and (iii) deep learning-based [2], [3]. Factorization-based algorithms aim to reconstruct some polynomial function of the adjacency matrix from the node embeddings. On the other hand, random walk-based methods follow a different approach. Rather than directly reconstructing a deterministic The associate editor coordinating the review of this manuscript and approving it for publication was Yongming Li . function of the graph's adjacency matrix, random walk-based methods tend to optimise node embeddings such that two nodes are more likely to have similar embeddings if they tend to co-occur on short random walks over the graph [1], [4]. Deep neural methods generate node representations that depend on the structure of the graph as well as the feature information available. Complex deep encoders address the non-linearity and irregularity in the structure of the input graph.
In this paper, we do not consider the factorization-based embedding methods because they are not applicable to large graphs as they usually involve costly eigenvalue decomposition operations, and therefore they are of limited use [5], [6], [7]. Instead, we focus on random walk-based methods, specifically the node2vec scheme, and deep neural methods, specifically the GraphSage [8] scheme. The two schemes are widely used, state-of-the-art embedding schemes. However, each scheme has its own limitations. For example, node2vec does not leverage node features in the embedding learning process and can not generate embeddings for unseen nodes VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ during the training process, a property called transduction [1], [9]. On the other hand, GraphSage assumes that nodes in the same neighborhood should have similar embeddings and depends mainly on node features, so that missing or partly noisy node features may significantly drop downstream task's performance [10], [11]. The following highlights the main contributions in this paper: • A novel deep prototype-based node embedding technique is presented. The proposed technique aims at alleviating some of the shortfalls of relevant stateof-the-art node embedding techniques.These include node2vec,GraphSage, and relevant representatives of random walk-based techniques.
• The performance of the proposed technique has been evaluated in node classification and link prediction tasks. Competitive results have been demonstrated against the state-of-the-art node2vec and GraphSage techniques.
• A thorough discussion and analysis is provided to explain the obtained results. The remaining of this paper is organized as follows: Section II briefly overviews key node embedding techniques, specifically the node2vec and GraphSage embedding methods. Section III gives a detailed description of the proposed embedding technique. Section IV provides eight benchmark datasets used in the experimental work conducted in this paper along with a comparison of the performance of the novel proposed technique against state-of-the-art methods on both node classification and link prediction tasks. Section V discusses the relative merits of the proposed technique. Finally, Section VI concludes the paper.

II. RELATED WORK
This section presents a brief overview for the baseline node embedding methods most relevant to the method proposed in this paper.

A. RANDOM WALK-BASED EMBEDDINGS
Node2vec is chosen as a representative of the random walk-based methods. Node2vec derives its idea from word2vec [12], an embedding method in the natural language processing domain. Word2vec places words having similar context close in the embedding space. In a similar fashion, node2vec takes short random walks in a graph and treats the sequence of nodes in a random walk as a sequence of words in a sentence. The random walks gathered are then forwarded to a single hidden-layer neural network, commonly known as the Skip-gram model [13]. The Skip-gram model is trained to predict the surrounding words given a current word, or in the case of node2vec, to predict the surrounding neighbor nodes given a current node. Node2vec has been preceded by DeepWalk [14], an embedding scheme that follows the same aforementioned approach yet with unbiased random walks. Node2vec extends DeepWalk by employing parameters to control the random walk behavior. Specifically, node2vec utilizes a combination of breadth-first and depth-first node sampling to explore nearby and far nodes in the graph, respectively [15], [16].

B. GRAPH NEURAL NETWORKS (GNNs)
Over the past decade, graph neural networks methods have been receiving much attention. This class of methods was pioneered by the seminal work of [17] which introduced the Graph Convolutional Networks (GCNs) node embedding technique. The GNNs architectures share the following steps [18], [19]: the node features are initially the attribute vectors, then each node detects its neighbors and, accordingly, the feature vectors of the aggregated neighbors are compressed to update the node feature.
Among various methods in this category, GraphSage is a prominent example of GNNs. The feature aggregation function of GraphSage is given in (1).
where N (v) is the neighborhood of node v (usually N (v) does not exceed the 2-hop), h k N (v) is the feature set within node v's neighborhood at the k th iteration, h k v is node v's representation at the k th iteration, W k is a trainable parameter matrix and σ is the sigmoid function. The aggregator functions include the mean, the max and the LSTM [20] (long short-term memory). GraphSage tries to maximize the similarity of the embeddings of u and v, where u and v are neighbors, and minimizes that between u and n, where n is a sampled non-neighbor of u. The similarity between the embeddings of two nodes, according to GraphSage's convention, is the dot product of their embeddings.
Position-aware Graph neural networks (PGNNs) is another relevant method introduced in [21] which aims to address some serious limitations of typical GNNs. Existing GNNs are partially capable of detecting the location of a node with respect to other nodes in a graph. In a formal sense, a typical GNN usually fails to reconstruct the shortest path between two nodes given their generated embeddings. A PGNN learns a non-linear distance-weighted aggregation strategy over carefully sampled random subsets of nodes called anchor-sets and computes the distance from a particular target node to each anchor-set. As a result, PGNNs can accurately estimate node locations relative to anchor nodes besides being inductive, scalable, and able to incorporate node features.

III. PROPOSED EMBEDDING TECHNIQUE
The section is divided into two parts, A and B. Part A presents the foundation on which our work is based. Part B describes in detail the proposed embedding technique.

A. BACKGROUND 1) RADIAL BASIS FUNCTION NETWORKS (RBFNs)
Our work draws some inspirations from the typical RBFNs [22]. By default, an RBFN is a single hidden-layer FIGURE 1. In the graph shown, node d is the target training instance and nodes a, b, c and g are the chosen prototypes. S k ij denotes the dissimilarity score fetched from dissimilarity matrix S k between the two nodes i and j . The dissimilarity scores, between d and all prototype nodes, are fetched from two graph extracted dissimilarity matrices S 1 and S 2 (not shown for irrelevance) then weighted by the trainable weights α 1 and α 2 . The output from each prototype layer's node is a convex linear combination of its input (See (2)). The final embedding of d is the product of the dissimilarity scores and the trainable weight matrix between the prototypes and output layers. Both the prototype and output layers are linear.
feed-forward neural network. For classification tasks, the RBFN operates as follows: an input sample from the dataset given to the network undergoes a nonlinear mapping at the hidden layer neurons where each neuron stores a ''prototype'' example that belongs to some class in the training set. The activation function of each hidden layer neuron computes a similarity measure against the prototype the neuron stores. Conventionally, Gaussian kernels are used as the activation functions for the hidden layer neurons. The mapping from the hidden to the output layer is linear, i.e., output neurons act as linear summation units. Each output node corresponds to a class featured in the dataset.
In our work, we employ a modified version of RBFNs to suit the task of node embeddings. First, the dataset in our problem corresponds to all nodes in the graph under consideration. The dataset is randomly split into training and testing subsets. Second, the prototypes of the hidden layer neurons are chosen at random from the training set of nodes. Third, dissimilarity scores between the input nodes and hidden layer prototypes are obtained directly from nodeto-node dissimilarity matrices rather than the conventional use of basis functions in RBFNs.

2) SIAMESE NEURAL NETWORKS (SNNs)
A Siamese neural network [23] (SNN) is a neural network with two or more identical branches or sub-networks. The sub-networks share the same architecture and parameter configuration. A basic Siamese network example is commonly trained on a pair of images sent at a time to the network's branches. The Siamese network goal is to learn a low-dimensional embedding representation that best captures the relation between the two input images. A distance-based cost function can then be used to decide if the two images are from the same or different classes.
The network described in this paper is an RBFN-like network, as shown in Fig. 1, trained using a Siamese-like training technique. Note that a Siamese network having b branches can be simulated by training b dataset samples back-to-back using only one branch. This view is the one adopted in this paper. The proposed framework, the adopted dissimilarity metrics as well as the training algorithm are described next in part B.

B. PROPOSED METHODOLOGY
This section presents the details of the proposed deep prototype-based node embedding technique. The proposed method aims to encode relational topological node information as well as available node features. The proposed method is unsupervised which makes it well suited for large graphs as well as small graphs.
Let G be a given graph with set of vertices V = {v 1 , v 2 , . . . , v n }, n = |V | and F the matrix of nodes' features of size n × f where f is the size of a node's feature vector. Several dissimilarity matrices S 1 , S 2 , . . . , S k are extracted from G. In our basic formulation, the node-to-node shortest paths matrix and the pairwise cosine dissimilarity matrix based on F are computed. Other dissimilarity matrices can be incorporated as well. Note that the network can adapt to either similarity or dissimilarity matrices, but we use the term ''dissimilarity'' throughout the paper to avoid confusion. Therefore, we convert any similarity measure, such as the cosine similarity, into a dissimilarity one. As we utilize multiple dissimilarity matrices, the hidden layer is preceded by a softmax layer of size 1 × k, where k is the number of dissimilarity matrices. Our target is to generate, for each node v i in V , a corresponding embedding vector e i of size 1 × e where e is a hyper-parameter that stands for the desired embedding size. To generate an embedding e i for node v i , the node index i is forwarded to the hidden layer to get the dissimilarity scores-against the hidden neurons prototypes-from the precomputed dissimilarity matrices S 1 , S 2 , . . . , S k . For a given dissimilarity matrix S, a node v i and a prototype node v j , the dissimilarity score is the element at index (i, j) in S. Let S 1 ij , S 2 ij , . . . , S k ij be the dissimilarity scores between node v i and prototype node v j for each S 1 , S 2 , . . . , S k . These scores are multiplied by the normalized weightsα 1 ,α 2 , . . . ,α k . Eachâ is normalized according to the softmax rule: where each α i is a trainable parameter. Now assume a hidden layer of z neurons and output layer of e neurons. Each output neuron acts as a linear summation unit for all the upcoming dissimilarity scores from the hidden layer yet weighted by the weights matrix z×e between hidden and output layers. The embedding vector e i of a node v i is the result of the output layer.
To update the network's weights, we follow the approach of Siamese Networks training: given two nodes v a and v b , we forward them back-to-back to obtain respective embeddings e a and e b . We find the cosine dissimilarity between e a and e b : We minimize the L2-norm distance loss function: where S true is the equally weighted sum of the dissimilarity scores, between v a and v b , defined in the dissimilarity matrices S 1 , S 2 , . . . , S k .

IV. EXPERIMENTS
The proposed method is evaluated in two graph-related tasks, namely, unsupervised node classification and link prediction using eight benchmark graph datasets.

A. DATASETS DESCRIPTION
Our experiments are applied on the following graph datasets: • Cora [24]: The Cora dataset contains 2708 scientific publications that are divided into seven categories. There are 5429 edges each indicating a citation of a paper in another one. A binary word vector describes each publication in the dataset, indicating the existence or absence of the associated word from a dictionary of 1433 unique words.
• BlogCatalog [25]: A social blog directory hosting the blogs of the users (bloggers). A node corresponds to a blogger in the network, and an edge indicates a friendship between two bloggers and the attributes vector of each node is based on recording occurrences of specific keywords in the blogger's blog descriptions.
• LastFM [26]: The dataset is collected from LastFM social network where nodes are LastFM users from Asian countries and edges are mutual follower ties between nodes. The users' favorite artists are used to extract the node features. Node labels represent the countries the users belong to.
• MIT [27]: Facebook friendship social network for Massachusetts Institute of Technology (MIT) in which seven properties are attached to each Facebook user: status flag, gender, major, second major, dorm, year, and high school. The 'year' property is taken as the label while the other six properties are set as features via onehot encoding.
• Flickr [28]: Flickr is an online network where people can share photos and follow each other. Nodes and edges represent users and followers respectively. The feature vector of a node reflects a user's interests based on a list of tags whereas the label of the node is the category of images the user is interested in.
• WikiCS [29]: The dataset is derived from Wikipedia articles and consists of nodes and edges corresponding to Computer Science articles and hyperlinks, respectively. The articles correspond to 10 classes representing different fields.
• Amazon [30]: A dataset of two networks: Amazon Computers (A. Computers) and Amazon Photos (A. Photos). Nodes represent goods, and edges indicate that two goods are frequently purchased together. The product reviews, in form of bag-of-words, node features may be used to map goods to their product categories. Datasets statistics are summarized in Table 1. The experiments show the performance of our model against node2vec and GraphSage over node classification and link prediction tasks. Note that we apply unsupervised node classification across all experiments and all embedding schemes.

B. UNSUPERVISED NODE CLASSIFICATION TASK
For a fair comparison between the different node embedding methods, we tried to standardize the factors involved in each experiment when possible. We conducted our experiments for 16 length embedding vector, 20 epochs with batch size of 50, train-test dataset split of 50% and Adam optimizer with learning rate = 0.005. For the GraphSage method, for each sampled node, we sampled up to the 2 nd neighborhood (i.e., 2 hops) with 25 and 10 nodes for the first and second hops respectively. GraphSage using mean aggregators is the variant considered in this experiment. Table 2 shows node classification results.

C. LINK PREDICTION TASK
The results for the link prediction are shown in Table 3. The reported results are using the AUC [31] (area under the ROC curve) metric. The link prediction experiment was conducted as follows: we randomly remove 10% edges from each dataset's graph as positive testing data. We sample the same number of nonexistent edges (unconnected node pairs) at random as the negative testing data. To create the training data, we use the remaining 90% of existing links and also the same number of additional sampled nonexistent links. To obtain the nodes embeddings for each training graph, all embedding methods are trained on the training graph (after removing the positive edges) of each dataset.
Later, we compute the embedding of existing edges (positive) or nonexistent (negative) between pairs of nodes in a graph by applying one of four operations: Hadamard or element-wise multiplication, averaging, L1 norm or L2 norm on the embeddings of the considered node pairs. A logistic regression classifier is applied on the edges embeddings to classify the respective edges into two classes: positive (or likely there is a link between the two nodes for which this edge embedding is computed) and negative.
We transform the task into a supervised one by defining a node-to-node edge indicator with value 1 for no edge and value 0 for an existing edge. The following term is added to the loss objective function given in (4).
whereŷ is a binary indicator whether an edge exists between two nodes u and v, D() is the Euclidean distance between the two (normalized) embeddings z u and z v of u and v respectively and m = 0.33. Equation (5) tends to minimize the Euclidean distance between two connected nodes embeddings.

V. DISCUSSION
The method proposed in this paper can be viewed more as a global rather than a local embedding technique. This means that during nodes embeddings generation process, we capture the relation(s) among nodes no matter how distant these nodes are within the graph. This has an advantage against local embedding techniques, which may fail to look beyond a node's direct neighborhood. Regarding our method, the embedding for any node in the graph is acquired after matching the similarities between this node and a set of fixed reference nodes (prototypes) spread across the entire graph. In that sense, the direct neighborhood of a node is not our focus, as opposed to the strategies adopted in GraphSage and node2vec. Although GraphSage and node2vec, in the light of our presentation, would be considered local embedding methods, extra costly measures can be adopted to model more global views of the input graph. For example, in node2vec, random walks can model higher-order relations among nodes yet for long path lengths. Hence, GraphSage and node2vec can be considered as ''inherently'' local. Being a global embedding technique is an add-on feature when dealing with complex graph datasets.
The performance of node classification, link prediction and many other downstream tasks depends on the nature of the presented graph and how informative the external information given in the form of feature vectors is. The results in Table 2 show the advantage our method against the rest of the methods over 50% of the datasets in the node classification task. Let us consider node classification over the LastFM dataset in particular. In comparison with GraphSage and node2vec, our method outperforms due to the LastFM dataset nature. Feature vectors contain information about the favorite artists each user listens to, which is, at least, as important as user-to-user connections in determining a user's country. As a consequence, node2vec performance would suffer since it takes into account only the structural information. Moreover, higher-order connections among a group of users or between a user and other users spread across the network can provide richer information about an individual's identity than low-order connections with immediate neighbors. This may explain why our method performs better than GraphSage over the social networks LastFM and BlogCatalog.
The global property of an embedding technique may not always be a favorable aspect. In some datasets, the close connections between nodes are sufficient to produce embeddings that effectively encode relevant graph information. Further work to draw higher-order relations among nodes may be redundant and sometimes disturb the performance of the downstream task being considered. Supporting this point is the Amazon Computers dataset. It is a reasonable conclusion that products that are bought together (as indicated by the graph links) usually belong to the same category. Hence, and in similar contexts, a local embedding technique may outperform a global one in predicting the nodes' classes.
Regarding link prediction, our method and GraphSage are dominating the performance over Hadamard and Average edge embedding operators respectively. We attribute this phenomenon to the core of both methods: GraphSage learns  useful embeddings of nodes by aggregating (taking average in our case) the features of their neighborhoods whereas our method maintains the cosine similarity (sum of Hadamard product yet unnormalized) between two nodes embeddings to be directly proportional to the similarity defined between the two nodes. On the other hand, node2vec is the clear winner in particular over L1 and L2 operators and in general if performance, regardless of the edge operator, is the only choice. However, the last finding contradicts with [15] where it implies that node2vec performs best using Hadamard.
The choice of prototype nodes in the graph is made at random. This factor, along with several other factors, bears the entry of noise into the node features compared to GraphSage. In general, GNNs are sensitive to even slight perturbations to graph properties such as modifying graph topology or introducing noise to node features [32], [33]. Table 4 shows the node classification accuracy of our method against GraphSage after adding noise to the node features of Flickr dataset. Noise introduced is random binary vectors added to specific ratios of node features.
The results in Table 4 show nearly 10% drop in node classification accuracy using our method after introducing noise to 75% of node features while GraphSage experienced about 34% drop in accuracy. Also, the results showed that our method remained resilient as compared to GraphSage that suffered after each increase in the percentage of noise infected feature vectors.
There are many alternatives rather than choosing prototype nodes at random. One scheme is to consider the celebrity nodes, or the nodes with the highest number of connections. Aside from prototypes assignment, setting the percentage of prototype nodes remains an important hyper-parameter.
In our experiments, we find setting 30% of graph nodes as prototypes to be the best choice as demonstrated by Fig. 2.
Our method supports multiple dissimilarity measures if each measure can be defined in matrix form. We tested node classification accuracies, reported in Table 5 after applying a mixture of the dissimilarity measures over the LastFM  dataset. We use the (i) shortest path, (ii) node features cosine dissimilarity, (iii) SimRank [34] and (iv) pairwise label dissimilarity (score of 0 for two nodes of same class and 1 for two nodes of different classes) measures.
The results show that combining both shortest path and node features cosine dissimilarity metrics has superior performance than that of the shortest path metric only. However, combining different dissimilarity metrics not always boost the performance as indicated by the third row in Table 5. The last result indicates that including pairwise label dissimilarity significantly increased performance. Note that utilizing labels dissimilarity among nodes is illogical or unfair when applying node classification task. Nevertheless, the last result is only to demonstrate that a carefully selected dissimilarity metric can greatly affect the performance of the applied downstream task. A last note to mention is that some dissimilarity metrics can be costly to compute. For instance, the computation of the shortest path distances has computational complexity of O(n 3 ), for n vertices in graph. Fortunately, we may resort to some relaxations such as limiting computation of the shortest path distances only to some k-hop neighborhood around the target nodes.
Finally, it is worth mentioning that PGNNs and our proposed method share the idea of employing anchor nodes (we call them prototypes) across the graph in the process of embeddings generation. However, the details of the two schemes and their purposes are different. For instance, the strategy of anchor sets sampling and the architectures are not the same. Another example is that, unlike our embedding framework, PGNNs are mainly dependent on noise-free node features as PGNNs are a variant of GNNs.

VI. CONCLUSION
In this paper, we propose a new method for generating node representations in graphs. The proposed method is evaluated on the two most famous graph related tasks: node classification and link prediction. Results are presented using a number of benchmark datasets. As a conclusion, we recommend using our method when dealing with complex, multi-dissimilarity measures or noise-inflicted networks. Regarding our future work, we plan to support supervised downstream tasks and enhance the performance of the proposed technique.