JONNEE: Joint Network Nodes and Edges Embedding

Recently, graph embedding models significantly improved the quality of graph machine learning tasks, such as node classification and link prediction. In this work, we propose a model called JONNEE (JOint Network Nodes and Edges Embedding), which learns node and edge embeddings under self-supervision via joint constraints in a given graph and its edge-to-vertex dual representation as a Line graph. The model uses two graph autoencoders with additional structural feature engineering and several regularization techniques to train for an adjacency matrix reconstruction task in an unsupervised setting. Experimental results show that our model performs on par with state-of-the-art undirected attribute graph embedding models and requires less number of epochs to achieve the same quality due to Line graph self-supervision under a unified embedding framework.


I. INTRODUCTION
Networks appear in many real-world tasks that require describing important relations between objects and their attributes. Effective network representation provides valuable information on how graph structure can improve understanding and feature extraction of structural information accompanied by non-network features.
Nowadays, machine learning methods demand the representation of information in the vectorized form to follow standard frameworks for classic machine learning tasks. Graph machine learning is usually associated with particular nodelevel machine learning problems, such as node classification, link prediction, and node clustering, followed by network visualization. In order to feed graph data to machine learning frameworks, network embedding emerges as an effective and efficient approach to solve machine learning problems on networks. It maps graph motifs, such as nodes and edges, into a low-dimensional space while preserving certain graph information and constraints related to the graph.
Today, there exists a large variety of graph embeddings that automatically extract vector representation for The associate editor coordinating the review of this manuscript and approving it for publication was Vicente Alarcon-Aquino .
networks [1]- [8]. All network embeddings have different training and inference complexity, various construction ideas, and diverse applications to network data domains. However, the concept of network representation learning allows to combine them in a general pipeline verifying in terms of the quality of graph machine learning tasks on benchmark networks.
Different models have their own advantages and shortcomings. For example, sequence-based models can be efficiently trained for large graphs but do not take into account features in nodes and edges. Adjacency matrix factorization frameworks have high quality while being very costly and slow to train. Models based on deep learning methods can efficiently incorporate features but are hard to train and interpret in terms of which factors and neighbors impact model performance the most.
Moreover, all of these models mostly rely on node embeddings for constructing edge embeddings: while node embeddings are trained, edge embeddings used for link prediction task are usually presented by certain symmetric operators of incident node embeddings [9] or their low-dimensional bi-linear representation [10]. Such an approach does not consider the noise in the edge data, nor does it describe the fact that edge formation and features may be obtained independently from graph structure and jointly learned under the self-supervised framework.
In our study, we aim to construct a model that combines the best of all three classical approaches for network embeddings under Line graph self-supervision: sequence-based models for fast feature generation, capturing structural information and high-order dependencies; deep learning model autoencoder for its expressiveness and effective use of generated features; and matrix factorizations for simplicity in model regularization.
A new model embeds nodes and edges into common vector space as latent codes of the original graph and Line graph autoencoders (the similar idea was used in [11] and [12]) with shared loss function similar to proposed in [13]- [15] edge embedding operators for ''pooling'' the first-order neighborhood of source and target nodes.
Our goal is to learn better node representations by using an edge-to-node direction and explicitly regularizing a node embedding to be close to the average of the embeddings of the edges incident to it as an objective instead of using deterministic mapping. This idea expresses that a node should be not only a center of its node neighborhood, but also a center of its dual edge neighborhood, and explicitly support edge reconstruction from incident nodes as their local averages based on Line graph representation.
Since our focus is on the idea that edges may include additional information that could be presented not only by incident node embeddings alone, we first focus on the link prediction problem to guide our model evaluation. The link prediction problem is one of the core machine learning tasks on graphs, which can be solved by our model better than by other state-of-the-art solutions. Secondly, we aim to test JONNEE model on multi-class node classification problem under unsupervised and semi-supervised settings. Finally, JONNEE embeddings were used for demonstrating superior quality on network visualization problem.
Overall, we show that JONNEE is a viable model that often outperforms state-of-the-art unsupervised solutions for classical machine learning problems on graphs.
In this work, the following contributions were made: 1) Novel JONNEE model on a joint network node and edge embedding via Line graph self-supervision is suggested. 2) Extensive experiments show the importance of each component in JONNEE model on the resulting quality of the finetuned model. 3) A comprehensive comparison of existing methods is performed on node classification, link prediction, and network visualization problems. Our model show equal quality compared to state-of-the-art models while converging faster in an unsupervised setting. The rest of the paper is organized as follows. We start by defining the problem field and reviewing existing network embedding models in Section II. Section III contains a detailed description of the proposed model and the motivation behind each component, followed by the proofs of their importance in Section IV. Next, the finetuned model was compared to state-of-the-art network embeddings in Section V. Section VI discusses experiment results, main takeaways, and future work.

II. RELATED WORK
We start by introducing essential definitions and establishing general notation for our paper related to network science and network representation learning.
Let us consider a graph G = (V , E) defined as a set of vertices V and a set of edges E ⊆ V × V Additionally, certain graph substructures may be equipped with attributive features conveying information that cannot be expressed by graph structure alone. For example, each node of a citation network is a paper that may have a vector description of its textual or semantic content. If these features are available, we denote them as X 0 ∈ R |V |×d 0 , where d 0 is a dimension of original space for node attributive features.
The procedure for constructing a vector representation of a graph is called graph embedding, which is a mapping from a collection of substructures (most commonly, these include either all nodes, all edges, or certain subgraphs) to R d . We will mostly consider node embeddings: |V |. For many graph-based tasks, the most natural task formulation is unsupervised learning: this is described as the case when it is required to learn embeddings using only the adjacency matrix A containing information on structural similarity and, possibly, features X 0 without task-specific loss part. Another possible case is that there are labels available for some graph substructures, and one wishes to recover missing labels in a semi-supervised approach. One example of this is the node classification task, in which all nodes are available from the start, but only a fraction is labeled.
In order to describe network embedding definitions and training settings, it is important to present a clarification of what is meant by good network embedding. During the embedding procedure, one should aim to compress the data while retaining most of the essential information about direct node similarities, and at the same time extract important features from the structural information usually described by high-order node proximities, which is some similarity measure over node-related graph substructures, e.g. neighborhoods. Higher-order proximities can be defined similarly, usually at a higher computational cost. Apart from the cosine measure usually adopted to quantify similarity, other metrics like Rooted PageRank, Katz Index, Common Neighbors, Jaccard coefficient, Adamic Adar can also be used (see [16] for more details). Finally, when the task of constructing efficient node embeddings is stated, let us describe the main approaches for defining models' architectures and optimization frameworks and then describe the improvements suggested for edge embeddings.
In order to describe the rapidly growing field of network embeddings, we use one of the possible taxonomies covering the idea of network embedding construction via optimization problem. There are usually three general categories related to matrix factorizations, sequential models, and graph neural networks. Below, we provide a brief overview of existing approaches for each category following the detailed survey in [8] and review existing edge embedding models constructed over node embeddings which leads to the main motivation behind our model.

A. MATRIX FACTORIZATION AND LAPLACIAN METHODS
Starting with matrix factorization methods, these models usually factorize large adjacency matrix with a product of two matrices containing network and context representations. Factorization models are common techniques for approaching a dimension reduction problem in many domains.
The goal for Laplacian Eigenmaps [23]- [25] class of models lies in preserving first-order similarities via representing each node by graph Laplacian eigenvectors associated with its first k nontrivial eigenvalues.
Thus, using graph Laplacian, a model gives a more significant penalty if two nodes with larger similarity are embedded far apart in the embedding space.
Another approach is to directly factor the node proximity matrix into a product of embedding F and context F c matrices, with symmetric models using only F (e.g., GraRep [26]), or asymmetric models (e.g., HOPE [16]) concatenating representations from F and F c rows.
In most cases, factorization approaches are not able to learn on which order proximities to focus attention, require a transductive learning paradigm and have high computational complexity.

B. SEQUENCE-BASED APPROACHES
Overcoming limitations on time complexity for the factorization models, sequence-based embeddings use different random walk strategies to maximize the probability of observing sampled neighborhood (context) of a node given its embedding.
The idea is to maximize the probability of observing the neighborhood of a node given its embedding similar to Skipgram model [27]. The most well-known random walk based graph embeddings are DeepWalk [28] and node2vec [9].
One of the most efficient sequential models based on graph Laplacian approximation and sampling strategy around the node is based on sampling neighborhood using graph diffusion presented in diff2vec [37].
In general, sequence-based models are efficient for representing structural information alone. They can be used for feature engineering in conjunction with other models, taking node and edge attributes into account.

C. GRAPH NEURAL NETWORKS
Graph neural networks incorporate graph structure into classic deep learning models, thus connecting neural networks and automated graph feature engineering.
Before the beginning of the deep learning era, graph signal processing techniques using graph spectral decomposition were suggested in [38] and [39]. Advances in deep learning allow applying of deep neural networks to graph data [40]- [44].
In [45], authors propose Graph Convolutional Layer that further simplifies approximation to spectral convolution and achieves improved computational efficiency for semi-supervised multi-class node classification. A model combining several such convolutions is referred to as Graph Convolutional Network (GCN). Improvements over speed and optimization of training GCNs were suggested in [46]- [48]. Different convolutions via spatial or graph spectral methods were suggested in [49]- [58].
Similar to the other fields, autoencoders were proposed to train unsupervised node-level representations [59]. Graph Autoencoder (GAE) is based on several graph convolutional layers to encode graph structure and inner-product decoder to reconstruct the adjacency matrix. Variational graph autoencoder (VGAE) [59] extends the previous model with different GCN encodings for mean and standard deviation.
In recent work, authors of GraphSAGE [60] offer an extension of GCN for inductive unsupervised representation learning via trainable aggregation functions instead of simple convolutions applied to neighborhoods in GCN. GraphSAGE learns aggregation functions for a different number of hops applied to sampled neighborhoods of different depths, which are then used for obtaining node representations from initial node features.
Another direction of research is to extend attention from random walks and factorization techniques. GAT model [61] uses masked self-attention layers for learning weights balancing the impact of neighbors on node embedding and supporting both inductive and transductive learning settings.
Recent deep learning based models readily exploit node features when those are available are flexible and generic but still least understood and not as efficient as conventional Euclidean deep learning models with independent samples in terms of optimization.

D. EDGE EMBEDDINGS
There exists a variety of models used to construct edge embeddings along with downstream tasks involving graphs and their embeddings. HARP [31] incorporates several hierarchical layers, and constructs node embedding from edge embedding. Considering attribute networks, CANE [62] and LANE [63] directly incorporate edge features and labels into network embedding. Multi-edge network embedding for event graphs (in which an event is described by several edges) was presented in HEBE [64]. Interestingly, Knowledge Graph (KG) completion solves link prediction between entities [65], however these methods are not applicable to homogeneous networks with one type of edges. Approaches for componentwise edge embedding operators based on node embeddings were suggested in many research papers [9], [10], [13], [15], [66]- [68] while other studies just concatenate node embeddings [45], [60], [69].
Recently, self-supervised graph machine learning arose as a popular direction on network embeddings via improving models under limited data constraints. It aims to use different low-level augmentation strategies for training model under self-supervised setting [70], [71]. However, there is no approach to include both node and edge embeddings directly under one optimization problem.
Definition 1 (Dual Graph): For a graph G = (V , E) defined as a set of vertices V and a set of edges E ⊆ V × V we denote by G * = (V * , E * ) a line (dual) graph, the nodes of which are the edges of G and edges are nodes, in the sense that two adjacent nodes are connected by an edge if corresponding edges have the incident node.
Remark 1: Of course, by ''dual'' we mean ''edge-tovertex dual'' or congruent graph, also known as Line graph. The ''edge-to-vertex'' duality concept is preserved throughout the paper as it gives additional motivation for our graph representation. There is also correspondence, albeit not a bijective one, between nodes of G and edges of G * . Finally, it is important to mention that only simple undirected graphs without loops and multiple edges will be considered for our study.
The Line graph usage was suggested for joint node and edge structure learning in Dual-Primal GCN [12] and ELAINE [72] in semi-supervised learning setting. However, these models were not applicable to any undirected weighted networks and graph machine learning tasks.
With the present work, we propose a model that combines the best of three core approaches for node embeddings: sequence-based models for fast feature generation, capturing structural information and high-order dependencies; deep learning model autoencoder for its expressiveness and effective use of generated features; and matrix factorizations for simplicity of model regularizing. In addition, self-supervision in matrix reconstruction task via Line graph embedding is a novel idea, which is also one of the main contributions of our work.

III. MODEL: DUAL GRAPH AUTOENCODER
A novel approach is required to address the problems discussed above and further improve upon the existing models. We propose a model of Dual Graph Autoencoder, which learns JOint Nodes and Edges Embedding, called JONNEE. The model name was chosen specifically to distinguish it from another approach presented in the paper on Dual-GCN [73], in which duality concept stands for local and global network features. Our model is designed to be especially effective when only structural information is available, but it also accommodates node and edge weights and features. The model is shown in Figure 1. We start by describing the model and its essential components. Learning in JONNEE proceeds in two steps.
1) Feature learning: during this stage, features for the graph and its dual graph are generated as embeddings through a sequence-based method. 2) Joint learning of two autoencoders on a graph and on its dual graph, learning both representations for nodes and edges simultaneously and in a consistent manner. Remark 2: With minor modifications, both steps can be combined in an end-to-end pipeline. Feature generation may be skipped or substituted with end-to-end modeling random walks with graph neural network [36]. As with traditional autoencoders, our model has a deterministic version as well as its probabilistic analog. To keep the exposition simple, we focus on describing the deterministic GAE-based option here, implying that the extension to VGAE-based setting is carried out in the same manner as described in Section III-B and is thus straightforward.
As input to JONNEE, a graph G = (V , E), represented with an adjacency matrix A ∈ R |V |×|V | + (binary or nonnegative valued for weighted graphs) and, optionally, node attributes X 0 ∈ R |V |×d 0 and edge attributes X * 0 ∈ R |E|×d * 0 are available. Prior to learning, we obtain the graph dual G * = (E, E ), E ⊆ E × E in the form of its dual adjacency matrix. In the basic case, this graph just incorporates structural information and is thus binary. However, it can be weighted as well: for example, one can introduce node weights as averages of edge weights to use as weights for dual edges that correspond to these nodes. We leave experimental testing of this idea for future work in this area as suggested in [13] studying first-order proximity link embedding operators. If the graph is weighted, its weights are normalized to VOLUME 9, 2021 be bounded by 1 just using a normalized adjacency matrix The structure of the model description Section III is as follows. We start with a discussion feature learning from sequential models and the idea behind this process. Next, we discuss graph autoencoders for the graph and its Line graph. After the model is determined, we explain the loss function and the regularization tricks used in it. Finally, we discuss the training procedure and its convergence in the unsupervised setting and also present JONNEE semisupervised counterpart with natural modifications from the original model.

A. FEATURE LEARNING
During the first stage, the algorithm linearizes the graph using an algorithm Seq ∈ {node2vec, diff2vec}, which can be either random walks or diffusion. Seq learns features X 1 = Seq(G, u) and X * 1 = Seq(G * , u * ) of a size u for G and u * for G * separately. If node and dual node (edge) features X 0 and X * 0 are available from the outset, they are concatenated with learned sequence embeddings described below to form augmented feature sets: In (1) we denote as D = d 0 + u the total output number of features per node and D * = d * 0 + u the total output number of features per edge. We will assume that the same augmented feature dimensionality u is sufficient to learn sequence embeddings from both G and G * , that is, u * = u. In our experiments, we set both node2vec and diff2vec parameters to default values proposed in the reference implementations (see [9] and [37]), and leave parameter tuning for future work.
As has been discussed in Section II, informative node features can improve learning representations, especially for graph convolutional networks and graph autoencoders that employ a sequence of graph convolutions as an encoder. Indeed, these models could be interpreted to operate on node features, effectively passing messages between them to update a feature based on convolving its local neighborhood until they are sufficiently consistent. This means that a good initial approximation will most likely not only speed up convergence but also ensure a significantly better local optimum [74].
One way to generate such an approximation is to employ a more lightweight algorithm for learning embeddings. For this purpose, we use sequence-based algorithms (node2vec and diff2vec), as they have an essential property for light initialization, which is the lowest complexity among all the other models described. Moreover, there is an established intuition that ensembling, which we implicitly perform in the model, works better if it combines sufficiently different models. That supports our decision to take sequence-based algorithms that rely on node contexts and sampling to complement graph convolutions. Experiments show the importance of such a feature generation procedure.
Another goal is to explore the best way for feature generation in various settings: for weighted and unweighted graphs, for graphs with and without additional node features. We observe that although GAE works with any adjacency matrix, not necessarily a binary-valued one, it is helpful to make use of edge weights during feature learning as well. While node2vec supports node weights, we found out that the original diff2vec does not, with diffusion from a given point being performed uniformly over neighbors. We generalize this model by weighting the choice over neighbors, thus making diffusion more likely in the direction of heavier or wider edges.

B. GRAPH AUTOENCODERS
Following feature generation, the second stage receives as input the data A, X , and dual data A * , X * . Two graph autoencoders GAE and GAE * are then trained on this data to learn hidden representations F ∈ R |V |×d and F * ∈ R |E|×d . These representations should have the same dimensionality d, so that node and edge embeddings reside in the same vector space. Autoencoders have the same simple architecture that consists of an encoder GCN with two graph convolutional layers, regularized with dropout, and a β-masked inner-product decoder obtaining reconstruction A of the adjacency matrix A from graph embedding representations.
A two-layer GCN architecture is defined as where } are trainable parameters of GCN (X , A). Following equations (2)-(3) graph and dual graph embeddings and reconstructions are specified as: In (4)-(5) denotes the Kronecker (point-wise) product of matrices. Matrices B = 1 + βa ij i,j and B * = 1 + β * a * ij i,j with β, β * > 0 allow us to soften learning and regularize the model, especially on unweighted graphs. This defines the forward pass of JONNEE model. Below, we discuss the choice of GCN architecture and then go into details of training and regularization procedures.

1) TRAINING WITH β-MASKING
Initially, an opposite idea was proposed in [43], where authors weigh higher the actual positive-class terms in MSE reconstruction loss in order to penalize their model more for missing a possible connection than for incorrectly predicting an absent edge as likely. In VGAE, this is implemented by sampling the positive class multiple times until the balance is corrected. However, in our case, β-masking worked better with respect to threshold-independent metrics, such as F1-score, by fitting the model more loosely on the provided adjacency matrix.

2) CHOICE OF THE ENCODER ARCHITECTURE
In our experiments, we consider an encoder composed of two GCN layers. Although the depth of this model could vary, we consider a shallow model is meaningful in many realworld applications for graphs. Compared to images, we have no regular filter structure and are struggling with the computational hardness of considering distant relations between nodes in a graph, which are not meaningful in many practical scenarios. For example, a well-known social graph rule of six handshakes states that any person knows any other person through an average of five other people; in most actual cases of local social networks, this number (effective graph diameter) is even smaller. Thus, any number of encoding layers over the graph diameter would be redundant. We also assume that the number of edges is linear to the number of nodes in most large real-world networks. So it is reasonable to reduce the number of hyperparameters and take the same number of first-layer hidden units for primal and dual encoders.

C. TRAINING 1) GRAPH RECONSTRUCTION LOSS
As reconstruction loss function, we employ mean squared error (MSE), which is proportional to Frobenius matrix norm of the error: In the case of an unweighted graph, one can also use the standard cross-entropy loss function instead of MSE used in (6)-(7).

2) JOINT LEARNING
Let us denote again as f and f * the node and edge representation mappings learned by GAE and GAE * respectively, and matrices F ∈ R |V |×d and F * ∈ R |E|×d as their training set images. Now, we describe how the learning of two models is coupled, so that node representations benefit from edge representations and vice versa. We suggest that a good representation for a node should be similar to averages across learned representations for edges adjacent to this node, expressing the idea that these edge representations have in common. Furthermore, on the other hand, an edge representation is similar to what its adjacent nodes share as representations learned by the primal autoencoder. Informally, we want to achieve some form of meaningful arithmetic for edge and node representations, akin to that learned by word2vec: by averaging over edges adjacent to a node, we ideally would extract some common features. Another way to describe this is to note that passing to a dual graph twice (from nodes to edges and then back to nodes combining edges so that G * * = G) should not affect their representation too much since we wish the edge representations to retain as much information about nodes as possible, and vice versa.
Based on this idea, we add dual terms to our loss function: In (8)-(9) f (v) stands for embedding of node v in graph G, f * ((u, v)) stands for embedding of an edge e = (u, v) in the dual graph G * , and N G (w) denotes neighborhood of a node w in corresponding graph G , which is either G or G * in our case.
For convenience of setting up the model, we observe that L G * →G can be re-written in matrix form, using the incidence matrix M ∈ R |V |×|E| , where each element is defined as M u,e = I {node v is adjacent to edge e}, then

3) LAPLACIAN REGULARIZATION
We use typical network problem structure knowledge and add some tricks to regularize learned representations and make them more robust. Namely, we draw inspiration from previous researches, such as [43] and [75], and add Laplacian regularization to both node and edge representations. Intuitively, it seems plausible that a model that places more emphasis on first-order proximity (through direct linkage) is simpler than a model that is more reliant on neighborhood similarities for second-order proximities. Thus, a way to regularize our model and enforce preserving the direct relationship is to add the Laplacian term, with a coefficient controlling the strength of a regularizer. We impose this regularization on both primal and dual GAE by adding appropriately weighted losses L reg and L * reg , defined as follows: In (

4) FINAL MODEL
We combine individual GAE loss functions (6)- (7), individual regularizers (10)-(11) and joint components (8)-(9) into VOLUME 9, 2021 the following loss function: In our experiments, we have only tested out adding one term L G * →G , due to computational limitations on large graphs. Thus we omit the L G→G * term from descriptions in this work, making C * = 0. However, we believe that a more symmetric model would learn better quality embeddings.
In (12) we denoted by and * the learnable parameters of GAE and GAE * , respectively. These are essentially the parameters of their GCN encoders, as was defined in (2)-(3). During optimization, we iteratively update these weights with back-propagation gradient descent, synchronously updating them: first, the gradients are computed with respect to fixed weights from the previous update step, and then parameters get updated (for more information on optimization schemes, see Section III-D). The model is shown in Figure 1.

D. OPTIMIZATION
In our implementation, we experiment with different training regimes for joint optimization of two networks. The standard training uses synchronous updates, computing gradients for both GAE and GAE * and renewing both parameter sets with dual parameters held fixed at the previous value. As an alternative to simultaneously updating model weights, we propose to alternate updates for G and G * , between propagating gradients from the loss associated with model G parameters on even step and with model G * parameters on odd steps, keeping G's embeddings fixed. The idea is similar to the training of generative adversarial network (GAN), in which training a generator is alternated with training a discriminator. We expect this regime to have a similar effect of stabilizing training.
Following the author's implementation of GAE [59], we take Adam optimizer with a learning rate 0.01. In general, larger graphs (over 10000 nodes) require more training epochs; however, we observe that the order of hundreds or even 100 to 200 epochs worked well on our test cases.
Due to differences in scale, some components of the composite loss may overbalance others and disrupt optimization. To cope with that, we initially perform scaling of the additional part of the loss λ * L * reg ( * ) + C · L G * →G ( , * ) to make it equal to the general part L G ( ) + L G * ( * ) times λ and then keep this scaling, effectively normalizing gradients on every step based on the first step. In our tests, we observed that λ = 0.1, λ * = 0.2 work the best, as was found via hyper-parameter search. We also use standard deep learning regularization tactics, adding a small dropout 0.1 after the ReLU activation following the first convolutional layer. This means that we randomly knock down a specified small share of layer's neurons on each forward pass during training and turn our network into an ensemble, thus preventing excessive co-adaptation of different neurons.

E. JOINT DUAL GRAPH CONVOLUTIONAL NETWORK FOR SEMI-SUPERVISED CLASSIFICATION
Additionally, JONNEE admits modifications for labeled data incorporation in semi-supervised node K -class classification with labeled nodes V L ⊆ V (dual construction allows to work with semi-supervised edge classification, usually used as entity recognition for knowledge graphs). This model JONNLEE (with L for 'Labeled') is even simpler: instead of reconstructing the adjacency matrix with an inner product decoder, we may use one more GCN layer above representations F with input size d and output size K to obtain correct probabilities, which we train to approximate target class probabilities by softmax normalization, so that class probabilities ≈ Softmax(GCNLayer(F)) ∈ R |V L |×K .
If we have edge labels in addition to node labels, we can use them in the same way. Otherwise, for the dual model to be semi-supervised as well, we label an edge as 1 if it connects two nodes with the same label, and 0 otherwise or if one of the nodes is not in the training dataset and its label is unknown. In such construction we force an edge to indicate that its adjacent vertices belong to the same class, also using softmax class probabilities ≈ σ (GCNLayer(F * )) ∈ R |V * L |×2 . This way, we can use our model with almost the same loss function, in which L G and L G * terms are now multinomial cross-entropies between the one-hot label distribution and the last layer output in the case of single-label multi-class classification.
For the training set, we take the nodes that have labels. This way, we learn node and edge representations consistent with the training labels and also have them preserve first-order proximities with Laplacian regularization.

IV. EXPERIMENTS WITH THE MODEL
In this section, we report on the experiments conducted to validate the proposed model. We start with describing experiment setup, model implementation, and machine learning settings. Next, we provide description of extensive experiments on the model components impact and hyper-parameter optimization. Finally, we showed the value of two training modes on the model performance.

A. EXPERIMENTAL SETUP 1) MODEL IMPLEMENTATION AND HARDWARE
The model is written in Python3 with PyTorch framework, version python0.3.1.post2. The code to reproduce the results will become available after the paper is accepted. All computations have been conducted on a single laptop computer MacBook Pro 2013, with 2.3 GHz Intel Core i7 processor with 4 cores and 16 GB RAM, with only one core used in order to not exploit differences in models' ability to parallelize.

2) EVALUATION FOR MACHINE LEARNING TASKS ON GRAPHS
Since we maintain that edges can include additional information that cannot be presented by incident nodes embeddings alone, we mainly focus on the link prediction problem to guide our model evaluation. The link prediction problem is one of the core machine learning tasks on graphs, which can be solved by our model better than by other state-of-the-art embedding-based solutions. Secondly, we test our model on a multi-class node classification problem.
In these experiments, we run JONNEE for 100 epochs on 5 random 2000-node subgraphs with 5 random trainvalidation-test splits of all datasets except for HSE and Cora, on which only different splits were used. All seeds are kept the same for different algorithms and hyperparameter choices within one test suite, while sigmoid activation of output reconstruction layer was thresholded using validation set and set to 0.6.

3) DATASETS
In Table 1, we provide a summary of datasets we use for experiments.

B. COMPONENTS TESTING: FEATURE GENERATION
One of the central components of the model is seeding the autoencoder with informative node features by learning them with a sequence-based algorithm from the training data. In this set of experiments, we verify that this procedure is indeed useful and observe some other details regarding the choice of sequence-based algorithm for feature generation.

1) GENERATED VS. DUMMY FEATURES
By dummy features, we refer to the identity matrix I n used in place of node features and which can be viewed as a set of uninformative features, with each feature only expressing the identity of its node. Other approaches are based on random walks from node2vec (n2v in Tables) and diff2vec (d2v in Tables) models as described previously in Section III. The results are shown in Table 2. These results show that the proposed procedure significantly improves performance over dummy features. The choice between random walks and diffusion, in general, depends on the graph's properties: sparser graphs could be better handled by diffusion, whereas random walks generally yield better and more stable results on dense graphs.

2) CASE OF WEIGHTED GRAPH
Next, we analyze the performance of feature generation on a weighted network. We observe that while a biased random walk performed by node2vec can use edge weights by placing additional weight onto 'heavier' edges where appropriate, the original diff2vec does not use weights. We were able to slightly improve diff2vec performance on weighted graphs by modifying its diffusion graph construction stage to account for weights: we make propagation probabilities proportional to respective edge weights, as shown in Table 3. The HSE dataset itself was collected in [13] for HSE University co-authorship recommender system, while weights represent the aggregated quality of overall papers published by researchers.

3) CASE OF NONTRIVIAL ATTRIBUTIVE NODE FEATURES X 0
In cases when additional information for each node is available, we face a potential trade-off: our learned sequence features may dilute the information provided by these extra features, which are valuable as they contain information additional to the structure data. Thus, we compare different options for making features on the first stage: we either concatenate two kinds of features or do not use sequence-based features at all (for the reasons mentioned above, we assume these features are valuable). Results on the Cora dataset are reproduced in Table 4. We find that using learned features X 1 in addition to extrinsic ones X 0 allows us to improve the results over all metrics by a significant margin.

4) Node2vec VS diff2vec
Although node2vec proves to be more stable and reliable, our experiments reveal that diffusion-based features result in better clustering in the embedding space. Below, we visualize d = 16 dimensional embeddings in 2 dimensions using VOLUME 9, 2021 TSNE algorithm [78], which is a common dimensionality reduction tool for vector-valued data. We also compute silhouette score with K-Means detected clusters for the original JONNEE embeddings on 1000 nodes subgraph of HSE dataset (see Figure 2). The observation that clustering is induced by diffusion is indeed not surprising: when a truncated random walk with generic parameters p, q ≈ 1 most often leaves the starting node in a few hops, diffusion stays local due to the same probability of nodes being sampled in diffusion graph.

C. COMPONENTS TESTING: β-MASKING
In these experiments, we aim to find out whether masking aids learning. For this purpose, both positive and negative values were tested, with positive employed to smooth learning and negative to force closer fitting on training data. The results for selected representative mask intensities for Cit-HepTh and Ca-GrQc are shown in Table 5. It shows that a small positive β-value improves generalization by regularizing the model, but negative masking is unstable and requires further investigation.
We also try taking larger masks for the dual optimizer since, in the node-based tasks, it is only required to learn representations consistent with the primal graph rather than perform well on edge-based tasks. The Table 6 shows some results for varying β * with the same optimal βs on every dataset. We find that by taking a β * >β, it is possible to improve quality. Overall, however, parameter tuning for masking coefficients offers only marginal improvements when the appropriate order of magnitude is found, so defaulting it to β = 0.01, β * = 0.02 is a good choice.

D. COMPONENTS TESTING: LAPLACIAN REGULARIZATION
Similar to the previous experiment, we test Laplacian regularization for λ * = λ choosing the best parameter (see Table 7) and then try varying the ratio between the two coefficients (see Table 8). Experiments show that adding Laplacian regularization slightly improved the quality of our model. The improvement that λ * -tuning offers is fairly small, so it is enough to set λ = 0.1 and λ * = 0.2.

E. COMPONENTS TESTING: LEARNING WITH DUAL GRAPH CONSTRAINT
In this set of experiments, we test one of the central features of our model -learning joint embedding for nodes and edges as nodes of the dual graph. From the results reported in Table 9, we see that dual learning with a small parameter coefficient provides an improvement over a node-based link prediction. Remark 3: It is important to say a few words on hyperparameter choice and scaling. We observe that imposing a joint loss benefits the quality of resulting node embeddings.
However, training with it should be managed more carefully by tuning the step size so that one component of the loss function does not outweigh the others. In particular, the common loss should not overbalance the respective losses of the two autoencoders. Otherwise, the model would still learn to approximate averages, but these would be of low quality, not fit to reconstruct edges, and useless altogether. For that, we compute the ratio of loss components L G and the remaining part and then scale the remainder down by that constant factor on each subsequent iteration. This way, we achieve the initial composition of losses on the order of 1 : λ for the primal model's weights and on the order of 1 : λ * for the dual model's weights, with optimization weighing gradient directions in the same way. During the process of optimization, these ratios may change, but the initial calibration is essential.

F. ALTERNATING OPTIMIZATION
We conduct two tests of the alternating optimization (see Table 10) and observe that, although the mean results for classification scores are generally worse, sample variance across different runs is consistently reduced for alternating optimization. This leads us to the idea that by selecting an appropriate number of steps performed by each optimizer before passing over to dual and learning rates, one could perform better with this procedure.

V. EVALUATION FOR GRAPH MACHINE LEARNING
We perform a comparison with state-of-the-art graph embedding models on link prediction (unsupervised setting), node classification (unsupervised and semi-supervised settings), and network visualization (unsupervised setting) problems. Each Table shows the results of our best model after testing of components. The compared models are chosen with default hyperparameters, as stated in corresponding research papers. A unique pipeline for training and evaluation of all the models was applied in each experiment compared here.

A. LINK PREDICTION OVERALL
We compare our model to a selection of other popular structural embedding models based on different principles: HOPE [16] and GraphFactorization (GrF in Tables) [79] based on matrix factorizations, node2vec (N2v in Tables) and unmodified diff2vec (D2v in Tables) learning representations from sampled sequences, and to GAE [59], GAT [61], GraphSAGE (GrS in Tables) [60], a deep learning models for graphs. With this, we want to verify that our JONNEE can be compared favorably along all dimensions, against a diverse set of models, according to various metrics and datasets. Results of our experiments are shown in Table 11. In the experiments, we take the embedding dimension to be d = 16 and sample several subgraphs while holding a train/validation/test split constant with proportion 0.85/0.05/0.1. Other hyperparameters were default for benchmark models. We use implementation of GAT, and GraphSAGE from DGL framework [80]. The default number of layers here for both models is 2. GraphSAGE uses GCN aggregation. The results show that tuned JONNEE with diffusion features and 400 epochs mostly outperforms baselines.
Despite the larger number of parameters (due to accounting for Line Graph), Figure 3 shows that our model converges faster than GAT and GraphSAGE on HSE, Cit-HepTh, Ca-GrQc datasets.

B. MULTI-CLASS CLASSIFICATION FOR NODES
In this set of experiments, we verify that the model's performance generalizes well from link prediction to other tasks such as multi-class node classification. Additionally, we offer a simplified version of our Dual Autoencoder JONNEE, named JONNLEE, directly adapted to incorporate available node labels during training in a semi-supervised manner.
Evaluating our model on multi-class classification, we no longer specify sample standard deviations to keep the tables clean. We train models on the 80% subsample and validate on the rest 20%.

1) UNSUPERVISED REPRESENTATION EXTRACTION
In these experiments, we train models on the entire graph G in an unsupervised way to generate node embeddings and then fit a Random forest with default parameters for multiclass classification on a random subsample of node representations to evaluate the quality of embeddings. By producing embeddings, we reduce the node classification problem to a standard classification problem with vector input. We choose Random forest as one baseline method proved to show better performance compared to other standard methods like support vector machine (SVM) or gradient boosting (XGBoost) on multi-class classification problem on embedding vectors for co-authorship networks [14], while leaving comparison on other methods for JONNEE representation for the future work. All models, except for Graph Factorizations, perform better than random (which in the case of Cora is ≈ 0.14 accuracy). The results are reported in Table 12.

2) JONNLEE: SEMI-SUPERVISED ANALOGUE OF JONNEE
By exploiting knowledge of labels during the training stage, we incorporate task knowledge into the embeddings and expect better performance on the relevant task. For a fair comparison, we take another semi-supervised model GCN and compare it to JONNLEE, a semi-supervised modification of JONNEE. From this experiment, we verify that the    proposed modification of JONNEE for the semi-supervised setting outperforms the model without supervision and also compares favorably to the original GCN (see Table 13).

C. VISUALIZATION AND COMMUNITY DETECTION
Finally, we briefly demonstrate that the embeddings learned by JONNEE are suitable for visualization. For this, we compare it to diff2vec [37], known for accurate and highly modular cluster visualization. We use the Cora dataset, which is convenient due to having only a single label per node, which can be interpreted as a community/cluster label. In this case, communities are topic groups of papers on the ''machine learning'' topic, with features containing their TF-IDF encoded content and links representing citations.
We learn 100-dimensional embeddings to retain more information by both models and use TSNE [78] to obtain planar embeddings. We then plot the embeddings and edges between them and color the points according to their class labels. The results for both models are displayed in Figure 4, where we see that embeddings learned by diff2vec in an unsupervised way are nicely aligned with true labels, while JONNEE not only identifies thematic communities by label but also forms a cleaner and more modular cluster.

D. DISCUSSION
We see that the quality of the embeddings learned by JONNEE is almost always superior to state-of-the-art structural models. Moreover, it is exceptionally flexible and is able to beneficially incorporate both node and edge features and weights while being stable to hyperparameter perturbations. However, our model has a certain drawback of longer training, similar to matrix factorizations but less parallelizable.
Our model has complexity O(|E| 2 ) in the worst case, which is arguably far from efficient. However, the paper is intended as a proof of concept for (edge-to-vertex) 'dual' graphs to be used and explored in tandem with regular graphs to learn more accurate representations. Moreover, for sparse realworld networks, this complexity is lower than the complexity of some factorizations, as mentioned above. Additional memory consumption is at worst |E| 2 at runtime, but storing all of the extra data is unnecessary. These problems we believe to be solvable with clever approximations to the dual graph similar to [11], [12], for which we have not found open-source code to conduct comparison. Code of our model and experiments could be found on GitHub. 1

VI. CONCLUSION
In this work, we have developed a new embedding model JONNEE which learns high-quality node representations in tandem with edge representations. We have implemented the model and conducted an extensive range of experiments demonstrating that the model performs well compared to other state-of-the-art benchmarks.
The quality of the embeddings learned by JONNEE is almost always superior in the link prediction problem to those learned by state-of-the-art models of all different types: matrix factorization based, sequence-based. However, it shows competitive results to other deep learning methods. The model performs well in both unsupervised and semisupervised settings for the node classification task. In addition, embeddings learned by JONNEE have a well-clustered structure and are suitable for visualization. JONNEE is exceptionally flexible and is able to beneficially incorporate both node and edge features and weights while being stable to hyperparameter perturbations. However, our model has a certain drawback of longer training similar to matrix factorizations but less parallelizable. It can be overcome via changing core architecture components: graph autoencoders may be changed to graph sampling models or inductive models so that JONNEE will be able to process large and temporal networks. Also, it is interesting to see how the construction of joint constraints may be generalized for arbitrary embedding operators that can be approximated by a deep neural network of incident nodes embeddings. We believe that presented via Line graph selfsupervision on learning network representation is a promising approach generalizing for complex networks.