Local Structure and High-Order Feature Preserved Network Embedding Based on Non-negative Matrix Factorization

Network embedding, as an effective method of learning the low-dimensional representations of nodes, has been widely applied to various complex network analysis tasks, such as node classiﬁcation, community detection, link prediction and evolution analysis. The existing embedding methods usually focus on the local structure of the network by capturing community structure, ﬁrst-order or second-order proximity, etc. Some methods have been proposed to model the high-order proximity of networks to capture more effective information. However, they are incapable of preserving the similarity among nodes that are not very close to each other in network but have similar structures. For instance, the nodes with similar local topology structure should be similar in embedding space even if they are not in the same community. Herein, we regard these structure characteristics as the high-order features, which reveals that the structure similarity between nodes is spatially unrelated. In light of above the limitations of existing methods, we construct the high-order feature matrix for mutually reinforcing the embedding which preserves the local structure. To integrate these features effectively, we propose LHO-NMF, which fuses the high-order features into non-negative matrix factorization framework while capturing the local structure. The proposed LHO-NMF could effectively learn the node representations via preserving the local structure and high-order feature information. In speciﬁc, the high-order features are learned based on random walk algorithm. The experimental results show that the proposed LHO-NMF method is very effective and outperforms other state-of-the-art methods among multiple downstream tasks.


I. INTRODUCTION
Large-scale networks with complex topology structure, node content, node labels and other side information are becoming ubiquitous, such as social networks, biological networks, protein interaction networks, citation networks and telecommunication networks, etc [1]- [6]. In recent years, complex network analysis has attracted increasing attention. Network embedding is a fundamental method in complex network analysis, which refers to the approach of embedding the nodes and links in the network into a low-dimensional space while preserving internal structure and attribute affinity [7]. Similar nodes in the network are also similar in lowerdimensional space and should be kept closer. To this way, a variety of network analysis tasks such as node classification [8], [9], community detection [10]- [12], visualization [13] and link prediction [14], [15], could be well supported by vector-based machine learning algorithms.
Recently, many network embedding methods mainly pay attention to neighborhood structure, community structure, as well as high order proximity, all of are typical network structure. By treating nodes as words and generating short random walks as sentences, DeepWalk [7] based on random walk applies a natural language processing model to learn the representations of nodes preserving the neighbor structure. To enable capturing the diversity of connectivity patterns of the network, Node2vec [16] improves the random walk strategy utilizing breadth-first sampling and depth-first sampling to preserve the neighbor structures of different levels. And LGSFA [17] proposed by Hong Huang et al. captures the local geometric structure by reconstructing current nodes with neighbors and enhancing intra-class compactness and inter-class separability. In addition, to improve the classification performance, LNSPE [18] captures the neighbors of samples and defines adjustable loss function to learn the optimal adjacency graph.
Furthermore, it is necessary to capture more abundant neighbor structures to learn more connotative and discriminative node representations. At present, a bunch of methods have been proposed to preserve the first-order, second-order even higher-order node proximity of the network. Deep-Walk learns the high-order proximity of nodes by a hyperparameter, i.e., the window size. LINE [19] integrates the first-order and second-order proximity between nodes into the embedding learning phase. Nevertheless, LINE defines two loss functions so that it is only able to capture the local and global network structure information, respectively. Then, SDNE [20] preserves the local and global structure simultaneously using deep learning algorithm. And it maps the network to a non-linear latent space to capture the highly non-linear network structure. Besides, to capture more global structure information, a number of methods are proposed to capture the higher-order proximity of nodes [20], [21]. The methods mentioned above can merely preserve fixedorder proximity, which can not perform well on all networks or tasks. Hence, Zhang et al. [22] propose AROPE based SVD framework. AROPE obtains arbitrary order proximity of nodes, meanwhile revealing the intrinsic relationship between proximities of different orders.
The above methods have achieved excellent performance on a variety of network analysis tasks. Even so, they are incapable of capturing the similarity of among nodes that are not very close to each other in the network but have similar structures. For example, the nodes which are not in the same community may have similar surrounding topology structure. In that case, they should be similar in embedding space too. In this paper, we regard this feature revealing the structure similarity of node spatially unrelated as high-order features. In addition, they also ignore the mutual interaction between local structural similarity and higher-order features. In fact, it's essential to explore the relationships between them while preserving these characteristics and how they affect the performance of network embedding. The methods based on matrix factorization by decomposing the adjacency matrix to learn node embedding could only capture the local structure of the network. In addition, most networks are very sparse so that node embeddings learned from adjacency matrix factorization will lose a lot of structural information. The algorithms which define a polynomial function of the adjacency matrix to capture high-order proximity make the network denser and increase computational complexity and noise. And it is also not accurate enough because for an unweighted network, the k-order power of adjacency matrices may include plenty of repetitive edges. In light of the above limitations, we devote to capture the local structure and highorder features at the same time and exploit their consensus relationship on how to impact node embedding. One of the main obstacles we face is how to effectively fuse the highorder features and local structure into embedding space so as to learn more informative and discriminative representations for each node.
Towards this goal, we propose LHO-NMF (Local structure and High-Order features preserved network embedding based on Non-negative Matrix Factorization), a novel approach to preserve the high-order features learned from random walk algorithm and local structure proximity for node representations. NMF enables to map multiple components to a low-dimensional space effectively while maintaining internal correlation between the components due to its interpretability and additivity. Therefore, we construct the high-order feature matrix and local structure extraction matrix, then through NMF framework learn the node lowdimensional embedding preserving multiple structures. In specific, the high-order features are captured by factorizing a random matrix, which preserves the global network structure. Eventually, we optimize it via an iterative multiplicative updating algorithm. Comprehensive and extensive experiments on various real networks demonstrate the effectiveness of our proposed model.
In summary, our major contributions are summarized as follows: • We propose LHO-NMF, an efficient and scalable algorithm for network embedding, which effectively integrates the high-order feature of nodes into a nonnegative matrix factorization framework while capturing the local structure. • We construct the high-order feature matrix based on random walk and adopt the combined NMF framework to permit users to control the loss weight between different structure features. • We conduct extensive experiments, including multilabel classification, clustering and link prediction on several real-world datasets to evaluate the performance of the model. The results show that our method achieves significant accuracy improvements than baseline methods.
The rest of this article is organized as follows. In Section 2, we introduce related works. In Section 3, we present the proposed framework LHO-NMF in detail. In Section 4, we provide the cogent experiment results and analysis to evaluate our method. Finally, this article is concluded in Section 5.

II. RELATION WORK
In this section, firstly, we give a simple description of the related work about network embedding. Then we formally introduce the related work about non-negative matrix factorization to help to understand our model better.
Our work aims to learn representations of nodes in the network, which also is called network embedding. Recently, network embedding has been an active area of frontier research [23]- [25]. From the perspective of the research method, network embedding algorithms can be roughly divided into following categories, including random walk, deep learning and matrix factorization [26].
Early works such as IsoMAP [27] and Local Linear Embedding (LLE) [28], decompose the constructed affinity matrix to obtain the eigenvectors as low dimensional representations of the network. With the achievement of random walk in the field of natural language processing, some researchers propose to apply this method on the graph. For example, DeepWalk [7] generates random walk path sequences for each node of the network and treats path sequences as sentences in word2vec [29]. Then it feeds this path sequences into Skip-Gram [30] to learn the embedding of nodes. To make up the arbitrariness, node2vec [16] defines two hyper-parameters to balance depth-first sampling (DFS) and breadth-first sampling (BFS) to capture more comprehensive neighborhoods structure and global information of the network. More recently, Tang et al. [19] define two loss functions to capture first-order proximity and second-order proximity respectively to preserve the local and global structure of the network. Here, the first-order proximity is defined if there is an edge that is directly connected between two nodes and the second-order proximity is defined by shared neighbors. The limitation of LINE is that it learns the local and global information independently. SDNE [20] defines a semi-supervised neural network model to capture jointly the first-order and second-order proximity, which increases the ability of model to capture nonlinear structure. The secondorder proximity is learned by the unsupervised part and capture local information by using first-order proximity as a monitoring message. Further, in addition to integrating the first-and second-order jointly via a deep variational model , DVNE [31] maps the network into a Wasserstein space to capture more uncertainties of the nodes.
Recently, Graph Neural Network which encodes nodes by aggregating neighborhood information in spacial or spectral domains have become popular trends of network embedding. The aggregation mechanism of GCN [32] is defined as localized first-order approximation of spectral graph convolutions. GAT [33] firstly applies self-attention mechanism on neighborhood aggregation. Both of GCN and GAT are trained in semi-supervised manner. GAE [34] has the same convolutional architecture with GCN while it is trained by reconstructing the adjacent relationships. GraphSAGE [35] aggregates neighborhood information with sum-pooling, meanpooling or LSTMs on sampled neighbors and is trained by reconstructing both edges and sampled non-edges.
Moreover, to capture more comprehensive internal structure information, many researchers extend the works by utilizing high-order information [20]- [22], [36]. For example, GraRep [21] defines different loss function to capture k-hop relational information. Then, they combine k-proximity learned from different models to learn the global representations of nodes. As mentioned before, the above methods only preserve fixed-order proximities, which could not perform well on all networks tasks. Hence, Zhang et al. [22] propose AROPE (ARbitrary-Order Proximity preserved Embedding) based SVD framework. AROPE enables to select arbitrary order proximity of nodes according to the network characteristics, meanwhile, reveal the intrinsic relationship between proximities of different orders.
Furthermore, matrix factorization is also applied in network embedding widely. The common methods such as Singular Value Decomposition (SVD) and Principal components analysis (PCA) are used in the series of matrix factorization models [15], [37]. Non-negative Matrix Factorization (NMF) [38], as a variant of the standard matrix factorization, adds non-negative constraints on matrices. A number of studies indicate that NMF has better additivity and interpretability so that it is often used to learn the embedding. However, the matrix factorization is unable to address with higher order recovery problem and WATITF [39] is proposed to overcome this issues. More about network embedding for a recent review of Cui et al. [26].

B. NON-NEGATIVE MATRIX FACTORIZATION.
Given a nonnegative vector x ∈ R m + and n observations defined as x i , i = 1, 2, ...n, we can obtain a data ma- + , so that their product approximate to the original data matrix X: X ≈ UV, where d min(m, n) is the dimension of the latent space or the rank of the underlying data.
As a popular dimension reduction model, NMF has been used in diverse network mining tasks such as community detection, link prediction and network embedding. A number of studies have confirmed the effectiveness of NMF on these tasks [40]- [48]. Recently, PPNMF (Proximity Preserving Non-negative Matrix Factorization) [49] combines the firstorder and the second-order proximity into NMF framework to learn node representations. In addition to integrating firstorder and second-order similarity, Wang et al. [10] propose a Modularized Non-negative Matrix Factorization (M-NMF) framework to add community constraints through defining a modularity constraint term. Then M-NMF utilizes the NMF to incorporate the similarity information and community structure into embedding space.
Further, in consideration of the significance of higherorder proximity for grasping the global characteristics of the network, some studies take higher-order proximity into account to capture more comprehensive information. GraRep [21] applies matrix factorization to preserve the high-order proximity and concatenate the representations of nodes learned from different orders. To be more effective and flexible, AROPE (ARbitrary-Order Proximity preserved VOLUME 4, 2016 Embedding) [22] based SVD framework could take the similarity of any order according to the network characteristics while retaining the correlation between different orders. However, the above NMF-based methods are shallow model so that they are unable to reveal the nonlinearity correlation between the original network and embedding space. In light of these issues, Ye et al. [50] propose the Deep Autoencoder-like NMF (DANMF) based on NMF framework. DANMF combines an encoder and decoder component with NMF to preserve the hidden structure characteristic.
It is obvious that almost all of the above network embedding frameworks only consider the local neighbor structure, community property, low order or high order proximity between nodes independently. How they affect network embedding performance is needed to explore. In our paper, we construct a NMF framework to exploit the consensus relationship between neighbor structure, community attribute and k-order structure information impact on network embedding.

III. THE PROPOSED MODEL
In this section, we present our proposed method LHO-NMF in details. Firstly, we report the notations used in this paper. Then, we introduce how to capture high-order feature and local structure respectively. Finally, we integrate them with NMF to learn the low-dimensional representations of nodes following the optimization algorithm.
LHO-NMF includes two vital parts, a local structure information encoder and a high-order feature encoder. As we know, adjacency matrix A preserves the most of the network topology structure characteristic and directly represents the first-order proximity. Our model first maps the original network into a low-dimensional embedding space to obtain X by factorizing the adjacency matrix A. Because network is very sparse, X loses a lot of pivotal information when preserving the local structure. At the same time, we generate random walk sequences for each node and learn the k-order feature by controlling the window size. Further, we integrate the high-order features learned from random walk into NMF framework, capturing the local structure to learn node representations preserving high-order feature information and local structure jointly. The overall architecture of our model is shown in Fig. 1.

A. NOTATIONS
Suppose there is an undirected network G = (V, E) with n nodes and e edges, V = [v 1 , v 2 , ..., v n ] denotes the set of nodes and E denotes the set of edges among the nodes. G is represented by the adjacency matrix A ∈ R n×n , A(i, :) represent the connections between the node i and other nodes in G. For unweighted networks, if there exists an edge between node i and node j, A ij = 1, otherwise A ij = 0. Since network G is undirected, A is a symmetric matrix, i.e A ij = A ji . In this paper, we utilize bold uppercase characters to exhibit matrices. The matrix B ∈ R n×d preserves the high-order structure features of nodes in the network, where d is the dimension of features. B(i, :) describes the The graph G with node set V and edge set E A ∈ R n×n The adjacency matrix of G B ∈ R n×d The high-order structure feature matrix of nodes The representations of nodes X ∈ R n×m The local structure matrix of nodes The volume of a graph G T The size of window structure feature of the node i. Our purpose is to learn the representations of nodes W ∈ R n×k (k ≤ n), where k is the dimension of representations. The terms and notations used later are shown in Table 1.

B. LOCAL STRUCTURE EXTRACTION
In the real world, information network often has a lot of information missing, many nodes without direct connection in the adjacency matrix also have high similarity in essence. In addition, the adjacency matrix preserves directly connected edges which characterize the first-order proximity of nodes. They particularly reveal that adjacency matrix preserves the topology structure of the network. Hence, we factorize the adjacency matrix A to capture local topology structure of the network. To preserve the local structure in a low-dimensional space, we adopt the NMF method, which minimizes the following objective function: where X ∈ R n×m is local feature matrix and m is the dimension of the space. As mentioned before, embedding matrix X loses a lot of pivotal information actually. Therefore, we integrate higherorder features to mutually enhance the learning of X. At the same time, we obtain the low dimensional representations of nodes preserving local and community structures by decomposing the local feature matrix X, which minimizes the following objective function: where W ∈ R n×k is the embedding matrix, U ∈ R k×m is an auxiliary matrix and k is the embedding dimensionality.

C. HIGH-ORDER STRUCTURE EXTRACTION
In this paper, we extract high-order structure of nodes based on random walk. Qiu et al. [51] prove that the models based on random walk and skip-gram are regarded as factorizing the matrix with a closed form and verify their effectiveness for conventional network mining tasks. The matrices that are implicitly approximated and factorized is denoted as the following: Node3 Higher-order structures can be obtained as the window size T increases. Inspired by this, we extract the high-order structure from the matrix B.
where H ∈ R k×d is an auxiliary matrix.

D. UNITED FRAMEWORK
Together with the objective function (1), (2), (4), the final loss function of our method can be denoted as: where α, β and γ are positive parameters for adjusting the contribution of the corresponding terms.

E. OPTIMIZATION
Because the loss function in (5) is not convex, it is not feasible to use derivation to calculate the optimal solution. In our paper, we divide the loss function into four subproblems to separately optimize four parameter matrices (X, W, H, U).
Then we use Majorization-Minimization framework [52] to compute the local optimal solution of each problem. The update strategy we adopted is alternating optimization, i.e., fixing the other three matrices when updating one. Algorithm 1 shows the pseudo code of the optimization process. The specific formulas are shown as the following.
X-subproblem: Updating X with other parameters W, U, H fixed, denoted as the following suboptimization problem: Herein, X is a nonnegative matrix so we introduce the Lagrange multiplier matrix Θ, which results in the following equivalent function: Further, we set the value of (7) to 0, i.e., L(X) = 0, we have: Following the Karush-Kuhn-Tucker (KKT) condition for the non-negativity of X, resulting in the following equation: Initialize the value of X, and the updating rule of X is: where denotes the matrix multiplication. W-subproblem: When updating W with other parameters X, H, U fixed, we denote to solve the following objective function: (11) Similar to the optimization computation of X, we define the following updating rule for W: H-subproblem: Updating H with other parameters X, W, U fixed, which results in the following function: In the same way, the updating rule of H is then given as follows: H U-subproblem: Updating U with other parameters X, W, H fixed leads to solve the following problem is needed: Likewise, optimize U by the same optimization method as X, we use the following rule to update: Algorithm 1: Optimization of the LHO-NMF Input: The network G, the high-order structure features matrix B, the embedding dimension d, convergence coefficient δ, and balance parameter α, β, γ Output: X ∈ R n×m , W ∈ R n×k , U ∈ R k×m , H ∈ R k×d 1 Initialize X, W, U, H; 2 while not conv do: 3 Update X according to (10); 4 Update W according to (12); 5 Update H according to (14); 6 Update U according to (16); 7 compute loss function L using (5); The optimization flowchart of our method is described in Algorithm 1. The input data of optimization include the network G, the high-order feature matrix B, the embedding dimension k, convergence coefficient δ, and balance parameter α, β, γ. First of all, randomly initialize X, U, W, H with a uniform distribution. Then, we update X, U, W, H iteratively until convergence (Alg.1, Line 2 − 10). The output is the embedding matrix W for all nodes in network. The representations of nodes learned by LHO-NMF could preserve both the high-order feature and local structure.

F. COMPUTATIONAL COMPLEXITY ANALYSIS
The whole computation complexity of our model depends on the matrix multiplication in the updating rules. As we know, given two matrices A and B, where A ∈ R m×r + and B ∈ R r×n + , the computational complexity of AB is O(mrn). Based on this, the computational complexity of four updating equations in Algorithm 1 (Line 6 -9) is O(n 2 m+nkm+nm 2 ), O(nmk+ndk+2nk 2 +mk 2 +dk 2 ), O(nkd + nk 2 + dk 2 ), O(nkm + nk 2 + mk 2 ) respectively. Since m, k, d can be regard as the input constant and m ≤ n, k n, d n, the computational complexity is O(n 2 + nm 2 + nkm + nkd). In practice, most networks are very sparse, hence only the non-zero values are computed in matrix multiplication. Based on this, the computation is reduced to O(ne + nm 2 + nkm + nkd), where e is the edge number in networks. In addition, in our model the matrix X ∈ R n×m , W ∈ R n×k , U ∈ R k×m , H ∈ R k×d are parameter matrix, so the space complexity is O(nm + nk + km + kd). Because k, m, and d are smaller than n, the space computation is reduced to O(n). We can find that the complexity of our model is the same order of magnitude with most NMF-based algorithms.
In order to effectively illustrate the time complexity of our model, we learn the low-dimensional embeddings of nodes for several different size datasets and compute the time then show them in Table 2. The results show that the scale of the network increases in multiples, and the computing time also increases in multiples.

IV. EXPERIMENTS
In this section, we conduct extensive experiments on multilabel node classification, clustering, link prediction and visualization tasks to evaluate the effectiveness of our model. Firstly, we report the datasets and experimental settings used in this paper. To demonstrate the effectiveness of our model, we compare it to those of state-of-the-art methods on extensive experiments. In the end, we analyze its sensitivity across used parameters.

A. DATASETS AND EXPERIMENTAL SETTINGS
We employ four widely-used network datasets for multi-label node classification task and three real network with ground truth datasets for clustering and link prediction. The statistical characteristics of these datasets are shown in Table 3 and the detailed introduction is as follows.
• BlogCatalog [53]: It is an online blogging network. The nodes represent the bloggers and the edges represent the relationships between bloggers. Labels indicate categories of interest to the blogger. A blogger has multiple interests so a node has multiple labels. • Protein-Protein Interactions [54]: It is a part of PPI network for Homo Sapiens. The class labels represent biological states. • Wikipedia 1 : It is a co-occurrence network of words on Wikipedia. The class labels are the Part-of-Speech (POS) tags inferred by Stanford POS-Tagger [55]. • Flickr [53]: It is also an online community platform.
User could subscribe to a diversity of interest groups. Class labels represent the interest groups. • Polblog [56]: Polblog is a social network whose nodes represent the American politicians' blogs and an edge exists if their blogs have web links. Labels indicate categories of politicians. • Livejournal [57]: Livejournal is also an online social network dataset, whose nodes represent the bloggers and an edge exits between two nodes if they are friends. Divide bloggers into groups according to their friendship and treat the groups as labels. • Orkut [57]: Orkut is an online making friends network with nodes as users and they have an edge if they are friends. The groups that organized by users are groundtruth. Experimental settings. The parameters of our model LHO-NMF include three hyperparameters α, β and γ, the local structure dimension m and embedding dimension k. In the experiment, we set α, β and γ ∈ [1,101] or ∈ [0, 1], m ∈ [100, 200, 300, 400, 500] to find the optimal parameters of the model. And k changes according to the number of labels. The experiment results show that the performance is better when α, β and γ ∈ [1,101]. In the process of extracting the high-order structure matrix B, we set the dimension d = 128. When m is greater than 200, the clustering performance 1 http://mattmahoney.net/dc/text.html gradually decreases, so that we set the dimension m = 200. The details of parameter analysis are shown in Fig. 4.
The effects of two parameters on the experimental results are analyzed in the following. In general, we set the value of α, β and γ depending on the data set in the experiment below.

B. BASELINE METHODS
We compare LHO-NMF with three NMF-based and four state-of-the-art network embedding methods. Their details are listed as follows.
• M-NMF [10]: M-NMF incorporates community structure and first two order proximities into NMF framework to learn the embedding of nodes. In experiment, we set the embedding dimension as 128, and utilize the default values in paper for other parameters. • NetMF [51]: NetMF proves that the models with negative sampling such as DeepWalk, LINE, node2vec and PTE are regarded as factorizing the matrix with a closed form and demonstrate proposed model outperforms DeepWalk and LINE for conventional network mining tasks. • AROPE [22]: AROPE based SVD framework shifts embedding vectors across arbitrary orders and reveals the intrinsic relationship between them to learn arbitrary high-order proximity of nodes. • DeepWalk [7]: Deepwalk generates the random walk paths for each nodes and treats these paths of nodes as sentences in language model. Then it uses the skipgram [29] to learn the embedding vectors. In experiment, we set the parameters suggested by the paper. • Node2vec [16]: Node2vec extends the DeepWalk by using a biased random walk. It introduces two bias parameters p, q to optimize the random walk. The all parameters are set by default settings. We adopt the default settings for all parameters. • LINE [19]: LINE preserves the first and second-order proximities separately by defining two loss functions to learn embeddings of nodes. We adopt the default parameter settings except for the negative ratio to 5. • SDNE [20]: SDNE utilizes a deep auto-encoder to optimize first-order and second-order proximity simultaneously. We set the parameters as in the article of the authors. • GAE [34]: GAE is based on the variational autoencoder and has the same convolutional architecture with GCN while it is trained by reconstructing the adjacent relationships.

C. MULTI-LABEL CLASSIFICATION
In this subsection, we evaluate the performance for multilabel node classification in terms of metrics Micro-F1 and Macro-F1. In order to reduce the occasionality of experimental results, we repeat the classification procedure 10 times and take the average as the results. Table 4 shows the node classification performance of our method and baselines on VOLUME 4, 2016  four datasets when T = 1 separately. All datasets share the same window size T in the same table. In these tables, bold numbers represent the best results.
As we can see, LHO-NMF achieves significantly better performance over BlogCatalog, PPI and Flickr on Micro-F1 and Macro-F1, demonstrating the effectiveness of our model for network embedding. In Wikipedia, AROPE shows better performance than our method in terms of both Micro-F1 and Macro-F1. This phenomenon implies that comparatively low order is enough to model Wikipedia's network structure. The reason is that Wikipedia is a dense word co-occurrence network whose average degree is 77 approximately, so that if two words co-occur in a window with size 2 they will have an edge. In specific, we show the relative performance of LHO-NMF(X) where X is embedding matrix. The results show that the methods only based on matrix factorization perform poor on classification tasks, demonstrating the effectiveness of incorporating the high-order feature to learn the representations of nodes.
As mentioned before, the window size T determines the order of structure captured. Further, we explored the effect of window size on multi-label classification performance. Herein, we set window size T varies from 1 to 10. Table 5 and Fig. 2 exhibit the related results and variation tendency. Because Table 4 has shown the results of T = 1, T start at 2 in Table 5. In Blogcatalog, PPI and Flickr, as window size T increases, the proposed LHO-NMF achieves significantly better classification performance as measured by both Micro-F1 and Macro-F1. But the performance tends to be stable after T reaches 4 gradually. In Wiki, when T is greater than 3, the classification performance gradually degrades. This phenomenon indicates that the size of window is never the bigger, the better, we should set it in terms of the network sparsity dynamically to reduce computation. Our method can dynamically adjust the window size according to network sparsity to learn better node representations.

D. NODE CLUSTERING
In this subsection, we evaluate the performance for node clustering based on typical metrics, Normalized Mutual Information (NMI). We evaluate the clustering performance on real-world datasets with ground-truth that include Polbog, Livejournal and Orkut. NMI varies from 0 to 1 and larger value indicates better clustering performance. In our experiment, we apply the standard k-means algorithm to obtain the clustering results of other network embedding methods. Since the initial value has a great influence on the clustering result, we repeat the clustering 10 times and compute the mean of them as the results.  percent improvement on NMI compared with the second-best method. This benefits from that our method integrates both the high-order structure, neighbor and community attribute, capturing diverse and comprehensive structure characteristic of networks. DeepWalk and node2vec based on random walk can capture the second even higher-order proximities. However, they ignore the community structure. In addition, SDNE and LINE only preserve the proximity between nodes of network, which is also incapable to effectively preserve community structure. AROPE could capture the similarity of different orders. Although more global structural information is captured as the order increases, module information is still ignored by AROPE. And the M-NMF adds modularity term to learn the embedding of nodes. However, for the networks that are very sparse and community structure of which are not obvious, the modularity constraint of NMF makes the representation of nodes similar to each other, so it shows relatively low in performance. In specific, we show the relative performance of LHO-NMF(X) where X as embedding matrix and LHO-NMF where W as embedding matrix, respectively. The results show that the methods only based on matrix factorization show poor performance on clustering tasks. The above results demonstrate the superior power of fusing high-order features into embedding when preserving local structure.

E. LINK PREDICTION
Given a network that is removed a portion of the edges proportionally, link prediction aims to predict which pairs nodes are likely to form edges. In our experiments, we randomly hide 50%, 40%, 30%, 20%, 10% edges as test data for evaluation while ensuring that the remaining network is connected, and use the remaining edges to train the node embedding vectors respectively. A typical metric Area Under Curve (AUC) score is utilized to evaluate the performance of LHO and other baseline methods. First, we show the results of removing 10% edges on all network datasets to verify the performance of LHO-NMF. As we can see in Table 7, our model achieves 19.6%, 7.5% and 9.5% improvements on Polblog, Orkut and Livejournal respectively. We notice that M-NMF is next only to our model on all datasets in predictive power. In the same way, we also show the relative performance of LHO-NMF(X) where X is embedding matrix, demonstrating the methods only based on matrix factorization perform poor on link prediction.
In specific, we take the results on Livejournal and Orkut as examples to explore the effect of the ratio of training data. The results in Fig. 3 show that our method achieves significant and consistent superiority than all the baselines in the two datasets under different portions of removing edges. Due to the difference in network structure, some datasets reach the optimal level at 80, while others reach the optimal level at 90. In general, the results demonstrate that our method could achieve superior performance in link prediction, indicating the effectiveness of preserving the high-order feature and local structure information for network embedding.  We explore the effect of each parameter by changing two of them and fix the others at the same time. For example, we vary α, β and fix γ, m to observe the effect of α, β. And so on in a similar fashion. Specifically, we vary m from 100, 200, 300, 400, 500. Fig. 4(a) -(c) and Fig. 6 show the performance of NMI with the change of these parameters respectively. In Fig. 4(a), clustering performance is worst when both α and β are less than 10. Within a certain range, the value of NMI tends to be stable with the increase of α and β. When α is greater than 50 and β is less than 30, clustering performance achieves best. As we can see in Fig. 4(b), from the horizontal-level, NMI does not vary too much when α is in [1,20], which suggests the clustering performance is relatively stable as γ increases when α is within a certain range. In Fig. 4(c), we notice that within a certain range, NMI tends to be stable when β and γ are linearly correlated, and NMI reaches its maximum when γ and β are in range [20,101]. Fig. 6 shows the effect of the dimensionality of local structure embedding space m on three datasets. NMI is maximum when m is 200, then the NMI decreases as m increases, which suggests when m = 200, low-dimensional space is sufficient to capture the local structure of the most network.

G. VISUALIZATION
In order to show the great performance of our model more intuitively, we show the embeddings generated by our method and baselines on Orkut graph. We map the node representations learned from the models into a two-dimensional space, where nodes with the same color have the same label. The closer the nodes with the same color are and the farther away the nodes with different colors are, the better effectiveness the models have. Here, the label is the community ground truth.
The detailed results of visualization are exhibited in Fig. 5. We have the following observations and insights: • It is intuitive that our proposed model achieves a significantly better clustering performance. The observation implies that preserving high-order proximity, neighbor structure and community structure could learn more distinguish representations for nodes. • AROPE captures the different order similarity of nodes and more information can be captured as the order increases. However, the module information is ignored, which results in representations of nodes are too similar to distinguish for Orkut network with close relationship between the different community.

V. CONCLUSION
In this paper, we propose an efficient scalable algorithm LHO-NMF to learn low-dimensional node representations.
To mutually enhance the embedding learning of nodes, LHO-NMF effectively integrates the high-order features and local   structure into non-negative matrix factorization framework. Specifically, high-order features are captured based on random walk. We further introduce hyper-parameters for control the loss weight between different structure features. Our extensive experiments including multi-label classification, clustering and link prediction on several real-world datasets demonstrate the effectiveness of our model for network embedding. But if the network has a million or a hundred million nodes the efficiency may be reduced because the model is based on matrix factorization.
In the future, we will focus on designing better algorithm for different structure preserving and improve computational efficiency while maintaining high accuracy.