Network Embedding Using Deep Robust Nonnegative Matrix Factorization

,


I. INTRODUCTION
The complex networks in real world (e.g., online social networks, co-authorship networks and hyperlink networks) often contain much valuable information, which has made network analysis become a hot research topic. A large number of researchers have engaged in studying various tasks of network analysis, such as node classification [1], node clustering [2], link prediction [3], visualization [4], etc. Owing to the fact that complex networks' data are very sparse and high dimensional, these network analysis tasks often suffer from troubles of high computational cost and low performance.
The associate editor coordinating the review of this manuscript and approving it for publication was Benyun Shi .
To overcome these problems, network embedding has been proposed as an effective technique. This technique, also known as graph embedding or network representation, aims at learning low-dimensional node feature representations in the given network, while preserving structural and inherent properties of the network itself. The representations learnt can be input into analytical tasks as feature vectors. It has been proven by many existing works that better network embedding operations are beneficial to improve the performance of analysis tasks greatly [5]- [7].
As being a promising technical field, network embedding has attracted numerous endeavors on the studies of algorithms and methodologies. From algorithmic perspective, the existing methods for network embedding can be roughly VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ summarized into three main categories: deep learning based methods (e.g., SDNE [8] and DNGR [9]), random walk based methods (e.g., DeepWalk [10] and Node2vec [11]) and matrix factorization based methods (e.g., M-NMF [12] and GraRep [13]). The former two categories, especially the deep learning based ones, seem to have become more and more popular. Whereas, matrix factorization based methods have their own distinct advantages: better interpretability, less or no parameters, better flexibility to incorporate prior knowledge, etc. In [14], Qiu et al. proved that DeepWalk and Node2vec could be closely unified into the matrix factorization framework. Moreover, in [12] and [13], both M-NMF and GraRep present great competitiveness by comparing with other types of methods. All of these have stimulated great interests from researchers in matrix factorization based network embedding methods.
Recently, some network embedding methods based on matrix factorization have been proposed with improved performance obtained in different works, but they still suffer from the following problems: • Most of the existing methods are with the mechanism of single-layer mapping from the original network to the final network embedding space, which limits their capabilities to learn more complex and useful features from the network. After all, real-world complex networks usually contain considerably complex hierarchical features, including microscopic node similarities and macroscopic community structures, which are quite difficult to be discovered by using shallow methods.
• Simple adjacent matrix of the network is normally used as the original feature matrix for factorization, nevertheless this feature matrix cannot represent enough local and global structure information of the network, which also reduces the performance of network embedding to some extent.
• The actual complex networks generally contain noises, such as casual followships in online social networks and unreal links in hyperlink networks, but the squared Frobenius norm is typically selected to measure the factorization error in the existing matrix factorization based network embedding methods, which makes them not robust enough against noises.
To solve these problems mentioned above and improve the performance of matrix factorization in network embedding, a novel embedding method called DRNMF for short is proposed. More specifically, the main work can be summarized as follows: • We devise an embedding approach using deep robust nonnegative matrix factorization (DRNMF), which is made by multi-layer NMF and the combination of high-order proximity matrices as being the original factorization feature matrix. Thus, more informative and discriminative embedding effects can be obtained, meanwhile 2,1 norm is applied to construct the objective function to improve the robustness against noises.
• We develop the effective iterative update rules to optimize DRNMF model and also prove their convergence. Besides, we introduce the pre-training strategy to expedite their iterative process.
• We evaluate DRNMF model on several benchmark datasets and different analysis tasks, including node classification, clustering and visualization. The results demonstrate that DRNMF is evidently superior over the state-of-the-art matrix factorization based network embedding methods.
The rest of this article is organized as follows. A brief review of related research on NMF, deep NMF (DNMF) and matrix factorization based network embedding is given in Section II. In Section III, the proposed network embedding method DRNMF is presented in detail. Experiments and specific analysis are reported in Section IV. Finally, conclusions are given in Section V.

II. RELATED WORK
In this section, we briefly review the related work regarding nonnegative matrix factorization (NMF), deep NMF (DNMF) and network embedding based on matrix factorization.

A. NMF AND DNMF
NMF is a popular low-rank matrix decomposition model that focuses on the analysis of data matrices whose elements are nonnegative [15]. Mathematically, it can be formulated as: given a data matrix X = [x 1 , x 2 , . . . , x n ] ∈ R m×n + composed of n data samples as columns, each with m features, X can be approximately decomposed into the product of two matrices as X ≈ WH, where W = [w 1 , w 2 , . . . , w r ] ∈ R m×r + is the basis matrix, H = [h 1 , h 2 , . . . , h n ] ∈ R r×n + is called the coefficient matrix or the encoding matrix, r min(m, n), and R + denotes the set of nonnegative elements. By applying nonnegativity constraints on W and H, each data sample x i can be represented as an additive linear combination of nonnegative basis vectors, which is x i ≈ r j=1 w j h ji . NMF provides more interpretable parts based decompositions than the other matrix factorization models, because it naturally complies with human intuition of ''combining parts to form a whole''. This characteristic makes NMF widely used in various representation learning tasks, such as image representation [16], [17], microbiome data representation [18], and even network embedding [12], [19], [20].
NMF is essentially a shallow method, which only contains single-layer mapping from W to X and H, thus it cannot reveal more complex hierarchical features hidden in complex data objects. Inspired by deep learning, DNMF [21] was proposed to solve the problem left by traditional NMF, with principal idea as stacking single-layer NMF into l (l > 1) layers, thereby to obtain hierarchical mappings (W 1 , W 2 , . . . , W l ) and corresponding features (H 1 , H 2 , . . . , H l ). An intuitive representation of this hierarchical factorization and the comparison between NMF and DNMF are respectively described in Eq. (1) and Fig. 1.
Recently, there have already been some successful examples about DNMF on learning hierarchical features from complex data objects. For example, Trigeorgis et al. [21] used DNMF to automatically learn low-to-high level feature representations from face data, including pose, expression and identity features. Each representation level is suitable for clustering according to the uncovered corresponding attributes of data, and similar work can be found in [22]. In [23], Song et al. applied DNMF to document classification task and found that the process of hierarchical feature learning could eventually lead to a better classification performance. All of the above-mentioned works conducted comparison analysis among DNMF, NMF and other single-layer matrix factorization methods, and the results demonstrate that DNMF has stronger feature learning ability, which motivates us to apply DNMF to network embedding.
It should be noted that those works on DNMF all used Frobenius norm to construct the objective function, which were written like ||X−W 1 W 2 . . . W l H l || 2 F . Noises with large errors in X are prone to dominate the objective function in the form of squared errors, which means that DNMF with Frobenius norm is not robust enough against noises. Normally, using 2,1 norm, robust NMF (RNMF) has been considered to have better robustness against noises, comparing with the regular NMF methods based on Frobenius norm. Therefore, DRNMF is expected to perform better than DNMF, because of the advantage of 2,1 norm over Frobenius norm, which will be verified in experiments.

B. NETWORK EMBEDDING BASED ON MATRIX FACTORIZATION
Owing that our goal is to boost the performance of matrix factorization for network embedding, here we are only concerned about matrix factorization based methods. Detailed information of other types of methods (e.g., deep learning based methods and random walk based methods) can be found in some survey papers, such as [24], [25] and [26]. In general, the existing matrix factorization based methods for network embedding mainly focus on two aspects: feature matrix construction and dimension reduction, which will be introduced respectively as follows.
The feature matrix of a given network is actually its original high-dimensional representation matrix used for factorization. At early stage, most methods select adjacent matrix as feature matrix, such as SocioDim (Social dimensions) [27] and GF (Graph factorization) [28]. However, many recent literatures have pointed out that high-order proximity matrix should be better to enhance the performance of network embedding than adjacent matrix and other low-order proximity matrices, because high-order proximity matrix is able to convey more local and global structure information, which is very useful to obtain more informative and discriminative embeddings. For example, GreRep [13] used k-step probability transition matrix as the high-order proximity matrix, and experimental results showed that it performs better than other methods using first-order or second-order proximity matrix, such as LINE [29]. In [30], MMDW method further improved the performance by using the average k-step probability transition matrix as the high-order proximity matrix. Similar works can also be found in [31]- [34].
Dimension reduction is used to obtain the final low-dimensional network representations by factorizing the original feature matrix. Factorization strategies vary according to matrix properties. If the obtained feature matrix is positive semi-definite, one can use SVD (singular value decomposition) method such as in GraRep [13] and HOPE [33]. For unstructured feature matrices, one can devise alternative optimization methods to obtain the embedding, such as in M-NMF [12], MMDW [30] and TADW [32], which are generally more efficient than SVD. Due to the complexity of calculating eigenvalues and eigenvectors, the computing process of SVD is very time consuming when encountering large-scale matrices.
Although the aforementioned matrix factorization based methods have achieved performance improvement at different levels, they are all shallow methods and still need to be improved. Our proposed method DRNMF has multi-layer factorization structure. Thus, DRNMF is fundamentally different from them. Moreover, to the best of our knowledge, at present there are still no works about network embedding using multi-layer matrix factorization.

III. METHODOLOGY
In this section, the proposed method DRNMF is described, starting from introducing the statement of problem. Then, DRNMF is presented in detail, including network embedding model, optimization solution, convergence proof and the embedding algorithm.

A. STATEMENT OF THE PROBLEM
Throughout this paper, matrices are denoted by bold uppercase letters. For a given matrix X, its i-th column vector, (i, j)-th element, trace, and Frobenius norm are denoted by x i , x ij , tr(X) and ||A|| F , respectively. Meanwhile, the identity matrix is denoted by I.
Without loss of generality, a given network can be formally represented as a directed and unweighted graph as Following the idea presented in [13] and [29], we define k-order proximity matrix of the network as: where S = [s ij ] n×n is also called as the first-order proximity matrix with s ij = a ij n j=1 a ij . Thus, the problem of network embedding here can be formally stated as: given a network G with matrices S, S 2 , . . . , S k , it is aimed at learning a low-dimensional rep-

B. NETWORK EMBEDDING MODEL
The framework of DRNMF network embedding model can be depicted as in Fig. 2. As we can see, this model is comprised of two key components: proximity matrix construction and DNMF with 2,1 norm, which can be described as follows.

1) PROXIMITY MATRIX CONSTRUCTION
Motivated by the work about the equivalence of DeepWalk and matrix factorization presented in [32], we select the mean value of all the k-order proximity matrices as the proximity matrix M: In Eq. (3), M combines multiple high-order proximities, so it can be expected to capture more local and global structure features. In [32], k is advised to be set as 2 considering the balance between the computational speed and accuracy. Here our method DRNMF will also follow this suggestion.

2) DNMF WITH 2,1 NORM
After obtaining the proximity matrix M, we factorize M using DNMF with 2,1 norm to produce an l-layer hierarchical feature representations for the original network: Besides, the dimensionality of H i will become much smaller along with the layer number increase, which implies a more abstract and more compact representation of the network. Similar to some other feature learning methods based on deep neural networks, this presented deep structure could also be expected to lead to more accurate network representation results, i.e., a better H l . In order to learn H l and the other factor matrices (e.g., W i ), we derive the following objective function using 2,1 norm: where H l ∈ R r×n + is treated as the final network embedding representation matrix H and W i ∈ R The minimization of the objective function in Eq. (4) is a typical constraint optimization problem and we can solve it by using alternating minimization strategy. Namely, all the variables can be fixed first at each iteration except for one unfixed to be updated. Through repeating these updating processes until the objective function achieve convergence, the final optimized results can be obtained. Next, we will present specific update rules for factor matrices W i (1 ≤ i ≤ l) and H l .

1) UPDATE RULE FOR W i
By fixing all the variables except for W i , the objective function in Eq. (4) is simplified to: where When i = 1, we set P 0 = I. Similarly, when i = l, we set Q l+1 = I. To solve Eq. (5), we can firstly use the Lagrange multiplier method to devise the Lagrange function of J (W i ): where i is the Lagrange multiplier to W i . Then, L(W i ) can be rewritten in the form of matrix traces as where Using Karush-Kuhn-Tucker (KKT) conditions [35], we have (9) where denotes element-wise product. Finally, by solving Eq. (9), we can obtain the following update rule for W i : Similarly, through fixing all the variables except for H l , the objective function in Eq. (4) is simplified to: Following the rule of Lagrange multiplier method presented above, we can also obtain the update rule for H l as follows:

D. CONVERGENCE PROOF
In this section, the convergence of the update rules shown in Eq. (10) and Eq. (12) are proved according to the following theorems. Theorem 1: Updating W i using the rule of Eq. (10) while fixing all the variables except for W i , the objective function J (W i ) in Eq. (5) monotonically decreases to obtain the minimum.
Theorem 2: Updating H l using the rule of Eq. (12) while fixing all the variables except for H l , the objective function J (H l ) in Eq. (11) monotonically decreases to obtain the minimum.
The proof of convergence for W i is similar to that for H l , thus we only focus on W i here (i.e., the proof of Theorem 1). To prove Theorem 1, we need to employ the following lemma.
and the equality holds when U = U . The detailed proof of this Lemma can be found in [36]. Next, we select the widely used the auxiliary function approach proposed in [37] to prove theorem 1. Firstly, according to Lemma 1, we can obtain The equality holds when is an auxiliary function of J (W i ). The first order and second order derivatives of Z (W i , W i ) with respect to W i are as follows: By reason that each matrix involved is nonnegative, we have Obviously, Eq. (17) is the same to Eq. (10). Because Z (W i , W i ) is the auxiliary function of J (W i ), Eq. (17) can also make J (W i ) converge to obtain the minimum. Therefore, Theorem 1 holds.

E. NETWORK EMBEDDING ALGORITHM
After repeating update rules shown in Eq. (10) and Eq. (12) until convergence, the final H l is the result of network embedding. To expedite the iterative process, we pre-train each layer VOLUME 8, 2020 in DRNMF model to attain an initial approximation to the factor matrices W i and H i . The effectiveness of pre-training will be proven in the experimental part. To perform the pre-training, we first decompose M ≈ W 1 H 1 by using robust NMF (RNMF), i.e., minimizing ||M − W 1 H 1 || 2,1 . Then, we decompose H 1 ≈ W 2 H 2 by minimizing ||H 1 − W 2 H 2 || 2,1 . Doing the decomposition step by step, we can finish the pre-training work for all the layers.
Afterwards, each layer is fine-tuned by using iterative update rules to minimize the proposed objective function described in Eq. (4). Based on the above knowledge, we devise DRNMF network embedding algorithm composed of two stages: pre-training and fine-tuning. The entire algorithmic framework is shown in Algorithm 1.

Algorithm 1 DRNMF Network Embedding Algorithm
Input Let d denote the maximal layer size out of all layers, t p and t f respectively denote the number of iterations to achieve convergence in pre-training stage and in fine-tuning stage, we can approximately analyze the time complexity of Algorithm 1. In the pre-training stage, the time complexity of computing the proximity matrix M is O(n 2 ) when we set k = 2 by following the suggestion presented in [32], and that for RNMF process is O(lt p (nd 2 + n 2 d) + t p n 2 d)). We can deduce the time complexity for the pre-training stage to be O(lt p (nd 2 + n 2 d)). In the fine-tuning stage, the time complexity for computing P i−1 , D i and P l are all O(nd 2 ), for computing Q i+1 is O(d 3 ), for updating W i and H l are both O(n 2 d). Therefore, the time complexity in the fine-tuning can be simplified to be O(lt f (nd 2 + n 2 d) because of the general condition d n. To sum up, the overall time complexity of Algorithm 1 is O(l(t p + t f )(nd 2 + n 2 d)), which has the same order of magnitude as many state-of-the-art matrix factorization based network embedding methods, such as M-NMF [12] and GraRep [13]. Owing to the reason that W i and H l are very sparse, the practical time complexity can be reduced significantly if only the non-zero entries of matrices involved are computed.

IV. EXPERIMENTAL STUDY
In this section, the effectiveness of our proposed method DRNMF is evaluated. First, we provide an overview of experimental datasets, baseline methods and parameter settings. Then we conduct detailed comparative analysis with baseline methods on three network analysis tasks, including node classification, node clustering and visualization. Finally, we validate the robustness of DRNMF and the effectiveness of pre-training strategy. All of the experiments are conducted on a PC with 64-bits Windows-7 system, 3.4GHz Intel i7-6700 CPU and 32GB RAM.

A. DATASETS
In our experiments, we use the following four widely-used benchmark complex networks as the datasets: • Political blog network (Polblog). 1 Polblog is a hyperlink network of websites, which is composed of 1224 blogs about US politics and 19025 hyperlinks related to them. Each blog is associated with a political label: liberal or conservative.
• Citeseer. 2 Citeseer is a citation network of academic papers from Citeseer digital library. It contains 3312 nodes (papers) and 4732 edges (citation links among papers). Each node has one category label indicating its topic.
• Cora 2 . Cora is also a citation network of academic papers. It consists of 2708 nodes and 5429 edges, and every node is assigned with a unique topic label.
• BlogCatalog 2 . BlogCatalog is an online social network between bloggers. It contains 10312 bloggers and 333983 friendships between them. Every blogger is assigned a group label indicating its topic interest. Although some bloggers have multiple labels, we only keep their foremost labels for the convenience of comparisons. These four datasets all contain complex hierarchical features, and thus are suitable to be used to validate the effectiveness of DRNMF. Besides, on these network datasets, the label associated with each node can also be treated as the cluster label indicating its group/cluster membership. Therefore, these network datasets can support us to conduct performance evaluation tasks for node classification and node clustering at the same time. Detailed statistics for the datasets are summarized in Table 1, where #labels denotes the number of class labels.

B. BASELINE METHODS
By reason that the motivation of this paper is to improve the performance of matrix factorization for network embedding by using DRNMF, we specially select 3 state-of-theart matrix factorization based methods for network embedding as baselines, including M-NMF [12], DNMF [21] and DANMF [38]. Besides, we also take into consideration the comparisions with another two classical network embedding methods, DeepWalk [10] and Node2vec [11], which are closely related to matrix factorization. Therefore, we have 5 baseline methods in total, which are respectively introduced as follows: • M-NMF. M-NMF is based on modularized nonnegative matrix factorization and can incorporate community structure into network embedding. It uses the combination of first-order and second-order proximity matrices as the feature matrix for factorization, which belongs to single-layer matrix factorization method.
• DNMF. DNMF is the first method that formally introducing deep learning concept into the NMF model. Although it is used to learn attribute representations of images, its multi-layer matrix factorization structure can also help it learn complex hierarchical features hidden in the network. Here, we select the adjacent matrix as its feature matrix and use Frobenius norm to devise its cost function.
• DANMF. DANMF is a deep autoencoder-like NMF method. It is used in community discovery, but it also can be used in network embedding by naturally treating its community membership matrix as the network embedding representation matrix. Like DNMF, DANMF also uses the adjacent matrix as being feature matrix and uses the cost function based on Frobenius norm.
• DeepWalk. DeepWalk is a classical network embedding method. It first transforms a network into linear sequences by truncated random walk and then uses Skip-gram natural language model to obtain node representations. In [14], related theoretical analysis and proofs show that DeepWalk empirically produces a low-rank transformation of a network's normalized Laplacian matrix.
• Node2vec. Similar to DeepWalk, Node2vec is also on the basis of random walk and can learn richer node representations by exploring diverse node neighborhoods. Theoretically, Node2vec is regarded as factorizing a matrix related to stationary distribution and transition probability tensor of a 2nd-order random walk, which has also been proved in [14].

C. PARAMETER SETTINGS
For fair comparisons, parameters in all the methods are tuned to be optimal or set to be their suggested values. For M-NMF, their regularizer parameters α and β are respectively set to be 0.5 and 5. As suggested in [10], for DeepWalk, we set walks per vertex γ = 80, window size ω = 10 and walk length t = 40. For Node2vec, we use hyperparameter settings with γ = 10, ω = 10 and t = 80. We employ a grid search over return parameter and in-out parameter p, q ∈ {0.25, 0.5, 1, 2} for training. For GraRep, we set k = 3 on Polblog, Citeseer and Cora datasets, with k = 6 on BlogCatalog dataset. For DANMF, we set the graph regularizer parameter λ to be 1, which was suggested in [38]. It should be noted that all these methods use the same dimension of representation as r = 10 * (#labels) on the same dataset. For the sake of fairness, three methods with multi-layer learning structure (i.e., DNMF, DANMF and DRNMF) have the same layer configuration on the same dataset, which is shown in Table 2.
Although we have tried to set more layers, the performance promotion is not significant while spending much more computation time. Under each setting of parameters for different methods, the experiments are repeated for 10 times and the average results are reported here.

D. COMPARISONS ON NETWORK ANALYSIS TASKS 1) NODE CLASSIFICATION
Node classification involves using the representations generated by network embedding methods to classify each given node into the category it belongs to. Here, the representation of each node is used as its feature vector and the label it associates with is treated as the true class label. In the experiments, we use support vector machine (SVM) implemented by Weka 3 as the classifier. For every dataset, when training the classifier, we randomly sample 10% to 90% of the labeled nodes as training data and the rest as test data. Since each node in each dataset has only one class label, we can simply use classification accuracy (i.e., the proportion of correctly classified nodes) as the performance metric. The evaluation results on each dataset are shown in Fig. 3, from which we have the following observations and analysis: • The curve of our method DRNMF is consistently above the curves of baseline methods, which means that node representations learned by DRNMF are more informative and discriminative, thus the performance of node classification can be improved much better.
• Compared with shallow methods M-NMF, DeepWalk and Node2vec, multi-layer matrix factorization based methods DNMF, DRNMF and DANMF perform better, which demonstrates that the multi-layer factorization structure is indeed able to obtain better node representations from the process of learning hierarchical features of the original networks.
• Although DNMF, DRNMF and DANMF have multi-layer factorization structures, DRNMF performs better than DNMF and DANMF. The reasons might be concluded in two aspects. First, DRNMF uses the combination of high-order proximity matrices as the original feature matrix, which contains richer information about network structure than the adjacent matrix used by DNMF and DANMF. Second, DRNMF employs 2,1 norm, which makes it more robust against noises existing in networks than DNMF and DANMF using Frobenius norm.
It can be noted that the first reason undoubtedly plays a more important role, because network datasets used here almost have no noises. In Section IV.E, we will specially test the robustness of DRNMF by manually adding noises to datasets.

2) NODE CLUSTERING
Due to the fact that node labels of network datasets used here can also be treated as cluster labels, the performance evaluation for node clustering can thereby be conducted.
To perform node clustering, we obtain node representations through using network embedding method at first, and then apply K-means algorithm to these learnt node representations to attain clusters. As each node has ground-truth cluster membership, we use the widely-used Purity and NMI (Normalized mutual information) [39] as metrics to measure the performance. The larger the Purity and NMI values are, the better the performance of node clustering will be. We run every method on four datasets and the evaluation results are shown in Table 3. As is shown from Table 3, on Polblog dataset, Purity and NMI are of equal value as 1.0 only on DRNMF method. Besides, on every dataset, DRNMF is the only approach with all the Purity values larger than 0.9. In terms of performance improvement, compared with M-NMF, DNMF, DANMF, DeepWalk and Node2vec, on Citeseer dataset, the Purity value of DRNMF improves by 38%, 13.9%, 11.4%, 28.9% and 25.6% respectively and the NMI value improves by 44.8%, 22.8%, 16.9%, 36.6% and 32.9% respectively. Similar results can also be found on Cora and BlogCatalog datasets. All the results demonstrate that DRNMF performs the best in terms of node clustering, and it even has considerable advantages compared with the baseline methods.

3) VISUALIZATION
Visualization is another important application for network embedding issues. It aims to display the given network in a two-dimensional space. If the learned node representations are more discriminative, the visualization results can display clearer boundaries between nodes with different labels. This also indicates the corresponding network embedding method performs better.
In our experiments, we use node representations learned from different network embedding methods as the input to the t-SNE visualization tool [4], where the nodes with the same label are highlighted with the same color.
As representatives, the visualization results on Polblog and Citeseer datasets are given respectively as shown in Fig. 4 to Fig. 5. From Fig. 4, it can be seen that the visualization results of M-NMF, DNMF, DANMF, DeepWalk and Node2vec are not satisfactory, because lots of nodes with different labels are mixed with each other and the boundaries of different groups are not clear. For DRNMF, nodes with the same label aggregate together with clear boundary from the other cluster. Moreover, the number of clusters seems to be consistent with the number of ground truth's. Similar observations can also be found in Fig. 5. In general, the representations learned by DRNMF are more identifiable, which makes the visualization results of DRNMF perform better than the baseline methods.

E. ROBUSTNESS TEST
DRNMF network embedding model uses 2,1 norm instead of the widely used Frobenius norm to devise its objective function, which can be expected to improve the model's robustness against noises. To validate the effectiveness of this mechanism, we specially conduct comparative experiments between DRNMF using 2,1 norm (DRNMF_L21) and DRNMF using Frobenius norm (DRNMF_Fro). In our experiments, for each dataset, we first define a so-called cannotlink pairwise constraints which include pairs of nodes with different labels. The possible numbers (#CL) of cannot-link pairwise constraints on a network with ground-truth classification results set C = {c 1 , c 2 , . . . , c #labels } can be denoted as: Then, we randomly extract the fixed percentage values (including 6 levels: 0%, 1%, 2%, 4%, 6% and 8%) of cannot-link pairwise constraints to establish virtual links, which can be regarded as noises. Finally, we evaluate the performances of DRNMF_L21 and DRNMF_Fro on node classification tasks under different noise levels. Note that on each dataset, we only utilize 10% labeled nodes for the convenience of comparisons. The results on different datasets are shown in Fig. 6.
As we can see from Fig. 6, on each dataset, the performance degradation of DRNMF_L21 is much smaller than that of DRNMF_Fro with the percentage increase of noises. For example, when the noise rate is 8% on BlogCatalog dataset, the accuracy value of DRNMF_L21 is 0.21, which is 25.8% lower than that without noises, but the accuracy value of DRNMF_Fro is 0.13, which is 46.8% lower than that without noises. Similar results can also be obtained on the other three datasets. All the results illustrate that DRNMF_L21 has better robustness than DRNMF_Fro. This makes DRNMF with 2,1 norm more suitable to be applied to real-world complex networks that contain noises in most cases.

F. CONVERGENCE EFFICIENCY ANALYSIS USING PRE-TRAINING
In the proposed DRNMF network embedding algorithm (i.e., Algorithm 1), the purpose of pre-training stage is to expedite the iterative process in the subsequent fine-tuning stage. To validate the effectiveness of this strategy, we conduct convergence efficiency comparisons in the fine-tuning stage by using DRNMF with pre-training and DRNMF without pre-training on every dataset. We evaluate the convergence efficiency from two aspects: the number of iterations (#Iterations) and the convergence time (Time). The results are shown in Fig. 7 and Table 4.
As shown in Fig. 7, DRNMF with pre-training needs 9, 28 and 67 iterations on Polblog, Citeseer and Cora datasets respectively till convergence, while DRNMF without pre-training respectively needs 28, 69 and 121 iterations. On larger dataset BlogCatalog, DRNMF with pre-training spending 251 iterations is more advantageous than DRNMF without pre-training spending 592 iterations. Less iterations, less convergence time. As shown in Table 4, compared with DRNMF without pre-training, DRNMF with pre-training saves time by 66.1%, 44.8%, 48.7% and 66.3% respectively on Polblog, Citeseer, Cora and BlogCatalog datasets. In particular, it takes about 1.05 hours running on BlogCatalog dataset, which is far less than 3.14 hours taken by DRNMF without pre-training. All these mentioned results demonstrate that, the proposed pre-training strategy in DRNMF accelerates the model within less iterations and time consumption till convergence, thereby effectively helps DRNMF improve the convergence efficiency.

V. CONCLUSION
In order to further increase the performance of matrix factorization for network embedding, in this paper, we propose a method called DRNMF which has multi-layer factorization structure. This structure makes DRNMF be able to learn more useful hierarchical features hidden in complex networks. Meanwhile, we select the combination of high-order proximity matrices of the network as the original feature matrix for factorization, and use 2,1 norm to improve the robustness of the network embedding model. These strategies make our method perform better than the state-of-the-art matrix factorization based methods, which has been verified by extensive relevant experiments. In the future, there will still be some issues as follows to be solved further.
• There is no doubt that the performance of DRNMF will be affected by the configuration of layers, including the number of layers and the size of each layer. As most of the deep learning based methods, here their settings need to be tuned manually according to the change of performance. Therefore, an in-depth exploration of layer configuration is quite necessary.
• The time complexity of DRNMF is O(l(t p + t f )(nd 2 + n 2 d)), which restricts itself to be applied efficiently in large-scale complex networks. Hence, more efficient optimization algorithms for DRNMF model need to be developed.
• At present, it is generally believed that deep learning methods are all built on the basis of neural networks. However, DRNMF also has multi-layer learning structure, which makes it resemble a conventional deep learning method. It will be quite interesting to investigate whether DRNMF can reach or even outperform those network embedding methods established on the basis of deep neural networks. This needs us to conduct more research work with comparative experiments. Center of Guangdong Province. He has completed more than 30 research and development projects. He has authored or coauthored more than 100 publications in these areas. His research interests include database and cooperative software, temporal information processing, social network, and big data analytics.
XIANG FEI received the B.Sc. and Ph.D. degrees from Southeast University China, in 1992 and 1999, respectively. After graduation, he worked, as a Postdoctoral Research Fellow, on a number of projects including European IST Programs and EPSRC. He is currently working as a Senior Lecturer with the School of Computing, Electronics and Maths, Coventry University. His current research interests include machine learning and data mining in cyber-physical systems.
HANCHAO LI received the B.Sc. degree in mathematics from the University of Warwick, in 2013, and the M.Sc. degree in computing from Coventry University, in 2015, where he is currently pursuing the Ph.D. degree, working on music information retrieval, i.e., data mining in music subject area. He has published several conference and journal articles. His research interests are big data, data mining, machine learning, and any mathematics-related researches.
QIONG ZHANG received the Ph.D. degree from Laval University, QC, Canada in 2016. He worked as a Visiting Researcher with McGill University, QC, Canada, from 2015 to 2016. He is currently an Adjunct Professor with the Nanchang Institute of Technology and also a Research Director with Shenzhen Qihang Academy Company, Ltd. His research interests include computer vision, image processing, natural language processing, artificial intelligence, data mining, etc. VOLUME 8, 2020