Inter-Intra Information Preserving Attributed Network Embedding

To alleviate the problem caused by the sparsity of network structure which is often the case in large-scale network, attributed network embedding has attracted an increasing amount of attention. Some existing attributed network embedding models integrate linkage structure and node attribute by adding a consistency criterion on the structure representation and attribute representation, whereby the similarity between them can be ensured. However, to enforce the structure and attribute representations to be similar may cause the information distortion due to the inherent difference between these two kinds of information. Additionally, the existing models mainly focus on learning the inter-relation between structure and attribute information, while the intra-characteristic of each information is ignored, leading to the information loss. To address the above two problems, we propose a novel model named Inter-Intra Information Preserving Attributed Network Embedding (IINE) to effectively learn the node representations of attributed network, which can not only capture the inter-relation between structure and attribute information with less information distortion but also effectively preserve the intra-characteristic of each information. The proposed model is composed of a primary model named coupled autoencoder and two auxiliary models named structure miner and attribute miner. The coupled autoencoder trains the node representation by smoothly combining both structure and attribute information, while the structure miner and attribute miner are utilized to further mine the intra-characteristic from the corresponding information so as to assist the primary model. The Extensive experiments are conducted on seven real-world datasets, and the results confirm the superior performance of IINE over several state-of-the-art models.

challenging due to problems like data sparsity, curse of dimensionality caused by the explosive growth of network scale. In this case, the network embedding method, which aims to learn effective low-dimensional network representation [6]- [11], [11]- [13], plays a key role in solving the network analysis problems. For instance, in [14], an effective network embedding model is proposed to integrate different kinds of information in heterogeneous network to learn node representations, which can both reduce the dimensionality and enrich the knowledge inside representations. Besides, network embedding can be useful in real life. As one of the most popular social networks, there are more than 300 million users in weibo, while one user may only interact with hundreds of them, which makes the traditional network representation methods like adjacent matrix be memory-wasting and inefficiency, due to the inevitable VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/ problems of high-dimension and sparsity. In this case, network embedding algorithms aim to learn a low-dimensional and dense representation that can not only reduce the computational complexity but also better reflect the specific feature of different users. Conventional network embedding models mainly focus on exploiting the network structure information during the learning of node representations. Some of them attempt to preserve the global proximity of network structure [6], [7], [15], some of them pay more attention to preserve the local proximity of each node [8], [16], and some of them tend to generally learn both the global and local proximity by jointly integrating the lower-and higher-order proximities to generate the node representations [17]- [20]. However, due to the increasing scale of networks, most of them suffer seriously from the sparsity of network structure. Take the Citeseer dataset 3 as an example, there are 3,312 nodes but only 4,732 edges in this network, which means each node is only connected with few neighbors on average. In this case, there is a great challenge to learn efficient node representations due to such sparse structural information. Fortunately, in many sparse networks, the nodes are associated with specific attributes, such as user profile in social network and paper abstract in citation network [21], [22]. These attributes contain rich semantic information which can reflect the inherent property of the corresponding node [14]. Therefore, the data sparsity problem of node representation learning can be efficiently alleviated by properly utilizing both the node structure information and network attribute information.
One of the most intuitive ideas is to separately train structure representation and attribute representation and then concatenate them to generate the final representation. However, according to [9], this method cannot learn the inter-relation between structure and attribute information. Recently, some attempts have been made for training node representations by coupling network structure and node attribute information [9], [10], [23]- [27]. However, many existing models integrate those two kinds of information by keeping the consistency between node structure representation and node attribute representation, which is usually implemented by adding a constraint to ensure these two kinds of representation to be similar. Nevertheless, the structure information of each node reflects the interacting relationship between the node and its neighbors, while the attribute information represents the inherent feature of each node. From this perspective, consistency criterion may cause the information distortion, for the reason that it enforces the representations learned from different information to be similar. Besides, the existing attributed network embedding methods mainly focus on learning how to combine these two kinds of information so as to obtain the inter-relation between them, which ignore the significance of preserving the intra-characteristic inside each single information. Therefore, it remains an open challenge how to not only 3 http://linqs.cs.umd.edu/projects/projects/lbc/index.html incorporate both structure and attribute information to capture the inter-relation without causing the information loss but also preserve the intra-characteristic of each single information simultaneously when learning node representations.
In this paper, we provide a solution by proposing a novel network embedding model named Inter-Intra Information Preserving Attributed Network Embedding (IINE). IINE is composed of a primary model named coupled autoencoder and two auxiliary models named structure miner and attribute miner. The coupled autoencoder integrates both structure and attribute information to generate node representations which can capture the inter-relation between them with less information distortion. In the meanwhile, the structure miner and attribute miner focus on learning node structure representations and node attribute representations from the corresponding information respectively so as to further capture the intra-characteristic. In particular, the representations learned by structure miner and attribute miner are considered as the auxiliary representations of the node representations learned by the coupled autoencoder, which can effectively help preserve the intra-characteristic when learning the inter-information.
The main contributions of this paper are summarized as follows: • We define a novel and meaningful problem which is to not only capture the inter-relation between structure and attribute information with less information distortion but also preserve the intra-characteristic of each single information.
• To solve the above new problem, a novel model called IINE is proposed, which is composed of a primary model named coupled autoencoder and two auxiliary models named structure miner and attribute miner.
• Two case study experiments are conducted to investigate the influence of intra-characteristic and consistency criterion on the attributed network embedding problem, and the results empirically confirm the significance of the novel problem studied in this paper.
• Extensive experiments are conducted on seven realworld datasets through the classification and clustering tasks to demonstrate the superior performance of IINE.
The rest of this paper is organized as follows. We briefly introduce related work in section II. Then some important notations and formal definitions of the studied problem are presented in Section III. Section IV describes in detail the proposed IINE model. The experimental results are reported in Section V. Finally, the conclusion is drawn in Section VI.

II. RELATED WORK
Network embedding, which aims to learn low-dimensional representations for networks, has attracted an increasing amount of attention in recent years. The history of network embedding can be tracked back to the eigen-based methods such as LLE [28], IsoMap [29] and Laplacian Eigenmaps [30]. These works represent nodes in lower 79464 VOLUME 7, 2019 dimension by utilizing the leading eigenvectors which can preserve the local structure and spectral properties. However, these approaches are often optimized in closed form and need to extract the leading eigenvectors, which are often not applicable for large networks. To this end, some efforts have been made in describing the network structure in a computationally feasible way [6]- [8].
For keeping the global structure of the original network, inspired by the word2vec model [31], DeepWalk [6] employs random walk to extract both the global structure of network and the sequential feature of each node, and then the Skip-Gram algorithm is adopted to learn the node representations. Subsequently, Node2Vec [7] extends DeepWalk by introducing a biased random walk to control the path sampling process which can better preserve the global structure. For keeping the local structure of network, Line [8] focuses on preserving the neighbor structure of each node by proposing the first-and second-order proximity to constrain the training process. Based on the equivalence between matrix factorization and skip-gram based embedding as proved in [32], both GraRep [17] and M-NMF [18] utilize the matrix factorization to integrate both the global and local structure into node representation. GraRep [17] preserves them by concatenating the representations obtained by factorizing several lower and higher orders' node similarity matrices. The M-NMF model [18] obtains low-dimensional node representations which can preserve both the community structure and node local proximity simultaneously. With the development of deep models, DNGR [33] and SDNE [34] utilize deep autoencoder to capture network's non-linearity in learning node representations. In [35], an attention-based network embedding model is proposed to exploit the social structures for user alignment. Furthermore, ANE [36] adopts the adversarial learning principle of GAN [37] to train the node representations.
However, the aforementioned methods only exploit the topological structure information of networks, which may suffer seriously from the data sparsity problem. To alleviate this problem, node attribute information, which contains rich semantic information, is integrated into the network embedding model. The recently developed attributed network embedding models [10], [25], [38], [39] empirically prove that the performance of representation learning can be improved by taking into account the node attributes. In TADW [25], the textual features are integrated into the matrix factorization so as to guide the node representations to learn attribute information. LANE [10] jointly adopts the label, attribute, and structure information to learn node representations. Besides, DANE [38] adopts two deep Autoencoders to train structure and attributed features separately and then keep the consistency between them. Furthermore, node representations in ANEM [39] are required to preserve both the attribute information and the structural information including the local proximity structure and mesoscopic community structure. Despite success, many attributed network embedding methods adopt consistency criterion to integrate both structure and attribute information, which may cause the information distortion. Additionally, the intra-characteristic of each information is ignored, leading to the information loss.

III. PROBLEM FORMULATION
In this section, we will present the important notations and formally define the studied problem. . . , f d } represent the node set and the attribute set of attributed network respectively. Mathematically, two important information matrices are introduced to describe those information as follows. One of them is the structure proximity matrix A ∈ R n×n , which is composed of a series of structure matrices of different orders. Each element a ij represents the transition probability between node i and node j, and the sum of each row vector a i is equal to 1. In practice, we sum up the first-, second-and third-orders' structure matrices and then normalize the matrix to obtain the structure proximity matrix. The other is the attribute matrix P ∈ R n×d , where each element p ij represents whether node i contains the j-th attribute f j , and each row p i is the attribute vector of each node. Finally, the definition of edge set is

B. PROBLEM DEFINITION
Different from the network consisting of only topological structure, nodes in attributed network are usually associated with different attributes, for instance, user profile in social network, paper abstract in citation network, and so on. Therefore, for better analyzing the real-world network, attributed network embedding is of vital significance.
Definition 1 (Attributed Network Embedding): Given the structure proximity matrix and attribute matrix A and P of the graph G, the attributed network embedding is to learn an appropriate model which can map both of structure and attribute information into low-dimensional network represen- How to learn efficient node representations from these two kinds of information is the key to attributed network embedding. The existing attributed network embedding methods can be generally categorized as two classes. One of them is the consistency based network embedding model like [10], and another is combination based network embedding model like [9].
Definition 2 (Consistency Based Attributed Network Embedding): Consistency based Attributed Network Embedding adopts a similarity constraint on structure and attribution representations of the same node, so as to keep the consistency between structure information and attribute information.
Definition 3 (Combination Based Attributed Network Embedding): Combination based Attributed Network Embedding aims at combining two kinds of information into the learned representation, which usually adopts a combination function to incorporate the structure representation with attribute representation.
However, both kinds of methods would to some extent mislead node embeddings to lose information, where the first model trains the final representation by forcing the latent representations of two different kinds of information to be similar, while the second model pays much attention to learn the inter-relation between structure information and attribute information while ignoring the intra-characteristic of each single information. Based on this, we define a novel problem called 'Inter-Intra Information Preserving Attributed Network Embedding' as follows: Definition 4 (Inter-Intra Information Preserving Attributed Network Embedding): Given the structure proximity matrix A and the attribute matrix P, the Inter-Intra Information Preserving Attributed Network Embedding is to smoothly incorporate A and P to train the low-dimensional representation H that can not only capture the inter-relation between two kinds of information with less information distortion but also preserve the intra-characteristic of each single information.

IV. THE PROPOSED MODEL
For solving the above problem, we propose the IINE model, whose general architecture is shown in Figure 1. As shown in the dashed line rectangle, the coupled autoencoder is proposed to integrate the structure and attribute information to generate the node representations which can capture the inter-relation between them with less information distortion.
For preserving the intra-characteristic, two auxiliary models named structure miner and attribute miner are established to further mine the intra-characteristic of the corresponding information so as to assist the coupled autoencoder.

A. COUPLED AUTOENCODER
In attributed network, structure information and attribute information describe the network from two different aspects. For smoothly integrating the structure and attribute information to train the node representations, we propose a novel unsupervised model named Coupled Autoencoder. Without loss of generality, the coupled autoencoder consists of encoding and decoding processes:

1) ENCODING
The object of the encoding process is to mine the latent information of network structure and node attribute. To this end, both of the structure and attribute information, namely the structure proximity matrix A and the attribute matrix P, are firstly encoded as the latent structure representation U A ∈ R n× k 2 and latent attribute representation U P ∈ R n× k 2 by using corresponding encoding functions. Then, the node representation H is generated by the weighted combination of U A and U P , where the combination weight is set as γ .
Although there are many rational combination methods, for preserving information as much as possible, the concatenating combination is adopted. Therefore, the encoding process can be formulated as follows: where θ A , θ P are the parameters of the structure encoder and the attribute encoder respectively. Then the node representation is generated by 2) DECODING As shown in the upper part of dashed line rectangle of Figure 1, during the decoding process, the decoder should reconstruct both structure and attribute information from H by using different decoding functions, which can be formulated as follows: where φ A and φ P represent the parameters of the structure decoder and the attribute decoder respectively. In summary, the encoding process maps both the structure and attribute information into a latent space and combines them to obtain node representations H. In contrast to the encoding process, the decoding process is to decode the node representation H back to the original structure and attribute spaces by using different decoding functions so as to reconstruct both kinds of information. The more complete information the node representations can reconstruct, the more representative they are. Therefore, the reconstruction losses from structure information and attribute information are adopted to guide the training of the coupled autoencoder: However, due to the sparsity issue of the given network, the reconstruction error is mainly controlled by zero elements. For balancing the losses caused by zero and non-zero elements, the confident matrix [40] should be taken into consideration: where · F , , C A ∈ R n×n and C P ∈ R n×d denote the Frobenius norm, element-wise dot product, structure confident matrix and attribute confident matrix respectively. In

B. STRUCTURE MINER AND ATTRIBUTE MINER
Although the coupled autoencoder can guide node representations to learn both structure information and attribute information, the intra-characteristic of each single information like the local proximity may be missed during the training process. For addressing this problem, we propose the structure miner and the attribute miner to further mine the intra-characteristic inside the corresponding information so as to assist the training of node representations.

1) STRUCTURE MINER
As shown in the left part of Figure 1, the structure miner is a deep neural network. This model takes A as input, and encodes it into the latent space to obtain the structure auxiliary representation V A ∈ R n× k 2 , which is able to keep the intra-characteristic of structure information at both global and local aspects. Globally, the A decoded from V A is expected to be able to reconstruct the original topological structure. Locally, node representations in V A are supposed to reflect the proximity between a node and its neighbors, and the dissimilarity between a node and a part of the nonneighbors. Therefore, the loss of structure miner is composed of both the reconstruction loss and the negative-sampling based pair-wised proximity loss: where N E is a sampling set of (i, j) / ∈ E, σ (·) is the sigmoid function, and v A i is the structure auxiliary representation of the i-th node respectively. As we can see, to minimize this loss is equivalent to minimize the reconstruction loss of structure information, minimize the dissimilarity (the minus of similarity) between neighbor nodes, and maximize the dissimilarity between the negative sampled non-negative node pairs.

2) ATTRIBUTE MINER
The attribute miner aims to learn node attribute representations which can further learn the intra-characteristic of attribute information. Similar to the structure auxiliary representation, the attribute auxiliary representation V P ∈ R n× k 2 is expected to learn both global attribute information reconstruction and local proximity-keeping abilities. However, unlike the structure proximity matrix, the attribute matrix cannot directly reflect the pairwise relationship between each node. Fortunately, it is intuitive that nodes usually share similar features with their neighbors, while non-neighbors may not. For instance, the connected papers in paper citation network share similar research topics, while a part of the disconnected papers are likely to be in different research areas. Inspired by this circumstance, it is better to keep the attribute representations of nodes and their neighbors to be close while keeping at a distance for some of non-neighbors. By integrating with the negative sampling method, the loss function of the attribute miner can be written as follows: where v P i is the attribute auxiliary representation of the i-th node. Similar to the loss of structure miner, optimizing the loss of attribute miner is to minimize the reconstruction loss of attribute information and the dissimilarity (the Euclidean VOLUME 7, 2019 distance) between neighbor nodes, while maximizing the dissimilarity between the negative sampled non-negative node pairs.

C. INTER-INTRA INFORMATION PRESERVING ATTRIBUTED NETWORK EMBEDDING
By incorporating the coupled autoencoder with the structure miner and the attribute miner, the IINE model is established. In IINE, the reconstruction loss from the coupled autoencoder will guide the node representations to learn the coupled information. Meanwhile, two auxiliary representations V A and V P generated from the structure miner and the attribute miner are utilized to guide the structure component U A and the attribute component U P of node representations to preserve the intra-characteristic of each single information, and the auxiliary loss can be written as follows: where u A i and u P i are the structure component and the attribute component of the representation of the i-th node respectively.
Finally, the objective function of our IINE model can be written as follows: where α is the weight to control the strength of auxiliary loss. In our implementation, the loss function in Eq. (11) is optimized by the Adam [41] function in Python.

V. EXPERIMENTS
In this section, extensive experiments are conducted on seven publicly available networks. First of all, we introduce in detail the experimental settings, including the datasets and evaluation measures. Secondly, parameter analyses on both classification and clustering tasks are conducted to investigate the sensitivity of the parameters on the IINE model. Then, comparison experiments are conducted to compare the proposed IINE model with several existing network embedding models, and the results confirm the superiority of the proposed model. As an auxiliary experiment, visualization of the embedding results on the Citeseer dataset generated by different algorithms is also presented. Finally, two case study experiments are conducted to investigate the influence of intra-characteristic and consistency criterion on the attributed network embedding tasks, and the results empirically confirm the significance of the novel problem studied in this paper.

A. DATASETS
In experiment, we adopt seven publicly available attributed networks which are Citeseer network [42], Wiki network [43], Terrorist network [44], and four subnetworks (Cornell, Texas, Washington, Wisconsin) from WebKB network. 4 The datasets are described as follows:

B. EVALUATION MEASURES
The performance of each network embedding model is evaluated in the tasks of node classification and node clustering where the LinearSVM classifier and the k-means clustering are used respectively. 5 In addition to the classification and clustering results, the 2-d visualization graph is also adopted to assess the quality of the embeddings generated by each algorithm.

1) CLASSIFICATION
During the classification task, 20%, 50%, 80% of labels are used as the training data to guide the node representations to learn a proper classifier, where the remaining labels are used as the testing data. The classification results are evaluated in terms of both Micro-F1 (Mic-F1) and Macro-F1 (Mac-F1) scores [6] when comparing the predicated labels with the groundtruth labels. The calculations of Mic-F1 and Mac-F1 are presented as follows: Mac-F1 = 2 × recall ma × precision ma recall ma + precision ma (13) where (17) with N being the number of nodes. Generally, the higher values of Mic-F1 and Mac-F1 scores indicate the better performance of the classification result.

2) CLUSTERING
In the clustering task, the number of clusters is set to be the label number of the corresponding dataset. The clustering results are evaluated by both Purity and Normalized Mutual Information (NMI). where I (·; ·), H (·) are the mutual information and the entropy functions respectively. The calculation of I ( ; C), H ( ) and H (C) are listed as follows: The higher values of Purity and NMI [45] indicate better performance of the clustering result.

C. PARAMETER ANALYSIS
We first study the performance of IINE in classification (80% samples as training set) and clustering tasks in terms of different values of model parameters α, β and γ , which are related to the weight of auxiliary loss, the confidence value of reconstruction loss, and the combination weight of embeddings respectively. The range of three parameters are set as 0.1 ≤ α ≤ 0.9, 1 ≤ β ≤ 9, and 0.1 ≤ γ ≤ 0.9 respectively.

1) ANALYSIS OF α
The influence of the weight parameter α is investigated at first. By fixing γ = 0.6 and β = 6, Figure 2 Figure 2(a) and Figure 2(b), in classification task, the Mic-F1 and Mac-F1 scores of 'Citeseer', 'Wiki' and 'Terrorist' networks are insensitive to the value of α, while those of the remaining four networks are affected more obviously. What's more, the classification results tend to be better at the edge of the setting interval (about α ≤ 0.3 or α ≥ 0.6). On clustering performance, as shown in Figure 2(c) and Figure 2(d), the clustering results of each dataset are better at the middle of the setting interval (about 0.2 ≤ α ≤ 0.8). This demonstrates that the performance of classification and clustering can be significantly enhanced by considering a proper auxiliary loss (e.g. 0.2 ≤ α ≤ 0.3 or 0.6 ≤ α ≤ 0.8), but when it becomes too large, the major loss may be adversely affected, which weakens the embedding performance.

2) ANALYSIS OF β
We then analyze the influence of the confidence parameter β. By fixing α = 0.6 and γ = 0.6, the variation of the Mic-F1 and Mac-F1 values in the classification task as a function of β are shown in Figure 3(a) and Figure 3 Figure 3(b), it can be observed that when β ≤ 3, the performance of classification in terms of both Mac-F1 and Mic-F1 grows dramatically, and then it tends to be stable as β increases. As can be seen from Figure 3(c) and Figure 3(d), the performance of clustering increases or decreases dramatically at the edge of interval (β ≤ 3 or β ≥ 7), while it keeps a relatively better and stable state in the middle of interval (4 ≤ β ≤ 6). This shows that it is suitable to increase the confidence value for assigning a larger loss on the non-zero elements so as to enhance the data utilization of sparse information. However, when the confidence value is too large, the information of zero elements may be neglected, and in the extreme case, the node representations may be misled to learn the ability to reconstruct an all-1 matrix.

3) ANALYSIS OF γ
At last, the influence of concatenating parameter γ is explored in this part. By fixing α = 0.6 and β = 6, the variation of the Mic-F1 and Mac-F1 values as a function of γ is shown in Figure 4(a) and Figure 4(b) while Figure 4(c) and Figure 4(d) present the influence of γ in the clustering task in terms of Purity and NMI. From Figure 4(a) and Figure 4(b), it can be seen that the classification performance can keep in a higher level in terms of both Mac-F1 and Mic-F1 scores when γ is in the middle of interval (0.3 ≤ γ ≤ 0.7), while somewhat worse results are generated at the edge of interval. As can be observed in Figure 4(c) and Figure 4(d), the performance of clustering may be insensitive to γ when it is relatively small (γ ≤ 0.7), but decreases dramatically as γ becomes too large (γ ≥ 0.8). The above phenomenon shows that properly combining structure and attribute information can enhance the performance of node representation. However, the larger γ means the less importance of structure information in the node representation, which weakens the utilization of structure information.
Generally, according to the above analysis, we choose a proper parameter combination which are α = 0.6, β = 6 and γ = 0.6 for all the datasets when conducting the following experiments.

D. COMPARISON RESULTS
In this section, we conduct several experiments to compare the performance of IINE with six state-of-the-art network embedding algorithms. 1) DeepWalk [6]: Inspired by Word2Vec [31], it generates the truncated random walks based on the topological structure and then feeds them into the Skip-gram model to generate the node representations. 2) Node2Vec [7]: It extends DeepWalk by introducing a biased random walk to explore diverse neighborhoods more efficiently which can better preserve the global structure. 3) M-NMF [18]: It preserves both community structure and node local proximity simultaneously when learning the node representations. 4) SDNE [34]: In SDNE, the information of the network topology will be embedded into the deep autoencoder so that the locality, globality and non-linearity of network will be captured in learning node representations.

5) LANE [10]
: It jointly adopts the label, attribute, and structure information to generate node representations. And in our experiment, the version without the label information is used. 6) SNE [46]: SNE learns node representations which can preserve the global network structure and the homophily effect by utilizing both of structure and attribute information. Parameters of these six compared algorithms are set by either the default settings which are suggested by the authors, or tuned by trials to get more proper results. For each algorithm, the dimensionality of node representations is set as 128 for the for subnetworks of WebKB, and 256 for other networks. The best result is denoted by bold font.
From Table 1, it can be observed that, in the classification task, IINE outperforms the compared algorithms in terms of both Mic-F1 and Mac-F1 on all the datasets at almost all the training ratios, which confirms that the node representations trained by IINE are more discriminative. What's more, the average ranks of IINE in terms of Mic-F1 and Mac-F1 are 1 and 1.10, while those of the second best model SNE are 3.19 and 2.90 which empirically demonstrates the superior robustness of IINE than the compared algorithms in the classification task.

2) NODE CLUSTERING
In the clustering task, we adopt the classical k-means as the clustering algorithm, where the number of clusters is set to be the number of labels, and the clustering results are evaluated by both Purity and NMI. Similar to the classification task, we adopt the average rank to asses the general performance of each algorithm.
From Table 2, it can be observed that, in the clustering task, IINE achieves the best performance on all the datasets in terms of both Purity and NMI, which indicates that the learned node embeddings are well distributed in low dimensional space. In average, in terms of Purity and NMI, the clustering results of IINE are 18.89% and 21.50% higher than those of the second best model SNE, respectively. In general, IINE is ranked in the first position on all the datasets, which presents the strong robustness of IINE in the clustering task.
The experimental results in Table 1 and Table 2 can to some extent reflect the preference of each algorithm. Among all the compared algorithms, the random walk based models like DeepWalk and Node2Vec are more suitable for the 'Wiki' and 'Citeseer' datasets, which contains a larger network, while others are more suitable for the relatively smaller networks namely, 'Terrorist' and subnetworks in 'WebKB'. The above phenomenon may be caused by the reason that random walk based methods consider the neighbors in a long path, which may neglect the micro structure of the smaller network, while others mainly adopt the adjacent neighbors as structure information which may be limited to the macro structure of the larger network. Besides, although deep model is adopted in SNE and SDNE, SNE can achieve better results than SDNE on almost all the datasets, which can to some extent reflect the effectiveness of utilizing attribute information. While both attribute information and deep model are used in SNE and IINE, IINE can significantly outperform SNE, which empirically demonstrates that it can effectively enhance the efficiency of node embeddings by keeping the intra-characteristic of each single information when combining them.

E. NETWORK VISUALIZATION
To show the embedding results of different approaches more clearly, t-SNE [47] is utilized to further map the obtained embeddings to 2-d space, where nodes with the same label are marked with the same color. Specifically, we plot the visualization results of the Citeseer dataset in Figure 5.
As shown in Figure 5, the nodes with the same label in the visualized results of IINE are relatively more compact than those by the compared algorithms, which is beneficial  to the downstream problems like classification and clustering. Besides, several obvious problems can be found in the visualized graph of the compared algorithms. As can be seen from Figure 5(a) and Figure 5(b), both DeepWalk and Node2Vec have a confusing part in their visualized graphs, which may be caused by random walk. In Figure 5(c), the visualized graph of M-NMF shows that, many nodes with different labels are too close together while several node groups with the same label are largely separated, which neglects the discriminative of node representation. As can be seen from Figure 5(d), Figure 5(e), and Figure 5(f), nodes in the visualized graphs of SDNE, LANE, and SNE are somewhat messy, which are likely to violate the rational distribution of the nodes.

F. CASE STUDY
In this part, two meaningful problems are investigated as follows: • Whether the characteristic of each single information matter?    • Whether the consistency criterion will cause the information distortion in attributed network embedding?
In order to obtain the answer of the above problems, two case study experiments are conducted in both classification (with training ratio as 80%) and clustering tasks.

1) THE INFLUENCE OF INTRA-CHARACTERISTIC OF SINGLE INFORMATION
In our model, the intra-characteristic of single information is captured respectively by the structure miner and the attribute miner. To investigate the influence of intra-characteristic, the performance of the original IINE and that of the IINE without auxiliary models (denoted as IINE w ) are compared in both of the classification and clustering tasks, and the results are reported in Table 3 and Table 4. From Table 3 and Table 4, we can find that on all the datasets, IINE can significantly outperform IINE w in terms of both classification and clustering tasks. In the classification task, the improvements in terms of Mic-F1 and Mac-F1 are 4.38% and 15.14% respectively; While in the clustering task, the improvements in terms of Purity and NMI are 10.39% and 38.42% respectively. As analyzed above, in the clustering task, the performance of learned representation will degenerate sharply when there is no measure to preserve the intra-characteristic of each single information. This is mainly due to the reason that the unsupervised node clustering task requires the intra-characteristic information for making nodes more informative and distinctive. This case study experiment empirically demonstrates that the performance of node representations can be significantly enhanced when the intra-characteristic of each information is properly preserved.

2) THE INFLUENCE OF CONSISTENCY CRITERION
The consistency criterion believes that the structure representation and the attribute representation of the same node are supposed to be similar. Therefore, adding the consistency criterion is equivalent to adding similarity constraint on U A and U P , which can be formulated as follows: After adding the above loss, the consistency criterion based model can be derived, which is denoted as IINE c . The comparison results of IINE and IINE c are shown in Table 5 and  Table 6. As can be seen from both Table 5 and Table 6, the results of both classification and clustering of IINE c degenerate dramatically. As can be observed in the negative influence on the experimental performance, it is obvious that the consistency criterion is likely to distort the learned information of node representations, leading to the degradation of performance. Therefore, this study to some extent demonstrates the irrationality of consistency criterion on combining the different information.

VI. CONCLUSIONS
In this paper, we study a novel problem of attributed network embedding, namely how to learn node representations that can reflect the inter-relation between structure and attribute information while avoiding the information distortion and preserving the corresponding intra-characteristic of each single information. To solve this problem, we propose the IINE model. The basic idea is to design a coupled autoencoder to learn the inter-relation inside the deeply coupled information from both of the network structure and the node attributes. In addition, two auxiliary models called structure miner and attribute miner are designed to learn single information representations so as to further mine the intra-characteristic inside them. By jointly correlating the coupled information representations and single information representations, an overall objective function is proposed. Extensive experiments conducted on the seven real-world attributed networks show that our model outperforms most of the state-of-the-art network embedding methods in both of the node classification and clustering tasks. Additionally, two case study experiments are conducted by comparing IINE with two variations models namely IINE w and IINE c , which empirically demonstrates the significance of the novel problem proposed in this paper.
KAI WANG received the bachelor's degree from Sun Yat-sen University, in 2018. He is a Graduate Student with Sun Yat-sen University, from September 2018. His current research interest includes data mining.
LEI XU received the bachelor's degree from Sun Yat-sen University, in 2018. He is a Graduate Student with Sun Yat-sen University, from September 2018. His current research interest includes data mining.
LING HUANG received the undergraduate and master's degree from South China University of Technology, in 2009 and 2013, respectively. She is currently pursuing the Ph.D. degree with Sun Yat-sen University. She has published several papers in international journals and conferences such as pattern recognition, IEEE ACCESS, information sciences, knowledge-based systems, KDD, AAAI, IJCAI, IEEE ICDM, IEEE BIBM, and DASFAA. Her research interest includes data mining. YONG TANG received the B.S. degree from Wuhan University, in 1985, and the Ph.D. degree from the University of Science and Technology of China, in 2001, all in computer science. He is currently a Professor and the Dean of the School of Computer Science at South China Normal University, and serves also as the Director of Services Computing Engineering Research Center of Guangdong Province. His research focuses on data base and cooperative software, temporal information processing, social network, and big data analytics. He has completed more than 30 research and development projects, and has authored or coauthored more than 100 publications in these areas.
CHENGZHOU FU received the master's degree in software engineering, in 2012 and the Ph.D. degree in computer science, in 2017, from South China Normal University, China. His current research interests include social network and data mining.