Identification of Key Nodes in Complex Networks Based on Network Representation Learning

Recently, some research has utilized machine learning methods to identify critical nodes in complex networks. However, existing approaches often lack a comprehensive consideration of network structural features during node feature extraction. Benefiting from the powerful feature extraction capability of network representation learning methods, a simple and effective algorithm for identifying key nodes in complex networks, termed Network Representation Learning and Key Node Identification (NRL_KNI), is proposed. The NRL_KNI algorithm utilizes network embedding techniques for learning node feature representations, followed by clustering and the utilization of quota-based limited sampling to obtain sampled nodes. Subsequently, these sampled nodes are employed to train a regression model for predicting the diffusion capability of unsampled nodes. To rank node influences, a Local Structure Influence Score (LSIS) based on the local structure is introduced to evaluate nodes’ final impact. Experimental results on eight real-world datasets demonstrate that the NRL_KNI algorithm generally outperforms traditional centrality methods and network representation learning-based methods in terms of the Jaccard similarity coefficient and Kendall’s Tau correlation coefficient evaluation metrics.


I. INTRODUCTION
With the rapid growth of social media platforms, and the widespread dissemination of information [1] has spawned many applications based on social networks.For example, the influence maximization problem aims to identify the most influential key nodes in the social network [2], so that under a specific propagation model (such as the Susceptible-Infectious-Recovered model), the key nodes can influence as many other nodes as possible.Critical node identification in complex networks is also crucial in other scenarios, such as viral marketing, personalized recommendation, information dissemination, etc.Therefore, how to effectively identify key The associate editor coordinating the review of this manuscript and approving it for publication was Dominik Strzalka .
nodes in a complex network has become a research hotspot in different fields.
Traditional methods for identifying key nodes typically rely on node centrality indices for selection.For instance, degree centrality [3] primarily assesses a node's significance based on the count of its first-order neighbors; betweenness centrality [4] quantifies importance by measuring how frequently a node serves as an intermediary in the shortest paths; PageRank centrality [5] gauges a node's significance by considering both its connections to other nodes and their individual importance scores; K-shell coefficient [6], on the other hand, is a network-centric method for globally classifying the importance of nodes based on structural characteristics; Berahmand et al. [7] introduced a novel semi-local and parameter-free centrality measure for identifying the most influential key nodes.While these methods are straightforward and extensively employed, their drawback lies in their partial utilization of the intricate structural traits of the network.
In the pursuit of more effectively identifying the most influential key nodes, machine learning-based algorithms for key node identification are emerging as promising tools.These approaches can be categorized as either supervised or unsupervised methods.The supervised method generally involves transforming the challenge of identifying key nodes into a regression or classification problem.It leverages the SIR epidemic model to simulate the actual spreading capability of each node, effectively using it as a node label.For instance, Asgharian Rezaei et al. [8] initially constructed an eigenvector for each node, with its dimension equal to the total number of nodes in the network.They employed 0.5 % of the network's nodes to train a regression model.However, a drawback of this method arises when the network scale increases, as the dimension of the eigenvector becomes excessively large, resulting in considerable time and space consumption during runtime.Similarly, Yu et al. [9] extracted feature representations from the nodes in the network and utilized Convolutional Neural Networks (CNNs) to train a regression model.Subsequently, the trained regression model was employed to predict the influence of nodes.Conversely, Zhao et al. [10] utilized the entire network to train a classification model, subsequently applying the trained classification model to predict the importance categories of nodes in a separate test network.
The crux of the supervised method lies in obtaining propagation ability labels for nodes, a process whose cost escalates significantly with the expansion of the network scale.Hence, unsupervised learning methods have gained popularity among researchers in the task of key node identification.Unsupervised learning approaches first employ network representation learning, also known as network embedding, to extract feature vectors of nodes.Subsequently, in conjunction with other methods, which identify key nodes within the network.For instance, the DeepIM method [11] utilizes network representation learning to tackle the key node identification challenge.In this method, the CARE algorithm [12] is first employed to acquire representation vectors for each node.These vectors are then used to compute the similarity between pairs of nodes, constructing a matrix of correlations among nodes.The selection of key nodes is based on the magnitude of node counts within the correlation matrix.Similarly, some studies [13] have also employed node embeddings as a foundation, coupled with clustering algorithms to partition network nodes into multiple clusters.Subsequently, within these node clusters, employ additional methodologies to select key nodes.
The above methods do not well combine the feature extraction ability of network representation learning and the generalization ability of machine learning models, resulting in poor adaptability and generalization performance of the above methods.Simultaneously, some algorithms constructed node feature dimensions that were excessively large, leading to significant time and space consumption during runtime.Therefore, this paper proposes a simple and effective key node identification algorithm NRL_KNI based on the network embedding method.The network embedding algorithm can greatly reduce the representation dimension of nodes and preserve the node structure information as much as possible.In addition, to account for the structural diversity of training samples, a quota-based approach is employed for sampling the partitioned node clusters, and the regression model is trained by sampling 5% of the nodes in the network as the training set, which can not only save model training time but also help to enhance the model's generalization capacity.
Overall, the main contributions of this paper are as follows: • A simple and effective key node identification algorithm NRL_KNI is proposed, which combines network embedding and regression models, which can make more effective use of node structure information and improve the prediction effect of the model.In this paper, only 5% of the nodes in the network are used to train the model, which improves the training efficiency and generalization of the model.
• In assessing the ultimate influence of nodes, this paper introduces a localized structure-based influence score (LSIS).LSIS integrates the propagation capability of first-order neighboring nodes with the respective node's local features, providing an effective evaluation of node influence.
• The proposed NRL_KNI method was evaluated for its performance on eight real datasets.Comparative analysis with the results from nine benchmark methods indicated that, in the majority of cases, the NRL_KNI method consistently outperformed in terms of the Jaccard similarity coefficient metric.Simultaneously, under the Kendall's Tau correlation coefficient metric, the NRL_KNI algorithm demonstrated performance improvements of up to 10% compared to the second-best performing benchmark method.Additionally, parameter sensitivity analysis experiments indicated that NRL_KNI exhibits strong robustness.

II. RELATED WORKS
In this section, we primarily focus on the relevant techniques associated with the NRL_KNI model, namely the SIR propagation model for obtaining node labels and the network embedding methods for learning node representation features.

A. SIR PROPAGATION MODEL
In the context of influence maximization, the identification of key nodes in complex networks aims to select a certain number of nodes that maximize the spread range under a specific propagation model.In graph G, the expected number of nodes that node v can influence under a specific propagation model is defined as its propagation capability (influence spread range).
128176 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
In this study, the SIR epidemic model is employed to assess the propagation capability of nodes [8]- [10], [14], and [15].The propagation process of this model can be succinctly described as follows: Initially, if node v is in an infected (activated) state (I), while the rest of the nodes remain susceptible (inactive) (S), nodes in an infected state (I) activate their susceptible (S) neighbors with a probability β, simultaneously, nodes in an infected state (I) transition to a recovered state (R) with a probability α, the propagation process terminates when no new nodes are activated (infected).

B. NETWORK REPRESENTATION LEARNING METHODS
Network representation learning, also referred to as network embedding, aims to learn a low-dimensional vector representation, denoted as v i ∈ R |V |×d , for each node in the graph G (V , E), where d ≪| V | represents the dimensionality of the embedding vector.
In recent years, the application of deep learning for learning feature representations of network nodes has gained significant attention.The DeepWalk algorithm [16] was the first to introduce the Skip-gram natural language model and random walk strategy for learning node feature representations.It utilized a Depth-First Search (DFS) strategy to generate random walk sequences and employed these sequences as contextual information input into the skip-gram model to learn node representations.Building upon Deep-Walk, Node2Vec [17] utilized a biased random walk approach to sample neighborhoods of nodes.This method incorporated two hyperparameters to control the sampling strategy using both Depth-First Search (DFS) and Breadth-First Search (BFS) to generate walk sequences.Subsequently, the Skip-gram natural language model [18] was employed to learn node embeddings.
However, this node's first-order neighborhood-based random walk strategy fails to capture richer contextual information.To capture higher-order contextual information, researchers integrated community information and role information into network representation learning.
The CARE algorithm [12] integrates community information into network representation learning.It accomplishes this by employing a community detection algorithm to partition network nodes into multiple communities.Then, by setting a hyperparameter, it samples either the community members of the current node or its immediate neighbors.The obtained walk sequences are subsequently fed into a Skip-gram model to obtain node representations.Similarly, algorithms such as MNMF [19], GEMSEC [20], CNRL [21], NRL_RWCE [22], CRARE [23], among others, also incorporate community information into network representation learning.On the other hand, the role2Vec algorithm [24] incorporates role information into single-layer network representation learning.Similarly, Zhang et al. proposed the RMNE algorithm [25] which integrates role information into multi-layer network representation learning.In this paper, the NRL_KNI model employs the SEGK algorithm [26] for learning node representations.In the following sections, we will elucidate the intricate details associated with learning node representations using the SEGK algorithm.

III. METHODOGY
According to convention, in this paper, an undirected network is defined as G(V, E), where V represents the set of nodes, E ⊆ V × V represents the set of edges, and v i , v j ∈ E indicates the presence of an edge between node v i and node v j .
The NRL_KNI model proposed in this paper is illustrated in The middle section represents the second part, involving the model training using the training dataset and making predictions for non-sample nodes.The lowermost section corresponds to the third part, focusing on assessing the final influence of nodes based on their local structural features and combining them with their propagation capabilities.Ultimately, an influence ranking list for nodes is generated.The following sections will provide a detailed explanation of the theoretical concepts involved in each of these components.

A. NETWORK EMBEDDING AS FEATURE LEARNING IN NRL_KNI
The structural features of nodes in networks are widely utilized for identifying key nodes in complex networks.Consequently, recognizing structurally equivalent nodes holds paramount significance for downstream network analysis tasks.In this paper, we employ a network embedding algorithm called SEGK [26] to learn low-dimensional vector representations of nodes.This method integrates the structural information of nodes into the node representation learning process, by comparing different structural information of nodes across multiple scales to learn node embeddings.Specifically, this algorithm defines an R-hop neighborhood for each node, where the R-hop neighborhood of node v i is defined as G 1 i , G 2 i , . . ., G R i , and simultaneously establishes the computation method for the kernel between two nodes: The algorithm initializes matrix K ∈ R n×n as a symmetric positive semidefinite matrix.The values of matrix K can be computed using formula (1), with the calculation equation being K i,j = k v i , v j .Subsequently, the kernel matrix K is factorized to obtain K = SS T , where S ∈ R n×n , and each row of S can be interpreted as the structural representation of nodes in a latent dimensional space.However, when the network scale becomes significantly large, the total number of network nodes denoted as n becomes substantial.This leads to a considerable computational cost for factorizing matrix K. Therefore, the algorithm employs an approximation technique (Nyström method [27]) to yield an approximation of K as K ≈ SS T , where S ∈ R n×d , and d ≪ n signifies the dimension of the embedding vectors.

B. OBTAINING SAMPLE NODES AND NODE LABELS AS THE TRAINING SET
For machine learning models, constructing training samples with diverse structures contributes to enhancing the model's generalization ability.Therefore, in this section, building upon node feature representations, we extract sample nodes with varying structural characteristics through clustering algorithms and quota sampling to promote the model's generalization capacity.
The NRL_KNI model obtains the feature matrix embedding of nodes through the SEGK algorithm.In the context of the SEGK algorithm, it can map nodes with structurally similar attributes into the embedding space while preserving their proximity.Consequently, the K-means algorithm can assign nodes with structural similarity to the same cluster.In essence, this algorithm aims to partition n data points into k(k ≪ n) clusters, so that the sum of squared distances between each sample point within a cluster and the centroid sample point of that cluster is minimized.
However, the clusters of nodes obtained through the k-means algorithm do not guarantee an even distribution of nodes within each cluster.Therefore, this algorithm employs a quota sampling approach to sample nodes within each cluster.Assuming the total number of nodes in node cluster c i is m i and the total number of sampled nodes is h, then the number of sampled nodes from node cluster c i is m i n × h.The advantage of this approach lies in avoiding excessive or insufficient sampling across clusters of different sizes, thus accommodating the structural diversity of sample nodes.
For the sampled data points, the SIR model is utilized to simulate their propagation capabilities as node labels.Subsequently, the embedding vectors of the sample nodes are combined with their corresponding propagation capabilities to construct the training set.

C. MODEL TRAINING AND PREDICTION FOR NON-SAMPLE NODES
In this paper, we set the proportion of sampled nodes to be 5% of the total number of network nodes.Considering that the smallest dataset used in this study contains approximately 1000 nodes, this implies that the number of sampled nodes is around 50.Therefore, the selection of a machine learning model that can efficiently train on small-scale datasets is crucial.In other words, we aim to train a model on a small-scale dataset and subsequently apply that model to a large-scale dataset.
In this study, we opt for utilizing Support Vector Regression (SVR) [8] to train the regression model.The SVR model has been extensively employed in the literature [8], [28] for modeling small-scale datasets.The central objective of SVR is to identify an optimal hyperplane within a high-dimensional feature space, intending to minimize the error between predicted and actual values while maintaining this error within a permissible tolerance range.Diverging from conventional regression techniques, SVR's distinctiveness lies in its robustness against outliers and nonlinear relationships.Through the incorporation of kernel functions, SVR can map data into higher-dimensional spaces, thereby accommodating intricate nonlinearity patterns.
Given the obtained training dataset, a regression model is trained using SVR on the sampled dataset.For non-sampled nodes, their feature representation vectors are obtained, and these vectors are then utilized as inputs to the SVR model to obtain their predicted values.

D. COMPUTE THE FINAL INFLUENCE OF NODES
However, the final ranking of network node influence is not directly sorted by the values predicted by the SVR model or simulated by the SIR model.Therefore, this paper introduces the Local Structural Influence Score (LSIS) based on node-local structures.It combines the propagation capability of first-order neighboring nodes with their local structural characteristics.The computation method for the LSIS score of node v is as follows: In equation ( 2), τ (v) represents the set of first-order neighboring nodes of node v, d (u) represents the degree of node u, and vitality [u] represents the propagation capability of node u, which is either the value predicted using the SVR model or simulated using the SIR model.
In the above equation, the importance of a node is measured by incorporating the propagation capabilities of its first-order neighboring nodes.If a node's neighboring nodes exhibit strong propagation capabilities, it implies that the node has a greater capacity for outward diffusion through these neighbors.Simultaneously, weights are allocated to the nodes in the first-order neighborhood based on their degrees.Building on this assumption, the larger the degree of a node, the more significant it is within the network.Therefore, by enhancing the contribution of neighboring nodes in LSIS based on their degrees, this approach efficiently evaluates the significance of nodes within the network.
In Algorithm 1, the process begins with learning the node embedding matrix through the SEGK algorithm in the third line, and k node clusters are in the sixth line obtained using a clustering algorithm.Subsequently, lines 7 to 9 utilize a quota sampling method within the node clusters to acquire sample nodes.The propagation capabilities of the sample nodes are simulated using the SIR model in the 10th line.In the 11th line, a regression model is trained using the sample set.The subsequent lines from 12 to 14 are employed for predicting the propagation capabilities of non-sample nodes.The computation of the Local Structural Influence Score (LSIS) for nodes within the network occurs in lines 15 to 17. Ultimately, the algorithm concludes in the 18th line, outputting a ranked list of nodes based on LSIS.

IV. EXPERIMENTS
To evaluate the performance of different methods, this section first employs various approaches to generate node rankings.These rankings are then compared with the actual node rankings, and two different metrics are employed to assess the methods' performance.In recent years, researchers have proposed some novel node influence ranking algorithms [29], [30], to obtain the true ranking of node influence, this study runs the SIR model 1000 times for each node, and the average propagation capability is taken as the node's propagation ability label.Subsequently, nodes are ranked based on their propagation capabilities in descending order to establish a baseline ranking.

A. DATASETS
The study validates the performance of the proposed methods using networks from various domains, including citation networks (cora, citeseer), collaboration networks (CA-GrQc), social networks (Socfb-Reed98), interaction networks (ia-fbmessage), biological networks (bio-CE-GT, bio-CE-LC), and language networks (wiki) [31].The datasets mentioned above can be obtained from the NetworkRepository [32].Detailed statistics for these datasets are provided in Table 2.In Table 2, the symbol β th is employed to denote the theoretical diffusion threshold of the network, which can be calculated using the following formula: β th ≈ ⟨k⟩ ⟨k 2 ⟩ .In this equation, ⟨k⟩ = 1 N i d i is used to represent the network's average degree, indicates the second-order average degree of the network [33], and d i denotes the degree of node i. Regarding the infection probability within the SIR model, this study consistently sets the value of β slightly greater than β th , ensuring that large-scale propagation can be triggered at the corresponding β value [34].

B. EVALUATION METRICS
In this paper, the Jaccard similarity and Kendall's Tau correlation coefficient are utilized to quantify the similarity and correlation between the ordered lists generated by the algorithms and the ground truth lists.
1) JACCARD SIMILARITY COEFFICIENT [8] This metric is used to compare the similarity or disparity between two samples, with values ranging from 0 to 1, where higher values indicate greater similarity between the two samples.According to the definition, for two ordered lists PR and TR, the Jaccard similarity coefficient JS (k) for the top-k elements can be mathematically defined as: where PR is the node ranking list obtained by a specific algorithm, and TR is the true node ranking list.
2) KENDALL'S TAU CORRELATION COEFFICIENT [9] This metric is used to quantify the degree of concordance between two lists.Its mathematical definition is as follows: where X and Y respectively denote two distinct lists of length n, while N + and N − represent the counts of concordant and discordant pairs between lists X and Y.For instance, the set of joint ranks obtained from X and Y is represented by (x 1 , y 1 ) , (x 2 , y 2 ) , . . ., (x n , y n ).If x i > x j and y i > y j or x i < x j and y i < y j , then (x i , y i ) and x j , y j form a concordant pair.If x i > x j and y i < y j or x i < x j and y i > y j , then this pair is considered a discordant.Furthermore, the value of τ (X , Y ) ranges from −1 to 1.A larger value of τ (X , Y ) indicates a stronger correlation between the two ranking lists.

C. BASELINE METHODS
The baseline algorithms in this paper include four centrality-based heuristic algorithms and five network embedding-based counting methods, which are detailed as follows: 1) DEGREE CENTRALITY (DC) [3] Degree centrality measures the importance of a node by calculating the number of its first-order neighbors.Intuitively, if a node has more neighboring nodes, it is considered more significant in the network.Let d v represent the degree of node v in network G.The normalized degree centrality of node v can be expressed as follows: 2) BETWEENNESS CENTRALITY (BC) [4] Betweenness centrality measures the importance of a node by calculating the frequency at which the node appears as an intermediary in the shortest paths.The betweenness centrality of node v in the network can be defined as: where g fh represents the total number of shortest paths between nodes f and h, and n v fh represents the count of paths passing through node v among the aforementioned total number of shortest paths.A higher betweenness centrality of node v indicates stronger control and information transmission capacity within the network.
3) PageRank CENTRALITY (PG) [5] The core idea of this algorithm is that the importance of a node is not solely determined by the number of nodes it is connected to, but also by the importance of those nodes.In other words, a node's PageRank centrality is influenced by both its degree and the quality of its connections, emphasizing the collaborative impact of these two factors.[35] This algorithm assesses the centrality of nodes based on the length of relationship paths between them.Unlike other centrality metrics, Katz centrality not only takes into account a node's direct connections but also considers multi-hop 128180 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.connections between nodes.The Katz centrality of node v i can be defined as follows:

4) KATZ CENTRALITY(KC)
where α is the attenuation factor, and α is strictly less than the reciprocal of λ max .λ max is the largest eigenvalue of the adjacency matrix A, and parameter ξ provides an initial value for all nodes, typically set to 1.
5) DeepIM ALGORITHM [11] This method introduces graph representation learning techniques for the identification of critical nodes in complex networks.Initially, the algorithm employs the CARE VOLUME 11, 2023 128181 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.algorithm [11] to learn node embedding vectors.Subsequently, it calculates the cosine similarity between every pair of nodes, constructs a matrix of node correlations, and eventually selects the top-K nodes as critical nodes using a counting-based approach.In this paper, nodes are arranged in descending order based on their frequency of appearance, resulting in an ordered list reflecting the nodes' influence.
Building upon the approach of the DeepIM method, this paper also extends several other methods.[26] This algorithm learns node embedding vectors by approximating and decomposing the kernel matrix containing the structural similarity between nodes.For the graph kernel used in this algorithm, the Weisfeiler-Lehman (WL) subtree kernel is selected in this paper.Similarly, after obtaining node embeddings, the selection of critical nodes follows the same approach as the DeepIM algorithm.

6) SEGK ALGORITHM
7) DeepWalk ALGORITHM [16] This method starts by sampling nodes using a random walk approach.Subsequently, node sequences are used as inputs to train node embedding vectors using the Skip-Gram model [18].The selection of the critical node set aligns with the approach of the DeepIM algorithm.
8) Node2Vec ALGORITHM [17] This method introduces two hyperparameters during the sampling process to control the balance between depth-first search and breadth-first search for generating random walk sequences of nodes.Subsequently, these node sequences are used as inputs for training node embedding vectors using the Skip-Gram model [18].Similarly, the selection of critical nodes follows the approach of the DeepIM algorithm.[22] This algorithm is a node embedding method based on community-aware random walks and incorporates an automatic parameter determination strategy for the random walks to ensure that the learned node representations preserve community information.The selection of the critical node set aligns with the approach of the DeepIM algorithm.

D. EXPERIMENTAL TOOLS AND PARAMETER SETTINGS
In this paper, all experiments were conducted on a computer equipped with the Windows operating system and executed using the CPU.The software tools utilized for experimental data processing included Python (version: 3.8), Networkx (version: 2.6.are considered.In the SIR model, the infection probability β for each dataset is set to be greater than the threshold β th , and the recovery probability is consistently set to 0.01.The clustering parameter k is chosen from the set {5, 10, 15, 20}.For all datasets, the sampling ratio r is uniformly set to 0.05.Similarly, it achieved the highest Jaccard similarity scores for the top-50 nodes in the CA-GrQc, wiki, and citeseer datasets.Furthermore, as ''Ranks'' values increased, the performance of NRL_KNL also exhibited improvement.In most scenarios, the NRL_KNI method consistently demonstrated the best outcomes.These results suggest that the NRL_KNI algorithm holds a certain advantage over baseline algorithms in identifying critical nodes within complex networks, thus affirming the effectiveness of the proposed approach presented in this study.
Table 3 presents the results of Kendall's Tau correlation coefficient metric, where bold type indicates the best-performing result, and results marked with horizontal lines denote the second-best outcomes.In the CA-GrQc dataset, the performance of NRL_KNI improved by approximately 10% compared to the second-best Katz method.In the cora dataset, the performance of NRL_KNI improved by around 7% compared to the second-best algorithm.In the citeseer dataset, NRL_KNI's performance improved by approximately 10% compared to the second-best method.Moreover, it is worth noting that NRL_KNI exhibited a performance improvement of approximately 1% compared to the Katz algorithm in the Socfb-Reed98 and bio-CE-GT datasets.Similarly, on the bio-CE-LC dataset, the performance of NRL_KNI also improved by around 1% compared to the second-best performing Betweenness Centrality algorithm.Additionally, Table 3 reveals that, in the majority of datasets, the performance of network embedding-based counting methods is subpar, whereas certain centrality-based heuristic algorithms exhibit favorable performance.

F. PARAMETER SENSITIVITY ANALYSIS
In the process of clustering, the number of clusters k can have a significant impact on the clustering outcomes.Therefore, this section performs parameter sensitivity analysis on six datasets.When evaluating the influence of the number of clusters k, while keeping other parameters constant, it is sufficient to vary only the value of k.The results are depicted in Fig 3 .For the citeseer, wiki, Socfb-Reed98, and ia-fbmessage datasets, the Jaccard similarity coefficient curve shows no noticeable fluctuations as the parameter k changes.The results indicate that the clustering parameter k is not highly sensitive on these datasets, and the algorithm presented in this paper demonstrates relatively stable outcomes.On the other hand, for the cora and CA-CrQc datasets, the Jaccard similarity coefficient curve exhibits slight fluctuations with changes in the parameter k, but these variations do not significantly alter the algorithm's results, showing an overall trend towards stability.The results indicate that, in most cases, the parameter k has a minor effect on the Jaccard similarity coefficient values.Table 4 presents the impact of different clustering numbers, denoted as k, on Kendall's Tau correlation coefficient.Across the cora, Socfb-Reed98, wiki, and ia-fa-message datasets, as the clustering number k varies, the values of Kendall's Tau correlation coefficient exhibit minimal fluctuations, indicating that parameter k is not highly sensitive to these datasets.However, in the CA-GrQc dataset, despite slight fluctuations in the correlation coefficient values due to changes in parameter k, the overall performance of the NRL_KNI algorithm remains relatively stable.Moreover, for the citeseer dataset, when k = 10, there is an approximately 3% performance improvement compared to k = 15 or k = 20.Nevertheless, considering the overall trend, the amplitude of this improvement is not substantial, and the proposed algorithm continues to outperform the baseline.The experimental results from Fig 3 and Table 4 collectively demonstrate that the proposed method is not highly sensitive to the clustering parameter k, further highlighting the robustness of the NRL_KNI algorithm.
The optimal values of k for different datasets are shown in Figure 3 and Table 4. Dataset CA-GrQc has an optimal k value of 20, while dataset citeseer has an optimal k value of 10.This indicates that different datasets have different optimal values for k.Tables 5 and 6 respectively present the impact of different parameter values r and d on the Kendall's Tau correlation coefficient.The optimal values for parameters r and d vary across different datasets.For instance, the optimal r value for the citeseer dataset is 0.03, while for the cora dataset, it is 0.07.The optimal d value for the citeseer dataset is 64, whereas for the wiki dataset, it is 200.To maintain consistency with the benchmark methods, d is uniformly set to 128 in this study.

V. CONCLUSION
This paper presents a simple and effective key node identification algorithm named NRL_KNI.This algorithm leverages network representation learning techniques to acquire node embeddings, enabling efficient capture of the structural information of nodes within the network while significantly reducing the dimensionality of node representations.Moreover, this method introduces the Local Structural Influence Score (LSIS) to evaluate the final impact of nodes, taking into full consideration the influence factors of their local structures.The NRL_KNI approach employs a quota-based sampling technique on node clusters, using only 5% of the nodes within the network for model training, thus effectively reducing computational time and enhancing the algorithm's generalization performance.Experimental results conducted on multiple real-world datasets indicate that the NRL_KNI method significantly outperforms the majority of baseline methods in terms of metrics such as the Jaccard similarity coefficient and Kendall Tau correlation coefficient.
However, there is still significant room for improvement for NRL_KNI.The shortcomings of this work include the reliance on node propagation capability labels on multiple 128184 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
simulations of the SIR epidemic model.Although this approach is widely used in many literature works and this paper attempts to mitigate its impact on the overall algorithm by reducing the size of the training set (using only 5% of the network nodes), the influence of simulating node propagation capability on NRL_KNI is inevitable.Therefore, exploring a more efficient method to approximate node propagation capability becomes a direction for the improvement of this algorithm.Furthermore, the algorithm proposed in this paper can be extended for application in multi-layer networks, temporal networks, attribute networks, etc.It is also possible to explore alternative methods for constructing node feature vectors or employing diverse machine-learning models for training the sample set.

Fig 1 .
It can be divided into three main components.The uppermost section in Fig 1 represents the first part of the model, which is primarily responsible for acquiring feature representations of nodes along with their corresponding propagation capability labels to form the training dataset.

FIGURE 1 .
FIGURE 1.The overall framework of the NRL_KNI model.
3), Matplotlib (version: 3.7.1),NumPy (version: 1.16.6), and Pandas (version: 1.4.3).Each experiment on the datasets involved the training of a support vector regression model for predicting the diffusion capability of non-sampled nodes.The model was configured with the following parameters: the regularization parameter C was set to 100, and the kernel function selected was the Radial Basis Function.Furthermore, for network embedding-based methods, the node embedding dimension d is uniformly set to 128, while the remaining parameters are used with the default values from the original literature.When constructing the correlation matrix, only the top 10 most relevant nodes for each node 128182 VOLUME 11, 2023Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

FIGURE 3 .
FIGURE 3. The influence of different cluster numbers k on Jaccard similarity coefficient.
E. EXPERIMENTAL RESULTS ANALYSIS Across different datasets, the performance of various methods was compared based on the ranking of the top-k nodes, where k ranged from 10 to 290, and the corresponding ''Ranks'' values were shown on the graph.Fig 2 illustrates the variation VOLUME 11, 2023 128183 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
curves of Jaccard similarity coefficient values across six datasets.As depicted in Fig 2, our proposed NRL_KNI method attained the highest Jaccard similarity scores for the top-30 nodes in the wiki, ia-fb-message, and citeseer datasets.

TABLE 1 .
The symbols used in this article and their meanings.

TABLE 2 .
Detailed statistics for real datasets.

TABLE 3 .
The Kendall's Tau correlation coefficient between the rankings obtained from different algorithms and the true rankings.

TABLE 4 .
The influence of different cluster numbers k on Kendall's Tau correlation coefficient.

TABLE 5 .
The influence of different sampling ratios r on Kendall's Tau correlation coefficient.

TABLE 6 .
of different embedding dimensions d on Kendall's Tau correlation coefficient.