Benefits of Bias in Crawl-Based Network Sampling for Identifying Key Node Set

We study the problem of identifying a set of key nodes from a network when limited knowledge about its structure is available. Most studies assume complete knowledge of the given network when identifying a set of key nodes, but in current practice, networks of interest are often too huge to obtain their entire topological structures. When the complete structure of a network is not available, network sampling strategies are often used to obtain a partial structure of the network. We investigate how network sampling strategies affect the problem of identifying a key node set. Specifically, we investigate the effect of conventional network sampling strategies on the solutions found for two types of key node set identification problems: the minimum $p$ -median problem and the influence maximization problem. Our results show that when the network is obtained using crawl-based network sampling strategies, both the minimum $p$ -median and the influence maximization problems are effectively solved by simple heuristic algorithms with sampling ratios in the 10–20% range. We also find that among three conventional sampling strategies (random sampling, random walk sampling, and sample edge counts) checked in this paper, random walk sampling is the most robust strategy in terms of effectively identifying the key node sets of diverse types of networks.


I. INTRODUCTION
Identifying a set of key nodes in a given network is a fundamental research problem in network science research, and it has broad application [1]- [6]. Examples of the key node set identification problem include classical problems in graph theory such as minimum p-median (MM) and minimum p-center problems [3], [7]. The MM problem is motivated by applications to city planning, and is useful for determining facility locations [3]. The influence maximization (IM) problem is another popular key node set identification problem, which is expected to be useful for so-called ''viral'' marketing in social networks [4], [8]- [15]. IM aims to identify a small set of influential nodes (called seed nodes) for which the expected size of the influence cascade triggered by the seed nodes is maximized [8]. Note that the problem of identifying k most important nodes using centrality or other metrics The associate editor coordinating the review of this manuscript and approving it for publication was Ghufran Ahmed.
[16]- [19] in a network is also considered to be a key node set identification problem.
One of the main research challenges in identifying key node sets has been to develop computationally efficient algorithms [4], [7], [20]. Because key node set identification problems are typically NP-hard [7], [8], naive algorithms for the problems are infeasible for use with large-scale networks. Many researchers have proposed efficient algorithms for key node set identification. Some algorithms offer theoretical guarantees on the quality of the solutions, and others are heuristic algorithms without theoretical guarantees [4], [7]. Thanks to the efforts of these researchers, scalable key node set identification algorithms are widely available [4], [7], [9], [13], [14].
However, some issues remain open in key node set identification. Notably, most existing studies assume complete knowledge is available for a given network whose key nodes are to be identified, but modern networks of interest are often too large for their entire topological structures to be efficiently known [21]- [29]. For instance, social networks, which represent relationships among social media users, are often very large, and access to network data is typically limited, so only a part of the network structure can be known [21], [24], [30]. It is also fundamentally difficult to obtain the current complete structure of the Internet due to both its scale and its distributed, heterogeneous nature [31], [32].
When the complete structure of a network is not available, a network sampling is often used to obtain a partial structure of the network [22], [23], [26]- [30], [33]. If arbitrary access to any nodes is allowed, random sampling seems to be a natural choice for simplicity because of its simplicity and neutrality. However, in many real applications, random access to nodes is not allowed [30], [33], so crawl-based sampling techniques have been widely used for analyzing the structure of several types of large-scale networks, such as online social networks [23], the world wide web [34], and peer-topeer (P2P) networks [35]. When using crawl-based network sampling, it is assumed that only one node can be visited initially, but the neighbors of already visited nodes can be visited at each step. Popular crawl-based network sampling strategies include random walk (RW) sampling [36], breadthfirst search (BFS), depth-first search (DFS), and sample edge counts (SEC) [22]. It is known that crawl-based sampling strategies have a bias toward high-degree nodes; that is, when using crawl-based network sampling strategies, the probability of visiting high-degree nodes is much higher than that of visiting low-degree nodes [22]. This bias in crawl-based network sampling is generally not desirable when estimating network characteristics, and therefore, considerable effort has been devoted to eliminating the bias in crawl-based sampling strategies [23], [29], [37]. In contrast with the problem of general characterization of network topology, the bias in crawl-based sampling might be beneficial when finding key nodes in a network. A pioneering work by Maiya and Berger-Wolf [22] suggested the benefit of the bias of crawlbased sampling strategies in identifying high-degree nodes. Their finding suggest the hypothesis that the bias in crawlbased sampling is beneficial also for identifying the set of key nodes.
This paper revisits the benefit of the biases in crawl-based network sampling strategies and examines how the crawlbased network sampling applied to a given network affects identification of the key node set in the network. It is naturally expected that when the sample size is small, identifying a key node set from such a partial network will be quite difficult or even impossible. However, because of the bias in crawl-based sampling, we expect that the key node set can be identified even from the limited knowledge obtained with crawl-based sampling. We address the following research questions in particular. (1) How do network sampling strategies affect the effectiveness of key node set identification algorithms? (2) How large a sample do we need in order to obtain a reasonable solution for key node set identification problems? To answer these questions, we formulate two types of key node set identification problems, assuming limited knowledge about the network structures. These two problems are variants of popular key node set identification problems: the MM and the IM problems. We apply simple heuristic algorithms to the problems, and examine the effects of network sampling on both of the MM and IM problems.
Our main contributions are summarized as follows.
• We show that the biases in crawl-based sampling strategies are beneficial in both the MM and IM problems. Although the definitions of the key nodes in the MM and IM problems are notably different, similar simple heuristic algorithms can achieve reasonable solutions to both problems when the partial structure of the network is obtained via crawl-based sampling.
• We demonstrate that crawl-based sampling strategies require only 10-20% sample sizes to obtain reasonable solutions of both the IM and MM problems. When a 10-20% sample size is available, heuristic algorithms can find a key node set that is comparable to the key node set obtained from examining the complete network.
• We reveal that a moderate level of bias in crawl-based sampling strategies is central to identifying the key node set effectively and robustly in an unknown network; that is, weak bias in the sampling strategies degrades the quality of solutions, and strong bias deteriorates the stability of the solutions.
This paper is organized as follows. In Section II, we provide definitions and give the formulations of the problems studied in this paper. In Section III, we explain the research methodology. In Sections IV and V, we present the results for the MM problem and the IM problem, respectively. In Section VI, we discuss the implications and future directions of this work. Finally, in Section VII, we conclude this paper.

A. DEFINITIONS
This paper considers the class of problems that require finding a set of key nodes in a ground truth network G = (V , E) using only an incomplete subnetwork G = (V , E ). The incomplete network G is obtained by applying network sampling strategies to the ground truth network G. The graph G can be either directed or undirected, but for simplicity, in what follows, G is assumed to be undirected. Note that we consider network sampling and key node set identification independently. Although a problem of jointly optimizing both of the network sampling and key node set identification can be formulated, studying such problem is beyond the scope of this paper.
A network sampling strategy probes to obtain the nodes in S ⊆ V (|S| = M ), where M is called the sample size. Probing node v reveals the nodes adjacent to v. Let T be a set of nodes adjacent to nodes in S. Then, the set of nodes in the incomplete network G is V = S ∪ T . The set of links in Before introducing the MM problem for incomplete networks, we explain the original MM problem. Given an undirected unweighted network G = (V , E), each node v ∈ V has a demand w(v). Let d(u, v, G) be the shortest path length between nodes u and v in the network G, and let D(v, X , G) = min[d(u, v, G) : u ∈ X ]. Then, the MM problem is defined as follows.
Problem 1 Minimum p-Median Problem [3]: Given a network G and w(v) for each node v ∈ V , find a set of p nodes X (|X | = p, X ⊆ V ) such that the objective function (i.e., total We now define the minimum p-median (MM) problem for incomplete networks. To the best of our knowledge, the MM problem under limited knowledge about the network is a novel problem that has not been studied before. In the original MM problem, the network G and the demand for all nodes are available. In contrast, in the MM problem for incomplete networks, only the network G is available for finding the median node set X that minimizes the total cost f (X ). The MM problem for incomplete networks is defined as follows.

C. INFLUENCE MAXIMIZATION PROBLEM FOR INCOMPLETE NETWORKS
The IM problem for incomplete networks is formulated analogously to the MM problem for incomplete networks. We first explain the original IM problem. Influence maximization problem is a combinatorial optimization problem on a graph that aims to identify a small set of influential nodes (known as seed nodes) such that the expected size of the influence cascade triggered by the seed nodes is maximized. While the IM problem has been studied under several types of influence cascade, the independent cascade (IC) model [8] is the most popular. This paper focuses on the IM problem using the IC model, although our problem formulation can be extended to the IM problem with other types of influence cascade. In the IC model, each node is either active or inactive. When node u becomes active at time step t, node u will influence each inactive neighbor node v ((u, v) ∈ E) with probability p u,v at the next time step t + 1. Namely, a node v becomes active with probability p u,v . The parameter p u,v of the IC model is the probability of spreading influence between nodes u and v. Note that each node has a single chance to influence each of its neighbor. At time step 0, the nodes selected as seed nodes (U ∈ V ) become active, and the other nodes are set as inactive. Then, the stochastic process explained above is repeated until it ends (i.e., no nodes are newly activated in the time step). Let W be a set of link weights representing the probability of influence spread, U ⊆ V be a subset of nodes in graph G, and σ (U , G, W ) be the expected number of active nodes at the end of the process of the IC model on network G with probabilities W when U is the set of seed nodes. The IM problem is then defined as follows [8].
Problem 3 Influence Maximization Problem: Given a social network G, influence spread probabilities W , and an integer k, the aim is to find a set of seed nodes U (U ⊆ V , In contrast with the original IM problem, in the IM problem for incomplete networks, the ground truth network G is not available for use in finding a set of seed nodes. Only a subnetwork G = (V , E ) of G is available. The IM problem for incomplete networks is then defined as follows.

Problem 4 Influence Maximization Problem for Incomplete Networks: Given an incomplete network G , influence spread probabilities
The IM problem under limited knowledge about the network was first proposed in our previous conference papers [24], [38] and has also been studied by other research groups [39], [40]. In this paper, through extensive experiments, we comprehensively investigate the effects of network sampling on both the IM problem and the MM problem.

A. GENERATING INCOMPLETE NETWORKS
We generate an incomplete network G from a given ground truth network G using the following three network sampling strategies.
1) SAMPLE EDGE COUNT (SEC) [22] SEC aims to obtain high-degree nodes without global knowledge of the network by greedily taking the node with the highest expected degree among known but unselected nodes. Let S be a set of chosen nodes. Initially, S contains a randomly selected node. SEC greedily obtains the node with the most links from the nodes in S. This method greedily obtains the node with the highest expected degree. SEC is intended to have a strong bias toward high-degree nodes.
2) RANDOM WALK (RW) [36] Initially, RW obtains and visits a randomly selected node. Then, RW repeatedly obtains and visits a randomly selected unvisited neighbor node of the most recently visited node until a specified number of nodes is obtained. If a visited node has no unvisited neighbor, then a randomly selected unvisited neighbor of some other visited node is obtained. RW does not intentionally visit high-degree nodes, but visited nodes still have a higher degree on average than the nodes that would be obtained from an unbiased random sampling.
Random sampling repeatedly obtains a node uniformly at random from all nodes in a network until a specified number of nodes is obtained.

B. ALGORITHMS FOR FINDING THE KEY NODE SET
We use simple heuristic algorithms for both of the MM and IM problems. We apply existing MM and IM algorithms for complete networks to an incomplete network G . Then, we investigate how the network sampling strategies affect the effectiveness of the existing key node set identification algorithms. This paper focuses on the effectiveness of the algorithms and does not experimentally investigate their computational cost. But note that the key node sets can be obtained from incomplete networks with lower computational cost than from the complete networks because the computational cost of the key node set identification algorithms depends on the size of the networks.
For the MM problem, we use a greedy algorithm for the original MM problem [41]. Since the MM problem is NPhard, the greedy algorithm is often used for solving the MM problem [41]. Although the objective function f (X ) can be calculated using the ground truth network G and demand for all nodes in the original greedy algorithm, in the MM problem for incomplete networks, this information is not available. Therefore, in the MM algorithm for an incomplete subnetwork, the following objective function g(X ) is used instead of f (X ). In this, g(X ) can be calculated from an incomplete network G and w(v)(v ∈ V ). Similar to the original greedy algorithm, the heuristic algorithm iteratively adds a node u to the median node set such that g({X ∪ u}) is minimized. Pseudocode for the algorithm is shown in Algorithm 1.
For the IM problem, we use TIM+ as an efficient approximation algorithm [10]. TIM+ is a state-of-the-art algorithm that achieves efficient computational cost and high effectiveness. We apply TIM+ to the incomplete network G .

IV. RESULTS OF MINIMUM p-MEDIAN PROBLEM A. DATASET AND PRELIMINARIES
As the ground truth networks G, we use (1) a network of Autonomous System (AS) [42], 1 (2) a P2P network of Gnutella (P2P) [43] , 2 and (3) a network of the US powergrid (PowerGrid) [44]. 3 Characteristics of each network are shown in Table 1, and the degree distributions for each network are shown in Fig. 1.
We randomly generated the demand of the nodes using a Zipf distribution, a normal distribution, and an exponential distribution. The results in [41] show that the degree of a node and the demand of the node are correlated, and therefore we use the following procedures to generate the demand of each node. We first generate |V | random variables according to the given distribution. We then assign the i-th largest variable as the demand of the node with the i-th highest degree. We next swap the demand of each node with the demand of FIGURE 2. Total cost vs. number of median nodes (sample size: 10%; distribution of the demand: Zipf; parameter determining the correlation between degree of a node and its demand: q = 0.9): Total cost when using incomplete networks obtained with RW is comparable with the cost when using the complete network. some other randomly selected node with probability 1 − q (0 ≤ q ≤ 1). The parameter q controls the strength of the correlation between node demand and node degree. We used γ = 2 as the parameter of the Zipf distribution, mean µ = 1 and standard deviation σ = 0.1 for the normal distribution, and mean λ = 1 for the exponential distribution.
We apply the algorithm introduced in Section III to the incomplete network G obtained by SEC, RW, and random sampling, and obtain the set of median nodes X for each. To obtain the median node set X , we assume that the demand w(v) is available for each node v ∈ V . We then calculate the total cost f (X ) for the median node set X while changing the sample size, sampling strategy, and distribution of node demand. We generated the demands and obtained sample subnetworks for each parameter setting 20 times. The results shown from here are averaged over the 20 simulation runs for each configuration.

B. RESULTS
We first fixed the sample size as 10% of nodes (M = 0.1|V |), and investigated the total cost while changing the number of median nodes (Fig. 2). The distribution of node demand is the Zipf distribution. As the parameter of correlation between degree and demand, we used q = 0.9. The results when using the complete network and the demand of all nodes are included in the figures (denoted as greedy). These results show that the total cost when selecting median nodes from incomplete networks is comparable with the cost when using the complete network. When using the incomplete network obtained via SEC, the increase in cost relative to the cost for the complete network is only 10% for AS and 15% for P2P. In contrast, for PowerGrid, the cost when using the incomplete network obtained via SEC is significantly higher than the cost when using other sampling strategies. As shown in Fig. 1, there are no strong hubs that have significantly high degree in the PowerGrid, and therefore the benefit of finding highdegree nodes is smaller in PowerGrid than in AS and P2P. In the MM problem, selecting median nodes that are far from each other is generally desirable to achieve lower cost, but SEC typically traverses the network only near the starting node. This drawback of SEC also detrimentally affects the cost when applying SEC to PowerGrid for subnetwork selection.
We next investigate the relation between the sample size and the total cost. Fig. 3 shows the normalized cost when selecting 50 median nodes from the incomplete networks, compared against the sample size (characterized as the fraction of sampled nodes). The normalized cost is defined as the cost when selecting 50 median nodes from the incomplete networks divided by the cost when selecting 50 median nodes 75374 VOLUME 8, 2020  from the complete networks. These results show that when the incomplete network is obtained via RW, a 20% sample size achieves a normalized cost of 1.1-1.3 for all three networks. For AS and P2P, a 10% sample size is large enough to achieve a normalized cost of 1.2. This result suggests that when the subnetwork is obtained via RW, reasonable solutions for the MM problem can be obtained from only limited observations of the networks. The cost when using SEC is lower than that when using RW for the AS and P2P networks, but it is much higher than the cost when using RW for PowerGrid. These results suggest that when we crawl completely unknown networks to determine the median nodes, using RW and collecting 10-20% of samples is a good approach.
We next investigate the effects of node demand on the total cost. Fig. 4 shows the normalized cost for each demand distribution. For comparison purposes, the results when the demands of all nodes are fixed to 1 (denoted as Fixed) are also shown. Here, we use SEC for the AS and P2P networks, and RW for the PowerGrid network. The number of median nodes is fixed to 50. The results show that the total cost when the node demand follows a Zipf distribution, which has a heavy-tailed distribution, is higher than the cost when the node demand follows other distributions. Moreover, we also change the parameter q that controls the correlation between node degree and the node demand (Fig. 5). Here, the fraction of sampled nodes is 0.1, and the number of median nodes is 50. From these results, when the node demand follows the Zipf distribution, the total cost increases as the correlation between degree and demand of node decreases. When the correlation between node degree and node demand is low, low-degree nodes tend to have higher demand than when the correlation is strong. Low-degree nodes are more difficult to discover by sampling strategies than high-degree nodes are. As a consequence, high-demand and low-degree nodes are likely to be unknown when selecting the median nodes in the low-correlation scenario. This is why the correlation affects the total cost. We can also find that there is little difference between exponential and normal distributions. This could be explained by the fact that both distributions have an exponential tail, which implies that there are no extremely high-demand nodes. From this observation, the MM problem for incomplete networks requires a larger sample when the correlation between degree and demand is very low and the demand distribution is heavy-tailed.
We now tackle the problem of finding median nodes when the demand of all nodes is available but the topology is only partially known. To do this, we examine the benefit of FIGURE 6. Total cost when the demand of all nodes is available (distribution of the demand: Zipf; the parameter determining the correlation between degree of a node and its demand: q = 0; the number of median nodes: p = 50): Using incomplete networks achieves a lower total cost than the high demand heuristic.  knowledge about the network topology for the MM problem. We use h(X ), defined as follows, as the objective function of the greedy algorithm instead of g(X ): To calculate h(X ) for node v / ∈ V , we let D(v, X , G ) = d max + 1, where d max is the diameter of G . Fig. 6 shows the normalized total cost when the node demand follows the Zipf distribution, with q = 0 for the situation where the demand of all nodes is available. For comparison purposes, the results when selecting the 50 nodes having the highest demand as the median nodes are included in the figures (denoted as high demand). These results show that if node demand is available, a good solution can be obtained when there is no correlation between degree and demand. Comparing the results of high demand with the results when using incomplete networks shows that network topology is useful for achieving a lower total cost.

V. RESULTS OF INFLUENCE MAXIMIZATION PROBLEM A. DATASET AND PRELIMINARIES
As the ground truth network G, we use four real social networks: NetHEPT [12], NetPHY [12], Facebook-small [45], and Facebook-large [46]. NetHEPT and NetPHY represent co-authorship among researchers, and Facebook-small and Facebook-large represent friendships among Facebook users. These are widely used as benchmark datasets for IM problems [8], [10]- [12], [47]- [51]. Multiple links are simply converted to a single link [24], [52]. Characteristics of each network are shown in Table 2, and the degree distribution of each network is shown in Fig. 7.
We synthetically generated the influence-spread probabilities of each link, using the weighted cascade (WC)  model [12]. Specifically, for each link (u, v), we let The WC model is widely used for generating influence-spread probabilities for the evaluation of IM algorithms [10]- [12], [47], [53]. We also used p u,v = 0.01 for all node pairs (u, v) for comparison.
We apply the algorithm introduced in Section III to incomplete networks G obtained via SEC, RW, and random sampling and obtain the set of seed nodes U . As a parameter for TIM+ we used ε = 0.1. For the obtained seed node set U , we calculate the influence spread σ (U , G, W ) through simulation of the IC model on the ground-truth network G. We run each simulation 1,000 times and take the average of the influence spread.

B. RESULTS
Similar to the approach in the previous section, we fixed the sample size as 10% of nodes (M = 0.1|V |) and investigated the influence spread while changing the number of seed nodes. Figs. 8 and 9 show the results when using, respectively, the WC model and the uniform (p = 0.01) model for influence-spread probabilities. The results when applying TIM+ to the complete network are included in the figures (denoted as TIM+). For Facebook-large with p = 0.01, we failed to obtain the results from TIM+ due to the scalability limits already reported in [54]. We therefore added the results when using degree discount IC [12] (denoted as DDIC), which is a lightweight heuristic algorithm, instead of the results of TIM+ for Facebook-large with p = 0.01. The obtained results show that influence spread when selecting seed nodes from incomplete networks is comparable with influence spread when using the complete network. Except for Facebook-small, using incomplete networks obtained via RW and SEC achieved higher influence spread than using incomplete networks obtained via random sampling. This result confirms the benefit of crawl-based sampling strategies for finding a key node set, as also seen in the results for the MM problem. For Facebook-small, using an incomplete networks obtained via random sampling achieved higher influence spread than using networks obtained via RW and SEC. In particular, when using SEC, the influence spread is much smaller than when using other strategies. As shown in Table 2, Facebook-small has a higher density than the other networks. The benefit of finding hub nodes is low in dense networks since influence can be easily spread, even from lowdegree nodes. In contrast, the drawback of high locality with SEC can affect the selection of good seed nodes.
We next investigate the effects of sample size. Figs. 10 and 11 show the normalized influence spread when selecting 50 seed nodes from the incomplete networks. The normalized influence spread is defined as the influence spread when selecting 50 seed nodes from the incomplete networks divided by the influence spread when selecting 50 seed nodes from the complete network. For Facebook-large with p = 0.01, seed nodes in the complete network were selected using DDIC; for other settings, seed nodes in the complete network were selected using TIM+. Fig. 10 shows the results for the WC model, and Fig. 11 shows the results for p = 0.01. From these results, a normalized influence spread of 0.8-0.9 can be achieved by using only a small sample size of 10-20%.  These results suggest that we can obtain a set of influential seed nodes from a small sample. They also suggest that the crawl-based sampling strategies RW and SEC are effective for obtaining the partial structure of a network when identifying an influential node set. Only for Facebook-small was random sampling more effective than SEC and RW. However, when the sample size was 20%, RW achieved a normalized influence spread of 0.7 for the WC model and approximately 0.9 for p = 0.01. This suggests that when RW is used and 20% of nodes are sampled, sufficient influence spread can be achieved, and in many cases a smaller sample is sufficient.

VI. DISCUSSION
An important implication of our results is that the partial structure of a network obtained via crawl-based network sampling is sufficient for identifying key node sets. Our results suggest that a 10% sample size is enough in many cases. This is a good result for real applications since access to real networks is typically limited (e.g., by restrictions on the number of API calls for social media graphs).
Another implication is that using RW sampling with a moderate level of bias is a robust strategy when the ground truth network is completely unknown. SEC sampling is designed to find high-degree nodes, and the bias of visiting high-degree nodes in SEC is stronger than in RW sampling. Our results show that SEC outperformed RW in several networks. However, SEC was not robust, and sometimes performed poorly for several types of networks. For instance, for the PowerGrid network in the MM problem, and for the Facebook-small network in the IM problem, using the SEC strategy resulted in considerably worse results than using other strategies. When using the SEC strategy, the crawling process tends to be confined to a strongly clustered subnetwork, which sometimes results in very poor outcomes. These results suggest that a strong bias in SEC degrades its robustness. Therefore, when we do not have any knowledge about the ground truth network, RW sampling should be used rather than SEC sampling.
We recognize some limitations of this study and suggest them as future research directions. First, the generalizability of our findings to other types of key node set identification problems is still not clear. There are several types of key node set identification problems, such as several variants of the IM problem and the minimum p-center problem. Our results show that similar heuristics (i.e., applying algorithms for the complete network to the incomplete network) can be effective for both IM and MM problems on incomplete networks. We therefore expect that other types of key node set identification problems for incomplete networks can be solved with a similar approach. We are interested in validating our results against other types of key node set identification problems in future research. We are also interested in key node set identification problems for other types of networks such as dynamic networks and multi-layer networks. Second, our study is purely experimental, and theoretical verification of our results is also an important direction for future research. In particular, it would be worthwhile to derive the sample sizes necessary to achieve specific targets for practitioners using key node set identification algorithms. Third, we only consider some combinations of existing sampling strategies and existing key node set identification algorithms, and there is a room for development of more effective algorithms. We are interested in exploring how to choose pairs of sampling strategies and key node set identification algorithms for more robust key node set identification. Using more simple key node set identification algorithms such as using centrality measures of nodes would be a option for more robust key node set identification. Moreover, the sampling strategy could be adaptively changed during sampling. Such new adaptive sampling strategy could be effective for the key node set identification.

VII. CONCLUSION
We studied two variants of the problem of identifying a set of key nodes from a network under limited knowledge about its structure. Specifically, we investigated how conventional network sampling strategies affect the solutions obtained for the MM and IM problems, which are popular key node set identification problems, when only partial networks are known. The conventional crawl-based sampling strategies are known to have a bias toward high-degree nodes. Although these biases are not desirable for estimating the general characteristics of unknown networks, they are expected to be beneficial for identifying key node sets. Our results have shown the benefits of biases in crawl-based sampling strategies for the IM and MM problems. We showed that both the IM and MM problems are effectively solved by similar simple heuristic algorithms when the subnetworks were obtained by crawl-based sampling strategies. For many cases, we showed that a 10-20% sample size is enough to find key node sets that are comparable with the key node sets obtained from the complete networks. We also examined which sampling strategy should be used for identifying the key node sets. Our results suggest that using RW sampling is a good option. SEC sampling is sometimes better than RW but is sometimes considerably worse.