Predicted max degree sampling: Sampling in directed networks to maximize node coverage through crawling | IEEE Conference Publication | IEEE Xplore

Predicted max degree sampling: Sampling in directed networks to maximize node coverage through crawling


Abstract:

When studying large-scale networks, including the Web and computer networks, it is first necessary to collect appropriate data. There is a large body of literature on web...Show More

Abstract:

When studying large-scale networks, including the Web and computer networks, it is first necessary to collect appropriate data. There is a large body of literature on web crawling and network sampling in general; however, this work typically assumes that a query on a node either reveals all edges incident to that node (in the case of an undirected graph) or all edges outgoing from that node (e.g., in the case of a web crawler). There is relatively little work on sampling directed networks where incoming and outgoing edges may be obtained with separate queries. This type of sampling is relevant to networks like Twitter, in which the `Friends' and `Follower' connections are reciprocal relationships (i.e., if A is a follower of B, then B is a friend of A). In this paper we present Predicted Max Degree Sampling (PMD), a new method of sampling a directed network, with the objective of maximizing the total number of nodes observed. We consider two types of crawling, corresponding to the scenarios in which there is or is not a limit on the number of nodes returned by a query. We compared PMD against three baseline algorithms with three real networks, and saw large improvements vs. baseline sampling algorithms: With a budget of 2000, PMD found 15% to 170.2% more nodes than the closest baseline algorithm for the different datasets considered.
Date of Conference: 01-04 May 2017
Date Added to IEEE Xplore: 23 November 2017
ISBN Information:
Conference Location: Atlanta, GA, USA

1. Introduction

Collecting network data is often a time-consuming task: among other things, one must deal with bandwidth limitations or API rate limits. The research literature contains a large amount of work on network sampling, but most of this work has been performed on either undirected networks or on directed networks, where a query returns the edges outgoing from a node (e.g., the web crawling scenario).

Contact IEEE to Subscribe

References

References is not available for this document.