2017 conference paper
Challenging the limits: Sampling online social networks with cost constraints
Ieee infocom 2017 - ieee conference on computer communications.
Graph sampling techniques via random walk crawling have been popular for analyzing statistical characteristics of large online social networks due to simple implementation and provable guarantees on unbiased estimates. Despite the growing popularity, the ‘cost’ of sampling and its true impact on the accuracy of estimates still have not been carefully studied. In addition, the random walk-based methods inherently suffer from the sluggish nature of random walks and the ‘slow-mixing’ structure of social graphs, thereby leading to high correlation in the samples obtained. With these in mind, in this paper, we develop a mathematical framework such that the cost of sampling is properly taken into account, which in turn re-defines a widely used asymptotic variance into a cost-based asymptotic variance. Our new metric enables us to compare a class of sampling policies under the same cost constraint, integrating “random skipping” (bypassing nodes without sampling) into the random walk-based sampling. We obtain an optimal policy striking the right balance between sampling quality (less correlation) and sampling quantity (higher cost per sample), which greatly improves over the usual skip-free crawling-based samplers. We further extend our framework, enabling one to design more sophisticated sampling strategies with an array of control knobs, which all produce unbiased estimates under the same cost constraint.