Skip to Main Content
Record-replay is effective to reproduce non-deterministic bugs, and has gained attentions in research community. However, current approaches fall short of handling nondeterministic bugs in multi-processor platforms and distributed systems due to several reasons. First, multi-thread programs on multi-processor platforms, which are common in today's distributed systems, are difficult to be recorded and replayed because of data-races. Second, increasing systems scale makes production environment more sensitive to perturbation from recording. Even hacking control scripts has been unacceptable because of the boosting complexity comes from variety of programs and large number of computing cores. Third, when deployed in distributed systems, large scale will also multiply recording traces, which overwhelms developers, and also slows down the whole system dramatically. To address the above issues, we propose following mechanisms to efficiently record-reply in multi-processor distributed systems: control-flow based record-replay, low-perturbation loading and proportion sampling. We have implemented these mechanisms in ReBranch -- a practical record-replay system for debugging multi-thread programs in multi-processor platforms and distributed systems. ReBranch has already shown its power on dealing with real bugs. We also present our debugging experiences using ReBranch with a case study on handling a bug in memcached -- an important component in many commercial systems.