Scalable group-based checkpoint/restart for large-scale message-passing systems | IEEE Conference Publication | IEEE Xplore