By Topic

DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Qi Gao ; The Ohio State University, Columbus, OH ; Feng Qin ; Dhabaleswar K. Panda

While software reliability in large-scale systems becomes increasingly important, debugging in large-scale parallel systems remains a daunting task. This paper proposes an innovative technique to find hard-to-detect software bugs that can cause severe problems such as data corruptions and deadlocks in parallel programs automatically via detecting their abnormal behaviors in data movements. Based on the observation that data movements in parallel programs typically follow certain patterns, our idea is to extract data movement (DM)-based invariants at program runtime and check the violations of these invariants. These violations indicate potential bugs such as data races and memory corruption bugs that manifest themselves in data movements. We have built a tool, called DMTracker, based on the above idea: automatically extract DM-based invariants and detect the violations of them. Our experiments with two real-world bug cases in MVAPICH/MVAPICH2, a popular MPI library, have shown that DMTracker can effectively detect them and report abnormal data movements to help programmers quickly diagnose the root causes of bugs. In addition, DMTracker incurs very low runtime overhead, from 0.9% to 6.0%, in our experiments with High Performance Linpack (HPL) and NAS Parallel Benchmarks (NPB), which indicates that DMTracker can be deployed in production runs.

Published in:

Supercomputing, 2007. SC '07. Proceedings of the 2007 ACM/IEEE Conference on

Date of Conference:

10-16 Nov. 2007