Skip to Main Content
Recent paradigm shifts in distributed computing such as the advent of cloud computing pose new challenges to the analysis of distributed executions. One important new characteristic is that the management staff of computing platforms and the developers of applications are separated by corporate boundaries. The net result is that once applications go wrong, the most readily available debugging aids for developers are the visible output of the application and any log files collected during their execution. In this paper, we propose the concept of task graphs as a foundation to represent distributed executions, and present a low overhead algorithm to infer task graphs from event log files. Intuitively, a task represents an autonomous segment of computation inside a thread. Edges between tasks represent their interactions and preserve programmers' notion of data and control flows. Our technique leverages existing logging support where available or otherwise augments it with aspect-based instrumentation to collect events of a set of predefined types. We show how task graphs can improve the precision of anomaly detection in a request-oriented analysis of field software and help programmers understand the running of the Hadoop Distributed File System (HDFS).