Skip to Main Content
Ever-increasing demand for computing capability is driving the construction of ever-larger computer clusters, typically comprising commodity compute nodes, ranging in size up to thousands of processors, with each node hosting an instance of the operating system (OS). Recent studies [E. Hendriks (2002), F. Petrini et al. (2003)] have shown that even minimal intrusion by the OS on user applications, e.g. a slowdown of user processes of less than 1.0% on each OS instance, can result in a dramatic performance degradation - 50% or more - when the user applications are executed on thousands of processors. The contribution of this paper is the explication and demonstration by way of a case study, of a methodology for analyzing and evaluating the impact of the system (all software and hardware other than user applications) activity on application performance. Our methodology has three major components: 1) a set of simple benchmarks to quickly measure and identify the impact of intrusive system events; 2) a kernel-level profiling tool Oprofile to characterize all relevant events and their sources; and, 3) a kernel module that provides timing information for in-depth modeling of the frequency and duration of each relevant event and determines which sources have the greatest impact on performance (and are therefore the most important to eliminate). The paper provides a collection of experimental results conducted on a state-of-the-art dual AMD Opteron cluster running GNU/Linux 2.6.5. While our work has been performed on this specific OS, we argue that our contribution readily generalizes to other open source and commercial operating systems.