Skip to Main Content
This paper characterizes "queue storms" in supercomputer systems and discusses methods for quelling them. Queue storms are anomalously large queue lengths dependent upon the job size mix, the queuing system, the machine size, and correlations and dependencies between job submissions. We use synthetic data generated from actual job log data from the ASCI Blue Mountain supercomputer combined with different long-range dependencies. We show the distribution of times from the first storm to occur, which is in a sense the time when the machine becomes obsolete because it represents the time when the machine first fails to provide satisfactory turnaround. To overcome queue storms, more resources are needed even if they appear superfluous most of the time. We present two methods, including a grid-based solution, for reducing these correlations and their resulting effect on the size and frequency of queue storms.