Skip to Main Content
High-density computer racks become increasingly commonplace in supercomputing centers and data centers. With tight integration of high-powered computing components in the racks, hot spots or pockets of elevated temperatures on the chips and system can be easily formed when room air circulation is not effective. Hot spots reduce the reliability of high-density systems and increase the chances of thermal emergencies, which further trigger system slowdowns or shutdowns. Techniques such as dynamically scaling down the voltage of the CPUs and fan control are available on today's systems to reduce heat generation and dissipate heat. Unfortunately, these techniques work independently on their own without cooperation. As a result, to prevent thermal emergencies, systems may work at reduced capacity when full capacity is required. We propose a combined in-band and out-of-band approach to reduce the likelihood of thermal emergency slowdowns and improve the reliability of systems. Our thermal control framework unifies temperature control mechanisms in systems to balance temperature, power consumption, and performance. More precisely, we balance the use of in-band dynamic voltage and frequency scaling (DVFS) with out-of-band proactive fan control. Our results on a power-aware cluster indicate the coordinated use of fan control and DVFS is more effective than either technique in isolation at reducing average system operating temperatures with expected performance.