A Novel Mitigation Method for Noise-Induced Temperature Error in CPU Thermal Control

It has been reported that in the thermal control of real-time computing systems, zero-mean thermal sensor noise can induce a signiﬁcant steady-state error between the target and actual temperatures of a CPU. Unlike the usual case of zero-mean sensor noise resulting in zero-mean temperature ﬂuctuations around the target value, this noise-induced temperature error manifests in the form of a bias, i.e., the mean of the error is not zero. Existing work has analyzed the main cause of this error and produced a solution, known as TCUB-VS. However, this existing solution has a few drawbacks: the transient response is sluggish, and the exact value of the noise standard deviation is necessary in the design stage. In this paper, we propose a novel method of avoiding noise-induced temperature error while overcoming the limitations of the existing work. The proposed method uses an estimated CPU temperature for the part of the controller that is sensitive to noise while using actual measurements for the other part of the controller. In this way, our proposed method eliminates noise-induced temperature error and overcomes the drawbacks of the existing work. To show the efﬁcacy of our proposed method, theoretical results are obtained using a stochastic averaging approach, and experimental results are presented along with simulations.


I. INTRODUCTION
In real-time computing systems, it is crucial to maintain the CPU temperature at a certain desired value because overheated CPUs suffer serious performance degradation, whereas if the CPU is cooler than the target temperature, this often implies resource underutilization. The goal of CPU thermal control is to maintain the temperature of a CPU at the desired value. The main challenges in CPU thermal control design for real-time systems include (i) satisfying both the real-time constraint and the thermal constraint, (ii) doing so in the face of uncertainty in the system dynamics, and (iii) overcoming the effect of thermal sensor noise. Here, the thermal constraint refers to the need to keep the CPU temperature below a given value to prevent CPU overheating. Previous studies on thermal-aware real-time scheduling have been reported by Wang and Bettati [1], [2] and Hung et al. [3], who The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney. addressed problem (i) by using feedforward-based schemes. In feedforward-based schemes, an accurate system model is required to prevent CPU overheating. However, such a system model is usually subject to uncertainties in various aspects, such as thermal dynamics, power consumption, task execution time, and ambient temperature; consequently, feedforward-based schemes cannot effectively solve the CPU overheating problem. Therefore, to avoid CPU overheating (problem (i)) while addressing uncertain system dynamics (problem (ii)), feedback-based thermal control schemes have been proposed by Yue et al. [4], Hettiarachchi et al. [5], and Fu et al. [6]. Kim et al. [7] revealed that problem (iii) is also of significant importance because zero-mean noise may result in CPU overheating under utilization constraints. Thermal noise in CPU temperature measurements is very common, and the level of noise can often be substantial in practice. Indeed, the sensor noise amplitudes in the work of Rotem et al. [8] and Long et al. [9] were as large as 10 • C. Kim et al. [7] addressed problems (i), (ii), and (iii) within the Thermal Control under Utilization Bound (TCUB) framework, which was proposed by Fu et al. [6]. In the work of Fu et al. [6], the CPU utilization level is controlled based on CPU temperature feedback to maintain the temperature at a certain target. Kim et al. [7] showed that in TCUB systems, thermal sensor noise can induce a significant steady-state error between the desired and actual CPU temperatures. This phenomenon is referred to as noise-induced temperature error (NITE).
A similar phenomenon has been reported in toner concentration control in printing systems by Eun and Hamby [10] and Kabamba et al. [11], and the feedback control system structure that is conducive to this phenomenon has been generalized by Lee and Eun [12].
Fortunately, Kim et al. [7] proposed a method called Thermal Control under Utilization Bound with Virtual Saturation (TCUB-VS) to eliminate the NITE phenomenon. However, this method requires accurately measuring the standard deviation of the sensor noise, which varies over time and with the measurement conditions. In addition, TCUB-VS can result in a sluggish transient response of the CPU temperature.
Because NITE occurs due to measurement noise, an alternate mitigation method is to place a low-pass filter in the feedback loop. This, however, also causes a sluggish response of the control system and can affect the system stability (See Choi and Sul [13], Satici et al. [14], Li et al. [15], and Park and Kim [16].) The main goal of this paper is to propose a novel method to address problems (i), (ii), and (iii) mentioned above while overcoming the drawbacks of TCUB-VS. The proposed method is named TCUB-NR, which stands for Thermal Control under Utilization Bound with Noise Reduction. According to the analysis presented by Kim et al. [7], Eun and Hamby [10], and Lee and Eun [12], a critical mechanism for NITE mitigation relies on a proportional control term, which is sensitive to noise, and a utilization bound in the thermal controller. Therefore, the main idea of TCUB-NR is to use an estimated CPU temperature, instead of noisy temperature measurements, for the proportional control term of the thermal controller. This estimated temperature is obtained by using a nominal thermal model of the CPU, whose input is the CPU utilization. Note that the novel feature of TCUB-NR is that to eliminate NITE, the thermal controller is considered to be divided into two parts. One part includes a proportional control term that is sensitive to noise, and the other part is not sensitive to noise. TCUB-NR uses the estimated temperature for the noise-sensitive proportional control term but uses the actual sensor output for the noise-insensitive part of the controller. In this manner, TCUB-NR eliminates the effect of sensor noise in the part of the controller that is critical for NITE mitigation while maintaining the feedback control function by using the actual sensor output for the other part of the controller. If the estimated temperature were to be used for all aspects of control, the real-time feedback nature of the controller would be lost; however, this is avoided in the TCUB-NR scheme. In turn, because the effect of noise on the NITE is reduced, the extended linear range of the virtual saturation block used in TCUB-VS is no longer needed, thus eliminating the main cause of the sluggish response. In addition, our proposed method does not require the accurate standard deviation of the noise, which is used to set limits for the virtual saturation in TCUB-VS. Hence, the drawbacks of the existing approach are alleviated. To prove that our proposed method is effective in mitigating NITE, a theoretical analysis of NITE elimination with TCUB-NR is presented using the stochastic averaging approach of Skorokhod [17]. Additionally, a performance comparison between TCUB-VS and TCUB-NR is shown to illustrate the improvement achieved with TCUB-NR.
The outline of this paper is as follows. In Section II, we describe the thermal control problem for real-time systems and show that the NITE phenomenon occurs in TCUB. Additionally, the TCUB-VS mechanism is explained. Section III shows that TCUB-NR mitigates the NITE phenomenon while overcoming the limitations of the TCUB-VS method. Moreover, the NITE mitigation effect of TCUB-NR is analyzed using the stochastic averaging approach of Skorokhod [17]. The TCUB-NR scheme proposed in this paper for eliminating NITE is validated with both simulated and experimental results in Section IV. Finally, our conclusions are given in Section V.

II. NITE IN CPU THERMAL CONTROL SYSTEMS A. OVERVIEW OF TCUB
TCUB is a feedback-based control approach for maintaining a desired CPU temperature by controlling CPU utilization based on measurements. In this approach, the CPU temperature is measured, and then the utilization level is adjusted based on the measured value, the desired temperature setpoint, and a model of the thermal dynamics of the unit.
The overall structure of TCUB is shown in Figure 1. It consists of a controller and a processor. The controller has two feedback loops. The inner loop, shown in red, is responsible for utilization control, and the outer loop, in blue, is responsible for thermal control. Since the processor utilization dynamics are much faster than the temperature dynamics, the outer loop runs at much lower sampling and control rates than the inner loop. The index k represents the sampling instances for the outer thermal control loop, and k represents the sampling instances for the inner utilization loop.
The setpoint temperature and the maximum and minimum utilization bounds, denoted by T R , β, and α, respectively, are provided as input to the thermal controller. The minimum utilization bound α may be defined as the sum of the product of each minimum achievable task execution time with the corresponding minimum allowable task rate. The maximum utilization bound β may be determined by the schedulerdependent utilization bound at which real-time tasks can miss a deadline.
In the k-th sampling period of the outer loop, based on the measured temperature T (k) provided by the thermal sensor, the thermal controller calculates the utilization setpoint u s (k) for the utilization controller in the inner loop. In the k -th sampling period of the inner loop, based on the measured utilization u(k ), the utilization controller adjusts the change in the task rate r(k ) and then sends it to the actual processor to drive the processor utilization to converge to the setpoint u s (k) given by the thermal controller [6], [7].
For the thermal dynamics of the processor, by combining the thermal RC model [18], [19] with the relationship between CPU power and utilization [6], the model to be employed in thermal control can be obtained as follows [6]: where T (k) is the temperature of the processor, T 0 is the ambient temperature, R th is the heat resistance, G p is the ratio between the actual active power at run time and the estimated active power P a , and P idle is the power when the CPU is idle. In addition, = exp(−T s /(R th C th )), where T s is the control sampling interval and C th is the heat capacity. Figure 2 shows a block diagram of a real-time TCUB system [6], where C(z), G(z), and H (z) denote a proportional-integral (PI) controller, an anti-windup controller, and the transfer function between the CPU utilization and CPU temperature, respectively. All transfer functions are given in the form of discrete-time transfer functions. Here, the transfer function H (z) models the combined dynamics of the utilization controller (the inner loop) and the processor in Figure 1, whereas C(z) and G(z) are components of the thermal controller in Figure 1. Figure 2 also shows the thermal sensor noise, denoted by n(k). Here, the thermal sensor noise, along with other noise sources, is assumed to have a zero-mean Gaussian distribution with a standard deviation of σ n because it is approximated as Gaussian-distributed noise [20]. The upper and lower bounds on the CPU utilization in TCUB are represented by the saturation sat β α (u), with the following definition:

B. NITE IN TCUB
where α and β denote lower and upper utilization bounds, respectively. The signals T (k), T R , and u(k) are the CPU temperature deviation, the CPU temperature setpoint, and the utilization command, respectively, all obtained around a designed operating point of T (k) = 67 • C. Additionally, the signal u s (k) corresponds to sat β α (u). The difference between u(k) and u s (k) is used as an input to the anti-windup controller. The transfer functions H (z) and G(z) are given by where K = K i (1 + w I T s 2 ) and b = 2−w I T s 2+w I T s . The constants K p , K i , and w I are the original gains of the PI controller in a continuous-time design, rather than the discrete-time design; for details, see [6], [7].
Reference [7] shows that in the thermal control feedback system shown in Figure 2, due to measurement noise, the mean value T (k) of the CPU temperature deviation in the steady state deviates from the setpoint T R . The implications of this are illustrated in Figure 3, where the mean value of the CPU temperature deviation, as obtained through simulation, is plotted as a function of the CPU temperature setpoint. Clearly, in the presence of n(k) with σ n = 3.5, the CPU temperature is either higher or lower than the setpoint. For comparison, Figure 3 also shows the response when n(k) = 0 for all k ≥ 0, represented by the black line. Note that no error exists in the range of setpoints from 51 • C to 83 • C. However, even if the noise is small, there are a few conditions under which NITE occurs (see [10] for details). The temperature deviation outside of this setpoint range is due to the limits on the utilization level imposed by α and β.

C. TCUB-VS
The analysis presented in [7] reveals that NITE is caused by undesired persistent triggering of G(z) due to measurement noise amplified by the proportional term of the controller. Hence, the analysis in [7] predicts that eliminating G(z) should eliminate NITE. However, since G(z) is necessary for stability and for the transient response (see [21]- [26] for a discussion of the adverse effect without G(z)), an alternative solution has been suggested that uses a virtual saturation  block with a linear range that is wider than the actually allowed utilization range. The proposed scheme is illustrated in Figure 4, where all transfer functions, gains, and signals are defined similarly to those in Figure 2. To avoid the continuous triggering of G(z) by noise, the limits of the virtual saturation block are set to In this way, 99% of the undesired activation of G(z) is avoided because n(k) follows a Gaussian distribution with the standard deviation of σ n . Note that for the above setup, an accurate estimate of σ n is required to compute the new limits. Figure 5 shows the mean CPU temperature deviation as a function of the setpoint. In contrast to Figure 3, the CPU temperature is now controlled without errors. Figure 5 also shows the response when there is a mismatch between the actual noise standard deviation and the estimate. The green line shows the CPU temperature error for the case in which α and β are chosen using a value of σ n = 3.5 when the actual value is 6. Clearly, some error in the CPU temperature reappears. Therefore, the standard deviation of the measurement noise must be exactly known for TCUB-VS. However, this requirement might be difficult and impractical to fulfill because the sensor characteristics vary over time and with the measurement conditions. This is one of the disadvantages of TCUB-VS. Another disadvantage is that TCUB-VS may result in a sluggish transient response for avoiding controller windup because the virtual saturation delays the activation of anti-windup as a result of the linear range being wider than the actually allowed range of utilization.  A novel NITE mitigation method that avoids the disadvantages of TCUB-VS is the topic of the next section.

III. A NOVEL METHOD FOR ELIMINATING NITE
We propose an effective method for avoiding NITE in feedback-based thermal control systems. Figure 6 shows the block diagram of a system with the proposed noise reduction method, referred to as Thermal Control under Utilization Bound with Noise Reduction (TCUB-NR). Most signals and transfer functions are defined similarly to those in Figure 2. In addition, H n (z) is the transfer function of the nominal thermal dynamics of the CPU, which is assumed to be identical to the plant H (z). An estimated CPU temperature T (k), which is not affected by the noise, is obtained from H n (z), and C ∞ is equal to C(∞).
TCUB-NR is implemented by using the estimated CPU temperature for the proportional control part of the thermal controller and the actual measured CPU temperature for the other part of the controller. As noted in [7], the NITE is due to undesired triggering of anti-windup caused by the fluctuation of the utilization level induced by the action of measurement noise through the proportional control term. Therefore, TCUB-NR uses an estimated temperature for proportional control, which is sensitive to noise, but uses the actual sensor output for the other part of the controller, which is not sensitive to noise. In this way, TCUB-NR is expected to avoid NITE. Note that the noise-insensitive part of the controller uses the actual temperature measurement in order to maintain the feedback control function. Consequently, TCUB-NR does not lose its real-time feedback nature while effectively mitigating NITE.
Mathematical support for the above argument is presented as follows. The utilization command u(k) in the TCUB scheme illustrated in Figure 2 is written as We decompose C(z) into the proportional control part, C ∞ , and the remaining part, denoted by C 1 (z). In other words, we write C(z) = C ∞ + C 1 (z). For the case of the controller given in (4), C ∞ = (K + K p ). Equation (5) shows that in TCUB, both parts of the controller use the actual measured CPU temperature as input: both are multiplied by T (k) + n(k). Specifically, the term (K p + K )n(k) represents the sensor noise amplified by the proportional control term, which results in the occurrence of NITE [7].
In contrast, the utilization command u(k) in the TCUB-NR scheme illustrated in Figure 6 is written as Thus, we obtain (5) and (7) shows that in TCUB-NR, the noise n(k) is not multiplied by C ∞ but instead is multiplied only by C 1 (z). Hence, this scheme eliminates the effect of the noise in the proportional control term, i.e., the gain of (K + K p ), but still retains the feedback on the actual measured CPU temperature for the other part of the controller, C 1 (z).
Similarly to what is done in [7], [10], [12], the system in Figure 6 is analyzed below using stochastic averaging to mathematically demonstrate the elimination of the NITE. Applying stochastic averaging to the system in Figure 6 with respect to the measurement noise yields the system depicted in Figure 7. The symbols used for the signals involved follow the same notation used in Figure 6 but with an upper bar, which indicates that they represent mean values of the corresponding signals with respect to the randomness caused by noise. Additionally, note that the zero-mean measurement noise n(k) is averaged away and does not appear in the system of Figure 7. The function h β α (ū(k); κσ n ) is given by where The parameter κ represents the part of u(k) that is directly proportional to n(k). In particular, the temperature tracking error signal is denoted byē. The NITE in TCUB-NR (the system depicted in Figure 6) is evaluated by analyzingē in the system depicted in Figure 7. The accuracy of this approximation has been proven to be high [7], [10], [12]. The NITE in TCUB-NR is given bȳ where G(1) is the dc gain of the transfer function G(z). The derivation of (10) is given in the Appendix. According to equation (8) and the controller C(z) given in (4), the function h β α (ū; κσ n ) is the same as sat β α (ū) since κ = C ∞ − (K + K p ) = 0. Therefore, when the steadystate utilization commandū is in the linear region, the term u − h β α (ū; κσ n ) in (10) is zero; accordingly,ē = 0 in Figure 7. Therefore, this analysis of (10) reveals that TCUB-NR eliminates the NITE.
According to (10), unlike TCUB-VS, TCUB-NR eliminates the NITE without requiring the exact standard deviation of the noise. In addition, the utilization bounds of TCUB-NR are the same as those of TCUB. Therefore, since there is no delay in the activation of the anti-windup mechanism in TCUB-NR, it is expected that the transient response of the thermal control system should not be degraded compared to the original design.
Regarding system stability, the stability of the system shown in Figure 6 is investigated as follows: When H (z) = H n (z) = G(z), the stability of the system in Figure 6 is identical to that in Figure 2, which was discussed in [6]. Thus, the system depicted in Figure 6 remains stable.

IV. EXPERIMENTAL VALIDATION A. EXPERIMENTAL SETUP
For validation, we have implemented TCUB in a computing system. The hardware platform is a Samsung DB-P205 desktop with an Intel i7-870 CPU @2.93 GHz, and the operating system is Linux kernel 4.13 with Ubuntu 16.04.3 LTS. The implemented setup is shown in Figure 8. To implement TCUB, we divided it into three components: the temperature monitoring module, the utilization controller, and the thermal controller. Here, the CPU temperature is measured with lm-sensors. The lm-sensors package provides tools for monitoring CPU temperatures using digital thermal sensors on CPU cores. 94004 VOLUME 8, 2020  We generate a workload to vary the CPU utilization by using a cyclic redundancy check (CRC) benchmark. For each core, we run a thread that controls the CPU utilization by regulating the work time and sleep time. We use the system tick counter for the work time and implement the sleep time using the usleep function.
We acquire the thermal resistance from a linear regression of the temperature output with respect to the CPU utilization. Next, the thermal capacitance is obtained from the discrete thermal RC model [18], [19], the CPU power consumption [27], and the transient temperature graph. Table 1 shows the parameters related to the thermal behavior in the experimental setup. Based on Table 1 and the TCUB method as presented in [6], [7], the thermal control system parameters are obtained as follows: The nominal plant H n (z) for TCUB-NR is assumed to be identical to H (z).

B. SIMULATION RESULTS
By running Simulink with the transfer functions and parameters given in (11), we obtain MATLAB simulation results for the thermal control systems shown in Figures 2, 4, and 6 for various scenarios. Under the assumption that the standard deviation σ n of the measurement noise is 3.5 and the ambient temperature  is 15 • C, Figure 9 shows the response of the thermal control system of Figure 2. Clearly, the CPU temperature shown deviates from the target temperature setpoint. The CPU is underutilized relative to the desired level due to the effect of the noise. In addition, the response of TCUB in the absence of measurement noise is shown for comparison, from which it is clear that the deviation results from the zero-mean measurement noise. Figure 10 shows the responses of the thermal control systems of Figure 4 (TCUB-VS) and Figure 6 (TCUB-NR). The response of TCUB-VS, shown in red, exhibits no bias, as expected. However, if the estimate of the standard deviation σ n is off from the true value, bias appears again, as shown in blue. The response of TCUB-NR is shown in green; here, no bias exists, and the fluctuation is much less. Hence, TCUB-NR eliminates the NITE without requiring knowledge of the noise characteristics.
In the next simulation scenario, the CPU temperature setpoint is 70 • C, and the ambient temperature changes due to a sudden shift in the operating environment. After 1800 seconds, the ambient temperature rises. Figure 11 shows the responses of TCUB-VS and TCUB-NR in this scenario. For the first 1800 seconds, neither control system can achieve the desired CPU temperature because of the upper limit on the utilization. During this time, controller windup occurs. At 1800 seconds, the environment changes such that the desired temperature can be achieved. Note that TCUB-VS, as shown in blue, takes thousands of seconds to recover from the windup. This is because the anti-windup activation is delayed due to the wide linear range of the virtual saturation block. On the other hand, TCUB-NR, as shown in green, recovers immediately. Hence, it is shown that TCUB-NR overcomes the sluggish transient response of TCUB-VS, although neither TCUB-VS nor TCUB-NR exhibits any NITE.
Note that NITE occurs in TCUB when CPU power is increased by twice [7]. Also, [7], [30] show the occurrence of NITE in various CPUs (Alpha 21264, Pentium 4, i7-2620M). Therefore, NITE can occur regardless of CPU types or power. In addition, the combined outcome of all sources of noise including power spike etc., is approximated as thermal sensor noise n. Therefore, TCUB-NR is effective to avoid NITE although power spike become larger as CPU power P a increases. For validation, Figure 12 shows the responses of TCUB and TCUB-NR as the existing CPU power increases from 95W to 123 or 205W by overclocking the CPU in accordance with [31], [32]. Specifically, responses of TCUB and TCUB-NR are, respectively, shown in solid and dotted green when the CPU power P a is 123W. When the CPU power P a is 205W, responses of TCUB and TCUB-NR are shown in solid and dotted red, respectively. As shown in Figure 12, it has been confirmed that NITE is mitigated in TCUB-NR regardless of CPU power.

C. EXPERIMENTAL RESULTS
First, Figure 13 shows experimental results for thermal control systems implementing the TCUB, TCUB-VS, and TCUB-NR schemes, respectively. For these experimental results, the target temperature is 65 • C. In Figure 13, the response in red shows that TCUB exhibits NITE. The  response in blue is TCUB-VS designed with the estimate of σ n to be 1 while the actual value is about 4.5. Clearly, NITE is not well mitigated. By contrast, the response in green shows that TCUB-NR, our proposed method, effectively eliminates NITE and noise standard deviation is not necessary.
Similar to the simulation results in Figure 11, Figure 14 includes a change in the ambient temperature due to a sudden shift in the operating environment. This experimental setup is to demonstrate an improved transient response by TCUB-NR. For this reason, we use TCUB-VS design with accurate σ n . The three responses in Figure 14 are almost the same until 1800 seconds, before which they cannot achieve the desired CPU temperature due to the low ambient temperature.
Sluggish transient response means it takes longer for CPU temperature to reach the set-point. The reason for this is that it takes time for controller states to come out of windup [21]. In Figure 14 (experimental results), TCUB-VS (blue line) takes about 1500 seconds (3300-1800) to recover from windup and reach the setpoint while TCUB-NR (green dotted line) takes only about 700 seconds (2500-1800). Note that TCUB (red line) never reaches the setpoint due to NITE.
Notably, the fluctuation with TCUB-NR is also much less than that with TCUB or TCUB-VS due to the noise FIGURE 14. Experimental responses of thermal control systems with the system parameters given in (11) in the case of a sudden shift in the operating environment. reduction mechanism of TCUB-NR. Both the simulations and the experimental results show that the proposed TCUB-NR method of thermal control effectively eliminates NITE. Additionally, the experimental validation and the simulation results support the claim that TCUB-NR overcomes the two main drawbacks of TCUB-VS.
In order to verify the accuracy of the CPU core temperature measurement using the existing thermal sensor (lm sensor package), we conducted an experiment using a thermocouple device. According to [6], the CPU core temperature T can be calculated by measuring CPU ambient temperature T 0 , that is, T = R th P th + T 0 . Hence, we placed the thermocouple device in a close proximity of the CPU case to measure T . Figure 15 show CPU core temperature obtained from the lm sensor and that calculated using the above equation and measurement of T 0 , when CPU is idle. In Figure 15, the thermocouple measured CPU ambient temperature is shown in black. Based on the ambient temperature, the calculated CPU core temperature in blue is obtained. Finally, the CPU core temperature obtained from the existing lm sensor is shown in red. Clearly, the CPU core temperature from the existing thermal sensor (red line) is very close to the calculated CPU core temperature (blue line).

V. CONCLUSION
In thermal control for real-time computing systems, noiseinduced temperature error occurs. In this paper, a new approach called Thermal Control under Utilization Bound with Noise Reduction (TCUB-NR) is proposed and shown to effectively eliminate this noise-induced bias while simultaneously overcoming a few drawbacks of an existing solution. The proposed method achieves CPU thermal control in the presence of model uncertainty and measurement noise. It is conjectured that similar errors induced by noise will occur for other types of noise distributions. Therefore, for mitigating the error induced by non-Gaussian-distributed noise, the potential of existing methods, such as minimum entropy [28], [29], is an open problem that could be investigated in future work.