A Novel Reliability Index to Assess the Computational Resource Adequacy in Data Centers

The energy demand of data centers is increasing globally with the increasing demand for computational resources to ensure the quality of services. It is important to quantify the required resources to comply with the computational workloads at the rack-level. In this paper, a novel reliability index called loss of workload probability is presented to quantify the rack-level computational resource adequacy. The index defines the right-sizing of the rack-level computational resources that comply with the computational workloads, and the desired reliability level of the data center investor. The outage probability of the power supply units and the workload duration curve of servers are analyzed to define the loss of workload probability. The workload duration curve of the rack, hence, the power consumption of the servers is modeled as a function of server workloads. The server workloads are taken from a publicly available data set published by Google. The power consumption models of the major components of the internal power supply system are also presented which shows the power loss of the power distribution unit is the highest compared to the other components in the internal power supply system. The proposed reliability index and the power loss analysis could be used for rack-level computational resources expansion planning and ensures energy-efficient operation of the data center.


I. INTRODUCTION
Data centers (DCs) are becoming an essential part of the modern information technology industry with the increasing popularity of cloud-based services. With the growth in computation resource capacity and size of DCs, the energy demand and operational costs are continuously increasing [1]. According to a report published by Lawrence Berkeley National Laboratory, the total energy consumption of the DCs in the US was approximately 70 billion kWh or 1.8% of its national consumption. Their energy demand was projected to be doubled to roughly 140 billion kWh annually by 2020, corresponding to a $13 billion annual electricity bill [2]. A recent study from Google indicates that a typical cluster utilizes only 10% to 50% of its installed computational capacity [3]. Due to the overprovisioning of the computation resources, The associate editor coordinating the review of this manuscript and approving it for publication was Zhiyi Li . the DCs incur unnecessary electrical energy and associated operational costs.
In addition, it is also important to limit the energy losses in the internal power supply system (IPSS) of DC to enhance the overall efficiency [4], [5]. The losses of the IPSS depend on the structure of the IPSS and the electrical load demands of the load sections. The DCs typically have two major load sections namely, the IT loads and the cooling loads that are fed by the power conditioning equipment of the IPSS [6]. The uninterrupted power supply (UPS), the power distribution unit (PDU), and the power supply unit (PSU) are the main components of the power conducting section in the IPSS. All these devices consume a significant amount of power, which are considered as power losses of the IPSS in this paper. The power losses of these components increase with increasing IT loads [5], [6]. The right-sizing of the computational resources considering the energy-efficient operation of DC are addressed in [7]- [11]. The number of idle servers considering data traffic and negotiated service level agreements (SLAs) can be reduced as proposed in [7]. The number of active servers is optimized by the consolidation of workloads through virtualization in [9]- [11]. However, the reliability of the rack-level PSU is not included in these studies. So, this paper takes the opportunity to quantify the computational resource adequacy considering the outage probability of the PSUs at rack-level. The power consumption models for UPS, PDU, and PSU are also considered since the increasing power consumption of these devices degrades the reliability of the IPSS in DC, as explained in [5].
In this paper, a novel index called loss of workload probability (LOWP) is introduced. The index addresses the adequacy of the rack-level computational resources as the number of required servers per rack. The required number of servers per rack is calculated considering the computational workloads taken from a data set published by Google [12], and the outage probability of the PSUs. The time-series of the electrical power consumption of the rack is calculated from the assigned workloads of the servers. Further, the electrical load duration curves of the racks are constructed to calculate the LOWP, considering the outage probability of the associated rack-level power supply units (PSUs). The power consumption model of the server and the power loss models of the major components of the IPSS are presented as a function of the server utilization. The server utilizations from the Google data set are used to fit in the proposed models that give the overall IT load profile. Besides the IT loads power consumption, the percentage of power loss of the IPSS devices and aggregated power loss of the IPSS are also analyzed using the server utilizations from the Google data set. It is computationally challenging to process the data to get the useful time-series of each server utilization and fit that in the proposed models.
The contributions and findings of this paper are listed below: • A novel reliability index called LOWP is introduced to analyze the adequacy of the rack-level computation resources considering the computational workloads. The LOWP defines the probability of computational workloads that cannot be satisfied due to failures in PSUs at the rack. The application of the LOWP is also discussed considering computational resource expansion planning and designing the clusters of racks for latency-sensitive workloads.
• To the best of our knowledge, this is the first attempt that has been made to use the server utilizations data published in the Google data set without any scaling to model the aggregated power consumption of IT loads and the IPSS of the DC. The analysis shows that total loss and the percentage of loss of PDUs are higher than the UPSs and the PSUs at rack-level and aggregated level. The power conditioning system of the IPSS consumes more than 10% of the rated IT load in DC. The remainder of this work is organized as follows: Section II describes the state of the art of the power consumption modeling approaches of the components of DC and the loss of load probability analysis that is traditionally used in the power system. Section III shows the formulation of the power consumption models of the IT load section and the power conditioning equipment including the structure of the Google data set. Section IV describes the results and analysis of the proposed power consumption models and reliability index for rack-level computational resources. Section V explains the application of the proposed LOLP index for planning DC expansion and discusses the limitations of this work. Finally, Section VI contains the conclusions and recommendations.

LIST OF SYMBOLS
rotational speed of local fan φ PDU power loss coefficient of PDU φ UPS power loss coefficient of UPS C the cumulative rated power of the available PSUs at the rack C j the computational rated power of the available PSUs at the rack for hour j L the IT load demand of the rack n the total number of PSUs per rack N f number of total local fans N R number of total racks N S number of total servers P[C = C j ] the probability to have remaining power supply capacity of C j after failures P[L > C j ] the probability to have IT load demand L more than the remaining power supply capacity C j supply capacity of C j after failures of PSUs at rack p a the availability of PSU p j the probability to have a power supply capacity C j for percentage of measurement time t j P fan j power consumption of j th local fan P fan power consumption of individual local fan P idle i idle power consumption of i th server P According to The Uptime Institute's classification, the IPSS infrastructure of DCs has evolved through at least four distinct stages in the last 40 years, known as the ''Tiers of data center'' [13]- [15]. The tiers of the DC are distinguished by the redundant components in the power flow paths to IT loads, without considering the computational workloads, hence the electrical load of the IT devices. The methodologies related to the reliable operation of the DC that leads towards the dynamic right-sizing of computational resources are addressed in [7]- [11]. These research works are focused on reducing the number of idle servers based on data traffic and negotiated Service Level Agreements (SLAs) [7], optimizing the number of active servers through virtualization [9]- [11], and location-dependent dynamic resource allocation to control the active servers population at each location [8]. However, none of these research works considers the reliability aspects of the IPSS, since the performance of the IPSS degrades with increasing power losses in the IPSS [5]. Therefore, the outage probabilities of the rack-level PSUs are important to consider for the adequacy of the computational resources in DCs, which can be identified by the LOWP index.
Regarding the reliability indices for maintaining SLA, quality of service (QoS), and capacity management, the key quality indicator (KQI) and key performance indicator (KPI) are described in [45], [46]. A further application of reliability assessment is proposed in [47] that use ''service availability'' of the servers in a performance optimized DC. The reliability index ''service availability'' and ''service reliability'' are also used in [48], [49] as a function of up-time and down-time of DC components. Another index named ''Service latency'' to assess the system reliability, especially for edge and internet DCs is studied in [50], [51]; the transaction latency impacts the quality of experience of end users [45]. Another index named Defects Per Million Operations (DPM) is discussed in [46] that asses the system reliability by measuring the number of failed operations per million of operations. These mentioned indices accounts component's up-time, downtime, operation or task failures, and computational abilities to assess the DC performance. However, it is important for the DC operators to assess the computational resource adequacy beforehand, to ensure operational efficiency and the SLA of the DC, as explained in Section I. The proposed index LOWP could address the computational resource adequacy at rack level.
The LOWP index is inspired by the commonly used reliability index of the power system called ''loss of load probability (LOLP)'' that has been proposed by Booth et. al in 1972 [16]. LOLP quantifies the expected load demands that will not be met by the available generation capacity [17]. A similar probabilistic approach is used in this paper, to quantify the computational workloads that will not be handled due to the failures of PSUs at the racks in DC. The conceptualization of the LOWP is the novelty of the index since it is proposed to be applied at rack-level to quantify the computational resource adequacy in the DC, while LOLP is used in power systems to address the generation adequacy. The LOLP considers the force outage rate of generators and the forecasted load of the system. Meanwhile, the LOWP index is calculated using the computational workloads and outage probability of the PSUs at the rack-level.

B. POWER CONSUMPTION MODELING OF IT LOADS AND POWER CONDITIONING DEVICES
The power consumption of the rack is needed to get the electrical equivalent workload duration curve, which is also needed for the power consumption models of the components of the IPSS in DCs [6].
The power consumption models of different components of the IT load section and the IPSS are presented in [18]- [20], however, the models are limited to identify the component level consumption and losses considering the dynamic structures and topologies of DCs. The modular modeling approaches can consider the dynamic structures of the load sections for power consumption modeling from component-level to aggregated-level [6], [21], [22]. The earliest processor utilization based power consumption model of CPU appeared in 2002 [23]. This model was extended further to model the consumption of the whole server [24]- [26]. Moreover, the UPS, PDU, and PSU as the power conditioning equipment in the IPSS are responsible for ensuring back-up supply and power quality for the IT loads. These essential components consume a significant amount of energy during the transformation processes, which is considered as power losses in this study. The power consumption model of the UPS and PDU is analyzed in [18], [27].

C. APPLICATION OF REAL DATA SET IN POWER CONSUMPTION MODELS
Due to the scarcity of publicly available data sets with real server workloads, it is difficult to use the server utilization based models for other purposes. Google and Alibaba are the only DC owners who released the data sets of their DC with servers' utilization data [12], [28]. The Google data set is used in research works for workload characterization, server classification, and server failure analysis [29]- [32]. Recently, the server utilization table of this data set is used for characterizing the server power consumption and validating the proposed power oversubscription model in a benchmarked cloud 54532 VOLUME 9, 2021 interface in. The authors only consider some selected servers' utilization by scaling-up the utilizations in [33]. However, the utilization factors of all the reported servers in the data set are used in this paper, to find the time series of the server power consumption without any scaling.

III. METHODS AND PROCEDURES A. LOSS OF WORKLOAD PROBABILITY (LOWP)
The probability that the computational workload will not be served by available computational capacity at the rack due to random failures of the rack-level PSUs is defined as the loss of workload probability (LOWP), as given in (1). The computational workloads are converted into electrical loads or load demand of rack-level IT loads as it is explained in Section III-B. In this case, the cumulative load curve is modeled from the hourly IT load of the rack, which is known as the load-duration curve. Meanwhile, the computational capacity is defined by the power supply capacity, hence the cumulative rated power of the PSUs at the rack. It is assumed that every server is connected with a PSU in the rack, as shown in Figure 1. Therefore, if the remaining power supply capacity C j is less than the IT load demand L, for a certain percentage of observation time t j , the overall probability that the IT load demand will not met is defined by LOWP, as defined in (1).
where C is the cumulative rated power of the available PSUs at the rack C j is the computational rated power of the available PSUs at the rack for hour j. L= the IT load demand of the rack. P[C = C j ] is the probability to have remaining power supply capacity of C j after failures of PSUs at rack. P[L > C j ] is the probability that the IT load demand L will be more than the remaining power supply capacity C j . p j is the probability to have a power supply capacity C j during the observation time t j . t j is percentage of measurement time with a IT load demand equal or more than the remaining power supply capacity. As the IT load demand L is obtained from the load-duration curve of the rack, which is constructed using the power consumption of the racks. Therefore, the power consumption model of the rack is needed to be addressed beforehand. The power consumption of the servers along with the local cooling fans are considered in this paper to characterize the power consumption of the rack, as these are the major power consuming components in a rack [6]. A flowchart with the steps to calculate the LOWP index is shown in Figure 2. The power consumption model of these components, hence the power consumption models of the rack and other load sections are presented in Section III-B.

B. POWER CONSUMPTION MODELS OF LOAD SECTIONS 1) SERVER POWER CONSUMPTION
In this paper, the blade type servers are considered among the different generalizations (e.g., blade, tower, and rack-able) since they contain similar basic hardware blocks, i.e, processors, memory, chipset, input/output (I/O) devices, storage, voltage regulators, and cooling systems (fans and heat sinks) [34], [35]. The power consumption of a server as a function of utilization is given in (2) where u i is the utilization of i th server unit, which varies depending on the assigned work load and resource allocation to this server. Server utilization, u i is a unitless quantity that can vary between 0% (no work load) to 100% (maximum work load).

2) LOCAL COOLING FAN POWER CONSUMPTION
The local cooling fans are attached to the servers with heat sinks for handling the heat generated from the IT loads. The servers contain variable airflow control to ensure the reliable operation of the server cooling system in current technology. The required air flow rate depends on the heat generated by the servers, which determines the required rotational speed and fans' power consumption, shown in (3)-(5). The equivalent thermal resistance, R can be expressed as the ratio of the different of die and ambient temperature and the heat generated by the CPUs [6], as shown in (3). The thermal resistance, R also depends on the summation of equivalent thermal resistance of heat sinks, R hs and the thermal resistance of the CPU case, R case [36], [37], as shown in (4). VOLUME 9, 2021 Here, R hs depends on its convective heat transfer as a function of the wind speed at the surface of heat sink, determined by the cooling fans' revolution speed (i.e., revolutions per minute, ); hence, R case is assumed to be constant in [36]. The constants a 1 and a 2 depend on the properties of airflow and CPU package; and the parameter a 3 depends on the level of turbulence in the air flow [36].
In this paper, the thermal resistance of the server, R is calculated first using (3) that is used to calculate the required rotational speed of the fans, , as shown (4). Finally the power consumption of the fans are calculated as a function of , as shown in (5). The constants of the equations a 1 to a 7 are taken the regression models presented in [36], [37] and shown in Section IV-2.
where, R is the thermal resistance, P server is the server power consumption,T die is CPU die temperature and T amb is the ambient temperature of the server hall. R hs , R case , and P fan represent the equivalent thermal resistance of heat sinks, thermal resistance of the CPU case, rotational speed, and power consumption of the local fan, respectively.

3) COMPLETE RACK MODEL
The blade servers are mounted in racks in high density. The fans control the thermal limit of the server equipment. It is not typical to measure the power consumption of every blade server, but the aggregated power consumption of the servers and fans at rack level [6], shown in. Thus the total power consumed by the IT loads is the summation of all rack-level consumption, as shown in (7).
where, P Rack is the total rack power consumption with total N S number of servers and N f the number of local fans. With N R number of racks the IT load power consumption is P IT total .

4) CONSUMPTION MODELS OF POWER CONDITIONING DEVICES
The main task of the power conditioning devices, i.e., the UPS, PDU, and PSU in the IPSS are to maintain a continuous power supply and ensure power quality for the IT load section. Different architectures of UPSs are available with battery backups to supply power to the IT loads for short interruptions. The PDUs and PSUs are used to maintain specific voltage levels for the IT loads [6], [18]. However, backup generators are also essentially present in the system to supply power to the DC during long interruptions in the public grid. Since the focus of this study is on the IPSS of the DC thus the backup generators are not considered. The IPSS that is considered in this paper is shown in Figure 1, where every rack is considered to host 10 servers with 10 PSUs and a PDU unit, while 100 PDUs are connected with 1 UPS. The working principles of the UPS, PDU, and PSU are explained in our previous work [5]. The detailed power consumption models of the UPS and PDU is taken from [5], [6], that are given in (8) and (9) where, P Loss UPS and P idle UPS represent the UPS power loss and the idle power loss of UPS. φ UPS is the power loss coefficient of UPS that is unitless considering (8).
where, P Loss PDU and P idle PDU represent the UPS power loss and the idle power loss of UPS. φ PDU represents the PDU power loss coefficient with a unit of per watt, as in (9).
The power losses of PSUs do not depend on the supplied power [6], [18], [36]. In [38], the power consumed by the PSU is explained as load-dependent without any constant losses. In this study, it is assumed that the PSU consumes 1% of its server electrical load.

C. GOOGLE DATA SET STRUCTURE AND PROCESSING METHODOLOGY
The Google data set contains 6 tables with various information about the operation of 12, 583 servers under a period of 29 days, from May 1, 2011 to May 30, 2011. The data set contains a table called ''task-usage'' with information about resource utilization information (resource refers to CPU, memory, or disk). This data has been sampled in 5 min or at a value of 12 samples/hr. The data set is divided into 501 comma-separated files (.csv) files (approximately 166 GB in size). It needs a significant amount of data processing capacity to execute complex queries based on unique server IDs. Matlab c R2019b in windows server with 8 core Intel R Xeon R CPU E5 − 4603 and 96 GB of Memory has been used to extract the time-series utilization of all individual servers. Additionally, limited information in schema [12] and the unprocessed nature of the data, requires an extensive understanding of different attribute relationship. The followings are some specific cases and related assumptions that are considered during the data processing step.
• The measurement starts on May 1, 2011 at 5 PM thus we truncate the data of this day and we only analyze the remaining 28 days.
• If no record exists for a server in a measurement interval, this means that no task was assigned to that server and the CPU utilization of that server is assumed to be zero.
• Out of all the server utilization records, due to measurement errors, 583 records (less than 0.00005%) have CPU utilization of more than 1. All those values are truncated to 1.
• No scaling factor is used to bias the reported utilization factor of the individual server so that the actual trend of the overall IT load consumption could be replicated.

IV. RESULTS AND ANALYSIS
In order to compute the load duration curve of the rack, the computational workloads of the servers need to be translated to power consumption. The power consumption of servers is calculated using the servers' utilization factor from the Google data set according to (2). Thus the analysis of the power consumption of servers, local fans, racks, and the power losses of the IPSS components are presented in Section IV-A. The load demands of the racks are used later to get the electrical IT load-duration curve of rack for the LOWP in Section IV-B.

A. POWER CONSUMPTION MODELS 1) POWER CONSUMPTION OF SERVERS, FANS, AND RACKS
As a use-case study, it is assumed that each rack of the DC hosts 10 blade servers with all computational resources described in Section III. Additionally, each rack has 40 local fans, i.e., 4 fans for each blade server to manage sufficient airflow into the racks [6], [39]. The rated power of a blade server, hence, the rating for a PSU is assumed to be 800 W, and the server consumes 400 W in idle mode [6]. The servers' utilization data are converted into a time-series; sampled every 1 hr for all 12, 583 servers to get the power consumption of the servers, as given in (2). The day-wise analysis of power consumption of individual servers is shown in Figure 3. The 25 th and 50 th percentile show that the power consumption of most of the servers is below 450 W in Figure 3. Some of the servers are utilized highly every day as represented by the 90 th percentile in Figure 3, hence the power consumption of those servers are also high. That means the measurements of the servers' workloads are taken from different clusters based on priority task scheduling. The server utilization, power consumption, and fans power consumption of such a highly utilized server (server ID 4820240534) are shown in Figure 4.  This server has higher utilization for the first week which varies between 20% − 100%. The power consumption pattern follows the utilization. But it is not the case for local fans since the power consumption of the fan depends on temperatures (T die , T amb ) and rotational speed ( ). Although, the rotational speed of fans depends on the equivalent thermal resistance (R) that is inversely proportional to P IT , as given in The servers are distributed among 1, 259 racks (10 servers per rack), while the last rack has only three remaining servers with 12 local fans. The total power consumption of the IT loads including servers and local fans is shown in Figure 5. The total IT load has a weekly consumption pattern, as depicted in Figure 5. The weekly consumption pattern of the IT load is also reported for the enterprise and hyper-scale DCs in [4], [18]. Therefore, the modeled IT load power consumption profile can replicate the scenarios of a real-world DC and use further for identifying the power losses of the IPSS.

2) POWER CONDITIONING EQUIPMENT LOSSES
The rating of the IPSS components depends on the power demand of the rack-level IT loads. For this study, we assume that every server is equipped with an 800 W PSU that also consumes 1% of the supplied power. The racks with PSUs are distributed between PDUs, where each PDU could supply a maximum of 10 servers or a rack. Further, each UPS will be connected with 100 racks or 100 PDUs for a backup power supply with a rated power of 900 kVA. The power losses of a PSU, PDU, and UPS are shown in Figure 6. As a single unit, the UPS consumes almost 5806 times more energy than a VOLUME 9, 2021  PSU and 61 times more than a PDU on average. However, it is not possible to judge the overall performance of these devices based on the power consumed by individual devices, as shown in Figure 6; because the power losses of these devices depend on the supplied power, as given in (8) and (9). The rated power capacity of UPS is 100 times higher than the PDU so it consumes more energy than a PDU, as also claimed in [18], [40]. Thus, the percentage of losses with respect to the rated power of the rack has been analyzed for the PSU, PDU, and UPS at rack-level, as shown in Figure 7. The term percentage of energy loss of a device is the ratio of its power loss on the rated power of the rack, which is 8 kW in this case. The average power loss of a PDU is 5.8 % of the rack rated power, 2.8 % for a UPS, while only 0.65 % on average for 10 PSUs of a specific rack. The percentage of loss of the devices shows the device efficiency to supply a rack. So, the power loss of a PDU is more than a UPS, as depicted in Figure 7. The total loss of all PDUs is also higher than  the total loss of the UPSs and PSUs, as shown in Figure 8. Due to the total number of PDU in the system that is higher than UPS and the series loss of the PDU represented by the square term in (9), the overall power consumption of PDUs is more than UPSs and PSUs. The percentage of losses of the power conditioning devices with respect to the total IT load at the aggregated level of DC, as shown in Figure 9. The percentage of loss at the aggregated level is calculated as the ratio of the total loss of the devices on the total rated IT load demand (10.07 MW) for the last week of the timeseries. The total power loss of the power conditioning devices is more than 10% of the total IT load every day of that week, as depicted in Figure 9. The percentage of loss of the PDUs at the aggregated level is also higher than the percentage of losses of the UPSs and the PSUs, as shown in Figure 9. It is a contribution of this study to identify the higher percentage 54536 VOLUME 9, 2021 of loss of the PDUs at the rack-level and aggregated level of the DC compared to the UPSs and PSUs. The power loss of PDU increases with the increasing IT load demand, as given in (9), which could cause a shortage of its rated power supply capacity as explained in [5]. In that case, some of the PSUs that are connected with the PDU will be switched off, which will lead to the outage of servers at the rack. The outage probability of the PSUs is considered for calculating LOWP further in Section IV-B. Meanwhile, to enhance the efficiency of the IPSS the number of PDUs in the IPSS is needed to minimize depending on rack-level electrical demand.

3) CONSTRUCTION OF LOAD DURATION CURVE OF IT LOAD
The IT load at rack-level is the summation of the power consumed by the servers and the local fans in the rack, as explained in Section IV-A1. The hourly power consumption of a rack is shown in Figure 10. The cumulative load curve also called the load-duration curve of the rack is constructed using the power consumed by the IT loads of the rack, as shown in Figure 11. Therefore, the total rack power consumption P Rack in (6) gives the load demand L in (1) and the IT load duration curve of the rack, as shown in Figure 11.

B. RACK LEVEL COMPUTATIONAL RESOURCE ADEQUACY ANALYSIS
The loss of workload probability (LOWP) defines the probability that the IT load demand of a rack cannot be supplied with the cumulative power supply capacity of the rack-level PSUs, as explained in Section III-A. According to the IPSS architecture shown in Figure 1, each of the rack hosts ten servers connected with ten PSUs, where the PSUs are assumed to be identical with rated power of 0.8 kW.
Regarding Pr LOWP of the rack in (1), the probability to have certain amount of cumulative power supply capacity p j for failures of the rack-level PSUs, and the percentage of time t j  with IT load L are needed to calculate, which are explained in the following sections.

1) AVAILABILITY CALCULATION METHOD OF PSU
The related parameters to calculate the availability and unavailability of the rack-level PSU, i.e, mean time to failure (MTTF) and mean time to repair (MTTR) are taken from the IEEE Gold book [43]. The IEEE gold book is a standard practice guide for industrial applications that includes the reliability related statistical data of common industrial equipment. Therefore, it has been assumed that the IEEE gold book data also could be used to represent the similar new equipment based on the working principle of the new equipment. As an example, the PSU is not explicitly mentioned in the IEEE gold book; given that the working principle for industrial rectifiers is the same as for the PSU, the data found for the industrial rectifiers have been used instead in this paper. The availability and unavailability of the PSU are VOLUME 9, 2021 given in (10), where, t MTTF = 1960032 hr and t MTTR = 16 hr.

2) STOCHASTIC OUTAGES OF PSUs IN THE RACK
The number of available PSUs at the rack will follow the binomial distribution since all the PSUs are assumed to be identical [17], [41], [42]. The ''k-out-of-n'' configuration is used to assess the reliability of the rack-level PSUs, where k number of PSUs out of n should be available to serve the computational workloads [41], [42]. The expansion of the binomial equation is given in (11). The terms in (11) define the outage probabilities of the PSUs at different stages that are summarized in Table 1. The related parameters i.e., availability, p a and unavailability, q u of the PSU are calculated as shown in (10).
(p a + q u ) n = p a n + np a (n−1) q u n(n − 1) 2! + p a (n−2) q u 2 + . . . . + q u n = 1 (11) where, p a and q u are the availability and unavailability of PSU, and n is the total number of PSUs, respectively. The minimum number of spare PSUs and servers that are needed to ensure a certain level of service availability can be found from the capacity outage table, shown in Table 1. The probability of having four simultaneous failures of the PSUs at a rack is very less compared to the previous cases, as shown in Table 1. Therefore, with an acceptable risk the rack can be designed with three spare PSUs and servers, else for a simultaneous failure of three PSUs will cause an outage of 2.4 kW equivalent computational resource.

3) LOSS OF WORKLOAD PROBABILITY (LOWP)
According to the load duration curve of the concern rack, the IT base load demand of the rack is about 5.4 kW that persists 100% of the measurement time, while the peak consumption lasts for less than 1% of the time, as depicted in Figure 11. In this case, if three of the PSUs fail simultaneously in the rack the remaining power supply capacity will be 5.6 kW for serving the computational workloads, as shown in Table 1. Meanwhile, the load demand of 5.6 kW remain for 60% of the time in the rack, as show in Figure 11. Therefore, the LOWP of the rack will be as follows,

4) USE CASES
The LOWP of the concern rack could be an issue for DC operators for those who handle latency sensitivities workloads (i.e., banks, financial institute, cryptocurrency mining, etc), thought the value is low. the DCs that are used for cryptocurrency mining might face financial lose and data security issues due to losing such small amount of workloads, as addressed in [44]. In this case, the proposed LOWP index could help the DC operators to ensure reliable operation of the computational resources. The LOWPs for all the racks in the DC are shown in Figure 12. The same probability outage table is used for the PSUs, as given in Table 1 since it is assumed that all the PSUs are identical. The LOWPs for all the racks considering the simultaneous failure of any three rack-level PSUs are shown in Figure 12. As the LOWP depends on computational workloads, thus if higher workload demand persists for a longer time duration the value of the LOWP will be higher. The red dots depict the racks with the highest LOWP due to having higher workload demands compared to the remaining capacity (5.6 kW) for a long period of time in Figure 12. Meanwhile, the pink dots close to the horizontal line in Figure 12, depict a lower probability of the loss of workloads because the workloads of the servers in these racks are less compared to the remaining capacity. Thus, the LOWP index can use for clustering the racks for latency-sensitive workloads. The sensitive workloads are needed to be served by the racks with low LOWP. Moreover, the LOWP index can be used for expansion planning of the rack-level computational resources, so that the overprovisioning of the computation resources could be avoided.

V. DISCUSSION AND CONTRIBUTION
A novel reliability index called loss of workload probability (LOWP) is introduced in this paper that defines the rack-level computational resource adequacy. The LOWP index is inspired by the power system reliability index called LOLP. The LOWP index identifies the probability of the rack-level computational workloads that cannot be served by the servers due to stochastic outages of PSUs. The workload duration curves of the racks are formed using the power consumption of the servers in the rack that is calculated using the power consumption models of the servers and local fans in the rack. The servers' utilization data of the Google data set is used to get the real power consumption of these components of the IT load. However, the utilization of other components of the IT load i.e., memory, hard disk, network equipment, etc. are not considered here since the CPUs consume most of the energy in a server. The scope of the LOWP is very wide in the operation of DCs since it can be used for the computational resource expansion planning and designing the group of servers for latency-sensitive workloads. The LOWP gives the number of spare PSUs and servers per rack to ensure a certain level of computational resource availability, which also leads towards the right sizing of the rack-level computational resources. However, the spare servers will incur idle power, hence the power loss of the IPSS will increase for the additional PSUs. Additionally, the racks can be grouped based on their LOWP index for important, hence latency-sensitive workloads, which is also demonstrated in this paper for the modeled racks reported in the Google data set. However, some specific information about the groups or clusters of the servers is not given in the data set. Thus ten servers are randomly chosen for modeling the rack-level power consumptions, hence, the workload duration curve.
The power consumption models of the IT loads and the major components of IPSS are also presented in this paper as a function of server utilization. The power consumed by the components of the IPSS is considered as power losses in this study. The percentage of loss shows the effect of the component power loss to supply the IT load demand of a rack.
The total power loss of the PDUs is found higher than the power losses of the UPSs and PSUs. However, it is typically claimed in the literature that the UPS are less efficient than the PDU, which is true for a single device but not for the aggregated level. The overall performance of the IPSS is also analyzed at the aggregated level of DC. The IPSS consumes more than 10% of the rated IT load at the aggregated level of DC. However, the IPSS architecture that is analyzed in this paper does not consider the redundant power flow paths between UPSs and PDUs, and the conduction losses of the cables. The conduction loss of the cable is neglected because of the shorter distance between the IPSS devices.
The power consumption models of the IT load and the components of the IPSS are presented as a function of server utilization. The server utilization factors of the Google data set are fitted into the proposed power consumption models that show a weekly pattern of the IT load in the time-series. A similar weekly consumption pattern for hyper-scale DC is reported before. To the best of our knowledge, the data set is used for the first time in this research to model the energy consumed by IT loads and the IPSS without any scaling factor. It is computationally challenging to extract the time-series of the utilization factor of individual servers because the data set is splitted into 501 files with measurement errors. Some required information about the cluster and servers is not available with the data set thus some assumptions are made to model the racks with reported servers.

VI. CONCLUSION AND RECOMMENDATIONS
With the increasing popularity of cloud-based services, the demand for computation resources is also increasing in the Data centers (DC). It is challenging to identify the required number of computational resources that can comply with the workloads. This paper has introduced a new index called loss of workload probability (LOWP) that can define the adequacy of the rack-level computational resources for the DC. This index can be used for efficient operation planning by the DC operator since it can define the group of server-racks for latency-sensitive workloads. Meanwhile, the DC owners can also be facilitated by this index because it can quantify the minimum number of spare servers in a rack to ensure a certain level of service availability. By this approach, it could be possible to avoid overprovisioning computational resources, which will ensure less operational cost and energy-efficient operation of DC.
The modular modeling approach of the IT loads and the internal power supply system (IPSS) is presented in this paper. The modeling approach can also consider other load sections i.e., cooling loads to model the energy demands, which also depends on the IT loads. The energy losses of the UPS, PDU, and PSU in IPSS are also analyzed that shows the PDUs consume more energy compared to the UPSs and the PSUs at the rack-level and aggregated DC level. Moreover, the IPSS consumes more than 10% of the rated IT load (10.07 MW) at the aggregated level of the DC, which is a significant amount of energy. Reducing the number of PDUs, hence, limiting the overall power losses of the IPSS will improve the overall efficiency of DC.