Reliability Assessment of a Large Diesel Generator Fleet

PowerSecure (PS), a subsidiary of Southern Company, operates over 1600 microgrids including a large fleet of standby diesel generators. Machines range from 125 to 2800 kW. There are two missions: standby power after utility failure and load management when utility is available. This report updates generator reliability data published over 20 years ago. These results are first derived from uniform records from a distributed data acquisition system. The unique aspects of this effort include automated data collection, analysis, and storage producing a standardized record for all machines and events. This greatly reduces the effort required to prepare reports and analysis. The automated collection largely eliminates human errors in data entry, and prevents post hoc adjustment of mission success or failure. We are not aware of previous published generator data using automated data acquisition. This study counts all failures and excludes none, whereas previous studies exclude more than half. This analysis expresses reliability as experienced by the customers. The number of machines, years in service, and years of operation greatly exceed the sum of previous published studies, enabling calculation of useful confidence limits. The fleet is heterogenous with machines from over a dozen manufacturers and many power ratings. Machine reliability was consistent across brands, and there is no significant difference in reliability for different size machines. This common assumption was not supported by the prior studies. High-quality data enables reliability growth management practices. The fleet reliability for outage demands of all durations increased from 95% in 2011 to 98% in 2014–18. Analysis of failures arrival time shows that the failure rate is strongly dependent on mission duration. The failure rates after 14 h are approximately 1% of the value for the first 30 min of operation. This has important implications for designing systems to meet specified reliability targets, and in interpreting and comparing results from fleets with different mission durations.


I. INTRODUCTION
A S OUR society grows ever more dependent on continuous availability of high-quality electric power, critical facilities are increasingly protected by arrays of standby generators. A defensible design of a system designed to reach or exceed a well-defined reliability objective is impossible without data on component reliability. Previous efforts to provide this data were based on surveys, which required intensive labor and suffered from reporting errors, selection bias, and limited numbers of trials and/or failures.
MTechnology, Inc. (MTech) was retained to determine the reliability of the PS fleet of nearly 2000 standby diesel generators. The MTech was given unrestricted access to the database of several hundred thousand records. Every start, stop, repair, and maintenance activity has been recorded in two redundant network operation centers since 2007.
The end-users are provided with standby power during utility outages (Standby demands), whereas their associated utilities can utilize the machines as nonspinning substitutes for spinning reserves (load management demands). The load management applications include demand response, coincident peak capacity markets, automatic frequency response, and generation during peak spot market prices.
The standby mission requires successful generator start after utility failure and provision of power to the load until utility power returns. The load management missions include generator start, synchronizing and operating in parallel with the local utility, and safe disconnect from the utility to conclude the mission. Both standby and load management missions can be of any duration.
A benefit of the dual missions is that the average machine runs 40 h/year. This is a more thorough test than typical 1-h/month standby machine test protocols. Frequent turnover of stored fuel greatly reduces the contribution of aged and contaminated fuel to failures.
The fleet has grown from 1874 load management (LM) demands in 2007 to 15 560 in 2018. The standby demands have grown from 343 to 3264 over the same period.
We began by assessing the reliability of the fleet in 2011. At that time the reliability for both missions was 95%, consistent with previously published studies [1], [2].
A reliability growth management program was instituted in 2012. By 2014, the fleet reliability had increased to 98%. This article utilizes data from January 1, 2016 through February 28, 2019 [8]. This work does not report on fleet availability, the long-term average fraction of time that the diesel engine generators are in service and satisfactorily performing their function.
The cause(s) of machine failures are not analyzed here. Failure to provide acceptable power for the duration of the mission for any cause constitutes failure. Failure causes include but are not limited to infant mortality, fuel starvation, fuel exhaustion, fuel contamination, operator errors, wear, failures in system controls, extreme weather events, fires, configuration errors, spurious trips of protective relays and circuit breakers, animal intrusions, damage from projectiles, shutdown of the engine due to control or sensor malfunction, and failures in the communications links that provide the signal to start. Failures where operators were able to restart machines were counted as mission failures.

II. PREVIOUS STUDIES
Several previous studies were based on records from the US nuclear power stations, which employ emergency diesel generators (EDGs) to provide power when offsite utility power is not available.
Wyckoff at EPRI published in 1986 a survey of 75 nuclear units at 54 stations for the years 1983 through 1985. He evaluated a fleet of 154 EDGs and 454 EDG-years of experience. [1] Data were self-reported by operators, with four plants providing no or partial information. Record keeping varied between utilities and adjustments were required in classifying EDG missions. The effort required to produce the data from operating records was estimated to be 100 to 200 h/unit, or roughly three person-years spent collecting and transcribing data into a common reporting format.
There were multiple posthoc exceptions to failures. If the machine failed a start but the cause was any of the following, the failure was excluded: 1) operating error that did not preclude restart; 2) incorrect trip signal; 3) malfunction of equipment that would not be operative in emergency mode; 4) minor oil and fuel leaks. Demands and failures were also excluded if they were conducted when the EDG had been declared out of service or was operated for troubleshooting and diagnostic purposes. Retries after the initial failure were not counted. Load-run tests that failed in less than 1 h were counted as start failures and not load-run mission failures.
The study apparently excluded 292 incidents where the machine suffered a failure but was later judged able to restart and/or manually loaded in an actual emergency. 1 This left 183 start and load-run failures.
Table VIII summarizes the results. Wyckoff did not present unreliability with excluded failures; they are included here for the purposes of comparison. Unreliability of unplanned demands is higher than test demands, but the small number of unplanned events limits the confidence in this result. Excluding most failures increased the reported reliability as shown in Fig. 13  Failures were classified as either failure to start (FTS), failure to run (FTR), or maintenance out of service. Maintenance out of service (MOOS) was further classified as to whether or not the plant was in a shutdown condition at the time of the demand. In addition, recovery of EDG trains from failures during unplanned demands were identified. The unreliability estimate includes consideration of recovery of EDG train failures. If recovery was excluded, EDG train unreliability was 6.9%.
Failures caused by MOOS were higher than anticipated, accounting for approximately 45% of the unreliability.
After MOOS, major contributors to unreliability were primarily caused by electrical hardware malfunctions. The authors noted that the failures during unplanned demands "appeared to be difficult for operators to diagnose and recover." This finding suggests that the operators working under the stress of an actual unplanned demand might not perform as well as Wycoff assumed when excluding failures.
The INEL study identified substantial variations in the failure rate as a function of time. The failure rate in the first 30 min of operation was nearly 10 times higher than the failure rate for 30 min to 14 h. The failure rate for runs over 14 h was 1% of the failure rate for 30 min. See Table 1.
Cooling system failures dominated early failures, whereas the electrical or fuel subsystems participated in half of the failures between 0.5 and 14 h. Beyond 14 h the only failures observed occurred in the electrical subsystem.
IEEE Std 493-2007 in [3] (popularly known as "The Gold Book") presents failure rates for standby diesel generators in Tables IV-X. Footnote 2 of this  [4] presents failure rates for standby diesel generators in Table II. The source of this information is also from the PREP and includes additional data collected since the publication of the 2007 Gold Book. See Figs. 14 and 15.
Smith et al. [5] presents failure rate information on auxiliary standby diesel and package standby diesel generators. The data source was a US Army Engineering and Housing Support Center study of the reliability, availability, and maintainability characteristics of diesel and gas-turbine power systems.
As with the Wyckoff [1] Smith reported significant effort was required to collect, organize, and report the necessary data. Plants were from a variety of applications including airfields, military bases, hospitals, utilities, cogenerators, and computer and communications. Where possible, plants with at least ten years of operating history were selected. A total of 22 plants participated in the survey, representing 20 standby diesel machines 2 and 277 standby unit-years. See Figs. 16 and 17.

III. MISSIONS AND SYSTEM BOUNDARIES
Diesel engine/generators are complex systems with many components and interfaces. The system boundary diagram is shown in Fig. 2. Most but not all machines in the PS fleet operate with two distinct missions. The load management missions require the machines to start, accelerate to synchronous speed, synchronize with the utility, close the generator output breaker, and gradually load the machines to a specified power level over a period of approximately 30 s or more. At the conclusion of the event the machine power is gradually reduced to zero, the output breaker opened, the engine cooled down then shutoff.
Load management missions are generally scheduled in advance although utility emergencies and frequency support applications cause unscheduled demands. The utility is always present during load management demands; the machines run "grid connected." The gradual loading and relatively long run times of the LM events are roughly analogous to the nuclear power plant 24-h run tests using slow starts.
The load management demands often occur when the utility is being stressed by ambient temperature extremes, generating station outages, transmission outages, and other factors. This adds an additional challenge to load management missions.
The standby missions are unscheduled and begin when the utility service to the site fails. All standby missions operate with machines in "island mode." The machines must start, accelerate to synchronous speed, synchronize with other standby machines at the site, close the output breaker, and power customer load. Many sites have one machine but sites with 40 machines are in service. Multiple machines generally synchronize while accelerating and load in less than 9 s. Standby missions exclude momentary voltage nonconformities and brief power interruptions. If utility power is restored in less than 10 s, it is not defined as a standby demand and no record of mission success or failure is generated.

IV. DESCRIPTION OF DATA COLLECTION
Most PS customers elect to participate in the remote monitoring program. The system information is sent to two independent geographically separated operations centers. Centers are staffed 24/7 by dispatchers and analysts who respond to utility outages, generator alarms, emergency events, and severe weather threats. 2 Results from continuously operated diesels and gas turbines in the Smith paper were excluded from this report.  Data collection technology consists of multiple sensors integrated into each system and software that logs, transmits and receives data across an encrypted private cell network. Multiple services receive, decrypt, and decode the data into a common database. All data points from each system component are stored for subsequent analysis and trending.
Dispatchers review every demand and classify the mission as a success or failure. Most sites with multiple machines use redundancy to improve system reliability. Failure of a single machine does not adversely affect customers. Some sites employ up to 40 machines for the LM applications, and failure of one or several machines does not compromise the mission.
In this study we compared run times for each machine during a "failed" demand to identify the failed machine(s) and scored the machines at the site that completed the mission as successes.
Additional services distribute information supporting fuel tracking, system diagnostics, remote system control, historical trending, and proactive response to storm warnings.

A. Fleet Composition
As of February 28, 2019, the PS fleet consisted of 1956 machines of all major brands. Power ratings range from 125 to 2800 kW, as shown in Fig. 1. Machines rated 500, 600, and 625 kW comprise 85% of the fleet.
PowerBlocks are based on Volvo Penta heavy truck engines mated with a generator, controls, switchgear, integral fuel and

B. Cohorts for Analysis
The machines were grouped into five cohorts for analysis.
A separate "Smith Cohort" was constructed for comparison of machines to the study by Smith et al. [5] The cohort contained 791 machines rated from 600 to 1800 kW.
Two separate cohorts were created for comparison to the IEEE Std 493 (Gold Book) and IEEE Std 3006.8 studies:

VI. SUMMARY FLEET STATISTICS
The fleet accumulated 5237 machine-years in service in the period January 1, 2016 to February 28, 2019. Load management and standby operations totaled 24.5 years running time.
A total of 829 failures was observed during this period. Failures that listed total run time equal to zero were classified as "Fail to Start," all other failures were classified as FTR.

A. Failure Modes
FTS were the demands with zero run time. Failure with nonzero run time is FTR.

B. Load Management Fleet Performance
Figs. 3 and 4 show the probability of failure to start and the hourly failure rate, respectively, for load management events.  The error bars in these and all subsequent figures represent 90% confidence limits. The smaller number of machines rated 700-1750 kW results in larger confidence limits.
All ratings except the 125-470 kW group are consistent with a 1% probability of failure to start per LM demand.
There is greater variation in the failure to run rates. As discussed below, the failure to run rate is a strong function of the mission duration.

C. Standby Fleet Performance
Figs. 5 and 6 show the probability of failure to start and the hourly failure rate, respectively, for standby demands. The 600-625 kW machines are powered by Volvo heavy truck engines that produce a lower FTS probability than any other cohort.
The failure rate for all durations of standby missions are less than 0.2% per hour for every cohort except the 700-1750 kW. The small population of these machines means that one or two additional failures has a large impact on the observed failure rate.

D. FTS Unreliability of Different Sized Machines
The probability of FTS is approximately 1% for most machines. The nuclear power plant EDGs in the INEL study are different than most of the fleet analyzed in this article. Some nuclear EDGs are opposed-piston design, with an upper and lower piston in each cylinder. Nuclear EDGs are generally  medium voltage, with power ratings from 4500 to 6000 kW, and usually operate at 600-1200 r/min instead of 1800 r/min. Most nuclear EDGs are started with compressed air, whereas all PS machines are started with battery-powered electric starters.
Despite these differences, Fig. 8 shows that the INEL results are similar to the PS fleet. There is also relatively little variation in FTS in the PS fleet with rated power. Most machines show slightly higher FTS probability in load management applications, possibly due to the extra steps required to synchronize with the utility.

E. Failure Rates Versus Time
The INEL report found a strong difference in FTR failure rate as a function of run time. The PS fleet exhibited the same behavior but were significantly less likely to fail at all times.
Unlike the FTS data, there were significant differences in the early (0 to 0.5 h) failure rates for LM versus standby events.
The vertical scale in Fig. 9 is four times that of Fig. 10. In both a group of machines had zero failures. These are shown with error bars reflecting the upper 90% confidence limit.
Once the machines have been running for more than 30 min, the differences between the LM and the standby are much smaller, so the results are plotted for both types of mission in Fig. 11. There was only one failure in 1031 demands and 58 589 operating hours for missions longer than 14 h. The most likely failure rate is 1.7E-5/h, with 90% confidence limits from 1.6E-5 to 6.4E-5 /h. The corresponding INEL failure is nearly 15 times larger at 2.5E-4/h, with 90% confidence limits from 2.4E-4 to 7.3E-4 /h. See Fig. 12. This has important implications for designing systems to meet specified reliability targets [7].

VIII. COMPARISON WITH PREVIOUS STUDIES
The PS fleet is substantially larger than previously published analyses. Both the number of machines and fleet total years in service are several times the combined total of the previous studies listed. The PS analysis did not exclude failures for any reason, whereas most previous studies excluded failures for a variety reason.
The results shown in Fig. 13 are in general agreement with Wyckoff [1] for total unreliability, but the start and FTR contributions are different. This is due in part to Wyckoff classifying any failure in the first hour as a start failure, and excluding over half of all failures. If failures excluded in that study are included the PS fleet is significantly less likely to fail. No failures were excluded from the PS dataset.
There is a large change in failure rate as a function of mission duration. This had been previously reported by INEL [2]. Both INEL and this study demonstrate a substantial reduction in failure rate after the first 30 min, and after 14 h. Table VII shows the failure rates for the three time intervals defined in the INEL report. Fig. 7 shows the same results with the error bars representing 90% confidence intervals. The PS fleet failure rates are significantly lower for all periods. There   [1] was only one failure in 1047 demands lasting more than 14 h, so the uncertainty for long runs is large.
The 2007 Gold Book [3] and IEEE 3006.8 [4] presented failure data in terms of period time (calendar time in service) rather than operating time, as shown in Fig. 14. The data was collected via surveys and presented for two overlapping power ranges, 250-1500 and 750-7000 kW. The PS fleet data for machines rated 250-1500 kW is consistent with the early Gold Book data but about 40% lower than the revised dataset published in IEEE 3006.8.
In the higher power cohort from 750 kW up to 7000 kW the PS fleet has significantly lower failure rates, shown in Fig. 15. The largest machine in the PS fleet is rated 2800 kW.
Smith et al. [5] also reported failures in terms of period hours. The machines were rated from 600 to 1800 kW; they distinguished packaged systems from those with auxiliary support systems. See Fig. 16.
The PS fleet has only 6% to 8% of the failure rates reported for Aux Standby and Package Standby machines in Smith et al. [5]. The difference may be due in part to the amount of operating time on each machine, and the small number of machines in the      The smaller number of operating hours implies a greater number of starts for the PS fleet, and more time spent in the first 14 h of run time, where failure rates are much higher. This is supported by the data shown in Fig. 17, where a comparison is based on operating hours shows significantly lower failure rates for Smith. The rapid decline in the failure rates for longer missions skewed comparisons between fleets with major differences in annual operating hours.

IX. RELIABILITY GROWTH MANAGEMENT
A program of reliability growth management (RGM) was instituted for the PS fleet in early 2012. Monthly, quarterly, and annual fleet statistics are reviewed and common failure modes are identified.
Failures are classified by probable origin: component, design, manufacturing, installation and commissioning, and operations. The reports for relevant failures are submitted to the responsible organization. The size of the fleet allows rapid identification of the relatively rare failure modes. The component defects must be addressed by the supplier, design defects by the engineering organization, manufacturing defects by factory personnel and installation, commissioning, and operating errors by the field service organization.
This program creates a set of feedback loops where each team within the enterprise strives to reduce or completely eliminate defects that have caused failures.
The results have been a substantial reduction in fleet unreliability for all demands, load management, and standby.

X. FUTURE WORK
Review of service data will be performed to assess as follows: 1) cause(s) of failure; 2) mean time to repair; 3) machine availability; 4) effectiveness of restoration efforts after failure; 5) effect of MOOS; 6) exclusion of certain types of failures; 7) changing duration of FTS failures; 8) calculating utility outage rates. This study defined "failure to start" as a machine failure resulting in zero recorded run time. Alternate definitions should be considered, such as machine failure in less than 30 s, or until the machine is fully loaded in the LM demands.
Utilities provide data regarding outage frequency and duration with the SAIFI and SAIDI metrics. These are of limited use in gauging the need for standby power in many applications because only outages longer than 5 min are included [6]. The PS fleet database could be used to extract utility outage rates from the standby demand data.