Driving cycle synthesis, aiming for realness, by extending real-world driving databases

In order to transform conventional buses into electric ones, exact knowledge of the energy consumption of the vehicles is essential. Furthermore, for a proper design of the transition and to avoid inefficiencies and excessive costs, this information must be adjusted to real operating scenarios. However, a recurring problem in this context is the lack of data to address all these issues. Previous studies have focused on the use of standard driving cycles or on the synthesis of cycles from a single route. This paper presents a methodology for extending real-world driving databases to perform massive simulations, thereby narrowing the confidence interval of estimates. As a case study, the method was applied to a municipal bus operator’s database in a project to assess the feasibility of retrofitting a diesel to an electric bus. The proposed framework is useful for generating a valid database for research on energy consumption distribution and powertrain optimization, as well as to support public transport bus operators and manufacturers.


I. INTRODUCTION
Electric buses are the paradigm of 'green' vehicles and there is a goal that, no later than December 2025, they will make up at least 45% of public fleets [1]. Electric powered vehicles come along with a number of benefits, such as superior drivetrain efficiency, lower local emissions, less noise et cetera. Actually, widespread electrification of transport is absolutely necessary to achieve the ambitious goals of environmental policies [2]. However, the majority of city buses are still combustion engine/diesel vehicles since their overall costs are comparably lower [3]. Thus, for cost efficiency, not only new ecological buses should be put into service, but conventional ones should also be retrofitted to electric (EV) or hybrid electric vehicles (HEV). As the task is very complex, there is a strong motivation to build tools and methods to enhance the development and optimization process of battery electric buses (BEB).
However, the changeover from conventional diesel buses to 'clean vehicles' is a complex problem. It has to be kept in mind that we are talking about vehicles that often have to operate more than 10 h and over 200 km daily. The design requirements may also change completely depending on the characteristics of the city as the range of the vehicle has to match the travelers' needs [4]. Obviously, a good understanding of the specific use-case is essential to provide an appropriate technological solution to ensure efficient vehicle fleet operation [5]. To this end, simulations are an adequate tool to explore the multiple possibilities and options for infrastructure design and drivetrain layout. So-called 'driving cycles', which are synthetic 'speed vs. time' profiles that mimic typical vehicle velocity patterns, are often used in this field as the input data to the simulators. Several techniques exist for producing sets of synthetic cycles from scarce realworld data. Of these, the probably most popular is the socalled 'micro-trip cycle construction' approach. Micro-trips are defined as driving intervals between two consecutive stops, including or not the periods of idle. In general, microtrip cycle construction comprises four steps: (1) measure some real world driving data by an instrumented vehicle, (2) segment the driving data into micro-trips, (3) construct as many cycles as desired by concatenating micro-trips, according to a pre-defined criterion, and (4) evaluate the final results. For the reader's information, a survey of variants of this technique is presented in Section I-A. Whatever the approach chosen, original measured data should ideally cover all cases and the whole parameter space. Most researchers, however, use a limited number of carefully selected routes in their experimental work, which can be justified because data collection campaigns are time and cost intensive. Consequently, previous contributions do not address the inherent problems that arise when dealing with a fleet operator's own database, including the lack of both plausibility of data and synchronization between the measurements. Furthermore, valuable information is often sensitive, and therefore classified, or at least it is not available in the needed quantity and quality.
A novel aspect of the present work is that we analyze a large set of measurements taken on a fleet of conventional internal combustion buses, which operate under real conditions. This paper describes the difficulties encountered in working with such a large-scale measurement database. The proposed approach, with a strong engineering component, was initially motivated by a feasibility study on the decision to retrofit city buses reaching the end of their operational life, with the purpose of transforming them into BEBs. Along these lines, the paper presents an appropriate methodology for the exploitation of the data and their use to evaluate the performance of electric vehicles. After conducting an in-depth analysis of our large real-world driving data set, it is studied which variables are most representative, how we can generate realistic synthetic driving cycles, and how to use them to investigate energy demands through massive simulations. The paper is organized as follows: After providing background in Section I-A, Section II reviews the data collection process and the methodology used in the research. In Section III, we present several experimental results illustrating the performance of the proposed approach. Finally, Section IV presents the conclusions and brings the paper to an end.

A. RELATED WORK
The importance of incorporating real-world driving conditions into the design of energy management control strategies is well-known [6]- [10]. Traffic agencies have thereby proposed standard driving cycles (SDCs), also named standardized on-road test cycles (SORTs), which include, among others, specific vehicle speed versus time profiles. These profiles are useful for quick comparisons but allow only vague estimates of energy economics or emissions. The Manhattan Bus Cycle (MBC), the Braunschweig City Driving Cycle (BCDC) or the European Transient Cycle (ETC) are some of the most popular ones in the area of heavy duty vehicles [11]. However, standardized driving cycles have a number of disadvantages, which makes their use for electric bus design more than questionable: • Most of them have been designed primarily for passenger cars or vehicles using roller platforms, but not for urban public transport buses. • The few bus driving cycles available were designed in the 1980s or 1990s for diesel vehicles. • They are based on local scenarios that often differ from the actual use case. • They have not always been created on the basis of measurement campaigns and therefore lack real driving characteristics.
For the above reasons, it is of great interest to develop more appropriate methods to synthesize realistic driving cycles. This is, in many ways, a complex task since, obviously, actual driving profiles depend on environmental conditions, traffic situations, driving styles and other related aspects [12], [13]. Furthermore, since many different use cases have to be taken into account, the development of methods that assess the quality of speed profiles, in a reproducible way, is another challenge [14]. In the following paragraphs we will review the extensive existing bibliography, citing some of the most relevant references.
Ho et al [15] showed the shortcomings of SDCs in assessing both emissions and energy consumed in road transport. To alleviate this deficiency, they developed a representative driving cycle by concatenating micro-trips, which were synthesized from actual measurements at Singapore. For this purpose, a chase car with measuring equipment followed the target vehicle, adapting its driving maneuvers to those of the latter. Though the chase car method can provide locally representative data, it requires a lot of time and resources. Alternatively, B. Zhang et al [16] collected real driving data in the city of Dalia using GPS trackers. After sequencing the kinematic parameters and performing component and cluster analyses, they created a representative driving cycle, parameterized by a certain correlation coefficient, for typical Chinese cities. A similar approach was also presented by K.S. Nesamani et al for Chennai in India [17]. Zhao et al [18] proposed a microtrip cycle construction method in which principal component analysis was employed to reduce the number of the kinetic parameters and microtrips were classified into 'congested', 'unimpended' and 'stable flow' segments. In this regard, Zhou et al [19] offer a comprehensive review on existing driving pattern recognition techniques based on artificial intelligence. A. Lajunen et al [20] carried out simulations of energy consumption in six different bus routes, where five of them were SDCs and only one was a real measured track. A cost-benefit analysis of electric and hybrid buses was also given. A complementary approach, based on the prediction of the driver behavior and the speed profiles, was presented by [21]. In most of these previous works, driving cycles were generated using data from pas-senger cars, which limits their validity when transferring to other use cases such as studies focusing on buses. Recently, the generation of large scale synthetic profiles based on stochastic techniques has also been a subject of research interest. As a representative approach, K. Kivekäs et al [22] used Monte Carlo methods to artificially generate a large number of driving cycles from a single suburban bus route. Based on this synthetic data set, a drivetrain comparison and a passenger-load sensitivity analysis were carried out. They also presented another method for synthetizing driving cycles in [23], which was used for analyzing the energy consumption of BEBs. In both cases, the cycle synthetization was performed segment-wise and was based on randomly chosen sections of a measured route. Their approach required the manual inclusion of bus stops. Moreover, environmental traffic conditions and crossings were not taken into account in this work.
Finally, a complementary approach to generate driving cycles, with the motivation of running realistic simulations to analyze drivetrain designs and/or evaluate vehicle energy economy, is to tackle the problem through Markov matrix formulations. For example, Zhao et al [24] focused on developing a methodology for testing route selection. Data were obtained by chasing vehicles in Xian city and, to increase accuracy, also by on-board measurements. The construction of the driving cycle was based on Markov and Monte Carlo methods. The typical methodology is exemplified by G. Souffran et al [25], [26]. This paper questioned the quality of SDCs and proposed the following approach for modeling real-world vehicle missions: driving patterns were first characterized by three time-depending variables, namely, vehicle speed, acceleration and road slope. Secondly, this definition of the vehicle state was used to calculate the probability of the next one and thus create a complete, new, vehicle mission. This approach was mainly used to compare the fuel consumption associated with the sizing of different propulsion chains, where the new European standard driving cycle (NEDC) was the gold standard. Brady et al [27] also generated driving cycles by using Markov process theory, where in this case only two state variables, i.e., vehicle speed and acceleration, are used to represent vehicle dynamics. A deep analysis of an hyper-heuristic Markov chain evolution approach for driving cycle generation was also presented by M. Zhang et al [28]. Their work underlined the importance of generating representative driving cycles for the automotive industry and introduced a method that handles multiple-parameter driving cycles, thus improving vehicles' adaptability.

II. MATERIAL AND METHODS
Operation measurements for 11 consecutive days, from June 24 to July 4, 2019, were recorded on 30 conventional diesel city buses of two different lengths (standard rigid buses of 12 m long and 18 m long articulated buses). The vehicles were equipped with data acquisition units to monitor up to 77 variables. These included 'major' physical quantities such as vehicle velocity, engine speed, engine torque, and fuel consumption, but also 'minor' ones such as the brake pedal position and the ABS Control Status. Data were provided by the public bus transport operator in Seville (Spain). The operator also provided us with technical information about the buses and questionnaires and operation logs describing commercial velocities, average number of passengers per km and tour, trip lengths, et cetera. Seville's buses may operate several different lines and routes per day, changing drivers every few hours. Consequently, our records include a wide variety of driving profiles and traffic situations [29]. This distinguishes our project from others where only individual route data are collected (see Section I-A). By contrast, Seville is a very flat city and there are no significant differences between route profiles (as an example, see the topography of a randomly selected route in Fig. 1).
Since buses may run several different routes per day, battery sizing is challenging and requires specific treatment. Next, we will describe the pre-and postprocessing data steps we have taken to address this problem. A summary of the main individual steps is shown in Figure 2. The analysis and preprocessing of the data are described in subsection II-A, while the procedure to extend the data set is explained in subsection II-B.

A. DATA ANALYSIS AND PRE-PROCESSING
Some elementary preprocessing steps (see Fig. 2) are inevitable since we manage a very heterogeneous database. Let us describe all of them in some detail:

1) Data overview and selection
The original database consists of 30 vehicles, but six of them were excluded from the analysis because their records were incomplete, ending up in a final sample of 24 buses. Each vehicle was equipped with numerous sensors which capture VOLUME   up to 77 different variables: some of them correspond to physical magnitudes (e.g., speed); there are also categorical variables that are only active when the driver performs a given action (such as pressing the brake pedal) and, finally, some variables are derived from others (for example, an onoff vehicle-in-motion indicator is obtained by applying a threshold to the speed).
One sees in Fig. 3 that the number of samples varies greatly from variable to variable. For this reason, after appending the data from all the buses, we sorted the variables according to their sample size, from largest to smallest. Table 1 shows the variables at top of the list: all of them are numerical and have a similar number of samples, in the order of millions. The rest of the variables have a size at least one order of magnitude smaller and, as a consequence, we exclude them from the analysis. Fig. 3 (bottom) also shows that buses may have prolonged periods of inactivity (e.g. between 27 to 30 June). After removing from the database these long idle periods, we still conserve on average about 118 hours of data per bus (minimum value, 39.75 h; maximum, 167.25 h; standard deviation, 38.1 h). Fig. 3 (top) strongly suggests resampling the different variables to a sample rate of 1 Hz. To this end, we filled the missing values using linear interpolation between the measured points. However, interpolation may generate random artifacts in the resulting data. This is illustrated in Fig. 4: observe that interpolation has created seven intermediary fake (nonzero) points between 21:00:08 and 21:00:16 UTC, where there is a data gap caused by the stopping of the bus. To fix it, we can use the binary 'vehicle motion' channel of the database. As illustrated in Fig. 4, this variable switches between 'off' and 'on' at approximately the transitions from/to standstill. However, simply setting to zero the 'fake' speed values will produce sudden speed jumps at these transitions, yielding unrealistic high accelerations. The solution involves a second step: after setting fake values to zero, the variables are smoothed using a moving average low-pass filter with Hamming kernel [30], [31], [32]. The window length, five samples, was chosen by manual tuning. Fig. 5 shows the acceleration values for a resampled  profile from the data base, drawn at random, before and after filtering. One sees that extremely high acceleration values, above any reasonable limit, occur around zero speed, confirming that implausible high values may appear when calculating acceleration parameters. The Figure also shows an accumulation of infeasible acceleration points, outside meaningful limit values, in the range of low vehicle speed (|v veh | < 10 km/h). For the filtered speed data, on the contrary, the distribution remains within the plausible physical bounds expected for a city bus. Finally, Table 2 lists the relative occurrences of remaining acceleration values above and below reasonable limits in the complete data set. Only a few (≈ 0.5 %) extraordinary high values remain after this step, and hence can be neglected. Orig. data Filt. data Upper limit Lower limit

3) Analysis of the data
Let us now study the relationships between the data points. We handle several millions of measurements, say N , of the 19 variables given in Table 1. Suppose that they are stored in a matrix D = (d ij ), D ∈ R N ×19 , where d ij is the i th observation of the j th variable. Let X ∈ R N ×19 be the standardized version of D, obtained by subtracting from each column its mean value and dividing the resulting column by its standard deviation. The structure of X can be described by two unitary matrices U, V and a diagonal matrix Σ = diag {σ i }, Σ ∈ R 19×19 , in what is called 'singular value decomposition' (SVD), to facilitate the visualization of the data. The SVD of X is a factorization of the form: where σ i ∈ R, ⃗ u i ∈ R N ×1 and ⃗ v i ∈ R 19×1 are the i-th columns of matrices U and V, correspondingly, and It is also assumed that the singular values σ 1 , σ 2 , . . . , σ 19 are sorted in nondecreasing order (σ i ≥ σ i+1 for all i).
Interestingly, the SVD is linked to classical principal component analysis (PCA) [33]. In particular, eqn. (2b) shows that ⃗ v 1 , . . . , ⃗ v 19 are the eigenvectors of the data correlation VOLUME 4, 2016 matrix 1 N −1 X T X and therefore point in the directions where the data values are more spread out.
For visualization purposes, let us consider the truncated formula:X This choice is not arbitrary. Thanks to the properties of the SVD,X = (x ij ) is the best two rank matrix approximation of X in the Frobenius norm sense. In plain words, ⃗ v 1 and ⃗ v 2 constitute a basis for the plane that best fits the data points represented by the rows of X. From (2c), σ 1 ⃗ u 1 and σ 2 ⃗ u 2 contain the coordinates of the projections in this new coordinate system. In the PCA language, these coordinates are usually referred to as 'principal components'. Writing: This is a convenient representation, asx ij , which is the i th value of the j th variable, can be easily visualized by considering it as the product of the length of ⃗ d j times the length of ⃗ c i 's projection onto it. We put this idea into practice in the biplot show in Fig. 6. It includes two scatter-plots: the first one shows the points ⃗ c i . The second plot is formed from the vectors ⃗ d j , which are drawn as arrows from the origin and have been labelled with the PIDs of the corresponding variables. One sees that vectors ⃗ d j are mainly related to the speed on the upper half-plane, and to the engine's torque on the lower halfplane. For a deeper insight, we split each of the quadrants into two halves by diagonal dashed lines, which allows us to subdivide each group into smaller ones ('green', 'red', 'blue', 'pink' and 'orange'). Variables grouped together are highly correlated or redundant with one another [33] and, consequently, one can reduce the size of the dataset by selecting one of them from each subgroup. As criterion, we mainly considered the relative contribution of the principal components to the individual redundant variables: from each group of correlated variables, the one whose variance was best explained by one of the components was selected in general. By so doing, we select the four representative signals shown in Table 3.  Biplot showing the representation of 1000 randomly selected data points onto the plane of best fit, as well as the directions onto which it is necessary to project the data points to reconstruct the original variables ( Table 1 shows the link between PIDs and variables' name). This biplot allows one to see easily which variables are related among themselves. For interpretation of the references to colour in this Figure, the reader is referred to the web version of this paper.

B. EXTENDING THE DATASET
Unfortunately, to buttress the credibility and robustness of simulation results, more data are actually needed than can be measured in practice (even when the measurements are made by a large fleet operator). As anticipated in Fig. 2, our data set was extended to enable massive simulations. Specifically, additional speed profiles were created via 'bootstrapping' whereas auxiliary power and cargo weight were generated by 'inverse sampling'. Bootstrap is a random sampling technique with replacement. It is used when small sample size results in large variance estimates of the parameters of interest (in this case, energy consumption of the buses), raising doubts about their quality. Bootstrapping is known to be a solution to this problem in that it can be used to find accurate confidence intervals, even if the underlying data is incomplete or does not fulfil assumptions such as Gaussianity or homogeneity [34]. In our case a variant of the block bootstrap approach, known as simple block bootstrapping, was implemented. Here, a time series is split into non-overlapping blocks, which are randomly drawn with replacement and concatenated in a new sequence. In our case, the beginning and end of each block occur at time instants when the vehicle comes to a standstill, though a minimum block length is required. In this way, a block can be randomly concatenated with any other block, guaranteeing the smoothness and continuity of the generated profile. After dividing the data of a bus into, let us say, b blocks, we take samples each of size b with replacement from these blocks. These synthetic profiles will be used later as inputs in the Monte Carlo simulations. The advantage of this bootstrap-based method is that it preserves the probability of the different use cases.
Another factor that strongly influences the operating range of any electric vehicle is the auxiliary power demand [20], [35]. In extreme climate regions, typically in hot or cold countries, heating ventilation and air conditioning systems (HVAC) may require an additional consumption of 1.2 kWh/km at an average velocity of 20 km/h [5]. Compared to that, the driving resistances a vehicle has to overcome at a velocity of 20 km/h are about 10 kW. In the end, depending on the speed profile, the energy consumed by non traction auxiliary devices makes up to 50 % of the vehicle's total energy consumption. However, the auxiliary power demand is seldom recorded in conventional combustion vehicles. For this reason, we reconstruct it using a simple model described by equation (3), which use the variables listed in Table 3. In this formula, the auxiliary power P Aux is calculated as a fraction of the power output of the engine, which equals its torque T Eng multiplied by the rotational speed N Eng of the axis. The term 1 − T Dmd 100 , where T Dmd is the driver's demand engine (in percent), takes into account that increasing the demand for engine power, by stepping on the gas pedal, decreases the amount of power available for auxiliary systems. Finally, α Loss is an empirical loss factor. The formula is as follows: Inverse sampling was applied for obtaining extra auxiliary power demand profiles. Inverse sampling is a technique for generating independent and identically distributed random observations from a univariate probability distribution, given its cumulative density function [36]. The results of this approximation can be seen in Fig. 7. An average auxiliary power demand of either 7.5 kW or 12.5 kW, depending on the bus length, is obtained, which is in harmony with the state of the art [5] and supports this approximation. Finally, another important factor influencing the energy consumption of a battery electric bus (BEB) is the passenger load [37]. As [35] pointed out, the relationship between energy consumption and passenger load is roughly linear, approximately increasing 5 kWh/100 km for each additional 1000 kg of weight it takes. To consider the variation of mass during a trip, caused by passengers hopping on and off the bus, vehicle mass is recalculated at each bus stop. To this end, the fleet operator provided us statistics on the passenger distribution, including the minimum, maximum and average number of passengers per line. This information was also taken into account via inverse sampling. Now, we claim that the triplet formed by speed profile, auxiliary power and mass variation, generated as previously described, is sufficient to carry out high quality studies, through simulations, in the field of battery electric buses design. This will be shown in the next Section.

III. EXPERIMENTAL VALIDATION
As part of the strategy, a validation step has been implemented. Specifically, the extended sets of speed profiles, auxiliary power and mass variations are used as inputs to a simplified forward longitudinal model of a retrofitted city bus (see [38]), implemented in Matlab and Simulink. This type of modeling and simulation is commonly used to investigate on various topics, from energetic studies such as the present work to advanced control strategies [39], [40]. The differences lie in the level of detail and the corresponding focus of the analysis. The criteria that will allow us to evaluate the proposed approach will be the degree of plausibility of the results obtained with the simulations and their coherence with the state-of-the art. The model and its sub-components were fully validated against real operating data by both the manufacturer and the individual subsystem suppliers. First, we describe this model in Section III-A. A preliminary analysis of the extended dataset is then shown in Section III-B. Finally, the results of the simulations using the extended dataset are given in Section III-C.

A. SIMULATION MODEL
The general technical specification of the electric bus and the vehicle model is given in Table 4. The basic principle of the forward simulation is sketched in Fig. 8 and described in the following paragraphs.
The driver model is a speed reference tracking controller which provides a torque request to the vehicle model as a function of the difference between reference velocity and actual speed v ref − v act . Here, v ref is the speed according to the simulations input profile and v act is the actual vehicle's speed calculated by the vehicle model. This torque is then translated to accelerator and brake pedal angle / relative deflection. These values are processed by an operation strategy and the resulting set-points are passed to a simplified torque path model, which computes the required wheel torque after taking into account the losses of individual drivetrain components. In combination with vehicle and environmental parameters, the velocity is calculated by integrating the equation of motion (4): where F tot is the net force acting on the vehicle, F d is the driving force exerted by the drivetrain at the driven wheels, F r is the sum of driving resistance forces, m tot the total mass of the vehicle andv veh is the derivative of the vehicle speed with respect to time, i.e., the vehicle acceleration. The driving force is determined by the powertrain output torque at driven axle T em and the dynamic radius of vehicle wheels r dyn : The total resistive force F r comprises slope, rolling and air-drag forces acting on the vehicle and is calculated according to: where g is the gravitational acceleration, f r is the friction coefficient, α the road incline, ρ the air density, A the frontal area of the vehicle, and v veh the vehicle speed. The overall mass m tot is given by the curb weight m curb , including equivalent rotational inertia terms and the varying passenger load m passenger : The vehicle speed is fed back to the driver, closing the control loop. Auxiliary power demand was taken into account for the calculation of energy consumption and the weight of the bus plus passengers was actualized at each bus stop. Finally, the overall power demand is given by: The study simulates a battery state of charge (SOC) window depletion use-case with fixed initial and end SOC conditions. All simulations start with a state of charge of 95% and end either when the track has reached its final destination or battery SoC drops to 15%. Cautiously considered, the results can be assumed to represent real operation characteristics, especially in terms of operating time, range or energy demand.

B. COMPARISON OF ACTUAL AND EXTENDED DATA
A selection of parameters derived in ARTEMIS project [41], [42], listed in Table 5, were chosen to compare the original and extended datasets with each other and with standard driving cycles (SDCs). The values are computed for both the original data and the cycles from the output extended dataset. In the latter case, we used 1100 synthetic profiles for estimating the statistics. As can be seen in Table 5, as well as in Figures 9 and 10, our extended data can compete to the real data, although there are slight differences which are discussed next. Comparing the parameters of real and extended data, one sees that the artificially created data set remains within the boundaries provided by the original database, as can be seen by comparing the extreme values of, for example, the 'maximum speed' parameter. Specially, speed related and dynamic parameters, e.g. acceleration, describing the driving profiles are noticeable similar. Furthermore, the percentage of time either standing still or driving remains exactly the same, which is obviously due to the decision to use moments of zero speed as linkage point for the sections. In general, the mean value of the original and extended sets of driving profiles do not differ much, whereas the standard deviation is smaller for the extended dataset. This is caused by the considerable increased volume of the extended dataset and it is a natural consequence of bootstrapping. Fig. 10 provides an additional overview of the different parameter's distributions for the original database and the extended one. The distributions of the extended data set have nearly the same characteristics as the original real set but contain way more data (in this case about 7.5 times). This increased data density allows massive simulations and variations of the experiment, leading to results with high credibility.  Furthermore, the results presented show considerable deviations between the SDCs and our real-world measurements. It is also worth noting that the extreme values of the parameters, such as maximum speed or acceleration, of the SDCs are far away from what happens in reality. For example, the 'European transient cycle' (ETC) has a driving percentage of more than 96% or an average distance between stops of more than 7 km, which is unrealistic. The same applies to the average driving speed and speed percentiles, as well as the high frequency of stops (every 50 meters in the 'Manhattan bus cycle'), which is not considered realistic in an urban scenario. That said, not everything in the SDCs is unrealistic: for example, a similarity in the number of stops and in the average distance between them can be observed between the 'Braunschweig City Driving Cycle' (BCDC) and our dataset. All these observations reinforce our starting hypothesis and can be interpreted as an additional argument encouraging the use of real measurements and operational data as a starting point for feasibility studies and design of electric buses. VOLUME 4, 2016 C. VEHICLE SIMULATION RESULTS Figures 11 and 12 present the overall simulation results for real and our extended synthetic data, formed by 1100 profiles, demonstrating the overall validity of the proposed approach to synthesize velocity profiles via bootstrapping and additional mass and auxiliary load via inverse sampling. The simulations have been carried out using the usual methodology employed in this field for working with micro-trips cycles (see e.g. [23]): the power consumption was calculated for each profile with the simulator and histograms were constructed from the results. In addition to the feature comparison performed in the previous subsection, which confirmed the similarity of real and synthetic data, the final results show a high credibility when considering the overall bus performance, taking into account the operating time and range, but also the energetic and dynamic behaviour. All parameters, such as consumption or recovery, are plausible and remain within the limits of the original experiment. Moreover, the range achieved as a function of relative consumption almost resembles a Pareto curve, as it ought to be. Fig. 12 shows the 95% confidence level interval bounds for various simulation results. If we look at one of the most representative variables in this context, the spatial rated energy consumption, the average results for the real and synthetic data are around 2.11 kWh/km. This is a good result in several respects. First, this absolute value reflects very well the reality for this class of vehicles. The similarity, moreover, of real and synthetic data for this parameter increases the confidence in the data synthesis procedure analyzed and proposed here. Secondly, we observe that the interval bounds for the 95% confidence level are noticeably larger for the original data, despite the fact that this is a relative parameter, unlike, for example the total energy consumption. The larger variance of the results based on the real data compared to the synthetic ones is partly due to the fact that some real trips did not complete a full trip (battery state of charge (SOC) variance less than 80 percent). In reality, this is not a drawback, but quite the opposite, as it is necessary for the fleet operator to identify the most suitable layout conservatively. Exaggerating a bit, the desirable goal may be a result with near-zero variance. Considering for example the energy demand as key factor for the design of the vehicles batteries, the real data set shows an absolute variance of 20 kWh, respectively 10 %, which obviously leads to difficulties in sizing. Assuming battery costs of currently about 500 e/kWh in automotive applications the conflict between costs and achievement of objectives is assured. On the contrary, the use of massive synthetic data reduces all questionable variables to a satisfactory and reduced spectrum. Table 6 is a supplement to Fig. 12, which again explicitly states the absolute numbers of the corresponding confidence intervals. The expected value and confidence interval boundaries of 95% likelihood of overall energy consumption were originally 205.37 kWh with a lower boundary of 195.87 kWh and an upper of 214.87 kWh. The same calculation for results based on the extended data set show an expected value of 227.75 kWh, a lower boundary of 227.74 kWh and an upper of 227.76 kWh. Further analysis revealed an average auxiliary power of 7.5 kW for buses of 12 m long and 12.5 kW for buses of 18 m long in this hot country region, which is in harmony with the state-of-the-art. The influence of passenger load is comparably small, which is also in agreement with other investigations (e.g. see [23]).

IV. CONCLUSION AND OUTLOOK
A methodology has been presented to construct large-scale profiles of driving cycles, auxiliary consumption and passenger number variation from the database of a real urban bus operator. In addition, a comprehensive characterization of the driving profiles has been performed to assess the representativeness of the synthetic profiles. The main objectives were to preserve the original characteristics of the measured data, to ensure physical plausibility, and to generate a database large enough to minimize stochastic uncertainty. In fact, the analysis revealed the lack of key features in SDCs, reinforcing the urgent need for alternative approaches such as the one presented here. The extended data set was used to analyze the energy consumption of the buses, which is useful to study the feasibility of retrofitting them into an electric bus fleet. Our work has focused on the parameters that characterize a driving profile, taking into account that it will be used to analyze nominal, global and spatial energy consumption, which in turn is related to running time, autonomy and recovery. The chosen triplet of input parameters (speed, mass and auxiliaries) is usually sufficient to obtain good and plausible results and synthetic profiles have proven to be representative of this class of vehicles.
Uncertainty about achieving objectives often leads to conservative design, which ends up in high vehicle costs and inefficient fleet operation. Considering that real world driving conditions is essential for EV design, energy storage systems, particularly vehicle batteries, make up a significant portion of the total costs of ownership (TCO). To reduce these costs, it is necessary to accurately size these systems. Synthetic  databases allow a precise estimation of energy demand and enable a proper powertrain design even for vehicles whose route is unknown. Additionally, by analyzing worst-case scenarios (ultimate boundaries of experiment), the introduced methodology helps to identify appropriate vehicle topology and operation modes (electric, fuel-cell, hybrid). The presented work allows to address all these socio-economic problems by narrowing the confidence intervals of the results to an extreme minimum of variance.
In any case, the creation of 'real' driving profiles is a very broad field and the method presented in this paper has left open several questions that will be the subject of further research. For example, a definition of a suitable measure to identify an adequate minimum size of the data set needed for the design of experiments or the synthesis of profile features considering also methods such as hidden Markov models, factorization of non-negative matrices or deep neural networks. Future articles will also incorporate more extensive analyses of the data using multivariate data analysis techniques.

ACKNOWLEDGMENT
We are grateful to TUSSAM, public bus transport operator in Seville, for providing us with the experimental data for this research. VOLUME 4, 2016