Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics

We propose a method to estimate information loss when conducting histogram binning and principal component analysis (PCA) sequentially, as usually done in practice for fleet analytics. Coarser-grained histogram binning results in less data volume, fewer dimensions, but more information loss. Considering fewer principal components (PCs) results in fewer data dimensions but increased information loss. Although information loss with each step is well understood, little guidance exists on the overall information loss when conducting both steps sequentially. We use Monte Carlo simulations to regress information loss on the number of bins and PCs, given few parameters of a dataset related to its scale and correlation structure. A sensitivity study shows that information loss can be approximated well given sufficiently large datasets. Using the number of bins, PCs, and two correlation measures, we derive an empirical loss model with high accuracy. Furthermore, we demonstrate the benefits of estimating information losses and the representativeness of total loss in evaluating the accuracy of k-means clustering for a real-world customer fleet dataset. For preprocessing sensor data which are aggregated from sufficient number of samples, continuously distributed, and can be represented by Beta-distributions, we recommend not to coarsen the histogram binning before PCA.

Variable for vehicle i and sensor j (•) (x)  The x th element of a set (•) c Column-wise variable Set element for vehicle i and sensor j ( •) Reconstructed variable using the first κ principal components.

I. INTRODUCTION
T O UNDERSTAND the usage of vehicles across the lifetime, companies acquire and analyze measurements from various sensors inside the vehicle, providing datadriven decision support for customer-centric automotive development.For instance, Wilberg et al. [1] highlighted the potential of sensor data analysis in supporting requirement engineering and reliability evaluation.Albers et al. [2] showed the prospective of automotive development processes driven by sensor data.More specifically, Reicherts et al. [3] used naturalistic driving studies to identify vehicle dynamics.Tanshi and Söffker [4] proposed a method for determining the takeover time budget based on the analysis of driver behavior.
However, companies aiming to exploit the customer data face policies related to data protection and privacy preservation.To ensure privacy by design, data thriftiness and obligating procedures are usually required.For example, Viktoriya et al. [5] identified the automobile industry's ethical issue.Enev et al. [6] showed that driver fingerprinting is possible with sensor data, which would strongly violate customer privacy.Furthermore, multivariate sensor data with fine-grained temporal resolution prohibit conducting analytics on the raw data.These data need to be reduced, often by orders of magnitude, in respective preprocessing procedures.
A common approach to manage both, preparing the data for analytics and privacy preservation, is to aggregate the operational data from customers and solely keep the aggregated data as historical sensor data, e.g., by binning the data [7], [8].Binning temporal data is typically an initial step in preprocessing sensor data, often already conducted directly on vehicle control units, i.e., on the customer side.
It is, however, still challenging to use the heterogeneous, high-dimensional data, although binned, in exploratory or supervised analytical models -a problem coined the "curse of dimensionality" [9].As each histogram bin accounts for a dimension, in total, the average number of bins times the number of sensors considered easily results in a dimensionality of thousands and prohibiting the application of most visual procedures and analytical models.Hence, the aggregated dataset needs to be shrunk to a manageable number of dimensions.One of the most widespread means of further reducing the dimension is to perform principal component analysis (PCA), an unsupervised low-rank matrix approximation technique, as the second step.Together with binning, after all, we would like to reduce the number of principal components (dimensionality) without losing much information.
With significantly large amount of samples, coarsergrained histogram binning results in fewer dimensions, but loses more information.However, with the same outcome dimensionality, should we perform coarser binning or finer binning at first to optimize the performance of upstream analytical models?Although there are already evaluation metrics for PCA, do they really represent the influence of binning on the pre-processed dataset?If not, which evaluation metric can consider the whole process and provide representative decision support for configuring the binning?So far, it remains a research gap to understand the mechanism of the decision support problem that exist in the two-step process.On the one hand, there is a lack of large amount of raw data before binning from customer fleets, as they are previously binned on-board (inside of control units) [10].On the other hand, binning and PCA are combined with the spread of distribution patterns of various customer fleets.This makes it especially complex to investigate the research questions using theoretical analysis and mathematical proofs.
In this paper, we tackle these challenges from the perspective of information losses, measured by the Kullback-Leibler divergence between original, binned, and PCA approximated data.Considering the difficulty of theoretical analysis, we estimate the information losses by simulating the raw data based on Monte Carlo approaches.In the following, we highlight the contributions of our work.
• For the purpose of loss estimation of binning and PCA, we model raw customer fleet data using three scale parameters and the correlation structures between histograms within a row and between rows.Based on simulations with sensor data drawn from Beta-distributions with varying degrees of correlation between and within rows, sensitivity study shows the influence of each parameter on information losses.
• Based on the sensitivity study, we derive an empirical model that guides how to set the number of bins in combination with the order of dimensions to be considered.The model can determine appropriate values for the number of bins, number of principal components, and total loss, given two of the values are set.
• Using a case study of real-world fleet data from 1454 vehicles, we found the benchmark evaluation metrics from binning (Kolmogorov-Smirnov statistic) and PCA (variance unexplained) cannot be considered as decision support metric.Taking k-means clustering as an example, neither of those two metrics is capable of representing the accuracy of fleet analytics right after binning and PCA.Instead, we demonstrate how the estimated total information loss outperform those metrics.
• For fleet analytics with sufficient number of samples, it is recommended not to coarsen the finely-binned histograms before performing PCA.Note that all the findings in this paper are valid only when the raw fleet sensor data are continuously distributed and could be represented by Beta-distributions.The remainder of this article is organized as follows.In Section II, we review the related work on data binning followed by PCA and loss estimation and highlight our position in the research field.In Section III, we introduce the notation followed by an overview of our methodology, the model assumptions, the algorithms proposed, and loss functions considered.In Section IV, we then present the results of various investigations, including sensitivity studies with the proposed model, empirical modeling of the information loss mechanism throughout the procedures, and a case study on k-means clustering.We will then summarize the key findings obtained, conclude, and outline future research direction in the realm of sensor data preprocessing in Section V.

II. RELATED WORK
Histogram binning and subsequent principal component analysis (PCA) are popular data aggregation techniques for preprocessing customer fleet data.To locate the scope of our work, we review the related work in three parts: histogram binning for fleet analytics, PCA for histogram data, and loss-aware perspective for determining the cardinality of principal component after PCA.Afterward, we highlight the contribution of our work to the related fields.
Histogram binning allows the removal of the temporal dimension of sensor information, reduces the cardinality of sensor values, and can mitigate the effects of noise (observation inaccuracy).Numerous research activities have been conducted with binned customer data for customer-centric decision support, especially in the automobile industry.Schoch et al. [11] optimized charging strategy for longer battery cell lifetimes using binned customer data.Huang and Meng [12] put out a decision support framework for pricing automobile insurance based on binned telematics driving data.Ling et al. [13] used binned data for customer vehicle usage profiling.
Currently, the investigation of binning strategy is mainly conducted using available raw samples, e.g., Boulle [14] considered the influence of samples on the bins and the frequency-based binning.However, if the raw data samples are not available, information loss cannot be computed.
Hence, in this paper, we enable the binning loss estimation by simulating the samples and reconstructing the correlation structure to estimate.
PCA based on histogram data has been systematically investigated, whereas the histograms are treated as symbolic data [15], driven by the methodological approach for symbolic data analytics [16], [17].Billard and Le-Rademacher [18] put out the PCA method for interval data.Makosso-Kallyth [19] extended the scope from interval data to symbolic histogram variables.
Yet, this research field has not been widely applied in industrial contexts such as customer fleet analytics, where interpretability and robustness of methods are essential.Hence, we focus on a relative more conventional, but more popular context, i.e., concatenate the histogram bins and then perform matrix factorization using PCA.Haselgruber et al. [20] used PCA to aggregate engine load data for evaluating reliability testing.Bartłomiejczyk [21] analyzed the driving behavior of bus drivers by aggregating the measurement signals followed by PCA.Schoch et al. [22] implemented PCA for binned sensor data to reduce the features for electric vehicle service analytics.Ling et al. [23] applied PCA to improve the representativeness of customer sampling based on aggregated usage data and thus identified fringe customers.
From the perspective of variances, the approximation error with a PCA-reconstructed matrix, for instance, can be measured by the Frobenius norm of the approximation error matrix.It can be computed as the sum of all squared singular values minus the sum of the first few selected singular values squared, as a singular value squared captures the variance explained by the associated dimension.According to predefined fraction of variance explained, it is common practice to determine the number of principal components [24].
From the perspective of information theory, there is a principle for guiding model selection, namely minimum description length (MDL) based on Kullback-Leibler divergences [25].Tavory integrated the MDL principle into PCA [26].Bruni et al. [27] reviewed the methodology using three test cases and pointed out that MDL outperforms most of the model selection methods such as variance explained.On the contrary, specifying the model parameters has unclear influences on MDL performance.The influences behave sometimes explicit, but sometimes also implicit.
However, spanning binning and PCA as a sequence, the binning resolution and number of PCs influence on each other.Adding to the difficulty of changing the data acquisition strategy of customer fleets with various control units, it becomes increasingly time-consuming for big data analytics of customer fleets.Still, the parameter choosing for histogram binning followed by PCA is merely investigated, which remains interesting even for fleet analytics.
So far, Vaiciukynas et al. [28] investigated histogram binning followed by dimensionality reduction using fleet data with more than 20,000 vehicles.Although they thoroughly investigated various dimensionality reduction techniques on the performance of feature representation, the influence of binning resolution remains unknown, especially in the context of customer fleet analytics.
Therefore, we address the problem from the information theory perspective, but with simpler metrics, i.e., Kullback-Leibler divergences.We focus on the estimation and comparison of the information losses between the raw data, binned data and PCA approximated data.

III. METHODOLOGY
This section introduces our simulation-based research design to estimate information loss when preprocessing with changing characteristics using Monte Carlo sampling.First, we describe binning and principal component analysis (PCA) with notations.Then, we present our simulation procedure and algorithms.Afterward, we define loss functions based on Kullback-Leibler divergences for these binned data to evaluate the information losses.

A. PRELIMINARIES 1) HISTOGRAM BINNING
Consider a target dataset X for n vehicles.Each vehicle consists of m sensors.For vehicle i = 1, . . ., n and sensor j = 1, . . ., m, we represent the acquired data using a random variable X ij , so that the samples of sensor values can be allocated to discrete intervals (binning).
Assuming (i) the sensor data per vehicle to be binned with the same number of intervals k, and (ii) the number of samples p remains to be consistent for each sensor and each vehicle, we represent random variable X ij by p samples, and the data by set b ij = {b Considering the feasibility of saving the whole sample set across the whole life of a vehicle, we aggregate b ij and represent them as histogram where k t=1 π (t) ij = p and π (t) ij ≥ 0. In this way, π ij need to be stored on board, e.g., in the control units.In total, m histograms from n vehicles can be acquired and collected together in dataset with the histogram values alone.We refer to this procedure as k-fold binning.
After k-fold binning, the sequential information is completely lost.However, as the distributions are discretized by intervals, the distribution information is lost as well.
Histogram h ij can be described by an empirical cumulative density function (ECDF) F ij (x), where the range of x is identical to the range of e ij .The relative difference of two ECDFs can be quantified by the Kolmogorov-Smirnov (K-S) statistic D K-S , i.e., supremum absolute ECDF difference [29].The supremum represents the maximum across all x values.For D, the overall K-S statistic is the mean of all mn histograms, i.e., where the reference ECDF F ref ij (x) could be the original ECDF of raw samples b ij or another finely binned histogram.

2) PRINCIPAL COMPONENT ANALYSIS OF BINNED DATA
Once the sensor data is aggregated through k-fold binning, the dataset can be represented as a matrix.One row represents a vehicle.One column represents the average number of bins per sensor times the number of sensors as the number of columns (also coined dimension of the data).However, the high dimensions might prohibit the direct usage of the matrix for data mining purposes.By applying principal component analysis (PCA), we can, however, exploit correlations within the matrix to represent the principal structure of the matrix more concisely.
Consider a k-fold binned dataset M consisting of k • m column vectors by concatenating each vector element from D in the row direction, i.e., To simplify the denotation, we define r as the number of columns of M, i.e., r = k • m.We now represent M ∈ R n×r in a κ-dimensional latent space using PCA.To balance the weights between the columns, we normalize M to M n ∈ R n×r by centering and scaling the histogram values into probability densities, where element π (t)  ij , t = 1, . . ., k, is transformed to as centering elements by subtracting the mean of the elements' column entries moves the novel basis vectors towards maximum variance directions.
After normalizing M to M n , we approximate the matrix with lower rank to determine latent basis vectors for M n 's column and row space, as done, with truncated singular value decomposition [30].These latent basis vectors are located in directions where the data is most widely spread to capture the maximum amount of information in the matrix (the variance of its elements) with as few latent dimensions as possible.The intuition is to then only consider the primary latent dimensions derived and discard higher-order dimensions considered as noise.Hence, M n is decomposed into three parts, i.e., Here, U ∈ R n×r represents the coordinate matrix, consisting of r vertical coordinate vectors where to work with fewer dimensions in further analytical tasks.We refer to this procedure as κ-rank PCA.
As mentioned in the related work, the most popular selection metric of κ is the variance explained [24], which is the proportion of variance remained from all variance, i.e., the ratio from the sum of first κ singular value squared to the sum of all r singular value squared.For the consistency of comparison to the losses after binning, we regard the variance unexplained K var , the opposite indicator, as the standard metric to quantify PCA performance, i.e., 3) METHODOLOGY OVERVIEW Based on the formulation of histogram binning and PCA for customer fleet analytics, we formulate our objective and illustrate the overview of the methodology shown in Fig. 1.We provide decision support in determining preprocessing parameters k and κ from the perspective of information losses, in particular the total information loss after preprocessing L. Without having the raw samples of dataset X, our hypothesis is that there exists general behavior for customer fleet data.First, we characterize X by three scale parameters and two structure parameters.Then, we simulate the data distribution via GenBeta and get the Beta-parameter set A and B. After generating binned histogram data D in SimBinning, we perform PCA in SimPCA.With the first κ principal components, we project D to M proj for further analytics and approximate the generated binned dataset D by D. The core lies in estimating the information losses between generated Beta distributions characterized by A and B, binned dataset D and PCA-approximated dataset D.

B. DATASET CHARACTERIZATION
We represent the time-series sensor data measurements by Beta distributed random variables, i.e., X ij ∼ Beta(α ij , β ij ), as (i) according to the study from Greene [31] and Lin et al. [32], sensor information in vehicles can be well synthesized with unimodal Gamma distributions, which can also be represented using Beta-distributions; (ii) compared to normal distributions, the distribution can describe different shapes; (iii) their probability density can be zero as apparent in many practical settings; (iv) realizations are between 0 and 1, facilitating the data preprocessing in our simulation study and result interpretation.
We restrict Beta-parameters α ij ∈ A, β ij ∈ B to the interval [1,10].That is because distributions with α ij < 1 or β ij < 1 are U-shaped, a distribution hardly occurring for sensor data as the intervals are larger than real conditions.
We model k as a global parameter such that the binning strategy, in terms of the number of bins, is identical for all vehicles and sensors.
The samples from a single vehicle i and a single sensor j describe a random variable X ij that we aggregate into a histogram h ij to receive matrix M.
We model sensor correlation by letting sensor value follow beta distributions with a varying similarity of their probability density functions (PDFs), and then calculate the Here, cov stands for covariance of two vectors and var represents the variance of a vector.Hence, we get coefficient ρ c ∈ [0, 1] as the total average of all the correlation coefficients without the diagonal elements, i.e., where 1 ∈ R m×1 represents the unity vector.Correspondingly, the row-wise Pearson correlation matrix of D is represented with R r ∈ R n×n , in which the element ij of the correlation matrix is Algorithm 1 Generating Beta-Parameters 1: function GenBeta(n, m, λ, μ) Get beta-parameters for each X ij .

4:
for i = 2, . . ., n do 5: end for 8: end for 12: Return A, B 13: end function Row-wise correlation coefficient ρ r ∈ [0, 1] can be calculated with where In a nutshell, the previous formulation shows that dataset D can be characterized with scale parameters n, m, p as well as structure parameters ρ c , ρ r .

C. SIMULATION PROCEDURE
We simulate dataset D with combinations of parameters described above using Monte Carlo methods.For each simulation run, the procedure consists of two parts: Beta-parameter generation to derive D as GenBeta in Algorithm 1, and preprocessing as SimBinning and SimPCA in Algorithm 2.
In Algorithm 1, we generate n × m random numbers uniformly distributed in the range [1,10], assigning them to matrix A. Subsequently, we repeat this process, generating a new set of random numbers and assigning them to matrix B. At this point, Beta-parameter matrices A and B are initialized for further processing.Due to the shape property of Beta-parameters [33], the closer the parameters are, the stronger their correlation is.Hence, we add a treatment to the correlation structure using blending about the first column or row.Regarding the first row and the first column as references, we subsequently blend the other rows and columns towards the references, controlled by the columnwise and row-wise blending factors λ, μ ∈ [0, 1].The larger the blending factors are, the nearer the rows (columns) to the reference row (column) are.However, λ, μ are not identical to the correlation coefficients.By performing a parameter study, the blending factors, as expected, exhibit strong positive associations with correlation coefficients, i.e., λ ∝ ρ c and μ ∝ ρ r .In the sensitivity study, we use λ and μ to represent ρ c and ρ r .When estimating information loss with an observed matrix of sensor values, we can empirically determine estimates of the corresponding λ and μ values.Generate the samples for each X ij and perform k-fold binning.

2:
for i = 1, . . ., n do 3: b ij ← randomly generate p samples according to X ij ∼ Beta(α ij , β ij );  M proj ← Compress M using the first κ principal components according to (7); As shown in Algorithm 2, we randomly generate the samples that obey the Beta-distributions from Algorithm 1.By binning the samples, we simulate the histogram values and approximate them with a given lower rank.Here, k and κ serve as the variables that control the preprocessing procedure.Typically we project the data into the latent space to reduce their dimension according to (7).In this case, the evaluation of information losses requires the reconstructed dataset.Hence, we take the reconstructed normalized matrix M n from (6), de-normalize it into M by solving (4).Hence, M is approximated by M with κ principal components.As a result, we obtain the binned dataset D and the approximated dataset D which has the same nested structure as D, according to (1).
The simulation parameters are summarized in Table 1.Column-wise parameters are the parameters among the sensors.Row-wise parameters stand for the parameters among the vehicles.
After each simulation case, we keep the beta-parameters A, B, and the binned dataset D as well as the approximated dataset D for the evaluation of the information losses.

D. LOSS ESTIMATION
To compare the distribution of X ij before and after each preprocessing step, their probability density functions (PDFs) are used.For the Beta-distributions, we represent the PDF of X ij with f X ij (x), 0 ≤ x ≤ 1.After k-fold binning, we represent the empirical PDF of the binned X ij as where xk represents the ceiling function, and π is an element of de-normalized M n towards M. Similarly, we get the empirical PDF for the reconstructed X ij after its κ-rank approximation as where represents the reconstructed histogram value as an element of M, de-normalized from M n .
As shown in Fig. 2, we can determine the loss of the binning and the PCA steps.
Hence, we have -binning loss l The information loss function used between two PDFs is the Kullback-Leiber divergence (relative entropy), i.e., the expectation of the logarithmic difference [34].In this paper, the logarithmic function base is 2, and the unit is Shannon (Sh).
Let us first describe the binning loss.The loss function is defined as Here only if p X ij (x) = 0, for all x, the divergence is defined as f X ij (x) = 0. To compute the integral, we regard it as a sum of the function values of N samples of variable x with equal distance, i.e., we apply Quasi-Monte Carlo approach [35].Hence, we can approximate the loss by giving a finite value of N, which acts as a hyper-parameter.Over our n times m random variables, we evaluate the whole dataset using the average of their absolute values, yielding the average binning loss Correspondingly, we evaluate the PCA loss and total loss by computing and

IV. RESULTS AND DISCUSSION
Before putting our approach into applications, the sensitivities of information losses (binning, PCA, and total losses) on the parameters in Table 1 are analyzed.Based on the sensitivity study, we will derive an empirical model from estimating these losses without simulation.Furthermore, we will demonstrate how the loss model benefits the decisionmaking of choosing parameters of aggregation.

A. SENSITIVITY STUDY
Table 1 shows the parameters in the simulation procedure.However, with the data generation and simulation procedure described, both correlation structure parameters ρ c , ρ r can only be calculated when the dataset already exists.Instead, to generate the dataset, we use the treatment parameter λ, μ to control the correlation structure.
To start sensitivity analysis, we define the reference case configurations, shown in the outer-right column in Table 1.Then, we change parameters individually and derive its impact on information loss.
As shown in the preprocessing reference configuration, we choose eight-fold binning and three-rank dimensionality approximation.It is relatively easy to identify the distribution shape with lower noise with eight bin histograms, and a rank of three allows us to visualize the data intuitively.The scale and structure parameters represent a typical dataset, exhibiting a weak correlation structure.The solver settings play a role in the sampling resolution.The number N affects the accuracy to compute the loss functions in ( 16)-( 18), due to the quasi Monte Carlo approximation of the integrals.Based on results with extensive preliminary studies, we found a N of 1000 to be sufficient to approximate the Kullback-Leibler divergence with an accuracy of at least 99%.Furthermore, with a basis of stochastic processing, we repeat each simulation case 50 times, i.e., n sim = 50.
In the following, we present the results from the simulations with respect to the type of parameters: resolution of the preprocessing, scale of the dataset, and the correlation structure.

1) PREPROCESSING PARAMETERS
Based on the reference case, we modify κ to values between one and ten.The number of bins, k, is varied from three to 20, representing an increase in granularity.As k rises, more detailed information about the data distribution is captured.Fig. 3 shows the resulting losses (binning, PCA, and total losses) over various levels.
With increasing k, a reduction of binning loss and an increase of PCA loss can be observed.The aggregated histograms contain more information, resulting in a closer difference to the original distributions.However, with an identical objective (e.g., κ = 3), a higher order of dimension brings more information and more linearly uncorrelated dimensions.Furthermore, no influence of κ on the binning loss is shown, as the binning is the step before PCA.Due to the growing variance explained with a higher order of κ, the PCA loss reduces.
Another interesting aspect is that L κ will surpass L k with a higher k.It indicates that with a higher k, the information loss in the PCA is higher than that induced from the aggregation process.The turning point that L κ goes over the L k is positively correlated with κ, implying that if we accept a higher order of dimension for the preprocessing, then it makes sense to aggregate the data with higher order.
Additionally, the trend of the total loss L is dominated by binning loss with up to 20-fold aggregation.It seems that we can minimize the whole information loss by merely increasing k.However, L yields to L κ , although L reduces with an increasing k.At high values of k, the dimension decrease brings more information loss to the dataset.In this case, the PCA dominates the divergence induced from the whole preprocessing.

2) SCALE PARAMETERS
In our study, the scale parameters n, m, and p are systematically varied within the range of 1 to 10000.This variation allows for an exploration of the algorithm's sensitivity to different scales of input features.Their dependencies to the information losses are presented in Fig. 4.
As shown in Fig. 4(a), no correlation between n and L k is found.Similarly, L κ and L become less dependent to n when n exceeds 100 approximately.After reaching this threshold, L κ yields L k .With small n, the smaller losses are due to the correlation structure.With identical correlation coefficients or treatment parameters, the linear dependency of the whole matrix after aggregation becomes stronger if the matrix is small.According to Fig. 4(b), a similar phenomenon is observed between m and the losses, but the yield threshold is found at roundly over ten.This threshold is much less than that for n, as the dimension order is k times higher than m.
The trends become different when it comes to p. Fig. 4(c) shows a reduction of all the information losses with a larger number of samples.It saturates with more than 1000 samples.With lower p, the sampling affects the noise and the histogram binning accuracy, which further influences L k .Another effect is that we have 80 columns of the matrix after aggregation (according to the reference case).With a p fewer than the number of columns, aggregated histogram values are noisy.At the same time, large amounts of missing values exist in the dataset.Based on this matrix, another level of noise is included after PCA.The PCA performed here indirectly approximated the missing values, resulting in a lower total loss.With higher p, this missing-value effect disappears, and the losses go stable.In summary, before decision-making for the data acquisition, it is necessary to keep scale parameters above the thresholds and disable the noise resulted from missing values.

3) STRUCTURE PARAMETERS
From the reference case described in Table 1, we modify the column-wise (between sensors) structure treatment λ from 0.3 to 0.9 at three levels of row-wise (between vehicles) structure treatment μ, namely 0.3, 0.5, and 0.7.Their information losses are presented in Fig. 5.
It is observed that a stronger correlation between the sensors reduces the PCA losses, resulting in a decrease in total losses.Since the aggregation is performed for each sensor and each vehicle individually, the matrix structure does not affect the aggregated histogram values.Hence, no clear dependence is observed in this study.These findings are also valid for the correlation between the vehicles.

B. LOSS MODEL DERIVATION
According to the findings in Section IV-A, when p mk, m and n are sufficiently large, we can outline the dependence between the parameters and the information losses empirically in (19).
The binning loss term is shown in (20).
The PCA loss term is a function of preprocessing parameters and structure parameters.According to Fig. 5, the effect from structural parameters is considered as a multiplier in the loss function, based on a function of the preprocessing parameters.Hence, The left term C represents the structural multiplier, which combines the row-wise and column-wise correlation effects as Based on the mean values from the sensitivity study cases, we perform curve-fitting on the equations by minimizing R 2 .The empirical functions from ( 19)-( 22) predict the information losses with an accuracy over 97.8%.This further supports the findings about the parameter dependencies.

C. CASE STUDY
In this we demonstrate how to use the derived information loss model and evaluate if the estimated losses imply the accuracy of upstream analytics.First, we present the data basis of the real-world fleet data and the evaluation case Subsequently, we project the data onto the first two components and visualize it with five binning levels.This step is taken to investigate the influence of binning levels on data interpretation, aiming to understand how varying levels of granularity impact the visual representation.Afterward, we evaluate the twostep preprocessing using our loss model.Furthermore, we compare our loss-aware method to two benchmark reference metrics, i.e., the variance unexplained for PCA and the K-S statistic for histogram binning.

1) CASE DESCRIPTION
Let us take an example from exploratory data analysis based on usage statistics from customer fleets.We take 1454 customer vehicles as the dataset with two segments, in which 727 BMW 740i limousine vehicles from the United Arab Emirates (UAE), denoted as set 1 , and 727 BMW 540i limousine vehicles from Japan, denoted as 2 are given.In this case, these two segments are with identical cardinalities, i.e., | 1 | = | 2 | = 727.The long-term statistical data is acquired from dealers via on-board diagnostics [36], [37] or vehicle telemetries [10].As shown in Table 2, the statistical usage data includes binned histogram values from ten measurements, with 24 bins for each measurement (sensor) histogram in average, 240 dimensions in total.
According to the prior knowledge, the usage behavior of the given two customer segments are quite different.Solely based on those binned data, we try separating the whole dataset into two segments after the preprocessing steps introduced in Section III-A.As the number of segments (two) is known, we group the 1454 vehicles by minimizing the squared Euclidean distances between them and their cluster centroids or means, i.e., k-means clustering [38].According to Tselentis and Papadimitriou [39], k-means is one of the most commonly used methodologies for driver profile identification and driving pattern detection.
Denote the resulted cluster sets 1 and 2 from clustering based on the binned data after κ-rank PCA.After clustering, we count the proportion of correctly clustered  customers to all customers.As the clustered segment sets are unsupervised, their mapping to our reference segments can be represented by either the intersection of corresponding segment and cluster numbers or vice versa.We take the larger proportion as the clustering accuracy A, i.e., To evaluate the relationship between estimated information loss and the accuracy of k-means clustering, we compare them under different number of bins k and principal components κ.As time-series raw data before binning are not available, we coarsen the fine-binned data by adding every two bins up.If the k for a sensor is odd, in the end, we add the last three bins up to make the coarsening more conservative.If there is only one bin, we keep it as it is without further coarsening.Regarding the original data as binning level one (L1), we repeat the coarsening steps until all bins to be concatenated, yielding five binning levels (L1,. . ., L5), whose number of bins for each random variable are listed in Table 2. Furthermore, we test the clustering with κ from one up to five, where the "curse of dimensionality" hardly occurs.
For this dataset, the number of observations (customer fleets) n equals | 1 | + | 2 | = 1454, the number of sensors (measurement variables) m is ten, the number of samples p → ∞ as they are long-term sensor measurements which could span over several years.As the influence of structural parameters on losses are a multiplier, there is no impact on the trends with different k and κ.Without the raw data before binning, we approximate the row-wise and columnwise blending factors λ, μ with the correlation coefficients of the L1 dataset.Hence, λ = 0.117, μ = 0.822, which implies that the bin values are weakly correlated and the customers are strongly correlated.
Given the scale and structure parameters, we estimate the average information loss per measurement variables per vehicle (shortly "information loss" in the following) using our empirical loss model according to ( 19)- (22) and their parameters.

2) DATA VISUALIZATION
On each aggregation level (L1 to L5), we plot the scatter snapshots of reduced fleet data in Fig. 6 (a-e), using the first two principal components (PC1 and PC2) with given segment information.Corresponding to clustering accuracies shown in Fig. 6h, we can perform exploratory data analysis by observing those snapshots and trying identifying the two segments visually, assuming that the segments are not given.
The information losses and accuracies of L1 and L2 close to each other.Their snapshots also show nearly identical pattern despite their opposite directions of PC1.Compared the snapshot of L2 to that of L3, we observe that the two segments can be decently identified, whereas their centroids are closer.This implies the tiny loss gain and the small accuracy drop.Moving to higher aggregation levels (fewer bins and lower dimensions before PCA), the sensitivity of loss estimation to clustering performance slightly decreases.However, with a tripled information loss from L3 to L4, the accuracy decreases from a general acceptable level (over 80%) to less than 70%, i.e., nearly two thirds of customers are correctly clustered, which shows limited ability of clustering.The snapshot of L4 shows a larger intersection area between two customer segments.Without given segment information, it is already difficult to identify two segments with our eyes.In L5 where the loss is doubled to L4, the points are so close that we can hardly identify the distance between the centroids of both segments.

3) LOSS-AWARE EVALUATION
Figure 6h shows the relationship between the accuracies for k-means clustering and our estimated information losses.
When the average number of bins k decreases, the accuracy for clustering two customer segments decreases significantly.At the same time, with the increase of principal components (PCs) remained κ for clustering, the clustering accuracy A increases from a single PC to two PCs, and then converges with minor fluctuation due to the algorithmic uncertainty of k-means clustering and higher dimensionality.Comparing the converged A for each binning level to their next level (after pairwise coarsening), the increase accuracy by explaining more variance with more PCs did not compensate the influence of binning.In other words, the clustering accuracy with lower binning resolution generally decreases to an extent that more principal components could not explain the original pattern before k-fold binning.Hence, other than κ-rank PCA, k-fold binning dominates the clustering performance.
From the information loss perspective, the empirical estimated total information loss L follows the trends of clustering accuracy.With higher binning levels and lower k, higher L is estimated from 0.07 Sh up to over 0.5 Sh when most of the effective information is lost (two bins).On the other side, for lower binning levels with higher k (mainly from L3 to L1), the loss converges without further increase, as the variance explained could cover the most information after binning.With higher κ, L also decreases but it decreases comparatively milder than that with higher k.In addition to clustering performance, binning dominates the information loss as well.
In this case in practice, twelve bins per sensor in average (L2) keeps the most of information and ensures the clustering performance.Hence, L2 shows a trade-off between clustering accuracy (or visualization) and the dimensionality.

4) COMPARISON OF EVALUATION METRICS
As mentioned in Section II, there are already well-known metrics of histogram binning and PCA for evaluation.For histogram binning, we usually regard such histograms as empirical distributions and compare them to original distribution via K-S statistics, expressed in (2).For PCA, the variance unexplained quantifies the fraction of variance lost by projecting the high-dimensional dataset into a low-dimensional latent space described by those principal components (PCs), formulated in (8).Spanning binning and PCA as a sequence where binning does not take place on the analytics site but in the vehicle control units, data scientists usually focus on PCA guided by variance explained or unexplained.In this case study, therefore, we first compare our loss-aware approach to the variance unexplained, then to the K-S statistics.As we do not have the original time-series data from customer fleets over the years, we choose the finest binning level from our raw binned data (L1) as the reference for computing K-S statistics.For each comparison, we focus on two trends, more number of PCs, or finer binning.
As shown in Fig. 6f, with more PCs, we observe lower variance explained and higher accuracy.However, with less than 60% variance explained, we can hardly identify the correlation between the accuracy and variance unexplained.This implies that the variance unexplained is representable for PCA induced losses, but insensible for lower variance unexplained.With lower binning levels (finer binning), higher variance unexplained are estimated for the PCA.However, clustering accuracy are higher, indicating the variance unexplained cannot represent the influence of binning granularity on clustering accuracy.To conclude, the variance unexplained is unsuitable for evaluating PCA for histogram data.
For the other benchmark metric (binning K-S statistic), the experiment results are plotted in Fig. 6g.With more PCs, no influence of the relative binning K-S statistic has been observed on the accuracy.With finer binning, the metrics are lower, indicating the higher clustering accuracy, which means that the K-S statistic can properly identify the impact of binning on clustering accuracy.Although binning dominates the overall impact, the minor impact of PCA on the accuracy cannot be indicated by binning statistics, as it happens after binning.In summary, it is partially suitable as a criterion for preprocessing histogram data, but it provides no guidance on how to perform PCA afterward.
Compared to the both reference metrics, shown in Fig. 6h, our loss-aware approach shows good correlation between the estimated total information loss and the clustering accuracy.In summary, our empirical loss model explains the experiment results from k-means clustering well.

D. APPLICABILITY AND LIMITATIONS
From the perspective of overall preprocessing, aggregation with fewer bins is not the proper choice to improve the performance of PCA.If more bins are available without losing more information, fewer dimensions are required, TABLE 3.An exemplary illustration of limitations of the loss-aware methods.In addition to k , the variance unexplained and total information loss are shown with the first one or two principal components, expressed in ( • )@κ(1 or 2).due to the enhanced correlation structure between the bins and between the sensors.In practice, if more bins per sensor are aggregated, the long-term statistical data can be preprocessed more compactly, improving the performance of further analysis such as clustering.
However, these findings are typically valid for customer fleet analysis or similar use cases where the raw fleet sensor data are continuously distributed, and they could be described by Beta distributions.To better indicate the limitations, we illustrated two counterexamples.Their snapshots, characteristics, and relevant indicators are presented in TABLE 3. The sparkline bar plots in the header show the histogram pattern of exemplary counterexamples A and B. These data for counterexample A and B are with n = 4, m = 1, and p → ∞.Similar to the case study, both cases have been coarsened twice, yielding the number of bins k from twelve to six and then three.
Counterexample A shows no advantage with more than three bins per histogram.The histograms all have only three different levels, which are cut equidistantly.By coarser binning resolution where k = 3, we do not lose any information.Hence, the estimated losses should not be applied to counterexample A.
Counterexample B shows no advantage with more than two principal components (PCs).All four observations of histograms can be described by linear combinations of the first two histograms.Hence, more than two PCs do not bring any additional information gain here.All variance are explained with the first two PCs, whereas our estimated losses show slightly drop with more PCs.Hence, the method proposed in this paper does not imply the real behavior when preprocessing counterexample B.

V. SUMMARY AND OUTLOOK
This paper showed a comprehensive, information lossbased perspective to support decision-making in configuring preprocessing for customer fleet analytics.The preprocessing includes (i) data binning on the customer side, and (ii) principal component analysis (PCA) based on the binned data without given raw sensor measurements.First, we characterized the data using three scale parameters (number of vehicles, number of sensors, and number of samples) and two structural parameters (average correlation coefficients between the vehicles and the sensors each other).Based on these parameters, the scale and correlation structure of the dataset can be modeled.To estimate the information losses across binning and PCA, we generated the sample datasets stochastically.We then simulated the preprocessing parameterized with the number of bins, and the order of dimension remained.
A sensitivity study identified the impact of preprocessing, scale, and structure parameters on the binning, PCA, and total losses.If the scale parameters are sufficiently large, their effects on information losses are negligible.By performing empirical regression, we identified the mechanisms of the loss formulation.The total loss consists of both binning loss and weighted PCA loss.The binning loss depends primarily on the number of bins in a negative exponential fashion.The PCA loss was modeled using two parallel loss terms of the preprocessing parameters, weighted by a structural multiplier, in which the structural parameters are combined serially.A case study based on the customer fleet data, which is acquired in real-world and binned, manually configured various binning levels by coarsen the bins, and applied k-means clustering and exploratory data analysis.The case study demonstrated that the estimation of information loss could support decision-making in properly configure histogram binning and PCA without having the raw time-series measurement logging.
When working with histogram binning and PCA, it is valuable to assess information loss (for example, using our derived loss model) and then determine the optimal number of bins and principal components accordingly.Traditional methods like variance explained for PCA may not fully capture the influence of histogram binning on analytical performance.Our approach provides a more nuanced perspective for a thorough understanding of how histogram binning affects PCA and analytical outcomes.With sufficient number of samples, customers, and sensors, a higher resolution of histogram bins per sensor can generally improve the performance of dimensionality reduction without losing more information in a comprehensive view.Furthermore, we discussed the limitation of our methodology and the findings by illustrating two polar cases.
In the closing section of our paper, we highlight avenues for future research that stem from the methodology proposed in this study.A promising area for further investigation involves the exploration of multi-objective optimization techniques for binning followed by PCA.This optimization could aim to simultaneously minimize noise while maximizing retained information, all while avoiding an increase in dimensionality that might compromise interpretability.Another interesting aspect worthy of future research is that, how the information losses of the original part of the dataset could be affected, when adding new sensors to the vehicle and keeping the number of bins unchanged.Additionally, when dealing with sensors that share physical relationships, there arises a need for aggregating their joint distributions, such as engine maps.Furthermore, understanding the impact of the mixed-variate aggregated dataset on information losses presents a valuable avenue for exploration in subsequent research efforts.
(resulting) correlation matrices based on the mean value of each random variable X ij in b ij = 1 p b ij .The mean values of the whole dataset D is represented as matrix D ∈ R n×m .D can be decomposed in m column vectors {d c1 , . . ., d cm } or n row vectors {d r1 , . . ., d rn }.The column-wise Pearson correlation matrix of D is represented with R ∈ R m×m , in which the element ij of the correlation matrix is

13 :Flatten
D to M according to (3); 14: M n ← Normalize M into probability densities by (4); 15: Perform truncated singular value decomposition for M according to (5); 16: M n ← Approximate M with κ rank according to (6); 17: D ← Reconstruct M n into the nested dataset structure similar to D; 18: As a result, we get two matrices A ∈ R n×m and B ∈ R n×m and perform the simulation in the next step based on these simulated beta parameters.

FIGURE 3 .
FIGURE 3. Information losses with different resolutions of the preprocessing.The areas around the curves are 95% confidence bands.Lk , Lκ and L represent binning, PCA and total losses.κ is the order of the approximated rank, set 1, 3, or 10.

FIGURE 5 .
FIGURE 5. Information losses with different correlation structures.The areas around the curves are 95% confidence bands.Lk , Lκ and L represent binning, PCA and total losses.μ is the row-wise blending factor, set 0.3, 0.5, or 0.7.

FIGURE 6 .
FIGURE 6. Experimental results of case study.(a)-(e) show the scatter plots of the 1424 customer fleets projected by the first two principal components (PC1 and PC2, κ = 2) with five different binning levels (L1 to L5).For example, Fig. 6b is the scatter plot of PC1 and PC2 with binning level L2, denoted as L1 / 2D.(f)-(h) show three metrics (PCA variance unexplained Kvar, relative binning K-S statistic KK−S, and information loss L in Sh) and their correlation to the accuracy A for clustering the two fleet segments over 25 experiment configurations (L1 to L5 combined with 1D to 5D).