Interval-Valued Features Based Machine Learning Technique for Fault Detection and Diagnosis of Uncertain HVAC Systems

The operation of heating, ventilation, and air conditioning (HVAC) systems is usually disturbed by many uncertainties such as measurement errors, noise, as well as temperature. Thus, this paper proposes a new multiscale interval principal component analysis (MSIPCA)-based machine learning (ML) technique for fault detection and diagnosis (FDD) of uncertain HVAC systems. The main goal of the developed MSIPCA-ML approach is to enhance the diagnosis performance, improve the indoor environment quality, and minimize the energy consumption in uncertain building systems. The model uncertainty is addressed by considering the interval-valued data representation. The performance of the proposed FDD is investigated using sets of synthetic and emulated data extracted under different operating conditions. The presented results confirm the high-efficiency of the developed technique in monitoring uncertain HVAC systems due to the high diagnosis capabilities of the interval feature-based support vector machines and k-nearest neighbors and their ability to distinguish between the different operating modes of the HVAC system.


I. INTRODUCTION
Generally, the energy demand of the residential and tertiary sector represents half of the total energy consumption where the HVAC systems represent the most energy consuming components (66% of the building's energy consumption). However, the operational faults in HVAC systems could significantly decrease their efficiency. Research studies have proved that an efficiency increase of 5-15% is attainable by simply repairing faults and optimizing building control systems [1].
The associate editor coordinating the review of this manuscript and approving it for publication was Fanbiao Li .
Thus, the deployment of FDD approaches is very essential to guarantee the safe operation of HVAC systems, improve the user comfort level, enhance the energy efficiency, and reduce the operating/maintenance costs [2]. Nevertheless, very few effective FDD approaches have been proposed for HVAC systems in the literature [3], [4].
FDD approaches can be classified into two main categories: data-driven [5], [6] and model-based approaches [7], [8]. Model-based FDD approaches consist in comparing systems measurements with system variables computed from the mathematical model, which is usually calculated using some fundamental understanding of the system under normal operating conditions [7], [9]. The difference between the measurements and the predicted model prediction (so-called residuals), can be applied as a diagnosis metric for decision making [10], [11].
On the other hand, data-driven FDD methods only make use of the available diagnosis data [12]- [14]. The data are first applied to identify the model in the training phase, which is then used to diagnose faults in the testing phase.
Data-based FDD methods can be divided into two principal phases: feature extraction and selection, and faults classification [15], [16].
PCA is the mostly employed tool for feature extraction and selection [17]. The PCA is a data-based method that has been widely used for feature extraction and selection of diverse complex systems [18], [19]. Indeed, the features extraction and selection requires the building of the system PCA model under normal operating conditions (NOC). This model is then applied as a test reference for system monitoring and its identification is based on the estimation of the structure of the process by an eigen-decomposition of the covariance matrix of the training data [20].
Conventional PCA-based feature extraction and selection techniques have been only implemented for single-valued representations. During the data mining operations, these data are obtained after several simplification stages which leads to a high imprecision. Indeed, actual data is frequently affected by various uncertainties such as imprecision linked to the data estimation approach adopted, computation, and measurement errors. These uncertainties/errors have a bad effect on the PCA model, and therefore, on the FDD accuracy [21]. By considering a representation with interval values instead of a single value, this uncertainty could be treated more accurately while representing the actual data. Unlike singlevalued, the interval-valued representation naturally provides extra information leading to a better decision making. The determination of the PCA model in this case implies the use of new suitable methods to the interval data.
Several conventional PCA versions have been extended to interval-valued representations over the past two decades [16]. The first variations are the centers PCA (CPCA) [22] and the vertices PCA (VPCA) [23]. The centers approach trusts on the interval centers, while the vertices approach is concentrated on the vertices of the hyperrectangles made through the interval-valued data. Another method, named midpoints-radii PCA (MRPCA), deals with both interval ranges and interval centers [24]. It is an improvement of the centers PCA by incorporating the radius of data.
The authors in [25] presented an alternative technique by applying least squares for MRPCA, while an analytic method of PCA was proposed in [26] for interval-valued data established on an interval-valued covariance matrix. In [27], the authors applied the symbolic covariance to expand the traditional PCA to interval-valued data case. The completeinformation principal component analysis (IPCA) presented in [28] is considered as a new PCA for interval-valued representation with an improved covariance matrix calculation. More precise monitoring can be obtained by representing the uncertainties in the form of intervals [29], where the PCA for interval-valued data is consequently applied for system feature extraction and selection. Nevertheless, this requires an expansion of the monitoring routine to the IPCA model. In this paper, the IPCA approach is applied to extract the more relevant and efficient interval-valued features from the HVAC system data. Then, the final selected features are fed to the ML techniques, namely support vector machines (SVM) [30], decision tree [31], K-Nearest Neighbors (KNN) [32], and Naive Bayes (NB) [33] for faults classification purposes. Therefore, this paper proposes a higher safety and reliability multiscale IPCA-based ML technique for FDD of uncertain HVAC systems. The uncertainties are analyzed through interval-valued representation of data-sets and a further multiscale/wavelet decomposition is applied for a better diagnosis performance. The multiscale representation is considered as an effective technique to separate the important data features from the noise through filters. The random noise is characterized by its presence at different coefficients in the signal, while the deterministic data features are captured at large coefficients. The small wavelet coefficients usually correspond to noise, while the important data features are usually represented by large wavelet coefficients (in the detail signals). Thus, from the HVAC system measurements, the characteristics are extracted in an appropriate manner via the multiscale IPCA (MSIPCA) approach where an optimum number of characteristics is selected. Finally, different classifiers are used to classify the various occurring operating modes in HVAC systems.
The rest of the paper is presented as follows: Section II presents a brief background of interval-valued data representation. The feature extraction and selection based multiscale interval PCA is presented in Section III. The obtained results showing the performance of the developed FDD methodology are described in Section IV, while Section V concludes the paper.

II. INTERVAL-VALUED DATA DESCRIPTION AND NORMALIZATION
In practical, due to the eventual measurement errors, the actual value x * i could be different from the measured value x i . The measurement error is represented by Usually, a measurement error margin (an upper bound δ i ) is provided by the sensor manufacturer. Thus, Thus, for the sake of accuracy, it is better to present such measurements by an interval value instead of a single value. Since the closeness error is unbeknownst, it is assumed that its variation is restricted and can be defined by an interval [x − , x + ] where x − and x + designate the lower bound and upper bounds of x, respectively.

A. INTERVAL VALUED-DATA DESCRIPTION
First, the properties of the interval valued variables are illustrated. An interval valued variable [X j ] ⊂ R is VOLUME 8, 2020 defined by a sequence of sets of values delimited by ordered bounds couples called minimum and maximum: . The generic interval [x j (k)] can further be given by the couple and x r j (k) = For any interval valued variable The mean value is given by: Giving any interval-valued variables and The inner product is defined as: The squared norm for any interval-valued variable [X j ] is defined by : In order to overcome the problem of having the predicted lower bound values y − (k) of response variable greater than the upper bound values y + (k), the Moore's linear combination rule used in interval arithmetic is adopted. Let a j ∈ R be a real scalar, then the interval-valued variable [x] times a j is given by [34]: Then the lower bound values y − (k) and the upper bound values y + (k) can be expressed by: with Generally, certain standardization solutions must be carried out before the data processing in order to get scale-invariant results. Four alternative standardization methods for interval data were developed in [35], and are illustrated below.

1) STANDARDIZATION USING THE DISPERSION OF INTERVAL CENTER AND RANGE
The interval-valued variables are standardized according to the procedure developed in [36]. The results are achieved with reference to certain arithmetic concepts of basic interval.Consider two basic notions: the mean interval and distance between intervals where the mean interval [m j ] is represented by: while the distance between intervals is given by: where d x j (k) , y j (k) satisfies the Euclidean distance properties. The definition given by equation (15) presents the notation of scalar variance for interval-valued data. The variance is described as the sum of the squared distances from the mean interval, therefore the variance σ 2 j for interval valued data is represented by The definition of variance can also be written by: x r ij . With a little algebra we obtain: The expression in equation (17) asserts that the variance for interval valued data could be broken down into three components: variance between midpoints, variance between ranges and twice the connection among midpoints and ranges, The remarked properties in (15) imply that the distance among intervals can be concluded to the Euclidean distance in the space R m . A standardized interval is

2) STANDARDIZATION USING THE DISPERSION OF THE INTERVAL CENTERS
Considering the mean and the dispersion of the interval centers (x − j (k) + x + j (k))/2 the second standardization approach standardizes so that for each variable the resulting transformed midpoints have zero mean and unit variance. The mean value and the dispersion of all interval midpoint are defined by: with this notation the standardized interval is defined with boundaries

3) STANDARDIZATION USING THE DISPERSION OF THE INTERVAL BOUNDARIES
The third standardization approach transforms the N intervals [x j (k)] for each variable [X j ], such that the mean and the joint dispersion of the re-scaled interval boundaries are 0 and 1, respectively. The joint dispersion of a variable [X j ] is given by:

4) STANDARDIZATION USING THE GLOBAL RANGE
The fourth standardization approach transforms the intervals for a given variable so that the range of the n rescaled intervals is the unit be the lower and upper boundary values. The interval is transformed into standardized interval with boundaries with this notation:

III. FEATURE EXTRACTION AND SELECTION USING MULTISCALE INTERVAL PCA A. MULTISCALE REPRESENTATION
The interval data given by the matrix [X ] ∈ N ×m , where N refers to the measurements and m are the variables, are first multiscaled. Then, the initial signals are projected on a set of orthonormal scaling functions [37] as follows: Another alternative is to make use of a low pass filter of length ., h r ] by projecting the original signal into a set of wavelet based functions [37] given by: A third solution is to derive a high-pass filter from the wavelet basis functions [37] and use it to fine scale the signal. Thus, the original signal can be reproduced by summing the detail signals at all scales and the scaled signal at the coarsest scale as follows [37]: where n, k, j, and J are the original signal length, translation parameter, the dilation parameter and the number of scales respectively [38]. The wavelet transformation is considered VOLUME 8, 2020 as an effective segregation between the deterministic characteristics and haphazard noise. Thus, it represents a strong transformation of the time-domain signals into the timefrequency domain [37].

B. FEATURE EXTRACTION
The interval PCA (IPCA) approach was proposed in [28] by using the traditional PCA to process the interval valuedday and extract more information in interval measurements. Given two interval-valued variables [X j ] and [X j ], according to IPCA [28], the inner product is represented as: where In the auto-correlation case given by [X j ], [X j ] , the inner product [X j ] 2 for interval-valued data is represented as follows: where The covariance matrix of X ∈ R N ×m is represented by equation (31) based on the above definitions of inner product and interval norm and with all data units pre-processed.
The interval-valued principal components [T ] in the IPCA approach are determined based on the linear combination algorithm for interval-valued variables [39]. Using equation (31), the IPCA approach first compute the covariance matrix of the interval data matrix [X ] then carry out an eigen-decomposition, where λ 1 , . . . , λ m and p 1 , . . . , p m are the resulting eigenvalues and eigenvectors respectively. Based on Moore's rule [39] the interval-valued principal components are presented by: with From the IPCA model, the interval-valued estimations are given: with the same condition on τ , and given that C = P P T .

C. INTERVAL-VALUED PCA MODEL IDENTIFICATION
The selection of the appropriate number of principal components [40] is the key step in identifying the PCA model. In this study, the reconstruction error variance is minimized based on the interval data as in [40] to determine for the PCA model. Generally, when the PCA based single-valued data is applied, the reconstruction method is applied to estimate a variable based mainly on the PCA model. The accuracy of the reconstruction depends on the capability of the PCA model to disclose iterative relations between all variables [40], [41]. In [18] the authors propose the IPCA approach using the variable reconstruction. The value of that minimizes the variance of the interval-valued reconstructed error presents the number of principal components to be conserved in the IPCA model [18].

D. FEATURE SELECTION
The determination of the IPCA model is based on an eigendecomposition of the covariance matrix and the selection of the number of components to be retained [18], [20]. The matrices of eigenvalues, eigenvectors and interval-valued principal components can be given by: By taking into consideration the first highest eigenvalues and their corresponding eigenvectors, the matrix [X ] is decayed as: where x(k) is the estimation vector of [x(k)], C = P P T and, is the vector of the first interval-valued scores of latent variables. The vector of m − last interval-valued scores of latent variables, that represents the projection of interval-valued measurement data in the residual subspace, is defined by: To achieve a good classification performance, it is essential to extract the statistical characteristics via the IPCA model by listing exhaustively a few possible values.
In this study, the selected and extracted features/ characteristics from the IPCA model are the first retained interval-valued principal components, the interval squared weighted error (ISWE) statistic, the interval norm distance D N , the City-Block distance D B , the Hausdorff distance D H , the Euclidean distance D E and the Wosserstein distance D W . Next, the features are presented.

1) INTERVAL SQUARED WEIGHTED ERROR (ISWE ) STATISTIC
The ISWE feature is the most essential measure in the remaining principal components subspace [18] and is given by: The interval distance is given as the interval norm of the difference between two interval-valued samples and is defined by: where where where where d W x j (k) , y j (k) = m x,j(k) − m y,j (k) 2 + 1 3 r x,j (k) − r y,j (k) 2 and m x,j (k) =

IV. SIMULATION RESULTS
The various steps of the developed FDD technique are illustrated in Figure 1. The confusion matrix is used to compute the performance metrics of each classifier, where the classification accuracy is given the highest performance priority. Moreover, the Recall and Precision metrics are applied as per [42]: In the above equations, the percentage metrics TP, FP, and FN refer to the number of accurately identified samples, the number of misidentified samples, and the erroneously identified samples respectively. The Recall metric is used to measure the distinct classification sensitivity (accuracy). A. SIMULATED SYNTHETIC DATA Two datasets were used to generate the database. The first data set is taken from a healthy operation scenario, while the second one contains the simulated data under three different faulty scenarios. The latter data are manually manipulated to emulate the behavior of each fault separately. Hence, the generated database is categorized via performing the appropriate multiscale pre-processing then exploited as a training data for the ML algorithm. In addition, the system is used to generate the faulty database via simulating the occurring faults scenarios. Then, the simulation results are labeled based on the applied type of fault. The labeled sets of data are used as inputs for the MSIPCA to distinguish between the different types of system operation. The example given in [37] is replicated using the below system and then the simulated synthetic data are generated. Two uncorrelated variables are generated using Gaussian measurements with zero mean and unit variance. The system contains combinations of adding/subtracting the first two variables with possible scaling as per: [37]: The six measured variables contained in the data matrixX are then disturbed by a white noise of zero mean and standard deviation of 0.2 as follows [37]: Six variables are generated by using equation (48). These variables decompose on one healthy (designated to class C 0 ) and three faulty scenarios of synthetic data (designated to C i , i = 1, ..., 3) as described in Table 2. Taking into consideration the imprecision of 1%, the interval data is generated. Different normalization techniques (Table 1) are used for data  pre-processing before the IPCA model identification. In order to reduce the estimation error using the PCA model, more principal components should be kept in the model. However, the first principal components represent significant variation in the data and the last principal components represent the noise. Therefore, if more principal components are retained in the PCA model, the estimation error will be reduced in the training data but without generalization for testing data set (over-fitting). Thus, the number of principal components should be selected carefully. In this study, the value of that minimizes the variance of the interval-valued reconstructed error presents the number of principal components to be conserved in the IPCA model. Via the decomposition of the eigenvalues, the variances are sorted in descending order.
Then through the obtained model, the faulty data are transformed. Consequently, some characteristics are extracted and appropriately selected in order to represent simultaneously the different models in the two IPCA subspaces.   (F 1 , designated to class C 1 ) represents a step-change in the mean of all four variables, (F 2 , assigned to class C 2 ) consists of a variance change in the samples, and (F 3 , designated to class C 3 ) represents an incipient fault. In the faulty scenario, the generated faults (  Tables 3 and 4. The performances are compared using the extracted features of the IPCA and MSIPCA. In the first step, only the first retained principal components are used as inputs of the different classifiers. In this case of interval-valued data, the previously presented standardization methods are used and compared. According to the accuracy performance of the classifiers presented in Tables 3 and 4, the KNN and SVM classifiers present the best performance when using the standardization of the interval-valued data-based MSIPCA approach using the global range of interval compared to the other standardization methods and to the interval-valued databased IPCA approach.
In order to make more improvement in the classification accuracy, in addition to the first = 4 principal components, other features are added. Thus, the norm of the first retained interval-valued principal components T are combined with the different interval distances given in equations 41, 42, 43, 44 and 45, respectively. Tables 5 and 6 show the results of the use of the new features as inputs to the KNN and SVM classifiers. It is clear that both classifiers give the best results when using T combined with D N , The accuracy of all these MSIPCA-based combinations are between 97.16% and 100%. From the above tables, it can be concluded that the MSIPCA-based ML presents an accurate classification better than those using the IPCA-based ML approaches. For instance, the MSIPCAbased KNN technique shows 100% of class accuracy for the classes 0-3.

B. EMULATED HEATING SYSTEM 1) SYSTEM DESCRIPTION
The TRNsys simulation software (transient simulation), TRNsys simulation studio, and TRNBuild interface are employed to imitate an actual building and to beget the heating system data. Thus, TRNBuild interfaces allow adding many proprieties like window and door properties, thermal conductivity, wall and layer material properties, and various gains, etc (non-geometrical properties). Based on the existing construction parameters, the TRNSYS model is run with a time step of 1 h, using the meteorological data given by the US Department of Energy (DOE). The developed FDD method is validated in simulation by modeling in TRNSYS a building (located in France in the region of Amiens) during the cold season, with three zones, where the rooms are loaded with various profiles and schedules as a simulation. The data collected during one year of casual operation are used to guide and configure the FDD system by building a PCA model as per the above method used for synthetic data. So as to generate the faulty database, two fault cases were emulated in TRNSYS. The individual faults are executed statically by changing existing objects, i.e. schedules. The considered faults are i) Unplanned occupancy: this fault is tested by adding some unexpected persons or occupants in various hours, and ii) Opening the window when the HS is switched ON causing waste of energy. As the FDD issue can be considered as a classification trouble, three data classes are used: a healthy data class and two faulty data classes. The data time range is set from zero to 8000h with a time step of 1h. The description of the heating system variables are reported in Table 7.

2) FAULT CLASSIFICATION RESULTS
To validate the developed method of FDD, five different variables are simulated as given in Table 7. These variables VOLUME 8, 2020    represent one healthy (class C 0 ) and two separate modes of faulty operation (Ci, i = 1, ..., 2), as shown in Table 8. Tables 9 and 10 show the selected features performance accuracy with different normalization of interval-valued data. One can conclude that the KNN and SVM classifiers present the best performances when using the standardization of interval-valued data-based MSIPCA approach using the global range of interval compared compared to the other standardization methods and to the interval-valued data-based IPCA approach. Tables 9 and 10    for fault diagnosis of the heating system. It can be noticed from Tables 11 and 12 that the MSIPCA-based ML gives a higher classification accuracy with comparison to the IPCAbased ML methods. From these results, it is clear that both classifiers give the best results when using T combined with D N , D H , D E or D B . The accuracy of all these MSIPCAbased combinations are between 97.53% and 100%.  Via the standardization methods, using the dispersion of interval center, the KNN and SVM classifiers' accuracy are improved compared to the precision achieved applying the standardization methods 1, 3, and 4. To further improve the accuracy of the classification, city-block distances are added as a new feature. KNN and SVM accuracy's using this feature present good results comparing to the others. This combination considers the variation of the data in the two IPCA subspaces. Tables 13 and 14 present the SVM confusion matrices applying D B as a feature based IPCA and MSIPCA in testing respectively. For the testing healthy data, assigned to class C 0 , the SVM based IPCA (respectively, based MSIPCA) classifier (see Tables 13 and 14) identifies 2000 samples from 6000 (true positive). Moreover, the accuracy of detection is 91.78% (respectively, 99.4%) and its recall is 91.05% (respectively, 99.6%) which also represents the classification accuracy. Thus, for this class, only 8.22% (respectively 0.6%) of misclassification is observed (false alarms). For the first fault (F1) designated to class C 1 , the precision is 100% and the recall is 99.6% with 0% of misclassification for the training data set in the case of MSIPCA. The achieved results ratify the effectiveness of the proposed method for FDD of the heating system.

V. CONCLUSION
In this paper, a novel fault detection and diagnosis (FDD) technique was developed for uncertain HVAC systems. The developed method, called multiscale interval principal component analysis (MSIPCA)-based machine learning (ML), was applied for feature extraction and selection and the ML method was used for fault classification. The proposed MSIPCA-ML technique was developed for diagnosing uncertain HVAC systems under various operating conditions. Various cases were considered to prove the robustness and efficiency of the proposed FDD method. The effectiveness of the FDD method was investigated using synthetic and emulated heating system interval valued data. The developed FDD method presented a good diagnosis efficiency and better classification accuracy under different modes.
As future works, improved interval nonlinear feature extraction and selection approaches will be developed to deal with uncertainties and non-linearity natures of HVAC systems. Therefore, interval kernel PCA and kernel PCA-based machine learning classifiers will be developed for fault detection and diagnosis of uncertain and nonlinear HVAC systems.