Multi-Machine Gaussian Topic Modeling for Predictive Maintenance

In this paper, we propose a coherent framework for multi-machine analysis, using a group clustering model, which can be utilized for predictive maintenance (PdM). The framework benefits from the repetitive structure posed by multiple machines and enables for assessment of health condition, degradation modeling and comparison of machines. It is based on a hierarchical probabilistic model, denoted Gaussian topic model (GTM), where cluster patterns are shared over machines and therefore it allows one to directly obtain proportions of patterns over the machines. This is then used as a basis for cross comparison between machines where identified similarities and differences can lead to important insights about their degradation behaviors. The framework is based on aggregation of data over multiple streams by a predefined set of features extracted over a time window. Moreover, the framework contains a clustering schema which takes uncertainty of cluster assignments into account and where one can specify a desirable degree of reliability of the assignments. By using a multi-machine simulation example, we highlight how the framework can be utilized in order to obtain cluster patterns and inherent variations of such patterns over machines. Furthermore, a comparative study with the commonly used Gaussian mixture model (GMM) demonstrates that GTM is able to identify inherent patterns in the data while the GMM fails. Such result is a consequence of the group level being modeled by the GTM while being absent in the GMM. Hence, the GTM are trained with a view on the data that is not available to the GMM with the consequence that the GMM can miss important, possibly even key cluster patterns. Therefore, we argue that more advanced cluster models, like the GTM, can be key for interpreting and understanding degradation behavior across machines and ultimately for obtaining more efficient and reliable PdM systems.


I. INTRODUCTION
The importance of manufacturing maintenance is continuously increasing in industry and academia due to the future expectations on maintenance as a key enabler of industrial digitalization.Increased levels of automation, sometimes even referred to as ''light-out factories'', can only be realized with novel maintenance solutions.As a result, the academic field has recently increased its focus towards advanced data analysis for predictive maintenance (PdM).Maintenance experts in industry and academia have therefore conceptualized smart maintenance in a digitalized manufacturing [9].

Smart maintenance highlights data-driven decision making as
The associate editor coordinating the review of this manuscript and approving it for publication was Zhaojun Li .
well as internal and external integration as important principles to improve plant performance.This means sharing and analyzing data from multiple machines to predict future maintenance activities will be central in the development of future solutions in maintenance.However, previous research has identified challenges in developing effective and precise algorithms for PdM, often due to the lack of high-quality labeled data [58], i.e., the data has not been annotated with true machine health condition or it has not contained examples of every possible fault type [23].Correctly labeled data might not be available in real world industrial environments due to different reasons.As an example, for critical machines in a manufacturing company, the first aim is not allowing them to break down by carrying out periodic maintenance intervals.This means that no actual break-down data exists and hence, it is difficult to define the often necessary thresholds and tolerance limits for condition monitoring.There is also no observations until the actual end of life which can be used to identify fault patterns [36], [58].On the other hand, non-equipment-based measures such as product quality can also be used for labeling, but it is difficult to ensure that this type of information can be directly related to the corresponding machine operations.Such situations can only be managed through unsupervised machine learning methods that explore hidden structures and patterns without any target specifications [46].Moreover, in reality, machines can generate highly heterogeneous data from multiple streams, e.g. from sensors and other computer systems, which can include a high degree of non-linearity due to the external influencing factors and other sources of uncertainties.Therefore, it is not easy to integrate all the information to provide data-driven decision support for PdM and there is a need of methods and algorithms that are able to transfer learned knowledge from one machine to another similar machine [30], [35].
We present a framework that is able to capture several of the above complexities and it allows one to perform multi-machine analysis.The framework uses an unsupervised hierarchical probabilistic model called Gaussian topic modeling (GTM) [8], [53] for exploring cluster patterns across multiple machines and where the extracted knowledge associated with identified clusters is used for machine health assessment and degradation modeling.The framework builds upon previous research [10] where one aggregates any number of features over multiple streams in order to obtain a data point that represents the state at that particular time.The main contributions are summarized as follows: • a coherent framework where similarities and differences across machines can be utilized for learning and interpreting cluster patterns, • a simulation procedure for exploring the above framework, • experiments that show the benefit of utilizing the framework for exploring cluster patterns across machines.
The remainder of the paper is structured as follows.Section II focuses on related literature of PdM.The proposed framework for multi-machine analysis is described with illustrative examples in Section III.Section IV presents a multi-machine simulation; application of our framework for discovering patterns across multiple machines; results with exploration of shared clusters as well as results from a comparison with another cluster model called Gaussian mixture model (GMM).Section V discusses advantages of the proposed framework in terms of the research gaps highlighted in the introduction.Section VI concludes the paper with a summary and future work.

II. RELATED WORK
PdM has a high potential for improving productivity of manufacturing companies where the vision is failure free production with several expected effects such as: increased efficiency and reliability of operations as well as cost effectiveness.According to the estimation of Daugherty et al. [2], with the implementation of PdM, it is possible that companies can save up to 12% over scheduled repairs, reduce overall maintenance costs by up to 30% and eliminate asset failures up to 70%.To reach this vision and the expected effects, PdM has gained exponentially increasing attention in the context of prognostic and health management (PHM) and has been started to be adopted by many manufacturing companies in industry.Beside potential effects in industrial applications of PdM, it also holds a great interest in academia with rapidly increasing journal and conference contributions [13], [26].In principle, PdM predicts faults and failures by real-time evaluation of the system state in production and thereby enables for efficient maintenance operations [44].This allows for more efficient maintenance planing with reduced downtime and increased efficiency as a result [40].Literature shows that data-driven analytics can be used to gain knowledge from data to facilitate maintenance decision-making in industrial practice of PdM [34].In this context, there are different types of machine learning (ML) techniques that have been used in various phases of PdM implementation [19], [38].Recent review studies performed by Lee et al. [33] and Kim et al. [29], acknowledge that ML approaches provide increasingly effective solutions in these different phases of PdM, facilitated by the growing capabilities of hardware, cloud-based solutions, and newly introduced state-of-the-art algorithms based on different learning methods such as: supervised, unsupervised, semi-supervised, and reinforcement learning [3].
Cluster analysis [27] is one of the most commonly used unsupervised ML technique within exploratory data analysis for identification of hidden patterns.By identifying the cluster structure in the data and augmenting such structure with domain knowledge and meta data, new information can be gained which can further be used in a supervised setting or anomaly detection for the purpose of early detection and prediction of breakdown.In terms of machine knowledge discovery, Diaz-Rozo et al. [17] proposed an unsupervised approach for identifying multidimensional patterns of components within a cyber-physical system (CPS) application.In order to find the behavior patterns without any contextual information like machine status or maintenance history, different clustering algorithms such as k-means, hierarchical agglomerative and GMM were applied by using real data acquired through an embedded electronic based CPS device from a computer numeric control (CNC) machine during high throughput machining operation.Then, these algorithms were compared with each other in terms of their contribution to the knowledge extraction about the component performance.As a result, it has been demonstrated that unsupervised ML algorithms embedded in CPS are the key enablers for working towards highly accurate diagnosis and prognosis tools.In another study proposed by the same authors [18], they presented a new unsupervised learning algorithm based on GMM called Gaussian-based dynamic probabilistic clustering that is capable of coping with data streams and process dynamics.The algorithm was tested by using synthetic data and streams from an industrial test bed.Amruthnath and Gupta [3] proposed an unsupervised learning based approach for early fault detection in PdM.Both GMM and k-means was tested by using real world vibration data collected from an exhaust fan and it was found that GMM is more capable of predicting fault states than k-means.Yuan et al. [55] proposed a new unsupervised approach to overcome challenges in feature extraction and segmentation of high dimensional complex condition-based maintenance (CBM) life cycle data.Within such context, they developed two kinds of autoencoder models based on deep learning and cluster analysis and tested them by using an experimental bearing life cycle data set used for fault diagnosis in PdM.Recently, Zschech et al. [58] proposed a data science based approach including several analytical methods such as agglomerative hierarchical clustering for unsupervised pattern detection and a recurrent neural network for prognostic model training.The developed approach was tested using real world data in order to demonstrate its usability and practical applicability in the context of building prognostic decision models with missing labels.According to these studies, cluster analysis provides high flexibility and effectiveness in PdM applications for handling common challenges in real world machine and process data such as high dimensionality and lack of correct labeling of machine status or maintenance history [16].However, the domain is still quite challenging as it requires rather complex analysis of multiple data streams and their interrelation together with a model that enables for interpretable and actionable patterns that can be used as a decision support system for PdM in manufacturing [30].
Related to our point of foci, i.e., analysis of multiple machines, there is a term ''fleet based PHM'' where a fleet is identified as a set of machines grouped with regards to some characteristics, e.g., performing similar tasks, having similar service times or similar performance and health conditions [35], [52].Fleet based approaches have several advantages.First, it increases the amount of data available for health assessment and remaining useful life calculation [52].Second, based on similarities of machines, conditions and situations within the fleet allow transfer of knowledge which further improve algorithmic capabilities [51].
Third, this type of approach allows analysis in dynamic operational conditions and detection of deviations on machine behavior by comparisons, with the other machines within the fleet, and offers high level of interpretability as demonstrated in a recent study by Hendrickx et al. [23].Several approaches for PHM of fleets have recently been proposed in the literature [39].These approaches are usually designed for specific fleet compositions and characteristics [52].For instance, Lapira [31] proposed a framework for health monitoring and diagnosis of a fleet of wind turbines using clustering.Similarly, Liu [37] proposed a CPS framework for health management of wind turbines and demonstrated the health assessment using fleet-based analytics.
Peysson et al. [41] has recently presented a framework for fleet-wide maintenance.
Within our paper, we prefer the use of the term ''multimachine analysis'', rather than ''fleet'' since the latter term inevitable leads one to think more large scale when it comes to the number of machines whilst our framework can provide significant insights with as little as two machines.In terms of selecting the machines taking part in the analysis, we agree with the characteristic that Wagner et al. [52] and Lee et al. [35] list.As a further specification with respect to our framework, the critical component within the machines should have comparable physical properties and functions since the same cluster patterns are shared over the machines.
In contrast to these previous studies, we deploy a topic model [8], [53] specifically designed to handle group data with shared patterns.Moreover, we integrate this model with an extension to groups of a previous researched framework [10] that are able to handle multiple streams.The result is a coherent framework for multi-machine analysis using multiple data streams and further contributes to the research gap on how to efficiently model a set of machine characteristics in terms of patterns over time in a variety of operation conditions.

III. MULTI-MACHINE ANALYSIS
We first present an overview of our framework for multi-machine analysis.Our main assumption is that we have a set of machines that share some characteristics so that it would be meaningful to have a common language in terms of patterns across those machines.This would allow one to explore the patterns and correlate them with the machines' historical maintenance records, as well as other meta information in order to understand their meaning in terms of degradation [31, Section 2.2.3], as shown in Figure 1.
The main technique used for this purpose is unsupervised learning through cluster analysis which can be defined as grouping collection of data points based on some pattern similarity [27].In order to obtain such cluster patterns over multiple streams within a machine, we first define a set of features and perform a sliding window operation to extract those features over all streams to obtain a joint feature vector [10], as shown in Figure 2.This allows us to describe the current state of the machine for each point in time corresponding to the sliding window, as shown in Figure 3.This approach is repeated for all machines which results in one set of data points from all machines.We apply a group clustering model [53] commonly referred to as topic modeling in the case of text mining [8].In this hierarchical model, the clusters are shared at global level and the data residing within groups, i.e., the different machines, can only be modeled through these clusters although with different proportions.Hence, certain machines might have a high percentage of a certain cluster while another has a low percentage, as shown in Figure 4. Now, since the data points are extracted using a sliding window, we can color time segments of the streams of each machine by the cluster identity for that specific period    in time [10], [25] in order to be used for visual exploration over time.It should be noted that a given cluster model can be utilized in real-time scenarios for both prediction of class, after labeling the clusters, as well as for anomaly detection, as conceptualized in Figure 5.

IV. ILLUSTRATIVE EXAMPLE
In this section, we provide an illustrative example of the core of our framework by simulated data from multiple machines and streams from those machines.We elaborate on the simulation procedure and the framework including the  clustering model.We provide details of the experiment, analyze the results and highlight possible interpretations.Lastly, we perform a comparison with the GMM which is another commonly used cluster model.

A. MULTI-MACHINE SIMULATION
In order to simulate data from multiple streams, we utilize the skewed normal distribution [4] which takes parameters ξlocation, ω -scale, and α -slant. 1 The parameter α regulates the skewness where α = 0 results in the standard normal distribution, as shown in Figure 6.In order to simulate data streams, we assume that each stream is defined by certain basic patterns that repeat with some regularity over time.In our case, we also assume that these patterns are defined in terms of different parameter configurations: of the skewed normal distribution in order to simulate some deviation from the normal assumption often made and which our clustering approach is based on (since we use Gaussian mixture components).Moreover, since in our setting, machines consist of m streams, we define a matrix of basic patterns which capture the behavior for each stream in relation to the other streams, i.e., we assume correlation between the streams: The regularity of patterns is defined in terms of a Markov chain p(A i |A j ) between different patterns by drawing probability distributions for each stream in the following way: where β i = β, ∀i ∈ {1, . . ., h}, i.e., h basic patterns, and β ≤ 1, i.e., a symmetric ''bowl''-shaped Dirichlet density over the patterns emphasizing some stability when transitioning from one pattern to another.Each pattern A i also has a certain duration time which we simulate utilizing a Poisson distribution Pois(λ) by drawing the number of samples for which the pattern is manifested. 2 Lastly, different type of noise may distort the data.In order to reflect it in our simulation model, we implement the concept of -contamination [5]: where N s is the skewed normal density, N is the normal density and where: where Uniform(•) is the uniform distribution over the intervals [• ↑ , • ↓ ].The idea behind this type of contamination is to have some distortion noise according to a degree specified by (a small continuous number).Now, given the set of h basic patterns for the streams: we can construct the following simulation configuration, to be used in Algorithm 1, for obtaining multiple streams from one machine: Since we can generate data streams for a single machine, we utilize this algorithm in order to generate a set of machines, all of which are sharing patterns to different extent, as depicted by 2.

B. GAUSSIAN TOPIC MODELING
By utilizing the algorithms in the previous section, we assume that we have obtained streams of sensor measurements x 1  1:n , . . ., x m 1:n , where x i 1:n x i 1 , . . ., x i n , from each machine.We then proceed by extracting features over a sample point window of size w and extract u features f 1:u ( x) f 1 ( x), . . ., f u ( x) over that window for each stream, yielding joint aggregated feature vectors (in block matrix notation) of 2 We will assume the same sampling frequency for all streams.
the form [10], [25]: T T (8) where k is the starting index for the window.Then, by letting the window move over time and extract features, sets of data vectors are generated.However, if one move the window with only one step, many similar joint feature points may be generated since then only one point within the window has changed.In practice, one can therefore choose to let the window ''jump'' a number of steps r [25], which yields points of the form: and since our simulation procedure, as defined by Algorithm 2, ensures that each machine M • has the same number of streams and samples within those streams, we can obtain a collection/groups of data points, [F 1:v ] T where v is the number of machines under consideration.Given this collection of data, we apply a GTM [8], [53], where each topic is a multivariate Gaussian distribution.For clarity, we utilize the term cluster instead of topic in the description below.
The generative structure of the model (for details, see further Section IV-C) can then be described by: 1) Draw q clusters (topics) in terms of covariance matrices 1:q , and means µ 1:q based on some appropriate prior (further specified later in Section IV-C).
2) Draw group cluster proportions θ 1:v (also further specified in Section IV-C).3) For each machine i ∈ {1, . . ., v}: The key feature of this type of hierarchical model is that it captures the most important patterns on a shared global level.The aggregated data for each machine can only be modeled through these shared clusters and, hence, one obtain a common terminology in terms of patterns across machines which can be further used for exploration and interpretation using some additional meta-information such as logs of machines and historical maintenance reports.Combining such information can yield significant insights in degrading performance and breakdown behavior.We utilize the multivariate Gaussian distribution as a representation for clusters due to the fact that it is a well-known and an easy interpretable distribution which has been found useful in clustering problems within manufacturing [3], [17].It should be noted that we implicitly make the assumption that clusters have multi-dimensional convex elliptic shape by using multivariate Gaussian distributions as cluster components.However, as we will highlight in the next section, we will utilize probabilistic programming techniques [12] where the model is specified through a programming language separated from the training algorithm which

C. EXPERIMENT DESIGN
The simulation procedure is instantiated by defining the configuration parameters according to: Then, we select five machines to simulate by using Algorithm 2, each of them contains three data streams with 1000 samples per data stream.For simplicity, we choose to extract two basic features, namely the mean and standard deviation [25], over a window of size w = 10, and a jump between windows of r = 5.The resulting streams for each machine is depicted in Figure 7.
In order to implement3 the GTM found in Section IV-B, we utilize probabilistic programming through Stan [12] with the No U-turn sampler (NUTS) for Hamiltonian Monte Carlo (HMC) [24].We base the GTM code/model on the latent Dirichlet allocation implementation, found in Section 9.5 of the Stan users' guide [47] where the discrete cluster belonging variables have been averaged out and the components in our case are multivariate Gaussian distributions.
There is a general problem of training for mixture models due to non-identifiable and label switching issues [6].Hence one cannot easily perform posterior analysis without resorting to some additional post processing [21,Section 22.3].Furthermore, the clustering problem itself is multimodal which is a challenge for Markov chain Monte carlo (MCMC) in general since the chain might only explore the surroundings of a single mode dependent on initial conditions and priors.However, for our purpose of exploratory data analysis, we can take a pragmatic view on these issues and therefore, even though only one mode of posterior has been explored, it is acceptable as long as the clustering provides a useful view on the data and is somewhat stable between different runs.This is similar to the implicit view within the topic modeling community as well, since one there often resorts to MCMC by Gibbs sampling [8] which is likely to only explore a single mode [7].
The data is standardized and then, after some experimentation with respect to stability of clustering results, we obtained the following priors (see further Section IV-B): where i ∈ {1, . . ., q}, j ∈ {1, . . ., v}, I is the identity matrix, κ i l denote the l : th row of vector κ i ; µ 1:q , 1:q , 1:q have dimensions in agreement with the number of features and streams, i.e., u × m, corresponding to F and θ have a dimension in correspondence to the number of clusters q where β k = 1, ∀k ∈ {1, . . ., q}.LKJCorr is a prior distribution for a correlation matrix, see further [47].Regarding the standard multivariate normal prior for µ 1:q , it is rather informative in the sense that the cluster centers should be near the neighborhood of zero in all dimensions, however, more weakly informative priors resulted in less stability in clusters over different runs.
The label switching issue is something that needs to be handled in order to obtain useful clustering results.There are several different methods of different complexities to handle this issue [28], [42], [45], [57].We take a simple approach for this issue by inspiration from the label allocation approach, see further Zhu and Fan [57], together with a greedy algorithm for the final cluster assignments and leave evaluation of different approaches for future work when applying our method to a real world scenario.Now, for each sample from HMC/NUTS: µ 1:q , 1:q , θ 1:v , the probability for a feature vector F i belonging to a specific cluster j has the following appearance: (15) where ω(i) and ω(j) denotes the machine identity of F i and F j , respectively; θ ω(i) j denotes component j of θ ω(i) and N (•|•) is the normal density function.Given this we can proceed by formulating Algorithm 3 which simply assigns the component (cluster) with the maximum probability to the data point. 4By running this algorithm on all samples: µ i 1:q , i 1:q , θ i 1:v , i ∈ {1, . . ., s} obtained from HMC/NUTS, we get s different clusterings which we gather in a s×t cluster matrix C where t is the number of points to label.This matrix will then be the input to Algorithm 4 which will create a final clustering given a certain threshold δ that determines the level of acceptable uncertainty regarding the clustering.Such technique is inspired by previous research [50] where multiple labels have been assigned to cluster points but where we here omit such a multi-assignment and instead declare points that do not reach the threshold limit as uncertain with respect to cluster identity.The algorithm proceeds by first counting the number of times any two given points have ended up in the same cluster and then, given a random permutation, cluster points will be assigned to the same cluster if two points have at least appeared in the same cluster over the samples with a probability at least as high as δ.Points that are not assigned to any clusters (in the algorithm assigned to −1) will then be treated as uncertain with regards to cluster belonging.Lastly, as a final step, the clusters will be ordered according to size and then the q largest of these will be assigned to a final cluster label while the remaining points will be assigned to −1 as an indication that there exists some uncertainty regarding those points.
In general, there is an issue in selecting the number of clusters, however, in this illustrative example, for simplicity, we use a simple procedure by a pairs plot where one can see nuances of three clusters and some intermediate points that needs to be modeled by some auxiliary cluster, yielding a total of four clusters in the model, i.e., q = 4, and we also set the threshold to δ = 0.7 and save further exploration of the impact of this threshold to future research.

D. RESULTS
The resulting clusters for each individual machine as well as for all machines together can be seen in Figure 8.We observe that for some machines, a majority of the points have been assigned to a single cluster while others to a larger extent, contain two or more clusters.Due to the nature of data preprocessing by a sliding window, there also exists an intermediate cluster in between the main clusters.The relation between the proportion of the patterns in the data, as defined by A in Equation (10), and the proportions of identified clusters by Algorithm 3 and 4 are shown in Figure 9.We observe that the GTM model / algorithms well identify the proportion of patterns that are evident in each machine (the orange cluster is the intermediate cluster which can in in principle be ignored).Since we also utilize a sliding window, we can color each sliding window part of the streams by their cluster belongings as illustrated in Figure 10 [10], [25].However, due to the size of the windows w = 10 and we have data points for each jump of r = 5, the windows are overlapping which means that individual points might belong to two different clusters.We here choose to color the streams by window indices: 1, 1 + w, 1 + 2w, etc. corresponding to mutually exclusive windows.

E. EXPLORATION AND INTERPRETATION
Running the GTM over aggregated streams within multiple machines is the first step of the exploration phase.By using additional information such as logs, maintenance reports and meta-information one can correlate patterns within the machines with their degradation behaviors.Moreover, by using multi-machine analysis, there can be a possibility to compare gained information from a certain machine pattern found in the reports with the identified patterns across several machines.In this context, the identified cluster patterns becomes a unified language for the set of machines which enables one to project ''degradation hypotheses'' from one machine to another.
As an example, considering Figure 10, if one can identify a certain degradation behavior in the reports for machine 3, one can hypothesize that this behavior is related to the green cluster pattern and hence, from that perspective one can further analyze the degradation behavior of primarily machine 2, since a large proportion of that cluster appears there, then secondly one can further analyze machine 4 and 1, since those machines also contain some small proportion of the cluster.This is a good example where one single machine guides further analysis towards certain other machines with similar behavior in terms of cluster patterns.There might also be cluster patterns that jointly are the cause for degradation behavior.For instance if machine 4 is the main concern when it comes to degradation, then the continuous interchange between the blue, green and red clusters would be the main subject for further exploration.
Depending on the application scenario, the clusters can be interpreted in different ways.As an example in multi-tooling machines, the cluster patterns can capture different modes of operation, e.g., in terms of vibration.In older machines, where the current operation mode cannot be extracted easily from the control units of the machines, this presents a way of obtaining such information and subsequently allows for analysis of the machines' operation mode.In other cases, the cluster patterns can capture the different variations that might occur to different extent within a certain mode of operation and degradation state.Based on such clusters and previous knowledge of machine degradation, one can identify future system degradation states, e.g., healthy, warning, and faulty.Capturing those cluster patterns and explore to what extent they appear in different machines with different degradation states then becomes the point of foci.

F. COMPARISON WITH GAUSSIAN MIXTURE MODEL
Since the GTM is designed to find clusters on a group level, one can obtain cluster information which might not be so easily captured by other clustering methods that do not take individual groups into account.The reason for this is that certain clusters might appear more clearly on certain machines which then helps the model to identify these clusters.This can be illustrated by another simulation scenario, using the same pre-processing as before, where the clusters are not equally well separated, defined by Equation 16.An example of simulated data from this scenario where the clustering results differ between the GTM and the GMM is shown in Figure 11.From Figure 11b, nuances of clusters appear since one can observe some separations between groups of points.This is information that will be beneficial for the GTM during training since it has a group level.On the other hand, the GMM, which is not taking individual groups into account, will be trained with a view on data from all machines, shown in Figure 11f, where the previously mentioned nuances clusters are not at all evident.A simple experiment on this data, running the GTM and GMM 30 times with the same parameters as priors set according to Equation (14), where the GMM implementation is based on the previously described GTM implementation but without the group level and where the number of clusters is set to 3 and threshold set to δ = 0.7, results in that the GTM captures the clusters, with some variations, in all of the cases, while the GMM captures all the clusters in only 8 of the cases.The result of the first run of the experiment is shown in Figure 12.As it is seen, the red and green cluster in Figure 12a are consistent with the previously mentioned nuances of clusters from Figure 11b and 11d.In 22 cases, the GMM only correctly identified one of the cluster, shown in green in Figure 12b, i.e., it failed to separate the actual two clusters that resides within the red cluster area in Figure 12b.This is due to that the model is performing clustering on all data points without the group level, as shown in Figure 11f, where the clusters are not as clearly separated.It should also be noted that other choices of priors can yield different results that might help identifying the cluster for the case of GMM also, however, the point we want to make here, given that the priors are exactly the same for the GMM and GTM in this case, is that the hierarchical structure, taking groups into account, is an advantage of the GTM which could decide whether or not certain clusters are identified.
To summarize, due to the fact that the GMM does not model group information and hence does not consider or strive at being a good fit to data within individual groups; it can lose information for identifying clusters and can therefore lose track of possibly key cluster patterns.It should be noted that training a cluster model with obtained stability between clustering results is a difficult task due to the multimodality of such problems.Hence, different initialization can yield convergence to different modes unless there are indications of clusters as is the case on the group level in our example.

V. DISCUSSION
By using a framework where cluster patterns are identified across machines, such as we propose, one increases the likelihood of taking full benefits of any limited fault event data associated to any of the involved machines.As an example, if certain patterns appear to be rather unique for a subset of machines which have had several faulty conditions, one may use the model to observe any tendency of such patterns in the other machines and act accordingly.In such cases, any contextual information, e.g., in terms of more in-depth knowledge of behavior for certain machines, can also play a crucial role for understanding how the identified patterns can be projected and used for the other machines.The parameters involved in the framework, in particular window and jump size as well as the number of clusters, need to be adjusted to the problem characteristic at hand.As an example, if degradation is expected to appear over several weeks one may aggregate data over days, i.e. a window size corresponding to a day and set the jump size to non-overlapping windows.A smaller jump size can be also set in order to obtain overlapping windows which will increase the smoothness of extracted features.This can be beneficial for visual exploration of cluster results.However, in case of massive data, the window and jump size can be used as a way of reducing the number of extracted feature data points.This can be particular useful in our case where MCMC is used since such approaches are computationally demanding [11].Also for larger data sets, the δ-algorithm may need further adaptions in order to decrease computational complexity.For adjustment of parameters as well as different additions of extracted features, partly determining the dimension, we propose that one uses visual exploration to learn about the effect on the results both in terms of patterns captured as well as with respect to interpretation of those patterns.
Besides being able to capture different patterns over multiple streams and machines, the framework can also be utilized to capture normal operations of machines and be used as an anomaly detector [15], [53].In contrast to clustering models that do not account for the group level information, the framework can also capture anomalies in terms of pattern proportions, i.e., not only deviations from the patterns but also how much of those patterns within machines change.This might reveal significant insights into degradation behavior.
However, as previously noted [18], degradation over time is likely to smoothly change previously identified patterns.Therefore, another important part of the clustering problem in an online setting is to let the model self-learn and self-adapt to normal tolerable changes over time without raising an anomaly alarm.This is commonly referred to as concept drift [20] in the machine learning community, i.e., the process itself is not stable over time which means that learned models trained on data in the beginning and end of the process are different to some extent.By using Bayesian theory [22], this can be handled by setting the posterior to the prior and retrain with a certain period [1].By doing this, the model drifts ''along'' with the degradation behavior, i.e., the changes of patterns over time, without raising unnecessary alarms.These changes of the model might also provide in itself valuable information in terms of machine degradation behavior, since one can then capture how much the model has changed over time and store this information as historical records of drifting behavior.For example, if a certain pattern is highly evident in the start of the analysis, but then at some point starts to change over time, one might correlate such information with other meta information regarding the machine and learn the cause of such behavior.
Lastly, more advanced models, which also have the ability to self-adapt to the number of clusters [48], [49], can allow one to explore machine behavior in terms of splitting and merging of cluster patterns over time, denoted as evolutionary clustering [14], [54], [56].Such splitting and merging behavior might further be matched with other meta information in order to understand why they occur and what they mean in terms of degradation behavior.

VI. CONCLUSION
We have presented a coherent framework which is able to capture information, at different levels, from the repetitive structure that multiple machines constitutes.The framework is based on a hierarchical probabilistic model, denoted GTM [53], where shared cluster patterns are captured at the top level and then used at different proportions for modeling the data, within the machines, on a group level.In order to obtain single data points over time from each machine, we aggregate a number features over all the data streams for a specific time window [10], [25].The framework uses a clustering schema, through Algorithm 1 and 4, which takes uncertainty into account and where a threshold regulates a desirable degree of reliability of assignment to clusters.We have illustrated our framework through synthetic data by simulating multiple machines/streams and where we performed modeling through probabilistic programming in Stan [12].We showed that the model can capture different patterns and proportions of those patterns well across several machines and we highlighted potentially utilization as information gain through association with additional meta information and comparison of the patterns across machines.Moreover, we also performed an experiment that highlights the benefit of the GTM through capturing the group level information to identify clusters.In contrast, the commonly used GMM operates on all data, i.e., does not take individual groups into account, and hence only aims to be a good fit of all data without the group perspective.As it has been seen from the experiment, this can result in the GMM is not able to identify clusters at the group level, and can thereby miss capturing key patterns in the data which might be crucial for the understanding of machine's degradation behavior.As a conclusion, it is important to highlight that such a framework contributes to fill the gap in data driven based PdM application in real word industrial environments since it deals with the challenge of lack of labeled data coming from multiple streams across multiple machines.
For our future work, we will apply our framework on realworld manufacturing case studies.One issue that one needs to handle is the big data aspect [43].Currently, statistical machine learning in the form of Bayesian inference [22] do not enjoy the same scalability as for instance deep learning methods [32] where it is usually easy to parallelize.However, recently proposed methods [11] suggest that there are opportunities to scale down the data size significantly and still obtain useful results.Such methods might provide the ideal complement to probabilistic programming tools such as Stan [47] and would then enable for a fast modeling and training loop, i.e., ''small data'' for efficient and effective exploratory multi-machine analysis.

FIGURE 1 .
FIGURE 1.An illustration of the framework where clusters are utilized for exploration of machine behavior.

FIGURE 2 .
FIGURE 2. An illustration of the data extraction process of the framework.

FIGURE 3 .
FIGURE 3.An illustration of clusters projected on the streams.

FIGURE 4 .
FIGURE 4.An illustration of how shared clusters patterns can be compared across machines.

FIGURE 5 .
FIGURE 5. A figure that illustrates how a cluster model can be utilized for prediction of the cluster label as well as for anomaly detection.

FIGURE 8 .
FIGURE 8. Clusters within machines 1 − 5 visualized in pairwise feature space, where f j i denotes the feature, and i ∈ {1, 2} correspond to the mean and standard deviation and j ∈ {1, 2, 3} to the streams.

FIGURE 11 .
FIGURE 11.Data for scenario as described by Equation (16) within machines 1 − 5 visualized in pairwise feature space, where f j i denotes the feature, and i ∈ 1, 2 corresponds to the mean and standard deviation and j ∈ {1, 2, 3} to the streams.

FIGURE 12 .
FIGURE 12. Results of GMM and GTM clustering on the scenario defined by Equation (16).