A Dataset Fusion Algorithm for Generalised Anomaly Detection in Homogeneous Periodic Time Series Datasets

The generalisation of Neural Networks (NN) to multiple datasets is often overlooked in literature due to NNs typically being optimised for specific data sources. This becomes especially challenging in time-series tasks due to difficulties in fusing temporal data from multiple sources. However, in a commercial environment, generalisation can effectively utilise available data and computational power which is essential to Green AI, the sustainable development of AI models. This paper introduces “Dataset Fusion,” a novel dataset composition algorithm for fusing periodic signals from multiple homogeneous datasets whilst retaining unique features for generalised anomaly detection. The proposed approach, tested on a case study of three-phase current data from two different homogeneous Induction Motor (IM) fault datasets on anomaly detection, outperforms conventional training approaches with an Average F1 score of 0.879 and effectively generalises across all datasets. Furthermore, when tested with varying percentages of the training data, results show that using only 6.25% of the training data, translating to a 93.7% reduction in computational power, results in only a 4.04% decrease in performance, demonstrating the advantages of the proposed approach in terms of both performance and computational efficiency. Moreover, the algorithm’s effectiveness under imperfect conditions highlights its potential for use in real-world applications.


I. INTRODUCTION
G ENERALISATION is a measure of a Neural Network's (NN) performance on data that it has not seen before but that is in the same class as the data that it has been trained on.The idea behind generalisation with Deep Learning (DL) is to transfer domain knowledge from data the NN has been trained on to unseen data in the same class, where the unseen data may contain conditions that slightly vary from the training data.This allows for a NN to be able to maintain performance across the dataset, and potentially transfer across multiple datasets with a similar distribution to the initial trained data.Various studies have been undertaken to understand the factors that affect the generalisation performance of a NN [1], and how to mitigate these factors to achieve the optimal level of performance [2].The underlying concept is as follows: When training a NN on a data sample, the NN learns to represent the function between the input data and the output data through the adjustment of the weights and biases.If This work was sponsored in part by Voltvision the distribution of the data sample used for training is not fully representative of the true distribution population, then the input-output function that is represented by this sample will inevitably vary from the function of the population.Extensive training on this sample will then result in a phenomenon known as overfitting [3], which refers to when the NN has accurately modelled the function represented by the training data but is not able to generalise to data in the same class due to the discrepancy between the functions represented by the sample and population.
Many works have been published with the aim of addressing overfitting and thus maximise the potential generalisation ability of a NN.Many works focus on architectural improvements to the NN to increase the robustness of the NN to overfitting through novel architectures such as Capsule Networks [4], whilst other research directions focus on the manipulation of the training procedure to limit overfitting with techniques such as Dropout [5], Early Stopping [6], Pruning [7] and adding noise to the weights and biases of the NN whilst training [8].
There is much less focus on research concerning the effect of the composition and specification of the dataset on generalisation, especially regarding time series data.A common approach currently used include denoising the training sample to better align the distribution of the sample to the population and hence limit the level of overfitting [9] [10].Additionally, there is a consensus in ML research that increasing the volume of data through various means such as augmentation [11] [12] improves NN generalisation performance; whilst this has been empirically confirmed, more recent research has discovered that this improvement largely comes specific samples within the supplementary data, and a significant volume of this data is essentially redundant and does not contribute to a performance improvement [13] [14].
Furthermore, there is a gap in literature concerning the fusion of multiple time series datasets in a single training set to balance the probability distribution of the training sample so that it better aligns with the true distribution of the problem domain.This is largely due to a lack of necessity in an experimental environment since most ML research tends to optimise the solution for a single dataset source.However, from a commercial standpoint, this can have many benefits with regards to time and computational power saved, as well as an added benefit of reducing the data requirements for training.In addition to this, the dynamic shifting of the distribution of data is often a bottleneck to the performance of the NN; this is a prevalent issue that is encountered when a NN is deployed in a non-stationary environment, which is common with time series data.Some recent works have detected this shift [15], and mitigate the effect this has on the generalisation performance of the NN with both time series data [16] [17] and image data [18].However, the majority of empirical evaluations of NN approaches in literature are mostly conducted on an isolated sample of data, which, in many cases, is not representative of the dynamic shifting of the distribution temporally.
To address the identified gaps in the literature, we propose a novel algorithm, named Dataset Fusion.The proposed method merges multiple homogeneous, periodic time series datasets into a single unified dataset for training anomaly detection NN models.The fusion process is designed to accurately represent population distributions and increase robustness against potential data distribution shifts.Our primary objective in this study is to examine efficient generalisation approaches that can minimise training time and computational demands for neural networks when working with new homogeneous time series data sources.The contributions of this paper can be summarised as follows: • A novel dataset composition algorithm is proposed, referred to as Dataset Fusion • The proposed approach is applied to a case study focused on motor current data, with a qualitative analysis conducted to assess the preservation of features from each individual dataset.• The generalisation performance of the proposed method is empirically evaluated in anomaly detection with the LSTMCaps neural network architecture from previous work [4], and compared to the performance when using conventional training approaches • The potential practical limitations of the proposed method in a real-world environment are discussed and assessed through further experimentation II.RELATED WORKS This section investigates the current literature on different methods of addressing NN generalisation performance, as well as recent works using multiple datasets in different domains and tasks.
A. NN based generalisation techniques 1) Dropout: Dropout [5] is a regularisation technique that can be used to prevent the neural network from overfitting on the training data.This is done by ignoring several randomly selected neurons, with the number of neurons dropped dependent on the "dropout rate" parameter, so they do not affect the output of the neural network on the forward pass.The idea behind dropout is to effectively train many subnets in your network so that your network acts as a sum of many smaller networks that can learn the representation of the data without the presence of the dropped-out neurons.This was found to improve the generalisation performance of the network by reducing data overfitting.
2) Pruning: Pruning is a process whereby a NN selectively removes trainable parameters based on an established criterion with the aim of maintaining the performance of the NN [19].There are two main categories of pruning: Structured and Unstructured pruning.Unstructured pruning directly removes trainable parameters from the network, such as connections to neurons (weights).Structured pruning involves removing entire structures from the network such as neurons and filters.Structured pruning allows for a faster computational time in relation to unstructured pruning, as most frameworks existing for Machine Learning (ML) do not allow for the acceleration of sparse matrix calculations, therefore the NN will be able to reduce the number of calculations for the former but not the latter.
Various criterion has been established in literature for the pruning of NNs.One primitive but popular method is known as the weight magnitude criterion.The criterion dictates that the weights of the smallest magnitude are removed.The idea behind this is to remove all the weights that contribute the least to the function as they are less likely to impact the final prediction.
The most established framework for pruning is known as the train, prune and fine-tune method [19].As the name suggests, the model is first trained, then iteratively pruned and fine-tuned based on the weight magnitude criterion.More recently however, an increasing number of works [20] [21] [7] have evolved this framework with novel methodology that has allowed for further reduction of NN parameters and hence more efficient training and computation whilst maintaining similar performance.

B. Data-based generalisation techniques
Data-based generalisation techniques are largely overlooked in the field of deep learning in comparison to NN-based techniques.This is because the NN structure and learning optimisation algorithms are usually the reason for such weak performance, so improvements can largely be made by improving and optimising how the NN learns as opposed to what the NN is learning.However, it is still important to consider the training data as it can be a bottleneck for learning ability if not composed in the correct manner [22].
An important aspect to consider is the difference in the data distribution between the data used to train the NN and the data that the NN will eventually be applied to.Often the data picked for training is not fully representative of the true distribution of the dataset, which creates a bottleneck to generalisation as the NN is not prepared for the distribution found in the overall population since it has been tuned to the distribution of the training sample.Shuffling the data before taking the training sample often helps with this to increase the likelihood of the sample representing the true distribution.Furthermore, shuffling during training also helps the NN weights to escape local minima and converge towards the global minimum of the function.Many works in deep learning falsely assume that the data distribution is static, which is largely incorrect as in practice data distributions are generally dynamic and tend to shift away from the test data distribution with real-world data; this phenomenon is known as distribution shift.This shift has been detected and quantified in recent works [23] [24], and accounted for with online adaptation to the shift [25], which allows the NN weights to adjust themselves to account for this distribution shift.
Data with excessive noise can inhibit the NNs ability to learn the true input-output function required by a specific application.Mathematically, this can be explained by the biasvariance decomposition [26], specifically the variance component.The variance value refers to the change in prediction accuracy over different subsets of the data.A high variance value indicates that the NN has learnt the noise in the data, or in other words overfit on the training data.The most common method of reducing the variance is increasing the volume of the training data, which will naturally bring the distribution of the data closer to the required distribution that represents the overall dataset as opposed to just the subset of training data.As well as using real-world data, data augmentation has also proved to be an effective method of increasing data volume using data that is already accessible [11] [12].Another effective method that is commonly used is the denoising of data [27].This also contributes to the reduction of the variance: By removing the noise in the data, the training data subset will be cleaner and will more accurately represent the true distribution of the dataset.Whilst both methods are generally proven to work, there are still major issues with both methods that are still being addressed in literature, such as effectiveness in a real-world situation.
The financial and computational costs associated with increasing the volume of training data are not just substantial but also rapidly increasing [28].This makes it increasingly difficult to train models without significant resources.Consequently, only well-funded entities with considerable resources can achieve a performance boost using this method, as their computational power and finances typically surpass those of other research entities and companies.This disparity creates a bottleneck for smaller entities, hindering their competitiveness in the field [29].
Denoising is still a very active field of research due to the many limitations of the currently proposed techniques.Since it is very difficult to determine which aspects of the data features are representative of the true distribution [27], it is very difficult to use denoising techniques such as filtering since features that are potentially important to NNs could be filtered out, limiting the potential performance of the NN.Furthermore, filtering techniques are largely contextual, so it is very difficult to develop filtering methods that work across multiple contexts to the same level of effectiveness.
Transfer learning [30] is also widely regarded in literature as a robust method of improving performance by utilising data from a different set but in the same domain as the target data.Whilst there have been many recent works regarding transfer learning [31], there is a gap in research concerning dataset fusion methods, an alternative to transfer learning that utilises multiple datasets in the same training phase, as opposed to multiple training phases to train a model more robust to the difference in data distributions between the training data and the data encountered during operation.This will be further explored in the present work.

C. Multi-Dataset training
Recently, researchers have identified the potential benefits of training generalisable NNs using multiple datasets to enhance performance and expand the capabilities of the NN models.Many of these approaches primarily focus on image-based applications [32] [33].For example, Yao et al. [34] introduced a novel framework for cross-dataset training, leveraging pre-existing labels from multiple datasets to create a single model capable of detecting the union of labelled features from all contributing datasets.This approach aims to maximise the utility of available labels for distinct classes in each dataset, thereby circumventing the time-consuming and resource-intensive process of labelling a single dataset with new classes that are already present in another dataset.Empirical evidence provided by the authors demonstrates the effectiveness of this approach when applied across multiple datasets, achieving comparable performance levels without sacrificing accuracy.In a related study, Zhou et al. [35] introduce a technique for training a unified object detection model on several extensive datasets by using dataset-specific training methods and losses, but maintaining a shared detection architecture with outputs specific to each dataset.This approach circumvents the necessity for manual taxonomy alignment, as it automatically combines the outputs of different datasets into a unified semantic taxonomy.The authors show that this multi-dataset detector achieves comparable performance to dataset-specific models in their respective domains while effectively generalising to unseen datasets without the need for fine-tuning.By utilising multiple training datasets, these approaches can lead to reduced resource requirements in terms of training data and improved performance.However, it is worth noting that they do not specifically address the aspect of reducing computational power during the training process.
In summary, analysing previous works on multi-dataset utilisation reveals clear benefits, such as reduced dataset labelling requirements and increased generalisation performance.However, there is a lack of exploration in timeseries-based multi-dataset models due to challenges in fusing sequential data from different sensors and collection specifications.The current study aims to address these points by proposing a novel approach for effectively integrating timeseries data from multiple sources.Furthermore, it is essential to consider computational efficiency, as training on multiple datasets may require additional computational resources.By addressing these points, this study aims to contribute valuable insights to the field of multi-dataset modelling and help pave the way for more robust and efficient approaches in various applications, particularly those involving time-series data.

III. DATASET FUSION
A summary of the dataset fusion process is depicted in Figure 1.
The signal in each dataset is first down-sampled to the sampling frequency, F s , of the dataset with the lowest sampling frequency.This step is essential to models with lookback such as Recurrent Neural Networks so that the same The re-sampling is implemented using the Fourier method.This method was chosen over decimation due to the simplicity of implementation and with the assumption that the signals used are periodic in nature.To avoid aliasing and other artefacts, a low-pass windowed-sinc filter is first designed and applied to the signal, with a cutoff frequency based on the target Nyquist frequency.The Hann window, hann(n), was employed in the filter design due to its desirable characteristics for resampling, such as reduced spectral leakage and smooth sidelobes.In the present work, 101 taps were used, giving a filter order of 100.Equation 2 expresses the impulse response, h[n], of this filter.
where K is the normalisation factor, f c is the cutoff frequency in Hz, n is the discrete time index, and M is the filter length or number of taps.
To implement the Fourier Method, the time series signal is first transformed into the frequency domain with a Discrete Time Fourier Transform (DTFT).For a discrete periodic sequence x[n] its DTFT, X[ω] is expressed in Equation 3.
where ω = 2πf and j = √ −1.The spectrum X[ω] is then band-limited to the Nyquist frequency of the lowest F s .The resulting spectra are then inverse-transformed back into the time domain.The downsampling operation is expressed in Equation 4a and the new number of samples, N new , is calculated using Equation 4b.Equation 5shows the Inverse DTFT operation used to obtain the resampled time series signal.
Each dataset is then normalised using Z-score normalisation, to overcome varying motor currents.For the resampled sequence, x[n] resampled , the normalised sequence, x [n] , is calculated using Equation 6.
where xresampled is the mean of the resampled sequence, and σ x resampled is the standard deviation of the re sampled sequence.
To batch the periods together, a zero-crossing algorithm, configured to detect crossings from positive to negative, is employed to first identify a single period, and then concatenate n periods based on the user-defined parameter.Given that zscore normalisation is utilised, any periodic time-series data will exhibit sign changes, making the zero-crossing algorithm applicable.In cases where multiple features are present in the data, the zero-crossing algorithm calculates only the first feature as a reference for period batching, in order to maintain the temporal integrity of the data and preserve the spatial relationship between features.The impact of varying batch sizes on training performance differs across data types and problem domains; hence, it is recommended that this parameter be optimised alongside other training hyperparameters.For a discrete periodic signal x[n] with assumed periodic sign changes, the set of indices for the positive-to-negative zero crossings, c + → − , can be expressed as shown in Equation 7.
For each dataset, the batch order is then shuffled randomly.This is done in order to mitigate the effect of distribution shift and prevent noise in one area of the signal from being prevalent in other areas of the signal.In other words, this step helps to reduce the variance of the NN prediction.A new signal is then constructed using the shuffled batches from each dataset by appending a batch from each dataset in an alternating fashion.
The full process of the Dataset Fusion algorithm is expressed in Algorithm 1.

Algorithm 1 Pseudocode of Dataset Fusion Algorithm
Function: Dataset Fusion(x, F s , P ) Input: A set of s finite discrete periodic sets x where s > 1 A set of sampling frequencies F s corresponding to X Number of periods batched P Output: x f used Determine F snew using Equation 1for x 1 to x s do if F s Xs = F snew then Apply filter shown in Equation 2Calculate X[ω] using Equation 3Calculate X [ω] resampled using Equation 4a Calculate N new using Equation 4b Calculate x[n] resampled using Equation 5 Calculate x using Equation 6Calculate c + → − using Equation 7Calculate x batched through grouping P periods by slicing x at the values where c where % represents the modulo operator and + + represents concatenation. A. Computational Complexity The computational complexity of the Dataset Fusion algorithm, represented in Big O notation, can be determined by breaking down the steps of the algorithm when practically applied.The breakdown of the complexity of each stage is provided in Table I.
As Table I shows, the filtering and resampling step has a complexity of O(nm(1+log(m))), since it involves applying a finite impulse response (FIR) filter with a complexity of O(m) and performing resampling using the Fast Fourier Transform (FFT) with a complexity of O(m * log(m)) for each of the n datasets.The Normalisation step scales each dataset using Z-score normalisation with a complexity of O(m) for each dataset, resulting in a total complexity of O(n * m).The Period Batching step identifies zero-crossings and creates period batches with a complexity of O(m) for each dataset, resulting in a total complexity of O(n * m).The logarithmic factor in the dominating term, O(n * m * log(m)), makes the Dataset Fusion algorithm scale well with increasing input size.This is because logarithmic growth is slow growth, ensuring that the algorithm remains efficient even as the number and size of the datasets (n and m) increase.Additionally, since the complexity is dependent on both the number of datasets (n) and the length of the datasets (m), the algorithm can efficiently handle varying dataset sizes and compositions.This scalability makes the Dataset Fusion algorithm a versatile algorithm and suitable for processing large and diverse datasets, which is essential in the context of real-world applications where data size and complexity are constantly evolving.

B. Requirements for application
Whilst the proposed algorithm is domain-independent, there are requirements regarding the data that must be met for the proposed method to be applicable.These requirements, as well as the reasoning, are detailed in the following sections.
1) Homogeneous Datasets: Although the methodology can be used in varying problem domains, the dataset fusion algorithm can only fuse homogeneous data, since the aim of the algorithm is to capture the data distribution of a problem domain as a whole in order to mitigate overfitting on a specific dataset.Generalisation to multiple problem domains is not in the scope of this algorithm.
2) Data periodicity: As explained in section III, The algorithm relies on the fact that the data is periodic, due to the resampling method used, the zero crossing method, and to be able to create a coherent and usable sequential fused time series dataset.
3) Time Domain Data: The proposed approach will only be applicable in the time domain representation of the datasets, as it relies on the sequential nature of the data to fuse it together in a meaningful way If the datasets being fused meet the requirements detailed above, then there is feasibility in applying the proposed method.Some examples of where the proposed method may be feasible are daily temperatures in a region, electrical power data and vibration data.

C. Proposed Benefits of Dataset Fusion
The Dataset Fusion algorithm seeks to eliminate the necessity of multiple NNs for a single problem domain.This approach theoretically enables the development of an NN that can adapt to unseen data from the same domain, even if originating from different data sources.Moreover, the Dataset Fusion algorithm aims to reduce data requirements from individual sources, as achieving ideal data collection conditions from each source can often prove to be challenging.Potential issues with collected data, such as data corruption, sensor faults, or insufficient data volume, among other data collection complications, further emphasise this need.The present study will experimentally investigate the proposed benefits of the Dataset Fusion algorithm.

IV. CASE STUDY: DATASET FUSION FOR 3-PHASE MOTOR
CURRENT DATA This section will explore the feasibility of applying the proposed method with a case study on motor current signals.
The aim of the case study is to empirically test and validate the effectiveness of our proposed method.The datasets used will first be introduced and the feasibility of the proposed method will be confirmed.The dataset fusion algorithm will then be applied, and the resulting signal will be compared and analysed to the original signals.

A. Dataset Introduction
For the present case study, two homogeneous open-source datasets [36] [37] will be used to confirm the feasibility of the Dataset Fusion methodology.Specifications of the datasets used are detailed in Table II.Both datasets are composed of three-phase motor current signals, however, one dataset, which will be referred to as Dataset A, contains fault data for an interturn short circuit fault, and the other dataset, which will be referred to as Dataset B, contains current signals for a broken rotor bar fault.
1) Dataset A: Dataset A [37] contains files from a motor running at a variable operating frequency F o , ranging from 30Hz to 60Hz with 5Hz increments.The motor has the following specifications: 4 poles, 1HP mechanical power, 220V supply and 3A rated current.The authors simulated both high-impedance and low-impedance short circuits, for different levels of fault severity.For the purpose of this case study, only the files captured at F o = 60Hz were used to meet the limitation of Dataset Fusion of only being applicable to homogeneous data.The new dataset structure is depicted in Table III.
2) Dataset B: The motor used to capture Dataset B [36] is a squirrel cage AC motor, running at a constant F o = 60HZ and has similar specifications to the motor used to capture Dataset A. The breakdown of the dataset is shown in Table IV.At the beginning of each file, for roughly the first 4 seconds, a transient signal representing the motor startup was also recorded.For the purpose of the case study, and for compatibility with the proposed algorithm, the transient subset of the signal, the first 200,000 samples, was discarded from each file, so that only the steady state of the motor remained.This left 801,000 samples left in each file, representing an approximate 20% redundancy of data.

B. Application of Dataset Fusion and Analysis
The Dataset Fusion algorithm was used to fuse the healthy files from Datasets A and B into a single, fused dataset.First, all healthy files were extracted from each dataset and concatenated into a large signal.The files in Dataset B were first sliced to remove the motor startup signature, then resampled to 10,000Hz, the same F s as Dataset A. Each Dataset was split into batches of 4 periods and then concatenated alternating between each Dataset to create the final fused dataset.
For each dataset, an initial analysis was conducted in order to understand the data and enable a more accurate interpretation of the fused dataset experimental results.The time series signal, a Probability Distribution Function (PDF) and FT representations from 0-500Hz were generated for a single phase from a healthy file from each dataset.The plots are illustrated in Figure 2.
A Principal Component Analysis (PCA) was also performed on the healthy data from each dataset as well as the fused data to gain comprehensive insights into the data, wherein the resulting axes are linear combinations of the original variables, defined by the eigenvectors and eigenvalues.This method allows for the identification of the most significant patterns to increase the interpretability of the proposed method.All datasets were uniformly downsampled to 10,000Hz, which corresponds to the minimum sampling frequency in Dataset A. Subsequently, the samples were partitioned into groups of 100,000, aligning with the smallest sample size per file in Dataset A. The data's three features, representing the three phases, were flattened into a single axis before being subjected to the PCA algorithm.The visualisation of the first two Principal Components in a 2D scatter plot can be found in Figure 3.
It is clear to see from Figure 2(d) and Figure 2(e), as well as Figure 2(g) and Figure 2(h) that Dataset B contains a     clearly present in the spectra, a significant feature of Dataset B. Furthermore, the harmonic peaks in the fused frequency spectra contain the same characteristics of both datasets, which is interesting to note as this representation would still be considered a healthy signal.Future work will investigate the use of a fused Dataset in the frequency domain to train a classifier NN.However, the scope of this study is to validate the use of the time series representation to train a generalised time series anomaly detector with reduced data requirements.
Upon examining the PCA plot depicted in Figure 3, it is clear that the healthy data from both datasets exhibit comparable traits and patterns.Interestingly, the fused dataset forms a cluster around the origin, positioning itself at the center of the two datasets.This central location of the fused dataset within the circular arrangement of points from the two datasets signifies that it effectively captures the salient features of both datasets.By doing so, the fused dataset aids in  It is important to note that simply concatenating two healthy files from each Dataset will produce a similar outcome to the representations shown in Figure 2.However, the purpose of this analysis is to show that the Dataset algorithm will still preserve the individual features of each representation in a new signal and still be usable for a data-driven approach.The PCA plot, as displayed in Figure 3, provides more compelling evidence of the impact of the Dataset Fusion technique on the combined dataset.Subsequent experimental results on anomaly detection, utilising the various datasets, will further explore the implications of employing a fused dataset for training an anomaly detection model.

C. Experimental Design
The aim of the experimentation presented in this study is to observe the effectiveness of the Dataset Fusion with training an anomaly detector NN in comparison to commonly used training methods.The following training methods will be compared for all of the following experiments:  (ANOVA), to ensure the reproducibility of the results presented.The ANOVA is generated using custom functions on Python 3.9, and the numpy [38] and pandas [39] libraries, with versions 1.22.0 and 1.3.5 respectively.

1) Neural Network Model:
The multi-channel LSTMCaps autoencoder NN developed in previous work will be used as the anomaly detection models for the following experiments.Further details regarding the architecture are given in [4].For the following experimentation, three input branches were used to accommodate the three features present in the dataset: the three-phase motor current signals.The model hyperparameters were optimised for anomaly detection using iterative sweeps.Table VII details the optimal hyperparameters for this task, which will be used across all training methods.
As Table VII shows, the NN trains best with 4M samples.Since the healthy data from each dataset contains over 4M samples, each file from the training set was randomly sliced to reduce the number of samples from each file, totalling the 4M samples required for each dataset when concatenated.This approach ensures that the model will train on a wide variety of data, and subsequently increase the reliability of the experimental results by avoiding optimising for a specific set of files.
There are various methods of determining the error thresholds of an Autoencoder NN.The most common method is through the use of an unseen validation set.After training the NN, the model is run on a file in the same class as the training set, but which has not been used for training.In this case, a single file containing data for a healthy condition motor is used.The Mean Absolute Error (MAE) for each prediction is then calculated, which provides a baseline for the accuracy of the NN with reconstructing healthy data.When multiple features are present in the dataset, as is the case with the data used in this case study, a threshold is calculated for each feature since the reconstruction performance of the NN model may vary across each feature.The overall threshold can be calculated through different methods, and the method used is determined based on the training performance of the NN as well as the data consistency and behaviour.In the present work, two methods are used: The largest MAE for each feature, or two standard deviations away from the mean MAE of each feature.The reasoning behind using two standard deviations from the mean as a threshold is, assuming a Gaussian distribution of error residual values, two standard deviations from the mean covers 95% of the data.In the case of an inconsistent or noisy dataset, it is better to use this method since there are more likely to be anomalous MAE values in the validation set, and if the largest MAE value is used, the threshold may be too high to consistently detect abnormalities in the test data.In the case of a consistent dataset with contains minimal noise, the largest MAE value is generally a good threshold to use since an MAE value to exceeds this threshold is more likely to indicate an anomaly as opposed to noisy but healthy behaviour.
2) Addressing limitations though experimental design: A potential limitation of the proposed experimentation is the experimentation on static datasets.By experimenting on static datasets, results achieved in experimental conditions may not generalise effectively to real-world conditions or even other data in the same problem domain.One experiment will address this limitation by recreating the dataset for each of the ten test iterations, shuffling the dataset files for each test iteration randomly in each dataset prior to data formulation.
Another potential limitation is the availability of an equal amount of health data from each Dataset tested.For the purpose of this experimentation, an equal amount of data from each dataset will be used, however, it cannot be ensured that there is an equal amount of healthy data available outside of experimental settings.Therefore, one experiment will address this issue by modifying the ratio of data used from each dataset to obverse the difference that is made to the anomaly detection results.

D. Experiment 1 -Training methods comparison
Experiment 1 aims to provide an initial comparison of the various training approaches detailed in Section IV-C.To ensure experimental validity, the experiment will be repeated 10 times.However, each iteration will not utilise the same training data; instead, it will employ a randomly shuffled subset of data from each file to create a dataset comprising 4M samples.This approach further bolsters the validity of the proposed method and mitigates the risk of misrepresenting its performance by only validating it on a single subset of data.
Table VIII presents the Precision, Recall, and F1 score results for each experiment, as well as the average F1 score across the datasets.Table IX displays the ANOVA for the Average F1 score results in VI.Additionally, Figure 4 features a box and whisker plot that visually compares the Average F1 scores from all 10 runs from each training approach, offering a representation of result spread and consistency.

E. Experiment 2 -Varying data volume
Experiment 2 aims to observe the effect of reducing the volume of training data on the performance of the anomaly detection model using the different training approaches.Similar to Experiment 1, the experiment will be repeated 10 times,  -Also T-Dataset A (Using the traditional training approach on Dataset A) results in better performance on Dataset A, the performance on Dataset B is poor, which brings the Average F1 score down, making this model unsuitable for use across homogeneous datasets -Using transfer learning, even when first training using the fused dataset does not result in better performance overall, even on the dataset that was trained on in the second phase -The pre-processing methods used to create the fused dataset result in a significantly higher performance for Dataset B, since the data needed to be downsampled from 55611Hz to 10000Hz as per the requirements of the algorithm.
-Although the performance across the different training approaches on Dataset B is lower than Dataset A, using transfer learning from the fused dataset to Dataset B yields the same results as just training on Dataset B using the traditional approach, but simultaneously results in strong performance on Dataset A -The box and whisker plot in Figure 4 illustrating the spread of the results also shows that the Dataset Fusion approach did not result in any outlying results from 10 runs, showing high levels of consistency.However, it is also clear to see that Transfer Learning approaches have the lowest deviation in results in comparison to a single training phase on 1 dataset, whether that is using traditional training approaches or the proposed Dataset Fusion approach and the dataset will be shuffled to ensure experimental validity.The experiment details for each test are shown in Table X.The estimated FLOPs used for each number of samples are calculated using Equation 8 [40]: where P is the number of trainable parameters in the NN, N is number of training samples, and E is the number of epochs.
Whilst the complexity of the model can undoubtedly affect the complexity of training, since the same model is used across all experiments, it will not be relevant for this calculation.
The tabulated results for experiment 2 can be found in Table XI.A summary of the results in the form of a box and whisker

F. Experiment 3 -Varying dataset ratio
Experiment 3 aims to assess the performance of each training method with an imbalanced dataset containing a different number of samples from each dataset.The purpose of this experiment is to simulate a real-world environment, where the volume of data available from different sources will not be equal in many cases.Similar to Experiments 1 and 2, the experiment will be repeated 10 times, and the dataset will be shuffled to ensure experimental validity.
Table XIII shows a breakdown of the experimental settings used.In the case of traditional training, the anomaly detector model will be trained on the reduced dataset, similar to experiment 2. However, transfer learning approaches will make use of both datasets.
The tabulated results for experiment 3 can be found in Table XIV for results from 10:90 to 50:50 (Dataset A: Dataset B), and in XV for results from 60:40 to 90:10.The tabulated results are summarised in a box and whisker plot, shown in Figure 6.The One-way ANOVA  Average F1 score of 0.879 and a 10-run average of 0.821, DF surpasses the other methods.These findings suggest that the DF approach effectively captures the salient features in both datasets, resulting in consistently strong performance across the datasets compared to the compared methods.Moreover, the results imply a considerable advantage in fusing the datasets, as the performance on individual datasets significantly exceeds that of models specialised in each respective datasets.
In comparison, the traditional training approach on Dataset A (T-Dataset A) exhibits better performance on Dataset A with a mean F1 score of 0.867 but suffers from poor results on Dataset B with a weaker score of 0.423, consequently lowering the average F1 score and making this model unsuitable for use across homogeneous datasets.The superior performance of the DF method on both Dataset A and Dataset B, along with the average F1 score across both datasets, underscores the algorithm's effectiveness in fulfilling the need for a single neural network adaptable to data from various sources within the same problem domain.This outcome aligns with the algorithm's proposed benefits, which aim to eliminate the need for multiple NNs and reduce data requirements from individual sources.
Furthermore, the results indicate that the traditional training approach, while effective for the dataset it was trained on, falls short in terms of generalisability across homogeneous datasets.In contrast, the DF method not only provides a more robust solution for handling data from multiple sources but also proves to be consistent in performance across multiple runs.
Utilising transfer learning, even when first training with the fused dataset, does not lead to better overall performance, even on the dataset that was trained on in the second phase.However, the preprocessing methods employed in the DF algorithm play a crucial role in enhancing performance, particularly for Dataset B, which required downsampling to meet the algorithm's requirements.This demonstrates the algorithm's adaptability to variations in data characteristics and its potential for handling real-world scenarios where data collection specifications may be inconsistent.Fig. 6.Box and whisker plot showcasing the impact of varying dataset ratios on each training method, grouped by training approach and plotted against average F1 score (higher is better).Although the performance across different training approaches on Dataset B is lower than on Dataset A, using transfer learning from the fused dataset to Dataset B yields results that are on par with training solely on Dataset B using the traditional approach.Simultaneously, this approach achieves strong performance on Dataset A. However, the DF algorithm still outperforms all transfer learning approaches in terms of overall performance on both test datasets.
The box and whisker plot in Figure 4, illustrating the spread of the results, shows that the DF approach did not produce any outlying results from the 10 runs, indicating high levels of consistency.However, it is also evident that the transfer learning approaches exhibit the lowest deviation in results compared to a single training phase on one dataset, whether using traditional training approaches or the proposed DF method.This highlights the potential benefits of combining transfer learning with the DF approach to achieve even more consistent and reliable performance across datasets.However, the results show that transfer learning may not be the best approach, especially in this problem domain, to achieve the best performance.
The ANOVA Analysis presented in Table IX demonstrates that the experimental results yield a statistically significant outcome, with a p-value of 1e-16.This highlights the relevance of the differences observed among the various training approaches.

B. Experiment 2 -Varying data volume Analysis
The results of Experiment 2, presented in Table XI and visualised in Figure 5, demonstrate that the DF approach significantly outperforms other training approaches across varying volumes of training data.Furthermore, when utilising only 6.25% of the training data, the model still surpasses other training approaches that use 100% of the data, achieving an Average F1 score of 0.788.As expected, most approaches experience decreased performance when using less data, with the proposed DF approach following this trend closely.However, it is worth noting that the performance reduction is minimal, with only a 4.04% drop compared to the baseline results.
Interestingly, not all approaches follow this pattern.For instance, transfer learning from the fused dataset to dataset B performs best when using 50% of the data, while transfer learning from the fused dataset to dataset A achieves optimal results with only 6.25% of the data.For these approaches, there does not appear to be a clear advantage to using more data, leading to the conclusion that, for training unsupervised anomaly detectors, using more data is not always beneficial.While this may not hold true for other tasks such as fault classification, those tasks are beyond the scope of the present study and will be explored in future works.
The box and whisker plot in Figure 5 reveals that the performance of the anomaly detection model trained with DF is less consistent and has a larger spread when using less data.However, examining the results from other training approaches shows that this is not always the case.In some instances, such as Transfer Learning from Dataset A to Dataset B, the spread is largest when training with the full dataset.Additionally, as the ANOVA table in Table XII shows, the p-value for this experiment suggests the results are statistically significant.
By outperforming all other models while using only a small fraction of the training data, the DF method addresses the common challenge of not having sufficient training data from each data source.In this experiment, as shown in Table X, the estimated FLOPs needed for training the model with 6.25% of the data dramatically decreased from 4.92 × 10 12 to 3.08 × 10 11 , representing a 93.7% reduction in computational power.This significant decrease is especially noteworthy when contrasted with the minor 4.04% reduction in performance.The DF approach effectively utilises less data, showcasing its potential to contribute to more sustainable and environmentally friendly AI development.Aligned with the principles of Green AI, which emphasise efficiency and reducing the environmental impact of training AI models, the results of Experiment 2 highlight the superior performance of DF.Its crucial implications for real-world applications demonstrate its ability to address the issue of limited training data while promoting more sustainable practices in AI development.

C. Experiment 3 -Varying dataset ratio Analysis
The results of the experiment, shown in Table XIV and Table XV, reveal that the DF algorithm outperforms all other training approaches in terms of Average F1 score.The best performance is achieved with a 30:70 ratio (Dataset A: Dataset B), resulting in an Average F1 score of 0.827, closely followed by the baseline experiment (50:50) with an Average F1 of 0.821.In comparison, the next best performing training approach was transfer learning from the fused dataset to Dataset B (TF -DF to Dataset B), which had an Average F1 score of 0.751.The ANOVA in Table XVI confirms that these values are statistically significant.
The box and whisker plot in Figure 6 indicates that the transfer learning approaches involving the fused dataset generally exhibit less spread compared to other methods.Furthermore, the spread of the DF results appears to be the most consistent across different experimental settings.One might hypothesise that the box plot trend would indicate an increase in generalised performance as the balance between samples from each dataset increases.Interestingly, this is not the case.
While the results empirically demonstrate that an imbalance in datasets mostly does not affect the stability of the results for DF, a clear trend is observed, wherein the average F1 score decreases as more of Dataset A is used for training.This trend is more evident in some of the other approaches, such as T-Dataset A, TL-Dataset B to A, and the mixed dataset (MD).For instance, training with the Mixed Dataset at a 40:60 ratio yields an F1 score of 0.889 on Dataset A, 0.509 on Dataset B, and an overall average of 0.699.Conversely, training with a 90:10 ratio results in an F1 score of 0.890 on Dataset A, 0.300 on Dataset B, and an overall average of 0.595.These results show that the performance on Dataset A can be maintained with less data while improving the performance on Dataset B with more data.
The observed trend leads to an intriguing conclusion: there is a higher benefit to training with Dataset B compared to Dataset A to achieve a more generalised anomaly detection performance.Although both Dataset A and Dataset B represent healthy data, Dataset B exhibits a higher level of noise, which can be clearly seen in Figure 2h, the raw FFT of the signal.Additionally, there is a slightly higher level of spread on the PCA shown in Figure 3.These observations suggest that Dataset B is a more difficult dataset to learn and requires more training data compared to Dataset A. This also implies that a larger variety of healthy data in the training dataset contributes to an overall better anomaly detection performance.

D. Further Remarks
In the context of Transfer Learning, it is important to note that better results on the dataset trained on in the second phase are not guaranteed.The results of the experiments demonstrate that swapping the transfer learning training phases produces different outcomes, which indicates that there is no clear approach to determine which dataset should be used in each phase without conducting additional testing.
The conclusions drawn from Experiment 3 suggest that there is a greater benefit to training on a higher number of samples from Dataset B, as opposed to Dataset A, in order to achieve a more generalised performance.However, identifying this preference presents a challenge, as a similar experiment must be conducted to reach this determination.Despite this issue, the DF approach offers a solution by providing a more stable performance across multiple experimental settings.This advantage becomes particularly significant in practical applications where time and resource constraints may render extensive experimentation unfeasible.By addressing these concerns and offering more consistent results, DF shows potential as an effective method for training unsupervised anomaly detection models, particularly in situations where the optimal dataset and training phase cannot be easily determined.The robust performance of the DF approach, combined with its ability to accommodate various experimental settings, positions it as a valuable tool for practical use cases.It is currently unclear whether the findings from these experiments will hold true for other tasks, such as fault classification.Addressing this question is beyond the scope of this paper, but it presents an intriguing avenue for future research.

VI. CONCLUSION
This paper presents a time-series dataset composition approach called the DF algorithm, designed to address challenges in achieving generalised anomaly detection performance across multiple homogeneous data sources.The proposed algorithm was validated using a case study involving motor current signals, demonstrating that the fused dataset retains salient features from both source datasets while clustering in the middle of both datasets when PCA is applied.The algorithm was then tested on an anomaly detection task and compared to conventional training approaches, with empirical results showing that the DF algorithm significantly outperforms other methods in terms of average performance across both datasets.
Additionally, further experiments were conducted to assess the performance of the proposed approach under non-ideal conditions.Experimental results indicate that the DF approach remains superior even when reducing the number of data samples, with only a 4.04% reduction in performance despite using only 6.25% of the training data, resulting in a 93.7% reduction in computational power required for training.When evaluating the model's performance with imbalanced numbers of samples from each dataset, the proposed approach proved stable across different sample ratios.These findings highlight significant benefits in the context of Green AI, which emphasises sustainable AI model development, as well as practical feasibility due to the algorithm's resilience under non-ideal conditions.
Several research directions could be explored based on the present study.For instance, future works might investigate the applicability of the algorithm for training classifier models or examine whether using the frequency domain representation of the fused dataset could enhance performance.Furthermore, it would be worthwhile to explore the relationship between dataset "complexity" and the "usefulness" of data from each fused dataset, enabling the development of a more systematic approach to dataset fusion.

Fig. 1 .
Fig. 1.A summarised illustration of the Dataset Fusion algorithm Finally, the Chaining and Stacking batches step involves filtering, chaining, and stacking the period batches with a total complexity of O(n * m).The overall complexity of the Dataset Fusion algorithm is the sum of the complexities of these steps, which is O(n * m * (1 +log(m)))+ 3 * O(n * m), with the dominating term being O(n * m * log(m)) due to its faster growth as the input size (n and m) increases.

Fig. 2 .
Fig. 2. Time series signal from Dataset A (a), B (b) and fused (c), Probability Distribution Function from Dataset A (d), B (e) and fused (f), and Fourier Transform from Dataset A (g), B (h) and fused (i) healthy files

Fig. 3 .
Fig. 3. Principal Component Analysis of Dataset A, B and the fused dataset B Transfer Learning -Dataset A to Dataset B TL -Dataset A to B Transfer Learning -Dataset B to Dataset A TL -Dataset B to A Mixed Dataset MD Dataset Fusion DF Transfer Learning -Fused Dataset to Dataset A TL -DF to Dataset A Transfer Learning -Fused Dataset to Dataset B TL -DF to Dataset B bringing the training data closer to the population distribution of the problem domain, thereby enhancing the robustness and generalisability of the model derived from this data.Further evidence of this will be given in the experimental results.

Fig. 4 .
Fig. 4. Box and whisker plot comparing the results for Experiment 1 -Training methods comparison

Fig. 5 .
Fig.5.Box and whisker plot showcasing the impact of varying data volumes on each training method, grouped by training approach and plotted against average F1 score (higher is better).

TABLE I ALGORITHM
COMPLEXITY FOR DATASET FUSION ALGORITHM, FOR n DATASETS WITH m LENGTH

TABLE II SPECIFICATION
FOR MOTOR DATASETS USED IN THE CASE STUDY

TABLE V EXPERIMENT
VARIANTS AND CORRESPONDING KEY FOR RESULTS TABLES

TABLE VIII EXPERIMENT 1
RESULTS -A COMPARISON OF THE PERFORMANCE OF THE ANOMALY DETECTION MODEL USING ALL TRAINING APPROACHES.THE EXPERIMENT KEY IS SHOWN IN

TABLE V
Dataset B and naturally the Average F1 score across both datasets, with a best Average F1 score of 0.879, and an average of 0.821 from 10 runs.
-Dataset Fusion (DF) has the best performance overall, in terms of F1 score on Dataset A,

TABLE XI FULL
RESULTS OF EXPERIMENT 2 -VARYING DATA VOLUME table for the Average F1 scores is shown in Table XVI.V. DISCUSSION A. Experiment 1 -Training methods comparison Analysis Examining the experimental results from Experiment 1, as presented in Table VIII, the DF method is empirically proven to consistently deliver the best performance in terms of F1 score for both Dataset A and Dataset B, as well as the average F1 score across both datasets.With the best