A Novel Outlier Detection Model for Vibration Signals Using Transformer Networks

Outlier detection in vibration signals can play an important role in addressing the issue of structural or environmental changes during vibration testing. In this study, a transformer-based model for outlier detection is proposed. Unlike previous statistical and regression outlier detection methods, the proposed model can identify the outlier location in a high dimensional observation space using the self-attention mechanism. The location of outliers within the vibration observation is marked by a combination of a spatial label and a temporal label. The outlier detection performance of the model is verified by a numerical study of the plane wave and an experimental study of the vibrating plate. These two studies show that the proposed model has good label prediction accuracies(all above 85%) toward the outlier location within the plane wave and vibrating plate observations.


I. INTRODUCTION
O UTLIER analysis of vibration signals has been studied by many researchers to identify novel data caused by environmental or structural variability [1]- [3]. The measured data can deviate from its normal condition when the vibration response carries information about the structural changes or the environmental factor such as temperature and load. Both types of deviation need to be handled very carefully, lest they lead to false alarms. Therefore, analysis of structural-induced or environmental-induced outliers is of interest because it facilitates valid variation detection in measured responses before using any further data processing techniques for revealing the structural condition.
Detecting outliers in multivariate vibration observations often proves to be more difficult than in univariate data because of the additional dimensionality [4]. Several attempts have been made to extract outliers concealed in the vibration observation. The most commonly used outlier detection method is the Mahalanobis squared-distance (MSD) [5]- [7] which characterises the normal vibration observation as a mean vector and a covariance matrix. A discordance test follows to evaluate whether a new observation has outliers. Despite the accessibility of MSD, there remains a paucity of evidence on its performance over inclusive data(the data with outliers). Furthermore, [8] proposed minimum volume enclosing ellipsoid(MVEE) and minimum covariance determinant(MCD) method to improve the detection of the inclusive outlier. Another outlier detection method is dimension reduction [9]. Typically, principal component analysis(PCA) [10] is used to retain information related to outliers. By substituting a group of correlated variables into a new smaller group of principal components, PCA can find the component relevant to the variability caused by outliers. In addition, the regression method also achieved notable results in outlier analysis. Unlike the statistical or dimension reduction method, the regression method [3] focuses on predicting the next time step of the measured response. By discriminating outliers from the reference regression model, the regression method is advantageous for online monitoring. However, the above conventional methods have some disadvantages in detecting outliers in high dimensional space. For example, in statistical method, the masking or swamping effects [11] due to the dominant normal component of the high dimensional data can make the variation contributed by outliers invisible. Moreover, if observations that represent the normal condition are inconsistent, they will become dispersed across the feature space. As a result, dimension reduction techniques like PCA may be infeasible for outlier detection due to the masking effects caused by normal variation components. Regarding regression methods, few studies have been able to draw on any systematic research into the correlation of data points within the high dimensional observation, which may carry the high dimensional feature related to the outlier. In short, detecting outliers becomes challenging as the dimensionality of the observation space increases.
More recently, artificial neural network(ANN) [12] is utilized for outlier detection on account of its nonlinear approximation capability. ANN can approximate nonlinear features or classify groups of features divided by nonlinear boundaries. Multilayer perceptron, convolutional and recurrent neural networks(MLP, CNN, RNN) [13]- [15] are the most popular ANNs for outlier detection. [16] proposed a CNN-based model to identify or eliminate abnormal data. The CNN is used to extract temporal features in the vibration time series for abnormal data classification. But the outliers discussed in this paper fairly exceed the mean and variance of the normal vibration observations. Accordingly, the outlier approximation potential of the model in this paper is not fully investigated. [7] proposed an RNN model with long short-term memory(LSTM) cells to approximate the Mahalanobis distances of normal conditions. By subtracting the predicted distances from that of the measured observations, one can monitor the variation caused by outliers. However, the approximation performance of this model is limited by the statistical distance metrics it applies. So far, there has been little discussion about exploiting ANN capacity for locating outliers. CNN uses convolution windows or filters to transform data into feature maps and RNN relies on recurrent cells for sequential feature extraction. CNN has shown stateof-the-art performance in local feature extraction but remains highly sensitive to adversarial noise. For outlier analysis, this means that the CNN-based model has weak robustness against inconsistent normal conditions. Furthermore, RNN has been firmly established as the dominant approach in sequence modelling and prediction. The sequence processing mechanism of RNN based model is inherently sensitive to the input sequence order, which makes the generalization task of outliers at random sequential positions difficult to achieve. Although ANN-based model has achieved significant improvements in approximation capability for outlier features, the challenge of outlier detection in high dimensional space and the fundamental constraint of CNN and RNN architecture remains.
Transformer architectures, in recent work, have demonstrated impressive performance in the fields of natural language processing and computer vision [17], [18]. This type of architecture relies entirely on an attention mechanism to draw global dependencies between input and output. Previous research has shown that transformer architecture is highly robust to severe occlusions, perturbations, and domain shifts in images [19]. In terms of outlier analysis, the transformer network is considered as the promising ANN candidate for outlier detection in high dimensional space. It has the potential of revealing the location of outliers within a vibration observation with the help of positional embedding and attention mechanism. In this study, we tried to address the challenges of locating outliers in high dimensional space through a transformer-based machine learning model. The proposed model shows notable outlier detection performance in both a numerical study of plane wave propagation and an experimental study of a vibrating plate. The main contributions of this paper are: 1. A novel transformer-based model for outlier detection is proposed.
2. A multi-output layer is implemented in the proposed model to smooth the outlier location labelling in high dimensional space and exploit the learning capacity of the transformer.
3. Numerical and experimental studies are presented to showcase the performance of the proposed model on outlier detection.
The remainder of this paper is organized as follows. Section II introduces the fundamentals of the transformer architecture. In Section III, the outlier labelling and simulation process, as well as model training and evaluation are described. The numerical and experimental studies of the proposed model are presented in Section IV. Finally, the conclusions are given in Section V.

II. RELATED WORK
In this study, a machine learning model based on a transformer architecture is proposed for outlier detection in high dimensional space. Specifically, the proposed model uses the transformer encoder to replace conventional outlier detection procedures to directly identify outlier features from vibration observations. Assuming the input of the transformer network consists of patches of flattened representations of the vibration observation at each time step (Fig. 1). The input is first processed by adding position tokens to flattened vectors of each time frame using the positional embedding layer. The embedded input is then fed into the transformer encoder. Finally, the features extracted by the encoder are processed by a multi-output classification layer to perform outlier identification. The overview of the transformer network is shown in Fig. 2.   [20] during model training. More details about positional embedding or encoding can be found in [21], [22].  In this section, we discuss the fundamentals of the transformer encoder consisting of a self-attention module, a feedforward layer, normalization layers, and residual connections [23]. The self-attention module (Fig. 4) has three input layers, namely, the query, key, and value layer. These three layers are linear layers that project each group of embedded vectors into the query, key, and value matrix respectively. The weights of each input layer are updated independently and the projection process can be formulated as: where X ∈ R m×n is the input for all three input layers of the self-attention module, Q, K, V ∈ R m×l are the query, key, and value matrix, W q , W k , W v ∈ R n×l are the query, key, and value projection weights for X. Usually, the size m× l would be smaller than m×n to reduce the computation cost. Moreover, an activation function is applied to the scaled dot product of Q and K to obtain weights on V . The output of the self-attention module is computed by: where Att is the self-attention function and sof tmax is the activation function [24]. In addition, the multi-head self-attention mechanism is implemented by concatenating outputs of several self-attention modules, which can be expressed as: where M H is the multi-head self-attention function with an output of size hm × l, C refers to concatenate, h is the number of heads, and W m ∈ R m×hm is the projection matrix for the concatenated outputs of self-attention functions.
In terms of the feed-forward layer, it provides similar projection functionalities as the input layer in the self-attention module. The major difference between the input layer and the feed-forward layer is that the input layer has no activation function applied while the feed-forward network has ReLU [25] activation functions in its hidden layer. More specifically, the feed-forward network is fully connected network and consists of projection layers with several hidden layers in between. Furthermore, there are two residual connections(also known as shortcut connections). One connection is between the input layer and the normalization layer after the self-attention module, and the other connection is between the normalization layer after the self-attention module and the normalization layer after the feed-forward layer. These residual connections are implemented to optimize the mapping process of both the self-attention module and the feedforward network.

A. ARCHITECTURE OF THE PROPOSED MODEL
In this study, a transformer-based supervised learning model is designed for outlier detection in high dimensional vibration observations. The architecture of the proposed model is shown in Fig 5. It mainly consists of two parts: a transformer encoder and a multi-branch output layer. Among these, the transformer encoder performs location feature extraction of outliers, and the multi-branch output layer performs the classification of both the spatial location and the temporal location of the detected outlier in the high dimensional vibration measurement.

B. TRAINING LABEL
According to [26], previous works on object detection usually takes a classifier for the target and evaluates it at various locations and scales in the observation space. By sliding the classification window in the observation space or using the divide and conquer strategy to decompose the observation space [27], [28], these detection methods can accomplish the object detection task at the cost of time and optimization difficulty. Conversely, you only look once (YOLO) system reframes object detection as a single regression problem to speed up the detection process. YOLO divides the 2dimensional input data into grids and predicts the bounding box of each grid as well as the existence of the target in that box. The label for the object in one grid is a combination of the bounding box label with 5 predictions and the conditional Inspired by the above works, this study proposed an efficient labelling method for outlier detection. The desired output of the proposed model is the spatial and temporal labels of outliers within vibration observations. In this study, it is assumed that outliers occur in certain areas of the testing points and at a certain period of time throughout one observation. One high dimensional vibration observation with outliers can be represented as an H × W × N tensor. For illustration, the vibration observation is divided into 2×2 areas of size H 2 × W × N 2 as shown in Fig 6. Consequently, the related spatial and temporal labels according to one-hot labelling would be: test point area I(

C. SIMULATION OF OUTLIERS
Previous studies mostly defined an outlier as a data point that falls far away from the overall points from a statistical point of view. However, in this study, the outlier is specified as the data deviate from its normal conditions, namely the sampling point strays away from the vibrating behaviour. A deviation ratio is implemented to introduce outliers in the labelled area or subset of high dimensional vibration VOLUME 4, 2016 measurement. This is achieved by multiplying a percentage of randomly selected vibration vectors in the labelled area by the following deviation ratio: where s = 0.1 is the scaling constant, r i is the ith element of r and the deviation ratio for the ith element of the selected vibration vector, and g is the probability vector of the same length as r following the discrete uniform distribution.

D. MULTI-OUTPUT LAYER IN THE MODEL
The proposed model has two output branches after the transformer encoder (Fig 5). One is for spatial label learning, and the other is for temporal label learning. This multi-output architecture forces the transformer encoder above to maintain both spatial and temporal information in the encoded feature. During the training process, pooling layers are employed on these two branches to extract the spatial and temporal location of outliers respectively from the encoded feature produced by the transformer encoder. There are, of course, many other pooling configurations for the multi-output layer(e.g., max-pooling/average-pooling, average-pooling/maxpooling). Validating these configurations is essential to quantify how the pooling configuration of the multi-output layer influences the model prediction. In section IV, a detailed discussion about the configuration of the multi-output layer is presented. After the pooling layer in each branch, the learned feature from prior layers is flattened into a 1-dimensional tensor and passed into a fully-connected classification layer for the final prediction. The architecture of FC layers in the two branches is two stacked dense layers. The upper layer has ten neurons and the lower has four neurons.

E. LOSS FUNCTION IN THE MODEL
Both output branches utilize the categorical crossentropy [29] function as the loss function, which can be formulated as: Where L s and L t are losses of the spatial label branch and the temporal label branch respectively, n represents the number of training samples, g s and g t represent the ground truth spatial and temporal location of outliers within the training sample, p s is the spatial label branch prediction of the sample and p t is the temporal label branch prediction of the sample. In addition, two output branches have the same loss weights, which means the contribution of L s and L t to the loss of the model is balanced.

F. TRAINING AND EVALUATION
According to the area division of the input data, as described in section III-B, a possible outlier location in highdimensional space is represented by a combination of a spatial label and a temporal label. Samples with simulated outliers are fed into the proposed model for training and a fraction of samples is used as validation data, which would be used for the evaluation at the end of each epoch. By monitoring the evaluation result of each epoch, the model that reaches the best evaluation performance during training is selected as the best model.
The evaluation metric for the proposed model is categorical accuracy. This metric computes the frequency with which the ground truth of the input matches the predicted label pair or probability pair. If the index of a maximal ground truth value is equal to the index of a maximal predicted value, it is counted as a successful prediction for the model being evaluated. In order to evaluate the proposed multioutput model, the categorical accuracy metric is applied to both output branches and the categorical accuracy of a single branch represents the performance of the corresponding outlier detection task. Additionally, the optimizer used for model training is adaptive moment estimation(Adam) and the loss weights of the two branches are 0.5 and 0.5.

IV. RESULTS AND DISCUSSION
The proposed model is validated with a numerical study and an experimental study: the 2-dimensional plane wave and the plate structure. The vibration data from both studies can be represented in 3-dimensional form with two spatial dimensions and one temporal dimension. Moreover, there is a notable difference between the vibration pattern of a plane wave and a plate, which helps to verify whether the proposed model is capable of detecting outliers within different vibration patterns. The plane wave vibration involves no shear force and its vibrational behaviour is predictable. Conversely, the composite plate has nonlinear characteristics(the discontinuity of mass) and can not be described using the analytic method. Tensorflow is used in the implementation of the proposed model. The detailed software environment is as follows: Tensorflow-gpu 1.14.0, CUDA 10.0, cuDNN 7.4, Keras 2.2.5, Python 3.7.3.

A. OUTLIER DETECTION IN PLANE WAVE
The governing equation of the 2-dimensional plane wave [30] is where F (x, y, t) is the value of the plane wave field at time t and location (x, y), A(ω 0 ) = 1 is the amplitude of the wave at frequency ω 0 , k x = 1 is the wave number along the x axis, and k y = 1 is the wave number along the y axis.  Table 1 provides the corresponding prediction result of each configuration and the transformer encoder in this comparison uses only one selfattention head. Although the MAX/AVG configuration achieves the best overall performance, no evidence suggests that this configuration is optimal for the spatial and temporal label output branch. For example, the AVG/MAX configuration has better temporal label prediction accuracy than that of the MAX/AVG configuration. One possible implication of this is that the encoded feature from the single head transformer is insufficient for the following label prediction tasks. Therefore, the influence of the self-attention head number was investigated as well. The performance of the proposed model using different numbers of self-attention head is tabulated in table 2 and the MAX/AVG configuration was adopted by the model. The model attains reasonably good S and T accuracy(all above 85%) by increasing the self-attention head number to 6. However, as the self-attention head number reaches 8, prediction accuracies of the model drops below 80%. A likely explanation is that the deterioration of the model performance is caused by the overcomplicated features from the 8 head transformer encoder. In summary, it has been shown in this numerical study that the proposed model is capable of the outlier detection task in the simulated plane wave observation. Additionally, it is evident that better prediction accuracy of the model can be achieved through the tuning of pooling configuration of the multi-output layer and self-attention head number of the transformer encoder.

B. OUTLIER DETECTION IN VIBRATING PLATE
As shown in Fig. 7, the vibration observation of the plate was collected from the measurement area. There are 10 × 10 testing points within this area and the interval between every two points is 10mm. The plate was excited by a handheld exciter(B&K type 5961) at the excitation point and the vibration signal of each testing points was collected by an accelerometer(B&K type 8309). In this experimental study, the vibration data of size(10 × 10 × 100) was divided into 4×4 areas for the simulation of outliers at different locations. Moreover, 50 training samples and 10 evaluation samples were prepared for every possible outlier location. Like the previous numerical study, the outlier detection performance of the proposed model in this experimental study was examined by 4 pooling configurations. The prediction accuracy of the single self-attention head model is shown in Table 3. The AVG/AVG configuration brings the most biased and the best overall prediction performance is achieved using MAX/MAX configuration. According to the numerical study, it seems that the self-attention head number has positive impact on the prediction accuracy of the proposed model under certain conditions. Therefore, it is expected to obtain a further performance improvement of the MAX/MAX model by increasing its self-attention head number. However, as shown in Table 4, the increase of the selfattention head number not necessarily improve the prediction VOLUME 4, 2016 result. This inconsistency may be caused by the change of the vibration pattern of the model input. Nevertheless, the proposed model achieves good outlier detection performance using MAX/MAX configuration and 6 attention heads.

V. CONCLUSION
This study proposes a transformer based model for outlier detection. The multi-output layer in the model relieves the outlier labelling complexity in high dimensional space by separating spatial and temporal labels of outliers apart. During model training, this multi-output layer urges the transformer encoder to enclose the necessary spatial and temporal location of outliers in its encoded output for the following prediction tasks. The proposed model can locate outliers within pre-divided areas of simulated plane wave and vibrating plate observations with accuracies up to 85.6%/93.1% and 99.9%/99.9% respectively. A limitation of the proposed model is that the resolution of the outlier location is fixed, which hinders its application in detecting outliers with irregular distribution. A further study on improving the resolution of the outlier location prediction together with outliers clustering will be considered. JIE ZHANG received the B.S. degree in measurement and control technology and instrument and the M.Sc. and Ph.D. degrees in measurement technology and automatic equipment from the University of Electronic and Science of China (UESTC), Chengdu, China, in 2010, 2013, and 2018, respectively.,He is currently a Research Associate with the School of Automation Engineering, UESTC. His current research interests include nondestructive testing and prognostics and health management of electronics. VOLUME 4, 2016