Detecting Anomalies in System Logs With a Compact Convolutional Transformer

Computer systems play an important role to ensure the correct functioning of critical systems such as train stations, power stations, emergency systems, and server infrastructures. To ensure the correct functioning and safety of these computer systems, the detection of abnormal system behavior is crucial. For that purpose, monitoring log data (mirroring the recent and current system status) is very commonly used. Because log data consists mainly of words and numbers, recent work used Transformer-based networks to analyze the log data and predict anomalies. Despite their success in fields such as natural language processing and computer vision, the main disadvantage of Transformers is the huge amount of trainable parameters, leading to long training times. In this work, we use a Compact Convolutional Transformer to detect anomalies in log data. Using convolutional layers leads to a much smaller number of trainable parameters and enable the processing of many consecutive log lines. We evaluate the proposed network on two standard datasets for log data anomaly detection, Blue Gene/L (BGL) and Spirit. Our results demonstrate that the combination of convolutional processing and self-attention improves the performance for anomaly detection in comparison to other self-supervised Transformer-based approaches, and is even on par with supervised approaches.


I. INTRODUCTION
Computer systems, such as cyber-physical systems (CPS), industry control systems (ICS), server systems, IoT services or supercomputers create every day a huge amount of data about the current status of the system, processes, network communication, or critical events, recorded in log data.Along with the direct monitoring of network traffic or process data, the automated evaluation of the log data to find anomalies or faulty processing in the systems became an important challenge in the last years in order to ensure an error-less operation of the system.Log data are mainly created by following a specific syntax to create templates of words and fill them with numbers, reflecting the status of the corresponding system [1], [2].Thus, log data consists of a number of log lines or log sequences, where each log line consists of The associate editor coordinating the review of this manuscript and approving it for publication was Muhammad Asif .multiple (or a sequence) of items (words or numbers).The task of analyzing the log data can thus be compared to text understanding, a field of natural language processing where Transformer networks, a recent architecture for artificial neural networks [3], allowed for impressive applications in the last years [4], [5].This success lead to a certain popularity of Transformer-based neural networks for log data analysis [6], [7], [8], [9], [10], [11].
Despite their success, Transformer networks require large amounts of data for training [12].This is mainly due to the huge number of trainable parameters (weights and biases) in the encoder blocks, which contain the self-attention heads: around 110 million parameters for the BERT model [4] and around 175 billion parameters for GPT-3 [13].As a consequence, training a Transformer model like BERT can take several days. 1 To avoid this issue, some researchers used a pre-trained Transformer to create a latent representation of the log data and feed this into an additional classification or prediction layer [10].Another problem with the Transformer approach is the limitation of processing only one line of the log data in one learning step of the network [7], [8], which can lead to a loss of information about possible correlations between the different lines of the log dataset.Previous research addressed this problem by concatenating successive log items into one sequence of specific length, allowing to capture the contextual information about different log entries [6], [8], [10].However, the amount of information preserved across multiple consecutive log entries can vary, due to the different lengths of single log entries and the resulting variation in the number of items from multiple log entries in one sequence.
Most log data are built in a semi-structured way, consisting of elements on fixed positions (like timestamps or process ids) and elements with varying size and position in a log entry (like free-text written by a developer or automatically created strings via templates or syntax rules) [1], [2].Log data can then vary from a structured [14] to a more unstructured format [15].Regardless of whether the log data consists of more structured or unstructured content, a standard preprocessing step for analyzing is to perform a pre-parsing step to identify templates along the data and their corresponding parameters [6], [7], [8], [9].It has been argued that splitting the dataset into templates and parameters can destroy the context between consecutive log sequences and that falsely created templates and parameter values lead to out-of-vocabulary word errors [10].These are words that appear only rarely or not at all in the training set and can lead to a higher false positive rate, by ignoring the bigger context in which an anomaly can happen [10].Further, the resulting templates and parameters depend mainly on the used pre-parsing approach [2], leading to different template parameter pairs and making it necessary to identify the best fitting pre-parsing approach for the respective dataset.
To overcome these limitations, we designed an anomaly detection network using a Compact Convolutional Transformer [12] (CCT).The usage of convolutional sub-networks for position embedding together with sequence pooling in the CCT leads to a compact architecture.While there exists a broad corpus of models and methods in anomaly detection in log data (see for a review He et al. [16]), we focus here on Transformer-based solutions, as they have been shown to outperform other methods.Additionally, most of the recent Transformer-based approaches utilize a self-supervised learning scheme, not requiring labels during the training process.As our approach also uses a self-supervised learning scheme, supervised methods, which are not Transformer-based, are not considered.
The compactness of the CCT enables the training of the complete network on two logfile datasets from high-performance computing systems: Blue Gene/L (BGL) and Spirit [15].Despite the fact that 2D-convolutional neural networks are very common in object recognition [17], [18] and object detection [19], it has been shown that they are useful for other domains, like anomaly detection in sensor data [20] or in analyzing network traffic [21].Based on these results, we assume that the two-dimensional kernel will be able to detect correlations between multiple rows and columns and encode local information.The self-attention mechanism [3] of the Transformer encoder block in the model [12] can preserve the contextual information between the latent representation from the convolutional sub-network and thereby encode global information.
The log data analysis is performed on the unstructured data without any pre-parsing to identify templates and parameters, as it has already been shown that pre-parsing is not necessary to achieve comparable performances [7], [9], [10].
In this study, we evaluate the capability of the CCT to encode the context within the datasets by a systematical variation of the 2D-convolutional kernel sizes, the decision threshold, and other hyperparameters.Our results show that a bigger kernel (allowing the parallel processing of multiple log sequences) can increase the performance of the network, indicating the importance of contextual information for anomaly detection.We demonstrate in this paper how the combination of convolutional processing in the early stages of the CCT and the attention-based processing in the Transformer blocks improves anomaly detection in comparison to previous published Transformer-based approaches.
The contributions of this paper are as follows: 1) By transferring the Compact Convolutional Transformer (CCT) [12] from object recognition to a task closer to natural language processing, namely detecting anomalies in supercomputer log data, we demonstrate how the extension of a Transformer-encoder architecture by a 2D-convolutional neural network improves anomaly detection in unstructured log data.2) We demonstrate how bigger convolutional kernel sizes increase the performance of the network, by incorporating information from multiple log lines and encapsulating contextual information between the convolutional encoding through self-attention.
3) The combination of 2D convolutional neural networks with Transformer-encoder blocks leads to a network with an overall lower number of trainable parameters than previously published Transformer-based approaches, showing comparable or better performances.

A. DATASET
To evaluate our approach, we use the open Blue Gene/L (BGL) dataset and Spirit datasets.The first one contains alert and non-alert messages recorded from the Blue Gene/L supercomputer at the Lawrence Livermore National Labs (LLNL) in Livermore, California [15], [24].The log entries are managed by a Machine Management Control System (MMCS) and provides error messages and warnings from hardware as well from software for individual chips and computer nodes, errors about inter-processor communication over the network, and temperature emergencies (for example through a dysfunctional fan) [15], [25].Only warnings and errors which correspond to a faulty behavior of the system (like a software crash) are marked as an alert in the dataset and they are the anomalies that we aim to detect in our study.Messages in the dataset are collected with a period of around one millisecond [15], [25].In this study, we use the version from the Loghub collection [24], consisting of 4, 747, 963 messages in total and with 348, 460 messages marked as anomalies.
The Spirit dataset contains log messages from the supercomputer installed at the Sandia National Labs (SNL) in Albuquerque, New Mexico [15].We use 1GB of log messages from the Spirit dataset as provided by Le and Zhang [10], consisting of 7, 983, 345 log messages in total with 768, 142 messages marked as anomalies.

B. COMPACT CONVOLUTIONAL TRANSFORMER
Introduced by Hassani et al. [12], the Compact Convolutional Transformer (CCT) is a subsequent development of the Vision Transformer (ViT) model proposed by Dosovitskiy et al. [26].A sketch of the complete network is illustrated in Fig. 1.In the original ViT, the input image is split into non-overlapping patches, similarly to the tokenization of vocabularies in natural language processing.A patch and position embedding is used to retain information about the spatial order of the patches.Due to this, only images that can be divided equally along the height and width can be used, if neither cropping nor padding are possible.In the case of log sequence analysis, cropping could lead to a loss of contextual information and padding increases the number of irrelevant items in the input.The sequence of patches is passed to a Transformer encoder, consisting of multi-head attention layers and multi-layer perceptron (MLP) blocks, extracting the input for the classification head.
In contrast to the original ViT, the CCT uses a convolutional block instead of separated image patches, enabling the processing of inputs with a non-quadratic spatial resolution [12].The convolution operation with a set of K kernels W = {W 1 , . . ., W K } and biases {b 1 , . . ., b K } on an input tensor X is defined pixel-wise by Eq. 1: The convolutional layer applies a non-linear activation function element-wise on this tensor, after adding a bias b k for each filter W k (Eq.2).We use the ReLU (rectified linear unit) activation function for all convolutional layers.
Each convolutional layer is followed by a max-pooling layer, performing a subsampling operation that only keeps the highest value in a 2 × 2 region.The output of a convolutional block (Z ) is shown in Eq. 3.
The usage of a convolutional subnetwork allows to change the number of log sequences and the length of each log sequence independently from each other.Furthermore, the convolutional block can preserve local spatial information, what could make the prepositional embedding obsolete.In fact, Hassani et al. [12] showed that the accuracy of the CCT was only weakly affected by removing the positional embedding, what we can confirm (see Fig. S1).The representations from the convolutional blocks are send into one or more Transformer blocks.Each Transformer block starts by normalizing the representation of each sample, followed by the multi-head self-attention with h self-attention heads to capture different contextual aspects.Following Vaswani et al. [3], the output of one attention head A i is the scaled dot-product attention as shown in Eq. 4. where are called the query, key and value, respectively, with X being the input representation.W Q i , W K i and W V i ∈ R d X ×d k are the weight matrices of the i-th attention head, d k being the embedding dimensionality and d X the dimensionality of the input.Each attention head A i ∈ [1, h] follows Eq. 4 and has its own i and W V i matrices.The multi-head attention is a mixing of the h outputs of the attention heads (Eq.5), using the mixing matrix After the output of the multi-head attention block is concatenated with the input (skip connections) and layernormalized, a multi-layer perceptron block, consisting of dense layers with the ReLU activation and a dropout rate of 0.1 (as suggested by Vaswani et al. [3]), calculates the output of the encoder block, also using skip connections.Finally, a Sequence Pooling mechanism is applied on the output of the Transformer blocks [12], by pooling over the sequence dimension before the latent representation is feed into a classification head.The output of the last encoder-block (x E ) is passed to a linear layer (g()) and a softmax function is applied on it (see Eq. 6).
Applying the resulting x S on the encoder output (x P = x S × x E ) scales it depending on the importance of the task.113466 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.In the original CCT model, the resulting x P is then sent to a classification layer [12].
In the BERT model, as well as in the original ViT, an additional [class] token is added to the input sequence.This additional token accumulates information about the actual input sequence due to the self-attention [4] and the corresponding hidden state is used for classification [4], [26].By pooling over the sequences, this additional token is not necessary with CCT [12].
The version of CCT used in our work consists of two 2D-convolutional layers with 128 feature maps in the first convolutional layer and 64 in the second, an embedding dimensionality (d k ) of 64, 5 Transformer blocks, and 5 attention heads.For the purpose of this work, we send the output of the Sequence Pooling layer (x P ) to an additional MLP block with 1, 800 neurons in the hidden layer and a softmax output layer.The softmax function predicts a probability score for each item in the log-sequence and for each possible word in the vocabulary.The network predicts a n × v dimensional matrix, where n is the sequence length and v the vocabulary size of the tokenization.This leads for example to 2,571,803 trainable parameters for a convolutional kernel size of 4 × 4. In the experiments, we vary the kernel size of both 2D-convolutional layer to investigate if a bigger spatial resolution, and thus more contextual information, can lead to a better performance.

C. PRE-PROCESSING
Previous work advised not to perform the classical log parsing of the dataset in order to avoid log parsing errors by out-ofvocabulary (OOV) words and losing contextual information between consecutive log messages [9], [10], [11].Contrary to this advice, the raw data in our work is pre-processed in the classical way for natural language processing: Special characters and numbers are deleted, and all words are set to lowercase [10], [27], [28].It has to be mentioned that deleting numbers in log data could disturb the detection of anomalies, depending on the system that should be analyzed.However, continuous numbers in particular can lead to a high number of tokens, where each token has only a low frequency of appearance, so they are removed.
Afterwards, single sequences are mapped to numbers (tokens).In this work, log data will refer to the complete log file, a log sequence to a single line (a single message) inside of the data, a log item is a single token after the tokenization, and a log window consists of a number of consecutive log sequences.
To perform the mapping, a pre-trained BERT Tokenizer from the Hugging Face library 4 is used.It performs a Word-Piece [29] tokenization of the words [4], with 451 unique tokens in the BlueGene/L dataset and 920 unique tokens in the Spirit dataset.Additionally, at the beginning and end of each log sequence, a start token (101) and an end token (102) are added to signal the start and the end of each log sequence to the Transformer.By calculating the length of each sequence (determined by the start and end tokens), we observed that most of the log sequences in both datasets are shorter than 30 items (see Fig. S2a and Fig. S2b), so we use a fixed sequence size of 30 tokens.If a sequence contains less than the 30 items, the first padded zero is replaced with the end token.If a sequence contains more than the 30 items, the last token is replaced with the end token.After mapping the complete dataset to tokens, we create log windows with a window size of 15 consecutive log sequences for the BlueGene/L dataset and of 20 consecutive log sequences for the Spirit dataset.The beginning of the next log-window is shifted by one, so each log sequence in the training set can be the last log sequence in a window (except the first 14 sequences in the very first log window).In this way, two-dimensional input matrices are created, which preserve the contextual information of one log sequence and also between multiple consecutive log sequences.

D. SPLIT INTO TRAINING, VALIDATION AND TEST SET
Due to the self-supervised nature of our approach, we use two additional hyperparameters to decide whether a data sample 4 https://huggingface.com/ is an anomaly or not.The first one is a decision threshold (th), applied to the prediction probability of each single token to decide if it is a valid token or an anomalous token.The second one is the number of anomalous tokens (h) in a data sample until it is detected as an anomaly.Together with the size of the convolutional kernel, there are three hyperparameters whose influence the anomaly detection performance.To tune these hyperparameters, an additional validation and test set are required to measure the performance values on unseen data [30], [31].Due to the fact that neither the BGL dataset nor the Spirit dataset provides a predefined training, validation, or test set, we have to split the dataset by ourselves.
After the complete dataset is pre-processed, it is shuffled and we use 60% for the training set and the remaining 40% to create the validation and test set.A similar split is used in previous studies [7], [10].To ensure learning only on log sequences representing the normal behavior of the system, we delete from the training set every log window containing a single log sequence originally marked as an anomaly.Due to the pre-processing and the frequent presence of similar entries in the log files, it can be assumed that the dataset is highly redundant and the subsets contain identical samples.To avoid data leakage caused by this [32] and to ensure that the validation and test set only contains unseen samples, we sort out every sample that also exists in the training set.After that, we split the remaining dataset further into a validation and a test set, with 20% used for the validation set and 80% used for the test set.We sort out data samples from the test set if they also exist in the validation to guarantee that only unseen data is used for the test set.A schematic overview about how we split the datasets is shown in Fig. 2.
Due to the fact that trainable parameters in deep neural networks are initialized randomly, we create ten different training, validation and test sets for the BGL and Spirit datasets.The median number of samples, together with the standard deviation, in the training set, validation set, and test set are shown in Tab. 1.

E. EVALUATION AND TESTING
To evaluate the influence of the different hyperparameters on the anomaly detection performance, we vary them systematically.To do so, we vary the kernel size from 2 × 2 to 6 × 6, testing five fixed values for the decision threshold (th), decreasing decrementally from 1 × 10 −3 to 1 × 10 −5 , and three values for the h parameter from one to three.We report here the performance values for the different hyperparameter configurations obtained on the validation set.
To allow for a fair comparison among different approaches, we report the performance values obtained on the unseen test set, which is not involved in training nor hyperparameter optimization.As the decision threshold depends on the predicted occurrence probability of the tokens, we assume that the performance of anomaly detection depends on the exact adjustment of the threshold.While an incremental change of the threshold is only a rough estimation for the best threshold value, we determine the decision threshold automatically using the Precision-Recall curve [33], [34].Due to the fact that the network predicts for each token in the last log line an occurrence probability, we assign the minimum occurrence probability as a sample threshold for the Precision-Recall curve.After calculating the F1-Score for the different threshold values, we choose the obtained threshold with the highest F1-Score as the decision threshold for the evaluation set.To calculate the precision-recall curve, we use the scikit-learn package [35].While we change the determination of the best threshold value, the other two hyperparameters are tested as before.The hyperparameters which lead to the highest F1-Score on the validation set are used and the performance values obtained on the test set will be reported.

F. TRAINING
In the training phase, the network should learn the context in which a token appears in a self-supervised manner.To realize that, we adapt the masked training task originally used for the BERT Transformer [4].We mask 20% of the tokens in the last sequence (except the start, end, and zero tokens) in each log window with the masking token (103) (Fig. 3 shows a simplified illustration of the training) and the training objective is to predict the masked tokens.This procedure is considered as a self-supervised training scheme as it uses automatically generated labels and does not rely on human annotations [36].It has been suggested by other Transformer-based approaches that masked learning is sufficient to learn contextual information [8], [9], [11].
The network is trained with the Adam optimizer with weight decay [37] and using the cross-entropy loss as the objective function.To deal with the imbalance in the appearance frequency of sequence length (see Fig. S2), we weight the training samples depending on the appearance frequency of the length of the last log sequence.The weight of one sample (W s ) is described in Eq. 7, where N is the number of all samples in the training set, L s is the number of samples with the same sequence length as the current sample, and L is the number of all different sequence lengths in the training set.Sample weighting is only used during training.A pseudocode description of the training is shown in Algorithm 1.We trained the network for 100 epochs due to the saturated end for 19: end for loss-values (see Fig. S3 and Fig. S4).
The initialization of the weights and biases in a deep neural network can influence the final performance of the network, as well as the speed of performance convergence [38].To enable a fast convergence, we initialized the weights in the convolutional subnetwork with randomly chosen weights from a normal distribution based on the initialization proposed by He et al. [39] and all other parameters are chosen randomly from a uniform distribution as proposed by Glorot and Bengio [40].To ensure that the observed performance is not only caused by the chosen weights, we train each network configuration multiple times with different initialized weights and biases.The network is trained directly on the log window samples as described above.Neither pre-training nor finetuning was used.

G. ANOMALY DETECTION
To identify an anomaly, we assume that the network will learn the context in which a single token of the log sequence is appearing.If a normal token appears in the right context, as it is under normal system conditions, the network should be able to predict its appearance with a high certainty.If a token appears in the wrong context, as it is the case for an anomaly, the network should not be able to predict it.Therefore, we use the softmax output of the network to determine the probability that the appearance of a token is correct or not.We present the log windows of the respective dataset and receive a end for 13: return(Anomaly) end if 18: end for probability score for each token in the last sequence.To create the predicted log sequence, we initialize the predicted log sequence with zeros and iterate through the probabilities of each token in the last sequence.If the corresponding probability is below a specific threshold th, the token is considered as an anomalous token and is set to −1.Otherwise, the original token is set in the prediction sequence.To decide if a complete log sequence is an anomaly, there must be h many anomalous tokens as minimum in the predicted sequence (Fig. 4 depicts the evaluation process).A pseudocode description of the evaluation is shown in Algorithm 2.
Evaluation Metrics: After each sequence is classified as an anomaly or not, we evaluate the performance of the network by using the common metrics, Recall, Precision and F1-Score [41].These metrics are computed as followed: where TP refers to the number of true positive samples, FP to the number of false positive samples, and FN to the number of false negative samples.Due to the strong imbalance between normal and abnormal samples, we report the values for both classes (normal and anomaly).
To support our observations on the influence of different convolutional kernel sizes on the performance, we conduct the Friedman test [42] on the models with different kernel sizes for the F1-Score, Recall, and Precision values.Due to the ongoing discussion about correct thresholds to determine  statistical significance [43], [44], [45], we did not define a specific threshold and only report the pure p and χ 2 values.

III. RESULTS
We present the performance of the CCT on the log anomaly detection task for two datasets: Blue Gene/L (BGL) and Spirit [15].We evaluate the ability to process the information of neighboring tokens by the convolutional kernel, by varying the kernel size from a 2 × 2 kernel to a 6 × 6 kernel.Further, we investigate how the recognition performance is influenced by the detection threshold (th) and the minimum number of anomalous tokens (h).We report finally the performance values on the test set, obtained via hyperparameter tuning on the validation set.

A. EVALUATION OF ANOMALY DETECTION 1) VARYING DECISION THRESHOLD AND KERNEL SIZES
To evaluate the influence of convolutional kernel size, decision threshold (th), and number of anomalous tokens (h) on the anomaly detection of the CCT, we varied them  systematically and evaluate the different combinations on the validation set.Our results on five fixed thresholds show that with a kernel size of 3 × 3, the network achieves already overall good performances on the BGL dataset, with a higher standard deviation of the Recall values for very low thresholds (Fig. 5).A smaller kernel size of 2 × 2 leads to a decrease of the Precision, but also for the Recall values.This suggests that the smaller kernel size does not encapsulate enough contextual information.Bigger kernel sizes (4 × 4, 5×5, and 6×6) also lead to promising results but with higher standard deviations and number of outliers.Especially at the lowest threshold value (th = 10 −5 ), the standard deviation for the Recall values increases, indicating an increase in the false negative rate.While a lower threshold reduces the probability for an anomaly to be detected, the number of falsely detected anomalies decreases (leading to a high Precision score) but also increases the number of not detected anomalies (higher number of false negatives).
However, on the Spirit dataset, the model with the 4×4 kernel shows the best performance in comparison to all other models, regardless of the kernel size (Fig. 6).This suggests that, for the bigger kernels, misleading context information is processed by the network.This demonstrates that the amount of contextual information can vary from dataset to dataset.Regarding the performance on the BGL dataset, a lower threshold also leads to a decrease of the Recall value.If the threshold is too low, the probability score of a token must be lower to be classified as a false token, reducing the number of false positives and increasing the Precision value, but also increasing the number of false negatives and decreasing the Recall value.
Because a single token with a low probability is not always an indicator for an abnormal behavior, we varied the minimum number of tokens in a single log sequence which can have a low probability (the h value).By increasing this number, the false positive rate should decrease.Our results show a similar behavior for all convolutional kernel sizes and for both datasets.With an higher h, the number of false positives decreases and the Precision increases.However, the number of false negatives increases, leading to a decrease in the Recall and F1-score (Fig. 7).The Precision, Recall and F1-Score values for all th and h pairs can be found in the supplementary material (see Fig. S5 and Fig. S6).

2) EVALUATION SCORES
Overall, the best evaluation scores are achieved with th = 0.505 × 10 −3 , h = 1 and all bigger kernel sizes for the BGL dataset (see Tab. 2 and Supplementary material for a deeper comparison of all values of th and h).As mentioned above, the model with a small 2 × 2 kernel leads to a low Precision score for the anomaly class (86.65), while a kernel size of 3 × 3 leads to an increase of the Precision score to 99.26, with a low standard deviation and the overall highest Precision score.Due to one heavy outlier in this model, the standard deviation for this kernel size is quite high.
In contrast, the median Recall score is over 99.75 for all four models with a bigger kernel size, with the highest Recall value presented by the 4 × 4 model and the 6 × 6 model (99.96).The overall highest F-1 score is achieved by the 6 × 6 model (99.58).We see that the performance levels for the networks with the 3 × 3, 4 × 4, 5 × 5, and 6 × 6 kernels are not that different from each other, suggesting that for the BGL dataset it is enough to encode the context information shared over 3 consecutive log lines.Due to the training of the networks with the same ten training sets, we assume that the high standard deviation values for the 5 × 5 and 6 × 6 model are caused mainly by the random initialization.Applying the Friedman test for different kernel sizes in the convolutional subnetwork results in p < 0.0007 (χ 2 = 19.44)for F1-Score, p < 0.0001 (χ 2 = 24.64)for Recall, and p < 0.1955 (χ 2 = 6.05) for Precision.For the Spirit dataset, the best result is achieved with th = 0.7525 × 10 −3 , h = 1 and a 4 × 4 kernel size with a F1-score of 97.69 (see Tab. 3).In contrast to the results on the BGL dataset, the lowest Precision value of 67.282 for the anomaly samples occurs with a 3 × 3 kernel while the Precision value for 2 × 2 kernel is only a bit lower than on the BGL dataset (90.94).Another difference is the decrease in performance for the bigger kernel sizes 5 × 5 and 6 × 6 (F1-score of 85.89 and 94.02, respectively).We assume that the lower performance is caused by irrelevant information which is processed by the bigger kernel.Applying the Friedman test for different kernel sizes in the convolutional subnetwork results in p < 0.0161 (χ 2 = 12.18) for F1-Score, p < 0.0026 (χ 2 = 16.27) for Recall, and p < 0.0010 (χ 2 = 18.40) for Precision.

B. COMPARISON WITH OTHER TRANSFORMER-BASED MODELS
After evaluating the influence of the single hyperparameters, we want to compare the performance of the CCT with previously published Transformer-based approaches for anomaly detection.In order to avoid a bias in choosing the best hyperparameters for the particular dataset and enable comparability with other approaches, we used the validation set to find the best-performing hyperparameters.While the decision threshold was obtained automatically, we report here the kernel size and h parameter, which achieves the highest F1-score on the validation set and the obtained performance values on the test set.
For the BGL dataset, we observed the highest F1-Score on the validation set with a kernel size of 6 × 6 and h = 1, resulting in an F1-Score of 97.0 on the test set (Tab. 4).Due to this, our approach shows a slightly lower F1-Score than the best-performing supervised approach from Le and Zhang [10], which have used 80% of the dataset for training and 20% for testing.Further, our approach shows a higher F1-Score than all other self-supervised approaches, and the highest Recall value of 99.12 regardless if supervised or self-supervised training was used.A lower Precision value of 96.39 indicates a lower number of false negative samples (or missed anomalies), but a higher number of false positive samples (or false alarms), in comparison to other methods.
With a convolutional kernel size of 4 × 4 and h = 1, the CCT achieves on the Spirit dataset an F1-Score of 97.69 on the test set, what is a higher F1-Score then the works of Le and Zhang [10] and Nedelkoski et al. [7] (Tab.5).While the CCT achieves the highest Recall score (99.39), the Precision score (96.0) is again lower.Considering the fact that Le and Zhang [10] trained their network in a supervised fashion, this shows nevertheless the potential of our self-supervised approach.

C. ANOMALY INTERPRETATION
Since the development of deep neural networks, the understanding and interpretability of the outcome of neural networks have become very important [46].This becomes even more important for deep networks analyzing log data of computer systems in critical infrastructures and cybersecurity related domains [47].In the presented approach, it is possible to visualize the input log sequence with the predicted probability to identify which part of the log sequence is the probable cause for an anomaly (see Fig. 8).
This visualization can either help to identify the anomaly in the system faster or to verify the correctness of the prediction through an expert.

IV. DISCUSSION
We have demonstrated how the Compact Convolutional Transformer (CCT) [12] can be used to detect anomalies in   log data using data from two HPC systems, Blue Gene/L (BGL) and Spirit.Furthermore, we have demonstrated how the detection benefits from the combination of convolutional input processing to capture local information and the ability 113474 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. of Transformers to encode more global context information.Our results show that a small kernel size of 3×3 raises the performances of the network to a level comparable with previous Transformer-based approaches for the Precision score and even outperforms them at the Recall score.This highlights the importance of the contextual information between different log sequences and how processing them in parallel improves the anomaly detection.
On the BGL dataset, the median performance of all kernel sizes, except the 2 × 2 kernel, shows only small differences.This suggests that a bigger kernel could be used, enabling the processing of more complex information and a more compact representation, making it possible to reduce the size of the Transformer encoder part (by reducing the number of encoder blocks, attention heads, or the input dimensionality) and leading to a reduced training time.However, our results obtained on the Spirit dataset suggests that the correct kernel size depends on the dataset, as the highest F1-Score is achieved with a 4 × 4 kernel size and both smaller and bigger kernel sizes lead to a decrease in the performance and to a higher standard deviation between multiple network instances, suggesting a less robust detection performance.

A. RECENT WORK
Since the growing success of Transformer networks in natural language processing (NLP), they became more interesting for tasks which are similar to NLP, like anomaly detection in log data [6], [7], [8], [9], [10].However, the different model approaches make different pre-processing steps necessary, such as log-data pre-parsing, converting strings into tokens, and sorting out redundant data (to name a few).Due to this, the comparison with other approaches can be difficult.However, in the following section, we discuss other Transformer-based models for anomaly detection in system log data.
With the HitAnomaly network, Huang et al. [6] presented one of the first approaches using a Transformer-based network.They used a pre-parsing mechanism to create templates and parameters out of the log data.To process the templates and the corresponding parameters, the network is composed of two encoders.Each encoder consists of Transformer blocks to create a latent representation of the templates and the parameters.Both representations together are used by a classifier to predict normal and abnormal log data.They achieved an F1-Score of 92.1 on the BGL dataset [6] with this supervised model (see Tab. 4).With a Recall of 90.0, their network has more false negative predictions than our approach, and a slightly lower Precision value of 95.0.
Nedelkoski et al. [7] proposed the Logsy network.For the network to distinguish between normal and abnormal log sequences, they used two different types of inputs for training.The first type are normal log sequences from the original dataset that should be analyzed.The second type are auxiliary sequences from a completely different dataset.By labeling the first type as ''normal'' and the second type as ''anomaly'', they train the network in a self-supervised manner.A spherical loss function is used to maximize the difference in the latent representation between ''normal'' and ''anomaly'' messages.An interesting feature of the usage of such labels is that it makes it possible to train later the network with real ''anomaly'' data, labeled by experts Nedelkoski et al. [7].However, a F1-Score of 44.0 on the BGL dataset is much lower than our approach by a similar training/test split (60% training and 40% test), which is mainly caused by the very low Precision value, indicating a high rate of false positives.For the Spirit dataset, the authors report a F1-Score of 56.0 with the same split.With a training to test split of 80% to 20%, the authors report an increase of the Precision but a slight decrease of the Recall.This leads to a F1-Score of 65.0 and 62.0 for the BGL and Spirit dataset (respectively).If the authors add 2.1% of correctly labeled anomaly data into the training set, the Precision increases to 89.0 and the F1-Score to 80.0 on the BGL dataset (data on the Spirit dataset are not reported).This shows the impact of even a small number of correctly labeled data on the performance of the network.
Guo et al. [8] introduced the LogBERT network.They use pre-parsing to identify the templates in the log messages and feed the tokenized representation into a Transformer encoder.They conserve correlated information between consecutive log messages by creating log sequences, where one log-sequence contains templates of multiple log messages.Similarly to the BERT Transformer [4], LogBERT is trained on a masked key prediction task, where some randomly chosen tokens in one sequence are masked and the network must learn to predict them correctly.Additionally, they used an additional loss function to minimize the distance of the representations of the normal log sequences to the center of a hypersphere.With this, the encoder representations of normal log sequences should be closer to the center of the sphere, while the representations of anomaly log sequences should have a greater distance and result in a higher loss.To identify an anomaly, they mask a random number of tokens in a sequence and predict the masked one.For each masked item, a candidate set is created and if the corresponding input token is not under the candidate set, the token is considered as an anomaly.If there are too many anomaly tokens in one sequence, the complete sequence is considered as an anomaly.With the proposed LogBERT, the authors achieved on the BGL dataset a Precision score of 89.4 and a Recall value of 92.3.
A similar approach was used by Lee et al. [9] for the LAnoBERT network, which is also based on a BERT-like Transformer and trained on the masked key prediction task.In contrast to the previously mentioned work, they did not perform pre-parsing on the unstructured log data but learned directly on the log sequences, which resemble more single log messages rather than a sequence of consecutive messages.To detect anomalies in the test set, every log key in one sequence is masked one after the other to predict the corresponding masked key.For this, they also created a candidate set where they consider the six possible tokens with the highest probability score.Assuming that the distribution of the resulting probabilities are different for normal and abnormal log sequences, they calculated an abnormal score, based on the probability values of the candidate set.If the abnormal score of a log sequence receives a certain value, the sequence will be classified as an anomaly.Additionally, they calculated a second error score, based on the loss value of the candidate set and compared which of both score values are better to detect anomalies.They achieved their highest F1-Score of 87.49 with the abnormal score on the BGL dataset.
A supervised trained model, with the usage of a pre-trained BERT model, was presented by [10].They used the BERT model to create semantic vectors and learn a binary classification task on it.With that, they achieved 98 for the F1-Score on the BGL dataset and a F1-Score of 97.9 on the Spirit dataset.This demonstrates the advantage of learning in a supervised fashion.Despite that, creating labels for datasets regarding to a specific real-world task is very time consuming, what makes self-supervised or unsupervised learning methods more interesting for real world scenarios.
Another stumbling stone in comparing different approaches with each other is the splitting of the dataset and the usage for training, validation, and test set.To measure the final performance of an approach and to avoid a bias in the performance, the performance should be measured on new (or unseen) data, which has not been used either for parameter optimization or to tune additional hyperparameters (like a decision threshold) [30], [31].This data is called the test set, while the subset used to optimize the weights is called training set.The validation set is used to tune additional hyperparameter.If the used dataset did not provide an explicit training, validation, and test set, the subsets must be created by hand, as in our study.One way to create the three subsets is to assign samples randomly to one of the three sets.To ensure that the obtained high performance is not caused by luck in assigning the samples to the subsets, we first shuffled the complete dataset ten times and split it into a training, validation, and test set.An alternative approach would be kfold cross-validation [41].
Most model approaches with a self-supervised learning scheme require one or more additional hyperparameters to detect an anomaly (such as a decision threshold).In the self-supervised model papers to which we compare our approach, only Guo et al. [8] mentioned the usage of a validation set to determine the hyperparameters, without clarifying how the validation set was obtained.Huang et al. [6] and Le and Zhang [10] mention the division of the dataset into a training and test set, but not the usage of a validation set.By splitting the dataset into a train, validation, and test set, it is necessary to avoid data leakage to guarantee that only unseen data exists in the test set.In Le and Zhang [10], the authors sorted the dataset samples in chronological order and took the first 80% for training and the remaining 20% for the test set.A similar procedure was done by Nedelkoski et al. [7] by sorting the data samples chronologically and evaluating the split-ratio between training and test set, without any validation set.While they motivated that this division would guarantee only unseen data in the test set, it must be taken into account that entries in log data can be written with a specific frequency.As a consequence, the same sample could be in the training and test set.Due to this, we sorted duplicated samples between the training, validation and test set out.How sorting out duplicated samples can influence the performance is observable by comparing the Recall, Precision, and F1-Score between the validation set and the test set.On the BGL dataset, the Precision decreases from 99.18 on the validation set to 96.39 on the test set, with a kernel size of 5 × 5, indicating that the ratio between true positive samples and false positive samples shifted.We assume that sorting out reduces the number of detectable anomaly samples, decreasing the number of true positive samples in relation to false positive samples.It has also to be mentioned that sorting out duplicated samples can introduce a new imbalance between the number of normal and anomalous samples.This is, for example, observable for the BGL dataset, where the number of anomalous samples is roughly 5% of the number of normal samples in the test dataset, while the validation dataset is much more balanced.This supports how important it is to correctly split the dataset for the final evaluation of the model.The different ways to obtain a training, validation, and test set complicate the comparability of different approaches, questioning the numbers reported in the studies, leading to problems in reproducing study results [32], and making it difficult to choose the right method in a practical situation [30].
In the original publication of the other Transformer-based approaches, no information about the number of Transformer blocks or trainable parameters could be found.Considering the BERT BASE model [4], BERT consists of 11 Transformer blocks and at least 110 million parameters, what can be considered the size of the Transformer-based approaches discussed here.In contrast to that, the CCT used in this work consists of two convolutional layers with 128 and 64 feature maps (respectively), 5 Transformer blocks, and 5 attention heads.In total with the additional prediction head, the complete network consists of only 2.6 million parameters.Although our network only used a fraction of their parameters, the proposed network can outperform all other self-supervised approaches.

B. POSSIBLE IMPROVEMENTS
Critical metrics for the correct evaluation for an anomaly detection system are the Recall and the Precision values for the normal and anomaly class.With regard to the CCT in this work, the number of convolutional layers, Transformer blocks and MLP blocks were found experimentally.However, further improvements could be obtained with a more extensive hyperparameter search using optimization frameworks such as Optuna [48] or skorch [49].
Our results on the BGL dataset show that, for a detection threshold (th) from 1.0×10 −3 to 0.505×10 −3 and a detection threshold of one (h = 1), the Recall value for the anomaly class is over 99.7 for the 3 × 3 and 4 × 4 models.If the threshold decreases further, the Recall score drops, while the Precision score increases.A similar behavior is also observable for the other model variants (see Fig. 5).This shows how the balance between Recall and Precision is determined by the detection threshold.We assume that with a higher detection threshold the Precision would be decreased, due to a higher number of falsely detected anomalies.
Different methods have been proposed to find a suitable decision threshold automatically, using optimization methods such as Grid-Search or Bayesian optimization [50].If a deep neural network is used in anomaly detection, the loss on the validation set can be used to determine the decision threshold automatically [9].We tested a method to obtain the decision threshold automatically by using the Precision-Recall curve, instead of using the Receiver Operating Characteristic (ROC) curve, which is a common method to determine the quality in a binary classification task.It has been argued that the Precision-Recall curve leads to a more robust evaluation of a binary classification task on an unbalanced dataset [51].
To classify a log sequence as an anomaly, the number of tokens predicted as an anomaly in a sequence must be above the value h.With a threshold of one, only one anomalous token will lead the prediction as an anomaly.This can lead to a high number of false positives, for example, if a single word is not included in the training set.With a h of 2, a slight increase of the Precision values for the Spirit dataset is notable, but also a strong decrease of the Recall value, as well as an increase of the standard deviation (see Fig. 7).
In masked learning, a part of the input is masked out and the network should learn to predict either only the masked parts or the complete (unmasked) input sample.It has been shown that, for 2D images, a masking ratio of ∼ 75% can be used if the complete unmasked image is predicted [52], due to the high information redundancy in an image.Despite the fact that the created log windows in our work are similar to 2D images, our network predicts only the last log sequence in a log window (not the complete log window), and the tokens in the last log sequence are not very redundant (as in a language prediction task).To evaluate whether a higher masking ratio leads to better performance, we trained the CCT with a 4 × 4 convolutional kernel and a 50% masking ratio on the Blue Gene/L dataset.With the higher masking ratio, the network obtained a Precision value of 94.52, a Recall value of 57.87, and an F1-Score of 74.27 on average over 5 model repetitions and at a decision threshold of th = 0.505 × 10 −3 (see Fig. S7).In comparison to a CCT trained with an identical convolutional kernel size and a masking ratio of 20%, all three performance values decrease.
Our results show that lower Precision and Recall scores are paired with a higher standard deviation.We hypothesize the following reasons for that: The weights and biases in neural networks are initialized randomly, so the starting point for the optimization is different for every model run.To train the model, the Adam optimizer with weight decay was used [37] with a learning rate of 0.0001 and a weight decay of 0.00001.This was mainly done to improve the regularization of the network, but the values for the learning rate and weight decay were found experimentally and could be fine tuned further.It also has been suggested that techniques like ''warmup learning'' can increase the stability of the network training [53].
As in previous work, we understand anomaly detection in log data as a natural language processing task [10].We performed natural language pre-processing by removing special characters and numbers, as it is done for example in sentiment analysis [54].Despite the good performance of the proposed method, it is still unclear whether numbers should be removed or not for anomaly detection in log data: removing the information on how much free space is left on a hard drive or how high is the workload of the CPU can lead to a high number of false negatives, as critical system state caused by very high CPU workload can not be detected.On the other side, it can be assumed that floating point numbers with a low appearance frequency in the original data will lead to a high amount of tokens.To reduce the number of possible tokens, we suggest treading the numbers before and after the decimal point as individual elements.

V. CONCLUSION
In this study, we presented a log anomaly detection network based on the Compact Convolutional Transformer [12].Our results show that the use of 2D-convolutional layers leads to the successful learning of contextual correlations between log messages without further pre-parsing.This conserves the contextual information and leads, to the best of our knowledge, to the highest F1-Score a Transformer-based network has achieved so far on the Spirit dataset, comparable to the best supervised trained approaches.Our study demonstrates how the combination of processing spatial information with a 2D-convolutional layer and the encoding of broader contextual information via attention can be used for log data analysis.

FIGURE 1 .
FIGURE 1. Schematic view on our network architecture.The Compact Convolutional Transformer (CCT) performs a convolutional processing on the log window input and sends the corresponding representation into the Transformer encoder to create a latent representation.The convolutional kernels are represented by colored matrices.The corresponding representations are represented by the colored squares.Positional embedding does not influence the order of the representations.The CCT uses the standard Transformer encoder architecture (image on the right).The latent representations from the Transformer encoder are used by the MLP head to create the prediction for the last log sequence in the input window.The softmax activation function in the prediction layer gives a probability score for each item in the sequence and for each possible token in the vocabulary.

FIGURE 2 .TABLE 1 .
FIGURE 2. Schematic view on the data split process.We split the dataset into a training, validation, and test set.To do so, after shuffling the dataset, 60% has been cleaned up from anomaly data and used for training.To create the validation, the remaining 40% of the dataset has been cleaned up by identical samples from the training set and 20% are used for the validation set.To create the test set, the remaining samples have been cleaned up from identical samples from the validation set.

Algorithm 1 11 :
Training Algorithm Input: Training data X train = x 1 , x 2 , x 3 , ..x m with m samples 1: preprocess X train 2: remove windows containing anomaly log lines 3: calculate sample weighting (WS) 4: for e = 0 : Epochs do 5: X train = shuffle(X train ) for i = 0 : Samples in batch do 12: Y b .append(lastlogline of x i ∈ X b ) 13: mask 20% randomly chosen tokens in the last logline of x i ∈ X b Loss b = CrossEntropy(Y b , P b ) 17: optimize CCT based on Loss b and WS b 18:

FIGURE 3 .
FIGURE 3. Training with masked predictions.The created log windows with a size s are tokenized and 20% of the tokens in the last sequence in each window are swapped with the [MASK] token.The original tokens of the last sequence have to be predicted as a training task.Training happens only on normal log data.

FIGURE 4 .
FIGURE 4. Detection of an anomaly.Sequence windows of the dataset are given to the network to predict the last sequence in the window.The soft-max output for each token in the last sequence is interpreted as a probability value.If the probability value for a token is below the decision threshold (th), we assume that the token appears in the wrong context and set to −1.If the numbers of −1 in one sequence are higher than the value h, the sequence is marked as an anomaly.

FIGURE 5 .
FIGURE 5. Precision, Recall and F1-Score on the Blue Gene/L dataset for different threshold values.Columns indicate the kernel size from a 2 × 2 to a 6 × 6 convolution kernel.Rows indicate the Precision (blue), Recall (red), and F1-score (green), respectively.The X-axis represents the changing detection threshold.The median is indicated by a black line and a bigger dot, the smaller dots indicating outliers, and the box 50% of the datapoints.A bigger kernel leads to better performance, by integrating more context information.Lower values for the detection threshold lead to an increase in Precision due to a smaller number of false positives.When the threshold becomes too small, the number of false negatives increases.The h value is one.

FIGURE 6 .
FIGURE 6. Precision, Recall and F1-Score on the Spirit dataset for different threshold values.Columns indicate the kernel size from a 2 × 2 to a 6 × 6 convolution kernel.Rows indicate the Precision (blue), Recall (red), and F1-score (green), respectively.The X-axis represents the changing detection threshold.The median is indicated by a black line and a bigger dot, the smaller dots indicating outliers, and the box 50% of the datapoints.With a 4 × 4 kernel, the network performs best, while both bigger kernels show higher standard deviations.Lower values for the detection threshold lead to an increase in Precision due to a smaller number of false positives.When the threshold becomes too small, the number of false negatives increases.The h value is one.

FIGURE 7 .TABLE 2 .TABLE 3 .
FIGURE 7. Higher h leads to lower Recall values.Columns indicate the kernel size from a 2 × 2 to a 6 × 6 convolution kernel.Rows indicate the Precision (blue), Recall (red), and F1-score (green), respectively.On the right are shown the values for the BGL validation set and on the left for the Spirit validation set.A higher value for h leads to an increase in the Recall value, due to a higher number of false negatives.The value for the detection threshold th is 0.505 × 10 −3 .

TABLE 4 .TABLE 5 .
Precision, Recall, and F1-Score on the BGL dataset in comparison with previously published methods.Score values of our network (CCT) obtained on the test set.Due to different splits of the dataset into training, validation and test set, reported scores from the literature have to be treated with caution, as detailed in the discussion.Precision, Recall and, F1-Score on the Spirit dataset in comparison with previously published methods.Score values of our network (CCT) obtained on the test set.Due to different splits of the dataset into training, validation and test set, reported scores from the literature have to be treated with caution, as detailed in the discussion.

FIGURE 8 .
FIGURE 8. Visualization the prediction probability score for log sequences, which are detected correctly as normal (a) or as anomaly (b).Green indicates a high probability, yellow a moderate probability, and red a probability below the detection threshold (th = 0.505 × 10 −3 ).The red-colored words suggests, which word in the log sequence causes the prediction as an anomaly.Example sequences are taken from the Blue Gene/L test set.