Reinforced Transformer Learning for VSI-DDoS Detection in Edge Clouds

Edge-driven software applications often deployed as online services in the cloud-to-edge continuum lack significant protection for services and infrastructures against emerging cyberattacks. Very-Short Intermittent Distributed Denial of Service (VSI-DDoS) attack is one of the biggest factors for diminishing the Quality of Services (QoS) and Quality of Experiences (QoE) for users on edge. Unlike conventional DDoS attacks, these attacks live for a very short time (on the order of a few milliseconds) in the traffic to deceive users with a legitimate service experience. To provide protection, we propose a novel and efficient approach for detecting VSI-DDoS attacks using reinforced transformer learning that mitigates the tail latency and service availability problems in edge clouds. In the presence of attacks, the users’ demand for availing ultra-low latency and high throughput services deployed on the edge, can never be met. Moreover, these attacks send very-short intermittent requests towards the target services that enforce longer delays in users’ responses. The assimilation of transformer with deep reinforcement learning accelerates detection performance under adverse conditions by adapting the dynamic and the most discernible patterns of attacks (e.g., multiplicative temporal dependency, attack dynamism). The extensive experiments with testbed and benchmark datasets demonstrate that the proposed approach is suitable, effective, and efficient for detecting VSI-DDoS attacks in edge clouds. The results outperform state-of-the-art methods with $0.9\%-3.2\%$ higher accuracy in both datasets.

detection methods. 80 Transformer learning primarily applies and performs well 81 in natural language processing (NLP) and computer vision 82 tasks [10], [11]. A key factor to success in these areas is 83 how text, images or videos are represented through repre-84 sentation learning [11]. Transformer models are built based 85 on multi-head attention, which helps analyze time-series data 86 because it considers contextual information (past-future), dif-87 ferent representation subspaces, and adapting periodic and 88 nonperiodic patterns. The impressive success of transformers 89 inspires us to use a transformer with reinforcement learning 90 in securing edge systems, which remains unexplored. Pri-91 marily, transformer-based reinforcement learning is known 92 to be unstable and inefficient for making downstream appli-93 cations [12]. The features, including experience replay and 94 multi-head attention, are crucial to adapt dynamic temporal 95 behaviour and discernible patterns in data to induce contex-96 tual information in learning to detect VSI-DDoS attacks in 97 edge clouds. Hence, we propose a transformer-based neural 98 model with learnable time representation to detect VSI-DDoS 99 attacks on the edge. 100 Reinforced transformer learning (RTN) is a learning 101 approach in which a transformer-based model is trained in 102 a reinforcement learning environment. It helps in model 103 training to achieve higher efficacy under multiple settings 104 to detect VSI-DDoS attacks. However, the transformer inte-105 grates deep reinforcement learning to employ said features 106 to mitigate emerging service-targeted attacks in edge clouds. 107 This paper makes the following contributions by combining 108 the requirements for low-rate and VSI-DDoS detection with 109 the capability of autonomy in edge clouds. 110 1) First, we introduce a transformer-based VSI-DDoS 111 detection approach on edge with learnable time repre-112 sentation in its architecture, known as VSI-TN. 113 2) Second, we introduce a transformer-induced deep rein-114 forcement learning approach known as VSI-RTN to 115 make attack detection efficient and autonomous for 116 edge clouds.

117
3) Third, transformer integration with deep reinforce-118 ment learning makes it possible to prioritize learning 119 on context-driven information (e.g., attack dynamism, 120 temporal dependency) for detecting VSI-DDoS attacks 121 under uncertainty.

128
Organization. The rest of the paper is structured as fol-129 lows. Section II discusses prior research on transformers and 130 deep reinforcement learning methods for DDoS detection. 131 The proposed system model is reported in Section III while 132 Section IV presents detailed experimental analysis. Finally, 133 the conclusion and future work are given in Section V.
to developing mechanisms to counter such attacks early, 156 but hard to make it in edge clouds. Most existing meth-157 ods were developed for classical DDoS detection based on 158 machine learning [14], deep learning [15] and deep reinforce-159 ment learning [13]. Saied [21] propose a multi-anomaly 193 1 The tail latency is defined as the latency of a server's 99th percentile response, which is the delay that users experience in the worst case. detection model for cyber threat data. Pretrained transform-194 ers' variant is used to encode log sequences for learning the 195 structure along with anomaly types. It employs natural lan-196 guage processing to find-out cyber threats from system logs, 197 which cannot be used in real-time detection and mitigation of 198 anomalies. Table 1 gives a comparison amongst existing and 199 our proposed methods.

200
Many deep reinforcement learning approaches have been 201 developed to detect, protect, and be resilient against cyber 202 threats by utilizing experience replay or feedback mecha-203 nisms in multiple domains [22]. However, exploring deep 204 Q-learning combined with a transformer for detecting 205 VSI-DDoS attacks remains an open problem. A high tempo-206 ral dependency and dynamic behaviour adoption in a short 207 period of time cause VSI-DDoS detection more difficult; 208 also, they bear legitimate behaviour during attacks targeting 209 multiple services for degrading users' QoS/QoE.

211
Due to the complex nature of VSI-DDoS attacks (e.g., 212 stealthy, sub-saturating, legitimate utilization of server's 213 resources, varied data patterns in each slot of extreme increase 214 of request), the detection methods [23] overlook attacks 215 before degrading the QoS of web services. For example, the 216 sudden increase of HTTP requests in a short period exceeds 217 the server queue limit and causes a delayed response to 218 legitimate users. Therefore, it's necessary to have a model to 219 capture those patterns to improve the detection performance. 220 The transformer plays a vital role in accomplishing such tasks 221 and is advantageous due to having a self-attention mecha-222 nism. Moreover, to employ the features of deep Q-learning 223 combined with transformer, we formulate the VSI-DDoS 224 detection problem as learnable time-representation, experi-225 ence replay, and dynamic policy update for performing the 226 detection operations early and efficiently.

228
Given the VSI-DDoS problem, identifying attacks in ser-229 vices deployed among edge servers formulated as a classifi-230 cation task with two classes: legitimate and attack. However, 231 multiple categories of attacks exist [2], [8] (e.g., VSI-DDoS 232 vertical, VSI-DDoS horizontal, VSI-DDoS application) to 233 manipulate services at different levels of deployed applica-234 tions. Therefore, without losing generality, we assume that 235 X and Y = {0, 1} denote an instance space and the set of 236 possible classes with timestamp t, where 0 and 1 encode as 237 legitimate and attack instances, respectively. Given training 238 data in the form of a finite set of observations: 240 drawn independently from p(X, Y), i.e., the probability distri-241 bution p on X × Y. The goal of detecting VSI-DDoS attacks 242 is to learn a classifier h, which is a mapping X −→ Y that 243 assigns a label to each instance x t ∈ X . Thus, the output 244 of the classifier h is defined as transformer (h T ) and deep 245  Q-learning with transformer (h QT ). Transformer-based models (e.g., BERT [10]) consist of sev-260 eral encoder and decoder layers with multi-head attention.

261
To solve our problem, we employ the transformer's encoder 262 layer for input data's intensive and compact feature rep-263 resentation. We instantiate and train h T for VSI-TN using 264 multi-head attention layers (as shown in Figure 2) inspired 265 from self-attention layer [24]. The input is transformed into 266 three vectors: the query vector q, the key vector k, and the 267 The attention is computed using the 269 following [24].
The transformer architecture employs one time-embedding 272 layer (time2vec), three encoder layers, and a classification 273 head placed after the last layer for smooth initiation of the 274 training process.     Second, the Rule Base module refers to when DRL policy 301 has found doubtful belief-vectors from the classifier module.

302
After assessing the rule-based module, the classifier will be 303 updated to improve the reinforced learning process.  found, the request is queued to wait for new rules from 321 the analyst manager (i.e., the next action).

322
3) Delay the classification task (a d ) if the classifier's 323 output is not satisfactory and a similar classification 324 task is already sent to the Rule-Base module, then the 325 RL agent verifies the correct classification of similar 326 task with the Rule-Base followed by classifier updation 327 to produce expected accuracy.

328
VSIDDoS attack classifier (VAC) has three components: 329 a Euclidean distance metric, a memory component, and a 330 transformer model. It estimates similarity scores for a new 331 sample using the distance metric corresponding to each class 332 while the memory component stores for already seen sam-333 ples. Let S be the recently classified sample stored in the 334 classifier. S i ∈ S be a subset of classified samples from 335 legitimate(i = 0) and attack(i = 1) class (number of classes, 336 k = 2). Similar to [13], for a new sample x, the distance 337 between each class(i) is measured by: where d max is the maximum distance used, d(z, x) repre-340 sents an Euclidean distance between samples z and x. Belief-341 vector E d = {e 0 , · · · , e i , · · · , e d } is expressed in terms 342 of similarity scores and e i = (d max − d i (x)). The trans-343 former model updates independently whenever DRL policy 344 encounters non-decisive samples, i.e., when it cannot take the 345 automatic classification action a p .

346
Reward function validates as correct automatic classifica-347 tion with 0 as a legitimate label and 1 as an attack label. 348 RL agent receives a −2 reward for incorrect classification. 349 Reward for assigning a classification task to a rule-base 350 decreases linearly by a factor of 0.5 depending on the present 351 analyst's load (L A (t)), i.e., (−0.5) × L A (t). The reward for 352 delaying a classification task decreases exponentially with 353 each time unit of delay (T D ). The exact reward function 354 is −2 T D /10 . Accordingly, the Q-value gets updated at each 355 time(t) for every action-state pair as follows.
1)). Further, move to the next state-action pair s(t + 1) and 362 a(t + 1) that maximize Q-values seen in the next state and 363 also minimize the time difference error between the learned 364 value and the current estimated value. Here, the learning rate 365 α assumes close to zero, i.e., 0 < α < 1, and discounted 366 factor γ to 0.5. Loss function to update Q-values for each 367 training batch is given below [26].
The DRL policy illustrated in Figure 5 receives the follow-371 ing parameters at time t when sample x enters the system: 372 VOLUME 10, 2022 FIGURE 4. Detailed architecture of VSI-RTN, inspired by DeROL [13] where the VSI-RTN uses Rule Base to accumulate training data for the Model Base to improve real-time model efficacy.  if action a is a c then 10: send S ch to Analyst Manager for correct labelling, VAC's updation and training of VSI-TN 11: send S ch to Sample Scheduler for further classification attempt 12: end if 13: obtain reward r(t) 14: end while 15: end for 16: if training phase then 17: train DQN using loss Eq. (8) 18 We conducted experiments with four real-world datasets, 409 including two testbeds and two benchmark datasets. Testbed setup and data collection is designed and developed 412 by following similar settings available in [27]. We configure 413 an edge server with an n-tier web application benchmark 414 RUBiS 2 (i.e., web server, an application server, and a DB 415 server) to assess our proposed VSI-DDoS detection models.

416
The 3-tier architecture is followed and deployed on the edge 417 cloud illustrated in Figure 6. Web application server deployed      and accuracy to establish the model's capability for detecting 495 VSI-DDoS attacks. The occurrences of attack class are rarer 496 than the legitimate class, leading to class imbalance problems 497 and vice versa depending on the time data were collected. 498 Hence, we employ AUC as a validation measure to alleviate 499 this problem. 500 We begin our experiments with the characterization of data 501 using cumulative density analysis for CPU utilization and 502 tail latency in the UVSI-DDoS-I testbed dataset as shown 503 in Figure 8. Figure 9 shows the same for CPU utilization 504 and memory usage in the UVSI-DDoS-II testbed dataset. The 505 cumulative difference between legitimate and attack is very 506 close (seen in the Figures), increasing detection difficulty. 507 Figure 10 shows the latency variation of HTTP requests in the 508 presence and absence of VSI-DDoS attacks within the UVSI-509 DDoS-I dataset. Under normal traffic conditions, latency 510 remains very close to 0. However, during the attack period, 511 it peaks between 200 ms to 800 ms. We used the Keras library 512 with the Tensorflow backend for implementing the proposed 513 VSI-TN and VSI-RTN models. 514

515
We begin with the UVSI-DDoS-I dataset for assessing mod-516 els with time-representation layers that achieve significant 517 model performance in detecting VSI-DDoS and iterate for 518 other datasets. The hyper-parameters of each model are tuned 519 with a grid search mechanism to obtain optimal model per-520 formance. Based on these, we achieve the best results with 521 a sliding window size of 25 instances, 12 attention heads, 522 10 epochs, and a batch size of 32 for the VSI-TN model. 523 The dropout value sets 0.1 and employs a global average 524 pooling for the encoder layer to prevent model overfitting. 525 ADAM [32] optimizer was used for our experiments with 526  .   TABLE 3. Hyperparameters of VSI-TN for UVSI-DDoS-I dataset. In case of UVSI-DDoS-II dataset size of query, key and value were reduced to 128 and number of attention heads was set to 6; reduced size of the neural net eliminated over fitting issue caused by relatively small sized UVSI-DDoS-II dataset.
'binary-crossentropy' as loss function and sigmoid as acti-527 vation function to obtain an accurate and stable model. The  Table 3.  behaviour expected from reinforcement settings, we evaluate 537 both VSI-TN and VSI-RTN models. These data sizes include 538 10%, 30%, 50%, 80% and 100% of both UVSI-DDoS-I 539 and UVSI-DDoS-II datasets that maintain temporal consis-540 tency. VSI-RTN verifies this with learning stability under 541 variable data size and data imbalance ratios (see Figure 16). 542 We observe that VSI-TN does not achieve stable performance 543 VOLUME 10, 2022  with varied data size and data imbalance ratios as given in 544   Table 6.   Table 5 shows performance of the proposed models 557 (i.e., VSI-TN and VSI-RTN) using both UVSI-DDoS dataset 558 scenarios. We observe that our proposed models outperform 559 baseline models with 0.9% to 3.2% more AUC score using   shows the decreasing losses on UVSI-DDoS-I data and Fig-577 ure 14 shows accumulated rewards along training iterations 578 for two models. One is the proposed VSI-RTN model, and 579 another baseline model named RNN-RL, which uses LSTM 580 instead of transformer in reinforcement settings. We have 581 reported three different runs for loss and rewards for the 582 VSI-RTN model, all showing a similar trend. Compared to 583 the baseline model RNN-RL, VSI-RTN can achieve higher 584 rewards in less amount of iterations. In terms of loss, the base-585 line model suffers several spikes during training. As shown in 586 Figure 16, the proposed model achieves stable performance 587 even fed varying amounts of data over time to the models. 588 In Figure 16, Unique Normal RL and Unique Attack RL 589 refer to normalized total unique normal and attack instances 590 seen by DRL policy while training. Unique Normal and 591 Unique Attack show the amount of unique normal and attack 592 instances that were sent back to the analyst manager and, 593 in turn, fed to the VSI-TN component for independent train-594 ing. As we can see, there is a steady growth in the AUC 595 score as we increase the data size. Also, we observe that 596 increased data size leads to a steady increase of unique data 597 received by the DRL policy. Unique data sent for training the 598 VSI-TN that originally decreases and eventually remains the 599 same without compromising performance. It illustrates that 600 VSI-TN in reinforcement settings can be trained with fewer 601 instances to make expected decisions for the model. This 602 training strategy can eliminate learning instability and data 603 imbalance problems and train the VSI-TN with only relevant 604 training examples. The resultant model is cost-effective and 605 efficient under those constraints. 606 Figure 15a shows that introducing a high penalty for delay 607 classification implies maximum effect by progressing model 608 training with lower delay in UVSI-DDoS-I data due to high 609 non-similarity within data instances. The typical case is the 610 fluctuation in performance at the beginning for wrong classi-611 fications; eventually, the model minimizes them. Throughout 612 the training, correct classification dominates over requests for 613 labelling from the rule base. Figure 15b shows the results of 614 the same experiment on UVSI-DDoS-II data, where the train-615 ing performance of the model fluctuates during initial steps 616 and eventually minimizes the wrong classification. In addi-617 tion, classification is requested from the analyst manager in 618 case of previously unobserved data due to low confidence in 619 the classifier by the RL agent. During this process, if sim-620 ilar to already sent data arrives, then the classification task 621 is delayed because of the analyst manager's classification. 622   In such a way, the model does not need to repeatedly send a      for UNSW-NB15, and 1358.16 Sec for CIC-DDoS2019, 665 respectively. Testing time per instance for VSI-RTN remains 666 relatively close to one other despite increased data size. This 667 analysis shows that the proposed VSI-DDoS detection mod-668 els perform well on the microsecond scale, implying that 669 models can improve service availability by controlling these 670 attacks on the edge at a very early stage.

671
The ROC curve of VSI-TN, along with other baseline 672 methods for UVSI-DDoS-I and UVSI-DDoS-II test data, are 673 shown in Figure 12. ROC curve shows the model's ability to 674 differentiate between the target classes in True Positive Rate 675 (TPR) and False Positive Rate (FPR). The proposed VSI-TN 676 outperforms BiLSTM, LSTM, and Gaussian NB, which also 677 reflects from Area Under the Curve (AUC) score reported in 678  Table 5. As a result, VSI-TN achieves stable learning ability, 679 adapts to dynamic and temporal data behaviour, and manages 680 data imbalance problems when detecting VSI-DDoS attacks 681 in edge clouds. The topic of his bachelor's thesis was detection 814 of VSI-DDoS attack using attention models in 815 edge clouds. He is currently a Visiting Researcher 816 at the Autonomous Distributed Systems Labora-817 tory, Umeå University, Sweden. His research inter-818 ests include machine learning, distributed systems, 819 and cyber security. 820 VOLUME 10, 2022 degree in information system from the Univer-822 sity of Engineering and Technology, VNU, Hanoi,