Self-Attentive Models for Real-Time Malware Classification

Malware classification is a critical task in cybersecurity, as it offers insights into the threats that malware poses to the victim device and helps in the design of countermeasures. For real-time malware classification, due to the high network throughputs of modern networks, there is a challenge of achieving high classification accuracy while maintaining low inference latency. We first introduce two self-attention transformer-based classifiers, SeqConvAttn and ImgConvAttn, to replace the currently predominant Convolutional Neural Network (CNN) classifiers. We then devise a file-size-aware two-stage framework to combine the two proposed models, thereby controlling the tradeoff between accuracy and latency for real-time classification. To assess our proposed designs, we conduct experiments on three malware datasets: the Microsoft Malware Classification Challenge (BIG 2015) and two selected subsets from the BODMAS PE malware dataset, BODMAS-11 and BODMAS-49. We show that our transformer-based designs can achieve better classification accuracy than traditional CNN-based designs. Furthermore, we show that the proposed two-stage framework reduces the average model inference latency while maintaining superior accuracy, thereby fulfilling the requirements of real-time classification.

the anti-malware industry today is the large volumes of files 23 constantly being transferred through connected networks. 24 To handle this volume of incoming potential malware without 25 disrupting the operations of the target device, endpoint secu- 26 rity systems must efficiently identify and analyze transmitted 27 software binaries in real time [1], [2], [3]. Therefore, achiev- 28 ing high accuracy with low latency is the principal objective 29 of real-time malware classification. 30 The associate editor coordinating the review of this manuscript and approving it for publication was Alberto Cano .
By methodology, malware classification can be broadly 31 separated into two categories: dynamic and static analysis [4]. 32 Dynamic analysis usually requires the execution of suspi-33 cious software within a sandbox environment, with the mali-34 cious activities observed at runtime. Although this approach 35 can potentially produce detailed information on malicious 36 activities, the analysis is slow to conduct. The requirement 37 of a sandbox also makes this approach inconvenient in sce-38 narios where realtime remediation is required. Furthermore, 39 some malware can differentiate between sandbox and actual 40 runtime environments, and accordingly behaves benignly 41 during sandbox execution to evade dynamic analysis [5], 42 [6]. Alternatively, static analysis examines signatures (pat-43 terns in a software byte sequence) and other static features 44 extracted from software files to deduce malicious behaviour. 45 This approach can be significantly faster than dynamic anal-46 ysis, as malware execution is not required. However, many 47 static analysis methods require converting the file binary into 48 assembly code, which is nevertheless time-consuming and 49 • We propose SeqConvAttn for malware classification by

101
• We then propose the model ImgConvAttn, which 102 employs Vision Transformer [21] on malware clas-103 sification over malware images. ImgConvAttn can 104 achieve good classification accuracy while significantly 105 reducing inference latency when compared with that of 106 SeqConvAttn. 107 • We integrate both models into a file-size-aware two-108 stage framework. Here, ImgConvAttn occupies the first 109 stage, and SeqConvAttn takes the second stage. Dur-110 ing inference, the framework would check the size of 111 the binary file and the first-stage classification uncer-112 tainty score to decide whether the malware needs to 113 undergo second-stage reclassification. By executing the 114 necessary models on a need-to-run basis, this framework 115 efficiently leverages different information learnt by the 116 two proposed models to achieve superior accuracy while 117 reducing the overall inference latency. 118 The rest of the paper is organized in the following manner. 119 Section II reviews relevant related work. Section III presents 120 the designs of SeqConvAttn, ImgConvAttn, and the file-size-121 aware two-stage framework. Section IV discusses experiment 122 designs and results. Section V offers further insights on Seq-123 ConvAttn and ImgConvAttn. Section VI addresses limita-124 tions of our designs. Section VII contains the conclusion of 125 this paper. 127 Static analysis checks the signatures and other static fea-128 tures derived from the examined file to determine the mal-129 ware type. Technically, such files could contain binaries, 130 assembly code, or even human-readable programming lan-131 guages. However, in most situations, only software binaries 132 are readily available. Disassembly of binaries into assembly 133 code can often be time-consuming [24]. To reduce the over-134 all file-to-verdict inference latency, file preprocessing time 135 should also be minimized. We therefore forego any feature 136 extraction techniques involving assembly code, and focus 137 solely on classification using raw binaries. In this section, 138 we present surveyed literature addressing the problem of 139 malware binary classification. We summarize the surveyed 140 literature in Table 1. Note that throughout the paper, the term 141 ''binary'' and ''byte sequence'' will be used interchangeably, 142 with the former used when referring to the content of malware 143 files, and the latter used when addressing the input to a model. 144

145
Intuitively, the content of a binary file can be directly exam-146 ined as a sequence. Malware detection on byte sequence 147 was first proposed by [7] using the Malconv model. Given a 148 byte sequence, Malconv first performs byte embedding, then 149 forwards the embedded sequence through gated convolution. 150 To handle the length of the binary input, the convolution ker-151 nel and stride size are set to 512 bytes, aggressively reducing 152 the resultant encoding length. Reference [25] [26] fur-196 ther investigated the application of ResNet50 for malware 197 classification. We note that there is a general trend of employ-198 ing deep models for malware classification. To achieve low 199 inference latency, inferences by deep networks has to be con-200 ducted on dedicated GPU. However, since GPU resources 201 are often not available in many deployment scenarios, our 202 research thus focused on the CPU inference time. We show 203 in Table 2 that inferences made by common deep models 204 on CPU results in high per-file latency, which makes them 205 unsuitable for real-time application.

206
In addition to convolution, some publications also inte-207 grated attention mechanisms [37] into the model architecture, 208 helping the classifier to focus on important regions in the mal-209 ware image. Reference [27] employed an attentional convo-210 lutional network to visually identify regions-of-interest on a 211 malware greyscale. Thereafter, binary sequences correspond-212 ing to these regions could be reassembled into assembly code, 213 providing insight on how the malware operates. Later, [29] 214 introduced a similar network, but employed a residual atten-215 tion mechanism instead. Reference [28]   Aside from greyscale images, some recent works have 225 also proposed generating malware images using alternative 226 means. Reference [39] proposed generating a hash map based 227 on the byte size associated to the opcodes found in the 228 malware assembly script. However, this method requires a 229 disassembly process, which would incur unacceptably high 230 latency for classification. Reference [10] proposed generat-231 ing malware Markov image (transition matrix). The Markov 232 image has a dimension of 256 × 256, with each pixel rep-233 resenting the transition probability from one byte value to 234       The extent that a particular source element contributes to the 279 aggregation process is determined by an attentional weight. 280 Colloquially, the weight is the amount of attention the target 281 pays to the source. Mathematically, this aggregation process 282 is referred to as scaled dot-product attention, defined by [19] 283 as (1) 285 Note that X ∈ R N ×d m is the initial sequence encoding of 286 length N and an encoding dimension of d m . The learnable 287 weights W Q , W K , W V ∈ R d m ×d kqv are used to project the 288 X into, respectively, the query XW Q , key XW K , and value 289 XW V sequences, with d kqv as the encoding dimension of 290 these sequences. From the perspective of a target element, the 291 scaled dot-product attention computes its encoding through 292 a weighted summation over the encodings of all source ele-293 ments in the value, with the attentional weights determined by 294 the similarity between its corresponding query element and 295 every key element. By carrying out the entire self-attention 296 computation as a series of matrix multiplications, all ele-297 ment encodings are computed concurrently. Note that the 298 scaled dot-product attention has a computation complexity 299 of O(N 2 ), scaling quadratically with respect to the sequence 300 length. However, the computations can be parallelized, mean-301 ing that given sufficient computation resources, the time com-302 plexity can become constant, or O(1).

303
To learn the different types of inter-dependencies within 304 the sequence X , multiple scaled dot-product attention heads 305 are employed in parallel. As the weight of each attention head 306 is initialized differently, different heads would capture differ-307 ent types of inter-dependency. To combine the information 308 learned from the parallel heads, their outputs are concatenated 309 along the encoding dimension and re-projected to a final 310 encoding. The entire design is referred as multihead attention, 311 which is mathematically defined as Here, H refers to the number of parallel attention heads, and 314 W O ∈ R Hd kqv ×d m the post-concatenation projection weight. 315 Note that in most transformer designs, the model dimension 316 stays invariant after undergoing multihead attention. Thus, 317 in this paper, d m = Hd kqv .

318
A conventional transformer architecture is composed of 319 multiple encoder blocks connected in a serial fashion. Each 320 block contains a multihead attention component followed by 321 a feedforward component with interjecting residual connec-322 tions. The feedforward block consists of a ReLU activated 323 expansion layer followed by a restorative layer, as shown by 324 (3) 325 Note that the expansion layer parameters are W 1 ∈ R d m ×Bd m , 326 b 1 ∈ R Bd m and the restorative layer parameters Here, B is referred to as an expansion 328 factor.

329
Transformers do not have implicit awareness of the posi-330 tions of elements in the sequence. To remedy this, positional 331 encoding is added to the initial sequence encoding before 332 VOLUME 10, 2022  During experimentation, we found that reducing the encod-371 ing sequence length by 1D convolution alone is inadequate 372 in decreasing the computation resource usage and incurred 373 latency to an acceptable degree. We found it necessary to 374 further truncate the byte sequence to a preset limit before 375 conducting SeqConvAttn inference. However, overly trun-376 cating a byte sequence may remove salient byte sections 377 critical to malware class identification, which could cause 378   According to [21], the method of positional encoding used is

436
Recall that the computation complexity of the transformer 437 module is O(N 2 ). Comparing the design of ImgConvAttn 438 against SeqConvAttn, we expect that the model feedforward 439 time should be much faster. Specifically, based on experimen-440 tal settings later presented in Section IV-C, the ImgConvAttn 441 transformer module input length is N = 257, while the 442 SeqConvAttn transformer module input length is N = 800. 443 However, the malware image generation process itself can 444 incur significant latency. As shown later in Figure 3, the time 445 needed to generate a bigram frequency image scales linearly 446 to the malware binary length. Thus, it is not guaranteed that 447 the overall file-to-verdict inference time of ImgConvAttn will 448 always be faster than that of SeqConvAttn. This information 449 is relevant to the design of the subsequent file-size-aware two-450 stage framework, as discussed in Section III-D2. Here, we first explain the design of the standard two-stage 453 framework. Afterwards, we present the improved design by 454 implementing the file-size-aware mechanism. Given that ImgConvAttn avoids the direct handling of long 457 byte sequences, this design should have a shorter feedfor-458 ward time when compared to SeqConvAttn. This satisfies 459 the low latency requirement of real-time classification. How-460 ever, as the two models learns to identify different types of 461 features, classification accuracy could be improved by con-462 sidering the verdicts from both models. To augment model 463 accuracy while minimizing the overall classification latency, 464 we incorporate both ImgConvAttn and SeqConvAttn into a 465 two-stage framework, as shown by the design on the top dia-466 gram of Figure 2. We assign ImgConvAttn as the first-stage 467 classifier, with an expected per-file inference latency of t 1 . 468 SeqConvAttn is then assigned as the second-stage, with a 469 latency of t 2 . To classify a malware binary, ImgConvAttn first 470 conducts a initial classification. The classification uncertainty 471 is then compared against a threshold value υ. If the uncer-472 tainty is below the υ, the classification process concludes. 473 However, if the uncertainty exceeds the υ, the binary file then 474 undergoes reclassification by SeqConvAttn. Thus, this frame-475 work avoids unnecessarily running the slower SeqConvAttn 476 if ImgConvAttn is sufficiently certain in its classification. 477 Assuming that ImgConvAttn classification is sufficiently cer-478 tain most of the time, the majority of binary files should only 479 incur an inference latency of t 1 , while only a minority would 480 incur a latency of t 1 + t 2 . In our design, the classification 481 uncertainty is defined as (5) 483 VOLUME 10, 2022 Here, C pred and U pred refer to, respectively, the predicted 484 class and the classification uncertainty. anism. This mechanism preemptively redirects large binary 518 files, whose expected latency fulfills the condition t 1 ≥ t 2 , 519 to the second-stage classifier. This design is shown by the 520 bottom diagram of Figure 2. Note that by preemptively redi-521 recting larger malware files to the second-stage classifier, the 522 actual proportion of files undergoing SeqConvAttn classifica-523 tion during runtime will likely be greater than p%. However, 524 the average inference latency should also be reduced.

525
For subsequent discussions, we refer to the file-size-aware 526 design as TwoStage-fsa-p%, again with p% indicative of the 527 selected uncertainty threshold υ determined over an set-aside 528 set. The original BIG 2015 dataset [22] consists of a ''train'' and 535 a ''test'' partition. However, only the ''train'' set is labelled. 536 Thus, we partitioned the 10,868 labelled malware binaries in 537 the ''train'' set into disjoint train, validation, and test sets. 538 Note that the average size of malware binaries in BIG 2015 is 539 1.29 MB, with only two outliers cases being larger than 540 3.8 MB (the largest file is 15 MB). The statistics of partitioned 541 dataset are presented in Table 3.  Table 4.

TABLE 4. Statistics of BODMAS-11 dataset.
In addition, we also composed a BODMAS-49 dataset, 566 consisting of the 49 malware classes that contains at least 567 100 instances. This dataset is also partitioned into disjoint 568 train, validation, and test sets, as shown by Table 5. This We used three metrics to assess the performance of our 574 model, accuracy, weighted-F1, and (file-to-verdict) latency.

575
The former two metrics are used to measure the correctness 576 of the model predictions. For each design, we report the aver-577 age accuracy and weighted-F1 score over five repetitions.   file-size-aware mechanism. Note that while we use GPU for 586 model training, to address portability issues, only CPU is used 587 to assess classification latency. As our experiment involves 588 latency assessment, we list the relevant specifications of our 589 test environment in Table 6.

664
We assign a training batch size of 50. During training, the 665 same Adam optimizer settings for the SeqConvAttn exper-666 iment is used here. Each model is trained for 100 epochs, 667 with the checkpoint yielding the highest validation accuracy 668 selected as the optimal version.

669
The baseline for comparison against ImgConvAttn is the 670 3C2D model, implemented based on the description pro-671 vided by [12]. The original design, proposed by [13], is a 672 shallow CNN consisting of 3 convolution-and-max-pooling 673 layers and 2 fully connected layers with training dropout. 674 Reference [12] added dropouts to the fully connected layer to 675 improve model robustness. For each model, we experimented 676 with two image types: bigram frequency and greyscale. 677 We forego using deep models as done by [14], [15], and [26], 678 and others. The reason is that, based in Table 2, these models 679 incur very large latency, which is too slow for the purpose 680 of real-time classification. This renders comparison to deep 681 models pointless.

683
For the two-stage frameworks, we employ the 684 ImgConvAttn model with bigram frequency malware images 685 on the first stage, and the SeqConvAttn model on the sec-686 ond stage. The uncertainty thresholds υ corresponding to 687 the expected proportion of reclassified file p% are deter-688 mined through the validation set of each dataset. As exper-689 iments on the two-stage framework are also conducted 690 with five repetitions (each times using a different set of 691 ImgConvAttn-Frequency and SeqConvAttn), separate υ val-692 ues are used for each of the five runs. However, the mag-693 nitude of υ remain broadly consistent throughout the runs. 694 Thus, we report the averaged υ in Table 8 for reference 695 purposes. Note that for TwoStage-fsa-p% experiment, the file 696 size threshold is set to 5 MB, since that is where the per-file 697 latency of ImgConvAttn-Frequency and the average latency 698 of SeqConvAttn are about equal, as shown on Figure 3 and 699  FIGURE 3. The relation between file size and inference latency for ImgConvAttn-Frequency on the BODMAS-11 validation set. As the model feedforward time should remain relatively constant, the linear latency increase is inferred to be caused by the bigram frequency image generation process.

702
From the results in Table 7, the following general obser-  Among the image-based classifiers, ImgConvAttn is 731 demonstrated to be superior to the 3C2D model when given 732 the same image type. Although ImgConvAttn is generally 2-3 733 ms slower than 3C2D, the average inference latency is never-734 theless quite fast. Additionally, the bigram frequency images 735 are demonstrated as superior to greyscale images on ImgCon-736 vAttn. Note also that the 98.17% accuracy score achieved 737 by ImgConvAttn-Frequency is higher than that that of Seq-738 ConvAttn, while the former also incurred significantly lower 739 latency. This may be explained by the information loss caused 740 by the truncation of input binary for SeqConvAttn, which 741 potentially removed salient features in the byte sequence after 742 the truncation limit.

744
On BODMAS-11, SeqConvAttn attained the best accuracy 745 and weighted-F1 score of any single model designs. However, 746 the differences in accuracy between SeqConvAttn and the 747 other baselines are quite small. For example, the second best 748 model, CNN+BiGRU is only about 0.09% worse than Seq-749 ConvAttn. We suspect that this is because all sequence-based 750 classifiers likely identified similar features for classification. 751 When compared to Malconv, SeqConvAttn is 0.16% bet-752 ter in accuracy, but its latency is also 14 ms longer. Most 753 likely, the marginal accuracy discrepancy between differ-754 ent models is caused by the truncation of malware bina-755 ries. As BODMAS binary files are generally larger than 756 that of BIG 2015, the truncation process thus removes more 757 salient features, including the distant inter-dependencies 758 within malware binaries that SeqConvAttn is designed to 759 identify.

760
For image-based classifiers, ImgConvAttn-Frequency is 761 the most accurate of the image-based classification designs. 762 Note that the latency disparity between 3C2D and ImgCon-763 vAttn becomes insignificant on BODMAS-11, as the aver-764 age classification latency for all models on BODMAS-11 are 765 significantly larger than that of BIG 2015. This is caused 766 by the fact that BODMAS-11 malware binaries are gener-767 ally larger in size. Referring again to Figure 3, this slows 768 down the bigram-frequency image generation process. Note 769 that this also causes a significant latency disparity between 770 greyscale and bigram frequency, likely since greyscale gener-771 ation time does not scale as strongly to the length of malware 772 binaries.

773
For both TwoStage-p% and TwoStage-fsa-p%, adjustment 774 of the uncertainty threshold υ results in a tradeoff between 775 accuracy and latency for the two-stage frameworks, as both 776 metrics generally increases with smaller uncertainty thresh-777 old (larger p%). However, at approximately after p = 15, 778 there is a clear effect of diminishing return, such that fur-779 ther increase in latency no longer corresponds to gains 780 in accuracy. This phenomenon is caused by the fact that, 781 with smaller uncertainty thresholds, binary files with more 782 confident first-stage classification are subjected to second-783 stage reclassification. Since files with higher confidence 784 for first-stage classification are more likely to be correct, 785 reclassification by the second-stage becomes increasingly 786 redundant. 787 VOLUME 10, 2022  Comparing TwoStage-p% against TwoStage-fsa-p%, their 788 classification accuracy scores are similar, with TwoStage-fsa-789 p% being slightly superior. Furthermore, due to the file-size-790 aware mechanism, TwoStage-fsa-p% is able to reduce the 791 classification latency by 10 ms. Since ImgConvAttn is slower would be misclassified on SeqConvAttn while being cor-810 rectly (and confidently) classified by ImgConvAttn. The 811 uncertainty-based reclassification process in the two-stage 812 framework thus filters-out a portion of these instances. Essen-813 tially, the two models somewhat covers the weaknesses of 814 each other, resulting in the accuracy of the two-stage frame-815 work to exceed that of either of its component models.

817
We experimented with BODMAS-49 to determine potential 818 limitations of our designs in an environment with signifi-819 cant class imbalance. Examining the single model designs, 820 ImgConvAttn-Frequency accuracy and weighted-F1 are sig-821 nificantly better than the other image-based schemes. How-822 ever, while SeqConvAttn performance are better than most 823 of the baseline sequence-based classifiers, its accuracy is 824 equal to the accuracy of Malconv. We explain this observation 825 as SeqConvAttn being inadequately trained on the minority 826 malware classes, as these classes has insufficient number of 827 training files. Despite that SeqConvAttn is able to learn more 828 complex features when compared to Malconv, this advantage 829 cannot be leveraged if there are too few examples to reli-830 ably generalize the salient and discriminating features of the 831 minority classes.

832
Despite the limitations of SeqConvAttn, the proposed 833 two-stage framework nevertheless achieved superior accu-834 racy when compared to all single models. Specifically, 835 we consider TwoStage-fsa-25% as the best design. This 836 design achieved an accuracy of 93.42% and incurred latency 837 of 17.2 ms, which is slightly better than the accuracy of 838 93.33% and latency of 18.4 ms from Malconv. Note that 839 while TwoStage-20% technically achieved the overall best 840  Figure 4 displays the gating maps of the respective 873 samples, with the columns corresponding to the sequence 874 elements. We note that it is difficult to visually divulge from 875 the gating map the emphasis or suppression of particular 876 information. Unlike attention maps, whose value suppresses 877 the information of entire elements, values of gating maps can 878 independently suppress specific sections of the encoding of 879 a sequence element. This hinders the interpretability of Mal-880 conv, as we cannot easily detect the binary sections deemed 881 salient by the model. The best we can discern is the presence 882 of different binary regions. This is shown by the different 883 textures exhibited on the gating map, which are potentially 884 indicative of the different types of information present in 885 different regions. 886 We also observed that in most cases, the vertical sections 887 of attention maps, identified by similar highlight densities, 888 tend to correspond to their respective binary regions on the 889 gating map. This suggests that, though by different means, the 890 VOLUME 10, 2022  The greyscale images, when compared to each other, visu-921 ally differ greatly despite the samples belonging to the same 922 malware class. This is because the underlying byte sequences 923 from different file instances differ greatly. This difference 924 likely makes acquiring the salient features for classification 925 difficult. As observed from the corresponding attention maps, 926 the salient patches identified from the three greyscale images 927 have little in common. However, the more fundamental cause 928 for the poor accuracy on greyscale image is two-fold. First, 929 converting from 1D byte sequence to 2D image introduces 930 vertical relationship between bytes that does not exist in the 931 original 1D format. Such relationship could be easily altered 932 by shifting the position of certain byte sections or interjecting 933 additional bytes at some location. Second, the byte value 934 cannot be interpreted as a greyscale value. This is because 935 distinct byte values, such as 12 (0 × 0C) and 255 (0xFF) do 936 not have a comparative relationship that defines one value as 937  information about a particular bigram always occupies the 962 same pixel in the image. Thus, salient information can be 963 captured more easily by ImgConvAttn, leading to higher clas-964 sification accuracy.

966
Regarding the model designs, we identify two limitations 967 of SeqConvAttn that require further investigation. First, the 968 fact that transformer complexity scales quadratically to the 969 input length necessitates the truncation of malware bina-970 ries, as otherwise the feedforward time and required com-971 putation resources would grow unacceptably large. However, 972 the byte sequence lost from truncation may cause SeqCon-973 vAttn to misclassify malware files. Thus, it is necessary 974 to develop a process to feed an entire malware binary to 975 SeqConvAttn without truncation, while overcoming the issue 976 of computation complexity. Second, as inferred from the 977 results on BODMAS-49, despite the higher model complex