SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework

Most traditional digital forensic techniques identify irrelevant files in a corpus using keyword search, frequent hashes, frequent paths, and frequent size methods. These methods are based on Message Digest and Secure Hash Algorithm-1, which result in a hash collision. The threshold criteria of files based on frequent sizes will lead to imprecise threshold values that result in an increased evaluation of irrelevant files. The blacklisted keywords used in forensic search are based on literal and non-lexical, thus resulting in increased false-positive search results and failure to disambiguate unstructured text. Due to this, many extraneous files are also being considered for further investigations, exacerbating the time lag. Moreover, the non-availability of standardized forensic labeled data results in $(O(2^{n}))$ time complexity during the file classification process. This research proposes a three-tier Keyword Metadata Pattern framework to overcome these significant concerns. Initially, Secure Hash algorithm-256 hash for the entire corpus is constructed along with custom regex and stop-words module to overcome hash collision, imprecise threshold values, and eliminate recurrent files. Then blacklisted keywords are constructed by identifying vectorized words that have proximity to overcome traditional keyword search’s drawbacks and to overcome false positive results. Dynamic forensic relevant patterns based on massive password datasets are designed to search for unique, relevant patterns to identify the significant files and overcome the time lag. Based on tier-2 results, files are preliminarily classified automatically in O(log n) complexity, and the system is trained with a machine learning model. Finally, when experimentally evaluated, the overall proposed system was found to be very effective, outperforming the existing two-tier model in terms of finding relevant files by automated labeling and classification in O(nlog n) complexity. Our proposed model could eliminate 223K irrelevant files and reduce the corpus by 4.1% in tier-1, identify 16.06% of sensitive files in tier-2, and classify files with 91% precision, 95% sensitivity, 91% accuracy, and 0.11% Hamming loss compared to the two-tier system.

forensics, cloud forensics, database forensics, multimedia forensics, and mobile forensics [1]. In general, when a system or any digital device is compromised or attacked, it is seized, which is substantiated by taking a bit-by-bit image copy followed by hashing, resulting in twice the size of the actual data. The number of devices seized for analysis is proportional to the number of cases registered, as is the rationale for the pending cases. According to national crime records bureau statistics from 2014 to 2018, there was a four-fold increase in pending cyber cases, with identity theft and transmission of obscene content factoring for a higher percentage [2]. Furthermore, according to the Federal Bureau of Investigation Regional Computer Forensics Laboratories (CFLs) annual report [FBI08], it processed more data than the previous year [3]. These statistics demonstrate that massive data is unavoidable.
Due to the limited availability of forensic resources and the inability to distinguish between relevant and irrelevant forensic files, investigations have a significant time delay [4]. Technically, files that do not aid in forensic investigations are deemed irrelevant, and these files encompass software installation files, operating system-related files, and standard temporary files. However, relevant or sensitive files are relevant to the investigations and therefore reveal personal information that includes a person's name, address, password, emails, bank information, and so on. From the perspective of a forensic investigator, it is facile to differentiate the relevant and irrelevant files in millions of RDC (Real Drive Corpus) files.
National Software Reference Library (NSRL) introduced a known-good-hash set with Message Digest(MD5) and Secure Hash Algorithm(SHA1) methods to eliminate irrelevant files in a corpus [5]. These hash sets are also known as whitelisted hashes and are used to eliminate files matched against known hashes. Hash values of any entity in NSRL using MD5 and SHA1 are calculated by '' (1).'' For example, refer ''(1)'' to evaluate the hash value of a file F i and S i in NSRL.
then the condition to eliminate matched file is given in '' (2)'' where f i is the predefined hash in NSRL and s i is the computed hash for new files in RDC, a is a parameter in the rolling hash function larger than | | and p is a parameter in the rolling hash function that should be a big prime.
The problems with this approach are threefold: The first point is that it is static, and the hash data set must be updated periodically. Second, even though NSRL identified over 202 million uninteresting hashes globally as of December 2021 [5], millions of irrelevant files remain unidentified and processed, further consuming expensive computational resources [6]. Thirdly, MD5 and SHA-1 are highly prone to active adversaries attack and are not recommended for forensic activities [7], [8]. To solve hashing issues, Forensic hash matching using side information is implemented by [9] and evaluated denseness index to characterize hash data by adding file sizes and pre-hashes to reduce NSRL hash comparison. The denseness index is calculated using '' (3). '' where s represents file size and n represents slots. The problem with ''(3)'' is that it reduces file comparison by considering side-information alone, such as considering file sizes within the range, thereby resulting in an imprecise threshold value. Frequent hashes, frequent paths, frequent bottom-level pairs, frequent sizes, clustered creation times, contextually irrelevant files, and known irrelevant extension techniques are evaluated by [10] to identify and eliminate the most irrelevant files. According to [10], if y = H (x) where x ∈ R, y ∈ [0, 9], and|x| is the floor function, then the criteria to eliminate irrelevant forensic files is given in '' (4).'' The problems with the above approach are: these techniques are confined to RDC corpora alone but not applicable to other drives. Secondly, usage of MD5 and SHA-1 hashes in this work results in a hash collision attack [11]. Due to the function defined in ''(4)'', many files with altered extensions could not be validated and eliminated, thus resulting in false negatives. To overcome this issue, a hybrid methodology is evaluated with redefined parametric threshold values to eliminate far more forensically irrelevant files in RDC by [6], yet time complexity remains exponentially high. Another method to identify the file's interestingness or relevancy is by considering metadata properties, and in specific, a file's magic number is considered. A colossal list of magic numbers for most file signatures to be used in cross-validation of any file is developed by [12]. A combination of file signature and keyword search is developed based on these file signatures [13]. Since keywords are selected based on the repetition of words in the corpus, many sensitive words that appeared a few times in the corpus are unrecognized. Further, no developments are made in the file signature. To solve this issue, a digital forensic toolkit is developed by [14] for extracting file metadata and generating timeline, but this toolkit extracts information based on file attributes, but not on the magic number of a file, which results in bypassing of files with altered extensions. With the same file signatures, [15] analyzed anti-forensic capabilities by modifying a file's magic number and demonstrated that such altering significantly results in a damaged file and misleading the investigators. To solve this issue, [16] developed a forensic toolkit based on a magic number extension checker, but this toolkit is confined to limited pre-defined modules. 3292 VOLUME 11, 2023 Apart from the techniques mentioned above, keyword search, pattern matching, and trained classifiers in machine learning models have been implemented in the past to find sensitive or relevant files. The first two methods necessitate manual intervention to compile keywords and patterns, culminating in exponential time complexity [17], [18]. The latter approach includes (1)-clustering algorithms to cluster similar documents based on relevancy, date of creation, file size, and Self Organizing map (SOM) [19], [20], [21]. (2)-topic classification models such as Latent Dirichlet allocation (LDA) and Latent semantic analysis (LSA) to detect latent topics so that an investigator could indeed flag a file as sensitive or not by perceiving the latent topics [22], [23], [24]. These approaches work with unsupervised data and extract latent topics in addition to the central theme of a document in a corpus. The major problem is that even after the topics are extracted, manual intervention is required to detect or classify whether a file is sensitive or not, as well as topic instability [25]. (3)-Furthermore, data classification algorithms in machine learning such as Naive Bayes, support vector machines, and decision trees are used in Digital forensics [26], [27] to classify the file as sensitive or not [28], [29]. The problem with these approaches is that they only work on supervised data, which must be labeled according to the forensics domain such that Data classification is performed based on data labeling. Another issue is that one must have extensive knowledge of the forensic domain to train the data, which would otherwise result in a significant loss.
This research proposes a ''three-tier Keyword Metadata Pattern'' framework to overcome these significant concerns. SHA-256 hash for the entire corpus is constructed with custom regex and stop-words modules to overcome hash collision, approximate threshold values, and eliminate recurrent files in tier-1. In stage 1 of tier 2, the new blacklisted keywords dataset is constructed using the word-to-vector and latent Dirichlet allocation algorithm. The metadata module of stage 2 is proposed to overcome false positive results in an existing system. The unique pattern module at stage 3 is designed to search for dynamic, unique, relevant patterns to identify the significant forensic relevant files and overcome the time lag.
This paper is organized in the following manner. Section 2 discusses related work, while Section 3 describes the proposed three-tier KMP classifier. Section 4 evaluates the proposed approach and presents the results, and finally, Section 5 summarises the conclusion and effectiveness of the proposed classifier.

II. RELATED WORK
Technology-aided forensic investigation advancements use machine learning, artificial intelligence, and natural language processing. Finding relevant files has piqued the interest of forensic researchers, and some of the methods are discussed in this section. Using standard hash sets from NSRL and virus share to eliminate common irrelevant files has become a traditional approach in Forensic Toolkit(FTK). In forensic corpus reduction techniques, NSRL known-good-hashes, Rowe's hash, peka torrent, virus share, and OS forensic hashes are used to implement hybrid filtering techniques that use MD5 and SHA-1 hashes for each entity [6]. The major problem with the above hash dataset lies in the compressed hash function and bitwise function f(p,q,r), as given in '' Fig. 1.'' Since p,q, and r are 0-31 bit words, the output generated from the four rounds is 128bit which is subsequently prone to known differential attack or hash collision attack in 264 rounds. Furthermore, [6] defines a set of algorithms with redefined threshold values to eliminate uninteresting files. Nevertheless, the most significant issue lies in the function f(x) defined with static parameters as given in ''(5)'' and '' (6).'' In ''(5),'' n k=0 del_size i is a predefined static function where the threshold values of specific files are initialized. If any file matches according to its threshold criteria, those files are supposed to be irrelevant. For example, the threshold value for a word document is defined as file_ext[w i ] = 4 * 1024 bytes, and text file as file_ext[t i ] = 4 * 1024 bytes. To validate the results, we created a word document and text file with an 8-digit word, then used Huffman coding to compress the data, which resulted in a document size of 3000 bits and a text file size of 2000 bits which is 1000bits lesser than the existing. Likewise, for remaining extensions, we found many threshold values imprecisely defined that lead to false negatives. Another problem exists in n i=0 extlist i,j of ''(6)'' which predefines some relevant and irrelevant file extensions. Some of the irrelevant extensions predefined in [6] are [.inf, .ext, .bin, .cab, .cfg, .cpl, .cur, .drv] and relevant extensions as [.dll, .bat, .csv, .doc, .txt, .xls, .rtf, .ppt]. Suppose any file matches with the irrelevant category extension, those files are treated as an uninteresting or irrelevant category, and any file that matches the relevant category is sent for further processing. The problem with this function is that if any file's header is altered or deleted, this approach fails to identify that file as sensitive as the new extension might not be defined in their list. As a result, we observed that many significant files were removed from further investigation, raising the question of reliability. Also, as extensions are predefined based on ''(5)'' and ''(6),'' this VOLUME 11, 2023 work can identify uninteresting files more than interesting files in RDC, as explained above. The shortcomings of these two functions can be addressed in our proposed work by using the file_ext(D i ) module of the KMPT classifier, where new threshold values are defined after corpus evaluation, and a novel algorithm to detect deleted or altered file extensions are proposed in Tier-2.  Forensic keyword search(FKS), an essential function of any forensic tool, is another approach to identifying relevant files. FKS aims to identify suspicious files in massive data by predefined keywords. For this purpose, the DHS released blacklisted keywords under eight categories widely used in CFTs apart from expert case-specific keywords. However, improperly devised keywords or keywords formulated by experts result in high false positive rate [18], [30]. Another problem with DHS keywords is that they are non-lexical; therefore, exact matches result from bypassing meaningful or related words. As a result, pipelined keyword enhancing technique is implemented in [31] by integrating seed keywords with pipelines. In their work, the authors mentioned that the most relevant words with a particular topic are identified as seed keywords. The major problem lies in determining seed words using '' (7).'' For any term that is least present in document(D), then ''(7)'' becomes Therefore, if any word(w) is not frequently present in the corpus(C), then that word is not treated as a seed word. Another problem lies in the identification of seed keywords using W2V given in '' (9).'' where D is corresponding data, D 0 m,a is initial seed keyword with topic m for level 1 and N =1.. 10.
Equation (9) identifies the top N keywords as seed keywords and then manually labels them. Therefore, the problem is that there is no guarantee that essential words must occur utmost, so many significant keywords are bypassed in this framework. Based on the ground truth dataset, in tier 2, a blacklisted keyword search module is proposed by integrating the W2V model and LDA topical modeling algorithm to overcome the above issues.
The unavailability of labeled datasets is an increasing concern in DF. Before processing the data, it should be labeled as most available data is unstructured or unsupervised [32]. The manual labeling technique is adapted and has still been used in some cases, which was quite tedious; later, with the help of feature extraction algorithms, unstructured data is converted to either semi-supervised or supervised data. As a result, text mining is gaining traction as it uses NLP to transform unstructured content into structured content [33], [34]. During data transformation, as many inconsistencies appear in the data, GapFinder is implemented for this purpose [35] which primarily focuses on extracting structured data from semi-structured data. The authors claimed in this paper that they implemented a topic classifier that uses the D2V model for word representation and the SVM model for article classification. The major problem with this approach is that it uses the D2V model for classification, which looks specifically for the headlines of articles and then classifies the data after vectorization. When we implemented the same technique, we discovered many false negatives in our work by implementing D2V on headlines alone, and how the existing system results in false negatives is given in '' (10).'' For example, consider an article in vector space containing a sequence of words; the context for the word p is given as P(w t ) with window size 2. For the given probability where Z (w j ) is the normalized frequency of occurrence and K is a scale factor. As headlines are unique to each document and do not appear elsewhere, Z (w j ) will always be 1, and the D2V model classifies the article as non-cyber security, resulting in a false negative. This is because it is optional for a subject to align with the topic headline, and we identified many such pages. To address this, the authors proposed KMPT-based D2V that performs full-text vectorization in tier-2 and SVM with linear kernel in tier-3 to reduce false negatives significantly.

III. KEYWORD METADATA PATTERN CLASSIFIER
To address the above shortcomings, we propose a three-tier framework depicted in '' Fig. 2''. Tier 1 consists of data extraction and a pre-processing engine [36]. The preprocessing engine includes tokenization, stop words, and lemmatization modules so that the input for Tier-2 is cleaned data [33], [37]. Tier 2 includes the KMP classifier, a novel forensic text-relevant classifier system for labeling and preliminary classifying of forensically relevant data. Tier 2 identifies and classifies the most relevant files in an RDC. As previously stated, suspicious files can be of any type, including a person's name, location, account details, user Id's, passwords, SSN, credit/debit card details, mac address, IP address, email Id's, and so on [38]. Keyword or string search [1], regex search, and hash value search are used as primary sources of searching for sensitive files individually. However, in this work, we combined keyword search, magic file number search [39], and pattern search [40] as a single entity for optimal results. The preliminary classification in this phase is evaluated with the help of bkw(D i ), file_ext(D i ), and pattern(D i ) modules. The resultant data from tier 2 is vectorized using the D2V model and given as input to tier 3. Tier 3 includes the Linear SVM learning classifier that automatically detects and classifies forensically relevant files. Advantages of the proposed system are: better accuracy in finding and classifying forensically relevant files; less manual work during investigations; three-fold evaluation to find suspicious files to avoid false negatives; and better performance evaluation metrics like precision, recall, f1-score, accuracy, specificity, Hamming loss, and Matthew's correlation coefficient(MCC). In tier 1, we enhanced the existing pre-processing engine module, which converts raw data into the machineunderstandable format by performing morpho-syntactic analysis, and inverted indexing techniques for full-text search [42]. In this work, data acquisition is performed on various sources such as personal computers, mechanical tape drives, and internet sources and then sent to pre-processing engine for data cleaning purposes. In morpho-syntactic analysis, the first step is tokenization, where each word in the corpus is converted into an individual token. After conversion, these tokens are sent to the stop words module custom-tailored for digital forensics.

A. DATASETS DESCRIPTION AND ACQUISITION
Stop words are common in spoken language, defined in the nltk library. These stop-words that impede the computing process in RDC are identified with the help of Inverse-document frequency, as mentioned in where s i is specific term in corpus(T), n is document and m being the frequency of a word. The process for defining a custom stopwords module is given below. All the stop words in T are removed that are defined in f _stwords(T , f s ). Following that, we used the Lemmatization process, which considers a word's context and derives it from the root word. Even though lemmatization takes longer than stemming, in the current context of forensic analysis, this text normalization technique is highly regarded. Finally, in tier-1, after performing morpho-syntactic analysis, we indexed our data using an inverted index algorithm using NLP for full-text search with quick response [42]. The pseudo code for the tier-1 model is provided in the algorithm 1. This section combines blacklisted keywords, metadata, and pattern searches into a single entity. We started with a ground-truth dataset of BKW defined by the DHS and forensic experts for keyword searches. These are the keywords that are presumed sensitive worldwide for surveilling terrorist activities, emails, digital chats, or any unethical or illegal activity. We then created a repository containing file extensions, associated magic numbers, and relevant ASCII codes for metadata search. Finally, we created a set of unique patterns to identify suspicious files at the word level for pattern search. '' Fig. 3'' shows the workflow of tier 2.

1) BLACKLISTED KEYWORD MATCH
Blacklisted keyword(BKW) match technique is used in this research to evaluate any illegal, unethical, or unlawful activities in RDC or on an individual's device. DHS keywords have two significant limitations. First, they are scant in both quantity and diversity. Second, similar words or words with proximity are not considered. For example, ''terrorism'' is a blacklisted word in DHS, but ISIS is a major terrorist organization that is not mentioned anywhere else in DHS.
As a consequence, many undefined sensitive keywords may slip through. Second, ''site'' is one of the blacklisted words in the DHS list. Whether the word ''site'' refers to a location or a website is unclear. This study overwhelms the dual drawback while proposing a method for identifying BKW and similar words. We used the W2V model with LDA, a topical modeling algorithm, and the flow of identifying new BKW is illustrated in '' Fig. (4).'' The BKW ground truth dataset is constructed with the help of forensic experts, and the entire corpus T is trained with W2V and LDA algorithm to get DHS-relevant keywords with the context. We then transformed the pre-processing data to vectors using the W2V module with the continuous skip-gram model by using Negative sampling. The vectorized words from W2V are now fed into the LDA(V i , K ) model, as shown in the algorithm 2.
Furthermore, we stringently trained to find relevant words in the test data. The output of the LDA model is stored in a central repository and can be compared to any forensic dataset to identify keywords. On RDC, we compared our proposed BKW model with the DHS BKW over 25 topics, and we present results for cyber security and terrorism. Our work identified 16 keywords in cyber security topics, whereas the existing method could identify only seven keywords. Similarly, for the topic ''terrorism,'' our proposed work identified 22 keywords, whereas existing work identified 13 words. Even though DHS keywords are the base for any keyword identification in FTK, the major problem is that investigators search these keywords literally. To overcome this issue, we proposed algorithm 2 to identify new sensitive keywords based on semantics and relevancy. When the proposed BKW identification method is compared to the existing keyword techniques, it is evident that it resulted in identifying keywords with a minimum of 40% efficiency. The comparison result is shown below. This module attempts to recognize and classify files based on changes to their magic number. Each file has its file identification number in a unique Hex form. In this module, we first passed our data set(T ) to the checker.py module, providing three checking functionality types. In this module, the first 24 bytes, along with the file name and the extension, are extracted from each file and compared to a central repository containing over 25K registered file-type extensions. Second, the same file is queried to a dynamic web server, where the list of file extensions is updated at periodic intervals. Finally, to avoid false negatives, we deployed the XXD tool [43] to confirm the file alteration. Additionally, this module identifies files with incompatible extensions in RDC. For example, the (boot.dl_?) file can be interpreted as boot.dll, which is an essential file in loading the Windows operating system. '' Fig. 5'' shows the approach for identifying suspicious files in terms of extensions. The proposed methodology identified 1.2k files with altered and mismatched extensions.

3) PATTERN DESIGNED
In existing methods, a traditional technique like regex searches for pre-defined patterns such as credit card numbers, bank accounts, person names, zip codes, email addresses, and so on. The file will be labeled as suspicious if any of this information matches. The primary downside of the above approach is that they are static and cannot recognize dynamic patterns. These dynamic patterns include dictionary patterns and operator patterns like -emoticons(punctuations, letters, or numbers that represent pictorial icons); emoji's [44]; collocation passwords (sequence of words or terms that occur more often than would be anticipated by coincidence) [45]; and character substitutions(replacing characters with numbers or some special characters or converting them into an upper case). We propose several patterns for detecting user-defined or machine-generated passwords post-corpus analysis to address the shortcomings. For this, we gathered password breach incidents across the globe and created a password dictionary database for password verification. Furthermore, after carefully analyzing millions of passwords, we designed novel password patterns to determine new passwords that are unavailable in data breaches. A password dictionary is created using a global password breach containing over 50 million passwords from 67 leaked databases globally. The proposed patterns are stored in a centralized repository and compared to files in RDC. Files shall be flagged as relevant and sent for further evaluation upon matching as per the flow given VOLUME 11, 2023

Algorithm 2 Construction of Sensitive Keywords Using W2V and LDA
Input: where P(W , Z , θ, ϕ, α, β) is total probability of LDA, T is corpus, k is topics, j is document, W is word.
in '' Fig. 6.'' Furthermore, the file encoding type from the file metadata aids us in identifying sensitive files. Since providing all of the regex patterns is challenging, few pattern-matching expressions are available in '' Fig. 6.' ' We simulated a training dataset incorporating most patterns to validate the accuracy and detection rate of traditional and proposed techniques. The comparison results are given in Table 2.
From Table 2, we can interpret that existing forensic pattern-matching algorithms and classic regex cannot identify math passwords, dictionary patterns, emoticons, emojis, and collocation passwords. In contrast, our proposed forensic pattern-matching algorithm could overcome all these downfalls and yield better results even though further improvement is needed. In the credit card category, 100% accuracy could not be achieved due to updated lengths in 2021 and 2022 master cards. As new email domains frequently emerge, a 100% email detection rate could not be achieved. Our proposed patterns identify emoticons and emojis with more than 92% accuracy. In the collocation passwords category, the existing approach failed to identify, whereas our approach identified passwords with 34.6% accuracy. So far, data cleaning, normalization, and preliminary suspicious data identification using the KMP classifier have been observed on RDC. After tier 2, the KMP classifier results in suspicious(label-1) and non-suspicious(label-0) files. The authors labeled the data into two categories based on this classification, and the dataset is prepared for automated machine learning classification. Since tier 3 includes a machine learning classifier, the KMP classifier's output is vectorized using the D2V model, and how the data is trained is given in the next section.
The complete pseudo code for tier 2 is given in the algorithm 3 3298 VOLUME 11, 2023

Algorithm 3 KMPT_classifier()
Input:processed data Output:set of forensic relevant files begin n=fseek-1 This section explains how the KMPT result is vectorized and how the data is being trained with a machine-learning classifier. Since a machine learning text classifier works with supervised data, an unstructured corpus is now transformed into structured data with the help of the KMP classifier that uses D2V vectorization [46]. Though other models such as BOW [47], TF-IDF, W2V with skip-gram, Continous-BOW, and Distributed-BOW models, D2V is considered suitable in the current forensic analysis context because D2V adds one more vector in space. Furthermore, D2V can be used to acquire document similarities, label representations, and word embeddings. As a result, the D2V model with skip-gram is used for vectorization because skip-gram represents words better than the C-BOW model. Now, for a given context word W c and a given focus word W b , the conditional probability is computed as shown in '' (11).'' where w c is the probability of generating context word with c as the index, W b is the center word with b as the index, u c represents the context word and v b represents the center word, C o represents the corpus and i represents the index, V OC represents the Vocabulary.
Assuming that context word W c is independently generated for any center word V b with window size=s, the probability of generating context words over given focus words on V oc is calculated using maximum likelihood function given in '' (12).'' where s is the window size, p is the position of a word, E T is the total corpus and θ is the model parameter.
Since the skip-gram model parameters are V b and u c for each word in V oc , model parameters V b and u c are learned and trained in the current context by maximizing the likelihood function given in '' (13).'' We used Stochastic Gradient(SG) for updating model parameters to minimize the loss in '' (13).'' To determine SG,the log conditional probability for V b and u c should be calculated according to the '' (14).'' VOLUME 11, 2023  Through differentiation,with respect to V b and all other word vectors, gradient can be obtained from '' (15).'' To ensure that the data is relevant or irrelevant, supervised data classification algorithm is used in the next step. Text-supervised algorithms such as Multinomial Nave Bayes, Logistic Regression, SVM, and Random Forest are used in the machine learning world. When these models are evaluated, SVM with linear kernel delivers the best results for text classification, with improved accuracy, precision, recall, and F1-score. Table 8 compares different text classification models. All models are compared on a data set split 70:30 between training and testing data. Equation (16) is used in this work to polarise data into two major classes: suspicious and non-suspicious.
where b is a biased term for defining the boundary and w as weight, in other words, hyperplane function h(x) for linear separable classes is calculated by using '' (17).'' where w is the vector and x is the variable and b is the biased term. It is also considered that functional margin is 1 for all support vectors as mentioned in ''(17)'' such that r i for all suspicious files is one and r i for non-suspicious is -1 as given in '' (18).'' Since ''((16)), (17) and (18)'' are helpful for linear separable classes, and as much of the files in RDC demand non-linear separable classes, these equations in turn, are used to classify the n-dimensional data into m-dimensional data where n > m where Although RDC is multidimensional data with two classes (class 0 and class 1), the linear kernel produced the best results in this work when compared to other kernels. This work is evaluated using other kernels such as the Polynomial kernel (P k ), the Sigmoid kernel (S k ), and the Gaussian Radial basis Kernel (RBF), as shown below. This study also compared KMPT classification model with other kernels in terms of text classification, and the results are shown in section IV.
where b is biased term, a is constant, and x i,j are variables.
where b is biased term, a is constant, and x i,j are variables.
where σ is variance and |x i −x j | is Eucledian distance between variables

IV. RESULTS AND DISCUSSION
The proposed methodology is applied to the T5 corpus, RDC, and documents related to security events crawled by a data crawler consisting of approximately 1.4 million documents. Table 3 presents a summary of the datasets gathered from various sources. This dataset is passed to the pre-processing engine, which tokenizes the documents, removes unnecessary common words, and identifies the root words. RDC contains over 5K orphaned documents, of which 28 percent of files are user related. Regex patterns are used in conjunction with a pre-processing engine to identify common patterns such as email, phone numbers, and SSIDs and remove noisy characters such as hyperlinks and tags.  '' Fig. (7)'' depicts the workflow of tier-1 and tier-2. The files extracted from the KMPT classifier are saved in a database and annotated with 0 and 1, with 0 indicating an irrelevant file and 1 indicating a relevant category. By revisiting the output of the KMP classifier, the authors went one step further in meticulously annotating the data. The data is vectorized using the D2V model rather than headlines alone in [35]. The authors used BOW, TF-IDF, and W2V embedding models with the KMPT classifier to vectorize the data. The D2V+KMP model produced the best results for Tier-2, as document-level embedding was observed in this study. After obtaining vectorized data from Tier-2, the data is fed into SVM for automated classification of suspicious or interesting files, as SVM is a good text classifier [48]. Different vectorization models are trained with the corpus and are tested with RDC to evaluate each model's classification metrics. In this scenario, each vectorized model is experimentally evaluated with all available SVM kernels and compared with our classification model, which is built upon D2V+linear SVM. In this work, the Linear kernel is denoted as L-SVM, the Polynomial kernel as P-SVM, Gaussian Radial Basis as G-SVM, and the Sigmoid kernel as S-SVM.
We compared our KMPT model vectorized with the BoW model with L-SVM, G-SVM, and S-SVM, which are vectorized using the BOW model. Table 4 compares the Bag-of-Words model-based KMPT with polynomial(P), Sigmoid(S), and Gaussian(G) SVM kernels. While comparing KMP with other classifiers, this work provides True Positive (TP) and True Negative (TN) concerning Precision, Recall, F1-score, Accuracy, specificity, Hamming Loss, and MCC. It can be understood that the proposed KMPT with D2V vectorization model outperforms stand-alone models such as Tf-IDF, W2V, and BoW. Table 5 provides a detailed comparison of the proposed methodology with L-SVM and other SVM kernels using word-level vectorization. Table 6 compares KMPT to other SVM kernels using the TF-IDF vectorized method. Table 7 compares KMPT to other SVM kernels by using vectorization at the document level. It is self-evident that compared to other SVM kernel classifications, KMPT and linear SVM kernel yielded the best results throughout all the vectorization models.

A. PERFORMANCE METRICS IN EVALUATING KMPT
True Positive: The file is correctly classified as a relevant or sensitive category and denoted with binary 1.
True Negative: The file is correctly classified as an irrelevant or non-sensitive category and denoted with binary 0. False positive: The file is falsely classified as a relevant category, even though it is an irrelevant category by nature.
False Negative: The file is falsely classified into an irrelevant category, even though it is a relevant category by nature.
Recall or sensitivity: It is a metric that states how many are classified in the overall actual positive class. Higher recall value reveals that data is highly predicted and better performance of the model.

Recall =
TP TP + FN Precision: It is a metric used in this work to identify how many class-1 files are identified in the overall class-1 category.

Precision =
TP TP + FP Specificity: It evaluates a model's potential to predict class-1 files of each available class.

Specificity = TN (TN + FP)
Accuracy: This is the base metric used to evaluate our model by identifying precise classification classes over total classification classes.

Accuracy =
TP + TN (TP + TN + FP + FN ) F1-Score: F1-Score is the weighted average of Precision and Recall. F1-Score is an instrumental performance measurement technique. It is widely used in scenarios when the model produces high recall and low precision or low recall and high precision. In such scenarios, measuring the model's performance is very complicated. F1-score makes Precision and Recall comparable. It uses the harmonic mean instead of the arithmetic mean.
ROC curve: A receiver operating characteristic curve is a graph that illustrates a classification model's performance across all classification levels by considering a True positive rate and a False positive rate. It evaluates a classifier's capacity to distinguish among each class in a balanced classification.     Hamming loss: Hamming loss (HL) metric is the fraction of misclassified entities ranging from 0(best classification) to 1(worst classification).
where n is training example, Q x,y and P x,y are boolean i th predictions containing y th label.
In Table 4, we experimentally evaluated our KMPT model based on the BOW vectorization model and compared our model with different SVM kernels based on the BoW model. Even though P-SVM identified class 1 entities with 98% accuracy, it identified class 0 with 54% accuracy. Furthermore, the same model resulted in a 100% sensitivity rate in detecting class-0, yet it could only identify class-1 with 21%. The results show that our proposed model yielded a balanced classification of class-0 and class-1 in terms of accuracy, sensitivity, and F1 score. As accuracy does not consider class imbalance, considering accuracy alone will be misleading. Therefore, we consider F1-score and ROC curves as the primary metric to evaluate our model.
In Table 5, we experimentally evaluated our KMPT model based on the W2V vectorization model and compared our model with different SVM kernels based on the W2V model. From the above results, even though W2V-based classification led to poor classification results, it is evident that our proposed model based on w2v gave good classification results in terms of accuracy, sensitivity, and F1-score.
In Table 6, we experimentally evaluated our KMPT model based on Tf and IDF and compared our model with different SVM kernels based on the TF-IDF model. Our model resulted in 85% precision, 86% recall, and 86% F1-score, which is better than other models.
In Table 7, we experimentally evaluated our KMPT model based on the D2V model and compared our model with different SVM kernels based on the D2V model. Compared to previous vectorization models like BoW, TF-IDF, and W2V, our model with D2V yielded the best results with 91% precision, 95% recall, and 92% F1-score. Since the corpus is imbalanced, F1-score is an accurate metric rather than accuracy to check the model consistency. Finally, the overall proposed system is evaluated with various feature_size(X ) where X = 40, 30, 20, and 10, and performance metrics such as precision, recall, f1-score, and accuracy are calculated and shown in '' Fig. 8'' for feature size f (x) = 0.30. When comparing the overall system to different features, the performance metrics for feature_size(X = 30) yielded better results.  In Table 8, BOW+SVM represents that the tier 1 result is vectorized using the BOW model and labeled according to DHS keywords and existing regex patterns. The resultant data is trained with L-SVM for automated classification resulting in 74% accuracy. Similarly, TF-IDF+SVM, W2V+SVM, and D2V+SVM represent that tier 1 result is vectorized according to each model and labeled according to existing DHS keywords and regex patterns. The resultant data is trained with linear SVM for data classification resulting in 74%, 61%, and 80% accuracy. KMPT represents the result of the overall three-tier proposed system and thus yields best precision, recall, f1-score, and accuracy.
'' Fig. 9'' represents the resultant ROC curve for the proposed methodology and '' Fig. 10'' represents confusion matrix and it is evident that TP and TN outperforms FP and FN resulting in best classification.

B. EVALUATION OF TIME COMPLEXITY
It is the computational complexity often approximated by counting the number of elementary operations executed by the algorithm. In tier 1, the whole corpus is matched against the SHA-256 hash corpus that contains nearly 10 million. For such an enormous hash corpus, as sorting is recommended, we, therefore, sorted our corpus, which took O(nlog(n)) in terms of time complexity that is better than O(n 2 ) in generic hash unsorted search. For tier 2, Where n is number of examples, D is dimensions of data(640) and V is the size of vocabulary(1 million). Training  in O(n log N) as each operation in the input data has logarithmic time complexity. The running time or predicting time for linear SVM is O(k*d), where k is the support vector/vectors and d is the total number of data points.

V. CONCLUSION
The proposed three-tier framework in this work reduced forensic investigation delay by avoiding the evaluation of unwanted files in massive data. This system also solved the problems associated with two-tier models, such as identifying and labeling the relevant forensic files. In Tier-1, due to Inverse document frequency, the custom stop words module identified an additional 248 stop words in the corpus resulting in 4.1% elimination of tokens in RDC, and the forensic classification process was accelerated. In Tier-2, w2v with the lda model detected 3,709 novel sensitive keywords in RDC and 845 from DHS, totaling 4554 blacklisted keywords. As a result of our model, 11.48% of files are identified as suspicious in RDC. The three-stage metadata extension module identified 970 file extensions in addition to the 25K extensions, classified 1.81% of files in RDC as suspicious. We created 280 unique patterns in addition to the existing patterns to identify the dynamic patterns in RDC. Our unique pattern module identified 34.6% collocation passwords, 92% emoticons, 94.5% emojis, 22% character substitution passwords, 66.8% dictionary patterns, and 52.4% math passwords, apart from 461 user passwords and 268 machinegenerated passwords. Due to this, RDC's level of suspicious files is extended to 16.06%. From tier 2, data was labeled as per the KMP classifier, and the linear SVM model was trained on the KMPT classifier and classified the forensic relevant data with 91% accuracy, 91% precision, 95% recall and 92% f1-score with 0.75% MCC relevancy and 0.11% of hamming loss with quasi-linear complexity.