An AI-powered Network Threat Detection System

The work develops a network threat detection system, AI@NTDS, that uses the behavioral features of attackers and intelligent techniques. The proposed AI@NTDS system combines data analysis, feature extraction, and feature evaluation to construct a detection model, which supports a more straightforward strategy by which the operating system or its operators can defend against network attacks. The Linux system interaction information of SSH (Secure Shell) and Telnet are obtained from the Cowrie Honeypot and labeled according to Enterprise Tactics of MITRE ATT&CK to ensure dataset credibility. The proposed AI@NTDS system has three levels, depending on the attacker’s attacks and the user’s risk of damage. Fifty-two features are used to detect the network threat level. The features contain message-based features for all kinds of Linux operating instructions, host-based features for all types of information in the network connection process, and geography-based features are related to the attacker’s location. AI-based algorithms LightGBM, Random Forest and the K-NN algorithm are used to verify the identification of the custom features. Finally, the detection model that is trained using the best combination of features is used to predict the test dataset. The accuracy of the proposed AI@NTDS system reaches 99%, 95.66%, and 94.08% with the LightGBM, Random Forest, and K-NN algorithms, respectively. The mutual dependencies of features and network threats are evaluated. Results of a performance analysis reveal that the proposed AI@NTDS system has an accuracy of 99.20% and an F1-score of 99.80%. It is superior to existing detection mechanisms, which it outperforms by 4% and 1% in accuracy and F1-score, respectively.


I. INTRODUCTION
T HE Internet of Things (IoT) is utilized in various industries, and more IoT devices are being connected to the Internet every day. By 2021, 35 billion IoT devices had been installed worldwide [1]. The global volume of data is increasing exponentially as the IoT grows. Remote controls of the devices are frequently used as they are convenient and support resource sharing in the IoT environment. Most IoT devices are based on Linux, and they are remotely controlled using Telnet or SSH. The password authentication strategy is used to protect these remote devices. However, hackers can use brute force to search for passwords in insecure situations. Hackers can break into a system through remote connections such as Secure Shell (SSH) or Telnet. A hacker who enters a control system will perform reconnaissance or download and execute malicious files to obtain system permissions and, ultimately, to steal sensitive information from an enterprise or organization.
Most devices communicate using SSH protocol remote access services. Since this protocol provides encrypted communication between the SSH client and the server, an attacker that is connected to the server can execute various malicious services.
Venafi, Inc. collects real-world examples of SSH threats [2]. For example, Sony Pictures was hacked in 2014 and SSH keys were stolen, leading to leaks of executive salaries and copies of unreleased Sony movies. The 2019 Kinsing Malware included several shell scripts that download and install, remove, or reinstall various services and programs. The 2020 Kaiji malware detected poorly configured SSH services and performed a brute force attack [3]. The above examples show that SSH attacks involve various behaviors. Therefore, SSH security is critical and user access to remote systems must be carefully monitored. Commands that are executed by a remote connection must be analyzed.
In this study, AI-powered techniques are used to solve the command-based content problem and design a network threat detection system, AI@NTDS. Since an enormous amount of information is collected daily, the manual defense of the remote connection threats may cause an irreversible situation. The malicious command dataset for AI Model training is collected and organized by the Honeypot. Most importantly, the problem of detecting malicious commands is solved herein.

A. PROBLEM STATEMENT
Many researchers have presented solutions to protect users against the command-line-based threat. The main task that will be addressed in this work is the detection of the hacker's malicious intent; 52 features will be provided for the analysis of the AI model. These include message-based, host-based, and geography-based features.

B. CONTRIBUTIONS
This work contributes to the field by developing an AI-powered network threat detection system, AI@NTDS, which has three levels.

II. RELATED WORK
The section will review the latest SSH-based intrusion systems, techniques, and experiments. Descriptions of the experiments have been published in different scientific articles, and various threats have been detected.

A. THREAT INDICATOR WITH THE HONEYPOT
Daniel et al. [4] found that criminal activity on the Internet is becoming more sophisticated. Traditional information security technologies can barely cope with recent trends in such activity. In this investigation, several Honeypots are combined to form a honeynet. The Honeynet ran for 222 days, and 12 million attack attempts were captured. The captured data are examined and evaluated herein. The experimental results can identify and quantify the dependences and distributions of the data. New threats are constantly emerging, so capturing the features of attacks and analyzing them effectively is essential [5]. Several Honeypot sensors were deployed to monitor and study (the attackers' behavior. Honeypots type are in Cowrie, Dionaea, and Glastopf, in Linux hosts, Windows host, and web application environments. The above Honeypots attract various attacks from different environments.
Sanjeev et al. [6] improved the deployment and maintenance of tight tanks for various IT systems and intensive resource requirements. Security researchers and security companies extensively use Honeypot because it traps and understands attackers' tools and strategies. The deep learning-based analysis that is inspired by neural networks is integrated into classifying threat events. Jason et al. introduced a tool for evaluating Honeypot [7]. Honeypots are used to capture traces of malicious activity. They can be used to study an attacker's behavior, but they can be challenging to implement and maintain. This study outlines a complete Honeypot design, conducts experiments in data, and presents results thus obtained. The evaluating tool's design is outlined, and the results are provided as quantitative calibration data.

B. ANALYSIS OF ATTACKERS' BEHAVIORS BASED ON SSH SESSIONS
Following the above definition of Honeypot, this subsection will discuss the use of the information collected for analysis of the collected information. The definition and analysis of the behavior in Honeypot using previously developed research methods are described.
Esmaeil et.al. proposed a Honeypot technique to investigate violent SSH attacks on academic networks [8]. The most common attack is the strong guess-password attack that targets SSH, FTP, and Telnet servers. Experimental results demonstrate that preset lists of user names and passwords are widely shared and form the basis of violent attacks. Craig et al. [9] used the Kippo SSH Honeypot system to identify the activity in the Honeypot. The system runs on the same hardware and software configuration as above. Data over 75 days were collected as experimental data. An analysis yields the attackers' behaviors and patterns. The experimental results show that the number and range of attacks are different so that the content can be further discussed.
Georgios et al. [10] discusses the current state of botnets affecting the Internet of Things and the reasons for causes of the success of attacks. They provided detailed information on the operating principles of malware in the Internet of Things, examined their interrelationships, and proposed preventative strategies against malware. Critical steps concerning the operation and communication of botnets have been proposed and six sets of features of Mirai botnets have been identified [11]. That study used the above features to secure IoT devices and protect Internet infrastructure from destructive distributed denial-of-service attacks.
Tomas et al. [12] observed botnets and described the behavior of the first two stages of their life cycle, which are initial infection and secondary infection. They resolved identified the behavioral attributes in each stage and designed a model to determine whether a threat is a botnet. They found that some network sessions and credential guesses are easily collected and usable attributes of the features in profiling threat agents. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.  [13]. The system uses a machine-learning algorithm and observes network traffic to detect attacks and find their targets in real time. A prototype, including a graphical user interface, was implemented as a plugin for the popular NfSen monitoring tool. G. Kannan et al. [14] grouped SSH attacks into two types -severe and non-severe. A severe attack is any attack that follows the successful corruption of an SSH server. A non-severe attack is any attack that fails. This study presents 14 features that are used in the real-time classification of attacks using machine learning algorithms. Tomas et al. [15] focused on the infection by botnets that are grouped into nine series by the features of the collected samples. They experimentally identified dependencies between commands and directories. R. M. Arifianto et al. [16] proposed the SSH Honeypot architecture that uses an Intrusion Detection System. Their work addressed SSH service attacks by observing the number of login attempts between two Honeypots, and the attack risk was determined using category weights and port scanner detection results. R. Kumar et al. [17] used virustotal.com and the relevant literature to categorize attacks into four classes -malicious, SSH, XOR DDoS, and spying. Pierre et al. [18] proposed a fully implemented binary classifier that used machine learning algorithms to differentiate between malicious and benign shell commands. The classification results thus obtained were combined with the results obtained using the 1-Command and n-Command classifiers. Shreya et al. [19] analyzed an SSH-based Honeypot to identify automated and human attackers. The method used the number of requests, the target of the attack, the frequency of requests, and passwords.
T. H. Lee et al. [20] used the packet length in an SDN switch in deep learning models to identify anomalous and malicious packets. J. M. Jorquera et al. proposed a method for classifying threats that was based on Linux command's property [21]. They used machine learning algorithms to identify and classify an attacker's malicious intentions in executing a cyber threat based on the severity of the command. Bryson et al. [22] showed that the Cowrie Honeypot is an effective system for collecting samples of malicious sessions. The same loader session can be found by finding the Edit Distance between command sequences J. T. Martínez et al. [23] designed an approach to detecting botnets that is based on an SSH-based Honeypot. Their dataset contained 93 functions, including commands, session status, and network statistics. They used the random forest algorithm in experiments.
B.X. Wang et al. [24] used their previously obtained research results to evaluate the threats and applied AI technique to the Zenodo CyberLab Honeypot dataset and compared the LightGBM algorithm with random forest and K-NN algorithms. With different feature sets in the SSH-based Honeypot for the cyber-threat intelligence proposed a heuristic distributed scheme (HIDE) to validate the falsification of traffic data. Their calculations were based on a homogeneous semi-Markov process that predicted the accuracy of mobility patterns. They used a cloudlet with a weight factor to determine whether a vehicle is malicious [26].
M. Sewak et al. sought to fill the gap between AI-based accomplishments and a comprehensive review of the cyber security threat landscape. They proposed and reviewed a machine-learning solution for threat detection and endpoint protection using deep reinforcement learning [27].

III. PROPOSED NETWORK THREATS DETECTION SYSTEM-AI@NTDS
An intelligent threats detection system, called the AI@NTDS system, is designed to investigate network threats and analyze them using specific features and algorithms, primarily for SSH sessions. This section describes the automatic data acquisition process and defines the features. A multilabel classification model will also be introduced.

A. SYSTEM ARCHITECTURE
The proposed AI@NTDS system architecture that is shown in Figure 1 has five parts, which perform data collection, data preprocessing, feature-based analysis, model training, and model output.
The dataset of Cowrie Honeypots was obtained from the CyberLab Honeypot-Zenodo [28]. The various attack features are identified from variations among attacks. Fifty-two features were extracted from the Cowrie Honeypot dataset and then grouped into message-based, host-based, VOLUME xxx, 2022 and geographic-based features. The data preprocessing part ensures that the labels and contents in the samples are correct. The algorithm is used to evaluate the importance of various feature combinations. To ensure the stability of an AI model and prevent overfitting, the model training process should not learn too closely with the result of the training dataset. The model is validated during the training process. The performance of the presented model is determined at various times, and the results thus obtained are presented in the following section.

B. DATA COLLECTION
The Cyberlab Honeypot collected attackers' data from June 2019 to February 2020 for use in this study. Cowrie Honeypots, with approximately 50 nodes mostly at universities and companies in the European Union and the United States were used. Each file in the dataset is based on reports of daily intrusions. Sessions are grouped according to the attacker invades and leaves. Each group of sessions contains various events and explicit intentions. This goal of this work is to reduce the complexity and automatically to collect results daily. This system automates the process by applying the concept of a crawler. The data collection program automatically decompresses and converts the extracted JSON file into a CSV file. Figure 2 presents the flow chart.

C. DATA PROCESSING
Data processing firstly removes irrelevant information from the dataset to ensure data quality. The first step in this process is the removal of data associated with failed intrusions. Empty fields are deleted to save storage space and increase computing efficiency. Then, the cleaned data are labeled in a manner consistent with Enterprise Tactics in MITRE ATT&CK Enterprise Tactics comprise 14 groups of Tactics, of which those used herein are indicated below. Table 1 presents the results obtained using the indicated Tactics. Nine tactics in the dataset are labeled with "no intention". They include No Action, Execution, Persistence, Privilege Escalation, Defense, Credential Access, Discovery, Command and Control, Impact. These tactics are associated with three malicious levels based on severity. Level 1 refers to actions that may damage the system, such as the execution of malicious files that stop the system. It is the most dangerous and malicious intention for a system, such as when a hacker inputs the command "kill", or "rm", or executes some unknown executable binaries. Level 2 refers  to setting file permissions for personal accounts. For example, a hacker may input the "chmod" command or "chattr" command to change the file permission. Level 3 refers to the absence of action or scouting actions. For example, if a hacker inputs a command like "cat /etc/passwd" and "lscpu" to obtain system information, the command will be assigned to level 3. Figure 3 presents the tactics distribution of labeled data. Defense Evasion is the most common tactics at Level 1. The attacker's purpose is not to leave records of the removal of downloaded programs, to destroy files, and to obfuscate the system. Malicious programs are commonly used to run scripts to set permissions and perform other actions that are typical of attackers in Honeypots. The most common tactic in Level 2 is Persistence. The attacker's purpose is to escalate privilege to an account. When a connection break occurs, an attacker maintains access to the system to support the malicious operations, or change the system configuration. The figure below shows that the most common attack following login at Level 3 is Discovery because when an attacker accesses the system, the first task is always to perform reconnaissance. The distribution of the tactics of an attacker when he enters a Honeypot can be identified from statistical data.
The data in our work were collected from the 4th of June, 2019, to the 29th of February, 2020. The data were  automatically collected using a web crawler, which grabbed 153,665,690 samples. After the invalid samples were removed, 298,667 valid sample entries remained to undergo the following process. The data from 4th June 2019 to 31th December 2019 formed the training dataset, while those from the 1st January 2020 to the 28th February 2020 formed the test dataset. Of the training data, 15% were allocated to the validation dataset that was used to evaluate the model. Table 2 presents the applications of the split dataset.

D. FEATURE DEFINITION
This subsection introduces and describes in detail 52 groups of features. These groups were divided into message-based, host-based, and geography-based types groups, as shown in Table 3. The algorithms that are used for machine learning focuses on the weights of feature data, so feature extraction is crucial. In addition, the features proposed by the authors are the use of red font marks. F34 Count_base64 : An attacker will always use base64 encoding to obfuscate malicious behavior. One of the most common features is that attack scripts are encoded and decoded at execution time. Therefore, this feature is used to determine whether the command Base64 is present in messages. Figure 4 indicates the results of comparison between Base64 encoding and decoding. F37 (Message_length) and F38 (Messages/sec) : These two features are used to calculate the total length of a message. A message without any intention is shorter in total than one with a particular intention. The essential purpose of these kinds of commands is to evade detection or to achieve multiple goals. The number of characters entered per second is used to determine whether the attacker is a robot or script execution. Table 4 presents the details of the message-based features.

2) Host-based Features
Features F39 to F41 are the communication protocol of the connection, information about the connection, and the version of the connection client, respectively. Features F42 and F43 are related to login information. Features F44 to F46 are the duration, average string length of the response, and the presence or absence of a file during the connection. The authors proposed one feature, F45, in the file type. F45 Received_Size (AVG) : The attacker will always query the content through the Linux command. For example, the user types "uname" returning the string "Linux". The size result is six. The act of stealing information is determined by the length of the string returned. Feature 45 is evaluated using Eq1.
Calculate how many " free " keywords are in the message.

Keyword_lscpu (Get_sys_info)
Calculate how many " lscpu " keywords are in the message.

Keyword_nproc (Get_sys_info)
Calculate how many " nproc " keywords are in the message.

Keyword_uptime (Get_sys_info)
Calculate how many " uptime " keywords are in the message.

Keyword_wget (Network_connect)
Calculate how many " wget " keywords are in the message.

Keyword_tftp (Network_connect)
Calculate how many " tftp " keywords are in the message.

Keyword_scp (Network_connect)
Calculate how many " scp " keywords are in the message.

Keyword_ping (Network_connect)
Calculate how many " ping " keywords are in the message.

Keyword_kill (Shutdown_action)
Calculate how many " kill " keywords are in the message.

Keyword_reboot (Shutdown_action)
Calculate how many " reboot " keywords are in the message.

Count_base64
Calculate the number of times the message has been in base64.

F35
Count_Hex Calculate the number of times the message has been in hexadecimal.

Count_url
Calculate how many URLs are in the message.

F37
Message_length Calculate the total length of the message.

F38
Messages / sec Calculate the length of Message input per second.

Received_Size (AVG)
The length of the data returned after entering the command and number of discover commands in the Honeypot.

F46
File All files of the Honeypot collect. Table 5 presents an example of the command calculation. Table 6 presents the details of the host-based features.

3) Geography-based Features
Features F47 to F50 are, city names that were analyzed globally. Latitude and longitude are used to determine the location of an attack. The corresponding geographic location can be used to determine whether an abnormal attack has occurred. Table 7 presents the details of the geography-based features.

E. MODEL TRAINING
The proposed AI@NTDS system is designed using the LightGBM algorithm. XGBoost and LightGBM are based on the Tree Boosting mechanism [29]. The LightGBM algorithm is well-known for its better training efficiency and lower memory usage than the XGBoost algorithm. The LightGBM algorithm differs from the traditional Gradient Boosting Decision Tree (GBDT) algorithm and is optimized

ID
Name Description F47 Continent_Code Codes for global continents.

Country_Name
The name of a unique territorial subject or political entity. F49 Region_Name The name of a part of a country.

City_Name
The name of a more densely populated and developed area.

F51 Longitude
Longitude is a geographic representation of the east-west position of a point on the Earth's surface.

F52 Latitude
Latitude is a geographic representation of the north-south position of a point on the Earth's surface.

Received_Size (AVG)
The length of the data returned after entering the command and number of discover commands in the Honeypot. F54 File All files of the Honeypot collect.
using various strategies. It has the following four main characteristics; a histogram algorithm, Gradient-based One-Side Sampling (GOSS), Exclusive Feature Bundling (EFB), and Leaf-Wise Tree Growth. The GOSS algorithm is described below Algorithm 1. In LightGBM processing, feature analysis is performed based on the parameter settings that are shown in Table 8.  The concept of GBDT is used to calculate the residuals as a generation decision tree. The most effective learning rate in this work is 0.1. The iterative process revealed that the best number of LightGBM estimators was 100. Since LightGBM grows leaf by leaf based on the tree model, the number of leaves here is set to 31.

IV. PERFORMANCE ANALYSIS
This section concerns the performances of the machine learning (ML) mechanism, feature-based analysis, and AI@NTDS system analysis. Features are analyzed and discussed. The most effective detection model algorithm is identified. The following section provides experimental proof of the result. Tables 2, 3, and 8 presents the used dataset, features, and learning parameters, respectively. The authors evaluate the AI prediction model with different multi-classification algorithms to assign malicious payloads to three levels. Table 9 provides various evaluation indexes and the operation time of each machine learning mechanism.

A. ANALYSIS OF ML MECHANISM
Many machine learning classification algorithms in the security domain use SVM and Random Forest [30], [31]. In previous works, a Decision Tree and naive Bayes algorithm are used to detect this issue [14]. XGBoost and LightGBM are also well-known classification algorithms. In this work, each algorithm with default parameters are compared in terms of accuracy, precision, recall, F1-score, and computation time metrics.
Following a comprehensive evaluation, the LightGBM was used as the proposed AI@NTDS learning model because the values of any indicator in this study measured were better than average. It required the least computation time, making it easier to deploy in various devices for real-time detection.
Naive Bayes and SVM perform poorly. Although these two algorithms are widely used, they are not suitable for the detection of malicious shell commands in this study.

B. ANALYSIS OF FEATURE-BASED
The features of the AI@NTDS system are divided into host-based, message-based, and geography-based groups. One of the purposes of this group is to find the classification model and identify the essential features. Four case studies were performed, and the results of the relevant analysis are provided in Table 10. In Case 1, message-based features VOLUME xxx, 2022 alone resulted in good performance of accuracy and precision. An attacker may attack a target by such means as causing confusion, recon, and deletion. These actions cause the attacker's input character to be more than a low-level threat. In Case 2, only host-based features are used. We contributed significantly to the returned strings and SSH-related information. Case 3 identified the essential features by observing the distribution of attackers based on geographical features. Latitude and longitude were the critical features, but the accuracy of the model using these features and other indicators combined with all of the features were not as high as in the preceding two case studies The results show that risks and hazards cannot be assessed using only the features that are associated with geographical location. In Case 4, all of the features were used, and the best values of all indicators were obtained. Finally, 52 dimensions are used to analyze the three types of features to obtain the best model of the detecting system, based on the feature engineering with gradient boosting machine, as shown in Figure 5. The message-based features account for about 50% of the ten features; these are Message_length (F37) and \.\w*(F5) features. The host-based features account for 40% of the top ten features; these are Received_size (F45) and duration (F41) features. A 99.75% precision, 99.85% recall, and F1-score of 99.80% are achieved. Therefore, the evaluation of AI@NTDS model is based mainly on host-based and message-based features.  The proposed features are proved to be very effective.

C. ANALYSIS OF AI@NTDS SYSTEM
The test dataset comprises 23% of the data in all experiment datasets. The training set comprises data from 2019, and the test set consists of data from 2020. The AI@NTDS classifier predicts the classification of each threat in the test dataset, yielding the results in Table 11. From the confusion matrix, the total misclassification ratio of the classifier for threat level 1 is 0.17%; that for threat level 2 is 0.37%, and that for threat level 3 is 0.86%. The F1-score reaches 99.80%, indicating that the AI@NTDS effectively detected samples of various threats. The AUC (Area Under the Curve) reaches 98.53%; the precision rate can reaches 99.75%, and the recall rate reaches 99.85%. Therefore, the detection model that is trained using the LightGBM algorithm can detect malicious sample changes in various periods of attack and has excellent efficiency and performance. Table 12 compares the performance of the proposed AI@NTDS classifier with those of related methods and algorithms in previous studies. Since previous studies have not provided detailed parameter settings and features of each mechanism, the parameter settings of Random Forest and K-NN were those used in their closest methods when they were originally developed. These mechanisms were compared using the same dataset. The accuracy of the AI@NTDS model is 4% higher than those Random Forest and K-NN, and the F1-score is 1% better. Therefore, the AI@NTDS classifier with the LightGBM algorithm is the most effective in classifying threat levels and identifying the attacker's intent. The difficulties of implementation and the number of data dimensions must be addressed are also compared. Although the model herein yielded better results than both of the other, it requires more features to be extracted from the dataset. Therefore, the proposed mechanism requires more time to spend in preprocessing data. All of the experiment datasets used in Table 12 use the Zenodo dataset, based on the Cowrie Honeypots. The proposed AI@NTDS in this study can be applied to any real-world scenario involves IoT devices and Linux server shell command analysis.

D. COMPARISON WITH OTHER STUDIES
Many studies of related issues have bee performed. Table  13 presents problem-solving mission with various features. The mechanisms that are listed in the first, fourth, and fifth columns have similar purposes and are analyzed in detail Table 12 presents.

V. CONCLUSION
This study proposed an AI@NTDS detection system that incorporates the LightGBM machine learning algorithm for identifying and classifying threats. Attackers' intentions are analyzed using collected data, and the degree of harm that is caused by malicious instructions is determined. Three types of attack are identified by threat levels of attack are identified using Enterprise Tactics of MITRE ATT&CK. A total of 52 features of three types -message-based, host-based, and geography-based features -are ultimately identified. The results of an analysis demonstrate that our model performed best when all features were used. Message-based features and host-based features accuracy for the model are largest.