Transformer-Based Named Entity Recognition on Drone Flight Logs to Support Forensic Investigation

The increase in drone usage by the public brings the number of drone incident and attack up. Sophisticated preventive mechanisms, as well as post-incident procedures and frameworks, are needed. Forensic investigation is performed upon a drone incident, aiming to uncover the incident scenario, mitigate the risk and report the examination results. Generally, standard drone forensic procedure consists of three stages, i.e., evidence acquisition, evidence analysis, and reporting. Among the existing research, many attempts have been made in framework proposal and evaluation, study case, and tools proposal and evaluation. However, less research focuses on utilizing specific data artifacts from the drone forensic image, such as telemetry, dataflash, and flight log data. Therefore, this research aims to propose the use of log message data to discover and extract some incident-related information using a deep learning-based NLP technique, i.e., named entity recognition using the Transformer. Cosine similarity is proposed as a substitute for dot-product in the self-attention mechanism of the Transformer encoder layer. Additionally, we propose NER architecture built from a mix of several existing methods and report the performance evaluation. We extract the DJI drone forensic image from a publicly available dataset using Autopsy and DJI Phantom Help and collect the decrypted log messages. Six entity types are defined after carefully reading the log message. These entity types are used in the manual annotation process using the IOB2 scheme as the label. The constructed dataset is used to evaluate the proposed model along with several baseline models. The proposed method outperforms the previous baseline model with a 91.348% F1 score. Finally, we conclude the experiment and mention several future directions.


I. INTRODUCTION
UAV technology's presence has significantly impacted several sectors, such as industry, film, and advertisement. It can be seen from the increase in the number of consumer drone usage in recent years. A survey from Statista [1] states that the shipments of drone consumers reached approximately 5 million units in 2020 globally. This number is expected to The associate editor coordinating the review of this manuscript and approving it for publication was Alba Amato . keep increasing to 9.6 million delivery in 2030. The increase in drone employment in many fields brings and opens new challenges to secure drone devices. Other than consumer drones, there are other types of drones, i.e., military, terrorist, and criminal drones [2]. Any failure, error, or malfunction is not tolerated for these types of drones, as in consumer drones. Therefore, it is critical to guarantee the security of the device. To this end, more sophisticated security and forensic procedures are needed to develop to diminish the risk caused by any attack or incident [3]. In the digital forensic research area, drone forensics is a quite new research topic. Generally, it is categorized into two sub-topics based on the evidence used to perform the investigation, i.e., digital and physical investigation. Both types of research aim to find relevant information regarding the incident, uncover the attack scenario, and diminish the risk as the effect of an incident [4]. In order to perform a digital forensic investigation, several artifacts can be utilized, such as image, video, telemetry log, and flight log data. The physical examination aims to achieve several objectives but is not limited to identifying any unique identifier, notable features, or damages. Secondly, it determines the model and class of the device, along with the capability of the storage system. Then, it lists the available options to perform extraction [5]. On the other hand, digital evidence is analyzed to achieve other objectives, such as mapping the link between the components of the UAV, identifying and matching the ownership to get a suspect user, and obtaining and inferring some information to prove that the device was used to commit a crime [5].
While the drone is flying, any event that happens to the drone is recorded in a log file, including the component's state, such as sensors, motors, GPS, and links. These data are stored in telemetry and dataflash logs located in the persistent storage attached to the device [6]. Mantas and Patsakis [6] have attempted to utilize telemetry and dataflash logs to perform drone forensic investigation by performing UAV integrity checks, anomaly detection on the visual flight path, command verification, error reporting, and hardware error detection. GRYPHON is proposed as an open-source tool to perform the aforementioned tasks [6]. Previously, DROP, as the first open-source tool for parsing drone flight log data from DJI, was proposed to help the process of acquiring the plain information encrypted in the proprietary.DAT dan.TXT log files of the DJI model [7]. Other than that, most of the drone forensic research is a type of study showcase, which starts from a scenario design, data generation and acquisition, data analysis, and finally, reporting. Several references are using DJI [8], [9], [10], Cheerson [11], Parrot [4], and Yuneec [12] model as experimental devices. However, there is no attempt to utilize specific data, in this case, the log message, to perform a drone forensic investigation. For this reason, we propose a deep learning-based Natural Language Processing (NLP) technique to perform information extraction from the log message to assist the forensic investigation process.
Information Extraction (IE) is one of the sub-topic in the NLP research domain, which aims to infer knowledge from a lake of text data. There are several steps in performing information extraction; after data source collection and preprocessing, Named Entity Recognition (NER) is one of the initial steps in IE [13]. The researcher has taken advantage of NER power to recognize and extract mentioned entities in several domain problems such as agriculture [13], [14], biomedical [15], [16], chemical [16], [17], food and dietary [14], and cybersecurity [18], [19], [20]. Inspired by the success of NER in those domains, we are motivated to investigate the usability of NER in the drone forensic domain, considering the characteristic of the data is unique for every domain specific. Fig. 1 illustrates a forensic timeline constructed from the flight log message along with mentioned entities within. A well-constructed forensic timeline exposes sequential events experienced by a system regarding a particular security incident [21]. In this research, we use flight log data to construct a forensic timeline that consists of the log message and the timestamp.
To perform NER, two common deep learning models can be used, either RNN-based or Transformer-based models. The latest state-of-the-art include BiLSTM-CRF and a pre-trained Transformer-based Language Model. The rise of Transformer-based language models (LM) such as BERT [22], one of the first pre-trained LM models, RoBERTa [23] as an optimized version of BERT, Distil-BERT [24], a smaller, faster, cheaper, and lighter version of BERT and GPT [25], types of pre-trained language model that employs only the decoder part of Transformer architecture are significantly impacted the NLP research landscape, including NER. However, every domain-specific problem has its own unique problem and data characteristic. In the general NER, the common entity types are Organization, Person, or Location. The sentence structure follows the natural language semantics. However, drone flight log messages' sentence structure does not necessarily adhere to that of the natural language in public news, for instance. For this reason, we aim to investigate the success of the Transformer-based technique in recognizing the region of interest in drone flight log messages.
The contributions of this paper are summarized as follows.
1) This research constructs a new NER dataset in the drone forensic domain. To the best of our knowledge, there is no publicly available NER dataset for drone forensic problems yet. We identify and propose six entity types as the tagset in the annotation process. The proposed model achieves a competitive score compared to the state-of-the-art methods, with a 91.146% F1 score. Yet, one of the scenarios achieves high performance, with a 91.348% F1 score. 2) This work showcases how to utilize specific evidence data to perform forensic information extraction to assist a forensic investigation, including a simple framework for data extraction and annotation. 3) We propose and investigate cosine similarity as a substitutive of dot-product in the self-attention sub-layer of the Transformer encoder to model contextual dependency in a sequence. We also propose a new NER architecture consisting of CNN character embedding, BERT word embedding, a Transformer with scaled dot-product attention as the encoder, and CRF as the decoder.
The paper comprises five sections. The remainder of this paper is as follows. Section II reviews the recent related works on drone forensic research, deep learning for named entity recognition, and the use of named entity recognition in cybersecurity. The proposed method is elucidated in Section III. Section IV explains the experimental results and analysis. We conclude the paper in Section V with several future directions.

II. RELATED WORKS
The advancement of the Unmanned Aerial Vehicle (UAV), commonly called a drone, followed by a constantly increasing number of drone usage in society, has brought drone forensic research to the surface and interested researchers. The case study-based paper is the most popular among the published papers on drone forensics. This section briefly discusses and summarizes the published works related to drone forensics. The following sub-section explains the other researchers' works on employing a deep learning model for named entity recognition. Since our research is a sub-field of cybersecurity, we also recap several related attempts at utilizing NER in the cybersecurity domain in the last subsection.

A. DRONE FORENSIC INVESTIGATION
The field of drone forensics is a relatively recent research topic. The growth and development of Unmanned Aerial Vehicle (UAV) technologies bring drone forensics subject to the surface and pique the academics' attention. The case study is the type of most drone forensics published research. In this research category, a forensic examination is performed on a drone device after having a scened flight under a controlled environment. The procedure starts with the data collection stage and ends with the investigation report. Yousef and Iqbal [10] proposed a series of guidelines to help forensic investigators conduct forensic investigations using the DJI Mavic Air drone model. In order to gather the evidence and explain the successfully collected data, various techniques are used, which may aid the investigation process. Several similar studies were done on other drone models, such as the Yuneec Typhoon model [12], DJI Spark [26], and the DJI Phantom [27]. According to those case studies, the drone's controller devices stored valuable data that can be compared to the other artifacts enabling a correlation study between the UAV and the mobile application used to control the drone. Some studies also suggest a technical procedure for drone forensic inquiry. In order to perform an end-to-end analysis from the preparation to the reporting of the findings, ten procedures proposed by Salamh et al. [12] must be followed.
Analyzing the encrypted files is one of the obstacles in the data collection phase. For the DJI models, encrypted evidentiary data is a certainty. However, sometimes we have no access to the DJI proprietary tools. Therefore, the data must be decrypted without using DJI's proprietary tool, even though DJI offers a closed-source and paid decryptor tool. Hence, some studies develop tools to help the researcher and investigator to conduct a forensic analysis. The DROP (Drone Open source Parser) tool developed by Clark et al. [7] is a parser tool for a.DAT file that can also decrypt the encrypted file to obtain the plain data within. Furthermore, DROP can link what the.DAT file holds and match it with the.TXT flight log file contents. After successfully decrypting those two files, GRYPHON [6] can be used for dataflash and telemetry log analysis. The program can perform timeline analysis, analyze flight data to discover an anomaly, map the GPS coordinates, and many other features. Other than the previously mentioned tools, several other tools were identified and described in a survey conducted by Viswanathan et al. [28]. Among the existing tools, Salamh et al. [8] carried out a case study to examine the features of the tools that were found to aid the forensic investigator in selecting the best suitable tools for a particular type of task. In the general digital forensics domain, log2timeline is commonly used to construct a forensic timeline from log records. The result is in.CSV format consisting of log records with corresponding timestamps. Timeline2GUI can be used to parse and analyze the log2timeline output file's contents. It offers an automatic analysis that is too complex if conducted manually. It is also equipped with sophisticated visualization features to highlight critical information and assist the forensic investigator in analyzing, interpreting, and drawing conclusions [29].
Understanding the drone device and its parts is a crucial step before beginning a forensic investigation [30]. VOLUME 11, 2023 Accordingly, Jain et al. [30] proposed a framework comprising 12 phases. The first five phases were used to locate and validate the drone's sensors and data. The other seven steps were used to analyze physical evidence like fingerprints and digital evidence from several sources, such as memory cards, flight logs, and network logs. However, the proposed framework has not explained the preparation phase in detail. Therefore, another framework consisting of four investigative phases with a more comprehensive analysis is proposed by Al-Dhaqm et al. [3]. One of the primary distinctions of this framework is a more extensive preparation phase consisting of pre-and post-incident preparation. Pre-incident preparation is an important step that is not yet covered in most forensic frameworks. This step aims to understand several possible indicators of compromise, define potential forensic evidence, and measure a drone device's forensic readiness before a flight. The remaining phases, including post-incident preparation, data acquisition, and data analysis, are likely the same as the other frameworks but more thorough.
Physical and digital evidence is the source of evidence used in a forensic investigation. Analyzing these two types of evidence require diverse technique. For digital evidence, computer-assisted tools are needed to read and present the evidence in a format that humans can comprehend. Study on evidence analysis is dominated by reconstructing and visualizing the flight path taken by the drone during a flight [5]. It is done by utilizing the GPS coordinates recorded in the flight log and with the help of the CsvView tool. A similar study conducted by Kumar and Agrawal [31] utilizes GPS data to reconstruct the flight path with three different drone makes as experimental devices. A tool to convert the.TXT or.JSON flight log file from Parrot make drones into a.CSV file that is easy to understand, named FlyLog Converter Tool, is proposed.

B. NAMED ENTITY RECOGNITION IN CYBERSECURITY
NER plays a vital role in the general NLP study as well as in the domain-specific areas, where it is utilized as an initial task that supports other downstream tasks, such as event extraction and relation extraction [32]. The capability of NER to obtain valuable information from text data can faster an information extraction process with a more accurate result. The extraction is accomplished by processing and recognizing tokens that may be associated with a specific type of entity [33]. NER task is not a new research topic. There have been many studies on the development of NER models. The capability of capturing bidirectional relations among words in a sentence posses by BiLSTM has been around for many years as the state-ofthe-art method in the sequence labeling task. Accompanied by a statistical model, CRF, which can maximize the probability of a label sequence, has proven to improve the BiLSTM performance [32]. Many more advanced models with a richer input representation obtained from pre-trained word embedding models, such as Word2vec [34], GloVe [35], ELMo [36], and BERT [13], improved the BiLSTM-CRF architecture. However, the presence of Transformer has revolutionized many NLP tasks, including NER. Since the publication, many variants of pre-trained Transformer-based models have been available. The main advantage of the Transformer architecture is that the attention mechanism in the encoder sub-layer can model the context and relation between the words in a sentence.
TENER [37] attempts to utilize the Transformer encoder to perform NER by incorporating relative positional encoding to make the model distinguish the direction of a particular context. CNN char embedding is added to the embedding vector to represent the char-level feature. The proposed architecture outperformed the RNN-based state-of-the-art models in the benchmark NER dataset, CoNLL2003. Working with deep learning models is primarily a matter of providing decent input to the neural model. Several efforts have been made to increase the performance of the Transformer encoder by incorporating more features into the embedding vector of the word representation. Instead of solely relying on the word embedding as the input source, adding a dictionary feature embedding to the input vector can improve the BiLSTM-CRF model equipped with an attention mechanism [19].
Supervised-based deep learning models can take advantage of the label information attached to the data points in the training process. Besides updating the weight parameters, label information can also be injected into the input vector, as proposed in LUKE [38], to provide a rich input. An entityaware self-attention mechanism is proposed to separate the token-to-token and token-to-entity context parameter. The masked language model is employed in the pre-training phase to predict some random token and entity. LUKE becomes the state-of-the-art for five well-known entity-related benchmarks, such as CoNLL2003 for NER, Open Entity for entity typing, TACRED for relation classification, ReCoRD for cloze-style question answering, and SQuAD 1.1 for extractive question answering task.
The capability of NER to recognize and extract the region of interest in unstructured text data has been implemented in various domains. Several efforts have been made to utilize NER in the cybersecurity domain. In a process-aware system, valuable information is stored in log files and commonly written in a less human-readable format. Because the records are text data in a large size, the researchers use particular NLP techniques to process the data and perform analysis automatically.
One of the most severe difficulties has been dealing with the complexity of cybersecurity data. Different systems and devices generate logs in different formats. No consistent name system with numerous acronyms, technical terminology, frequent conjunction use, and extensive nesting structure are the main challenges in cybersecurity data [33]. The prior state-of-the-art model employed the XBiLSTM-CRF architecture to conduct NER on a publicly available cybersecurity dataset [18]. The model's performance was enhanced by the concept of concatenating the word's vector representation with the Bidirectional Long Short-Term Memory (LSTM) layer output. The Conditional Random Field (CRF) layer is used to decode the concatenated output since it can determine how the sequence labels relate to one another.
Most of the work involved in implementing deep learning models is predominantly spent on figuring out how to prepare adequate input representations. Most often, word embedding techniques like GloVe [39], Word2vec [40], BERT [22], and ELMo [41] are used to learn the input representation. Pretrained language models, such as BERT and ELMo, can generate contextualized representational vectors after going through a pre-trained procedure on a large corpus. Contrary, GloVe employs local and global statistics to build a word vector, yielding a static lookup dictionary after being trained on a relatively large corpus. In order to provide additional information to the word embedding vectors, Gao et al. [19] developed a domain-specific knowledge base and data-driven NER system. Additionally, an attention mechanism is utilized to apply greater weights to more valuable information in a sentence. The experiment demonstrates that the designed model is more capable of identifying rare entities.
Aligning with the rise of the attention-based model, Zhou et al. [20] suggest further NER system development in cybersecurity data using BERT. Instead of taking a random word piece to be masked as used in BERT, Whole Word Mask is employed. Since the masking is applied to the entire word rather than at the word-piece level, this masking mechanism can cope with cybersecurity data better. This solution addressed the issue of conjunctions being frequently used in words like ''buffer-overflow'' or ''man-in-the-middle attack,'' which is one of the main issues in cybersecurity NER.
Among the published literature, there are still not many studies that specifically work on examining a particular type of drone forensic artifacts, especially the human-readable message within a drone flight log, as evidence to perform forensic analysis. Therefore, we are motivated to utilize the log message and perform information extraction to assist in a forensic investigation.

III. PROPOSED METHOD
The model's architecture of the proposed model is depicted in Fig. 2. In this research, we employ the modern deep learning model, Transformer, to encode and model the dependency between words in a sentence and perform named entity recognition in the drone forensic domain. Overall, our proposed method consists of positional encoding, character embedding, word embedding, encoder, and decoder. We further explain the details in the following sub-sections.
The existing studies show that most drone forensic research is based on case studies, tool development, and tool testing and evaluation. There are presently few studies performing analytics against certain drone data artifacts, specifically log message data. Inspired by the success of Transformer-based NER model implementation in various domains, including cybersecurity, this paper intends to take advantage of NER in recognizing mentioned entities in drone flight log messages. In order to fill the research gap, this work investigates the use of information extraction techniques to obtain insight from unstructured evidentiary data. The retrieved information is expected can assist the forensic investigator in pinpointing the critical information related to an incident in the flight logs faster.
A. DATA PREPROCESSING Drone flight log data contains a number of columns with numerous information regarding the drone's condition and state. From those columns, we take the data from the message, tip, and warning columns. Not every log entry has a message, as the log message is generated and triggered by certain events or incidents. This message contains useful information for the forensic investigator to conduct forensic analysis and investigation. Therefore, the other columns in the drone flight log message are ignored.
After collecting all the log messages from the flight log files, the message is then tokenized to get per token separation without lowercasing. As observed from the dataset, many entities are written in a capital case. To give the model a chance to see the difference between upper and lower case, we preserve the original message without converting them to lowercase. We tokenize the message by keeping the dot and comma, as these two punctuations play the context separator role in a sentence. We use the Spacy 1 tokenizer to tokenize all the messages. The tokenized message is then converted into CoNLL format as a standard NER dataset format. Finally, equal-length tokens and labels are fed to the embedding layer to obtain a representational vector.

B. CHARACTER AND WORD-LEVEL EMBEDDING
Named entity recognition is part of a long process in the Information Extraction pipeline. NER is the initial step in performing information extraction from text data, which recognizes the region of interest and mentioned entities in text data or documents [13]. Since neural networks can not deal with text data, the data must be converted into numerical values. This process is called embedding. There are two levels of embedding used in this research, char-level and word-level embedding. Char-level embedding is used to tackle the outof-vocabulary problem, which is common in NLP problems. Therefore, each character has its own embedding vector. CNN [43] and LSTM-based [44] char-level embedding are common approaches in NER. Besides CNN and LSTM, Ada-Trans [37] is also used as the char-level embedding in this research to provide rich comparisons.
Despite the parallelism support offered by Transformer architecture, it does not have information about a word's position in a sentence. However, words in a sentence are arranged in sequential order, and the order determines the contextual information. Thus, positional encoding is used to inject the representation of the position of the word. Let t be the index position of a word in a sequence, then f : t ∈ N → PE t ∈ R d is a deterministic function that maps each index position into (1) Word embedding is a representational vector used in NLP tasks to represent the features in text data. This vector not only contains the features in text data but also mimics the behavior of text data, such as semantics. A well-constructed word embedding vector can be used to estimate the similarity between two words having u and v as the embedding vector with d dimension using cosine similarity [45] as defined in (3).
There are two types of word embedding, static and contextual embedding. Word2vec [40], fasttext [46] and GloVe [39] are the common static embedding. While ELMo [41] and BERT [22] are an example of contextual embedding. The difference between static and contextual embedding is in the way of the lookup process. Static embedding has a static dictionary that maps the word into a vector. Therefore, a word will have exactly one representational vector, no matter what the context is. Contrary, contextual embedding generates a different representational vector for each distinct context of a certain word has.
In this research, GloVe, E ∈ R d vocab ×d glove is used as a static embedding, and BERT, E s ∈ R d s ×d bert is used as the contextual embedding for sequence s. The final embedding vector is the concatenation of the positional embedding vector, charlevel features extracted by the AdaTrans, and the pre-trained word embeddings GloVe or BERT.

C. TRANSFORMER ENCODER LAYER
The development of research on the topic of natural language processing reached a significant stage after the presence of an attention-based deep learning architecture called Transformer in 2017 [47]. This architecture was first introduced by the Google research team for English-German and English-French translation problems. The ability to understand and model the language is the main advantage of the Transformer. Transformer architecture is divided into two major parts: the Encoder and the Decoder. In this study, the only part used was the Encoder. In general, the elements that build the Encoder block include Input Embedding, Positional Encoding, Multihead Attention, and Feed-forward Networks [42].
The attention mechanism in Transformer architecture tries to model the way some data in a database system are retrieved. Previously, the attention mechanism was introduced by Bahdanau et al. [48] in 2015 as additive attention, which was then modified by Luong et al. [49] in 2015 by proposing dotproduct attention. These two papers use language translation as the experimental case and model contextual learning using the attention mechanism. In the dot-product attention, each token in the sequence is transformed into three different representational vectors, i.e., query, key, and value, as shown in Fig. 3. In order to obtain the contextual representation of the currently processed token, there are five steps to follow [42].
1) Project each of the token's vectors in the sequence into three representational vectors, i.e., q ∈ R d k , k ∈ R d k , and v ∈ R d v . These three vectors are computed by multiplying the embedding vector e ∈ R d model with three 2) Take the dot-product between the vector of the current token q t to each vector of the context token k j in the sequence, yields vector s t = s t1 s t2 s t3 . . . s tj for j = 1, 2, 3, . . . , n, where n is the number of token in a sequence.
3) Scale the output of the dot-product by dividing it with √ d k . This is the main difference between Loung's attention with the Vaswani's attention.
4) Normalize the scaled dot-product output using softmax. The output of this step is a probability distribution to weigh the v vector as the target context, yielding a vector w t = w t1 w t2 w t3 . . . w tj . Therefore, n j=1 w tj = 1.
5) Finally, perform Hadamard Product ( ) between the probability distribution with the v vector to get the weighted value vector, as the weight indicates the amount of attention that exists between the query and key vector.
Vector y t ∈ R d v is the output of the scaled dot-product self-attention mechanism, as explained previously, which contains the contextual representation of the current token. Mathematically, the self-attention score of q t against each of k j and v j in a sequence with n number of tokens is formulated as (8).
Practically, the computation of forward propagation in neural networks is in a matrix multiplication nature. Instead of taking the dot-product between the vectors one by one, the VOLUME 11, 2023 whole self-attention mechanism can be wrapped into a single matrix multiplication operation by building Q, K , and V for query, key, and value matrices, respectively. These matrices are obtained by multiplying the word embedding matrices of a sequence E s ∈ R d s ×d model , with the weight matrices W Q ∈ R d model ×d k , W K ∈ R d model ×d k , and W V ∈ R d model ×d v . The projected matrices Q ∈ R d s ×d k , K ∈ R d s ×d k , and V ∈ R d s ×d v are the representational matrices for the sequence. Therefore, the self-attention mechanism for a single sequence can be formulated as (9).
In the self-attention mechanism, each token has one self-attention score against each token in the sequence. It makes the output vector tends to contain only a single context for each token in the sentence. However, it is possible for certain word has several contextual relations with more than one word in the sentence. Therefore, multi-head attention comes as a solution to learning several contexts for each token, which is modeled in each attention head weight. The attention head is a hyperparameter in Transformer architecture. We can set the number of attention heads as needed based on our data and case. In this paper, the sequence length is mostly (more than 80%) less than ten words, and the longest sequence is 33, so it is less likely that a word has several contexts in a sequence. To keep the model's complexity simple, the d k and d v are taken from d model /H = 128. Thus, the complexity of multi-head attention with d k = d model /H is the same as a single head with d k = d model . The multi-head attention mechanism can be formulated in a matrix multiplication operation, as shown in (10), where h denotes the attention head, H is the number of the attention head, a is the index of attention head, and W O is a weight matrix for the concatenated output from each attention head.
The W Q a ∈ R d model ×d k , W K a ∈ R d model ×d k and W V a ∈ R d model ×d v matrices are different weight matrices for each attention head. In order to obtain the multi-head attention score, the result of each attention head is concatenated, then multiplied by a weight matrix W O ∈ R hd v ×d model . The resulting matrix is then passed as an input to the next sublayer, which is a fully-connected layer. The overall multi-head attention mechanism is depicted in Fig. 3.
The fixed sinusoidal positional encoding proposed in Transformer is not representative enough since it only represents the distinct position and distance but lacks direction information. Inspired by the success of bidirectional LSTM, TENER incorporates direction-aware positional encoding to give the attention mechanism ability to model which direction of a certain context comes from [50] and [51], which is then called AdaTrans. Therefore, the modified formula to obtain the attention score between query and key vector is shown in (13) [37], where t is the index position of the current token and j is the index position of the context token. Fixed sinusoidal positional encoding PE t in (1) becomes R t−j in (12) to represent the relative positional encoding, and R t−j ∈ R d k to make it compatible with the word embedding vector dimension. u and v are learnable parameters to give the model the ability to distinguish the representation of e t,j and e t+1,j from different distances, and ω i is the same term as (2).
Several attention mechanism modifications focus on injecting more linguistic features into the embedding vector and the attention computation. However, to the best of our knowledge, there is no attempt to control the attention output's behavior yet. Inspired by [52] where cosine is used as a normalization function in neural network architecture, we intended to use cosine similarity to normalize the attention score. Originally, the output of the attention is scaled by √ d k [42], then fed to the softmax function to get the probability distribution. However, the resulting probability distribution has only one significant element, which is then used to weigh the context value vector. Thus, the attention score will represent exactly one context only. Multihead attention overcomes this issue by projecting the key, query, and value vector into several distinct attention heads which do not share their parameters. Since cosine can smoothen the probability distribution from the softmax output, we aim to investigate the use of cosine normalization as a substitute for the dot-product operation in the self-attention mechanism. As illustrated in Fig. 4, the probability distribution on the smaller scale tends to have several significant values compared to the larger one. This slope probability distribution will capture several attention from the context words' vector. Additionally, from the existing NER architecture, we explore several possible arrangements to find an architecture with the best performance evaluated on our dataset.
Before performing matrix multiplication between the key and query vector in the self-attention mechanism, our proposed method first divides these two vectors with their respective norm and constructs the matrix back. The modified self-attention mechanism is depicted in Fig. 5. Since (3) can be written in the form of (15), thenQ andK are the query, and key matrices constructed from the vectors that have been divided by their respective norm. Consequently, we can fully exploit the optimizable matrix multiplication operation as in the vanilla Transformer architecture. Thus, the forward propagation is slightly the same, except for the additional step for dividing the key and query vector by its norm before performing the matrix multiplication. Therefore, the attention score between the query and key vector using  cosine similarity can be computed using (16).
The output of the multi-head attention sub-layer is then passed to the Add + Norm sub-layer, as shown in Fig. 2 (b). The term Add in Add + Norm sub-layer means a residual connection [53] between the previous sub-layer output and the current sub-layer output before being propagated to the next sub-layer. This residual connection retains the positional information from the embedding layer during the computation to the upper layer of the architecture. The term Norm refers to LayerNorm [54] to control the value of each sublayer output. Afterward, the next sub-layer is the Feed Forward Network (FFN) which consists of two linear transformations with ReLU [55] activation function in between. This sub-layer is formulated in (17) as follows: where W and b is the weight and bias parameter for each linear layer in FFN, and x is the input vector. The output of this sub-layer is then passed to a linear layer before being propagated to the decoder.

D. CONDITIONAL RANDOM FIELD LAYER
In sequence labeling tasks, such as NER, CRF is a common method used. According to studies, the Hidden Markov Model and the Maximum Entropy Markov Model (MEMM) are ineffective at analyzing sentence-level sequences compared to the CRF approach [15]. The main CRF features that can compute cross-position label combination probability grab the researchers' attention to apply this method to the NER problem. Combining the previous state-of-the-art NER model, BiLSTM, with CRF has proven to improve performance [56]. In this paper, CRF is used as the decoder for all encoder combinations in our experiment. For an observed sequence x = x 1 x 2 . . . x n with the corresponding target label y = y 1 y 2 . . . y n , let Y be the set of all valid sequence of labels in the dataset. The probability of the predicted label from the encoder is computed using (18), where f (x, y t−1 , y t , t) is an arbitrary feature function to compute the transition score from y t−1 to y t in the sequence x. Let d be the number of feature functions used, Feat(x, y t−1 , y t , t) is the weighted sum of all transition scores from y t−1 to y t in the sequence x from each feature function. After getting all possible paths and their corresponding probability, the Viterbi algorithm is used to discoverŷ, which denotes the path with the highest probability, as written in (21).

IV. EXPERIMENTAL RESULT AND ANALYSIS
In this section, we give the details of the long process of dataset preparation which consists of data collection, decryption, extraction, cleansing, entity type identification, annotation rules definition, data annotation, and train test splitting. We then describe the experiment settings we used to get the experimental results. Furthermore, we discuss the performance of our proposed method with several attention mechanism arrangements. We then compare the performance of our proposed method with other baseline models. Finally, we disclose the research challenges and limitations we encounter throughout the experiment. The experimental code along with the dataset is available on a GitHub repository. 2

A. DATASET PREPARATION
To the best of our knowledge, there are no publicly available NER datasets in the drone forensic domain yet. For this reason, a new dataset is constructed for the experiment in this paper. However, there is an open drone forensic image dataset publicly available provided by VTO Labs Drone Forensic Program. 3 Therefore, the first step in dataset preparation was the data extraction process. From the total of 82 drone images from 10 different models, we choose 60 drone images from three drone models, i.e., DJI, Parrot, and Yunnec, to extract, simply because these three models are the majority among the available models. As of March 2021, DJI had a market share of 76%, based on the sale volume. Thus, most of the consumer and commercial drones in the market are DJI-made [57]. The drone images are stored in several different formats, such as.ZIP,.001, and.BIN. These images are acquired from the controller devices, which are considered the primary evidence close to the owner and contain incident-related information [58]. After exploring the drone images with the 3 https://www.vtolabs.com/drone-forensics help of Autopsy 4 and DJI Phantom Help 5 for extracting and decrypting, Autopsy was used to extract the drone images file from the Android-based controller with.001 and.BIN extensions. The Autopsy is also used to decrypt the files inside the.ZIP files obtained from the iOS-based controller devices. Fig. 6 shows the Autopsy interface when extracting a drone forensic image acquired from an Android-based controller device. The green boxes denote the path of the flight log files stored. Sometimes, /dji.go.v4/ appear in a different folder name, i.e., /dji.pilot/. Both of the folders possibly exist at the same time in a single drone forensic image.
The only data taken from the drone images were human-readable log messages in order to perform entity recognition. To find this kind of data, we explore the drone images directory, which potentially contains human-readable log data. We found it in the flight log data. Then, we try to find all the locations of flight log data in all directories of every drone image we have downloaded and extracted. Unfortunately, we did not find the expected data from Yuneec  and Parrot models. Therefore, the only model that contains the data is the DJI. After collecting the flight log files, we use DJI Phantom Help tools to decrypt the files and get the plain data, which then is parsed to get the log message data. The number of messages from every drone image is shown in Table 1.
The next step is identifying the entity type mentioned in the drone log messages. Before reading the log messages, we filtered the duplicate message and got the unique message to read. After carefully reading the unique log message and comprehensively studying what every log message indicates, we categorized the entity types mentioned in the flight log message into six groups, i.e., Component, Action, Parameter, Function, State, and Issue. These entity types are used as the label for every word in a message after performing data annotation.
Annotation is a process of assigning a label to each data point in order to train a supervised model. In this case, the log message is the data that will be annotated. To demonstrate the power of contextual learning in the Transformer encoder, two annotation procedures are used to label the data, i.e., contextual tagging and consistent tagging. Consistent tagging refers to assigning a label to a word by only considering the token and ignoring the context within the sentence. Contrary, contextual tagging assigns a label to each word in a sentence by considering the present context. The following are several criteria for the annotation process on each entity type.
A ''Component'' label will be assigned to a span that indicates drone components, such as motors, sensors, and batteries. If the span is indicating of an action taken by the drone, then it will be assigned the ''Action'' label. The ''parameter'' label is assigned to the span, which indicates some variables stored in the drone, such as maximum flight distance, maximum flight altitude, and battery temperature. Every drone type has features or functions supporting the task given to it. Some example of span indicates function is obstacle avoidance, obstacle sensing, and remote controller settings. This type of span is assigned the ''Function'' label. Some spans indicate a drone's mode, such as sport mode, auto landing mode, and quick shot mode. These spans get the ''State'' label. Lastly, the ''Issue'' label is assigned to a span that indicates flight issues that happen to the drone during a flight.
Before assigning the label to each word, we first tokenize the sentence into tokens using tecoholic 6 tools. Then, the same tool is used to perform the data annotation process. IOB2 is used as the annotation scheme since IOB2 is one of the typical schemes in the NER task [59]. However, the BIOES scheme is proven can improve the NER model's performance [37]. Therefore, after finishing the annotation using the IOB2 scheme, a python script is used to convert the annotation into a BIOES scheme. We manually annotate the unique message only by carefully reading the context of the sentence first. Fig. 7 shows a sample of annotated data using the IOB2 and BIOES scheme in CoNLL format. Sometimes, a particular span belongs to two or more alternative entity types' tags. For this confusing span, we chose the longest span as the context of the mentioned entity. The ''battery temperature'' span is given the Parameter label for contextual tagging. However, for consistent tagging, the Component label for the word ''battery'' and the Outside label for the word ''temperature'' is assigned, respectively. Additionally, for the ''battery signal error'' span, the Issue is assigned for those three tokens considering the context. Nevertheless, consistent tagging assigns each token the Component, Outside, and Issue labels, respectively. After completing the label for all unique messages, we do the annotation for all messages  by using the labeled unique message as a lookup dictionary. Fig. 8 shows the annotation results using consistent and contextual tagging procedures.
After completing the annotation process, the dataset is split into train and test sets. Unlike the usual splitting method, we split the dataset based on the drone types. The first nine drone models in Table 1 are the train set, while the last four are used as the test set. By doing this, the train and test sets are generated from completely different drone models. Assuming that every model has its own features and functionalities, which vary among them, then the generated log messages will be different as well. However, since all the drones are DJI make, the test set is chosen from the most advanced type to make the test set contains log messages that do not exist in the train set. Because the features and functionalities in a more advanced model will not be in a less advanced model, so do with the generated log messages.
As the final result of data preparation, the distribution of every entity type in the train and test set of the annotated datasets is shown in Table 2 and Table 3. The final composition of the dataset is 76:24 for train and test sets, respectively, from a total of 1850 log messages. The final proportion is uncontrollable since the splitting is done based on the drone models instead of directly dividing the message into a certain common ratio used in existing research.

B. EXPERIMENT SETTINGS
The experiment was conducted using the publicly available code provided by TENER's original paper [37]. Therefore, the only requirement to install is fastNLP library. 7 We modified the code to implement the proposed method. While the hardware specification is as follows: Intel Core i7-8700 @ 3.2GHz, 16GB RAM, NVIDIA GeForce GTX 1060 6GB, and Ubuntu 20.04 LTS operating system.
The dimension of the input vector is 768, divided into eight attention heads with 96 as the dimension for each head and 7 https://fastnlp.readthedocs.io/ three encoder layers in the Transformer architecture. Both train and test batch size is 8, with a learning rate of 0.001 and with a warm-up step of 0.01. We set the dropout to 0.15 except for the fully-connected layers, which used 0.4. The intermediate fully-connected layers are sized 1536 dimensions. These parameters are inspired by Transformer [42] and TENER [37] original papers. The number of epochs we used is 50 because it has already provided convergence, as shown in Fig. 11. Three char-level embeddings, such as LSTM, CNN, and AdaTrans, were combined with two word-level embeddings, GloVe and BERT, to provide input for the encoder layer. In the Transformer encoder, three different attentions are employed combined with options whether to scale or unscale the attention score in the self-attention mechanism. Finally, the CRF is the only decoder used.
Several scenarios which used BiLSTM as the encoder are designed for the experiment based on the published reference as the baseline methods for comparison. The combination of arrangements from the available options of word embedding, char embedding, and the attention type are presented in the following subsection, along with the results. We freeze the BERT embedding to avoid the domination of the attention mechanism used in the BERT pretraining phase. Therefore, BERT parameters will not be updated during the training. We ran the experiment three times for each scenario and took the average as the final evaluation score.
The evaluation mechanism used in this experiment is the span-oriented paradigm. It means the predicted tag is evaluated on the entity type level instead of on the tag level. Therefore, if the predicted entity type is correct, even if the tag is not strictly correct, the predicted token is considered True Positive. For example, if the true label is B-Component, while the predicted label is I-Component or vice versa, we count the predicted label as True Positive.
Precision, Recall, and F1 score are used as the evaluation metrics after counting the true positive (TP), false positive TABLE 4. Performance evaluation of all scenarios on the dataset annotated using consistent tagging with AdaTrans as the char-level embedding. The best score is indicated in bold font. BERT and GloVe are used as word embedding.
(FP), and false negative (FN) for each label. Since there are seven labels in the dataset with an imbalance proportion, we use the micro-average approach to compute the final evaluation score. The formula for per entity type precision, recall, and F1 are shown in (22), (23), and (24), respectively, where c is the entity type, and C is the total number of entity types exists in the dataset. Micro-average for the precision and recall are identical to the per class formula, but the TP, FP, and FP are the sum from all classes as in (25) and (26). While (27) is used to calculate the micro-average F1 score. For all of these evaluation metrics, we used the pre-defined function SpanFPreRecMetric in fastNLP library. 8 The ε symbol is a small number to avoid division by zero error, while β is a term to weigh between precision and recall in order to obtain the F1 score. In this paper, we use ε = 1e − 13 and β = 1.

C. RESULTS ON DIFFERENT ANNOTATION RULES
This subsection presents all the possible arrangements from the available options of character embedding, word embedding, and attention mechanism. Since two types of datasets are constructed, every architectural arrangement is tested on these two datasets. Table 4 to 6 shows the first dataset's evaluation scores, which were annotated using a non-contextual 8 https://fastnlp.readthedocs.io/zh/latest/fastNLP.core.metrics.html  tagging procedure. The model with the best performance is highlighted a bold font. The presented scores in the tables contain both proposed and baseline models. Each table represents a scenario that is grouped based on the character embedding used, as Table 4 shows the models employing AdaTrans for extracting character embedding. Subsequently, Table 5 and 6 show the models' architecture where LSTM and CNN were used as the character embedding, respectively. From the evaluation score presented in Table 4 to 6, the best performance was achieved by GloVe -Scaled AdaTrans combination with an 87.771% F1 score. AdaTrans attention consistently achieves the highest score for all character embedding and word embedding options. GloVe outperforms the BERT embedding evaluated on the non-contextual dataset for all scenarios. This is because the first dataset has consistent tagging, meaning that a word has a consistent tag for all different contexts in the dataset. It complies with the GloVe behavior, where each word has exactly one representational vector. BERT -Scaled Transformer achieved the best overall performance for the second dataset with a 91.348% F1 score. The annotation procedure in the second dataset complies with the contextual representation resulting from VOLUME 11, 2023  BERT embedding, where every word has one representational vector for each distinct context in the dataset. This claim is supported by the experimental results shown in Table 7 to 9, where the scenarios that utilized BERT as the word embedding outperform the models that employed GloVe as the word embedding. As the dataset annotated using contextual tagging procedures better represents the semantics of a span, the following subsection discusses only the results of the contextual dataset.

D. ATTENTION MECHANISM IN COMPARISON
An attention-based model has been widely used in NLP research to model the context between words within a sentence. In this experiment, we investigate three different attention mechanisms to recognize mentioned entities in drone log messages. After conducting an extensive experiment, we obtain the results as shown in Table 4 to 9. The architecture arrangements are inspired by the TENER paper, which proposed AdaTrans to extract character-level features and incorporate relative positional to the attention layer. However, several scenarios have not been reported yet. Therefore, we experiment to find the best architecture to use on our dataset. The details explanation of every scenario is as follows.
Overall, the model's architecture consists of five layers, i.e., positional encoding, char embedding, word embedding, encoder, and decoder. Relative positional encoding proposed in TENER is used to reproduce the AdaTrans' unexplored scenarios. To obtain character-level embedding, either CNN, LSTM, or AdaTrans is utilized in every scenario as listed in Table 7 to 9. For the pre-trained word embedding, either GloVe or BERT is used to obtain the words' vector representation. Unscaled attention is reported to be better used in NER since mentioned entities commonly consist of a few words only [37]. Thus we experiment with every attention type with the scaled and unscaled scenario in the encoder layer, including the AdaTrans encoder. The employment of CRF can undoubtedly improve the performance of a NER Performance evaluation of all scenarios on the dataset annotated using contextual tagging with LSTM as the char-level embedding. The best score is indicated in bold font. BERT and GloVe are used as word embedding.

TABLE 9.
Performance evaluation of all scenarios on the dataset annotated using contextual tagging with CNN as the char-level embedding. The best score is indicated in bold font. BERT and GloVe are used as word embedding.
model [56]. Thus CRF is used as the decoder for all scenarios arrangements.
The CNN-BERT-Scaled Transformer outperforms the other scenario with a 91.348% F1 score. This score is slightly higher than the unscaled Transformer. We assume this slight difference is because of the dataset size, so the effect of either using scaling or not is insignificant. The AdaTrans-based encoder is considered a baseline model, which will be discussed in the following subsection. Our proposed model that uses cosine similarity instead of dot-product operation in the self-attention mechanism underperforms the Transformer with a competitive F1 score of 91.146%. This shows that the cosine similarity is able to model the context between words in a sentence, just like the dot-product intuition in the selfattention mechanism.
The presence of contextual pre-trained word embedding has positively impacted NLP research recently. The main advantage of contextual over static pre-trained word embedding is the ability to generate a unique representational vector of a word for each distinct context within two or more different sentences. In this experiment, these two types of pre-trained word embeddings were employed. From the  results reported in Table 7 to 9, the involvement of either contextual or static embedding significantly affects the model's performance evaluation score. This can be seen from the dot-product attention performance, with a 3.782% difference between the best dot-product attention that uses BERT and GloVe. This significant difference is also happening to the cosine attention, with a 3.899% difference. The other scenarios show consistent results, where a model with static and contextual pre-trained word embedding has a significant difference in the evaluation score. The process of capturing context that exists between words in a sentence also occurs in the encoder layer, which is performed by the self-attention mechanism. Therefore, the representational vector obtained from the static embedding is undergoing a refinement process in the encoder layer. Eventually, the words' vectors from the embedding layer went through a contextual learning pipeline, just like what the contextual pre-trained word embedding has done. Contextual vector representation from BERT fits the intuition of contextual learning in the self-attention mechanism by means that the contextually related words within a sentence are close to one another in the representational space. Therefore, using a static and contextual embedding in a Transformer-based model resulted in significantly different evaluation scores. Nevertheless, the experimental results show an insignificant effect of using different character-level embedding on the models' performance.
The scale factor implemented in scaled dot-product attention, as proposed in the vanilla Transformer, has been argued better not be used in the NER task [37]. The reason is to sharpen the probability distribution yielded by the softmax in the self-attention mechanism. The sharper the attention, the fewer contexts are captured by the attention, aligning to the span length of mentioned entities commonly exist [37]. However, the experimental results in Table 8 and 9 show a contradictive point. Consider the following points. First, the average sentence length is 6.3 and 8.8 in the train and test sets, respectively. Secondly, the sentence length is dominated by lengths ranging from one to ten, with more than 80% of the portion, as shown in Fig. 9. Thus, it is unlikely that the sentence contains several contexts. Therefore, unscaled attention is supposedly better if used instead of scaled attention. However, the experimental results demonstrated the contrary on dot-product and AdaTrans attention. The scaled attention for dot-product and AdaTrans attention achieved better performance than unscaled ones. Contrary, the cosine has better performance with unscaled attention. As shown in Fig. 5, the element-wise norm scale operation played the same role as the scale factor in scaled dot-product attention. Therefore, a scaling factor is needed in the self-attention mechanism and has proven to improve the model's performance compared to unscaled attention.

E. COMPARISON WITH OTHER BASELINE MODELS
To verify the superiority of our proposed methods, we compare the proposed models with several baseline models, as shown in Fig 10. The detailed architecture for each encoder is as follows: CNN-BERT-Scaled Transformer, AdaTrans-BERT-Unscaled Cosine, AdaTrans-BERT-Scaled AdaTrans, and AdaTrans-BERT-BiLSTM. For all of these encoders, CRF is used as the decoder. In terms of convergence speed, as depicted in Fig. 11, the proposed method converges as fast as Transformer and AdaTrans. Moreover, cosine attention outperforms the BiLSTM model. From the F1 score, our proposed method achieves the second-best performance with a 91.146% F1 score. This model uses the AdaTrans as the character-level feature extractor, concatenated with the output of BERT as the pre-trained word embedding to get a word-level feature vector. The unscaled cosine attention is used as the encoder and CRF as the decoder. The scaled Transformer model achieved the best performance, with a 91.348% F1 score. This scenario consists of CNN char embedding, BERT word embedding, scaled dotproduct attention, and CRF as the decoder. In comparison, the unscaled AdaTrans attention is in the third position, accompanied by AdaTrans as the character embedding and BERT as the word embedding, with a 90.514% F1 score. This proves that the relative attention mechanism is unsuitable for our case since our dataset has a relatively short sequence, with 6.3 and 8.8 words in length on average in train and test data, respectively.
Our proposed model underperforms the scaled dot-product attention with a 0.202% difference in the F1 score. However, from the recall score, our proposed model outperforms all the baseline models with a 93.612% recall score. This shows that the proposed method has the lowest False Negative rate, where the number of misclassification on entities are small. It means that the mentioned entities in the datasets are mostly correctly classified. Therefore, the proposed method successfully recognizes the region of interest in the log message.
The evaluation score indicates that the proposed model can recognize mentioned entities in flight log message data. When a forensic investigator conducts an evidence analysis process, plenty of evidence must be examined, analyzed, and evaluated. To this end, presenting the NER result in a sophisticated visualization can help the investigator pinpoint the region of interest in flight log data faster. Fig. 12 shows a sample log message that has been fed to the NER model in a visualization form to assist the forensic investigation. Having the mentioned entities highlighted with a particular color, the investigator can ignore the message with no highlights and focus only on those with color highlights. The color can be set to represent a level of importance. For instance, red can be used to highlight the Issue entity type. The highlight can help the forensic investigator find a message containing words or phrases with the Issue label.

F. CHALLENGES AND LIMITATIONS
After conducting the experiment, we described several challenges in the following. Since drone forensics is a relatively new research domain, a few open drone image datasets are available. We only found one drone image dataset, the VTO Labs Drone Forensic dataset. Even from 15 different models and more than 20 datasets, we only discover less than 2000 log messages. Moreover, the dataset does not contain any specific drone incident scenario. It implies that no ground truth can be used to test the proposed method regarding the forensic investigation, finding, and reporting view. Since this is an initial attempt on NER for drone forensics, there are few references, datasets, and domain-specific knowledge, such as entity types related to incidents and regions of interest in drone log messages. Therefore, many opportunities are opened by this attempt in the future, which will be our next project. Considering the time needed to perform a thorough analysis for one drone model, it is unrealistic to include other drone models. Besides, DJI has the largest market share, and the availability of a public dataset is one of the considerations for this research to be verifiable and reproducible.

V. CONCLUSION AND FUTURE WORKS
In this research, we have experimented with the employment of cosine similarity as a substitute for dot-product self-attention in the encoder sub-layer of Transformer architecture. To evaluate our proposed approach, we construct our own NER dataset by manually extracting several drone forensic image datasets that are publicly available from the VTO Labs. For a relatively small dataset, we obtain a good result indicated by the F1 score of 91.348% achieved by the dot-product attention supported by CNN character embedding and BERT word embedding. Our proposed approach outperforms the RNN-based state-of-the-art by achieving the F1 score of 91.146%. The proposed model can achieve high scores even if the test data are generated from different drone models. This proves that NER can be used as an extraction tool to assist the forensic investigation by only utilizing the log message data to recognize some incident-related information.
We plan to further analyze the trade-off between the convergence speed with the decrease in the number of parameters in the simpler architecture when using cosine as the attention type. Since the number of epochs is a one-time cost, and the inference time is a repetitive cost, we plan to explore further how many epochs are needed to train the simpler model having fewer parameters after employing the cosine in the self-attention sublayer without losing performance. As this research is still an initial step in information extraction, we plan to deploy the NER model so that it can be used as a practical solution for the forensic investigator.